forked from piskvorky/gensim
-
Notifications
You must be signed in to change notification settings - Fork 6
/
changes_080.html
213 lines (184 loc) · 13.4 KB
/
changes_080.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Change Set for 0.8.0 — gensim</title>
<link rel="stylesheet" href="_static/default.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '',
VERSION: '0.8.2',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="author" title="About these documents" href="about.html" />
<link rel="top" title="gensim" href="index.html" />
<!-- twitter search widget
<script type="text/javascript" src="_static/widget.js"></script>
-->
<meta property="og:title" content="#gensim" />
<meta property="og:description" content="Efficient topic modelling in Python" />
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-24066335-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li><a href="index.html">Gensim home</a>| </li>
<li><a href="tutorial.html">Tutorials</a>| </li>
<li><a href="http://groups.google.com/group/gensim">Support</a>| </li>
<li><a href="https://github.com/piskvorky/gensim/wiki">Contribute</a>| </li>
<li><a href="apiref.html">API reference</a>»</li>
</ul>
</div>
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h3><a href="index.html">Table Of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">Change Set for 0.8.0</a><ul>
<li><a class="reference internal" href="#codestyle-changes">Codestyle Changes</a></li>
<li><a class="reference internal" href="#similarity-queries">Similarity Queries</a></li>
<li><a class="reference internal" href="#simplified-directory-structure">Simplified Directory Structure</a></li>
<li><a class="reference internal" href="#other-changes-that-you-re-unlikely-to-notice-unless-you-look">Other changes (that you’re unlikely to notice unless you look)</a></li>
</ul>
</li>
</ul>
<div id="searchbox" style="display: none">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<input type="text" name="q" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
<p class="searchtip" style="font-size: 90%">
Enter search terms or a module, class or function name.
</p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
<div class="section" id="change-set-for-0-8-0">
<span id="changes-080"></span><h1>Change Set for 0.8.0<a class="headerlink" href="#change-set-for-0-8-0" title="Permalink to this headline">¶</a></h1>
<p>Release 0.8.0 concludes the 0.7.x series, which was about API consolidation and performance.
In 0.8.x, I’d like to extend <cite>gensim</cite> with new functionality and features.</p>
<div class="section" id="codestyle-changes">
<h2>Codestyle Changes<a class="headerlink" href="#codestyle-changes" title="Permalink to this headline">¶</a></h2>
<p>Codebase was modified to comply with <a class="reference external" href="http://www.python.org/dev/peps/pep-0008/">PEP8: Style Guide for Python Code</a>.
This means the 0.8.0 API is <strong>backward incompatible</strong> with the 0.7.x series.</p>
<p>That’s not as tragic as it sounds, gensim was almost there anyway. The changes are few and pretty straightforward:</p>
<ol class="arabic simple">
<li>the <cite>numTopics</cite> parameter is now <cite>num_topics</cite></li>
<li><cite>addDocuments()</cite> method becomes <cite>add_documents()</cite></li>
<li><cite>toUtf8()</cite> => <cite>to_utf8()</cite></li>
<li>... you get the idea: replace <cite>camelCase</cite> with <cite>lowercase_with_underscores</cite>.</li>
</ol>
<p>If you stored a model that is affected by this to disk, you’ll need to rename its attributes manually:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">lsa</span> <span class="o">=</span> <span class="n">gensim</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">LsiModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'/some/path'</span><span class="p">)</span> <span class="c"># load old <0.8.0 model</span>
<span class="gp">>>> </span><span class="n">lsa</span><span class="o">.</span><span class="n">num_terms</span><span class="p">,</span> <span class="n">lsa</span><span class="o">.</span><span class="n">num_topics</span> <span class="o">=</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTerms</span><span class="p">,</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTopics</span> <span class="c"># rename attributes</span>
<span class="gp">>>> </span><span class="k">del</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTerms</span><span class="p">,</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTopics</span> <span class="c"># clean up old attributes (optional)</span>
<span class="gp">>>> </span><span class="n">lsa</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">'/some/path'</span><span class="p">)</span> <span class="c"># save again to disk, as 0.8.0 compatible</span>
</pre></div>
</div>
<p>Only attributes (variables) need to be renamed; method names (functions) are not affected, due to the way <cite>pickle</cite> works.</p>
</div>
<div class="section" id="similarity-queries">
<h2>Similarity Queries<a class="headerlink" href="#similarity-queries" title="Permalink to this headline">¶</a></h2>
<p>Improved speed and scalability of <a class="reference internal" href="tut2.html"><em>similarity queries</em></a>.</p>
<p>The <cite>Similarity</cite> class can now index corpora of arbitrary size more efficiently.
Internally, this is done by splitting the index into several smaller pieces (“shards”) that fit in RAM
and can be processed independently. In addition, documents can now be added to a <cite>Similarity</cite> index dynamically.</p>
<p>There is also a new way to query the similarity indexes:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">index</span> <span class="o">=</span> <span class="n">MatrixSimilarity</span><span class="p">(</span><span class="n">corpus</span><span class="p">)</span> <span class="c"># create an index</span>
<span class="gp">>>> </span><span class="n">sims</span> <span class="o">=</span> <span class="n">index</span><span class="p">[</span><span class="n">document</span><span class="p">]</span> <span class="c"># get cosine similarity of query "document" against every document in the index</span>
<span class="gp">>>> </span><span class="n">sims</span> <span class="o">=</span> <span class="n">index</span><span class="p">[</span><span class="n">chunk_of_documents</span><span class="p">]</span> <span class="c"># new syntax!</span>
</pre></div>
</div>
<p>Advantage of the last line (querying multiple documents at the same time) is faster execution.</p>
<p>This faster execution is also utilized <em>automatically for you</em> if you’re using the <tt class="docutils literal"><span class="pre">for</span> <span class="pre">sims</span> <span class="pre">in</span> <span class="pre">index:</span> <span class="pre">...</span></tt> syntax
(which returns pairwise similarities of documents in the index).</p>
<p>To see the speed-up on your machine, run <tt class="docutils literal"><span class="pre">python</span> <span class="pre">-m</span> <span class="pre">gensim.test.simspeed</span></tt> (and compare to my results <a class="reference external" href="http://groups.google.com/group/gensim/msg/4f6f171a869e4fca?">here</a> to see how your machine fares).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This current functionality of querying is as far as I wanted to get with gensim.
More optimizations and smarter indexing are certainly possible, but I’d like to
focus on other features now. Pull requests are still welcome though :)</p>
</div>
<p>Check out the <a class="reference internal" href="similarities/docsim.html#module-gensim.similarities.docsim" title="gensim.similarities.docsim: Document similarity queries"><tt class="xref py py-mod docutils literal"><span class="pre">updated</span> <span class="pre">documentation</span></tt></a> of the similarity classes for more info.</p>
</div>
<div class="section" id="simplified-directory-structure">
<h2>Simplified Directory Structure<a class="headerlink" href="#simplified-directory-structure" title="Permalink to this headline">¶</a></h2>
<p>Instead of the java-esque <tt class="docutils literal"><span class="pre">ROOT_DIR/src/gensim</span></tt> directory structure of gensim,
the packages now reside directly in <tt class="docutils literal"><span class="pre">ROOT_DIR/gensim</span></tt> (no superfluous <tt class="docutils literal"><span class="pre">src</span></tt>). See the new structure <a class="reference external" href="https://github.com/piskvorky/gensim">on github</a>.</p>
</div>
<div class="section" id="other-changes-that-you-re-unlikely-to-notice-unless-you-look">
<h2>Other changes (that you’re unlikely to notice unless you look)<a class="headerlink" href="#other-changes-that-you-re-unlikely-to-notice-unless-you-look" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li>Improved efficiency of <tt class="docutils literal"><span class="pre">lsi[corpus]</span></tt> transformations (documents are chunked internally for better performance).</li>
<li>Large matrices (numpy/scipy.sparse, in <cite>LsiModel</cite>, <cite>Similarity</cite> etc.) are now mmapped to/from disk when doing <cite>save/load</cite>. The <cite>cPickle</cite> approach used previously was too <a class="reference external" href="http://groups.google.com/group/gensim/browse_thread/thread/3c4c6c0f76c5938c#">buggy</a> and <a class="reference external" href="http://dieter.plaetinck.be/poor_mans_pickle_implementations_benchmark.html">slow</a>.</li>
<li>Renamed <cite>chunks</cite> parameter to <cite>chunksize</cite> (i.e. <cite>LsiModel(corpus, num_topics=100, chunksize=20000)</cite>). This better reflects its purpose: size of a chunk=number of documents to be processed at once.</li>
<li>Also improved memory efficiency of LSI and LDA model generation (again).</li>
<li>Removed SciPy 0.6 from the list of supported SciPi versions (need >=0.7 now).</li>
<li>Added more unit tests.</li>
<li>Several smaller fixes; see the <a class="reference external" href="https://github.com/piskvorky/gensim/commits/0.8.0">commit history</a> for full account.</li>
</ul>
<div class="admonition-future-directions admonition">
<p class="first admonition-title">Future Directions?</p>
<p class="last">If you have ideas or proposals for new features for 0.8.x, now is the time to let me know:
<a class="reference external" href="http://groups.google.com/group/gensim">gensim mailing list</a>.</p>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li><a href="index.html">Gensim home</a>| </li>
<li><a href="tutorial.html">Tutorials</a>| </li>
<li><a href="http://groups.google.com/group/gensim">Support</a>| </li>
<li><a href="https://github.com/piskvorky/gensim/wiki">Contribute</a>| </li>
<li><a href="apiref.html">API reference</a>»</li>
</ul>
</div>
<div class="footer">
© Copyright 2011, Radim Řehůřek <radimrehurek(at)seznam.cz>.
Last updated on Oct 30, 2011.
</div>
</body>
</html>