forked from piskvorky/gensim
-
Notifications
You must be signed in to change notification settings - Fork 6
/
tut1.html
414 lines (383 loc) · 34.4 KB
/
tut1.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Corpora and Vector Spaces — gensim</title>
<link rel="stylesheet" href="_static/default.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="author" title="About these documents" href="about.html" />
<link rel="top" title="gensim" href="index.html" />
<link rel="up" title="Tutorials" href="tutorial.html" />
<link rel="next" title="Topics and Transformations" href="tut2.html" />
<link rel="prev" title="Tutorials" href="tutorial.html" />
<!-- twitter search widget
<script type="text/javascript" src="_static/widget.js"></script>
-->
<meta property="og:title" content="#gensim" />
<meta property="og:description" content="Efficient topic modelling in Python" />
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-24066335-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="tut2.html" title="Topics and Transformations"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="tutorial.html" title="Tutorials"
accesskey="P">previous</a> |</li>
<li><a href="index.html">Gensim home</a>| </li>
<li><a href="tutorial.html">Tutorials</a>| </li>
<li><a href="http://groups.google.com/group/gensim">Support</a>| </li>
<li><a href="https://github.com/piskvorky/gensim/wiki">Contribute</a>| </li>
<li><a href="apiref.html">API reference</a>»</li>
<li><a href="tutorial.html" accesskey="U">Tutorials</a> »</li>
</ul>
</div>
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h3><a href="index.html">Table Of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">Corpora and Vector Spaces</a><ul>
<li><a class="reference internal" href="#from-strings-to-vectors">From Strings to Vectors</a></li>
<li><a class="reference internal" href="#corpus-streaming-one-document-at-a-time">Corpus Streaming – One Document at a Time</a></li>
<li><a class="reference internal" href="#corpus-formats">Corpus Formats</a></li>
</ul>
</li>
</ul>
<h4>Previous topic</h4>
<p class="topless"><a href="tutorial.html"
title="previous chapter">Tutorials</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="tut2.html"
title="next chapter">Topics and Transformations</a></p>
<div id="searchbox" style="display: none">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<input type="text" name="q" size="24" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
<p class="searchtip" style="font-size: 90%">
Enter search terms or a module, class or function name.
</p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
<div class="section" id="corpora-and-vector-spaces">
<span id="tut1"></span><h1>Corpora and Vector Spaces<a class="headerlink" href="#corpora-and-vector-spaces" title="Permalink to this headline">¶</a></h1>
<p>Don’t forget to set</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">logging</span>
<span class="gp">>>> </span><span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">format</span><span class="o">=</span><span class="s">'</span><span class="si">%(asctime)s</span><span class="s"> : </span><span class="si">%(levelname)s</span><span class="s"> : </span><span class="si">%(message)s</span><span class="s">'</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>
</pre></div>
</div>
<p>if you want to see logging events.</p>
<div class="section" id="from-strings-to-vectors">
<span id="second-example"></span><h2>From Strings to Vectors<a class="headerlink" href="#from-strings-to-vectors" title="Permalink to this headline">¶</a></h2>
<p>This time, let’s start from documents represented as strings:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">gensim</span> <span class="kn">import</span> <span class="n">corpora</span><span class="p">,</span> <span class="n">models</span><span class="p">,</span> <span class="n">similarities</span>
<span class="go">>>></span>
<span class="gp">>>> </span><span class="n">documents</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Human machine interface for lab abc computer applications"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"A survey of user opinion of computer system response time"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"The EPS user interface management system"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"System and human system engineering testing of EPS"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"Relation of user perceived response time to error measurement"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"The generation of random binary unordered trees"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"The intersection graph of paths in trees"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"Graph minors IV Widths of trees and well quasi ordering"</span><span class="p">,</span>
<span class="gp">>>> </span> <span class="s">"Graph minors A survey"</span><span class="p">]</span>
</pre></div>
</div>
<p>This is a tiny corpus of nine documents, each consisting of only a single sentence.</p>
<p>First, let’s tokenize the documents, remove common words (using a toy stoplist)
as well as words that only appear once in the corpus:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="c"># remove common words and tokenize</span>
<span class="gp">>>> </span><span class="n">stoplist</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="s">'for a of the and to in'</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
<span class="gp">>>> </span><span class="n">texts</span> <span class="o">=</span> <span class="p">[[</span><span class="n">word</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">document</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">()</span> <span class="k">if</span> <span class="n">word</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stoplist</span><span class="p">]</span>
<span class="gp">>>> </span> <span class="k">for</span> <span class="n">document</span> <span class="ow">in</span> <span class="n">documents</span><span class="p">]</span>
<span class="go">>>></span>
<span class="gp">>>> </span><span class="c"># remove words that appear only once</span>
<span class="gp">>>> </span><span class="n">all_tokens</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">texts</span><span class="p">,</span> <span class="p">[])</span>
<span class="gp">>>> </span><span class="n">tokens_once</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">word</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">all_tokens</span><span class="p">)</span> <span class="k">if</span> <span class="n">all_tokens</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">texts</span> <span class="o">=</span> <span class="p">[[</span><span class="n">word</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">text</span> <span class="k">if</span> <span class="n">word</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">tokens_once</span><span class="p">]</span>
<span class="gp">>>> </span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">]</span>
<span class="go">>>></span>
<span class="gp">>>> </span><span class="k">print</span> <span class="n">texts</span>
<span class="go">[['human', 'interface', 'computer'],</span>
<span class="go"> ['survey', 'user', 'computer', 'system', 'response', 'time'],</span>
<span class="go"> ['eps', 'user', 'interface', 'system'],</span>
<span class="go"> ['system', 'human', 'system', 'eps'],</span>
<span class="go"> ['user', 'response', 'time'],</span>
<span class="go"> ['trees'],</span>
<span class="go"> ['graph', 'trees'],</span>
<span class="go"> ['graph', 'minors', 'trees'],</span>
<span class="go"> ['graph', 'minors', 'survey']]</span>
</pre></div>
</div>
<p>Your way of processing the documents will likely vary; here, I only split on whitespace
to tokenize, followed by lowercasing each word. In fact, I use this particular
(simplistic and inefficient) setup to mimick the experiment done in Deerwester et al.’s
original LSA article <a class="footnote-reference" href="#id3" id="id1">[1]</a>.</p>
<p>The ways to process documents are so varied and application- and language-dependent that I
decided to <em>not</em> constrain them by any interface. Instead, a document is represented
by the features extracted from it, not by its “surface” string form: how you get to
the features is up to you. Below I describe one common, general-purpose approach (called
<em class="dfn">bag-of-words</em>), but keep in mind that different application domains call for
different features, and, as always, it’s <a class="reference external" href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out">garbage in, garbage out</a>...</p>
<p>To convert documents to vectors, we’ll use a document representation called
<a class="reference external" href="http://en.wikipedia.org/wiki/Bag_of_words">bag-of-words</a>. In this representation,
each document is represented by one vector where each vector element represents
a question-answer pair, in the style of:</p>
<blockquote>
<div>“How many times does the word <cite>system</cite> appear in the document? Once.”</div></blockquote>
<p>It is advantageous to represent the questions only by their (integer) ids. The mapping
between the questions and ids is called a dictionary:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">dictionary</span> <span class="o">=</span> <span class="n">corpora</span><span class="o">.</span><span class="n">Dictionary</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">dictionary</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">'/tmp/deerwester.dict'</span><span class="p">)</span> <span class="c"># store the dictionary, for future reference</span>
<span class="gp">>>> </span><span class="k">print</span> <span class="n">dictionary</span>
<span class="go">Dictionary(12 unique tokens)</span>
</pre></div>
</div>
<p>Here we assigned a unique integer id to all words appearing in the corpus with the
<a class="reference internal" href="corpora/dictionary.html#gensim.corpora.dictionary.Dictionary" title="gensim.corpora.dictionary.Dictionary"><tt class="xref py py-class docutils literal"><span class="pre">gensim.corpora.dictionary.Dictionary</span></tt></a> class. This sweeps across the texts, collecting word counts
and relevant statistics. In the end, we see there are twelve distinct words in the
processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector).
To see the mapping between words and their ids:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">token2id</span>
<span class="go">{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,</span>
<span class="go">'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}</span>
</pre></div>
</div>
<p>To actually convert tokenized documents to vectors:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">new_doc</span> <span class="o">=</span> <span class="s">"Human computer interaction"</span>
<span class="gp">>>> </span><span class="n">new_vec</span> <span class="o">=</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">new_doc</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
<span class="gp">>>> </span><span class="k">print</span> <span class="n">new_vec</span> <span class="c"># the word "interaction" does not appear in the dictionary and is ignored</span>
<span class="go">[(0, 1), (1, 1)]</span>
</pre></div>
</div>
<p>The function <tt class="xref py py-func docutils literal"><span class="pre">doc2bow()</span></tt> simply counts the number of occurences of
each distinct word, converts the word to its integer word id
and returns the result as a sparse vector. The sparse vector <tt class="docutils literal"><span class="pre">[(0,</span> <span class="pre">1),</span> <span class="pre">(1,</span> <span class="pre">1)]</span></tt>
therefore reads: in the document <cite>“Human computer interaction”</cite>, the words <cite>computer</cite>
(id 0) and <cite>human</cite> (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">corpus</span> <span class="o">=</span> <span class="p">[</span><span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">corpora</span><span class="o">.</span><span class="n">MmCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">'/tmp/deerwester.mm'</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span> <span class="c"># store to disk, for later use</span>
<span class="gp">>>> </span><span class="k">print</span> <span class="n">corpus</span>
<span class="go">[(0, 1), (1, 1), (2, 1)]</span>
<span class="go">[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]</span>
<span class="go">[(2, 1), (5, 1), (7, 1), (8, 1)]</span>
<span class="go">[(1, 1), (5, 2), (8, 1)]</span>
<span class="go">[(3, 1), (6, 1), (7, 1)]</span>
<span class="go">[(9, 1)]</span>
<span class="go">[(9, 1), (10, 1)]</span>
<span class="go">[(9, 1), (10, 1), (11, 1)]</span>
<span class="go">[(4, 1), (10, 1), (11, 1)]</span>
</pre></div>
</div>
<p>By now it should be clear that the vector feature with <tt class="docutils literal"><span class="pre">id=10</span></tt> stands for the question “How many
times does the word <cite>graph</cite> appear in the document?” and that the answer is “zero” for
the first six documents and “one” for the remaining three. As a matter of fact,
we have arrived at exactly the same corpus of vectors as in the <a class="reference internal" href="tutorial.html#first-example"><em>Quick Example</em></a>.</p>
</div>
<div class="section" id="corpus-streaming-one-document-at-a-time">
<h2>Corpus Streaming – One Document at a Time<a class="headerlink" href="#corpus-streaming-one-document-at-a-time" title="Permalink to this headline">¶</a></h2>
<p>Note that <cite>corpus</cite> above resides fully in memory, as a plain Python list.
In this simple example, it doesn’t matter much, but just to make things clear,
let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do.
Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim
only requires that a corpus must be able to return one document vector at a time:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">class</span> <span class="nc">MyCorpus</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="gp">>>> </span> <span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="gp">>>> </span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="s">'mycorpus.txt'</span><span class="p">):</span>
<span class="gp">>>> </span> <span class="c"># assume there's one document per line, tokens separated by whitespace</span>
<span class="gp">>>> </span> <span class="k">yield</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
</pre></div>
</div>
<p>Download the sample <a class="reference external" href="./mycorpus.txt">mycorpus.txt file here</a>. The assumption that
each document occupies one line in a single file is not important; you can mold
the <cite>__iter__</cite> function to fit your input format, whatever it is.
Walking directories, parsing XML, accessing network...
Just parse your input to retrieve a clean list of tokens in each document,
then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside <cite>__iter__</cite>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">corpus_memory_friendly</span> <span class="o">=</span> <span class="n">MyCorpus</span><span class="p">()</span> <span class="c"># doesn't load the corpus into memory!</span>
<span class="gp">>>> </span><span class="k">print</span> <span class="n">corpus_memory_friendly</span>
<span class="go"><__main__.MyCorpus object at 0x10d5690></span>
</pre></div>
</div>
<p>Corpus is now an object. We didn’t define any way to print it, so <cite>print</cite> just outputs address
of the object in memory. Not very useful. To see the constituent vectors, let’s
iterate over the corpus and print each document vector (one at a time):</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">for</span> <span class="n">vector</span> <span class="ow">in</span> <span class="n">corpus_memory_friendly</span><span class="p">:</span> <span class="c"># load one vector into memory at a time</span>
<span class="gp">>>> </span> <span class="k">print</span> <span class="n">vector</span>
<span class="go">[(0, 1), (1, 1), (2, 1)]</span>
<span class="go">[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]</span>
<span class="go">[(2, 1), (5, 1), (7, 1), (8, 1)]</span>
<span class="go">[(1, 1), (5, 2), (8, 1)]</span>
<span class="go">[(3, 1), (6, 1), (7, 1)]</span>
<span class="go">[(9, 1)]</span>
<span class="go">[(9, 1), (10, 1)]</span>
<span class="go">[(9, 1), (10, 1), (11, 1)]</span>
<span class="go">[(4, 1), (10, 1), (11, 1)]</span>
</pre></div>
</div>
<p>Although the output is the same as for the plain Python list, the corpus is now much
more memory friendly, because at most one vector resides in RAM at a time. Your
corpus can now be as large as you want.</p>
<p>Similarly, to construct the dictionary without loading all texts into memory:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="c"># collect statistics about all tokens</span>
<span class="gp">>>> </span><span class="n">dictionary</span> <span class="o">=</span> <span class="n">corpora</span><span class="o">.</span><span class="n">Dictionary</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">()</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="s">'mycorpus.txt'</span><span class="p">))</span>
<span class="gp">>>> </span><span class="c"># remove stop words and words that appear only once</span>
<span class="gp">>>> </span><span class="n">stop_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">dictionary</span><span class="o">.</span><span class="n">token2id</span><span class="p">[</span><span class="n">stopword</span><span class="p">]</span> <span class="k">for</span> <span class="n">stopword</span> <span class="ow">in</span> <span class="n">stoplist</span>
<span class="gp">>>> </span> <span class="k">if</span> <span class="n">stopword</span> <span class="ow">in</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">token2id</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">once_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">tokenid</span> <span class="k">for</span> <span class="n">tokenid</span><span class="p">,</span> <span class="n">docfreq</span> <span class="ow">in</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">dfs</span><span class="o">.</span><span class="n">iteritems</span><span class="p">()</span> <span class="k">if</span> <span class="n">docfreq</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">dictionary</span><span class="o">.</span><span class="n">filter_tokens</span><span class="p">(</span><span class="n">stop_ids</span> <span class="o">+</span> <span class="n">once_ids</span><span class="p">)</span> <span class="c"># remove stop words and words that appear only once</span>
<span class="gp">>>> </span><span class="n">dictionary</span><span class="o">.</span><span class="n">compactify</span><span class="p">()</span> <span class="c"># remove gaps in id sequence after words that were removed</span>
<span class="gp">>>> </span><span class="k">print</span> <span class="n">dictionary</span>
<span class="go">Dictionary(12 unique tokens)</span>
</pre></div>
</div>
<p>And that is all there is to it! At least as far as bag-of-words representation is concerned.
Of course, what we do with such corpus is another question; it is not at all clear
how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and
we will need to apply a transformation on this simple representation first, before
we can use it to compute any meaningful document vs. document similarities.
Transformations are covered in the <a class="reference internal" href="tut2.html"><em>next tutorial</em></a>, but before that, let’s
briefly turn our attention to <em>corpus persistency</em>.</p>
</div>
<div class="section" id="corpus-formats">
<span id="id2"></span><h2>Corpus Formats<a class="headerlink" href="#corpus-formats" title="Permalink to this headline">¶</a></h2>
<p>There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.
<cite>Gensim</cite> implements them via the <em>streaming corpus interface</em> mentioned earlier:
documents are read from (resp. stored to) disk in a lazy fashion, one document at
a time, without the whole corpus being read into main memory at once.</p>
<p>One of the more notable file formats is the <a class="reference external" href="http://math.nist.gov/MatrixMarket/formats.html">Market Matrix format</a>.
To save a corpus in the Matrix Market format:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">gensim</span> <span class="kn">import</span> <span class="n">corpora</span>
<span class="gp">>>> </span><span class="c"># create a toy corpus of 2 documents, as a plain Python list</span>
<span class="gp">>>> </span><span class="n">corpus</span> <span class="o">=</span> <span class="p">[[(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">)],</span> <span class="p">[]]</span> <span class="c"># make one document empty, for the heck of it</span>
<span class="go">>>></span>
<span class="gp">>>> </span><span class="n">corpora</span><span class="o">.</span><span class="n">MmCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">'/tmp/corpus.mm'</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span>
</pre></div>
</div>
<p>Other formats include <a class="reference external" href="http://svmlight.joachims.org/">Joachim’s SVMlight format</a>,
<a class="reference external" href="http://www.cs.princeton.edu/~blei/lda-c/">Blei’s LDA-C format</a> and
<a class="reference external" href="http://gibbslda.sourceforge.net/">GibbsLDA++ format</a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">corpora</span><span class="o">.</span><span class="n">SvmLightCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">'/tmp/corpus.svmlight'</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">corpora</span><span class="o">.</span><span class="n">BleiCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">'/tmp/corpus.lda-c'</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">corpora</span><span class="o">.</span><span class="n">LowCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">'/tmp/corpus.low'</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span>
</pre></div>
</div>
<p>Conversely, to load a corpus iterator from a Matrix Market file:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">corpus</span> <span class="o">=</span> <span class="n">corpora</span><span class="o">.</span><span class="n">MmCorpus</span><span class="p">(</span><span class="s">'/tmp/corpus.mm'</span><span class="p">)</span>
</pre></div>
</div>
<p>Corpus objects are streams, so typically you won’t be able to print them directly:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span> <span class="n">corpus</span>
<span class="go">MmCorpus(2 documents, 2 features, 1 non-zero entries)</span>
</pre></div>
</div>
<p>Instead, to view the contents of a corpus:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="c"># one way of printing a corpus: load it entirely into memory</span>
<span class="gp">>>> </span><span class="k">print</span> <span class="nb">list</span><span class="p">(</span><span class="n">corpus</span><span class="p">)</span> <span class="c"># calling list() will convert any sequence to a plain Python list</span>
<span class="go">[[(1, 0.5)], []]</span>
</pre></div>
</div>
<p>or</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="c"># another way of doing it: print one document at a time, making use of the streaming interface</span>
<span class="gp">>>> </span><span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">corpus</span><span class="p">:</span>
<span class="gp">>>> </span> <span class="k">print</span> <span class="n">doc</span>
<span class="go">[(1, 0.5)]</span>
<span class="go">[]</span>
</pre></div>
</div>
<p>The second way is obviously more memory-friendly, but for testing and development
purposes, nothing beats the simplicity of calling <tt class="docutils literal"><span class="pre">list(corpus)</span></tt>.</p>
<p>To save the same Matrix Market document stream in Blei’s LDA-C format,</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">corpora</span><span class="o">.</span><span class="n">BleiCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">'/tmp/corpus.lda-c'</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span>
</pre></div>
</div>
<p>In this way, <cite>gensim</cite> can also be used as a memory-efficient <strong>I/O format conversion tool</strong>:
just load a document stream using one format and immediately save it in another format.
Adding new formats is dead easy, check out the <a class="reference external" href="https://github.com/piskvorky/gensim/blob/master/src/gensim/corpora/svmlightcorpus.py">code for the SVMlight corpus</a> for an example.</p>
<hr class="docutils" />
<p>For a complete reference (Want to prune the dictionary to a smaller size?
Convert between corpora and NumPy/SciPy arrays?), see the <a class="reference internal" href="apiref.html"><em>API documentation</em></a>.
Or continue to the next tutorial on <a class="reference internal" href="tut2.html"><em>Topics and Transformations</em></a>.</p>
<table class="docutils footnote" frame="void" id="id3" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>This is the same corpus as used in
<a class="reference external" href="http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf">Deerwester et al. (1990): Indexing by Latent Semantic Analysis</a>, Table 2.</td></tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="tut2.html" title="Topics and Transformations"
>next</a> |</li>
<li class="right" >
<a href="tutorial.html" title="Tutorials"
>previous</a> |</li>
<li><a href="index.html">Gensim home</a>| </li>
<li><a href="tutorial.html">Tutorials</a>| </li>
<li><a href="http://groups.google.com/group/gensim">Support</a>| </li>
<li><a href="https://github.com/piskvorky/gensim/wiki">Contribute</a>| </li>
<li><a href="apiref.html">API reference</a>»</li>
<li><a href="tutorial.html" >Tutorials</a> »</li>
</ul>
</div>
<div class="footer">
© Copyright 2012, Radim Řehůřek <radimrehurek(at)seznam.cz>.
Last updated on Mar 09, 2012.
</div>
</body>
</html>