In [3]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora, models, and other data.


In [4]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [5]:
import gensim.downloader as api
from gensim import utils 


Now, lets download the text8 corpus and load it to memory (automatically)




In [7]:
corpus = api.load('text8')

One of Gensim's features is simple and easy access to some common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a variety of corpora, models and other data.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
The module leverages a local cache that ensures data is downloaded at most once.

This tutorial:

* Retrieves the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In this case, corpus is an iterable.
If you look under the covers, it has the following definition:



In [8]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [9]:
print(inspect.getfile(corpus.__class__))

C:\Users\ahmed/gensim-data\text8\__init__.py


As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus.




In [10]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2020-08-03 17:22:13,126 : INFO : collecting all words and their counts
2020-08-03 17:22:13,168 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-08-03 17:22:18,092 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2020-08-03 17:22:18,093 : INFO : Loading a fresh vocabulary
2020-08-03 17:22:18,262 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2020-08-03 17:22:18,263 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2020-08-03 17:22:18,494 : INFO : deleting the raw counts dictionary of 253854 items
2020-08-03 17:22:18,502 : INFO : sample=0.001 downsamples 38 most-common words
2020-08-03 17:22:18,503 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2020-08-03 17:22:18,676 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2020-08-03 17:22:18,677 : 

Now that we have our word2vec model, let's find words that are similar to 'tree'




In [13]:
print(model.most_similar(positive=["woman","king"],negative=["man"]))
print(model.most_similar_cosmul(positive=["woman","king"],negative=["man"]))
print(model.most_similar(positive=['baghdad', 'england'], negative=['london']))

print(model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london']))
model.wv.save("1.bin")
model=model.wv.load("1.bin")
print(model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london']))



2020-08-03 17:25:37,400 : INFO : saving Word2VecKeyedVectors object under 1.bin, separately None
2020-08-03 17:25:37,401 : INFO : not storing attribute vectors_norm
[('queen', 0.7042675018310547), ('emperor', 0.6428842544555664), ('prince', 0.6371762752532959), ('empress', 0.6277985572814941), ('princess', 0.6248342394828796), ('throne', 0.6222509741783142), ('daughter', 0.5983394384384155), ('mary', 0.5940885543823242), ('son', 0.5868792533874512), ('regent', 0.5725945234298706)]
[('queen', 0.9349927306175232), ('emperor', 0.8972660303115845), ('empress', 0.8960575461387634), ('princess', 0.8838215470314026), ('prince', 0.8807331323623657), ('throne', 0.8766660690307617), ('mary', 0.8723936676979065), ('daughter', 0.8676536083221436), ('son', 0.8573718667030334), ('elizabeth', 0.8516519069671631)]
[('sicily', 0.7211719155311584), ('gaul', 0.6693712472915649), ('tripoli', 0.6660892367362976), ('normandy', 0.657295286655426), ('persia', 0.6572530269622803), ('ethiopia', 0.65650671720504

In [53]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

print(model.similarity('woman', 'man'))
print(model.similar_by_word(("hello"))
model.wmdistance(["hello"],["hello"])


SyntaxError: invalid syntax (<ipython-input-53-7eb66c3e9949>, line 5)

In [69]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
from nltk.corpus import stopwords
import nltk
stopwords = nltk.corpus.stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]
print( model.wv.most_similar("kill"))
sim = model.n_similarity(['kill', 'man'], ['murder', 'man'])
print("{:.4f}".format(sim))

[('summon', 0.7948280572891235), ('steal', 0.7809569835662842), ('destroy', 0.7747186422348022), ('hide', 0.7598826885223389), ('avenge', 0.7469691634178162), ('commit', 0.7427475452423096), ('devour', 0.7037591934204102), ('decapitate', 0.6890674829483032), ('heal', 0.6874472498893738), ('inflict', 0.6832908391952515)]
0.8443


You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:




In [17]:
import json
info = api.info()
print(json.dumps(info, indent=4))

it__.py",
            "license": "https://dumps.wikimedia.org/legal.html",
            "fields": {
                "section_texts": "list of body of sections",
                "section_titles": "list of titles of sections",
                "title": "Title of wiki article"
            },
            "description": "Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz`",
            "checksum-0": "a7d7d7fd41ea7e2d7fa32ec1bb640d71",
            "checksum-1": "b2683e3356ffbca3b6c2dca6e9801f9f",
            "checksum-2": "c5cde2a9ae77b3c4ebce804f6df542c2",
            "checksum-3": "00b71144ed5e3aeeb885de84f7452b81",
            "file_name": "wiki-english-20171001.gz",
            "read_more": [
                "https://dumps.wikimedia.org/enwiki/20171001/"
            ],
            "parts": 4
        },
        "text8": {
            "num_records": 1701,
            "record_format": "list of 

There are two types of data: corpora and models.



In [18]:
print(info.keys())

dict_keys(['corpora', 'models'])


Let's have a look at the available corpora:



In [19]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

20-newsgroups (18846 records): The notorious collection of approximatel...
__testing_matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
__testing_multipart-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
fake-news (12999 records): News dataset, contains text and metadata...
patent-2017 (353197 records): Patent Grant Full Text. Contains the ful...
quora-duplicate-questions (404290 records): Over 400,000 lines of potential question...
semeval-2016-2017-task3-subtaskA-unannotated (189941 records): SemEval 2016 / 2017 Task 3 Subtask A una...
semeval-2016-2017-task3-subtaskBC (-1 records): SemEval 2016 / 2017 Task 3 Subtask B and...
text8 (1701 records): First 100,000,000 bytes of plain text fr...
wiki-english-20171001 (4924894 records): Extracted Wikipedia dump from October 20...


... and the same for models:



In [20]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

If you want to get detailed information about the model/corpus, use:




In [None]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :




In [21]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

2020-08-02 14:08:37,421 : INFO : glove-wiki-gigaword-50 downloaded
C:\Users\ahmed/gensim-data\glove-wiki-gigaword-50\glove-wiki-gigaword-50.gz


If you want to load the model to memory, then:




In [22]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")

2020-08-02 14:08:57,577 : INFO : loading projection weights from C:\Users\ahmed/gensim-data\glove-wiki-gigaword-50\glove-wiki-gigaword-50.gz
2020-08-02 14:09:22,259 : INFO : loaded (400000, 50) matrix from C:\Users\ahmed/gensim-data\glove-wiki-gigaword-50\glove-wiki-gigaword-50.gz
2020-08-02 14:09:22,281 : INFO : precomputing L2-norms of word weight vectors


[('plastic', 0.7942505478858948),
 ('metal', 0.770871639251709),
 ('walls', 0.7700636386871338),
 ('marble', 0.7638524174690247),
 ('wood', 0.7624281048774719),
 ('ceramic', 0.7602593302726746),
 ('pieces', 0.7589111924171448),
 ('stained', 0.7528817057609558),
 ('tile', 0.748193621635437),
 ('furniture', 0.746385931968689)]

In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class ``Dataset`` and provide ``__iter__`` method




In [23]:
model.most_similar(positive=["woman","king"],negative="man")

[('berates', 0.6456252932548523),
 ('njinga', 0.6397151947021484),
 ('nafeek', 0.6385630369186401),
 ('horman', 0.62641841173172),
 ('genseric', 0.6253653764724731),
 ('sadeghnia', 0.6160673499107361),
 ('runas', 0.6149552464485168),
 ('befriending', 0.6077734231948853),
 ('flatmate', 0.6053695678710938),
 ('æthelflæd', 0.5924665331840515)]

In [36]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
t_model = KeyedVectors.load_word2vec_format('GoogleWord2Vec/GoogleNews-vectors-negative300.bin',binary=True)



AttributeError: module 'gensim.models.keyedvectors' has no attribute 'load_word2vec_format'