In [1]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

In [2]:
import smart_open
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

unable to import 'smart_open.gcs', disabling that module


* The [gensim-data](https://github.com/RaRe-Technologies/gensim-data) project stores a variety of corpora, models and other data. *gensim.downloader* is a Python module  that leverages a local cache.

In [3]:
import gensim.downloader as api

In [4]:
corpus = api.load('text8')

In [5]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



In [6]:
print(inspect.getfile(corpus.__class__))

/home/bjpcjp/gensim-data/text8/__init__.py


In [7]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2020-04-28 09:45:04,126 : INFO : collecting all words and their counts
2020-04-28 09:45:04,256 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-28 09:45:08,920 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2020-04-28 09:45:08,921 : INFO : Loading a fresh vocabulary
2020-04-28 09:45:09,144 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2020-04-28 09:45:09,145 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2020-04-28 09:45:09,344 : INFO : deleting the raw counts dictionary of 253854 items
2020-04-28 09:45:09,352 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-28 09:45:09,353 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2020-04-28 09:45:09,604 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2020-04-28 09:45:09,605 : 

* Find words that are similar to 'tree'

In [8]:
import pprint
pprint.pprint(model.wv.most_similar('tree'))

2020-04-28 09:47:21,594 : INFO : precomputing L2-norms of word weight vectors


[('trees', 0.7071799039840698),
 ('leaf', 0.6649410724639893),
 ('bark', 0.6580039262771606),
 ('bird', 0.641632080078125),
 ('flower', 0.622328519821167),
 ('fruit', 0.6136927604675293),
 ('avl', 0.6125401258468628),
 ('cactus', 0.5706822276115417),
 ('vine', 0.5641566514968872),
 ('moth', 0.5637525320053101)]


* Use the API to download corpora and models. Get a directory list this way.

In [9]:
import json
info = api.info()
print(json.dumps(info, indent=4))

{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

There are two types of data: corpora and models.



In [10]:
print(info.keys())

dict_keys(['corpora', 'models'])


* Corpora:



In [11]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    pprint.pprint(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

'20-newsgroups (18846 records): The notorious collection of approximatel...'
('__testing_matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis '
 'of t...')
('__testing_multipart-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] '
 'Synopsis of t...')
'fake-news (12999 records): News dataset, contains text and metadata...'
'patent-2017 (353197 records): Patent Grant Full Text. Contains the ful...'
('quora-duplicate-questions (404290 records): Over 400,000 lines of potential '
 'question...')
('semeval-2016-2017-task3-subtaskA-unannotated (189941 records): SemEval 2016 '
 '/ 2017 Task 3 Subtask A una...')
('semeval-2016-2017-task3-subtaskBC (-1 records): SemEval 2016 / 2017 Task 3 '
 'Subtask B and...')
'text8 (1701 records): First 100,000,000 bytes of plain text fr...'
('wiki-english-20171001 (4924894 records): Extracted Wikipedia dump from '
 'October 20...')


* Models:

In [15]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

* Detailed information about the model/corpus:

In [13]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

{
    "num_records": 12999,
    "record_format": "dict",
    "file_size": 20102776,
    "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py",
    "license": "https://creativecommons.org/publicdomain/zero/1.0/",
    "fields": {
        "crawled": "date the story was archived",
        "ord_in_thread": "",
        "published": "date published",
        "participants_count": "number of participants",
        "shares": "number of Facebook shares",
        "replies_count": "number of replies",
        "main_img_url": "image from story",
        "spam_score": "data from webhose.io",
        "uuid": "unique identifier",
        "language": "data from webhose.io",
        "title": "title of story",
        "country": "data from webhose.io",
        "domain_rank": "data from webhose.io",
        "author": "author of story",
        "comments": "number of Facebook comments",
        "site_url": "site URL from BS detector",
        "text": "tex

* Sometimes, you don't want load the model - you just want the path:

In [14]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

/home/bjpcjp/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz


* To load the model to memory:

In [15]:
model = api.load("glove-wiki-gigaword-50")
pprint.pprint(model.most_similar("glass"))

2020-04-28 09:49:14,570 : INFO : loading projection weights from /home/bjpcjp/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2020-04-28 09:49:35,361 : INFO : loaded (400000, 50) matrix from /home/bjpcjp/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2020-04-28 09:49:35,378 : INFO : precomputing L2-norms of word weight vectors


[('plastic', 0.7942505478858948),
 ('metal', 0.770871639251709),
 ('walls', 0.7700636386871338),
 ('marble', 0.7638524174690247),
 ('wood', 0.7624281048774719),
 ('ceramic', 0.7602593302726746),
 ('pieces', 0.7589111924171448),
 ('stained', 0.7528817057609558),
 ('tile', 0.748193621635437),
 ('furniture', 0.746385931968689)]


* In corpora, the corpus is never loaded to memory. All corpuses are wrapped to special class ``Dataset`` and provided an``__iter__`` method.


