# Word Embeddings Tutorial

We will be using some utilities form the Gensim package.
For more details on the implementation of word2vec in gensim, see this tutorial https://rare-technologies.com/word2vec-tutorial/.

## Install Gensim.

In [59]:
!pip3 install gensim

Defaulting to user installation because normal site-packages is not writeable


Define a helper function for downloading files.

In [60]:
import urllib.request
import os

def download_file(url, local_file):
    """
    Helper function to download a file and store it locally
    """
    if not os.path.exists(local_file):
        print('Downloading')
        with urllib.request.urlopen(url) as opener, \
             open(local_file, mode='wb') as outfile:
                    outfile.write(opener.read())
    else:
        print('Already downloaded')

Let's use a corpus from Peter Norvig.

In [61]:
big_text_file = 'data/big.txt'
download_file('http://norvig.com/big.txt', big_text_file)

Already downloaded


Convert each line in the file into a list of tokens:

In [62]:
with open(big_text_file, mode='r', encoding='utf-8') as infile:
    for i in range(10):
        line = next(infile)
        print(line)

The Project Gutenberg EBook of The Adventures of Sherlock Holmes

by Sir Arthur Conan Doyle

(#15 in our series by Sir Arthur Conan Doyle)



Copyright laws are changing all over the world. Be sure to check the

copyright laws for your country before downloading or redistributing

this or any other Project Gutenberg eBook.



This header should be the first thing seen when viewing this Project

Gutenberg file.  Please do not remove it.  Do not change or edit the



In [63]:
import re

sentences = []
with open(big_text_file, mode='r', encoding='utf-8') as infile:
    for line in infile:
        sentences.append(re.split('[\W\d_]+', line.lower()))

Enable logging the execution of gensim.

In [64]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Train a `word2vec` model on this corpus.

In [65]:
from gensim.models import Word2Vec
model = Word2Vec(sentences, size=100, window=10, min_count=5, sg=1, iter=20, workers=8, negative=10)

2020-02-27 09:26:03,901 : INFO : collecting all words and their counts
2020-02-27 09:26:03,903 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-27 09:26:03,948 : INFO : PROGRESS: at sentence #10000, processed 163547 words, keeping 10860 word types
2020-02-27 09:26:03,976 : INFO : PROGRESS: at sentence #20000, processed 268578 words, keeping 13976 word types
2020-02-27 09:26:03,998 : INFO : PROGRESS: at sentence #30000, processed 357495 words, keeping 15631 word types
2020-02-27 09:26:04,025 : INFO : PROGRESS: at sentence #40000, processed 459274 words, keeping 18863 word types
2020-02-27 09:26:04,052 : INFO : PROGRESS: at sentence #50000, processed 563414 words, keeping 20226 word types
2020-02-27 09:26:04,073 : INFO : PROGRESS: at sentence #60000, processed 642721 words, keeping 21550 word types
2020-02-27 09:26:04,099 : INFO : PROGRESS: at sentence #70000, processed 740152 words, keeping 23109 word types
2020-02-27 09:26:04,125 : INFO : PROGRESS: at 

2020-02-27 09:26:21,796 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-02-27 09:26:21,803 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-02-27 09:26:21,821 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-02-27 09:26:21,872 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-02-27 09:26:21,873 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-02-27 09:26:21,883 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-27 09:26:21,895 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-27 09:26:21,989 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-27 09:26:21,990 : INFO : EPOCH - 6 : training on 1279946 raw words (824749 effective words) took 2.3s, 355006 effective words/s
2020-02-27 09:26:23,006 : INFO : EPOCH 7 - PROGRESS: at 34.87% examples, 327752 words/s, in_qsize 15, out_qsize 0
2020-02-27 09:26:24,00

2020-02-27 09:26:38,365 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-27 09:26:38,473 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-27 09:26:38,474 : INFO : EPOCH - 13 : training on 1279946 raw words (824862 effective words) took 2.3s, 356144 effective words/s
2020-02-27 09:26:39,496 : INFO : EPOCH 14 - PROGRESS: at 33.39% examples, 313355 words/s, in_qsize 15, out_qsize 0
2020-02-27 09:26:40,538 : INFO : EPOCH 14 - PROGRESS: at 84.28% examples, 353272 words/s, in_qsize 15, out_qsize 0
2020-02-27 09:26:40,637 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-02-27 09:26:40,669 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-02-27 09:26:40,671 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-02-27 09:26:40,684 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-02-27 09:26:40,697 : INFO : worker thread finished; awaiting finish of 3 more thre

The method `most_similar` returns the top 10 closest words to the input, according to euclidean (L2) distance between their embedding vectors.

In [69]:
model.wv.most_similar('lemon')

[('parchment', 0.7194594740867615),
 ('chocolate', 0.7141339778900146),
 ('robber', 0.7136014699935913),
 ('voltaire', 0.7000627517700195),
 ('basic', 0.6810798645019531),
 ('orange', 0.6780054569244385),
 ('conscientious', 0.6769378781318665),
 ('symbols', 0.6705729961395264),
 ('magazines', 0.669700026512146),
 ('hound', 0.6680736541748047)]

Embedding vectors are stored in `model.vw`, here the vector of a word:

In [70]:
model.wv['project']

array([-2.5432853e-02,  1.7475882e-01, -2.4659063e-01,  1.5796936e-01,
       -1.5934931e-01, -6.4758068e-01, -2.3359986e-01, -6.1180305e-01,
       -5.1521289e-01,  6.1115885e-01, -8.6010331e-01, -2.7175543e-01,
        4.1587451e-01,  9.4346142e-01, -2.1622545e-01,  1.1367339e-01,
       -2.9938603e-02,  5.4370159e-01, -3.1393487e-02,  3.1429020e-01,
       -3.5453072e-01,  1.0455128e+00, -3.1041902e-01, -7.5869292e-01,
        1.8373707e-01, -5.8236009e-01,  7.8112155e-01,  1.1550092e+00,
        4.6694726e-01, -9.9365258e-01, -2.3194110e-02, -2.8324774e-01,
        2.7289498e-01, -7.6772898e-02,  5.0243080e-01, -6.8995404e-01,
        5.4123241e-02,  1.4126004e-02,  6.2727880e-01,  5.0530732e-01,
       -1.1450762e-01,  6.4593035e-01,  7.6572210e-01,  4.5214301e-01,
       -4.7532442e-01,  8.4757209e-01,  1.3727783e-01,  8.9591360e-01,
       -9.5968646e-01,  1.7984517e-01, -7.4194402e-02, -1.9932204e-01,
        3.1237483e-01,  8.1662911e-01, -4.3899754e-01, -1.3364074e-01,
      

## Compute the similarity of among all embeddings and plot them.
We will use `sklearn`.

In [71]:
!pip install sklearn

Defaulting to user installation because normal site-packages is not writeable


In [72]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(model.wv.vectors)

In [79]:
similarities[1,2]

0.80474955

Load the analogy test and measure the accuracy:

In [73]:
test_file = 'data/questions-words.txt'
download_file('http://download.tensorflow.org/data/questions-words.txt', test_file)

model.wv.evaluate_word_analogies(test_file)

2020-02-27 09:37:38,990 : INFO : Evaluating word analogies for top 300000 words in the model on data/questions-words.txt
2020-02-27 09:37:39,014 : INFO : capital-common-countries: 20.0% (6/30)
2020-02-27 09:37:39,061 : INFO : capital-world: 12.5% (3/24)
2020-02-27 09:37:39,070 : INFO : currency: 0.0% (0/2)
2020-02-27 09:37:39,138 : INFO : city-in-state: 2.9% (3/103)


Already downloaded


2020-02-27 09:37:39,254 : INFO : family: 13.2% (36/272)
2020-02-27 09:37:39,587 : INFO : gram1-adjective-to-adverb: 0.4% (3/702)
2020-02-27 09:37:39,654 : INFO : gram2-opposite: 0.8% (1/132)
2020-02-27 09:37:39,964 : INFO : gram3-comparative: 4.6% (32/702)
2020-02-27 09:37:40,046 : INFO : gram4-superlative: 0.6% (1/156)
2020-02-27 09:37:40,310 : INFO : gram5-present-participle: 4.7% (24/506)
2020-02-27 09:37:40,450 : INFO : gram6-nationality-adjective: 2.7% (7/264)
2020-02-27 09:37:40,902 : INFO : gram7-past-tense: 15.1% (150/992)
2020-02-27 09:37:41,060 : INFO : gram8-plural: 5.0% (17/342)
2020-02-27 09:37:41,167 : INFO : gram9-plural-verbs: 0.8% (2/240)
2020-02-27 09:37:41,169 : INFO : Quadruplets with out-of-vocabulary words: 77.1%
2020-02-27 09:37:41,172 : INFO : NB: analogies containing OOV words were skipped from evaluation! To change this behavior, use "dummy4unknown=True"
2020-02-27 09:37:41,173 : INFO : Total accuracy: 6.4% (285/4467)


(0.06380120886501008,
 [{'section': 'capital-common-countries',
   'correct': [('BERLIN', 'GERMANY', 'ROME', 'ITALY'),
    ('MADRID', 'SPAIN', 'PARIS', 'FRANCE'),
    ('MADRID', 'SPAIN', 'ROME', 'ITALY'),
    ('MADRID', 'SPAIN', 'LONDON', 'ENGLAND'),
    ('PARIS', 'FRANCE', 'LONDON', 'ENGLAND'),
    ('PARIS', 'FRANCE', 'MADRID', 'SPAIN')],
   'incorrect': [('BERLIN', 'GERMANY', 'LONDON', 'ENGLAND'),
    ('BERLIN', 'GERMANY', 'MADRID', 'SPAIN'),
    ('BERLIN', 'GERMANY', 'MOSCOW', 'RUSSIA'),
    ('BERLIN', 'GERMANY', 'PARIS', 'FRANCE'),
    ('LONDON', 'ENGLAND', 'MADRID', 'SPAIN'),
    ('LONDON', 'ENGLAND', 'MOSCOW', 'RUSSIA'),
    ('LONDON', 'ENGLAND', 'PARIS', 'FRANCE'),
    ('LONDON', 'ENGLAND', 'ROME', 'ITALY'),
    ('LONDON', 'ENGLAND', 'BERLIN', 'GERMANY'),
    ('MADRID', 'SPAIN', 'MOSCOW', 'RUSSIA'),
    ('MADRID', 'SPAIN', 'BERLIN', 'GERMANY'),
    ('MOSCOW', 'RUSSIA', 'PARIS', 'FRANCE'),
    ('MOSCOW', 'RUSSIA', 'ROME', 'ITALY'),
    ('MOSCOW', 'RUSSIA', 'BERLIN', 'GERMANY'),
 

# Explore embeddings from the Google News corpus

Download pretrained word embeddings from https://code.google.com/archive/p/word2vec/ and gunzip the file.
The two cells below do the work, otherwise you can do it manually.
__WARNING__: the gz file is 1.5GB, the extracted model is 3.5GB!

In [80]:
google_w2v_file = 'data/GoogleNews-vectors-negative300.bin'
download_file('https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz',
              google_w2v_file+'.gz')

Already downloaded


The file is compressed, it must be uncompressed:

In [81]:
!if [ ! -f data/GoogleNews-vectors-negative300.bin ]; then gunzip data/GoogleNews-vectors-negative300.bin.gz; fi

Google's vector were computed some time ago. It requires a special command to be loaded. Use this code if you want to use Google's embeddings in your code.

In [82]:
news_model = gensim.models.KeyedVectors.load_word2vec_format(google_w2v_file, binary=True)

2020-02-27 09:44:53,270 : INFO : loading projection weights from data/GoogleNews-vectors-negative300.bin
2020-02-27 09:47:32,413 : INFO : loaded (3000000, 300) matrix from data/GoogleNews-vectors-negative300.bin


## Playing with word similarity

See words which are closer in the embeddings vector space.

In [83]:
news_model.most_similar(['sun'])

2020-02-27 09:47:58,687 : INFO : precomputing L2-norms of word weight vectors


[('sunlight', 0.7269680500030518),
 ('sun_rays', 0.6871298551559448),
 ('sunshine', 0.6767958402633667),
 ('sunrays', 0.6644463539123535),
 ('noonday_sun', 0.6227596402168274),
 ('rays', 0.601547360420227),
 ('suns_rays', 0.5943776369094849),
 ('dried_tomato_basil', 0.5823874473571777),
 ('sun_shining', 0.5802727937698364),
 ('UV_rays', 0.5749224424362183)]

In [18]:
news_model.most_similar('obama')

[('mccain', 0.7319011688232422),
 ('hillary', 0.7284599542617798),
 ('obamas', 0.7229631543159485),
 ('george_bush', 0.7205674648284912),
 ('barack_obama', 0.7045838832855225),
 ('palin', 0.7043113708496094),
 ('clinton', 0.6934447884559631),
 ('clintons', 0.6816834211349487),
 ('sarah_palin', 0.6815145611763),
 ('john_mccain', 0.6800708770751953)]

In [84]:
news_model.most_similar('apple')

[('apples', 0.720359742641449),
 ('pear', 0.6450696587562561),
 ('fruit', 0.6410146355628967),
 ('berry', 0.6302294731140137),
 ('pears', 0.613396167755127),
 ('strawberry', 0.6058260798454285),
 ('peach', 0.6025872230529785),
 ('potato', 0.596093475818634),
 ('grape', 0.5935864448547363),
 ('blueberry', 0.5866668820381165)]

In [85]:
news_model.most_similar(negative=['banana'])

[('NORWALK_CONN', 0.23768547177314758),
 ('JIM_HANNON_TimesDaily', 0.2376764714717865),
 ('KITCHENER_ONTARIO', 0.23705120384693146),
 ('DENVER_CO_PRWEB', 0.23474794626235962),
 ('GRANDE_BAY_MAURITIUS', 0.23164315521717072),
 ('Dr._Parviz_Azar', 0.22826240956783295),
 ('HuMax_IL8_TM', 0.22697213292121887),
 ('subsidiary_Airstar', 0.22571393847465515),
 ('AB_OMX_Stockholm', 0.2248857468366623),
 ('MotoTron_electronic_controls', 0.22116824984550476)]

Which word is in the same relation to `female` as `man` to `king`?

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words.

In [86]:
news_model.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.7118191719055176),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006)]

Function to find an analogy

In [87]:
def analogy(x1, x2, y1):
    result = news_model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [88]:
analogy('japan', 'japanese', 'australia')

'canada'

In [89]:
analogy('australia', 'beer', 'france')

'beers'

In [90]:
analogy('obama', 'clinton', 'reagan')

'kerry'

In [91]:
analogy('tall', 'tallest', 'long')

'longest'

In [92]:
analogy('good', 'fantastic', 'bad')

'horrible'

Which word is in the same relation to `warm` as `summer` to `cold`?

In [93]:
news_model.most_similar(['summer', 'cold'], ['warm'])

[('winter', 0.5936813950538635),
 ('spring', 0.550656259059906),
 ('summertime', 0.5165988206863403),
 ('summers', 0.5085427761077881),
 ('autumn', 0.49106645584106445),
 ('week', 0.45701587200164795),
 ('midwinter', 0.45652201771736145),
 ('Summer', 0.4489293098449707),
 ('springtime', 0.4475139081478119),
 ('month', 0.44610556960105896)]

Accuracy in the analogy test is much better than our very small model

In [94]:
news_model.accuracy(test_file)

  """Entry point for launching an IPython kernel.
2020-02-27 09:58:34,478 : INFO : capital-common-countries: 83.6% (423/506)
2020-02-27 09:58:42,669 : INFO : capital-world: 82.7% (1144/1383)
2020-02-27 09:58:43,400 : INFO : currency: 39.8% (51/128)
2020-02-27 09:58:56,281 : INFO : city-in-state: 74.6% (1739/2330)
2020-02-27 09:58:58,181 : INFO : family: 90.1% (308/342)
2020-02-27 09:59:02,699 : INFO : gram1-adjective-to-adverb: 32.3% (262/812)
2020-02-27 09:59:04,885 : INFO : gram2-opposite: 50.5% (192/380)
2020-02-27 09:59:12,469 : INFO : gram3-comparative: 91.9% (1224/1332)
2020-02-27 09:59:16,381 : INFO : gram4-superlative: 88.0% (618/702)
2020-02-27 09:59:21,368 : INFO : gram5-present-participle: 79.8% (694/870)
2020-02-27 09:59:28,301 : INFO : gram6-nationality-adjective: 97.1% (1193/1229)
2020-02-27 09:59:36,697 : INFO : gram7-past-tense: 66.5% (986/1482)
2020-02-27 09:59:42,309 : INFO : gram8-plural: 85.6% (849/992)
2020-02-27 09:59:46,321 : INFO : gram9-plural-verbs: 68.9% (484

[{'section': 'capital-common-countries',
  'correct': [('ATHENS', 'GREECE', 'BANGKOK', 'THAILAND'),
   ('ATHENS', 'GREECE', 'BEIJING', 'CHINA'),
   ('ATHENS', 'GREECE', 'BERLIN', 'GERMANY'),
   ('ATHENS', 'GREECE', 'BERN', 'SWITZERLAND'),
   ('ATHENS', 'GREECE', 'CAIRO', 'EGYPT'),
   ('ATHENS', 'GREECE', 'CANBERRA', 'AUSTRALIA'),
   ('ATHENS', 'GREECE', 'HAVANA', 'CUBA'),
   ('ATHENS', 'GREECE', 'HELSINKI', 'FINLAND'),
   ('ATHENS', 'GREECE', 'ISLAMABAD', 'PAKISTAN'),
   ('ATHENS', 'GREECE', 'MADRID', 'SPAIN'),
   ('ATHENS', 'GREECE', 'MOSCOW', 'RUSSIA'),
   ('ATHENS', 'GREECE', 'OSLO', 'NORWAY'),
   ('ATHENS', 'GREECE', 'OTTAWA', 'CANADA'),
   ('ATHENS', 'GREECE', 'PARIS', 'FRANCE'),
   ('ATHENS', 'GREECE', 'ROME', 'ITALY'),
   ('ATHENS', 'GREECE', 'STOCKHOLM', 'SWEDEN'),
   ('ATHENS', 'GREECE', 'TEHRAN', 'IRAN'),
   ('ATHENS', 'GREECE', 'TOKYO', 'JAPAN'),
   ('BAGHDAD', 'IRAQ', 'BANGKOK', 'THAILAND'),
   ('BAGHDAD', 'IRAQ', 'BEIJING', 'CHINA'),
   ('BAGHDAD', 'IRAQ', 'BERLIN', 'GERMA

Which word from the given list doesn’t go with the others?

In [95]:
news_model.doesnt_match(['sun', 'moon', 'sand', 'jupiter'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'sand'

In [96]:
news_model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

# Visualize embeddings

Explore the vectors, by mapping them into 2 dimensions and plotting them on the plane.

In [None]:
import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Use Principal Compoment Analysis in order to reduce the size of the array made with word embeddings vectors.

In [97]:
from sklearn.decomposition import PCA

def display_pca_scatterplot(model, words=None, sample=0):
    """
    Collect from ::parameter: model the vectors for the given :parameter: words.
 
    Apply PCA to the matrix to project into 2 dimensional space, then plot them.
    """
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

Show the vectors for a list of words in 2 dimensions.

In [98]:
display_pca_scatterplot(news_model, 
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'italy', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])


<IPython.core.display.Javascript object>

In [107]:
cosine_similarity(np.stack((news_model.wv['beer'],news_model.wv['italy'])))

  """Entry point for launching an IPython kernel.


array([[1.0000001 , 0.09480633],
       [0.09480633, 0.9999993 ]], dtype=float32)

Check the distance between two words:

In [None]:
np.linarg.norm(news_model.wv['beer'] - news_model.wv['italy'])

# Explore embeddings from the Italian Wikipedia

Load word embeddings of dimension 100, trained on the text of the Italian WikiPedia.

In [108]:
italian = gensim.models.KeyedVectors.load_word2vec_format('data/it-vectors.100.5.50.w2v')

2020-02-27 10:15:14,248 : INFO : loading projection weights from data/it-vectors.100.5.50.w2v
2020-02-27 10:16:37,501 : INFO : loaded (214689, 100) matrix from data/it-vectors.100.5.50.w2v


In [109]:
italian.most_similar(['Francia'])

2020-02-27 10:16:38,060 : INFO : precomputing L2-norms of word weight vectors


[('Belgio', 0.8703928589820862),
 ('Spagna', 0.8083794713020325),
 ('Lussemburgo', 0.7751995921134949),
 ('Svezia', 0.7663754224777222),
 ('Danimarca', 0.7570080161094666),
 ('Svizzera', 0.749411940574646),
 ('Portogallo', 0.7451502084732056),
 ('Olanda', 0.7380844950675964),
 ('Germania', 0.7370772957801819),
 ('Norvegia', 0.7283455729484558)]

Which word is in the same relation to `donna` as `uomo` to `re`?

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words.

In [110]:
italian.most_similar(['donna', 're'], ['uomo'])

  """Entry point for launching an IPython kernel.


[('regina', 0.7099511623382568),
 ('consorte', 0.653093695640564),
 ('sposa', 0.6479295492172241),
 ('principessa', 0.633419394493103),
 ('Ranavalona', 0.6286108493804932),
 ('vedova', 0.6271299123764038),
 ('Brunechilde', 0.624060869216919),
 ('Haakon', 0.6232239007949829),
 ('Richilde', 0.6197010278701782),
 ('reggente', 0.6190868616104126)]

Word embedding do not capture the distiction between positive/negative polarity of adjectiuves:

In [111]:
italian.most_similar(['buono'])

[('cattivo', 0.8344665169715881),
 ('stupido', 0.8254504203796387),
 ('sciocco', 0.8137079477310181),
 ('noioso', 0.811944305896759),
 ('pigro', 0.7901057004928589),
 ('vigliacco', 0.7848994731903076),
 ('ignorante', 0.7752730846405029),
 ('carino', 0.7609102725982666),
 ('onesto', 0.7528376579284668),
 ('spregevole', 0.7483676671981812)]