# iphone2Vec

Referred from Siraj Raval's tutorial video/repository, original from © Yuriy Guts, 2016
Adapted to create clustering of words inside a particular text

Sample text kindly compiled by Eugene Yuen

#### Note: Converted to run on Python 3 and gensim-3.4.0


## Imports

In [2]:
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re

In [3]:
import nltk
import gensim.models.word2vec as w2v
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [6]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# from sklearn.cluster import KMeans
# from sklearn.feature_extraction.text import TfidfVectorizer

from pprint import pprint

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gabrielmanuelsidik/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gabrielmanuelsidik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gabrielmanuelsidik/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Prepare Corpus

**Load books from files**

In [7]:
book_filenames = sorted(glob.glob("data/iphone.txt"))


In [8]:
print("Found books:")
print(book_filenames)

Found books:
['data/iphone.txt']


**Combine the books into one string**

In [9]:
corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading 'data/iphone.txt'...
Corpus is now 507121 characters long



**Split the corpus into sentences**

In [10]:
# LOADING CORPUS FROM PUNKT LIBRARY
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [11]:
raw_sentences = tokenizer.tokenize(corpus_raw)
# raw_sentences = raw_sentences[0].split("\n")
print(raw_sentences)



In [12]:
#convert into a list of words
#remove unnecessary,, split into words, no hyphens
#list of words

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    filtered_words = []
    
    for word in words:
        if word not in stop_words:
            filtered_words.append(lemmatizer.lemmatize(word.lower()))
    
    return filtered_words

In [13]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [14]:
print(raw_sentence)
print(sentence_to_wordlist(raw_sentence))
print(sentences[-1])

Dispose of batteries according to your local environmental laws and guidelines.
['dispose', 'battery', 'according', 'local', 'environmental', 'law', 'guideline']
['dispose', 'battery', 'according', 'local', 'environmental', 'law', 'guideline']


In [15]:
token_count = sum([len(sentence) for sentence in sentences])
print("The song corpus contains {0:,} tokens".format(token_count))

The song corpus contains 56,022 tokens


## Train Word2Vec

In [16]:
#ONCE we have vectors
#step 3 - build model
#3 main tasks that vectors help with
#DISTANCE, SIMILARITY, RANKING

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 1

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

In [17]:
iphone2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [18]:
iphone2vec.build_vocab(sentences)

2018-08-07 16:52:19,807 : INFO : collecting all words and their counts
2018-08-07 16:52:19,809 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-08-07 16:52:19,826 : INFO : collected 3393 word types from a corpus of 56022 raw words and 6944 sentences
2018-08-07 16:52:19,827 : INFO : Loading a fresh vocabulary
2018-08-07 16:52:19,836 : INFO : min_count=1 retains 3393 unique words (100% of original 3393, drops 0)
2018-08-07 16:52:19,837 : INFO : min_count=1 leaves 56022 word corpus (100% of original 56022, drops 0)
2018-08-07 16:52:19,846 : INFO : deleting the raw counts dictionary of 3393 items
2018-08-07 16:52:19,848 : INFO : sample=0.001 downsamples 72 most-common words
2018-08-07 16:52:19,849 : INFO : downsampling leaves estimated 46628 word corpus (83.2% of prior 56022)
2018-08-07 16:52:19,859 : INFO : estimated required memory for 3393 words and 300 dimensions: 9839700 bytes
2018-08-07 16:52:19,860 : INFO : resetting layer weights


In [19]:
print("Word2Vec vocabulary length:", len(iphone2vec.wv.vocab))

Word2Vec vocabulary length: 3393


**Start training, this might take a minute or two...**

In [20]:
iphone2vec.train(sentences, total_examples=iphone2vec.corpus_count, epochs=10)

2018-08-07 16:52:24,828 : INFO : training model with 4 workers on 3393 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=7
2018-08-07 16:52:25,024 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-07 16:52:25,030 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-07 16:52:25,098 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-07 16:52:25,119 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-07 16:52:25,123 : INFO : EPOCH - 1 : training on 56022 raw words (46563 effective words) took 0.3s, 167429 effective words/s
2018-08-07 16:52:25,250 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-07 16:52:25,253 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-07 16:52:25,323 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-07 16:52:25,351 : INFO : worker thread finished; awaiting finish of 0 mor

(466160, 560220)

**Save to file, can be useful later**

In [21]:
if not os.path.exists("trained"):
    os.makedirs("trained")

In [22]:
iphone2vec.save(os.path.join("trained", "iphone2vec.w2v"))

2018-08-07 16:52:31,758 : INFO : saving Word2Vec object under trained/iphone2vec.w2v, separately None
2018-08-07 16:52:31,759 : INFO : not storing attribute vectors_norm
2018-08-07 16:52:31,762 : INFO : not storing attribute cum_table
2018-08-07 16:52:31,860 : INFO : saved trained/iphone2vec.w2v


## Explore the trained model.

In [23]:
iphone2vec = w2v.Word2Vec.load(os.path.join("trained", "iphone2vec.w2v"))

2018-08-07 16:52:34,418 : INFO : loading Word2Vec object from trained/iphone2vec.w2v
2018-08-07 16:52:34,492 : INFO : loading wv recursively from trained/iphone2vec.w2v.wv.* with mmap=None
2018-08-07 16:52:34,493 : INFO : setting ignored attribute vectors_norm to None
2018-08-07 16:52:34,495 : INFO : loading vocabulary recursively from trained/iphone2vec.w2v.vocabulary.* with mmap=None
2018-08-07 16:52:34,497 : INFO : loading trainables recursively from trained/iphone2vec.w2v.trainables.* with mmap=None
2018-08-07 16:52:34,499 : INFO : setting ignored attribute cum_table to None
2018-08-07 16:52:34,502 : INFO : loaded trained/iphone2vec.w2v


### Compress the word vectors into 2D space and plot them

In [57]:
# VISUALIZATIONS

# my video - how to visualize a dataset easily
# tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

In [58]:
# all_word_vectors_matrix = iphone2vec.wv.vectors

**Train t-SNE, this could take a minute or two...**

In [59]:
# all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

**Plot the big picture**

In [60]:
# points = pd.DataFrame(
#     [
#         (word, coords[0], coords[1])
#         for word, coords in [
#             (word, all_word_vectors_matrix_2d[iphone2vec.wv.vocab[word].index])
#             for word in iphone2vec.wv.vocab
#         ]
#     ],
#     columns=["word", "x", "y"]
# )

In [61]:
# points.head(10)

In [62]:
# sns.set_context("poster")

In [63]:
# points.plot.scatter("x", "y", s=10, figsize=(20, 12))

**Zoom in to some interesting places**

In [64]:
# def plot_region(x_bounds, y_bounds):
#     slice = points[
#         (x_bounds[0] <= points.x) &
#         (points.x <= x_bounds[1]) & 
#         (y_bounds[0] <= points.y) &
#         (points.y <= y_bounds[1])
#     ]
    
#     ax = slice.plot.scatter("x", "y", s=35, c=None, figsize=(10, 8))
#     for i, point in slice.iterrows():
#         ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

**Found the House Mottos!**

In [65]:
# plot_region(x_bounds=(-110, -95), y_bounds=(-20, -15))

**Can't find anything particularly interesting anymore :(**

In [66]:
# plot_region(x_bounds=(70, 80), y_bounds=(0, 20))

### Explore semantic similarities between book characters

**Words closest to the given word**

In [67]:
iphone2vec.wv.most_similar("power")

2018-08-07 11:55:13,389 : INFO : precomputing L2-norms of word weight vectors


[('adapter', 0.9425086379051208),
 ('compatible', 0.9358525276184082),
 ('charging', 0.9280827641487122),
 ('cable', 0.9280375838279724),
 ('connector', 0.9238523840904236),
 ('port', 0.9231459498405457),
 ('lightning', 0.9199609756469727),
 ('outlet', 0.9162002801895142),
 ('low', 0.9083038568496704),
 ('charger', 0.8988542556762695)]

In [69]:
iphone2vec.wv.most_similar("internet")

[('connection', 0.9563076496124268),
 ('connecting', 0.9117547869682312),
 ('transfer', 0.907412588596344),
 ('connects', 0.9065635800361633),
 ('join', 0.9020034670829773),
 ('lte', 0.8888698816299438),
 ('local', 0.8847479820251465),
 ('via', 0.8815390467643738),
 ('streaming', 0.8803442716598511),
 ('connected', 0.876605749130249)]

In [70]:
iphone2vec.wv.most_similar("music")

[('listen', 0.9246533513069153),
 ('podcasts', 0.9136156439781189),
 ('subscriber', 0.8771145343780518),
 ('podcast', 0.8657570481300354),
 ('listening', 0.8601361513137817),
 ('enjoy', 0.8511674404144287),
 ('synced', 0.8473389148712158),
 ('million', 0.8468717336654663),
 ('expert', 0.839187741279602),
 ('song', 0.8385128974914551)]

**Linear relationships between word pairs**

In [71]:
# def nearest_similarity_cosmul(start1, end1, end2):
#     similarities = iphone2vec.wv.most_similar_cosmul(
#         positive=[end2, start1],
#         negative=[end1]
#     )
#     start2 = similarities[0][0]
#     print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
#     return start2

In [72]:
# nearest_similarity_cosmul("screen", "brightness", "loud")

# Time to use this data and model to create the bins

	iterating through each word:
		bin them into similar clusters (threshold of similarity can be set [see: cosine distance against the average word] via a greedy algorithm or k-means clustering)
		{Note: greedy would be O(n^2) complexity, where n is the number of words in the total text}
		Keep tally as well of the categories each sentence has words falling into. (Still O(N^2))
	
	Resulting output include:
        1. 2-D array of sentences - categories
        2. 2-D array of categories - words
        3. 2-D array of categories - sentences
    1. All these can be made in O(n) if greedy is used.

In [73]:
# iphone2vec.wv.similarity("listen", "play")

In [74]:
# iphone2vec.wv.similarity("listen", "network")

In [75]:
# iphone2vec.wv.similarity("listen", "library")

In [76]:
# iphone2vec.wv.similarity("listen", "songs")

In [77]:
# iphone2vec.wv.similarity("turn", "songs")

In [78]:
# iphone2vec.wv.similarity("songs", "iPhone")

In [79]:
# print(iphone2vec.wv.get_vector("speaker"))

In [24]:
class Category:
    
    def __init__(self, word, threshold = 0.9): 
        self.threshold = threshold
        self.words = set()
        self.words.add(word)
        self.num_of_words = 1
        self.average_vector = iphone2vec.wv.get_vector(word)
        self.average_word = word
        
    def add_word(self, word):
        if (word not in self.words):
            self.words.add(word)

            self.average_vector = (
                iphone2vec.wv.get_vector(word) 
                + self.average_vector * self.num_of_words
            )/(self.num_of_words + 1)

            self.average_word = iphone2vec.wv.similar_by_vector(self.average_vector, topn = 1)[0][0]

            self.num_of_words += 1
    
    def is_similar(self, word):
        return iphone2vec.wv.similarity(self.average_word, word) > self.threshold
    


In [25]:
test_category = Category("speaker")

In [82]:
test_set = {"one", "two"}

In [83]:
test_set.add("three")

In [84]:
test_set.add("one")

In [85]:
print(test_set)

{'three', 'one', 'two'}


In [86]:
test_category.add_word("sound")

In [87]:
print(test_category.words)
print(test_category.average_word)
print(test_category.is_similar('speaker'))

{'sound', 'speaker'}
sound
False


In [26]:
categories = []
for sentence in sentences:
    for word in sentence:

        is_added = False

        for category in categories:
            if (category.is_similar(word)):
                category.add_word(word)
                is_added = True
                break

        if (not is_added):
            categories.append(Category(word, 0.80))
            

2018-08-07 16:52:56,610 : INFO : precomputing L2-norms of word weight vectors


In [89]:
for category in categories:
    print(category.words)

{'marked', 'skin', 'believe', 'executed', 'below', 'dealer', 'exit', 'property', 'where', 'backwards', 'adding', 'flight', 'generally', 'slow', 'step', 'correction', 'aircraft', 'hint', 'instant', 'ev', 'quoted', 'eye', 'improper', 'else', 'beyond', 'flammable', 'easily', 'magnify', 'multitask', 'representative', 'teletype', 'splash', 'resetting', 'speaks', 'highlighting', 'reject', 'notify', 'act', 'dictionary', 'www', 'black', 'specified', 'accepted', 'facemarks', 'latest', 'distracting', 'detects', 'personally', 'follower', 'sings', 'ratio', 'empty', 'pacemaker', 'well', 'contrast', 'sdh', 'participant', 'wubihua', 'week', 'thai', 'fraud', 'kit', 'closed', 'defibrillator', 'coffee', 'alex', 'syllable', 'adjustment', 'statistical', 'entire', 'choking', 'slot', 'routed', 'body', 'utilisant', 'lei', 'keeping', 'indicates', 'headset', 'exempt', 'head', 'warranty', 'gone', 'alphabetically', 'busy', 'diagnostics', 'rather', 'large', 'editor', 'over', 'boarding', 'dispose', 'revisit', 'pro

In [92]:
iphone2vec.wv.most_similar("german")

[('spanish', 0.9967426061630249),
 ('french', 0.9955888986587524),
 ('modethese', 0.9944429993629456),
 ('italian', 0.9941964149475098),
 ('acid', 0.9923858046531677),
 ('infrared', 0.9915544986724854),
 ('translates', 0.99146568775177),
 ('soapy', 0.9914011359214783),
 ('flashing', 0.9913240075111389),
 ('property', 0.99131178855896)]

In [29]:
from nltk.cluster import KMeansClusterer

# all_words = list(iphone2vec.wv.vocab.keys())
# all_vectors = []

# for word in all_words:
#     all_vectors.append(iphone2vec.wv.get_vector(word))
    
# all_vectors = [numpy.array(vector) for vector in iphone2vec.wv.vectors]
all_vectors = iphone2vec.wv.vectors

# need to convert the words into their vector representations!

NUM_CLUSTERS = 10
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance = nltk.cluster.util.euclidean_distance, repeats=25)
assigned_clusters = kclusterer.cluster(all_vectors, assign_clusters=True)

ValueError: math domain error