# Project: Create a Word2Vec Model

### Step 1: Import libraries

In [8]:
import os
import nltk
from nltk.corpus import stopwords

### Step 2: Download stopwords
- Execute the following cell

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/rune/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Step 3: Read content and sentinize
- Initialize an empty list called **all_sentences**
- For each filename in **'files/holmes'**:
    - HINT: Use **os.listdir(...)** ([docs](https://docs.python.org/3/library/os.html#os.listdir))
- Open the file and read the content and convert to lowercase and apply **nltk.sent_tokenize** on content.
    - Use **lower()** on content.

In [11]:
all_sentences = []

for filename in os.listdir('files/holmes'):
    with open(f'files/holmes/{filename}') as f:
        text = f.read().lower()
        all_sentences += nltk.sent_tokenize(text)

In [12]:
all_sentences[0]

'                       the adventure of the speckled band\n\n     on glancing over my notes of the seventy odd cases in which i have\n     during the last eight years studied the methods of my friend sherlock\n     holmes, i find many tragic, some comic, a large number merely\n     strange, but none commonplace; for, working as he did rather for the\n     love of his art than for the acquirement of wealth, he refused to\n     associate himself with any investigation which did not tend towards\n     the unusual, and even the fantastic.'

### Step 4: Tokenize each sentence
- Get all words by applying **nltk.word_tokenize** on them and assign the result to **all_words**
    - HINT: Use list comprehension

In [15]:
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words[0][:10]

['the',
 'adventure',
 'of',
 'the',
 'speckled',
 'band',
 'on',
 'glancing',
 'over',
 'my']

### Step 5: Remove all stop words
- Use **stopwords.words('english')** to filter all the words in **all_words**
    - HINT: iterate over the length of **all_words**, for each index use list comprehension

In [16]:
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

In [17]:
all_words[0][:10]

['adventure',
 'speckled',
 'band',
 'glancing',
 'notes',
 'seventy',
 'odd',
 'cases',
 'last',
 'eight']

### Step 6: Remove special characters
- Iterate over items in **all_words** to remove words with special characters
    - HINT: Use **isalpha()** ([doc](https://docs.python.org/3/library/stdtypes.html#str.isalpha))

In [53]:
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w.isalpha()]

In [54]:
all_words[0][:10]

['adventure',
 'speckled',
 'band',
 'glancing',
 'notes',
 'seventy',
 'odd',
 'cases',
 'last',
 'eight']

### Step 7: Install gensim and python-Levenshtein
- Run the following cells

In [11]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.0.1-cp38-cp38-macosx_10_9_x86_64.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 4.1 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-5.1.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 2.4 MB/s eta 0:00:01
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.0.1 smart-open-5.1.0


In [13]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 949 kB/s eta 0:00:01
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25ldone
[?25h  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp38-cp38-macosx_10_9_x86_64.whl size=80648 sha256=380bc120ef35546e3940410f4e6a75461b9919b2fa6f1ffac9aab52e6dcad8c2
  Stored in directory: /Users/rune/Library/Caches/pip/wheels/d7/0c/76/042b46eb0df65c3ccd0338f791210c55ab79d209bcc269e2c7
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2


### Step 8: Import another library
- Run the following cell

In [55]:
from gensim.models import Word2Vec

### Step 9: Create a model
- Use **Word2Vec** on **all_words**
    - Use **min_count=2** : Ignores all words with total frequency lower than this.

In [63]:
model = Word2Vec(all_words)


### Step 10: Find distances
- Try to run **model.wv.distance('holmes', 'watson')**
- Try to run **model.wv.distance('holmes', 'water')**

In [64]:
model.wv.distance('holmes', 'watson')

0.000813901424407959

In [65]:
model.wv.distance('holmes', 'water')

0.0012812614440917969

### Step 11: Find closests words
- Get all the words
    - HINT: **words = model.wv.index_to_key**
- Implement a function **closets_words(word)**
    - HINT: **distances = {w: model.wv.distance(word, w) for w in words}**
    - HINT: **sorted(distances, key=lambda w: distances[w])[:15]**

In [71]:
words = model.wv.index_to_key

def closets_words(word):
    distances = {w: model.wv.distance(word, w) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:15]

In [72]:
closets_words('holmes')

['holmes',
 'friend',
 'must',
 'back',
 'come',
 'might',
 'words',
 'quite',
 'turned',
 'police',
 'held',
 'watson',
 'sat',
 'colonel',
 'eyes']