[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/demos/nlp/nlp_foundations.ipynb)

# Fundamentals of natural language processing (NLP)
Pre-processing textual data is a standard task. Several Python libraries including `scikit-learn` and `Keras` offer similar functionality. We will use the `NLTK toolkit` in this notebook. It has a clear and easy to understand syntax and is well-suited to demonstrate standard NLP operations. Although not the focus of this tutorial, we also introduce a library called `Beautiful Soup`, which gained a lot of popularity in web-scraping. Make sure to have these libraries installed before running the following codes. 
Here is the agenda of the session:

1. Preparing text for analysis: the standard NLP pipeline
2. Use case: the IMDB movie review data set
3. Demo: training word-to-vec word embeddings

In [1]:
import numpy as np
# Library for standard NLP workflow
import nltk  # When running this notebook for the first time, you have to download the following NLTK packages. Just uncomment the next 3 lines
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### 1. Preparing text for analysis: the standard NLP pipeline

In [2]:
# Just some demo text; no need to quote I guess :)
text_raw = """ 
            I wonder if I have been changed in the night. Let me think. Was I the same when I got up this morning? 
            I almost can remember feeling a little different. But if I am not the same, the next question is 'Who in the world am I?' 
            Ah, that is the great puzzle!
           """

#### Remove Whitespace

In [3]:
def remove_whitespace(text):
    """ Function to remove whitespace (tabs, newlines). """
    return ' '.join(text.split())

text_processed = remove_whitespace(text_raw)
text_processed

"I wonder if I have been changed in the night. Let me think. Was I the same when I got up this morning? I almost can remember feeling a little different. But if I am not the same, the next question is 'Who in the world am I?' Ah, that is the great puzzle!"

Hm, but the punctuation is there still. Is it noise or is it useful? Let's try removing it for now (there is a bunch of methods out there). Additionally we will drop weird symbols and lower the big cases.

#### Punctuation, Whitespace and Casing


In [4]:
def remove_punctuation_and_casing(text):
    """
    Function to remove the punctuation, upper casing and words that include
    non-alphanumeric characters.
    """
    chars = '!\"#$%&()*+,-./:;<=>?@[\]^_`{|}~'
    text = text.translate(str.maketrans(chars, ' ' * len(chars)))
    return ' '.join([word.lower() for word in text.split() if word.isalpha()])

text_processed = remove_punctuation_and_casing(text_processed)
text_processed

'i wonder if i have been changed in the night let me think was i the same when i got up this morning i almost can remember feeling a little different but if i am not the same the next question is in the world am i ah that is the great puzzle'

This is starting to look like a dictionary already, right? There are some more issues we want to address though. Like 'stop words' - semantically they do not mean much but serve to put sentences together ("the", "a", "and", etc) - they will add noise. NLTK can offer you its own list of stop words.

#### Stopwords

In [5]:
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')
english_stopwords[0:9]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you']

In [6]:
len(english_stopwords)

179

The list of stop words looks comprehensive. However, say you miss a 'stop word' that you would also like to filter. You can extend the above list easily. After all, it is just a list.

In [7]:
t = type(english_stopwords)
print('Data type of stopwords is:', t )

Data type of stopwords is: <class 'list'>


In [8]:
# Add some custom stopwords
english_stopwords.append('some_word_you_dont_like')  # you can apply all the functions for lists
english_stopwords[-1]

'some_word_you_dont_like'

Finally, let's remove the stopwords from our processed sample text.

In [9]:
def remove_stopwords(text):
    """ Function to remove stopwords. """
    return ' '.join([word for word in str(text).split() if word not in english_stopwords])

text_processed = remove_stopwords(text_processed)
text_processed

'wonder changed night let think got morning almost remember feeling little different next question world ah great puzzle'

#### Lemmatization and Stemming
You might have already thought of the issue: what if a word is used in different forms? It will be treated as different words semantically right? That is where **stemming** and **lemmatization** comes into play. The former approach is simpler and consists mainly of dropping suffices. The later reduces a word to its dictionary form. To that end, we need to have a dictionary available. Let's first illustrate simple stemming.

In [10]:
# For the sake of completness, here is an example for using stemming

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def stem_words(text):
    """ Function to stem words. """
    return ' '.join([stemmer.stem(word) for word in text.split()])

text_processed = stem_words(text_processed)
text_processed

'wonder chang night let think got morn almost rememb feel littl differ next question world ah great puzzl'

Simple, isn't it. With just one example sentence, it is hard to appreciate the benefits of stemming. The idea is that if we have a large corpus many words will appear multiple times in different grammatical forms. Still, the meaning that these words carry is roughly identical. Running, run, ran, runner, etc. all of these words indicate that the text has something to do with running. Assuming that this is all we need to know -- yes that is a bold assumption -- stemming makes sense as it could greatly reduce the number of distinct words in a corpus. This number of distinct words, also called **vocabulary size**, is very important. It effects the efficiency of NLP operations and may also have a big impact on the accuracy of text classification. <br>
Let's now take a look on lemmatization. Here, things are a little more complicated. While NLTK offers a ready-to-use function, we need to tell it the grammatical form of the word that we want to lemmatize. Consider this example:

In [11]:
# NLTK lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# You need to choose the type of word:
print(lemmatizer.lemmatize("stripes", 'n'))  
print(lemmatizer.lemmatize("stripes", 'v')) # what happens if we claim this is a verb? 

stripe
strip


How would we know that grammatical form? In fact, determining this form is an NLP task in its own right. It is called **POS tagging**. Much research has been done on coming up with clever ways to determine POS (part-of-speech) tags. We will not go into details. A simple POS tagger is available as part of the `NLTK` library. it can be used like this:

In [12]:
nltk.pos_tag(["She", "earned", "her", "stripes", "with", "great", "performance"])

[('She', 'PRP'),
 ('earned', 'VBD'),
 ('her', 'PRP'),
 ('stripes', 'NNS'),
 ('with', 'IN'),
 ('great', 'JJ'),
 ('performance', 'NN')]

We make use of the above POS tagger later. For now, let's simply use the lemmatizer to reduce words to their dictionary form, irrespective of part of speech.

In [13]:
def lemmatize_words(text, **kwargs):
    """ Function to lemmatize words. """
    return ' '.join([lemmatizer.lemmatize(word, **kwargs) for word in text.split()])

text_processed = lemmatize_words(text_processed)
text_processed

'wonder chang night let think got morn almost rememb feel littl differ next question world ah great puzzl'

#### Cleaning HTML

For a more sophisticated cleaning of text, you might want to consider **regular expressions**. In a nutshell, regular expressions are a family of text processing techniques for searching and replacing text. Their capability to match expressions in a text, for example an email, is quite powerful. A quick read through the corresponding [Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression) would be useful. Also, here is a [nice playground](https://regexr.com/). Using a regular expression, we could re-write the above code as follows:

This text includes the email address of Stefan <stefan.lessmann@hu-berlin.de>. 
Also, we use <em>html</em> to <b>emphasize</b> parts and include breaks <br> to separate lines.

In [14]:
# Another piece of demo text illustrating some common issues
re_demo = """
            This text includes the email address of Stefan <stefan.lessmann@hu-berlin.de>. 
            Also, we use <em>html</em> to <b>emphasize</b> parts and include breaks <br> to separate lines.
          """

Finding or filtering email addresses is a common use case when processing text.

In [15]:
# Finding emails using RE
import re  # Python library for regular expressions

# Simple pattern to match email addresses
pat = '([\w\.-]+@[\w\.-]+\.[\w]+)+'

# Extracting email addresses
email = re.findall(pat, re_demo)
print('Found: ', email)

# Filter sub-strings
re.sub(pat, '', remove_whitespace(re_demo))

Found:  ['stefan.lessmann@hu-berlin.de']


'This text includes the email address of Stefan <>. Also, we use <em>html</em> to <b>emphasize</b> parts and include breaks <br> to separate lines.'

In [16]:
# Library beatifulsoup4 handles html
from bs4 import BeautifulSoup

# Remove html content
remove_whitespace(BeautifulSoup(re_demo).get_text())

'This text includes the email address of Stefan . Also, we use html to emphasize parts and include breaks to separate lines.'

#### Conversion of Emojis and Emoticons

Emoticons and emojis are a sequence of ASCII characters or unicode images that express moods or feelings in written communication. In use cases like sentiment analysis, emoticons and emojis give very valuable information.

One way to make use of the information is to convert the emoticons and emojis into text that reflects their meaning. For that, we will be using a condensed copy of the "emot" library by Neel Shah [(Github)](https://github.com/NeelShah18/emot/blob/master/emot/core.py). When running this notebook, make sure that the file ```emot_dictionary.py``` is in the same directory as the notebook. Note, that when using Google Colab it may be simpler to install the "emot" package directly.

In [18]:
import emot_dictionary as emot

emo_demo = """
            The movie was fantastic :o :-)) 🚀 👏
           """

Let's first convert the emoticons.

In [19]:
def convert_emoticons(text):
    """ Function to convert emoticons into a text that reflects their meaning. """
    EMOTICONS = emot.EMOTICONS()
    for i in EMOTICONS:
        text = text.replace(i, EMOTICONS[i])
    return text

emo_demo = convert_emoticons(remove_whitespace(emo_demo))
emo_demo

'The movie was fantastic Surprise Very happy 🚀 👏'

And now we also convert the emojis.

In [20]:
def convert_emojis(text):
    """ Function to convert emojis into a text that reflects their meaning. """
    EMOJIS = emot.EMOJIS_UNICODE()
    for i in EMOJIS:
        text = text.replace(EMOJIS[i], i.translate(str.maketrans('', '', ':')).replace(r'_', r' '))
    return text

convert_emojis(emo_demo)

'The movie was fantastic Surprise Very happy rocket clapping hands'

# Wrapping up
Albeit simple, the above demos provide a glance on text cleaning. While you could do a lot more, tasks like stop word removal, etc. will come up in many NLP projects. We conclude this part by putting all of the above steps into a helper function, which we will use later in the session to clean a data set of online movie reviews. Our helper function will use lemmatization instead of stemming because it is likely to give better results in downstream tasks (i.e., text classification). The following function is yet another helper to call the lemmatizer with the right dictionary form of a word.   

In [21]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Helper function that calls the POS tagger for an input word and return a code that can be used for lemmatization"""
    # Extract the first letter of the POS tag (see the above example to understand the output coming from pos_tag)
    tag = nltk.pos_tag([word])[0][1][0].upper()  
    # Dictionary to map these letters to wordnet codes that the lemmatizer understands
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [22]:
# Test the helper function
[get_wordnet_pos(x) for x in ["She", "earned", "her", "stripes", "with", "great", "performance"]]

['n', 'v', 'n', 'n', 'n', 'a', 'n']

And here is the real helper function for text cleaning. We will make use of it right after introducing our data set for subsequent parts. Since that data is stored in the form of a data frame, we refrain from making our helper function more general and simply assume that incoming text is a Pandas Series object (i.e., one column of a data frame).

In [23]:
def text_cleaning(documents):
    """
    Function for standard NLP pre-processing including removal of html tags,
    whitespaces, non-alphanumeric characters, and stopwords. Emoticons are
    converted to text that reflects their meaning. Words are subject to
    lemmatization using their POS tags.
    """
    cleaned_text = []  # our output will be a list of documents
    lemmatizer = WordNetLemmatizer()
    
    print('Processing input array with {} elements...'.format(documents.shape[0]))
    counter = 0
    
    for doc in documents:
        text = BeautifulSoup(doc).get_text() # remove html content
        text = remove_whitespace(text) # remove whitespaces
        text = convert_emoticons(text) # convert emoticons to text
        text = remove_punctuation_and_casing(text) # remove punctuation and casing
        text = remove_stopwords(text) # remove stopwords
        text = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in text.split()]) # lemmatize each word
        
        cleaned_text.append(text)

        if (counter > 0 and counter % 50 == 0):
            print('Processed {} documents'.format(counter))
            
        counter += 1
        
    return cleaned_text

## 2. Use case: the IMDB movie review data set
We use a popular NLP data set consisting of movie reviews posted at [IMDB](https://www.imdb.com/). The data is available in different sizes and shapes (cleaned, raw, ...) on the web. We use a version from Kaggle, which includes 50K reviews and binary labels whether a review is positive or negative. The labels are useful for sentiment analysis, which we will do in our next programming session. Here, we simply prepare the data for the next session and thereby further elaborate on the NLP operations introduced in the previous part. You can download the raw data from Kaggle: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data. A version is also available in the course folder corresponding to Block II.

### Load the data

In [34]:
# Remeber to adjust the path so that it matches your environment
import pandas as pd

imdb_data = pd.read_csv("IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1")
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [28]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


So the data is really simple; just two columns, one for the binary sentiment and one for the text of the review. Apparently, some of the reviews include HTML. We already added functionality to handle HTML into our text cleaning function. So this should not cause us any trouble. Let's look at an arbitrary review to get a better understanding of the text.

In [29]:
imdb_data.loc[8, 'review']

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

In [30]:
imdb_data.loc[8, 'sentiment']

'negative'

### Sampling
Working with the full data set of 50K reviews is time consuming. When experimenting with the notebook, you might want to draw a random sample to increase the speed of computations. For a modern computer, a sample size of 5000 should be feasible without increasing the time too much. Note that results of processing the full data sets are available in our course folder.

In [31]:
# Draw a radnom sample to save time
sample_size = 500
np.random.seed(111)
idx = np.random.randint(low=0, high=imdb_data.shape[0], size=sample_size)
imdb_data = imdb_data.loc[idx,:]

imdb_data.reset_index(inplace=True, drop=True)  # dropping the index prohibits a reidentification of the cases in the original data frame
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     500 non-null    object
 1   sentiment  500 non-null    object
dtypes: object(2)
memory usage: 7.9+ KB


### Data cleaning
Thanks to our careful preparation, cleaning the reviews should be easy. All it takes is applying our cleaning function to the data.

In [33]:
# Do the cleaning
# CAUTION: depending on your data set size, the processing might take a while 
import time  # To keep an eye on runtimes
start = time.time()
imdb_data['review_clean'] = text_cleaning(imdb_data.review)
print('Duration: {:.0f} sec'.format(time.time()-start))

Processing input array with 500 elements...
Processed 50 documents
Processed 100 documents
Processed 150 documents
Processed 200 documents
Processed 250 documents
Processed 300 documents
Processed 350 documents
Processed 400 documents
Processed 450 documents
Duration: 8 sec


In [None]:
# Check all is well
print('Original Review:\n' + imdb_data.review[46])
print('\nCleaned Review:\n' + imdb_data.review_clean[46])

Original Review:
One of the best horror/suspense movies I have seen in a long time. Wow, it was a big surprise and stunning at how good this movie was, sometimes a gem like this will surface but is rare. I expected a popcorn monster flick and a mildly diverting way to spend a late night but instead a very well made and directed movie with great acting and made with passion and heart. <br /><br />This is a movie that makes you feel for the characters and what happens to them, and it is filmed like you are there and it is really happening. I know some people in other reviews compare it to "Open Water", but I disagree because I thought Open Water was quite boring and mediocre, while this movie was the opposite, although superficially they are filmed in the same "realistic" style.<br /><br />The actors are unknowns, at least to me, but they all are very effective and convey the dire situation with frightening intensity and realism. The story is well done and flows smoothly, the plot is log

In [None]:
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review        500 non-null    object
 1   sentiment     500 non-null    object
 2   review_clean  500 non-null    object
dtypes: object(3)
memory usage: 11.8+ KB


In [None]:
imdb_data.head(8)

Unnamed: 0,review,sentiment,review_clean
0,The script seems to have been wholesale (ahem ...,negative,script seem wholesale ahem ahem cough cough bo...
1,I watched this again after having not seen it ...,positive,watch see since first come still make laugh lo...
2,Ultimately too silly and pointless. Yes there ...,negative,ultimately silly pointless yes gild cage metap...
3,Nice attempt and good ideas (redemption of the...,negative,nice attempt good idea redemption prostitute h...
4,A touching love story reminiscent of ÂIn the ...,positive,touch love story reminiscent mood draw heavily...
5,Went to watch this movie expecting a 'nothing ...,negative,go watch movie expect really action flick stil...
6,Alan Rickman & Emma Thompson give good perform...,negative,alan rickman emma thompson give good performan...
7,The only reason I give it a 2 is that filmogra...,negative,reason give filmography stylize day least some...


Looks like the cleaning has fulfilled its purpose.

### File input and output
Should you have used the full data set in the above cleaning, you will want to store your results. The following codes exemplifies the use of a library called `Pickle`, which Pandas support natively to store data sets in a binary format. Compared to csv, the advantage of a binary format is that the data needs less space on disk. Note that you might have to install `Pickle` for the code to work. 

In [38]:
# Saving objects to disk using pickle
import pickle

imdb_data.to_pickle('imdb_full_clean.pkl')

### A bird's eye view on the data
Let's have a quick look at what folks talk about in this data set. Using the class *Counter* from the collections package, we can easily count word occurrences and query the most common words. We can also check the number of occurrences for specific words. We do not really need the *word_counter* here and only use it to get a feeling for the data set. Our course, these types of checks make more sense when using the full data set. 

In [50]:
# Here is a bit of code to load the data of the clean reviews from our course folder
import pickle
with open('imdb_clean_full.pkl','rb') as path_name:
    clean_reviews = pickle.load(path_name)
len(clean_reviews)

50000

In [51]:
# Loop through the words and update a counter keeping track of word counts
import collections

word_counter = collections.Counter()
for r in clean_reviews:
    for w in r:
        word_counter.update({w: 1})        

In [52]:
# Query the top most frequent words
top_n = 10
word_counter.most_common(top_n)

[('movie', 103239),
 ('film', 95848),
 ('one', 55432),
 ('make', 46127),
 ('like', 44297),
 ('see', 41611),
 ('get', 35196),
 ('well', 32874),
 ('time', 31548),
 ('good', 29852)]

The above results hints at some more challenges when working with text data. Among the top ten most frequent words, none is really surprising or appears interesting. Well, what is interesting depends on the task. For example, words like *like* and *good* have meaning in a sentiment analysis setting. However, words like *movie* or *film* will naturally appear in a data set on movie reviews and do not reveal much information for sentiment analysis. This indicates that, in addition to filtering stop words, there could be other 'normal' words (i.e., not stop words) that we might want to filter. Again, preparing text data can be rather laborious...<br>
Let's check if people also talk about something more relevant.

In [53]:
# Check frequency of some target word
word_counter["spielberg"]

263

## 3. Demo: training word-to-vec word embeddings
Strictly speaking, the session is over. We covered the fundamental steps in a NLP pre-processing pipeline. You can easily skip the remainder of the notebook, in which we demonstrate one way to train word vectors using word-to-vec (W2V). When it comes to embeddings, the most typical use case is to **download pre-trained embeddings** and employ these for some downstream tasks (with or without fine-tuning). Keras supports that use case very well, as we will see in the next session. At times, however, you may want to **train your own embeddings**. For example, our data is from IMDB. People talk about movies. Maybe they use a specific type of language that is not well represented in pre-trained W2V embeddings. Without going into too much detail of the pros and cons of pre-training your own word embeddings, the point of the following codes is simply to showcase how you can do it if you want to. We will use a library called `Gensim`.

`Gensim` is a popular library for text processing. Although maybe even more geared toward topic modeling, it offers, amongst others, implementations of several algorithms to learn word embeddings including *W2V*, *GloVe*, and *Fasttext*. The following demonstrates training W2V embeddings using the cleaned reviews. Before moving on, you might need to install `Gensim`  ;)   

### The Gensim W2V model
Training word embeddings using `Gensim` is very easy and just a matter of calling a function. Well, the reason it takes so little code is that we have already cleaned our data and have it available as an array of texts; that is a format that `Gensim`supports. However, note that, depending on your data, the code may take quite a while to run. Again, word embeddings trained on the full 50K data set for 500 epochs are available in our course folder.

In [54]:
# CAUTION: Running the code might take a while
from gensim.models import Word2Vec    

emb_dim = 50  # embedding dimension
# Train a Word2Vec model
model = Word2Vec(clean_reviews, 
                 min_count=1,  #min_count means the frequency benchmark, if =2 and word is used only once - it's not included
                 window=5,     #the size of context
                 iter=10,      #how many times the training code will run through the data set, same as epochs. The first pass is to create dict. Set to >200 to obtain reasonable results
                 size=emb_dim, #size of embedding
                 workers=2)    #for parallel computing

In [55]:
# summarize the loaded model
print(model)
words=list(model.wv.vocab)

# get one embedding 
model.wv['movie']

Word2Vec(vocab=84673, size=50, alpha=0.025)


array([-0.26228845,  0.34722286,  1.4709738 ,  2.0605502 , -2.2747583 ,
       -3.351172  , -0.93709004,  0.30147228,  0.3833407 , -0.93977964,
       -3.5641367 ,  1.5985901 , -2.761732  , -0.8686265 ,  1.0060619 ,
       -2.178143  ,  1.3885053 , -0.72174406, -1.8205775 ,  0.76048607,
       -0.20302485,  1.9258785 ,  0.85206515, -1.9027067 ,  1.0411849 ,
        0.5764043 ,  1.5255504 , -2.4159296 ,  1.9932901 , -1.7513642 ,
       -0.5735383 , -0.02365238, -1.5335919 , -0.777492  , -1.1198854 ,
       -0.41005844,  0.9909722 ,  0.04843368,  1.2031087 ,  1.5734948 ,
       -0.14604816, -1.801444  ,  1.5539323 , -0.2381962 , -2.5179162 ,
       -1.8608973 , -0.6056832 ,  1.7911577 ,  2.10289   ,  2.5480464 ],
      dtype=float32)

### Input / output handling
Gensim supports saving and loading of trained embeddings in different versions. Below is a simple demo. More information is available on the [Gensim homepage](https://radimrehurek.com/gensim/models/word2vec.html)

In [56]:
# Save the trained word vectors
file="w2v_embedding.model"
save_as_bin = False
model.wv.save_word2vec_format(file, binary=save_as_bin)  # set binary to True to save disk space; false facilitates inspecting the embeddings in a text editor

### Working with the trained embeddings
We use a pre-trained version of the embeddings, which were trained on the full IMDB data set for 500 epochs. The examples are inspired by [this Kaggle kernel](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial). If you want to visualize the trained word vectors have a look at [this post](https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne). It is fairly easy to create a visualization but to get meaningful results you would need to prepare the data more carefully; for example removing too frequent words and too infrequent words. Ultimately, working with the embeddings would involve building a proper NLP model and using it to solve some downstream tasks, as we will do in the next session. The following codes simply showcase the embeddings and functionality of `Gensim`.

In [57]:
# Load model from disk
file = "w2v_imdb_dim50_embeddings.model"
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(file, binary=False)

#### Which word is most similar to another word?

In [58]:
model.most_similar(positive=['avatar'])

[('mcnaughton', 0.5568448901176453),
 ('gjs', 0.552051305770874),
 ('haht', 0.5311591029167175),
 ('furia', 0.529901921749115),
 ('shawniqua', 0.5293144583702087),
 ('emanate', 0.5260404348373413),
 ('vallee', 0.5185524821281433),
 ('imagineered', 0.5167945623397827),
 ('jingle', 0.5123077034950256),
 ('damir', 0.5118454694747925)]

#### How similar are two words?

In [59]:
model.similarity('good', 'great')

0.7539246

In [60]:
print('How similar is Tarantino to Spielberg: {}'.format(model.similarity('tarantino', 'spielberg')))
print('How similar is Emmerich to Spielberg: {}'.format(model.similarity('emmerich', 'spielberg')))

print('How similar is Paltrow to Bullock: {}'.format(model.similarity('paltrow', 'bullock')))
print('How similar is Paltrow to Alba: {}'.format(model.similarity('paltrow', 'alba')))

print('How similar is Cruise to Depp: {}'.format(model.similarity('cruise', 'depp')))
print('How similar is Cruise to Willis: {}'.format(model.similarity('cruise', 'willis')))


How similar is Tarantino to Spielberg: 0.5265995860099792
How similar is Emmerich to Spielberg: 0.20405761897563934
How similar is Paltrow to Bullock: 0.20976237952709198
How similar is Paltrow to Alba: 0.17420797049999237
How similar is Cruise to Depp: 0.16261133551597595
How similar is Cruise to Willis: -0.04004356637597084


#### Which word does not fit in?

In [61]:
model.doesnt_match(['cool', 'great', 'lovely', 'weak'])
model.doesnt_match(['movie', 'film', 'good'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'good'

#### A is to B as C is to ? 

In [62]:
model.most_similar(positive=['spielberg', 'woman'], negative=['man'], topn=5)

[('indefinably', 0.5315033793449402),
 ('directorial', 0.5310426950454712),
 ('grrrrrrr', 0.5281748175621033),
 ('tarrentino', 0.5250871777534485),
 ('meshugaas', 0.5238608717918396)]

### Phrase detection
W2V trains one embedding per word. The model is agnostic of common phrases such as 'New York'. It would train one embedding for new and another for york, provided both words are part of the vocabulary. You can get better embeddings by adding common phrases to the vocabulary. W2V will then train individual embeddings for these phrases. Gensims also comes with a phrase detection models, which allows you to handle bigrams, trigrams and the like. We will not retrain our W2V model but sketch how you can use Gensim to get these common phrases. You could then consider to add (some of) them to your vocab and enhance the model.  

In [63]:
from gensim.models.phrases import Phrases

# Train a bigram model
bigram_model = Phrases(clean_reviews, min_count=10) 

After training, we can take text and put it through the bigram model. The model will then alter the text so as to introduce bigrams. Here is an example,

In [64]:
bigram_model[clean_reviews[0][0:9]]



['one',
 'reviewer_mention',
 'watch',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly']

In [65]:
# Compare to the original review
clean_reviews[0][0:9]

['one',
 'reviewer',
 'mention',
 'watch',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly']

We can also make use of our counter class to examine the most common bigrams in the corpus, as follows:

In [66]:
bigram_counter = collections.Counter()
for key in bigram_model.vocab.keys():
    if key.decode().find('_')>-1: # the decode is needed because Gensims stores keys as bytes
        bigram_counter[key] += bigram_model.vocab[key]

In [67]:
bigram_counter.most_common(25)

[(b'look_like', 3719),
 (b'watch_movie', 3109),
 (b'ever_see', 2977),
 (b'see_movie', 2746),
 (b'bad_movie', 2698),
 (b'year_old', 2422),
 (b'film_make', 2369),
 (b'make_movie', 2359),
 (b'special_effect', 2324),
 (b'movie_make', 2132),
 (b'even_though', 2000),
 (b'movie_ever', 1987),
 (b'main_character', 1924),
 (b'one_best', 1921),
 (b'movie_like', 1919),
 (b'low_budget', 1896),
 (b'make_film', 1878),
 (b'see_film', 1843),
 (b'waste_time', 1775),
 (b'watch_film', 1652),
 (b'good_movie', 1618),
 (b'horror_movie', 1612),
 (b'much_well', 1533),
 (b'want_see', 1485),
 (b'seem_like', 1473)]

The above bigrams might be frequent. However, you would not consider training individual embeddings for phrases such as *look_like* or *waste_time*. This shows how proper phrase detection in the scope of W2V is nontrivial and would require more work before we can hope to get descend results.     