# Working with text

Text is one of the most widley avaliable sources of sequential data, we can think of it either as a sequence of characters or a sequence of words.   Deep learning models can't take in text as strings, we need to convert them to a numeric form. This can be done in a few ways.

* Split text into words, convert each word to a vector 
* Split text into characters, convert each character into a vector
* Extact n-grams, and transform each n-gram into vectors.

This process is know as tokenization. There are two major ways to encode these tokens:

* Onehot encoding - often used for characters and sometimes words.
* Word embeddings - often used for words or n-grams.

N-grams are used more frequently in traditional NLP (Natrual langauge procesing) methods. There are many great libraies in python for NLP such as spaCy, NLTK and gensim, in this notebook we'll use gensim because it has a nice word2vec implementation.

## Onehot encoding

We can perform some basic onehot encoding using python

In [17]:
import numpy as np

sentence = 'The quick brown for jumped over the tree?'
word_list = sentence.split()
token_index = {} #used to map word to int

#use set to remove duplicate words
for word in set(word_list):
    if word not in token_index:
        token_index[word] = len(token_index) 

token_index

{'The': 6,
 'brown': 2,
 'for': 0,
 'jumped': 7,
 'over': 1,
 'quick': 3,
 'the': 4,
 'tree?': 5}

In [18]:
n_words = len(token_index.keys())
#convert sentence to ints
word_ints = [ token_index[word] for word in sentence.split()]
#use sentence ints to generate onehot encodings
np.eye(n_words)[word_ints]

array([[0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.]])

In [19]:
#try and use the above code to onehot encode characters

## Generating Word Embeddings with Gensim


There are many choices for NLP libraries in python including spaCy, NLTK and gensim. Today we will use gensim because it has a intuitive word2vec implementation. This implementation is highly optimized and should run very fast, even on just CPU.

## Processing text

The first stage is processing text.

In [20]:
from gensim.models import Word2Vec #prebuilt word to vec implementation
import glob #finds all pathnames matching a pattern, like regex
import codecs #unicode support when reading files
from multiprocessing import cpu_count #use to get number of cpus on host machine
from gensim.utils import simple_preprocess,simple_tokenize #text processing
from string import punctuation #string  containing all puncuation

The first stage is to read in the text.

In [21]:
book_filenames = sorted(glob.glob("data/*.txt"))
print("Found books:")
book_filenames

Found books:


['data\\got1.txt',
 'data\\got2.txt',
 'data\\got3.txt',
 'data\\got4.txt',
 'data\\got5.txt']

In [22]:
corpus_raw = u""
#for each book, read it, open it un utf 8 format, 
#add it to the raw corpus
for book_filename in book_filenames:
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()

print("Corpus is {0} characters long".format(len(corpus_raw)))

Corpus is 9719485 characters long


Bellow is how we could process the text with libraries

In [23]:
table=str.maketrans("","",punctuation)     #create translation table
text = corpus_raw.translate(table) # remove puncuation
sentences = text.split('\n') #split into sentences
sentences = list(filter(None,sentences)) #remove empty strings
for i,sentence in enumerate(sentences):
    sentences[i] = sentence.lower().split() #lower case and split into words

The same preprocessing using gensim. Probaly better to use gensim because it will preporcess better for example it wouldn't remove apostrophes in "it's".

In [24]:
sentences = corpus_raw.split('\n') #split at new lines
sentences =  filter(None, sentences) # remove empty strings
sentences =  list(map(simple_preprocess,sentences)) #clean text 

## Word2Vec model

To create the model the data must be in the correct format. Each sentence must be tokenized (divided into indivdual words). We then feed a list of sentences, where each sentence is a list of words into the model. The other parameters mean:

* size - dimension of resulting word vector
* window - the maximum distance between the current and projected word
* min count - ignores all word with freq bellow this
* worker - number of cpus to use

More extensive documentation can be found [here](https://radimrehurek.com/gensim/models/word2vec.html). By feault the word2vec model uses the CBOW (Continous Bag of Words). 

In [25]:
workers = cpu_count()

In [43]:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=workers) #fit model

In [44]:
len(model.wv.vocab) #size of vocab

11766

In [45]:
model.wv.vectors.shape  # how we can access the word embeddings matrix

(11766, 30)

In [46]:
'word'  in model.wv.vocab #check if word in vocab

True

In [47]:
model.wv['man'] #get word vector for man

array([ 0.68739736,  0.05206719,  5.1409917 , -2.5213878 ,  0.6190936 ,
        0.31066605, -0.5831321 ,  2.170915  ,  0.09539993,  1.0989041 ,
       -2.8519135 , -2.8304002 ,  0.37002218,  0.902059  , -1.4979048 ,
        1.7596527 , -2.4026933 , -0.15181732, -1.3945677 ,  1.360345  ,
        1.4451095 , -2.0462534 , -0.40619943, -2.9830127 ,  3.3829682 ,
        2.2928636 , -3.6101625 , -0.48065746, -0.28180912,  3.1964552 ],
      dtype=float32)

In [48]:
model.wv.most_similar('man')  # find most similar words

[('woman', 0.7958261966705322),
 ('boy', 0.7713156938552856),
 ('bear', 0.7613998055458069),
 ('knight', 0.7192999720573425),
 ('wench', 0.7066137194633484),
 ('crow', 0.7050321698188782),
 ('wolf', 0.6626030206680298),
 ('one', 0.6473504304885864),
 ('fool', 0.6328256130218506),
 ('girl', 0.6240218877792358)]

In [49]:
# king +  woman  - man = ?
model.wv.most_similar(positive=['woman', 'king'], negative=['man']) 

[('queen', 0.8588622212409973),
 ('prince', 0.7311638593673706),
 ('daenerys', 0.644047737121582),
 ('joffrey', 0.6434864401817322),
 ('targaryen', 0.6206921935081482),
 ('princess', 0.5943112969398499),
 ('margaery', 0.5936480760574341),
 ('tourney', 0.580498993396759),
 ('traitor', 0.564799427986145),
 ('captain', 0.5633386373519897)]

## Dimensionality reduction with t-SNE

Before we can plot the word vectors we need to use some form of dimensionality reduction, to make them 2d. We'll use TNSE which works vey well for vizulazing high-dimensional data.



In [50]:
from sklearn.manifold import TSNE #from dimensionality reduction
import pandas as pd 

In [51]:
from sklearn.manifold import MDS

In [52]:
n = 1000 #only use first 1000 vectors

In [53]:
tsne = TSNE(n_components=2, perplexity=3,random_state=0)
tsne_vectors = tsne.fit_transform(model.wv.vectors[:n])

In [54]:
words = model.wv.index2word[:n] #get first n words from model

In [55]:
# from sklearn.manifold import MDS

# mds = MDS(n_components=2)
# tsne_vectors = mds.fit_transform(model.wv.vectors[:1000])

## Plotting word vectors with Bokeh

Bokeh is an interactive plotting library, which is great for exploring the realsionship of the word vectors. It's won't work in jupyter lab by default (because they have disabled inline javascript) but it will still work in jupyter notebook. We must install a lab [extension](https://github.com/bokeh/jupyterlab_bokeh) to get bokeh working in lab

In [56]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
output_notebook()

In [57]:
#create a dataframe to plot with
df = pd.DataFrame(tsne_vectors,index=words,columns=['x_coord','y_coord'])
df.index.name = 'word'
df.head()

Unnamed: 0_level_0,x_coord,y_coord
word,Unnamed: 1_level_1,Unnamed: 2_level_1
the,-4.832664,30.268881
and,-38.466236,2.400502
to,23.073069,12.972664
of,-45.412277,45.649239
he,-27.383984,-60.624546


In [58]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(df)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')


In [59]:

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# plot!
show(tsne_plot);

# Exercises

* Play around with the word vectors see if you can find any intresting relationships.

* Use a text of your choice from [gutenberg](http://www.gutenberg.org/wiki/Main_Page) and try to apply one of the above models to create word vectors.

* Afterwards cluster the word vectors using a model from sklearn and try to visulize them.

## References


* Text
  * [Deep learning with python](https://www.manning.com/books/deep-learning-with-python)
  * [Introduction to word embeddings](https://www.springboard.com/blog/introduction-word-embeddings/)
  * [Use word embeeding layers deep in keras](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)
  * [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)
  *[Learning word embeddings](https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html)
* Notebooks
  * [Thrones To Vec](https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE.git)
* Videos
  * [Word Embedding Explantion](https://www.youtube.com/watch?v=5PL0TmQhItY)
  * [Spacy - Modern NLP in Python](https://www.youtube.com/watch?v=6zm9NC9uRkk)