### Word Embeddings

- We'll be using the [pymagnitude](https://github.com/plasticityai/magnitude) library

In [1]:
import nltk
from sklearn.linear_model import LogisticRegression
from pymagnitude import *

In [2]:
#path = 'data/fasttext-wiki-news-300d-1M.magnitude'
#path = 'data/glove.6B.300d.magnitude'
path = 'data/glove.6B.50d.magnitude'
# this isn't working: path = 'data/elmo_2x4096_512_2048cnn_2xhighway_weights.magnitude'

vectors = Magnitude(path)

In [3]:
len(vectors)

400000

In [4]:
vectors.dim # this is how big the vectors are for each word

50

In [5]:
"cat" in vectors

True

In [6]:
for key, vector in vectors[500:510]:
    print(key, vector[:3])

working [ 0.0547345 -0.0305866 -0.0075621]
community [ 0.0276732  0.117468  -0.1533174]
eight [0.0133356 0.0815326 0.1307856]
groups [ 0.0933181 -0.0622403 -0.0163335]
despite [-0.0066614  0.0074928 -0.0322814]
level [-0.0736265  0.1976634  0.0354784]
largest [0.1119611 0.0235172 0.0475007]
whose [ 0.0633574  0.144303  -0.0080723]
attacks [ 0.2780417 -0.1416092  0.1276424]
germany [0.0529495 0.009489  0.0464709]


In [7]:
vectors.query("cat")[:3]

array([ 0.1027278, -0.1136787, -0.1218595], dtype=float32)

In [8]:
vectors.query(["cat","dog"])[0][:3]

array([ 0.1027278, -0.1136787, -0.1218595], dtype=float32)

In [9]:
vectors.distance("cat", "dog")

0.39547303

In [10]:
vectors.distance("cat", "car")

1.1279846

In [11]:
vectors.most_similar_to_given("cat", ["dog", "television", "laptop"]) 

'dog'

In [12]:
vectors.doesnt_match(["breakfast", "cereal", "dinner", "lunch"])

'cereal'

In [13]:
#vectors.most_similar("cat", topn = 5)

In [14]:
#vectors.most_similar(positive = ["woman", "king"], negative = ["man"])

### Topic Modeling

- Given a document, determine the topic of the document
- For this task, we'll use the Brown corpus of texts accessible via NLTK

In [15]:
from nltk.corpus import brown
from collections import defaultdict
import tqdm # tqdm displays a progress bar
from tqdm import tqdm_notebook as tqdm

category_vectors = []

cats = brown.categories()
    
# for each category
for cat in cats:
    print(cat)
    # grab all of the documents
    for fileid in tqdm(brown.fileids(categories=[cat])):
        words = list(map(str.lower, brown.words(fileids=[fileid])))
        # grab all of the words, find their embedding, sum all embeddings
        word_sum = np.sum([vectors.query([w]) for w in words if w in vectors], axis=0) # why axis=0?
        # add the now summed embedding to the list for this category
        category_vectors.append((cat,word_sum))
    

adventure


HBox(children=(IntProgress(value=0, max=29), HTML(value='')))


belles_lettres


HBox(children=(IntProgress(value=0, max=75), HTML(value='')))


editorial


HBox(children=(IntProgress(value=0, max=27), HTML(value='')))


fiction


HBox(children=(IntProgress(value=0, max=29), HTML(value='')))


government


HBox(children=(IntProgress(value=0, max=30), HTML(value='')))


hobbies


HBox(children=(IntProgress(value=0, max=36), HTML(value='')))


humor


HBox(children=(IntProgress(value=0, max=9), HTML(value='')))


learned


HBox(children=(IntProgress(value=0, max=80), HTML(value='')))


lore


HBox(children=(IntProgress(value=0, max=48), HTML(value='')))


mystery


HBox(children=(IntProgress(value=0, max=24), HTML(value='')))


news


HBox(children=(IntProgress(value=0, max=44), HTML(value='')))


religion


HBox(children=(IntProgress(value=0, max=17), HTML(value='')))


reviews


HBox(children=(IntProgress(value=0, max=17), HTML(value='')))


romance


HBox(children=(IntProgress(value=0, max=29), HTML(value='')))


science_fiction


HBox(children=(IntProgress(value=0, max=6), HTML(value='')))




In [16]:
import pandas as pd

keys,values=zip(*category_vectors) # unzip using a *

data = pd.DataFrame({'cat':keys,'vectors':values})

In [17]:
data[:3]

Unnamed: 0,cat,vectors
0,adventure,"[[101.48395, 49.925434, -35.528347, -92.13484,..."
1,adventure,"[[93.14879, 49.615086, -33.30255, -89.50236, 1..."
2,adventure,"[[93.08724, 25.093605, 1.5583314, -87.50236, 1..."


In [18]:
total = len(data)

#### compute the baselines

In [19]:
print('random baseline {}'.format(1.0/len(cat)))

print('most common baseline?')
for cat in cats:
    print(cat, len(data[data.cat==cat])/total)

random baseline 0.06666666666666667
most common baseline?
adventure 0.058
belles_lettres 0.15
editorial 0.054
fiction 0.058
government 0.06
hobbies 0.072
humor 0.018
learned 0.16
lore 0.096
mystery 0.048
news 0.088
religion 0.034
reviews 0.034
romance 0.058
science_fiction 0.012


#### split the data into train/test

In [20]:
test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

test.shape, train.shape 

((50, 2), (450, 2))

In [21]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder()
le.fit(data.cat)
y = le.transform(train.cat).reshape(-1, 1) # this is magic
ohe.fit(y)
y = ohe.transform(y)

X = np.array([x[0] for x in train.vectors])

X.shape, y.shape

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


((450, 50), (450, 15))

#### train a classifier

In [24]:
from keras.models import Model
from keras.layers import Input, Dense
from keras.models import Sequential
from keras import optimizers
from keras import regularizers

In [25]:
model = Sequential()
model.add(Dense(units=16, activation = 'sigmoid', input_dim=vectors.dim, kernel_regularizer=regularizers.l1(0.01))) # sigmoid, relu
model.add(Dense(units=8, activation = 'sigmoid'))
model.add(Dense(units=4, activation = 'sigmoid'))
model.add(Dense(units=len(cats), activation='softmax'))

In [26]:

sgd = optimizers.SGD(lr=0.10, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,#'sgd', # sgd, adam 
              metrics=['accuracy'])

In [27]:
model.fit(X,y,epochs=100, verbose=False)

ValueError: setting an array element with a sequence.

In [None]:
# making deep learning models more general purpose: use less layers and more epochs
test_y = ohe.transform(le.transform(test.cat).reshape(-1, 1))
test_X = np.array([x[0] for x in test.vectors])
model.evaluate(test_X, test_y)

## Play around with the following and note the changes in accuracy, as well as how long the training step takes):
##### Begin with a single dense hidden layer
* The accuracy increased from 14% to 16%.

##### Change the number of nodes in the hidden layer
* Using 8 nodes instead of 16 resulted in 16% accuracy. Using 32 nodes resulted in 14% accuracy.

##### Change the activation function of the hidden layer
* Changing the activation function to 'relu' did not change the accuracy from 16%.

##### Change the learning rate
* Changing the learning rate from 0.01 to 0.1 did not change the accuracy from 16%.

##### Add additional Dense layers
* I added two more hidden layers, intially all with 8 units, then with 8, 4, and 2. The accuracy stayed at 16%.

##### Add regularization (e.g., Dropout, L1, L2)
* I tried adding l1 and l2 regularization, with a range of values from 0.01 to 0.4. I wasn't able to increase the accuracy above 16%, but it did go down to 14% again. I think if I want to make this model better I should look into new values for numbers of layers and units in each layer.

## Answer the following questions in a markdown cell:
##### What is the neural network learning?
* The neural network is learning paramters for the functions used by nodes in the network.

##### How does the depth or width of the network affect the training and the results?
* Deeper and wider networks increase training time. As network sizes increase (in terms of depth and/or width), it becomes more likely that the corresponding model would be overfit. 

##### What is regularization? Why is it important?
* Regularization is a technique used to prevent overfitting. It's important because overfit models don't give us a good idea of how well they generalize when looking at new data.