# AUTMI Seminar 2019/2020 Spring

## Natural Language Processing

## April , 2020

# Text representations and analysis

## Preparation

[Download GLOVE](https://nlp.stanford.edu/projects/glove/)

In [None]:
!pip install spacy

!pip install textacy

!pip install flair

!pip install torchtext

!pip install -U scikit-learn

!python -m spacy download en

## Representations

For humans meaningful representation are strings, but the computer needs numerical representations to be able to run machine learning algorithms. The easiest approach is to create a `word ---> id` mapping that is going to map words to integer ids starting from 0. Different words should have a different id.

This is called **one-hot encoding**. Let's encode the following sentence:

In [None]:
sentence = "yesterday the lazy dog went to the store to buy food".split(" ")
print(sentence)

In [None]:
mapping = dict()
max_id = 0

for word in sentence:
    # a word we have not seen before
    if word not in mapping:
        # assign the smallest unused id
        mapping[word] = max_id
        # increment the id for the next word
        max_id = max_id + 1
        
mapping

## Problems
- When representing words with id's we assign them to the words in the order of the encounter. 
- This means that we will assign different vectors to the words each time we run the algorithm.
- We also have no concept of similarity, intuitively: `similarity(cat, dog) > similarity(cat, computer)`
- The representation is very sparse and could have very high dimension, which would also slow the computations.

## Word embeddings

- map each word to a small dimensional (around 100-300) continuous vectors.
- this means that similar words should have similar vectors.
    - what do we mean by word similarity ?
    
    
### Cosine similarity

- Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are "close" and "far" from one another.

$$s = \frac{p \cdot q}{||p|| ||q||}, \textrm{ where } s \in [-1, 1] $$ 

    
### Creating word embeddings

"a word is characterized by the company it keeps" -- popularized by Firth

- A popular theory is that words are as similar as their context is
- Word embeddings are also created with neural networks
    1. predict a missing word based on its context
    2. predict a word's context given the word itself

To create word embeddings, a neural network is trained to perform the tasks. But then it is not used actually for the task it was trained it on. The goal is actually to learn the weights of the hidden layer. Then, these weights will be our vectors called "word embeddings".

Given a specific word, a neural network will look at the words nearby and learn the probability of being the "nearby word". The "nearby" is actually given by a windows size that is a parameter of the algorithm (dog is more likely to appear next to cat than computer).

The training examples are generated from big text corpuses. For example from the sentence “The quick brown fox jumps over the lazy dog.” we can generate the following inputs:

![training examples](http://mccormickml.com/assets/word2vec/training_data.png)

To do this, we first build a vocabulary of words from our training documents–let’s say we have a vocabulary of 10,000 unique words.
First we build the vocabulary of our documents, then for representing words, we will use one-hot vectors. The output of the network will be a single vector that contains the probabilities for the "nearby" words.

![architecture](http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png)

_(images from mccormicklm.com)_

### Famous static word embeddings for English

- Word2vec
- GLOVE

### Contextual embeddings?

- Elmo
- BERT
- Flair

For static embeddings, we will use a GLOVE embedding of 100 dimensional vectors trained on 6B tokens.

[Download GLOVE](https://nlp.stanford.edu/projects/glove/)

In [None]:
import gensim

embedding_file = "glove.6B.100d.txt"

embedding = gensim.models.KeyedVectors.load_word2vec_format(embedding_file, binary=False)

In [None]:
dog_vector = embedding["dog"]
type(dog_vector), dog_vector.shape

In [None]:
embedding.most_similar("president")

In [None]:
embedding.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
embedding.similarity("woman", "computer")

In [None]:
from sklearn.manifold import TSNE

In [None]:
def tsne_plot(model, size=500):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []
    
    for word in model.wv.vocab:
        if len(tokens) > size:
            break
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
tsne_plot(embedding, 200)

## Contextual embeddings

In GloVe and Word2vec representations, words have a static representation. But words can have different meaning in different contexts, e.g. the word "stick":

1. Find some dry sticks and we'll make a campfire.
2. Let's stick with glove embeddings.

![elmo](http://jalammar.github.io/images/elmo-embedding-robin-williams.png)

_(Peters et. al., 2018 in the ELMo paper)_

In [None]:
# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence
from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence1 = Sentence("Find some dry sticks and we'll make a campfire.")
sentence2 = Sentence("Let's stick with glove embeddings.")

# embed words in sentence
flair_embedding_forward.embed(sentence2)
for token in sentence2:
    print(token)
    print(token.embedding)

In Flair, a pretrained NER tagger is also available to use

### Load matplotlib, pandas and spacy

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

import matplotlib
matplotlib.style.use('ggplot')
matplotlib.pyplot.rcParams['figure.figsize'] = (16, 10)
matplotlib.pyplot.rcParams['font.family'] = 'sans-serif'
matplotlib.pyplot.rcParams['font.size'] = 20

import spacy
from spacy import displacy

import os

import pandas as pd

import spacy
import pandas as pd
import re

In [None]:
nlp = spacy.load("en")

## Data analyzation

- we use nlp frameworks for the basic tasks
- for the preprocessing tasks (lemmatization, tokenization) we use [spaCy](https://spacy.io/)
- for keyword extraction and various text analyzation tasks we use [textacy](https://github.com/chartbeat-labs/textacy)
- textacy builds on spaCy output
- both are open source ython libraries

In [None]:
import torch
import torchtext
from torchtext import data
from torchtext.datasets import text_classification
import os
if not os.path.isdir('./data'):
    os.mkdir('./data')
text_classification.DATASETS['AG_NEWS'](
    root='./data', ngrams=NGRAMS, vocab=None)

In [None]:
import pandas as pd

train_data = pd.read_csv("./data/ag_news_csv/train.csv",quotechar='"', names=['label', 'title', 'description'])
test_data = pd.read_csv("./data/ag_news_csv/test.csv",quotechar='"', names=['label', 'title', 'description'])

In [None]:
train_data["text"] = train_data.title +  "," + train_data.description
train_data = train_data.drop("title", axis=1)
train_data = train_data.drop("description", axis=1)

test_data["text"] = test_data.title +  "," + test_data.description
test_data = test_data.drop("title", axis=1)
test_data = test_data.drop("description", axis=1)

In [None]:
train_data.groupby(train_data.label).size().plot.pie(subplots=True,figsize=(5, 10),autopct="%.0lf%%")

In [None]:
doc = nlp("Donald Trump called and asked me to serve as his running mate and Vice Presidential nominee.")

In [None]:
for tok in doc:
    print(tok.pos_)

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})
displacy.render(doc, style='ent', jupyter=True)

In [None]:
text_sports = train_data[train_data.label == 2]

text = " ".join(text_sports.text.tolist())
doc_text = nlp(text[:200000])

In [None]:
import textacy
from textacy.extract import ngrams
from collections import Counter

Counter([ng.text.lower() for n in [2,4] for ng in ngrams(doc_text, n)]).most_common(10)

Textacy can use graph based keyword extraction methods.

* TextRank (focuses on words)
* SingleRank (focueses on phrases)

In [None]:
from textacy import keyterms

keyterms.textrank(
    doc_text,
    normalize = "lemma",
    n_keyterms=10,
)

In [None]:
textacy.keyterms.singlerank(
    doc_text,
    normalize = "lemma",
    n_keyterms=10,
)

Extract entities from the doc:

In [None]:
import math
from collections import Counter 
words = [tok for tok in doc_text if tok.is_alpha and not tok.is_stop]
word_probs = {tok.text.lower(): tok.prob for tok in words}

freqs = Counter(tok.text for tok in words)

In [None]:
from wordcloud import WordCloud
print(len(freqs))
wordcloud = WordCloud(background_color="white", max_words=30, scale=1.5).generate_from_frequencies(freqs)
image = wordcloud.to_image()
image.save("./wordcloud.png")

In [None]:
from IPython.display import Image 
Image(filename='./wordcloud.png')

In [None]:
sample_df = train_data.groupby('label').apply(lambda x: x.sample(frac=0.2))

We add a new column to the table which will contain the cleaned and preprocessed text

In [None]:
from tqdm import tqdm

clean_text = []
for text in tqdm(sample_df['text']):
    doc = nlp(text)
    words = []
    for tok in doc:
        if not tok.is_stop and tok.is_alpha:
            words.append(tok.lemma_)
    clean_text.append(words)

# Add cleaned text to dataframe
sample_df['clean_text'] = clean_text
sample_df.head()

In [None]:
# Set variables for dependent and independent variables
labels = sample_df.label.tolist()
data = sample_df['clean_text'].tolist()

In [None]:
import gensim
from tqdm import tqdm
from sklearn.model_selection import train_test_split as split
import numpy as np

In [None]:
# We use the pretrained glove embedding
# To handle the Seq2Vec method, we take the mean of the word-vectors
def vectorize(tr_data, tst_data):
    print('\nLoading existing glove model...')
    embedding_file = "glove.6B.100d.txt"

    model = gensim.models.KeyedVectors.load_word2vec_format(embedding_file, binary=False)
    vectorizer = model.wv
    vocab_length = len(model.wv.vocab)
    
    tr_vectors = [
        np.array(np.mean([vectorizer[word] if word in model else np.zeros((100,)) for word in article], axis=0)) for article in tqdm(tr_data,'Vectorizing')
    ]
    
    tst_vectors = [
        np.array(np.mean([vectorizer[word] if word in model else np.zeros((100,)) for word in article], axis=0)) for article in tqdm(tst_data,'Vectorizing')
    ]
    
    return tr_vectors, tst_vectors

In [None]:
def get_features_and_labels(data, labels):
    tr_data,tst_data,tr_labels,tst_labels = split(data,labels,test_size=0.3)
    
    tst_vecs = []
    tr_vecs = []
    tr_vecs, tst_vecs = vectorize(tr_data, tst_data)    
    return tr_vecs, tr_labels, tst_vecs, tst_labels

In [None]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(data, labels)

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

### You can try different classifiers as well
- Multiple are available from [scikit-learn](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

In [None]:
rf  =  RandomForestClassifier(n_estimators=100, verbose=True, n_jobs=-1)
svc = SVC()
lr  = LogisticRegression(n_jobs=-1)

In [None]:
rf.fit(tr_vecs, tr_labels)
svc.fit(tr_vecs, tr_labels)
lr.fit(tr_vecs, tr_labels)

In [None]:
from sklearn.metrics import accuracy_score
print(type(tst_vecs))
rf_pred = rf.predict(tst_vecs)
svc_pred = svc.predict(tst_vecs)
lr_pred = lr.predict(tst_vecs)
print("Random Forest Test accuracy : {}".format(accuracy_score(tst_labels, rf_pred)))
print("SVC Test accuracy : {}".format(accuracy_score(tst_labels, svc_pred)))
print("Logistic Regression Test accuracy : {}".format(accuracy_score(tst_labels, lr_pred)))

# Building a Deep Learning model with pytorch and torchtext

In [None]:
test_data.to_csv("dataset_test.csv", index=False)
train_data.to_csv("dataset_train.csv", index=False)

In [None]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.long)

In [None]:
fields = [('label',LABEL),('text', TEXT)]

train, test = data.TabularDataset.splits(
                                        path = '.',
                                        train = 'dataset_train.csv',
                                        test = 'dataset_test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

In [None]:
print(vars(train.examples[0]))

In [None]:
import random

train, valid = train.split(random_state = random.seed(SEED))

In [None]:
print(f'Number of training examples: {len(train)}')
print(f'Number of validation examples: {len(valid)}')
print(f'Number of testing examples: {len(test)}')

In [None]:
TEXT.build_vocab(train, vectors ="glove.6B.100d")  
LABEL.build_vocab(train)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train, valid, test), batch_size = BATCH_SIZE,
                                                                           sort_key = lambda x: len(x.text),
                                                                           sort_within_batch = False,
                                                                           device = device)

In [None]:
import torch.nn as nn
from torch import autograd

class LSTMClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.embedding.weight.data.copy_(TEXT.vocab.vectors)
        self.embedding.weight.requires_grad=False
        
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)

        
    def forward(self, text):

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.lstm(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
                
        y = self.fc(hidden[-1])
        
        log_probs = F.log_softmax(y.squeeze(0))
        return log_probs

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 100
OUTPUT_DIM = 4

model = LSTMClassifier(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
criterion = nn.NLLLoss()

In [None]:
model = model.to(device)
criterion = criterion.to(device)

In [None]:
from sklearn.metrics import classification_report
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = preds.argmax(1)
    correct = (rounded_preds == y).float() #convert into float for division 
    target_names = ['class 0', 'class 1', 'class 2', 'class 3']
    print(classification_report(rounded_preds.cpu().numpy(), y.cpu().numpy(), target_names=target_names))
    acc = correct.sum() / len(correct)
    return acc

In [None]:
from sklearn.metrics import accuracy_score
import torch.nn.functional as F
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.text)

        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            predictions = model(batch.text)
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 15

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')