# OHE, CountVectorizer

This notebook is in several parts as follows:
1) In Part 1, we cover the basics of tokenization and one-hot encoding
2) In Part 2, we cover CountVectorizer
3) In Part 3, we cover end-to-end disaster data analysis

### Part 1: Tokenization & OHE

In [None]:
%matplotlib inline

import string
from collections import Counter
from pprint import pprint
import gzip
import matplotlib.pyplot as plt
import numpy as np 


In [None]:
long_text = """It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."""
short_text = """In fairy-tales, witches always wear silly black hats and black coats, and they ride on broomsticks. But this is not a fairy-tale. This is about REAL WITCHES."""
text = short_text

#### Tokenization

ML algos tend not to work with categorical data. Rather than working with categorical data directly, **encoding** allows you to _more expressively_ represent the text (categorical) data. We do this by converting categories to numbers. 

In [None]:
def extract_words(text):
    temp = text.split()
    text_words = []

    for word in temp:
        # remove punctuation at beginning of word
        while word[0] in string.punctuation:
            word = word[1:]

        # remove punctuation at end of word
        while word[-1] in string.punctuation:
            word = word[:-1]

        # Append this word into our list of words
        text_words.append(word.lower())

    return text_words

In [None]:
text_words = extract_words(text)
print(text_words)

#### One Hot Encoding

1. is a more efficient way to represent vectors. 
2. the column feature vector defines a high dimensional space, where each dimension represents a word
3. each element is zero in this vector, except the element corresponding to the dimension representing the word
4. For _full-texts_ instead of words, the vector representation of the text is simply the vector sum of all the words it contains:



In [None]:
word_dict = {}
word_list = []
vocabulary_size = 0
text_tokens = []

for word in text_words:
    # create an ID for words seen for the first time & add to dictionary
    if word not in word_dict:
        word_dict[word] = vocabulary_size
        word_list.append(word)
        vocabulary_size += 1

    # add the token corresponding to the current word to the tokenized text.
    text_tokens.append(word_dict[word])

In [None]:
print("Word list:", word_list, "\n\n Word dictionary")
pprint(word_dict)

In [None]:
print(text_tokens)

In [None]:
import re
text = """
Mary had a little lamb, little lamb,
little lamb, Mary had a little lamb
whose fleece was white as snow. 
And everywhere that Mary went
Mary went, Mary went, everywhere 
that Mary went
the lamb was sure to go
"""

In [None]:
text = re.sub(r'[^\w\s]', '', text) 
word_list = text.lower().split()

In [None]:
word_dict = {}
for word in word_list:
    if not word in word_dict.keys():
        word_dict[word] = 1
    else:
        word_dict[word] += 1

In [None]:
def one_hot(word, word_dict):
    """
    Generate a one-hot encoded vector for "word"
    """

    vector = np.zeros(len(word_dict))
    vector[word_dict[word]] = 1
    return vector

    

In [None]:
fleece_hot = one_hot('fleece', word_dict)
print(fleece_hot)

In [None]:
mary_hot = one_hot('mary', word_dict)
print(mary_hot)

In [None]:
mary_hot[6] == 1

#### OHE with Scikit-learn

1. Pre-process (transform) the data
    a. Lower case
    b. Split by space
2. Integer encode using `fit_transform` on the dataset
3. Alternatively, use binary encoding

`LabelEncoder` can be used to normalize labels


In [None]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder # Try ?LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [None]:
# 1. convert corpus documents to lower case
doc1 = long_text.lower()
doc2 = short_text.lower()

# convert to unary tokens
doc1 = doc1.split()
doc2 = doc2.split()
dataset = array(doc1+doc2) # convert the list to a numpy array of string

In [None]:
# 2. Integer encode the dataset
label_encoder = LabelEncoder()

# Option 1: Fit, then transform -- 2 steps

## Fit label encoder
le = label_encoder.fit(dataset) # create an instance of the class based on the dataset
le_classes = le.classes_

## Transform labels to normalized encoding
integer_encoded = le.transform(dataset)

print(f'Corpus vocabulary\n\n{le.classes_}')

In [None]:
# Option 2 -> single command: fit_transform

## Fit label encoder and return encoded labels
integer_encoded = label_encoder.fit_transform(dataset) # alphabetically assign integers to each word

print(f'doc1 & doc2 encoded as integers: \n\n{integer_encoded}')
print(f'\n1D shape of integer_encoded: {integer_encoded.shape}') #1D shape

In [None]:
# 3. Binary encoding
onehot_encoder = OneHotEncoder(sparse=False)

## Reshape `integer_encoded` into a 2D array required for `OneHotEncoder` instance
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(f'First couple of rows of OHE docs: \n\n{onehot_encoded[:2,:]}') # convert to an array of size integer_encoded x max(integer_encoded) 

In [None]:
# Invert

## Transform labels back to original encoding
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(f'Verify that the encoded values match the text: \n{inverted}->{doc1[0]}')

### Next steps

So where do we go from here? 

1. As you can see, we have managed to __encode__ the text corpus into a numerical vector format. 
2. The text was broken into unary tokens (you could also create n-ary tokens), which were then encoded into integers, and from which we created OHE vectors - all zeros, save for the index representing the token
3. Next, we could feed this into our neural network - the size of the input being the length of each vector. 
4. Though, not in scope for our discussion as we will move to TF-IDF and other topics, you could advance your programming skills by following this tutorial on implementation: [Reference](https://towardsdatascience.com/word-embeddings-for-sentiment-analysis-65f42ea5d26e) using Airline Tweet Analysis
5. See below for the filesfor you to get started
    a. You can download the data directly from Kaggle: [Reference](https://www.kaggle.com/general/74235)
    b. Alternatively, download from the course's repo (`data`) folder

## Part 2: Scikit-learn: CountVectorizer

As opposed to generating Label classes and then integer encoding those classes to one hot encoding vectors, you can use sklearn's `CountVectorizer` feature extraction on text.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Each document is an element of the corpus:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# or use our documents
corpus = [long_text, short_text]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
corpus_features = vectorizer.get_feature_names()
print(f'Corpus features:\n\n{corpus_features}')

term_freq = X.toarray()
print(f'\nTerm Frequency: \n\n{term_freq}')

#### Bigram implementation with CountVectorizer

In [None]:
# Bi-gram
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())

In [None]:
print(X2.toarray())

## Part 3: Example NLP problem: end to end

In [None]:
import numpy as np 
import pandas as pd 
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [None]:
train_df = pd.read_csv("data/twitter_disaster/train.csv")

In [None]:
train_df.head(3)

In [None]:
# Disaster Tweet sample: 
train_df[train_df["target"]==1]["text"].values[0]

In [None]:
# Not a disaster sample
train_df[train_df["target"]==0]["text"].values[0]

#### Approach

1. Load dataset
2. Divide the dataset into test & train components
3. Using sklearn's `CountVectorizer`, count the number of words in each tweet and turn into numerical data


In [None]:
# 1. load the dataset as a pandas DataFrame
def load_dataset(filename, text="text", target="target"):
    data = pd.read_csv(filename) # header=None, if needed
    X = data[text]
    y = data[target]
    return X, y

In [None]:
X, y = load_dataset('data/twitter_disaster/train.csv')

In [None]:
# 2. test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:
# 3. Encode
count_vectorizer = feature_extraction.text.CountVectorizer()
X_train_enc = count_vectorizer.fit_transform(X_train)
X_test_enc = count_vectorizer.transform(X_test) # note, using only transform, not fit_transform

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()

**Train & Test integer encoding to vectors**

In [None]:
train_integer_encoded_vectors = count_vectorizer.fit_transform(train_df["text"])

**Notice**, we use `transform`, instead of `fit_transform` for the test data 

This ensures that only tokens from the _training data_ are mapped to the test dataset.

In [None]:
test_integer_encoded_vectors = count_vectorizer.transform(test_df["text"])

In [None]:
test_integer_encoded_vectors.todense()

In [None]:
print(train_integer_encoded_vectors[0].todense().shape)
print(train_integer_encoded_vectors[0].todense())
print(f'There are {train_integer_encoded_vectors[0].todense().shape[1]} unique words in the {len(train_df)} tweets')

#### Model

- Using *ridge regression* model which allows our huge vectors to push the model weights toward zero without completely discounting different words
- Ridge regression adds a "ridge" term that has the effect of "smoothing" the weights 
- Equivalent to training a linear model with weight decay that decreases variance, at the cost of increasing bias a small amount
- For details, this explanation may help: https://www.youtube.com/watch?v=Q81RR3yKn30

#### Linear model vs Ridge Regression
- Linear model minimizes sum of squared residuals: $y \leftarrow C*X + B$
- Ridge regression adds a $\lambda * slope^2$ penalty to the bias

In [None]:
classifier_model = linear_model.RidgeClassifier()

Determine the cross-validation score

In [None]:
scores = model_selection.cross_val_score(classifier_model, X_train_enc, y_train, cv=3, scoring="f1")
scores

In [None]:
# if the model looks good, then let's fit it to the training dataset
classifier_model.fit(X_train_enc, y_train)

In [None]:
from sklearn.metrics import accuracy_score, f1_score

In [None]:
prediction = classifier_model.predict(X_test_enc)

In [None]:
accuracy_score(prediction, y_test)