<h1 id="Introduction-to-Python-and-Natural-Language-Technologies">Introduction to Python and Natural Language Technologies</h1>
<h2 id="Laboratory-06,-NLP-Introduction">Laboratory 06, NLP Introduction</h2>
<p><strong>March 18, 2020</strong></p>
<p><strong>&Aacute;d&aacute;m Kov&aacute;cs</strong></p>
<p>During this laboratory we are going to use a classification dataset of SemEval 2019 - Task 6. This is called Identifying and Categorizing Offensive Language in Social Media.</p>
<h2 id="Preparation">Preparation</h2>
<p style="padding-left: 40px;"><a href="http://sandbox.hlt.bme.hu/~adaamko/glove.6B.100d.txt" target="_blank" rel="noopener">Download GLOVE</a>(and place it into this directory)</p>
<p style="padding-left: 40px;">Download the dataset (with python code)</p>

In [None]:
import os
if not os.path.isdir('./data'):
    os.mkdir('./data')

import urllib
u = urllib.request.URLopener()
u.retrieve("http://sandbox.hlt.bme.hu/~adaamko/offenseval.tsv", "data/offenseval.tsv")

# 1. Train a Logistic Regression on the dataset

Use a CountVectorizer for featurizing your data. You can reuse the code presented during the lecture

## 1.1 Read in the dataset into a Pandas DataFrame
Use `pd.read_csv` with the correct parameters to read in the dataset. If done correctly, `DataFrame` should have 3 columns, 
`id`, `tweet`, `subtask_a`.

In [None]:
import pandas as pd
import numpy as np

In [None]:
def read_dataset():
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
train_data_unprocessed = read_dataset()

assert type(train_data_unprocessed) == pd.core.frame.DataFrame
assert len(train_data_unprocessed.columns) == 3
assert (train_data_unprocessed.columns == ['id', 'tweet', 'subtask_a']).all()

## 1.2 Convert `subtask_a` into a binary label
The task is to classify the given tweets into two category: _offensive(OFF)_ , _not offensive (NOT)_. For machine learning algorithms you will need integer labels instead of strings. Add a new column to the dataframe called `label`, and transform the `subtask_a` column into a binary integer label.

In [None]:
def transform(train_data):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from pandas.api.types import is_numeric_dtype

train_data = transform(train_data_unprocessed)

assert "label" in train_data
assert is_numeric_dtype(train_data.label)
assert (train_data.label.isin([0,1])).all()

In [None]:
train_data.groupby("label").size()

## 1.3 Initialize CountVectorizer and _train_ it on the _tweet_ column of the dataset
The _training_ will prepare the vocabulary for us so we will be able to use it for training a LogisticRegression algorithm later. Set the number of `max_features` to 5000 so vocabulary won't be too big for training. Also filter out english `stop_words`.

In [None]:
# We will need to use a random seed for our methods so they will be reproducible
SEED = 1234

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def prepare_vectorizer(train_data):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
vectorizer = prepare_vectorizer(train_data)

transformed = vectorizer.transform(["hello this is the intro to nlp"])
assert transformed.dtype == np.dtype('int64')
assert transformed.shape == (1, 5000)

## 1.4 Featurize the dataset with the prepared CountVectorizer, and split it into _train_ and _test_ dataset
You should use the random seed when you are splitting the dataset. The scale of the training and the test dataset should be 70% to 30%.

In [None]:
import gensim
from tqdm import tqdm
from sklearn.model_selection import train_test_split as split

def vectorize_to_bow(tr_data, tst_data, vectorizer):   
    # YOUR CODE HERE
    raise NotImplementedError()

def get_features_and_labels(data, labels, vectorizer):
    # tr_data,tst_data,tr_labels,tst_labels = split...
    # ...
    # tr_vecs, tst_vecs = vectorize_to_bow(...
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label, vectorizer)
assert tr_vecs.shape == (9268, 5000)
assert tr_labels.shape == (9268,)
assert tst_vecs.shape == (3972, 5000)
assert tst_labels.shape == (3972,)
assert tr_vecs[0].toarray().shape == (1, 5000)

In [None]:
# Import a bunch of stuff from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# We will train a LogisticRegression algorithm for the classification
lr  = LogisticRegression(n_jobs=-1)

## 1.5 Train and evaluate your method!

In [None]:
# Training on the train dataset
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError

try:
    check_is_fitted(lr)
except NotFittedError as e:
    assert None, repr(e)

In [None]:
from sklearn.metrics import accuracy_score

# Evaluation on the test dataset
def preds(lr, tst_vecs):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# If you have done everything right, the accuracy should be around 75%
lr_pred = preds(lr, tst_vecs)
assert lr_pred.shape == (3972,)
print("Logistic Regression Test accuracy : {}".format(
    accuracy_score(tst_labels, lr_pred)))

## 1.1 Change to TfidfVectorizer, and also change the configuration

Look up the documentation of [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). It has a lot of parameters to play with. 

This time, change the parameters to include _maximum_ of __10000__ features. Also include filtering of _stopwords_ and _lowercasing_ the features. (hint: look at the parameter names in the documentation)

Also [_ngram_](https://en.wikipedia.org/wiki/N-gram) features can improve the performance of the model. A bigram is an n-gram for n=2, trigram is when n=3, etc..


Bigram features include not only single words in the vocabulary, but the frequency of every occuring bigram in the text (e.g. it will include not only the words _brown_ and _dog_ but __brown dog__ also)

Change the configuration of the _TfidfVectorizer_ to also include the _bigrams_ and _trigrams_ in the vocabulary.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def prepare_tfidf_vectorizer(train_data):
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(
    train_data.tweet, train_data.label, prepare_tfidf_vectorizer(train_data))

In [None]:
# Train and evaluate! 
lr  = LogisticRegression(n_jobs=-1)

#lr.fit...

#lr_pred = ..

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError

try:
    check_is_fitted(lr)
except NotFittedError as e:
    assert None, repr(e)

## 1.2 Write a custom tokenizer for TfidfVectorizer

Right now, the vectorizer uses it's own tokenizer for creating the vocabulary. You can also create a custom function and tell the vectorizer to use that when tokenizing the text.

Use [spacy](https://spacy.io/) for tokenization. write your own function.

Your function should:
- get a sentence as an input
- run spacy on the input text
- return a token list that includes:
    - filtering of stop words
    - filtering of punctuation
    - lemmatizing the text
    - lowercasing the text

In [None]:
import spacy

nlp = spacy.load("en")


def spacy_tokenizer(sentence):
    # YOUR CODE HERE
    raise NotImplementedError()


vectorizer_with_spacy = TfidfVectorizer(
    max_features=10000, tokenizer=spacy_tokenizer)

In [None]:
assert (spacy_tokenizer("This is the NLP lab, this text should not contain any punctuations and stopwords, and the text should be lowercased.") == [
        'nlp', 'lab', 'text', 'contain', 'punctuation', 'stopword', 'text', 'lowercase'])

In [None]:
X = vectorizer_with_spacy.fit(train_data.tweet)

tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label, X)

In [None]:
# Train and evaluate! 
# If you have done everything right you should get the same or a little better performance than the standard
# TfidfVectorizer and CountVectorizer
lr  = LogisticRegression(n_jobs=-1)

#lr.fit...

#lr_pred = ..

# YOUR CODE HERE
raise NotImplementedError()

# 2. Word embeddings

## 2.1 Transform word vectors to sentence vector taking the average of the word vectors
Word vectors transform words to a vector space where similar words have similar vectors.
These vectors can be used as features for ML algorithms. But to feature a sentence first you need to create a _sentence vector_ from the vectors of the words. The easiest way of transforming word vectors to sentence vector is to take the average of all the word vectors.

![ww](https://www.researchgate.net/profile/Md-Shajalal/publication/329394770/figure/fig1/AS:701809937088513@1544335936936/A-framework-for-learning-word-vectors-7_W640.jpg)

In [None]:
#Load the embedding file
embedding_file = "glove.6B.100d.txt"

model = gensim.models.KeyedVectors.load_word2vec_format(embedding_file, binary=False)
vectorizer = model.wv
vocab_length = len(model.wv.vocab)

**Your transform function should:**
- tokenize the sentence with the spacy tokenizer
- get the embedding vector:
    - get the embedding vector from the model if the word is in the vocabulary
    - initialize a vector with zeros with the same dimension if the word is not in the vocabulary
- take the mean of the word vectors to return a sentence vector

In [None]:
def transform(words):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert transform("this is a nlp lab").shape == (100,)

**We can calculate similarities between sentences now the same way that we did between words! For this we need to use the cosine_similarity function!**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

print(cosine_similarity(transform("hello my name is adam").reshape(
    1, -1), transform("hello my name is andrea").reshape(1, -1))[0][0])

In [None]:
assert cosine_similarity(transform("hello my name is adam").reshape(
    1, -1), transform("hello my name is andrea").reshape(1, -1)).shape == (1, 1)

## 2.4 Finding Analogies
Word vectors have been shown to sometimes have the ability to solve analogies.

We discussed this during the lecture that for the analogy "man : king :: woman : x" (read: man is to king as woman is to x), x is _queen_

Find more examples of analogies that holds according to these vectors (i.e. the intended word is ranked top)!

Also find an example of analogy that does not hold according to these vectors!

Summarize your findings in a few sentences.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## 2.5 Bias in word vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias  in word vectors can be dangerous because it can incorporate stereotypes through applications that employ these models.

Run the cell below, to examine a sample of gender bias present in the data. Try to come up with another examples that can reflect biases in datasets (gender, race, sexual orientation etc.)

Summarize your findings in a few sentences.

In [None]:
print(model.most_similar(positive=['woman', 'doctor'], negative=['man']))

print(model.most_similar(positive=['man', 'doctor'], negative=['woman']))

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# ================ PASSING LEVEL ====================

# 3. Logistic regression using word vectors

These sentence vectors can be used as feature vectors for classifiers. Rewrite the featurizing process and transform each sentence into a sentence vector using the embedding model!

__Note: it is OK if your model is not better than the other classifiers__

In [None]:
def vectorize_to_embedding(tr_data, tst_data):    
    # YOUR CODE HERE
    raise NotImplementedError()

def get_features_and_labels(data, labels):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label)

In [None]:
assert tr_vecs[0].shape == (100,)

In [None]:
# Train and evaluate! 
lr  = LogisticRegression(n_jobs=-1)

#lr.fit...

#lr_pred = ..

# YOUR CODE HERE
raise NotImplementedError()

## 3.1 Ensemble model

Try out other classifiers from: [sklearn](https://scikit-learn.org/stable/supervised_learning.html). Choose three and build a [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) with the choosen classifiers. If the _voting_ strategy is set to _hard_ it will do a majority voting among the classifiers and choose the class with the most votes.

Make a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) with a TFIdfVectorizer and with your Ensemble model. Pipeline objects make it easy to assemble several steps together and makes your machine learning pipeline executable in just one step.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import VotingClassifier


def make_pipeline_ensemble(tweet, label):
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
pipeline = make_pipeline_ensemble(train_data.tweet, train_data.label)

In [None]:
assert type(pipeline) == Pipeline
assert type(pipeline.steps[0][1]) == TfidfVectorizer
assert type(pipeline.steps[1][1]) == VotingClassifier

In [None]:
# Train and evaluate! 
# YOUR CODE HERE
raise NotImplementedError()


## 3.2 __Also evaluate your classifiers separately as well. Summarize your results in a cell below. Did the ensemble model improved your performance?__

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# ================ EXTRA LEVEL ====================