<img src="assets/ucu-bg.jpg" alt="UCU Machine Learning Workshops 2017" style="height: 400px; border: 2px solid #C08050">

# Practice: Day 4

## Paraphrase Identification on Quora Question Dataset

You are given a dataset that contains 400k pairs of question titles from Quora. For each question pair, a supervised label is given by a human annotator: whether both questions in the pair are considered to have the same intent (`is_duplicate = 1`) or not (`is_duplicate = 0`).

Note that the human judgment about a particular pair being a duplicate can be subjective, so expect some "noise" in the target values.

Your task is to build a model that, given two question titles, predicts whether they have the same intent. Some infrastructural parts are created for your convenience. Fill out the rest as you go along.

**Example Plan**

1. You should start with a baseline model, which could be cosine similarity over BoW vectors. Optionally, you can try using TF-IDF afterwards and compare the results to simple counting.
2. Then, try leveraging some pre-trained word embeddings (e.g. fastText on Wikipedia, or Word2Vec on Google News etc.) and calculating the Word Mover's Distance as a feature. You can also use this feature later in step 4.
3. Then, encode the questions as fixed-length padded sequences of word embeddings, and create a neural network (e.g. with a Multi-Layer Perceptron architecture). You might want to allocate a separate validation set for picking the hyperparameters.
4. (Advanced) Use BoW cosine similarity, TF-IDF cosine similarity, WMD, and the predictions of the neural network as features for a 2nd-level model.

**Helpful Modules and Functions**

For baseline models:

* `gensim.models.wrappers.fasttext.FastText.wmdistance`
* `sklearn.feature_extraction.text.CountVectorizer` and `sklearn.feature_extraction.text.TfIdfVectorizer`

For neural models:

* `keras.preprocessing.text.Tokenizer`
* `keras.preprocessing.sequence.pad_sequences`
* `keras.models.Sequential`

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import log_loss, precision_score, recall_score, f1_score

In [None]:
%matplotlib inline

## Configuration

Make the subsequent runs reproducible.

In [None]:
RANDOM_STATE = 42

In [None]:
np.random.seed(RANDOM_STATE)

## Read Data

In [None]:
df_orig = pd.read_csv('quora-train.csv')

In [None]:
df_orig.head(5)

In [None]:
df_orig.is_duplicate.plot.hist()

## Partition Data

Remember the indices for the training and test sets.

In [None]:
splitter = StratifiedShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

In [None]:
ix_train, ix_test = next(splitter.split(df_orig, df_orig.is_duplicate))

In [None]:
print('Training set length:', ix_train.shape)
print('Test set length:    ', ix_test.shape)

**<span style="color: red">TODO:</span> Create features for your model.**

In [None]:
X = np.zeros((len(df_orig), 1))
y = df_orig.is_duplicate.values

## Begin Modeling

Split the data.

In [None]:
X_train = X[ix_train]
y_train = y[ix_train]

X_test = X[ix_test]
y_test = y[ix_test]

In [None]:
print('X train:', X_train.shape)
print('y train:', y_train.shape)
print('X test: ', X_test.shape)
print('y test: ', y_test.shape)

**<span style="color: red">TODO:</span> Train your model.**

**<span style="color: red">TODO:</span> Make predictions from your model.**

In [None]:
y_pred_train = np.random.uniform(size=y_train.shape)
y_pred_test = np.random.uniform(size=y_test.shape)

## Evaluate

In [None]:
def evaluate(y_true, y_pred, threshold=0.5):
    y_pred_label = y_pred >= threshold
    
    print('Log loss: ', log_loss(y_true, y_pred))
    print('Precision:', precision_score(y_true, y_pred_label))
    print('Recall:   ', recall_score(y_true, y_pred_label))
    print('F1 score: ', f1_score(y_true, y_pred_label))

**Training evaluation**

In [None]:
evaluate(y_train, y_pred_train)

**Test evaluation**

In [None]:
evaluate(y_test, y_pred_test)