# Data Science Hiring Challenge
### Fixed Term Contract  Mar 2018 

## Overview

The venture we are currently hiring for is focused on a field within machine learning called [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). We can think of this field as the intersection between language, and machine learning. Tasks in this field include automatic translation (Google translate), intelligent personal assistants (Siri), predictive text, and speech recognition for example.

NLP uses many of the same techniques as traditional data science, but also features a number of specialised skills and approaches. There is no expectation that you have any experience in this field, however, in this challenge we will test:

- your capacity to learn and apply these techniques, and:
- your general aptitude developing software in Python.

### Instructions

1. Create a Kaggle account and `fork` this notebook.
2. Answer each of the provided questions, including your source code as cells in this notebook.
3. Provide us a link to your Kaggle notebook at your convenience.

**Notes**
- This environment comes with a wide range of ML libraries installed. If you wish to include more, go to the 'Settings' tab and input the `pip install` command as required.
- Suggested libraries: `sklearn` (for machine learning), `pandas` (for loading/processing data).
- As mentioned, you are not expected to have previous experience with this exact task. You are free to refer to external tutorials/resources to assist you. However, you will be asked to justfify the choices you have made -- so make you understand the approach you have taken.

## Task description and dataset

### Task

You will be performing a task known as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Here, the goal is to predict sentiment -- the emotional intent behind a statement -- from text. For example, the sentence: "*This movie was terrible!"* has a negative sentiment, whereas "*loved this cinematic masterpiece*" has a positive sentiment.

To simplify the task, we consider sentiment binary: labels of `1` indicate a sentence has a positive sentiment, and labels of `0` indicate that the sentence has a negative sentiment.

### Dataset

The dataset is split across three files, representing three different sources -- Amazon, Yelp and IMDB. Your task is to build a sentiment analysis model using both the Yelp and IMDB data as your training-set, and test the performance of your model on the Amazon data.

Each file can be found in the `../input` directory, and contains 1000 rows of data. Each row contains a sentence, a `tab` character and then a label -- `0` or `1`. 

In [None]:
import os
print(os.listdir("../input"))

In [None]:
!head "../input/amazon_cells_labelled.txt"

# Tools and Environment Setup

Considering the goal of the proposed task, we'll use:

## For Data Input

We'll use [pandas](https://pandas.pydata.org/) to handle input data.

In [None]:
! pip install --upgrade pandas

## For Natural Language Processing

The framework to process natural language used for this task is [NLTK](https://www.nltk.org/). Additional resources are [NLTK Book](https://www.nltk.org/book/) and NLTK source code itself.

In [None]:
! pip install --upgrade nltk

## For Machine Learning

A toolset for Machine Learning named [scikit-learn](http://scikit-learn.org/stable/). <!-- NLTK has a module which wraps scikit-learn classifiers named [nltk.classify.scikitlearn](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearn) and we'll use such integration for this task. -->

In [None]:
! pip install --upgrade scipy numpy scikit-learn

# Tasks
### 1. Read and concatenate data into test and train sets.

Given that all three data sets are small, we'll load them into memory as `pandas`' dataframes. If other data transfer or data size were used such as text streaming or larger data sets, another approach surely would be required.

In [None]:
#Q1 soln.

from pathlib import Path

import numpy
import pandas


# dict to store panda's dataframes loaded from txt
dataframes = dict()

# Iterate all labelled files and load them into memory
for labelled_file in Path("../input").glob("*_labelled.txt"):

    # Extract data set name from file name
    source = labelled_file.stem.replace("_labelled", "")
    
    # Read txt files as csv, using tab as separator
    dataframes[source] = pandas.read_csv(labelled_file, sep="\t", header=None)

# Setup training and testing data
x_training = numpy.concatenate((dataframes["imdb"][0].values, dataframes["yelp"][0].values))
y_training = numpy.concatenate((dataframes["imdb"][1].values, dataframes["yelp"][1].values))
x_testing = dataframes["amazon_cells"][0].values
y_testing = dataframes["amazon_cells"][1].values

### 2. Prepare the data for input into your model.

In [None]:
#Q2 soln.


#### 2a: Find the ten most frequent words in the training set.

In [None]:
#Q2a soln.



### 3. Train your model and justify your choices.

Many textual resources were read while prototyping models in a trial and error fashion. The primary goal was to understand how the Natural Language Processing (NLP) and Machine Learning (ML) are related in this task.

At first, the candidate thought it would mainly use `NLTK` only, without using external machine learning libraries. However, it was not the case.

Several hours were spent trying to adapt [NLTK's Sentiment Analysis HOWTO](https://www.nltk.org/howto/sentiment.html) using [`SentimentAnalyzer`](https://www.nltk.org/api/nltk.sentiment.html#nltk.sentiment.sentiment_analyzer.SentimentAnalyzer) and [VADER](https://github.com/cjhutto/vaderSentiment)'s [`SentimentIntensityAnalyzer`](https://www.nltk.org/api/nltk.sentiment.html#nltk.sentiment.vader.SentimentIntensityAnalyzer). Without prior NLP nor ML knowledge, it wasn't very clear for the candidate on how to design other feature extractors.

While still studying NTLK, another resource used was [`nltk.sentiment.util`'s demos source code](https://www.nltk.org/_modules/nltk/sentiment/util.html) to provide some ideas on how a model could be built. Still, it wasn't clear on how to adapt it.

[NLTK Book's chapter 6 named Learning to Classify Text](https://www.nltk.org/book/ch06.html) gave a fairly good amount of conceptual explanation, but it still wasn't clear on how features could be extracted.

One concept that the candidate knew before this task was stop words. Some NLP recipes suggest to remove stop words, which are very frequent words which usually does not aggragate according to the designed classifier. However, reading NLTK's HOWTO, especially in VADER's examples, removing words such as "*but*" could alter the analysis of the task. So, removing stop words was never considered as an option in this context.

Still related to word contribution in context, [StreamHacker's Text Classification for Sentiment Analysis - Eliminate Low Information Features](https://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/) was an inspiring blog post on how to model based on High Information Feature Selection.

After understanding the basic dynamics of NLP and ML for this task, some prototypes were tried using [`nltk.classify.scikitlearn`](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearn), but no progress was done.

While reading [`nltk.tokenize` module documentation](https://www.nltk.org/api/nltk.tokenize.html), tests were done using different tokenization strategies and Penn Treebank's was empirically chosen.

[Reading](https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html) [some](https://medium.com/@chrisfotache/text-classification-in-python-pipelines-nlp-nltk-tf-idf-xgboost-and-more-b83451a327e0) [articles](https://www.toptal.com/machine-learning/nlp-tutorial-text-classification) helped to understand how a pipeline is important in designing a Machine Learning model.

[Towards Data Science's Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a) made me understand the importance of *Term Frequencies (TF)* and *Term Frequencies times inverse document frequency (TF-IDF)* concepts. TF-IDF concept is really great because it is a metric which can extract relevant words to document classification. Such feature calculation is a very interesting approach for this task.

At last, [Natural Language Processing for Hackers' Getting Started with Sentiment Analysis](https://nlpforhackers.io/sentiment-analysis-intro/) compared available methods, including those used in NLTK's HOWTO.

The following code is mainly based in NLP for Hackers' code. The original code use only a convertion from collection of text documents into a sparse matrix of tokens ([`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).

Modifications include using TF-IDF features applied in a sparse matrix of tokens ([`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) + [`TfidfTransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) => [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)) and the use of trigrams also (instead of using ony unigram and bigrams).

In [None]:
#from nltk import word_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


tokenizer = TreebankWordTokenizer()

# Establish a pipeline
n_gram_1_to_3_classifier = Pipeline(
    [
        ('tfidf_vectorizer',
         TfidfVectorizer(
             analyzer="word",
             ngram_range=(1, 3),
             tokenizer=tokenizer.tokenize,
         )),
        ('classifier', LinearSVC()),
    ]
)

# Train model with training data
n_gram_1_to_3_classifier.fit(x_training, y_training)

### 4. Evaluate your model using metric(s) you see fit and justify your choices.

#### Code adapted from [Natural Language Processing for Hackers' Getting Started with Sentiment Analysis](https://nlpforhackers.io/sentiment-analysis-intro/) to display a classification report of the trained model against testing data:

In [None]:
#Q4 soln.

from sklearn.metrics import classification_report


y_prediction = n_gram_1_to_3_classifier.predict(x_testing)
print(classification_report(y_testing, y_prediction))

#### Code adapted from [Natural Language Processing for Hackers' Getting Started with Sentiment Analysis](https://nlpforhackers.io/sentiment-analysis-intro/) to show top N contributing features:

In [None]:
from operator import itemgetter


model = n_gram_1_to_3_classifier
n = 100


# Extract the vectorizer and the classifier from the pipeline
vectorizer = model.named_steps['tfidf_vectorizer']
classifier = model.named_steps['classifier']

# Zip the feature names with the coefs and sort
coefs = sorted(
    zip(classifier.coef_[0], vectorizer.get_feature_names()),
    key=itemgetter(0), reverse=True
)

# Get the top n and bottom n coef, name pairs
topn  = zip(coefs[:n], coefs[:-(n+1):-1])

# Create the output string to return
output = []

# Create two columns with most negative and most positive features.
for (cp, fnp), (cn, fnn) in topn:
    output.append(
        "{:0.4f}{: >20}         {:0.4f}{: >20}".format(
            cp, fnp, cn, fnn
        )
    )

print("\n".join(output))

## Conclusion

This was a fun task to learn how one can learn some basic NLP and ML knowledge. Surely there were some easing constraints such as using a binary classification and English as language.