# Machine Learning Challenge

## Overview

The focus of this exercise is on a field within machine learning called [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). We can think of this field as the intersection between language, and machine learning. Tasks in this field include automatic translation (Google translate), intelligent personal assistants (Siri), information extraction, and speech recognition for example.

NLP uses many of the same techniques as traditional data science, but also features a number of specialised skills and approaches. There is no expectation that you have any experience with NLP, however, to complete the challenge it will be useful to have the following skills:

- understanding of the python programming language
- understanding of basic machine learning concepts, i.e. supervised learning


### Instructions

1. Download this notebook!
2. Answer each of the provided questions, including your source code as cells in this notebook.
3. Share the results with us, e.g. a Github repo.

### Task description

You will be performing a task known as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Here, the goal is to predict sentiment -- the emotional intent behind a statement -- from text. For example, the sentence: "*This movie was terrible!"* has a negative sentiment, whereas "*loved this cinematic masterpiece*" has a positive sentiment.

To simplify the task, we consider sentiment binary: labels of `1` indicate a sentence has a positive sentiment, and labels of `0` indicate that the sentence has a negative sentiment.

### Dataset

The dataset is split across three files, representing three different sources -- Amazon, Yelp and IMDB. Your task is to build a sentiment analysis model using both the Yelp and IMDB data as your training-set, and test the performance of your model on the Amazon data.

Each file can be found in the `input` directory, and contains 1000 rows of data. Each row contains a sentence, a `tab` character and then a label -- `0` or `1`. 

**Notes**
- Feel free to use existing machine learning libraries as components in you solution!
- Suggested libraries: `sklearn` (for machine learning), `pandas` (for loading/processing data), `spacy` (for text processing).
- As mentioned, you are not expected to have previous experience with this exact task. You are free to refer to external tutorials/resources to assist you. However, you will be asked to justfify the choices you have made -- so make you understand the approach you have taken.

In [None]:
import os
print(os.listdir("./input"))

In [None]:
!head "./input/imdb_labelled.txt"

# Tasks
### 1. Read and concatenate data into test and train sets.
### 2. Prepare the data for input into your model.

In [None]:
!head "./input/yelp_labelled.txt"

In [253]:
import spacy
import numpy as np 
from spacy.matcher import Matcher
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

nlp = spacy.load('en_core_web_sm')

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

def clean_text(sentence):

    processed_feature = re.sub(r'\W', ' ', sentence)

    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    return processed_feature.lower().strip()



def spacy_tokenizer(sentence):
    tokens = nlp(sentence)   
    tokens = [ token for token in tokens if not token.is_stop and not token.is_punct and not token.is_space]
    tokens = [ token.lemma_.lower().strip() if token.lemma_ != "-PRON-" else token.lower_ for token in tokens ]
    return tokens

    
def spacy_tokenizer_v2(sentence):

    tokens = nlp(sentence)
    
    matcher = Matcher(nlp.vocab)
    pattern = [{"LOWER": "is"},  {"OP": "*"},{"LOWER": "not"}]
    matcher.add("is_not", [pattern])
    
    matches = matcher(tokens)
    match_pos = []
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  
        match_pos.append(start)
        match_pos.append(end-1)
        

    tokens = [ token for token in tokens if token.i not in match_pos ]

    tokens = [ token for token in tokens if not token.is_stop and not token.is_punct and not token.is_space ]
    
    tokens = [ token.lemma_.lower().strip() if token.lemma_ != "-PRON-" else token.lower_ for token in tokens ]
    if match_pos:
        tokens.append("is not")
    
    return tokens




In [247]:

train_data = pd.read_csv("./input/yelp_labelled.txt", sep='\t', header=None)

train_data = train_data.append(pd.read_csv("./input/imdb_labelled.txt", sep='\t', header=None))
features = train_data.iloc[:, 0].values
labels = train_data.iloc[:, 1].values


#### 2a: Find the ten most frequent words in the training set.

In [248]:
from collections import Counter
complete_doc = nlp(' '.join(features))

words = [token.text for token in complete_doc if not token.is_stop and not token.is_punct and not token.is_space]
word_freq = Counter(words)
common_words = word_freq.most_common(10)
print(common_words)

[('movie', 179), ('film', 163), ('good', 146), ('0', 138), ('1', 124), ('food', 114), ('place', 109), ('like', 94), ('great', 88), ('time', 84)]


### 3. Train your model and justify your choices.

In [250]:
from sklearn import metrics
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
test_data = pd.read_csv("./input/amazon_cells_labelled.txt", sep='\t', header=None)
test_feature = train_data.iloc[:, 0].values
test_labels = train_data.iloc[:, 1].values




### 4. Evaluate your model using metric(s) you see fit and justify your choices.

In [254]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', TfidfVectorizer(tokenizer = spacy_tokenizer)),
                 ('classifier', classifier)])

# model generation
pipe.fit(features,labels)

predicted = pipe.predict(test_feature)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(test_labels, predicted))
print("Logistic Regression Precision:",metrics.precision_score(test_labels, predicted))
print("Logistic Regression Recall:",metrics.recall_score(test_labels, predicted))

Logistic Regression Accuracy: 0.9456521739130435
Logistic Regression Precision: 0.9458850056369785
Logistic Regression Recall: 0.9469525959367946


In [None]:
for input, prediction, label in zip(test_feature, predicted, test_labels):
  if prediction != label:
    print(input, 'has been classified as ', prediction, 'and should be ', label) 

In [None]:
### 5. just try to improve the model a little bit

In [255]:
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', TfidfVectorizer(tokenizer = spacy_tokenizer_v2)),
                 ('classifier', classifier)])


pipe.fit(features,labels)

predicted = pipe.predict(test_feature)

print("Logistic Regression Accuracy:",metrics.accuracy_score(test_labels, predicted))
print("Logistic Regression Precision:",metrics.precision_score(test_labels, predicted))
print("Logistic Regression Recall:",metrics.recall_score(test_labels, predicted))

Logistic Regression Accuracy: 0.9462242562929062
Logistic Regression Precision: 0.9479638009049773
Logistic Regression Recall: 0.945823927765237


In [None]:
for input, prediction, label in zip(test_feature, predicted, test_labels):
  if prediction != label:
    print(input, 'has been classified as ', prediction, 'and should be ', label) 

In [257]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', TfidfVectorizer(tokenizer = spacy_tokenizer)),
                 ('classifier', classifier)])

pipe.fit(features,labels)
predicted = pipe.predict(test_feature)

print("Logistic Regression Accuracy:",metrics.accuracy_score(test_labels, predicted))
print("Logistic Regression Precision:",metrics.precision_score(test_labels, predicted))
print("Logistic Regression Recall:",metrics.recall_score(test_labels, predicted))


Logistic Regression Accuracy: 0.950228832951945
Logistic Regression Precision: 0.9414364640883978
Logistic Regression Recall: 0.9616252821670429


In [260]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=200, random_state=42)
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', TfidfVectorizer(tokenizer = spacy_tokenizer)),
                 ('classifier', classifier)])

pipe.fit(features,labels)
predicted = pipe.predict(test_feature)

print("Logistic Regression Accuracy:",metrics.accuracy_score(test_labels, predicted))
print("Logistic Regression Precision:",metrics.precision_score(test_labels, predicted))
print("Logistic Regression Recall:",metrics.recall_score(test_labels, predicted))


Logistic Regression Accuracy: 0.9954233409610984
Logistic Regression Precision: 0.996606334841629
Logistic Regression Recall: 0.9943566591422122


In [275]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=200, random_state=42)
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', TfidfVectorizer(tokenizer = spacy_tokenizer_v2)),
                 ('classifier', classifier)])

pipe.fit(features,labels)
predicted = pipe.predict(test_feature)

print("Logistic Regression Accuracy:",metrics.accuracy_score(test_labels, predicted))
print("Logistic Regression Precision:",metrics.precision_score(test_labels, predicted))
print("Logistic Regression Recall:",metrics.recall_score(test_labels, predicted))

Logistic Regression Accuracy: 0.9954233409610984
Logistic Regression Precision: 0.9954853273137697
Logistic Regression Recall: 0.9954853273137697
