# Machine Learning Challenge

## Overview

The focus of this exercise is on a field within machine learning called [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). We can think of this field as the intersection between language, and machine learning. Tasks in this field include automatic translation (Google translate), intelligent personal assistants (Siri), information extraction, and speech recognition for example.

NLP uses many of the same techniques as traditional data science, but also features a number of specialised skills and approaches. There is no expectation that you have any experience with NLP, however, to complete the challenge it will be useful to have the following skills:

- understanding of the python programming language
- understanding of basic machine learning concepts, i.e. supervised learning


### Instructions

1. Download this notebook!
2. Answer each of the provided questions, including your source code as cells in this notebook.
3. Share the results with us, e.g. a Github repo.

### Task description

You will be performing a task known as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Here, the goal is to predict sentiment -- the emotional intent behind a statement -- from text. For example, the sentence: "*This movie was terrible!"* has a negative sentiment, whereas "*loved this cinematic masterpiece*" has a positive sentiment.

To simplify the task, we consider sentiment binary: labels of `1` indicate a sentence has a positive sentiment, and labels of `0` indicate that the sentence has a negative sentiment.

### Dataset

The dataset is split across three files, representing three different sources -- Amazon, Yelp and IMDB. Your task is to build a sentiment analysis model using both the Yelp and IMDB data as your training-set, and test the performance of your model on the Amazon data.

Each file can be found in the `input` directory, and contains 1000 rows of data. Each row contains a sentence, a `tab` character and then a label -- `0` or `1`. 

**Notes**
- Feel free to use existing machine learning libraries as components in you solution!
- Suggested libraries: `sklearn` (for machine learning), `pandas` (for loading/processing data), `spacy` (for text processing).
- As mentioned, you are not expected to have previous experience with this exact task. You are free to refer to external tutorials/resources to assist you. However, you will be asked to justfify the choices you have made -- so make you understand the approach you have taken.

In [1]:
import os
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords
import spacy
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from collections import Counter
import string
import re

print(os.listdir("./input"))

['amazon_cells_labelled.txt', 'imdb_labelled.txt', 'yelp_labelled.txt']


In [2]:
!head "./input/amazon_cells_labelled.txt"

So there is no way for me to plug it in here in the US unless I go by a converter.	0
Good case, Excellent value.	1
Great for the jawbone.	1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!	0
The mic is great.	1
I have to jiggle the plug to get it to line up right to get decent volume.	0
If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.	0
If you are Razr owner...you must have this!	1
Needless to say, I wasted my money.	0
What a waste of money and time!.	0


In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Chris\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Tasks
### 1. Read and concatenate data into test and train sets.
### 2. Prepare the data for input into your model.

Start by loading in the data sets into their own dataframes

In [7]:
df_test = pd.read_table('./input/amazon_cells_labelled.txt', header=None, names=["review", "sentiment"])
df_imdb = pd.read_table('./input/imdb_labelled.txt', header=None, names=["review", "sentiment"])
df_yelp = pd.read_table('./input/yelp_labelled.txt', header=None, names=["review", "sentiment"])

In [8]:
df_imdb

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
...,...,...
743,I just got bored watching Jessice Lange take h...,0
744,"Unfortunately, any virtue in this film's produ...",0
745,"In a word, it is embarrassing.",0
746,Exceptionally bad!,0


Concatenate the IMDB and Yelp data as the training set

In [9]:
df_train = pd.concat([df_imdb, df_yelp], ignore_index=True, sort=False)

In [10]:
df_train = pd.concat([df_imdb, df_yelp], ignore_index=True)

#### 2a: Find the ten most frequent words in the training set.

In [11]:
df_train

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
...,...,...
1743,I think food should have flavor and texture an...,0
1744,Appetite instantly gone.,0
1745,Overall I was not impressed and would not go b...,0
1746,"The whole experience was underwhelming, and I ...",0


In [12]:
df_test

Unnamed: 0,review,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
...,...,...
995,The screen does get smudged easily because it ...,0
996,What a piece of junk.. I lose more calls on th...,0
997,Item Does Not Match Picture.,0
998,The only thing that disappoint me is the infra...,0


In [13]:
reviews_train = df_train.iloc[:, 0].values
scores_train = df_train.iloc[:, 1].values

reviews_test = df_test.iloc[:, 0].values
scores_test = df_test.iloc[:, 1].values

In [14]:
print(reviews_train)

['A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  '
 'Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  '
 'Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  '
 ... 'Overall I was not impressed and would not go back.'
 "The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time."
 "Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing out the time it took to bring the check."]


In [15]:
processed_reviews_train = []

for sentence in range(0, len(reviews_train)):
    # Remove all the special characters
    processed_review = re.sub(r'\W', ' ', str(reviews_train[sentence]))

    # remove all single characters
    processed_review= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_review)

    # Remove single characters from the start
    processed_review = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_review) 

    # Substituting multiple spaces with single space
    processed_review = re.sub(r'\s+', ' ', processed_review, flags=re.I)

    # Removing prefixed 'b'
    processed_review = re.sub(r'^b\s+', '', processed_review)

    # Converting to Lowercase
    processed_review = processed_review.lower()
    
    processed_reviews_train.append(processed_review)

In [16]:
processed_reviews_test = []

for sentence in range(0, len(reviews_test)):
    # Remove all the special characters
    processed_review = re.sub(r'\W', ' ', str(reviews_test[sentence]))

    # remove all single characters
    processed_review= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_review)

    # Remove single characters from the start
    processed_review = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_review) 

    # Substituting multiple spaces with single space
    processed_review = re.sub(r'\s+', ' ', processed_review, flags=re.I)

    # Removing prefixed 'b'
    processed_review = re.sub(r'^b\s+', '', processed_review)

    # Converting to Lowercase
    processed_review = processed_review.lower()
    
    processed_reviews_test.append(processed_review)

In [17]:
review_words = []

for sentence in processed_reviews_train:
    processed_review = sentence.split()
    for word in processed_review:
        review_words.append(word)

print(len(review_words))

24683


In [18]:
word_freq = Counter(review_words)
common_words = word_freq.most_common(10)
print(common_words)

[('the', 1434), ('and', 827), ('is', 511), ('of', 504), ('was', 481), ('it', 478), ('to', 473), ('this', 435), ('in', 312), ('i', 253)]


### 3. Train your model and justify your choices.

In [19]:
vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))

processed_reviews_train = vectorizer.fit_transform(processed_reviews_train)
processed_reviews_test = vectorizer.transform(processed_reviews_test)

In [20]:
X_train = processed_reviews_train
X_test = processed_reviews_test
y_train = scores_train
y_test = scores_test

In [21]:
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)

RandomForestClassifier(n_estimators=200, random_state=0)

### 4. Evaluate your model using metric(s) you see fit and justify your choices.

In [22]:
predictions = text_classifier.predict(X_test)

In [23]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[399 101]
 [177 323]]
              precision    recall  f1-score   support

           0       0.69      0.80      0.74       500
           1       0.76      0.65      0.70       500

    accuracy                           0.72      1000
   macro avg       0.73      0.72      0.72      1000
weighted avg       0.73      0.72      0.72      1000

0.722
