# Machine Learning Challenge

## Overview

The focus of this exercise is on a field within machine learning called [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). We can think of this field as the intersection between language, and machine learning. Tasks in this field include automatic translation (Google translate), intelligent personal assistants (Siri), information extraction, and speech recognition for example.

NLP uses many of the same techniques as traditional data science, but also features a number of specialised skills and approaches. There is no expectation that you have any experience with NLP, however, to complete the challenge it will be useful to have the following skills:

- understanding of the python programming language
- understanding of basic machine learning concepts, i.e. supervised learning


### Instructions

1. Download this notebook!
2. Answer each of the provided questions, including your source code as cells in this notebook.
3. Share the results with us, e.g. a Github repo.

### Task description

You will be performing a task known as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Here, the goal is to predict sentiment -- the emotional intent behind a statement -- from text. For example, the sentence: "*This movie was terrible!"* has a negative sentiment, whereas "*loved this cinematic masterpiece*" has a positive sentiment.

To simplify the task, we consider sentiment binary: labels of `1` indicate a sentence has a positive sentiment, and labels of `0` indicate that the sentence has a negative sentiment.

### Dataset

The dataset is split across three files, representing three different sources -- Amazon, Yelp and IMDB. Your task is to build a sentiment analysis model using both the Yelp and IMDB data as your training-set, and test the performance of your model on the Amazon data.

Each file can be found in the `input` directory, and contains 1000 rows of data. Each row contains a sentence, a `tab` character and then a label -- `0` or `1`. 

**Notes**
- Feel free to use existing machine learning libraries as components in you solution!
- Suggested libraries: `sklearn` (for machine learning), `pandas` (for loading/processing data), `spacy` (for text processing).
- As mentioned, you are not expected to have previous experience with this exact task. You are free to refer to external tutorials/resources to assist you. However, you will be asked to justfify the choices you have made -- so make you understand the approach you have taken.

In [1]:
import os
print(os.listdir("./input"))

['amazon_cells_labelled.txt', 'yelp_labelled.txt', 'imdb_labelled.txt']


In [2]:
!head "./input/amazon_cells_labelled.txt"

So there is no way for me to plug it in here in the US unless I go by a converter.	0
Good case, Excellent value.	1
Great for the jawbone.	1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!	0
The mic is great.	1
I have to jiggle the plug to get it to line up right to get decent volume.	0
If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.	0
If you are Razr owner...you must have this!	1
Needless to say, I wasted my money.	0
What a waste of money and time!.	0


# Tasks
### 1. Read and concatenate data into test and train sets.
### 2. Prepare the data for input into your model.

**install libraries**

```shell
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install nltk
!pip install sklearn
```

In [3]:
import spacy
from spacy.tokens.token import Token
from nltk import ngrams
from dataclasses import dataclass
from typing import List
from sklearn.linear_model import LogisticRegression

In [4]:
nlp = spacy.load("en_core_web_sm")

In [5]:
@dataclass
class Sentence:
    sentence: str
    label: int
    tokens: List[Token]

In [6]:
def parse_line(line: str):
    sentence, label = line.strip().split('\t')
    label = int(label)
    tokens = nlp(sentence)
    return Sentence(sentence, label, tokens)


def load_sentiment_data(*files: str):
    data = (
        parse_line(line)
        for f in files 
        for line in open(f, 'r')
    )
    return data

**Load train set and test set, for only 1000 records in each file, just load to memory**

In [7]:
train_set = list(load_sentiment_data('input/imdb_labelled.txt', 'input/yelp_labelled.txt'))
test_set = list(load_sentiment_data('input/amazon_cells_labelled.txt'))

*To get all tag of spacy*

for label in nlp.get_pipe("tagger").labels:
    print(label, " -- ", spacy.explain(label))
    
*To get all dep of spacy*

for label in nlp.get_pipe("parser").labels:
    print(label, " -- ", spacy.explain(label))

#### 2a: Find the ten most frequent words in the training set.

*Wordcount by tag and dep, and then show top 20 for each of the dep/tag*

In [8]:
def wordcount_by_tag_dep(dataset):
    tag_wordcount = {}
    dep_wordcount = {}

    for row in dataset:
        for token in row.tokens:
            dep = token.dep_
            tag = token.tag_
            lemma = token.lemma_
            tag_wordcount.setdefault(tag, {})[lemma] = tag_wordcount.get(tag, {}).get(lemma, 0) + 1
            dep_wordcount.setdefault(dep, {})[lemma] = tag_wordcount.get(dep, {}).get(lemma, 0) + 1
    
    return tag_wordcount, dep_wordcount

In [9]:
tag_wordcount, dep_wordcount = wordcount_by_tag_dep(train_set)

#**Print wordcount for each of the tag and dep**


print('Top frequent words for each tag_:')
for tag, counts in tag_wordcount.items():
    print(tag, '>>', sorted(counts.items(), key=lambda x: -x[1])[:40])

print('Top frequent words for each dep_:')
for dep, counts in dep_wordcount.items():
    print(dep, '>>', sorted(counts.items(), key=lambda x: -x[1])[:40])


In [10]:
# reiew the wordcount for each of the tegs, and got those tags and white list

# better to review the partial dep as well when there is spare time

all_tags = {
    'RB', 'JJ', 'RBR', 'JJS', 'WRB', 'RBS', 'JJR', 'UH', 'VBD', 'NN', 'VBG', 'DT', 'VBN', 'CC', 'IN', 'VB', 'VBZ', 'WDT', 'VBP'
}
all_deps = {'advmod', 'amod', 'acomp', 'intj', 'preconj', 'predet', 'oprd'}

stopwords = {'-', 'be', 'the', 'and', 'of', 'a', 'have', 'in', 'this', 'for', 'that', 'do', 'with', 'to', 'as', 'on', 'at', 'by', 'from', 'about', 'an', 'here', 'movie', 'film'}

### 3. Train your model and justify your choices.

#### 3a: Generate features, use single word and n-gram in the above `all_tags`/`all_deps` list and `stopwords`

In [11]:
n_gram = 5

In [12]:
def as_feature(tokens: List[Token]):
    feature = []
    for tok in tokens:
        if (tok.lemma_ not in stopwords and (tok.tag_ in all_tags or tok.dep_ in all_deps)):
            feature.append(tok.lemma_)
    if feature:
        return ';;'.join(sorted(feature))
    return None


def generate_features(tokens: List[Token]):
    tokens = list(filter(lambda x: x.dep_ != 'punct', tokens))
    features = []
    for i in range(1, n_gram):
        for words in ngrams(tokens, i):
            feature = as_feature(words)
            if feature is not None:
                features.append(feature)
    return features

In [13]:
def generate_feature_dataset(dataset):
    feature_vector = {}
    for row in dataset:
        for feature in generate_features(row.tokens):
            feature_vector[feature] = feature_vector.get(feature, 0) + 1
    return feature_vector

##### a view of the features

In [14]:
feature_dataset = generate_feature_dataset(train_set)
feature_dataset_desc = sorted(feature_dataset.items(), key=lambda x: -x[1])

In [15]:
print('Top 20:', feature_dataset_desc[:20], '\nMid:', feature_dataset_desc[2200:2240], '\nLast 20:', feature_dataset_desc[-20:])

Top 20: [("n't", 677), ('good', 606), ('not', 551), ('bad', 431), ('like', 420), ('but', 412), ('all', 409), ('place', 406), ('food', 364), ('go', 357), ('so', 353), ('make', 338), ('great', 337), ('see', 312), ('just', 308), ('time', 304), ('very', 296), ('think', 292), ('love', 269), ('well', 262)] 
Mid: [('genius', 5), ('genius;;simply', 5), ('not;;play', 5), ('another;;just', 5), ('find;;mostly', 5), ('total;;waste', 5), ('most;;sabotage', 5), ('extremely;;uninteresting', 5), ("hold;;n't", 5), ('attention;;because', 5), ('adrift;;emotionally', 5), ('amateurish;;lousy', 5), ('first;;watch', 5), ('completely;;strike', 5), ('just;;too', 5), ('but;;speak', 5), ('theme', 5), ('director;;master', 5), ('care;;deeply', 5), ('patriotism;;very', 5), ('enjoy;;watch', 5), ('cinematography;;if', 5), ('funny;;so', 5), ('cheap;;trash', 5), ('angry', 5), ('angry;;feel', 5), ('craft;;superbly', 5), ('matter', 5), ('good;;mexican', 5), ('highly;;rank', 5), ('psychological', 5), ('incredible;;simply'

In [16]:
print(f'Feature dimension with n-gram (n is [1, {n_gram-1}]):', len(feature_dataset))

Feature dimension with n-gram (n is [1, 4]): 15753


##### Generate n-dimension traning vector

In [17]:
def fillin_features(num_features, tokens: List[Token], _feature2index):
    row_array = [0 for _ in range(num_features)]
    for feature in generate_features(tokens):
        if feature in _feature2index:
            row_array[_feature2index[feature]] = 1
    return row_array

In [18]:
def generate_feature2index(feature_dataset):
    num_features = len(feature_dataset)
    feature2index = dict(zip(feature_dataset.keys(), range(num_features)))
    
    return feature2index

In [19]:
def train_model(dataset, _feature2index):
    num_features = len(_feature2index)
    X_train = [
        fillin_features(num_features, row.tokens, _feature2index)
        for row in dataset
    ]
    y_train = [row.label for row in dataset]
    
    clf_model = LogisticRegression(solver='lbfgs').fit(X_train, y_train)
    print('Train set accuracy:', clf_model.score(X_train, y_train))
    
    return clf_model

In [20]:
feature2index = generate_feature2index(feature_dataset)

In [21]:
clf_model = train_model(train_set, feature2index)

Train set accuracy: 0.9915


### 4. Evaluate your model using metric(s) you see fit and justify your choices.

In [22]:
def evaluate_model(model, dataset, _feature2index):
    num_features = len(_feature2index)
    X_test = [
        fillin_features(num_features, row.tokens, _feature2index)
        for row in dataset
    ]
    y_test = [row.label for row in dataset]
    
    print(model.score(X_test, y_test))

In [23]:
evaluate_model(clf_model, test_set, feature2index)

0.768


### 5. Try to improve

The above train set is more about file/move, it might be a skewed dataset. Try to mix IMDB, Yelp and Amazon data and then split the entire dataset into 2/3 and 1/3 as train and test set.

In [24]:
from random import shuffle

In [25]:
all_dataset = train_set + test_set

In [26]:
shuffle(all_dataset)

In [27]:
mix_train_set = all_dataset[:2000]
mix_test_set = all_dataset[2000:]

In [28]:
mix_tag_wordcount, mix_dep_wordcount = wordcount_by_tag_dep(mix_train_set)

In [29]:
mix_feature_dataset = generate_feature_dataset(mix_train_set)
mix_feature2index = generate_feature2index(mix_feature_dataset)

In [30]:
mix_clf_model = train_model(mix_train_set, mix_feature2index)

Train set accuracy: 0.988


In [31]:
evaluate_model(mix_clf_model, mix_test_set, mix_feature2index)

0.814
