This jupyter notebook is prepared by [Chun-Kit Yeung](https://ckyeungac.com)

# Introduction
**What did we cover in the last tutorial?**

In the last tutorial, we have gone through the following codes to using the naive Bayes classifier in scikit-learn and evaluating the model performance, training accuracy and test accuracy.

**What do we cover in this tutorial?**

In this tutorial, I will introduce the python packages that are required to complete assignment 1, including
+ extract feature from text: `sklearn.feature_extraction.test.CountVectorizer`
+ calculating the accuracy score: `sklearn.metrics.accuracy_score`
+ performing the 5-fold cross-validation: `sklearn.cross_validation.KFold`

As for the `KFold`, I will only cover what it does, but not the exact usage of `KFold` in machine learning. You have to work on your own as it is one of task in the assignment.


In [1]:
# import the required packages
import numpy as np
import sklearn 
import matplotlib
import matplotlib.pyplot as plt

# tell the jupyter notebook that plot the diagram directly in the output
%matplotlib inline 
    
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sklearn.__version__)
print ('matplotlib version:', matplotlib.__version__)

SEP = '='*30

numpy version: 1.14.1
scikit-learn version: 0.19.1
matplotlib version: 2.1.2


### Sentiment Analysis

Let's create a toy dataset that include a list of tuples `(<sentence>, <sentiment_label>)` to have a taste in handling the text data.

`<sentence>` is a english sentence. Usually, it contains some of the positive/negative information/emotion. So, one of the popular task in machine learning is sentiment analysis in which machine learinng experts build models to predict the emotion of a sentence. In this toy example, the `<sentiment>` label is simply 0 or 1, indicating a negative or positive emotion is observed in the sentence.

In [2]:
# (<sentence>, <sentiment_label>)
# sentiment_label=1 means positive, 0 means negative.
toy_sentiment_dataset = [
    ('Today I am so happy and you are so happy.', 1),
    ('Happy is so important and you are so important too', 1),
    ('you look so great and happy!', 1),
    ('Don\'t be so sad, dude', 0),
    ('Machine learning is difficult', 0),
    ('COMP 4211 is the goodest', 1),
    ('James is so good', 1),
    ('Kit is so sad', 0),
]

sentences = [data[0] for data in toy_sentiment_dataset]
sentiment = [data[1] for data in toy_sentiment_dataset]

**Convert text to vector using [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)**

It is a handy package to convert a collection of text documents to a matrix of token counts.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english') # stop words are the word that contains no information
# vecotrizer.fit_transform(sentences)
vectorizer.fit(sentences)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]:
# See what is it look like in vector representation
sample = ['COMP 4211 is the goodest.']
sample_vector = vectorizer.transform(sample).toarray() 
print(sample_vector)

[[1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]]


In [5]:
# See what is the content if we convert the vector back to words. (Not sentence!)
print(vectorizer.inverse_transform(sample_vector))

[array(['4211', 'comp', 'goodest'], dtype='<U9')]


** Building toy sentiment prediction model using [`artificial neural network`](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)**

In [6]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(10,))
clf.fit(vectorizer.transform(sentences), sentiment)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

**Evaluating the artificial neural network using the [`accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) provided by scikit-learn**

In [7]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(vectorizer.transform(sentences))
acc = accuracy_score(sentiment, y_pred)
print(acc)

1.0


### K-Fold Cross-validation
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples, so-called **folds**. Of the k folds, **a single fold is retained as the test set**, and **the remaining k â 1 subsamples are used as training set**. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data.

In scikit-learn, it provides a handy package called [`KFold`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) to facilitate k-fold cross-validation.

In [8]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False) # without shuffling the data
some_2d_array = np.arange(20).reshape((10, 2))
fold_count = 0
for train_idx, test_idx in kf.split(some_2d_array):
    fold_count += 1
    print("Fold", fold_count)
    print(SEP)
    print("Training index:", train_idx)
    print("Test index:", test_idx)
    print(SEP, '\n')

Fold 1
Training index: [2 3 4 5 6 7 8 9]
Test index: [0 1]

Fold 2
Training index: [0 1 4 5 6 7 8 9]
Test index: [2 3]

Fold 3
Training index: [0 1 2 3 6 7 8 9]
Test index: [4 5]

Fold 4
Training index: [0 1 2 3 4 5 8 9]
Test index: [6 7]

Fold 5
Training index: [0 1 2 3 4 5 6 7]
Test index: [8 9]



In [9]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42) # with shuffling the data
some_2d_array = np.arange(20).reshape((10, 2))
fold_count = 0
for train_idx, test_idx in kf.split(some_2d_array):
    fold_count += 1
    print("Fold", fold_count)
    print(SEP)
    print("Training index:", train_idx)
    print("Test index:", test_idx)
    print(SEP, '\n')

Fold 1
Training index: [0 2 3 4 5 6 7 9]
Test index: [1 8]

Fold 2
Training index: [1 2 3 4 6 7 8 9]
Test index: [0 5]

Fold 3
Training index: [0 1 3 4 5 6 8 9]
Test index: [2 7]

Fold 4
Training index: [0 1 2 3 5 6 7 8]
Test index: [4 9]

Fold 5
Training index: [0 1 2 4 5 7 8 9]
Test index: [3 6]



In [10]:
# let say the train index is 
train_idx = [0, 1, 2, 3, 4, 5, 6, 7]

# you can assess the training data by doing a slicing/indexing
print("original some_2d_array:\n", some_2d_array)
print(SEP)
print("sliced some_2d_array[train_idx]:\n", some_2d_array[train_idx])

original some_2d_array:
 [[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]
sliced some_2d_array[train_idx]:
 [[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]]
