## TEXT ANALYSIS AND MACHINE LEARNING  Part 2

#### Before proceeding, please Check _"Text Analysis and Machine learning Part 1"_ if you have not read it. Click [Link](http://nbviewer.jupyter.org/github/aakinlalu/Sentimental-Analysis/blob/master/Sentimental_Analysis.ipynb)

We want to classify positive and negative tweet labelled in the first part of the article. We will extract TF-IDF fetures from tweets using TF-IDF weights and clssify the tweets using Logistic Regression and GridSearchCV. The end goal is to check the performance of the classifiers.

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

### Bag_of_words with TF-IDF (Term Frequency- Inverse Document Frequency) weights 
According to Gavin Hackeling(Mastering ML with Scikit-Learn), many words might appear with the same frequency in two documents but the documents could still be disimilar if one document is many time larger than the other. Scikit-Learn's TfidfTransformer object can mitigate this problem by transforming a matrix of term frequency vectors into a matrix of normalized term frequency weights.  

The smoothed, **normalized term frequencies** are by the following equations:

In [2]:
from IPython.display import Latex
Latex(r"""\begin{eqnarray}
{\mathbf{t}}\,  {\mathbf{f}}{\mathbf{(t,d)}} & = \frac{{\mathbf{f}}{\mathbf{(t,d)}}\, {\mathbf{+}}\, 
{\mathbf{1}} }{\mathbf{||x||}}\\
\end{eqnarray}""")

<IPython.core.display.Latex object>

f(t,d) is the frequency of term t in document d.
||x|| is the L2 normalization of the count vector.

The  **Inverse document frequency (IDF)** ia a measure of how rare or common a word is in a corpus. 

Scikit-learn provides a TfidfVectorizer class that wraps CountVectorizer and TfidfTransformer.

In [None]:
#load and split dataset into train and test 

In [2]:
#os.chdir('desktop/output')
os.getcwd()
os.listdir(os.curdir)

['.DS_Store',
 'Aliases.csv',
 'Analysis.png',
 'database.sqlite',
 'EmailReceivers.csv',
 'Emails.csv',
 'hashes.txt',
 'labelledTweet.csv',
 'Persons.csv',
 'SMS.csv',
 'tweet.csv']

In [3]:
df = pd.read_csv('labelledTweet.csv')

In [4]:
df.head()

Unnamed: 0,Tweet,Label
0,co jrpbjvmg z nie mehr wie es war von mariann...,positive
1,brink nasti war co du ucxicd,positive
2,vladimir putin kremlinrussia e vladimir war f...,positive
3,civilwarboutiq civil war victorian stripe ros...,positive
4,worcesternew woman charg caus death war veter...,negative


In [5]:
x_train_raw, x_test_raw, y_train, y_test = train_test_split(df['Tweet'], df['Label'])

In [6]:
Vectorizer = TfidfVectorizer()
x_train = Vectorizer.fit_transform(x_train_raw)
x_test = Vectorizer.transform(x_test_raw)

### Using Logistic Regression to classify the tweets

In [7]:
classifier = LogisticRegression()
classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Make prediction with Logistic Regression classifier

In [8]:
predicted = classifier.predict(x_test)

In [9]:
for i, predict in enumerate(predicted[:5]):
    print 'Predicted: %s Tweet: %s' % (predict, list(x_test_raw)[i])

Predicted: positive Tweet:  war guilt grace m live proof gracewin everi time matthew west co ahkjqbi ha jesu love
Predicted: positive Tweet:  icc investig russia georgia war co trqg aypzj
Predicted: positive Tweet:  three isra kill jerusalem attack co hknfpi lk
Predicted: positive Tweet:  co tvyclrz bm putin seek peac prepar war allpolit
Predicted: positive Tweet:  march war face changelog hotfix thursday novemb th co jy nygn vu


#### Check the performance of the classifier

**Accuracy** measures a fraction of the classifier prediction that are correct.   

In [20]:
accuracy_score(y_test, predicted)

0.97861540476814257

While accuracy measures the overall corrected of the classifier, it does not distinguish between false positive errors and false negative errors. ROC curve visualizes a classifier performance. ROC curve ilustrates the classifier's performance for all values of the decrimination threshold. 

**Precision** is the fraction of positive predictions that are correct.  
**Recall** is the fraction of the truly positive instance that the classifier recognises.  
**F1 measure** is the harmonic mean, or weighted average of the  precission and recall scores.

In [26]:
precision_recall_fscore_support(y_test, predicted)

(array([ 0.89572321,  0.98467756]),
 array([ 0.81043478,  0.99231478]),
 array([ 0.85094727,  0.98848142]),
 array([ 2300, 28236]))

#### Tuning Models with Grid Search
---
Grid search is a common method to select the hyperparameter values that produce the best model. Grid search takes a set of possible values for each hyperparameter that should be tuned, and evaluates a model trained on each element of the cartesian product of the sets. The disadvantage of grid search is that it is computationally costly for even small sets of hyperparameter values.

In [23]:
def main():
    pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),
                        ('clf', LogisticRegression())])
    parameters = {'vect__max_df': (0.25, 0.5, 0.75),
                 'vect__stop_words': ('english', None),
                 #'vect__max_features': (2500, 5000, 10000, None),
                 'vect__ngram_range': ((1, 1), (1, 2)),
                 'vect__use_idf': (True, False),
                 'vect__norm': ('l1', 'l2'),
                 'clf__penalty': ('l1', 'l2'),
                 'clf__C': (0.01, 0.1, 1, 10),}
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy', cv=3)
    df = pd.read_csv('labelledTweet.csv')
    X, y = df['Tweet'], df['Label']
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    grid_search.fit(X_train, y_train)
    print 'Best score: 0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%: %r' % (param_name, best_parameters[param_name])
    predictions = grid_search.predict(X_test)
    print 'Accuracy:', accuracy_score(y_test, predictions)
    print 'Precision', precision_score(y_test, predictions)
    print 'Recall', recall_score(y_test, predictions)

In [None]:
if __name__ == "__main__":
    main()

Fitting 3 folds for each of 384 candidates, totalling 1152 fits


### Reference
---
 Hackeling, G 2014, _Master Machine Learning with Scikit-learn_