### Goal

We have collected a number of tweets and dumped them in a database. If not, check out the "gather-twitter-info-example" notebook. We want to get an idea what people in the network are talking about, but we don't want to go through the tweets manually. Let's use the magic of machine learning instead: We learn a very simple text classifier on example texts and classify the tweets into these categories. 

### Train classifier

Let's train the text classifier on ~18000 newsgroup posts on 20 different topics (http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). This data can be easily loaded from the scikit-learn library.  

#### Load data

In [1]:
from sklearn.datasets import fetch_20newsgroups
 
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=32)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=32)

#### Train classifier

In the best case scenario this text classifier should be able to classify tweets from the pre-filtered twitter stream directly. So speed does matter. Therefore let's use a simple linear classifier: A multinomial naive bayesian classifier (http://scikit-learn.org/stable/modules/naive_bayes.html). To be able to classify text we first want to create a sparse vector representation of the documents (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and tf-idf transform (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to take relative term frequency (tf) and inverse document frequency (idf) into account (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). 

We combine the CountVectorizer, TfidfTransformer and Classifier into a scikit-learn pipeline and identify the optimal set of meta parameters using 5-fold cross validation combined with a grid search. The parameters we are optimizing for are i) n-gram length for feature extraction (1 word or 1 and 2 words) and ii) the additive smoothing term for the naive bayes classifier. 

In [6]:
%load_ext autoreload
%autoreload 2

from src.message_classifier import MessageClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

tc = MessageClassifier()
pipeline=Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('nb', MultinomialNB())])
param_grid={'vect__ngram_range': [(1, 1),(1, 2)], 'nb__alpha': [10**-5,10**-4,10**-3,10**-2,10**-1]}
tc.train(train_X=twenty_train.data, train_y=twenty_train.target, labels=twenty_train.target_names, pipeline=pipeline, param_grid=param_grid)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  6.2min finished


Best score 0.917093866007 of estimator Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        st...alse,
         use_idf=True)), ('nb', MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True))])


Cross validation reports an accuracy of 0.92 over all classes. Let's measure the performance on the hold out set. 

In [None]:
tc.test(valid_X=twenty_test.data, valid_y=twenty_test.target)