# IDST2: Classification in Practise

    1. What is machine learning?

A computer program is said to learn from **experience E** with respect to some class of **tasks T** and **performance measure P** if its performance at tasks in T, as measured by P, improves with experience E

    2. What is classification?

Experience - object description and labels for this objects  
Task - by given description guess label for object

    3. How to do it?

To get data:  
https://drive.google.com/drive/folders/0By_3cvm7F4CgZVp1Njk3ZU8wZE0

In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Read data from .csv file

In [2]:
csv = pd.read_csv('train_stoverflow.csv', encoding='cp1251', index_col='PostId', 
                  parse_dates=['PostCreationDate', 'OwnerCreationDate', 'PostClosedDate'], infer_datetime_format=True)

In [3]:
b

In [4]:
csv['CleanBody'] = [ removeMarkdown(x) for x in csv['BodyMarkdown']]

Lets look what we have

In [5]:
csv.head()

Unnamed: 0_level_0,PostCreationDate,OwnerUserId,OwnerCreationDate,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,Title,BodyMarkdown,Tag1,Tag2,Tag3,Tag4,Tag5,PostClosedDate,OpenStatus,CleanBody
PostId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1402214,2011-08-10 21:24:36,877278,2011-08-03 17:57:10,17,0,GET/Post values don't be refreshed/parsed by t...,![Get-values don't be refreshed][1]\r\r\n\r\r\...,php,apache,,,,2011-08-10 21:57:14,1,![Get-values don't be refreshed][1]How you see...
2338671,2011-07-09 15:46:51,45954,2008-12-12 21:03:08,1757,22,Search control in Haskell,Suppose you're writing a program that searches...,search,haskell,lazy-evaluation,inference,,NaT,0,Suppose you're writing a program that searches...
2042048,2011-05-03 15:48:34,1185829,2011-03-28 13:20:36,1,0,jQuery: event.preventdefault not working with ...,I have this bit of jQuery toggling a paragraph...,jquery,firefox,toggle,preventdefault,,NaT,0,I have this bit of jQuery toggling a paragraph...
1672620,2011-12-07 17:28:40,1086278,2011-12-07 17:18:32,1,0,java 2 dimensional arrays,\\I need to look through an array from east to...,java,arrays,,,,2011-12-07 18:33:56,4,\\I need to look through an array from east to...
3103106,2011-12-16 04:07:33,1002323,2011-10-19 00:33:34,16,2,Which interface is used to detect key events f...,I've created a simple custom dialog that asks ...,android,sdk,dialog,onkeyup,onkeydown,NaT,0,I've created a simple custom dialog that asks ...


**Hypothesis 1: text of post can give high accuracy**

We need to represent word with number. To do this we will count number of occurrences.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [7]:
X_train_counts = vectorizer.fit_transform(csv.CleanBody.values)

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Now it is OK. We can try to build very simple model - Naive Bayes.

In [9]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, csv.OpenStatus)

In [10]:
b

0.67600000000000005

Grid search and cross validation

In [11]:
from sklearn.linear_model import SGDClassifier

In [12]:
from sklearn.grid_search import GridSearchCV

In [13]:
sgdclf = SGDClassifier(n_jobs=-1)

In [14]:
sgdgscv = GridSearchCV(sgdclf, {'loss':['hinge','log','perceptron','squared_hinge'], 
                               'penalty':['l1','l2','elasticnet'],
                               'n_iter':[5,15,25],
                               'shuffle':[True, False]})

In [15]:
sgdgscv.fit(X_train_tfidf, csv.OpenStatus)

GridSearchCV(cv=None, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=-1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'loss': ['hinge', 'log', 'perceptron', 'squared_hinge'], 'shuffle': [True, False], 'n_iter': [5, 15, 25], 'penalty': ['l1', 'l2', 'elasticnet']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [16]:
print(sgdgscv.best_score_, sgdgscv.best_params_)

0.700108108108 {'loss': 'hinge', 'shuffle': True, 'n_iter': 15, 'penalty': 'l2'}


## Predicting

In [17]:
testset = pd.read_csv('test_stoverflow.csv', encoding='cp1251', index_col='PostId', 
                      parse_dates=['PostCreationDate', 'OwnerCreationDate'], infer_datetime_format=True)

In [18]:
testset['CleanBody'] = [ removeMarkdown(x) for x in testset['BodyMarkdown']]
X_new_counts = vectorizer.transform(testset.CleanBody.values)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [19]:
Y_predicted = sgdgscv.predict(X_new_counts)

In [20]:
answer = pd.DataFrame(Y_predicted, index= testset.index)
answer.to_csv(path_or_buf='Answer_SGD_TFIDF.csv', header=['OpenStatus'], sep = ',')

### Useful links

Nice description for ML Problem solving process
http://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

How to choose sklearn algorithm?
http://i.stack.imgur.com/BZJiN.png

How to choose right classifier?
http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/

Kaggle
https://www.kaggle.com/

Scikit-Learn library
http://scikit-learn.org/stable/

Text data
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Try Jupyter
https://try.jupyter.org/