# Intro

This notebook is to test the classifier.
It will involve a little of data exploration, feature extraction using bag of words and training a classifier. 

Later the code will be codified into python files.

In [1]:
import pandas as pd
import numpy as np

In [2]:
documents_df = pd.read_csv("../files/documents.csv")

In [3]:
print documents_df.shape
documents_df.head()

(1992, 4)


Unnamed: 0,id,title,content,category
0,0,Auditoría revela irregularidades en el Parlacen,GUATEMALA.- Una fiscalización de la Contralorí...,Other
1,1,Suspendidas las citas en Hospital Escuela,TEGUCIGALPA.- Una misteriosa obstrucción del s...,Other
2,2,Mariscos contaminados alarman a los “porteños”,"PUERTO CORTES, Cortés.- Alarmados se encuentra...",Other
3,3,Citan a 11 personas por vender pólvora,SAN PEDRO SULA.- Hasta el momento ocho bodegas...,Criminal
4,4,Con compra de granos se paliaría hambruna en e...,TEGUCIGALPA.- No llueve hace cuatro meses y la...,Other


# 1. Data Exploration

## Categories
As seen below there are two types of categories: Other and Criminal.  These categories are unbalanced

In [4]:
documents_df.category.value_counts()

Other             1717
Criminal           239
Criminal-Other      36
Name: category, dtype: int64

## Content

it seems there are strange objects in the content

In [5]:
print documents_df.content.apply(lambda x: type(x) ).value_counts()

<type 'str'>      1706
<type 'float'>     286
Name: content, dtype: int64


Lookin further into it we can see that there are NaN objects

In [6]:
conditions = documents_df.content.apply(lambda x: type(x) == float )
documents_df[conditions].head()

Unnamed: 0,id,title,content,category
68,68,"La obra de la Línea 1 del Metro de Panamá, el ...",,Other
91,91,TRIBUNITO DICE,,Other
114,114,CANTERAS VISTAS DE… ¡REOJO!,,Other
115,115,"La guerrilla colombiana de las FARC, incorporó...",,Other
117,117,HUMORADAS SABATINAS 01/03/2014,,Other


In [7]:
print "null objects: %i" %documents_df.content.isnull().sum()

null objects: 286


Of those NaN content are there any Criminal categories?

In [8]:
documents_df[(documents_df.content.isnull() ) & (documents_df["category"]=="Criminal")]

Unnamed: 0,id,title,content,category
198,198,Condenan a 40 años de cárcel a 8 pandilleros p...,,Criminal
237,237,VUELVEN FEMICIDIOS,,Criminal
265,269,"El director del Infah, aseguró que se le está ...",,Criminal
273,277,"Autoridades hondureñas, investigarán la muert...",,Criminal
295,299,"El nuevo gobierno, que ha cumplido un mes en e...",,Criminal
490,494,Una vecina oyó “gritos terribles” la noche en ...,,Criminal
612,617,"Hasta dos mil 500 millones de lempiras, se pag...",,Criminal
805,811,Conforman equipo especial para investigar robo...,,Criminal
837,845,"Autoridades hondureñas, han iniciado el proces...",,Criminal
901,909,IMPARABLE MUERTE DE BUSEROS,,Criminal


### Drop all NaN
even though there are some values of Criminal category, they don't have any more information that will help in the final objective


will mantain the original as documents_df
The processed is documents


In [10]:
documents = documents_df[(documents_df.content.notnull() ) & (documents_df["category"]!="Criminal-Other")]
documents[(documents.content.isnull() )]

Unnamed: 0,id,title,content,category


## Other things to consider...

- Most common words
- Size of text 
- Numbers of words


In [11]:
import nltk
from nltk import wordpunct_tokenize, RegexpTokenizer



In [13]:
stop_words = pd.read_json("../files/stopwords.json")[0].values.tolist()
# len(stop_words)
# type(stop_words)
# stop_words.tolist()
stop_words[10:30]

[u'_',
 u'a',
 u'actualmente',
 u'acuerdo',
 u'adelante',
 u'ademas',
 u'adem\xe1s',
 u'adrede',
 u'afirm\xf3',
 u'agreg\xf3',
 u'ahi',
 u'ahora',
 u'ah\xed',
 u'al',
 u'algo',
 u'alguna',
 u'algunas',
 u'alguno',
 u'algunos',
 u'alg\xfan']

In [14]:
#most common words


full_text = ""
for index, row in documents.iterrows():
    full_text += " " + row["content"]
    
criminal_text = ""
for index, row in documents[documents["category"] == "Criminal"].iterrows():
    criminal_text += " " + row["content"]   
    
other_text = ""
for index, row in documents[documents["category"] == "Other"].iterrows():
    other_text += " " + row["content"]   
    
    
regTokenizer = RegexpTokenizer(r'\w+')
def most_common(text,top=30):
    tokens = regTokenizer.tokenize( unicode(text, "utf-8").lower() )
    tokens = [ token for token in tokens if token not in stop_words]

    fdist = nltk.FreqDist(tokens)
    for w in  fdist.most_common(top):
        print w
    

most_common(criminal_text,30)

(u'polic\xeda', 198)
(u'nacional', 175)
(u'autoridades', 156)
(u'colonia', 152)
(u'personas', 138)
(u'san', 111)
(u'a\xf1os', 108)
(u'armas', 108)
(u'honduras', 107)
(u'vida', 106)
(u'seguridad', 105)
(u'zona', 88)
(u'centro', 85)
(u'pedro', 79)
(u'jos\xe9', 78)
(u'sula', 78)
(u'policial', 77)
(u'p\xfablico', 72)
(u'agentes', 71)
(u'crimen', 71)
(u'investigaci\xf3n', 70)
(u'casa', 67)
(u'pa\xeds', 66)
(u'orden', 65)
(u'policiales', 65)
(u'tegucigalpa', 63)
(u'cort\xe9s', 63)
(u'hern\xe1ndez', 63)
(u'drogas', 63)
(u'criminal', 63)


In [15]:
documents.content.apply(lambda x: len(x) ).min()

2

In [16]:
# Final State of documents
documents.category.value_counts()

Other       1449
Criminal     222
Name: category, dtype: int64

# Extract Features

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
import scipy.sparse as sp

In [18]:
documents.loc[:,"content"] = documents.content.apply(lambda x: unicode(x, "utf-8").lower()  )


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [19]:

vectorizer1 = CountVectorizer(min_df=1, max_df=1, #max_features=700,  
                              analyzer='char_wb', ngram_range=(5, 5),
                              stop_words=stop_words, binary=False)
# vectorizer1 = TfidfVectorizer(min_df=1,stop_words=stop_words ) #max_features=500
vectorizer2 = CountVectorizer(min_df=1, max_df=1,  analyzer='word', stop_words=stop_words, binary=True)
# vectorizer2 = TfidfVectorizer(min_df=1,stop_words=stop_words ) 


#train the vectorizer but only with the positive samples
vectorizer1.fit(documents[documents["category"]=="Criminal"].content.values)
vectorizer2.fit(documents[documents["category"]=="Criminal"].title.values)


X1 = vectorizer1.transform(documents.content.values)
X = vectorizer2.fit_transform(documents.title.values)

# X = sp.hstack((X1, X2), format='csr')


y = documents.category.apply(lambda x:  int(x=="Criminal") ).values

In [20]:
splits = 5
skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=943)


avg_precision = 0.0
avg_recall = 0.0
avg_fscore = 0.0

for train_index, test_index in skf.split(X, y):    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    prior = y.sum() *1.0 / len(y)
    
    clf = MultinomialNB(alpha=.001, class_prior=[1-prior,prior ], fit_prior=True)
    clf.fit(X_train, y_train)
    
    y_pred_train = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)
#     print(classification_report(y_test, y_pred )) #target_names=target_names
    precision_train, recall_train, fscore_train, support_train = precision_recall_fscore_support(y_train,y_pred_train)
    precision, recall, fscore, support = precision_recall_fscore_support(y_test,y_pred_test)
    
    print "---------------------------"
    print "- Train Score: %0.4f" % clf.score(X_train, y_train)
    print "- Test Score: %0.4f" % clf.score(X_test, y_test)
    print "%i of %i were criminal" % (y_test.sum(), len(y_test))
    print "precision: %0.4f, %0.4f" % (precision_train[1], precision[1])
    print "recall: %0.4f, %0.4f" % (recall_train[1], recall[1])
    print "f score: %0.4f, %0.4f" % (fscore_train[1], fscore[1])
    print "supoort: %0.4f, %0.4f" % (support_train[1], support[1])
    
    avg_precision += precision[1]
    avg_recall += recall[1]
    avg_fscore += fscore[1]
    
#     break
    
print 
print "avg precision: %0.4f" %(avg_precision/splits)
print "avg recall: %0.4f" %(avg_recall/splits)
print "avg f score: %0.4f" %(avg_fscore/splits)
    
# clf.fit(X[[0,1,2,3,5,6,7,8,9,10]], y[[0,1,2,3,5,6,7,8,9,10]])

---------------------------
- Train Score: 0.9760
- Test Score: 0.2418
45 of 335 were criminal
precision: 1.0000, 0.1307
recall: 0.8192, 0.8222
f score: 0.9006, 0.2256
supoort: 177.0000, 45.0000
---------------------------
- Train Score: 0.9768
- Test Score: 0.2567
45 of 335 were criminal
precision: 1.0000, 0.1304
recall: 0.8249, 0.8000
f score: 0.9040, 0.2243
supoort: 177.0000, 45.0000
---------------------------
- Train Score: 0.9768
- Test Score: 0.2575
44 of 334 were criminal
precision: 1.0000, 0.1277
recall: 0.8258, 0.7955
f score: 0.9046, 0.2201
supoort: 178.0000, 44.0000
---------------------------
- Train Score: 0.9738
- Test Score: 0.2425
44 of 334 were criminal
precision: 1.0000, 0.1359
recall: 0.8034, 0.8864
f score: 0.8910, 0.2356
supoort: 178.0000, 44.0000
---------------------------
- Train Score: 0.9768
- Test Score: 0.2733
44 of 333 were criminal
precision: 1.0000, 0.1306
recall: 0.8258, 0.7955
f score: 0.9046, 0.2244
supoort: 178.0000, 44.0000

avg precision: 0.1311
av

# TODO

- Add a CV grid search or a Bayes 
- Think in stacking (one for recall one for precision ;) ) 
- More data... data exploration...
- Try other classifiers...  Logistic, SVC , XGboost

# Change of plans

Create two classifiers: One optimized for Recall and One optimized for Precision



In [78]:
def precision_classifier(df, splits=5):
    # 0.3
    vectorizer1 = CountVectorizer(min_df=1, max_df=0.3, #max_features=700,  
#                               analyzer='char_wb', ngram_range=(5, 5),
                              stop_words=stop_words, binary=True)
    
    X = df 
    y = df.category.apply(lambda x:  int(x=="Criminal") ).values
    
    
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=943)
    avg_score = [0.0,0.0]
    avg_precision = [0.0,0.0]
    avg_recall = [0.0,0.0]
    avg_fscore = [0.0,0.0]
    
    
    for train_index, test_index in skf.split(X, y):   
        
            
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y[train_index], y[test_index]
                
        
        vectorizer1.fit(X_train.content[X_train["category"]=="Criminal"].values)
        X_train = vectorizer1.transform(X_train.content.values)
        X_test = vectorizer1.transform(X_test.content.values)
        

        prior = y_train.sum() *1.0 / len(y_train)

        clf = MultinomialNB(alpha=1, class_prior=[1-prior,prior ], fit_prior=True)
        clf.fit(X_train, y_train)

        y_pred_train = clf.predict(X_train)
        y_pred_test = clf.predict(X_test)

        #train
        prec, rec, f1, supp = precision_recall_fscore_support(y_train,y_pred_train)
        avg_score[0] += clf.score(X_train, y_train)
        avg_precision[0] += prec[1]
        avg_recall[0] += rec[1]
        avg_fscore[0] += f1[1]
        #test
        prec, rec, f1, supp = precision_recall_fscore_support(y_test,y_pred_test)
        avg_score[1] += clf.score(X_test, y_test)
        avg_precision[1] += prec[1]
        avg_recall[1] += rec[1]
        avg_fscore[1] += f1[1]
        
        splits =1.0
        break

    print "avg score    : %0.4f, %0.4f" %(avg_score[0]/splits, avg_score[1]/splits )
    print "avg precision: %0.4f, %0.4f" %(avg_precision[0]/splits, avg_precision[1]/splits  )
    print "avg recall   : %0.4f, %0.4f" %(avg_recall[0]/splits, avg_recall[1]/splits )
    print "avg f score  : %0.4f, %0.4f" %(avg_fscore[0]/splits, avg_fscore[1]/splits  )
    
    return clf, X_train, y_train, X_test
    
    


In [79]:
clf_prec, X_train_prec, y_train_prec, X_test_prec = precision_classifier(documents)

avg score    : 0.9805, 0.9493
avg precision: 0.8953, 0.7800
avg recall   : 0.9661, 0.8667
avg f score  : 0.9293, 0.8211


In [46]:
def recall_classifier(df, splits=5):
    #0,9  0.2410
    vectorizer1 = CountVectorizer(min_df=1, max_df=0.9, #max_features=700,  
                              analyzer='char_wb', ngram_range=(5, 5),
                              stop_words=stop_words, binary=True)
    vectorizer2 = CountVectorizer(min_df=1, max_df=0.9,  analyzer='word', stop_words=stop_words, binary=True)
#     vectorizer2 = TfidfVectorizer(min_df=1,stop_words=stop_words ) 
    

    

    X = df
    y = df.category.apply(lambda x:  int(x=="Criminal") ).values
    
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=943)
    avg_score = [0.0,0.0]
    avg_precision = [0.0,0.0]
    avg_recall = [0.0,0.0]
    avg_fscore = [0.0,0.0]
    
    for train_index, test_index in skf.split(X, y):    
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        vectorizer1.fit(X_train[X_train["category"]=="Criminal"].content.values)
        vectorizer2.fit(X_train.title.values)
        
        X_train1 = vectorizer1.transform(X_train.content.values)
        X_train2 = vectorizer2.transform(X_train.title.values)
        X_train = sp.hstack((X_train1, X_train2), format='csr')
        
        X_test1 = vectorizer1.transform(X_test.content.values)
        X_test2 = vectorizer2.transform(X_test.title.values)
        X_test = sp.hstack((X_test1, X_test2), format='csr')

        prior = y.sum() *1.0 / len(y)
        #0.001
        clf = MultinomialNB(alpha=0.5, class_prior=[1-prior,prior ], fit_prior=True)
        clf.fit(X_train, y_train)

        y_pred_train = clf.predict(X_train)
        y_pred_test = clf.predict(X_test)

        #train
        prec, rec, f1, supp = precision_recall_fscore_support(y_train,y_pred_train)
        avg_score[0] += clf.score(X_train, y_train)
        avg_precision[0] += prec[1]
        avg_recall[0] += rec[1]
        avg_fscore[0] += f1[1]
        #test
        prec, rec, f1, supp = precision_recall_fscore_support(y_test,y_pred_test)
        avg_score[1] += clf.score(X_test, y_test)
        avg_precision[1] += prec[1]
        avg_recall[1] += rec[1]
        avg_fscore[1] += f1[1]
        
        splits =1.0
        break

    print "avg score    : %0.4f, %0.4f" %(avg_score[0]/splits, avg_score[1]/splits )
    print "avg precision: %0.4f, %0.4f" %(avg_precision[0]/splits, avg_precision[1]/splits  )
    print "avg recall   : %0.4f, %0.4f" %(avg_recall[0]/splits, avg_recall[1]/splits )
    print "avg f score  : %0.4f, %0.4f" %(avg_fscore[0]/splits, avg_fscore[1]/splits  )
    
    return clf, X_train, y_train, X_test, y_test
    
    
clf_recall, X_train_recall, y_train_recall, X_test_recall, y_test = recall_classifier(documents)

avg score    : 0.9746, 0.9522
avg precision: 0.8522, 0.7843
avg recall   : 0.9774, 0.8889
avg f score  : 0.9105, 0.8333


In [80]:
y_pred_test = clf_recall.predict(X_test_recall)
print sum(y_pred_test)
print len(y_pred_test)
print
preds = []
for index, val in enumerate(y_pred_test):
    if val == 0:
        preds.append(val)
    else:
        z = clf_prec.predict(X_test_prec[index])[0]
        preds.append(z)


precision_recall_fscore_support(y_test,preds)

51
335



(array([ 0.97909408,  0.8125    ]),
 array([ 0.96896552,  0.86666667]),
 array([ 0.97400347,  0.83870968]),
 array([290,  45]))

## Voting

In [88]:
proba_test_recall = clf_recall.predict_log_proba(X_test_recall)
proba_test_prec   = clf_prec.predict_log_proba(X_test_prec)

final_preds = []
X_test_stacked = []
w_rec = .45
for rec, prec in zip(proba_test_recall, proba_test_prec):
    class0 = (rec[0]*w_rec) + (prec[0]*(1-w_rec))
    class1 = (rec[1]*w_rec) + (prec[1]*(1-w_rec))
    final_preds.append(int(class1 > class0))
    X_test_stacked.append([rec[0],rec[1],prec[0],prec[1]])
    
    
precision_recall_fscore_support(y_test,final_preds)
    


(array([ 0.98245614,  0.8       ]),
 array([ 0.96551724,  0.88888889]),
 array([ 0.97391304,  0.84210526]),
 array([290,  45]))

## Stacking

In [82]:
proba_train_recall = clf_recall.predict_log_proba(X_train_recall)
proba_train_prec   = clf_prec.predict_log_proba(X_train_prec)


X_train_stacked = []
for rec, prec in zip(proba_train_recall, proba_train_prec):
    X_train_stacked.append([rec[0],rec[1],prec[0],prec[1]])
    
    
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20)

rf_clf = rf_clf.fit(X_train_stacked, y_train_recall)

stacked_preds = rf_clf.predict(X_test_stacked)
precision_recall_fscore_support(y_test,stacked_preds)

(array([ 0.97269625,  0.88095238]),
 array([ 0.98275862,  0.82222222]),
 array([ 0.97770154,  0.85057471]),
 array([290,  45]))

In [65]:
print sum(stacked_preds)
print len(stacked_preds)

41
335


# Logistic Stacking

In [66]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(C=1, max_iter=600)
log_clf = log_clf.fit(X_train_stacked, y_train_recall)

stacked_preds = log_clf.predict(X_test_stacked)
precision_recall_fscore_support(y_test,stacked_preds)

(array([ 0.95959596,  0.86842105]),
 array([ 0.98275862,  0.73333333]),
 array([ 0.97103918,  0.79518072]),
 array([290,  45]))