# Intro

This notebook is to test the classifier.
It will involve a little of data exploration, feature extraction using bag of words and training a classifier. 

Later the code will be codified into python files.

In [113]:
import pandas as pd
import numpy as np

In [452]:
documents_df = pd.read_csv("../files/documents.csv")

In [454]:
print documents_df.shape
documents_df.head()

(1992, 4)


Unnamed: 0,id,title,content,category
0,0,Auditoría revela irregularidades en el Parlacen,GUATEMALA.- Una fiscalización de la Contralorí...,Other
1,1,Suspendidas las citas en Hospital Escuela,TEGUCIGALPA.- Una misteriosa obstrucción del s...,Other
2,2,Mariscos contaminados alarman a los “porteños”,"PUERTO CORTES, Cortés.- Alarmados se encuentra...",Other
3,3,Citan a 11 personas por vender pólvora,SAN PEDRO SULA.- Hasta el momento ocho bodegas...,Criminal
4,4,Con compra de granos se paliaría hambruna en e...,TEGUCIGALPA.- No llueve hace cuatro meses y la...,Other


# 1. Data Exploration

## Categories
As seen below there are two types of categories: Other and Criminal.  These categories are unbalanced

In [455]:
documents_df.category.value_counts()

Other             1715
Criminal           243
Criminal-Other      34
Name: category, dtype: int64

## Content

it seems there are strange objects in the content

In [456]:
print documents_df.content.apply(lambda x: type(x) ).value_counts()

<type 'str'>      1702
<type 'float'>     290
Name: content, dtype: int64


Lookin further into it we can see that there are NaN objects

In [457]:
conditions = documents_df.content.apply(lambda x: type(x) == float )
documents_df[conditions].head()

Unnamed: 0,id,title,content,category
68,68,"La obra de la Línea 1 del Metro de Panamá, el ...",,Other
91,91,TRIBUNITO DICE,,Other
114,114,CANTERAS VISTAS DE… ¡REOJO!,,Other
115,115,"La guerrilla colombiana de las FARC, incorporó...",,Other
117,117,HUMORADAS SABATINAS 01/03/2014,,Other


In [458]:
print "null objects: %i" %documents_df.content.isnull().sum()

null objects: 290


Of those NaN content are there any Criminal categories?

In [459]:
documents_df[(documents_df.content.isnull() ) & (documents_df["category"]=="Criminal")]

Unnamed: 0,id,title,content,category
198,198,Condenan a 40 años de cárcel a 8 pandilleros p...,,Criminal
237,237,VUELVEN FEMICIDIOS,,Criminal
297,297,"La Asociación de Jueces y Magistrados, pide tr...",,Criminal
349,349,“12 Years a Slave” se proclama mejor película ...,,Criminal
430,430,"Autoridades de Copeco, reportaron hoy 90 incen...",,Criminal
437,437,"Honduras, a las puertas de un acuerdo con el F...",,Criminal
465,465,Centro de Atención a mujeres víctimas de viole...,,Criminal
475,475,"El Gobierno de Kiev, arropado por Occidente, d...",,Criminal
490,490,El aire acondicionado hace tiritar la economía...,,Criminal
494,494,Una vecina oyó “gritos terribles” la noche en ...,,Criminal


In [35]:
print documents_df.loc[198].title

Condenan a 40 años de cárcel a 8 pandilleros por crimen de joven deportista.


### Drop all NaN
even though there are some values of Criminal category, they don't have any more information that will help in the final objective


will mantain the original as documents_df
The processed is documents


In [460]:
documents = documents_df[(documents_df.content.notnull() ) & (documents_df["category"]!="Criminal-Other")]
documents[(documents.content.isnull() )]

Unnamed: 0,id,title,content,category


## Other things to consider...

- Most common words
- Size of text 
- Numbers of words


In [461]:
import nltk
from nltk import wordpunct_tokenize, RegexpTokenizer



In [462]:
stop_words = pd.read_json("../files/stopwords.json")[0].values.tolist()
# len(stop_words)
# type(stop_words)
# stop_words.tolist()
stop_words

[u'0',
 u'1',
 u'2',
 u'3',
 u'4',
 u'5',
 u'6',
 u'7',
 u'8',
 u'9',
 u'_',
 u'a',
 u'actualmente',
 u'acuerdo',
 u'adelante',
 u'ademas',
 u'adem\xe1s',
 u'adrede',
 u'afirm\xf3',
 u'agreg\xf3',
 u'ahi',
 u'ahora',
 u'ah\xed',
 u'al',
 u'algo',
 u'alguna',
 u'algunas',
 u'alguno',
 u'algunos',
 u'alg\xfan',
 u'alli',
 u'all\xed',
 u'alrededor',
 u'ambos',
 u'ampleamos',
 u'antano',
 u'anta\xf1o',
 u'ante',
 u'anterior',
 u'antes',
 u'apenas',
 u'aproximadamente',
 u'aquel',
 u'aquella',
 u'aquellas',
 u'aquello',
 u'aquellos',
 u'aqui',
 u'aqu\xe9l',
 u'aqu\xe9lla',
 u'aqu\xe9llas',
 u'aqu\xe9llos',
 u'aqu\xed',
 u'arriba',
 u'arribaabajo',
 u'asegur\xf3',
 u'asi',
 u'as\xed',
 u'atras',
 u'aun',
 u'aunque',
 u'ayer',
 u'a\xf1adi\xf3',
 u'a\xfan',
 u'b',
 u'bajo',
 u'bastante',
 u'bien',
 u'breve',
 u'buen',
 u'buena',
 u'buenas',
 u'bueno',
 u'buenos',
 u'c',
 u'cada',
 u'casi',
 u'cerca',
 u'cierta',
 u'ciertas',
 u'cierto',
 u'ciertos',
 u'cinco',
 u'claro',
 u'coment\xf3',
 u'com

In [463]:
#most common words


full_text = ""
for index, row in documents.iterrows():
    full_text += " " + row["content"]
    
criminal_text = ""
for index, row in documents[documents["category"] == "Criminal"].iterrows():
    criminal_text += " " + row["content"]   
    
other_text = ""
for index, row in documents[documents["category"] == "Other"].iterrows():
    other_text += " " + row["content"]   
    
    
regTokenizer = RegexpTokenizer(r'\w+')
def most_common(text,top=30):
    tokens = regTokenizer.tokenize( text.lower() )
    tokens = [ token for token in tokens if token not in stop_words]

    fdist = nltk.FreqDist(tokens)
    for w in  fdist.most_common(top):
        print w
    

most_common(criminal_text,30)



('\xe2', 759)
('m\xc3', 262)
('\xc2', 219)
('est\xc3', 214)
('\xc3', 208)
('a\xc3', 184)
('nacional', 165)
('as', 146)
('tambi\xc3', 125)
('honduras', 125)
('ni\xc3', 123)
('polic\xc3', 123)
('d\xc3', 117)
('pa\xc3', 114)
('an', 108)
('seg\xc3\xban', 106)
('personas', 104)
('presidente', 92)
('ser\xc3', 90)
('hondure\xc3', 88)
('san', 88)
('vida', 86)
('hab\xc3', 84)
('gobierno', 83)
('seguridad', 82)
('v\xc3', 72)
('c\xc3', 71)
('autoridades', 71)
('pol\xc3', 70)
('as\xc3', 60)


In [464]:
documents.content.apply(lambda x: len(x) ).min()

2

In [465]:
# Final State of documents
documents.category.value_counts()

Other       1453
Criminal     218
Name: category, dtype: int64

# Extract Features

In [338]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
import scipy.sparse as sp

In [466]:
vectorizer1 = CountVectorizer(min_df=1, max_df=1, #max_features=700,  
                              analyzer='char_wb', ngram_range=(5, 5),
                              stop_words=stop_words, binary=False)
# vectorizer1 = TfidfVectorizer(min_df=1,stop_words=stop_words ) #max_features=500
vectorizer2 = CountVectorizer(min_df=1, max_df=1,  analyzer='word', stop_words=stop_words, binary=True)
# vectorizer2 = TfidfVectorizer(min_df=1,stop_words=stop_words ) 


#train the vectorizer but only with the positive samples
vectorizer1.fit(documents[documents["category"]=="Criminal"].content.values)
vectorizer2.fit(documents[documents["category"]=="Criminal"].title.values)


X1 = vectorizer1.transform(documents.content.values)
X2 = vectorizer2.fit_transform(documents.title.values)

X = sp.hstack((X1, X2), format='csr')


y = documents.category.apply(lambda x:  int(x=="Criminal") ).values

In [467]:
splits = 5
skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=943)


avg_precision = 0.0
avg_recall = 0.0
avg_fscore = 0.0

for train_index, test_index in skf.split(X, y):    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    prior = y.sum() *1.0 / len(y)
    
    clf = MultinomialNB(alpha=3.0, class_prior=[1-prior,prior ], fit_prior=True)
    clf.fit(X_train, y_train)
    
    y_pred_train = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)
#     print(classification_report(y_test, y_pred )) #target_names=target_names
    precision_train, recall_train, fscore_train, support_train = precision_recall_fscore_support(y_train,y_pred_train)
    precision, recall, fscore, support = precision_recall_fscore_support(y_test,y_pred_test)
    
    print "---------------------------"
    print "- Train Score: %0.4f" % clf.score(X_train, y_train)
    print "- Test Score: %0.4f" % clf.score(X_test, y_test)
    print "%i of %i were criminal" % (y_test.sum(), len(y_test))
    print "precision: %0.4f, %0.4f" % (precision_train[1], precision[1])
    print "recall: %0.4f, %0.4f" % (recall_train[1], recall[1])
    print "f score: %0.4f, %0.4f" % (fscore_train[1], fscore[1])
    print "supoort: %0.4f, %0.4f" % (support_train[1], support[1])
    
    avg_precision += precision[1]
    avg_recall += recall[1]
    avg_fscore += fscore[1]
    
#     break
    
print 
print "avg precision: %0.4f" %(avg_precision/splits)
print "avg recall: %0.4f" %(avg_recall/splits)
print "avg f score: %0.4f" %(avg_fscore/splits)
    
# clf.fit(X[[0,1,2,3,5,6,7,8,9,10]], y[[0,1,2,3,5,6,7,8,9,10]])

---------------------------
- Train Score: 0.9693
- Test Score: 0.8746
44 of 335 were criminal
precision: 0.9926, 0.6667
recall: 0.7701, 0.0909
f score: 0.8673, 0.1600
supoort: 174.0000, 44.0000
---------------------------
- Train Score: 0.9686
- Test Score: 0.8687
44 of 335 were criminal
precision: 1.0000, 0.5000
recall: 0.7586, 0.0909
f score: 0.8627, 0.1538
supoort: 174.0000, 44.0000
---------------------------
- Train Score: 0.9783
- Test Score: 0.8955
44 of 335 were criminal
precision: 1.0000, 0.8462
recall: 0.8333, 0.2500
f score: 0.9091, 0.3860
supoort: 174.0000, 44.0000
---------------------------
- Train Score: 0.9731
- Test Score: 0.8799
43 of 333 were criminal
precision: 1.0000, 0.6364
recall: 0.7943, 0.1628
f score: 0.8854, 0.2593
supoort: 175.0000, 43.0000
---------------------------
- Train Score: 0.9686
- Test Score: 0.8649
43 of 333 were criminal
precision: 0.9854, 0.2500
recall: 0.7714, 0.0233
f score: 0.8654, 0.0426
supoort: 175.0000, 43.0000

avg precision: 0.5798
av

In [127]:
 y.sum()

88

# TODO

- Add a CV grid search or a Bayes 
- Think in stacking (one for recall one for precision ;) ) 
- More data... data exploration...
- Try other classifiers...  Logistic, SVC , XGboost

# Change of plans

Create two classifiers: One optimized for Recall and One optimized for Precision



In [493]:
def precision_classifier(df, splits=5):
    
    vectorizer1 = CountVectorizer(min_df=1, max_df=1, #max_features=700,  
                              analyzer='char_wb', ngram_range=(5, 5),
                              stop_words=stop_words, binary=True)
    
    vectorizer1.fit(df[df["category"]=="Criminal"].content.values)
    
    X = vectorizer1.transform(df.content.values)
    y = df.category.apply(lambda x:  int(x=="Criminal") ).values
    
    
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=943)
    avg_score = [0.0,0.0]
    avg_precision = [0.0,0.0]
    avg_recall = [0.0,0.0]
    avg_fscore = [0.0,0.0]
    
    
    for train_index, test_index in skf.split(X, y):    
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        prior = y.sum() *1.0 / len(y)

        clf = MultinomialNB(alpha=3.0, class_prior=[1-prior,prior ], fit_prior=True)
        clf.fit(X_train, y_train)

        y_pred_train = clf.predict(X_train)
        y_pred_test = clf.predict(X_test)

        #train
        prec, rec, f1, supp = precision_recall_fscore_support(y_train,y_pred_train)
        avg_score[0] += clf.score(X_train, y_train)
        avg_precision[0] += prec[1]
        avg_recall[0] += rec[1]
        avg_fscore[0] += f1[1]
        #test
        prec, rec, f1, supp = precision_recall_fscore_support(y_test,y_pred_test)
        avg_score[1] += clf.score(X_test, y_test)
        avg_precision[1] += prec[1]
        avg_recall[1] += rec[1]
        avg_fscore[1] += f1[1]
        
        splits =1.0
        break

    print "avg score    : %0.4f, %0.4f" %(avg_score[0]/splits, avg_score[1]/splits )
    print "avg precision: %0.4f, %0.4f" %(avg_precision[0]/splits, avg_precision[1]/splits  )
    print "avg recall   : %0.4f, %0.4f" %(avg_recall[0]/splits, avg_recall[1]/splits )
    print "avg f score  : %0.4f, %0.4f" %(avg_fscore[0]/splits, avg_fscore[1]/splits  )
    
    return clf, X_test
    
    
clf_prec, X_test_prec = precision_classifier(documents)

avg score    : 0.9678, 0.8836
avg precision: 1.0000, 1.0000
avg recall   : 0.7529, 0.1136
avg f score  : 0.8590, 0.2041


In [508]:
def recall_classifier(df, splits=5):
    
    vectorizer1 = CountVectorizer(min_df=1, max_df=1, #max_features=700,  
                              analyzer='char_wb', ngram_range=(5, 5),
                              stop_words=stop_words, binary=False)
#     vectorizer2 = CountVectorizer(min_df=1, max_df=1,  analyzer='word', stop_words=stop_words, binary=False)
    vectorizer2 = TfidfVectorizer(min_df=1,stop_words=stop_words ) 
    
    vectorizer1.fit(df.content.values)
    vectorizer2.fit(documents.title.values)
    
    X1 = vectorizer1.transform(df.content.values)
    X2 = vectorizer2.fit_transform(documents.title.values)
    X = sp.hstack((X1, X2), format='csr')
    y = df.category.apply(lambda x:  int(x=="Criminal") ).values
    
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=943)
    avg_score = [0.0,0.0]
    avg_precision = [0.0,0.0]
    avg_recall = [0.0,0.0]
    avg_fscore = [0.0,0.0]
    
    for train_index, test_index in skf.split(X, y):    
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        prior = y.sum() *1.0 / len(y)

        clf = MultinomialNB(alpha=0.1, class_prior=[1-prior,prior ], fit_prior=True)
        clf.fit(X_train, y_train)

        y_pred_train = clf.predict(X_train)
        y_pred_test = clf.predict(X_test)

        #train
        prec, rec, f1, supp = precision_recall_fscore_support(y_train,y_pred_train)
        avg_score[0] += clf.score(X_train, y_train)
        avg_precision[0] += prec[1]
        avg_recall[0] += rec[1]
        avg_fscore[0] += f1[1]
        #test
        prec, rec, f1, supp = precision_recall_fscore_support(y_test,y_pred_test)
        avg_score[1] += clf.score(X_test, y_test)
        avg_precision[1] += prec[1]
        avg_recall[1] += rec[1]
        avg_fscore[1] += f1[1]
        
        splits =1.0
        break

    print "avg score    : %0.4f, %0.4f" %(avg_score[0]/splits, avg_score[1]/splits )
    print "avg precision: %0.4f, %0.4f" %(avg_precision[0]/splits, avg_precision[1]/splits  )
    print "avg recall   : %0.4f, %0.4f" %(avg_recall[0]/splits, avg_recall[1]/splits )
    print "avg f score  : %0.4f, %0.4f" %(avg_fscore[0]/splits, avg_fscore[1]/splits  )
    
    return clf, X_test, y_test
    
    
clf_recall, X_test_recall, y_test = recall_classifier(documents)

avg score    : 0.9993, 0.1731
avg precision: 1.0000, 0.1371
avg recall   : 0.9943, 1.0000
avg f score  : 0.9971, 0.2411


In [509]:
y_pred_test = clf_recall.predict(X_test_recall)

preds = []
for index, val in enumerate(y_pred_test):
    if val == 0:
        preds.append(val)
    else:
        z = clf_prec.predict(X_test_prec[index])[0]
        preds.append(z)


precision_recall_fscore_support(y_test,preds)

(array([ 0.88181818,  1.        ]),
 array([ 1.        ,  0.11363636]),
 array([ 0.93719807,  0.20408163]),
 array([291,  44]))

array([0])