## Project Description

In this project we explore how to use basic natural languate processing and machine learning techniques to automatically grade essays (AES) in Brazilian Portuguese. In general, AES is a very difficult problem even in English. The available tools that can be used for Portuguese are even less. 

### Load Data

In [1]:
%qtconsole

In [2]:
# -*- coding: utf-8 -*-

In [3]:
import numpy as np
import pandas as pd
import re
import nltk

In [4]:
import xgboost as xgb

In [5]:
import lightgbm as lgb

### Essay Question

In [6]:
question = pd.read_excel(io="/home/alin/MyLearning/EasyNLP/Brazil_essay.xlsx", sheetname="Essay Question" )

In [8]:
question = question.Original[0]


In [10]:
print question

Conforme apresentado no vídeo da Revista Exame, o comportamento é muito importante no trabalho, nas organizações. Em sua opinião, o comportamento é responsabilidade da própria pessoa ou da empresa em que trabalha? O que as organizações podem fazer para ajudar seus colaboradores a desenvolverem comportamentos melhores e mais adequados às necessidades do trabalho?


###  Essays

In [11]:
essay_df = pd.read_excel(io="/home/alin/MyLearning/EasyNLP/Brazil_essay.xlsx", sheetname="result" )

In [12]:
essay_df.columns = essay_df.columns.str.lower()

In [13]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin..."
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:..."


### Count # of paragraphs and remove html tags

In [14]:
essay_df['paragraphs'] = essay_df.apply(lambda r: len(re.findall(r'<p', r['resposta'])), axis = 1)

In [15]:
essay_df['text'] = essay_df.apply(lambda r: re.sub(r'<[^<>]*>', ' ', r['resposta']), axis = 1)



In [16]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...,6,Em minha opinião que quando se trata de comp...
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col...",4,"Olá Patricia, \n Concordo com sua colocação ..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...,2,O comportamento do profissional dentro da emp...
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin...",4,\n Em um primeiro momento o comportamen...
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:...",2,Questões comportamentais estão relacionadas ...


### Create some basic features

In [17]:
essay_df['tokens'] = essay_df.apply(lambda r: nltk.wordpunct_tokenize(r['text']), axis = 1)
essay_df['nlp_text'] = essay_df.apply(lambda r: nltk.Text(r['tokens']), axis = 1) 


In [18]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text,tokens,nlp_text
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...,6,Em minha opinião que quando se trata de comp...,"[Em, minha, opinião, que, quando, se, trata, d...","(Em, minha, opinião, que, quando, se, trata, d..."
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col...",4,"Olá Patricia, \n Concordo com sua colocação ...","[Olá, Patricia, ,, Concordo, com, sua, colocaç...","(Olá, Patricia, ,, Concordo, com, sua, colocaç..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...,2,O comportamento do profissional dentro da emp...,"[O, comportamento, do, profissional, dentro, d...","(O, comportamento, do, profissional, dentro, d..."
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin...",4,\n Em um primeiro momento o comportamen...,"[Em, um, primeiro, momento, o, comportamento, ...","(Em, um, primeiro, momento, o, comportamento, ..."
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:...",2,Questões comportamentais estão relacionadas ...,"[Questões, comportamentais, estão, relacionada...","(Questões, comportamentais, estão, relacionada..."


#### Character counts

In [19]:
essay_df['chr_cnt'] = essay_df.apply(lambda r: len(r['text']), axis = 1)

#### Token counts (including stopwords)

In [20]:
essay_df['token_cnt'] = essay_df.apply(lambda r: len(r['tokens']), axis = 1)

#### Token counts (excluding stopwords)

In [21]:
stopwords = nltk.corpus.stopwords.words('portuguese')

In [22]:
essay_df['tokens_fld'] = essay_df.apply(lambda r: [w.lower() for w in r['tokens'] 
                                                   if w not in stopwords and not w.isnumeric() and len(w) > 1], axis = 1)

In [23]:
essay_df['token_cnt_fld'] = essay_df.apply(lambda r: len(r['tokens_fld']), axis = 1) 

#### Number of sentences and number of sentences longer than 250 characters

In [24]:
essay_df['sentences'] = essay_df.apply(lambda r: nltk.sent_tokenize(r['text']), axis = 1)

In [25]:
essay_df['sent_cnt'] = essay_df.apply(lambda r: len(r['sentences']), axis = 1)

In [26]:
essay_df['long_sent_cnt'] = essay_df.apply(lambda r: len([s for s in r['sentences'] if len(s) > 250]), axis = 1)

#### Average length (# of tokens) of sentence

In [27]:
essay_df['avg_sent_len'] = essay_df.apply(lambda r: float(r['token_cnt'] / r['sent_cnt']), axis = 1)

#### Number of words  that appear both in the question and the essay

In [28]:
question_token = set([w.lower() for w in nltk.wordpunct_tokenize(question) 
                      if w not in stopwords and not w.isnumeric() and len(w) > 1])

In [29]:
essay_df['question_tokens'] = essay_df.apply(lambda r: len(set(r['tokens_fld']).intersection(question_token)), axis = 1)

### Binary outcome

In [30]:
essay_df['pass'] = np.where(essay_df['average_score'] >= 2, 1,0)

### Train/test split


In [31]:
from sklearn.model_selection import train_test_split


In [32]:
train_df, test_df, _, _ = train_test_split(essay_df, essay_df['pass'], test_size = 0.3, random_state = 0)

In [33]:
y_train = train_df['pass']
z_train = train_df['average_score']
y_test = test_df['pass']
z_test = test_df['average_score']

### Features from above

In [36]:
features = ['paragraphs', 'chr_cnt', 'token_cnt', 'token_cnt_fld', 'sent_cnt', 'long_sent_cnt', 'avg_sent_len', 'question_tokens']
X_train_1 = train_df[features].values
X_test_1 = test_df[features].values

In [37]:
X_train_1lst = [X_train_1[:,i] for i in range(X_train_1.shape[1])]
X_test_1lst = [X_test_1[:,i] for  i in range(X_test_1.shape[1])]

### Features from POS Tagging

#### Create a tagger 

In [40]:
from nltk.corpus import floresta
Tagger0 = nltk.DefaultTagger('n')
def simplify_tag(t):
    if "+" in t:
        return t[t.index("+")+1:]
    else:
        return t
tsents = [[(w.lower(),simplify_tag(t)) for (w,t) in sent] for sent in floresta.tagged_sents() if sent]
Tagger1 = nltk.UnigramTagger(tsents, backoff=Tagger0)
Tagger2 = nltk.BigramTagger(tsents, backoff=Tagger1)

#### Add unigram and bigram pos tag features

In [41]:
all_unigram = set()
all_bigram = set()
train_unigrams = []
train_bigrams = []
for sents in train_df.sentences:
    tui = {}
    tbi = {}
    for sent in sents:
        tsent = Tagger2.tag(nltk.word_tokenize(sent))
        for i in range(len(tsent) - 1):
            t0 = tsent[i][1]
            t1 = tsent[i+1][1]
            all_unigram.add(t0)
            all_bigram.add((t0, t1))
            tui[t0] = tui[t0] + 1 if t0 in tui else 1
            tbi[(t0, t1)] = tbi[(t0, t1)] + 1 if (t0, t1) in tbi else 1
        t0 = tsent[len(tsent) - 1][1]
        all_unigram.add(t0)
        tui[t0] = tui[t0] + 1 if t0 in tui else 1
    train_unigrams.append(tui)
    train_bigrams.append(tbi)

Only keep those tag grams appear in at least 5 documents

In [42]:
L = 5
use_cnt = [(x, sum(x in tu for tu in train_unigrams)) for x in all_unigram]
use_unigram = set([x for (x,y) in use_cnt if y >L])
use_cnt = [(x, sum(x in tb for tb in train_bigrams)) for x in all_bigram]
use_bigram = set([x for (x,y) in use_cnt if y >L])


#### pos tag features of test data

In [43]:
test_unigrams = []
test_bigrams = []
for sents in test_df.sentences:
    tui = {}
    tbi = {}
    for sent in sents:
        tsent = Tagger2.tag(nltk.word_tokenize(sent))
        for i in range(len(tsent) - 1):
            t0 = tsent[i][1]
            t1 = tsent[i+1][1]
            if t0 in use_unigram:
                tui[t0] = tui[t0] + 1 if t0 in tui else 1
            if (t0, t1) in use_bigram:
                tbi[(t0, t1)] = tbi[(t0, t1)] + 1 if (t0, t1) in tbi else 1
        t0 = tsent[len(tsent) - 1][1]
        if t0 in use_unigram:
            tui[t0] = tui[t0] + 1 if t0 in tui else 1
    test_unigrams.append(tui)
    test_bigrams.append(tbi)

#### unigram pos tag features

In [44]:
train_unigram0 = [{u: tu[u] if u in tu else 0 for u in use_unigram} for tu in train_unigrams]

train_uni_mat = pd.DataFrame(train_unigram0).values

test_unigram0 = [{u: tu[u] if u in tu else 0 for u in use_unigram} for tu in test_unigrams]

test_uni_mat = pd.DataFrame(test_unigram0).values

In [45]:
X_train_2 = train_uni_mat
X_test_2 = test_uni_mat

#### bigram pos tag features

In [46]:
train_bigram0 = [{u: tu[u] if u in tu else 0 for u in use_bigram} for tu in train_bigrams]

train_bi_mat = pd.DataFrame(train_bigram0).values

test_bigram0 = [{u: tu[u] if u in tu else 0 for u in use_bigram} for tu in test_bigrams]

test_bi_mat = pd.DataFrame(test_bigram0).values

In [47]:
X_train_3 = train_bi_mat
X_test_3 = test_bi_mat

### Create term-document matrix

In [48]:
train_text = np.array(train_df['text'])
test_text = np.array(test_df['text'])

In [49]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### Count vector  ignoring tokens with doc-frequency < 5 and excluding stopwords

In [50]:
vect = CountVectorizer(min_df=5, stop_words = stopwords).fit(train_text)

In [51]:
X_train_4 = vect.transform(train_text)
X_test_4 = vect.transform(test_text)

#### Tfidf ignoring tokens with doc-frequency < 5 and excluding stopwords

In [52]:
vect = TfidfVectorizer(min_df=5, stop_words=stopwords).fit(train_text)

In [53]:
X_train_5 = vect.transform(train_text)
X_test_5 = vect.transform(test_text)

#### Feature sets

In [54]:
X_train_123 = np.concatenate((X_train_1, X_train_2, X_train_3), axis = 1)

In [55]:
X_test_123 = np.concatenate((X_test_1, X_test_2, X_test_3), axis = 1)

#### normalize

In [56]:
from sklearn.preprocessing import StandardScaler


In [57]:
scaler = StandardScaler().fit(X_train_1)
X_train_1n = scaler.transform(X_train_1)
X_test_1n = scaler.transform(X_test_1)

In [58]:
scaler = StandardScaler().fit(X_train_123)
X_train_123n = scaler.transform(X_train_123)
X_test_123n = scaler.transform(X_test_123)

In [59]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

In [60]:
X_train_1lst = [X_train_1[:,i] for i in range(X_train_1n.shape[1])]
X_test_1lst = [X_test_1[:,i] for  i in range(X_test_1n.shape[1])]

In [61]:
X_train_14 = add_feature(X_train_4, X_train_1lst)
    

In [62]:
X_test_14 = add_feature(X_test_4, X_test_1lst)
X_train_15 = add_feature(X_train_5, X_train_1lst)
X_test_15 = add_feature(X_test_5, X_test_1lst)

### Try some basic models

#### constant model

In [64]:
from sklearn.metrics import accuracy_score, mean_squared_error, f1_score, confusion_matrix

In [65]:
z_predict_b = np.ones(y_test.shape[0])*2.5
y_predict_b = np.ones(y_test.shape[0])
print accuracy_score(y_test, y_predict_b)
print np.sqrt(mean_squared_error(y_test, y_predict_b))

0.879907621247
0.346543473107


In [66]:
def check_result(y_true, y_predict):
    print accuracy_score(y_true, y_predict)
    print f1_score(y_true, y_predict)
    print confusion_matrix(y_true, y_predict)

#### xgboost

### use feature 1

In [67]:
dtrain = xgb.DMatrix(X_train_1, y_train)

In [68]:
dtest = xgb.DMatrix(X_test_1)

In [70]:


xgb_params = {
    'eta': 0.1,
    'max_depth': 5,
    'min_child_weight': 1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'binary:logistic',
    'eval_metric': 'error',
    'silent': 1
}

In [71]:
cvresult = xgb.cv(xgb_params, dtrain, num_boost_round=1000, nfold=5,
            metrics='error', early_stopping_rounds=50, verbose_eval=True)

[0]	train-error:0.133582+0.0079291	test-error:0.152239+0.0321809
[1]	train-error:0.13209+0.00884382	test-error:0.135323+0.0312127
[2]	train-error:0.132836+0.00709709	test-error:0.136318+0.0302789
[3]	train-error:0.133333+0.00760219	test-error:0.136318+0.0312444
[4]	train-error:0.131094+0.00660948	test-error:0.137314+0.0314024
[5]	train-error:0.131095+0.00714949	test-error:0.139304+0.0295172
[6]	train-error:0.13209+0.00616393	test-error:0.139304+0.0295172
[7]	train-error:0.131841+0.00583373	test-error:0.136318+0.0296177
[8]	train-error:0.131841+0.00567237	test-error:0.136318+0.0296177
[9]	train-error:0.132089+0.00574819	test-error:0.137313+0.0294501
[10]	train-error:0.131592+0.0058549	test-error:0.135323+0.0304094
[11]	train-error:0.130099+0.00591788	test-error:0.136318+0.0302789
[12]	train-error:0.130597+0.00561754	test-error:0.137313+0.0302788
[13]	train-error:0.130348+0.0054156	test-error:0.137313+0.0302788
[14]	train-error:0.130846+0.0055287	test-error:0.137313+0.0289414
[15]	train-

In [72]:
num_rounds = 59

In [73]:
model = xgb.train(xgb_params, dtrain, num_boost_round = num_rounds, evals = [(dtrain, 'train')], verbose_eval = 20)

[0]	train-error:0.134787
[20]	train-error:0.133796
[40]	train-error:0.129832
[58]	train-error:0.115956


In [80]:
y_pred1 = model.predict(dtest)
predictions17 = [1 if x >=0.5 else 0 for x in y_pred1]

In [81]:
check_result(y_test, predictions17)

0.875288683603
0.933497536946
[[  0  52]
 [  2 379]]


In [84]:
dtrain = xgb.DMatrix(X_train_1, z_train)

In [88]:
dtest = xgb.DMatrix(X_test_1)

In [85]:
xgb_params1 = {
    'eta': 0.1,
    'max_depth': 5,
     'min_child_weight': 1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'silent': 1
}



In [86]:
cvresult = xgb.cv(xgb_params1, dtrain, num_boost_round=1000, nfold=5,
            metrics='rmse', early_stopping_rounds=50, verbose_eval=True)

[0]	train-rmse:1.66595+0.00464818	test-rmse:1.66588+0.0195561
[1]	train-rmse:1.50823+0.00417447	test-rmse:1.50886+0.0198445
[2]	train-rmse:1.36685+0.0033935	test-rmse:1.36752+0.0199878
[3]	train-rmse:1.24027+0.00278493	test-rmse:1.24136+0.0200829
[4]	train-rmse:1.12748+0.00274655	test-rmse:1.1293+0.0191711
[5]	train-rmse:1.02724+0.00243613	test-rmse:1.03017+0.0189689
[6]	train-rmse:0.93724+0.00162578	test-rmse:0.941997+0.019365
[7]	train-rmse:0.857745+0.00116613	test-rmse:0.864594+0.0178489
[8]	train-rmse:0.786553+0.00113758	test-rmse:0.794593+0.0166307
[9]	train-rmse:0.723274+0.00170383	test-rmse:0.732566+0.015698
[10]	train-rmse:0.667728+0.00201862	test-rmse:0.679702+0.0153046
[11]	train-rmse:0.618633+0.0024341	test-rmse:0.633003+0.0137845
[12]	train-rmse:0.576014+0.00238111	test-rmse:0.592497+0.012945
[13]	train-rmse:0.538097+0.00282257	test-rmse:0.557894+0.0115237
[14]	train-rmse:0.504966+0.00301833	test-rmse:0.528316+0.0103312
[15]	train-rmse:0.476052+0.00340289	test-rmse:0.503545

In [90]:
num_rounds = 83
model = xgb.train(xgb_params1, dtrain, num_boost_round = num_rounds, evals = [(dtrain, 'train')], verbose_eval = 20)
z_pred1 = model.predict(dtest)
print np.sqrt(mean_squared_error(z_test, z_pred1))
predictions18 = [1 if x >=2 else 0 for x in z_pred1]
check_result(y_test, predictions18)

[0]	train-rmse:1.66729
[20]	train-rmse:0.38721
[40]	train-rmse:0.295882
[60]	train-rmse:0.271656
[80]	train-rmse:0.250682
[82]	train-rmse:0.248841
0.36847334743
0.854503464203
0.92095357591
[[  3  49]
 [ 14 367]]


### Use features 123

In [82]:
dtrain = xgb.DMatrix(X_train_123n, y_train)
dtest = xgb.DMatrix(X_test_123n)
cvresult = xgb.cv(xgb_params, dtrain, num_boost_round=1000, nfold=5,
            metrics='error', early_stopping_rounds=50, verbose_eval=True)

[0]	train-error:0.126866+0.00854522	test-error:0.171144+0.03324
[1]	train-error:0.120647+0.00907218	test-error:0.139303+0.0343246
[2]	train-error:0.120398+0.0105948	test-error:0.140298+0.0308939
[3]	train-error:0.122886+0.00887861	test-error:0.140298+0.0316849
[4]	train-error:0.123383+0.0084502	test-error:0.140298+0.0307332
[5]	train-error:0.124627+0.00852337	test-error:0.135323+0.0287354
[6]	train-error:0.124129+0.00974264	test-error:0.135323+0.0287354
[7]	train-error:0.124876+0.00701827	test-error:0.136318+0.0306041
[8]	train-error:0.124876+0.00548397	test-error:0.137313+0.0271772
[9]	train-error:0.124129+0.00611368	test-error:0.140298+0.0287354
[10]	train-error:0.121144+0.00570514	test-error:0.138308+0.029917
[11]	train-error:0.122388+0.00559563	test-error:0.139303+0.0288386
[12]	train-error:0.121891+0.00588667	test-error:0.139303+0.0288386
[13]	train-error:0.119901+0.00651506	test-error:0.139303+0.0288386
[14]	train-error:0.119154+0.00722674	test-error:0.139303+0.0286664
[15]	train

In [83]:
num_rounds = 97
model = xgb.train(xgb_params, dtrain, num_boost_round = num_rounds, evals = [(dtrain, 'train')], verbose_eval = 20)
y_pred1 = model.predict(dtest)
predictions18 = [1 if x >=0.5 else 0 for x in y_pred1]
check_result(y_test, predictions18 )

[0]	train-error:0.124876
[20]	train-error:0.119921
[40]	train-error:0.084242
[60]	train-error:0.039643
[80]	train-error:0.013875
[96]	train-error:0.005946
0.877598152425
0.934809348093
[[  0  52]
 [  1 380]]


In [91]:
dtrain = xgb.DMatrix(X_train_123n, z_train)
dtest = xgb.DMatrix(X_test_123n)
cvresult = xgb.cv(xgb_params1, dtrain, num_boost_round=1000, nfold=5,
            metrics='rmse', early_stopping_rounds=50, verbose_eval=True)

[0]	train-rmse:1.66588+0.00464602	test-rmse:1.66722+0.0208145
[1]	train-rmse:1.50767+0.00383587	test-rmse:1.51074+0.0218353
[2]	train-rmse:1.36542+0.00335035	test-rmse:1.36961+0.0213826
[3]	train-rmse:1.23768+0.0025824	test-rmse:1.24345+0.0221618
[4]	train-rmse:1.12358+0.00211448	test-rmse:1.13245+0.0220914
[5]	train-rmse:1.02201+0.00153107	test-rmse:1.03259+0.0217388
[6]	train-rmse:0.930747+0.00180147	test-rmse:0.943324+0.0201998
[7]	train-rmse:0.849272+0.0018109	test-rmse:0.865748+0.0197805
[8]	train-rmse:0.776914+0.00194621	test-rmse:0.79606+0.0194041
[9]	train-rmse:0.712079+0.0019652	test-rmse:0.735464+0.0181199
[10]	train-rmse:0.654345+0.00225153	test-rmse:0.681601+0.0168056
[11]	train-rmse:0.602867+0.0022292	test-rmse:0.635026+0.016976
[12]	train-rmse:0.5583+0.00233603	test-rmse:0.595095+0.0150236
[13]	train-rmse:0.517275+0.0020376	test-rmse:0.560159+0.0131609
[14]	train-rmse:0.48184+0.00218133	test-rmse:0.529921+0.0119737
[15]	train-rmse:0.450327+0.00221836	test-rmse:0.502903+0.

In [92]:
num_rounds = 94
model = xgb.train(xgb_params1, dtrain, num_boost_round = num_rounds, evals = [(dtrain, 'train')], verbose_eval = 20)
z_pred1 = model.predict(dtest)
print np.sqrt(mean_squared_error(z_test, z_pred1))
predictions20 = [1 if x >=2 else 0 for x in z_pred1]
check_result(y_test, predictions20)

[0]	train-rmse:1.66704
[20]	train-rmse:0.34711
[40]	train-rmse:0.221447
[60]	train-rmse:0.178533
[80]	train-rmse:0.146926
[93]	train-rmse:0.130928
0.380832355499
0.849884526559
0.918032786885
[[  4  48]
 [ 17 364]]


#### Logistic Regression

In [54]:
from sklearn.linear_model import LogisticRegression


In [287]:
model = LogisticRegression(class_weight={1:1.0, 0:3.8}, C=100.0)

#### use feature set 1

In [288]:
def check_result(y_true, y_predict):
    print accuracy_score(y_true, y_predict)
    print f1_score(y_true, y_predict)
    print confusion_matrix(y_true, y_predict)

In [289]:
model.fit(X_train_1n, y_train)
predictions1 = model.predict(X_test_1n)
check_result(y_test, predictions1)


0.906666666667
0.951048951049
[[  0  25]
 [  3 272]]


### use feature set 123

In [291]:
model = LogisticRegression(class_weight={1:1.0, 0:2}, C=1000.0)

In [292]:
model.fit(X_train_123n, y_train)
predictions2 = model.predict(X_test_123n)
check_result(y_test, predictions2)


0.75
0.854932301741
[[  4  21]
 [ 54 221]]


#### use feature set X4

In [293]:
model = LogisticRegression(class_weight={1:1.0, 0:5}, C=1000.0)
model.fit(X_train_4, y_train)
predictions3 = model.predict(X_test_4)
check_result(y_test, predictions3)


ValueError: Found input variables with inconsistent numbers of samples: [1009, 700]

#### use feature set X5

In [100]:
model = LogisticRegression(class_weight={1:1.0, 0:1}, C=1000.0)
model.fit(X_train_5, y_train)
predictions3 = model.predict(X_test_5)
check_result(y_test, predictions3)

0.833718244804
0.908163265306
[[  5  47]
 [ 25 356]]


#### use feature set X14

In [133]:
model = LogisticRegression(class_weight={1:1.0, 0:1}, C=1000.0)
model.fit(X_train_14, y_train)
predictions4 = model.predict(X_test_14)
check_result(y_test, predictions4)

0.817551963048
0.897001303781
[[ 10  42]
 [ 37 344]]


#### use feature set X15

In [134]:
model = LogisticRegression(class_weight={1:1.0, 0:2.5}, C=1000.0)
model.fit(X_train_15, y_train)
predictions5 = model.predict(X_test_15)
check_result(y_test, predictions5)

0.826789838337
0.903969270166
[[  5  47]
 [ 28 353]]


### Random Forest

In [128]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

#### use feature set 1

In [294]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:20})
model.fit(X_train_1n, y_train)
predictions6 = model.predict(X_test_1n)
check_result(y_test, predictions6)

0.913333333333
0.954703832753
[[  0  25]
 [  1 274]]


In [295]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_1n, z_train)
pred = model.predict(X_test_1n)
print np.sqrt(mean_squared_error(z_test, pred))

#check_result(y_test, predictions6)

0.336921536452


In [153]:
predictions7 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions7)

0.849884526559
0.917825537295
[[  5  47]
 [ 18 363]]


#### Use feature set 123

In [158]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:10})
model.fit(X_train_123n, y_train)
predictions8 = model.predict(X_test_123n)
check_result(y_test, predictions8)


0.879907621247
0.936117936118
[[  0  52]
 [  0 381]]


In [159]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_123n, z_train)
pred = model.predict(X_test_123n)
print np.sqrt(mean_squared_error(z_test, pred))
predictions8 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions8)

0.356999391385
0.879907621247
0.936117936118
[[  0  52]
 [  0 381]]


### use feature 14

In [161]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:10})
model.fit(X_train_14, y_train)
predictions9 = model.predict(X_test_14)
check_result(y_test, predictions9)

0.877598152425
0.934648581998
[[  1  51]
 [  2 379]]


In [162]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_14, z_train)
pred = model.predict(X_test_14)
print np.sqrt(mean_squared_error(z_test, pred))
predictions10 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions10)

0.365751545531
0.877598152425
0.934809348093
[[  0  52]
 [  1 380]]


#### use feature 15

In [163]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:10})
model.fit(X_train_15, y_train)
predictions11 = model.predict(X_test_15)
check_result(y_test, predictions11)

0.875288683603
0.933497536946
[[  0  52]
 [  2 379]]


In [164]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_15, z_train)
pred = model.predict(X_test_15)
print np.sqrt(mean_squared_error(z_test, pred))
predictions12 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions12)

0.357219999833
0.863741339492
0.926889714994
[[  0  52]
 [  7 374]]


### GBM

In [165]:
import h2o

In [166]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,2 mins 33 secs
H2O cluster version:,3.10.5.4
H2O cluster version age:,1 month and 1 day
H2O cluster name:,alin
H2O cluster total nodes:,1
H2O cluster free memory:,3.538 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://localhost:54321


### Use feature set 1

In [None]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [169]:
y_train0 = y_train.reshape(y_train.shape[0],1)
z_train0 = z_train.reshape(z_train.shape[0],1)

In [218]:
train = np.concatenate((X_train_1n, y_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_1n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions13 = [1 if x > 0.5 else 0 for x in pred1]
check_result(y_test, predictions13)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.875288683603
0.933497536946
[[  0  52]
 [  2 379]]


In [219]:
train = np.concatenate((X_train_1n, z_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_1n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])
print np.sqrt(mean_squared_error(z_test, pred1))
predictions14 = [1 if x >= 2 else 0 for x in pred1]
check_result(y_test, predictions14)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.363471146243
0.866050808314
0.928217821782
[[  0  52]
 [  6 375]]


#### Use feature 123

In [220]:
train = np.concatenate((X_train_123n, y_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_123n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions15 = [1 if x > 0.5 else 0 for x in pred1]
check_result(y_test, predictions15)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.879907621247
0.936117936118
[[  0  52]
 [  0 381]]


In [221]:
train = np.concatenate((X_train_123n, z_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_123n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])
print np.sqrt(mean_squared_error(z_test, pred1))
predictions16 = [1 if x >= 2 else 0 for x in pred1]
check_result(y_test, predictions16)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.376590871913
0.854503464203
0.92095357591
[[  3  49]
 [ 14 367]]


### Use feature 4

In [226]:
X_train_4a = X_train_4.toarray()

In [228]:
X_test_4a = X_test_4.toarray()

In [229]:
train = np.concatenate((X_train_4a, y_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_4a)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions16 = [1 if x > 0.5 else 0 for x in pred1]
check_result(y_test, predictions16)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.877598152425
0.934809348093
[[  0  52]
 [  1 380]]
