## Project Description

In this project we explore how to use basic natural languate processing and machine learning techniques to automatically grade essays (AES) in Brazilian Portuguese. In general, AES is a very difficult problem even in English. The available tools that can be used for Portuguese are even less. 

### Load Data

In [1]:
%qtconsole

In [2]:
# -*- coding: utf-8 -*-

In [3]:
import numpy as np
import pandas as pd


In [4]:
import re

In [5]:
import nltk

### Essay Question

In [6]:
question = pd.read_excel(io="C:/Users/alin/Documents/ORAnalytics/AES/data/Brazil_essay.xlsx", sheetname="Essay Question" )

In [7]:
question = question.Original[0]


In [8]:
print question

Conforme apresentado no vídeo da Revista Exame, o comportamento é muito importante no trabalho, nas organizações. Em sua opinião, o comportamento é responsabilidade da própria pessoa ou da empresa em que trabalha? O que as organizações podem fazer para ajudar seus colaboradores a desenvolverem comportamentos melhores e mais adequados às necessidades do trabalho?


###  Essays

In [9]:
essay_df = pd.read_excel(io="C:/Users/alin/Documents/ORAnalytics/AES/data/Brazil_essay.xlsx", sheetname="result" )

In [10]:
essay_df.columns = essay_df.columns.str.lower()

In [11]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin..."
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:..."


### Count # of paragraphs and remove html tags

In [12]:
essay_df['paragraphs'] = essay_df.apply(lambda r: len(re.findall(r'<p', r['resposta'])), axis = 1)

In [13]:
essay_df['text'] = essay_df.apply(lambda r: re.sub(r'<[^<>]*>', ' ', r['resposta']), axis = 1)



In [14]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...,6,Em minha opinião que quando se trata de comp...
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col...",4,"Olá Patricia, \n Concordo com sua colocação ..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...,2,O comportamento do profissional dentro da emp...
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin...",4,\n Em um primeiro momento o comportamen...
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:...",2,Questões comportamentais estão relacionadas ...


### Create some basic features

In [15]:
essay_df['tokens'] = essay_df.apply(lambda r: nltk.wordpunct_tokenize(r['text']), axis = 1)
essay_df['nlp_text'] = essay_df.apply(lambda r: nltk.Text(r['tokens']), axis = 1) 


In [16]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text,tokens,nlp_text
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...,6,Em minha opinião que quando se trata de comp...,"[Em, minha, opinião, que, quando, se, trata, d...","(Em, minha, opinião, que, quando, se, trata, d..."
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col...",4,"Olá Patricia, \n Concordo com sua colocação ...","[Olá, Patricia, ,, Concordo, com, sua, colocaç...","(Olá, Patricia, ,, Concordo, com, sua, colocaç..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...,2,O comportamento do profissional dentro da emp...,"[O, comportamento, do, profissional, dentro, d...","(O, comportamento, do, profissional, dentro, d..."
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin...",4,\n Em um primeiro momento o comportamen...,"[Em, um, primeiro, momento, o, comportamento, ...","(Em, um, primeiro, momento, o, comportamento, ..."
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:...",2,Questões comportamentais estão relacionadas ...,"[Questões, comportamentais, estão, relacionada...","(Questões, comportamentais, estão, relacionada..."


#### Character counts

In [17]:
essay_df['chr_cnt'] = essay_df.apply(lambda r: len(r['text']), axis = 1)

#### Token counts (including stopwords)

In [18]:
essay_df['token_cnt'] = essay_df.apply(lambda r: len(r['tokens']), axis = 1)

#### Token counts (excluding stopwords)

In [19]:
stopwords = nltk.corpus.stopwords.words('portuguese')

In [20]:
essay_df['tokens_fld'] = essay_df.apply(lambda r: [w.lower() for w in r['tokens'] 
                                                   if w not in stopwords and not w.isnumeric() and len(w) > 1], axis = 1)

In [21]:
essay_df['token_cnt_fld'] = essay_df.apply(lambda r: len(r['tokens_fld']), axis = 1) 

#### Number of sentences and number of sentences longer than 250 characters

In [22]:
essay_df['sentences'] = essay_df.apply(lambda r: nltk.sent_tokenize(r['text']), axis = 1)

In [23]:
essay_df['sent_cnt'] = essay_df.apply(lambda r: len(r['sentences']), axis = 1)

In [24]:
essay_df['long_sent_cnt'] = essay_df.apply(lambda r: len([s for s in r['sentences'] if len(s) > 250]), axis = 1)

#### Average length (# of tokens) of sentence

In [25]:
essay_df['avg_sent_len'] = essay_df.apply(lambda r: float(r['token_cnt'] / r['sent_cnt']), axis = 1)

#### Number of words  that appear both in the question and the essay

In [26]:
question_token = set([w.lower() for w in nltk.wordpunct_tokenize(question) 
                      if w not in stopwords and not w.isnumeric() and len(w) > 1])

In [27]:
essay_df['question_tokens'] = essay_df.apply(lambda r: len(set(r['tokens_fld']).intersection(question_token)), axis = 1)

### Binary outcome

In [28]:
essay_df['pass'] = np.where(essay_df['average_score'] >= 2, 1,0)

### Train/test split


In [29]:
from sklearn.model_selection import train_test_split


In [124]:
train_df, test_df, _, _ = train_test_split(essay_df, essay_df['pass'], test_size = 0.3, random_state = 2017)

In [125]:
y_train = train_df['pass']
z_train = train_df['average_score']
y_test = test_df['pass']
z_test = test_df['average_score']

### Features from above

In [128]:
features = ['chr_cnt', 'token_cnt', 'token_cnt_fld', 'sent_cnt', 'long_sent_cnt', 'avg_sent_len', 'question_tokens']
X_train_1 = train_df[features].values
X_test_1 = test_df[features].values

In [129]:
X_train_1lst = [X_train_1[:,i] for i in range(X_train_1.shape[1])]
X_test_1lst = [X_test_1[:,i] for  i in range(X_test_1.shape[1])]

### Features from POS Tagging

#### Create a tagger 

In [130]:
from nltk.corpus import floresta
Tagger0 = nltk.DefaultTagger('n')
def simplify_tag(t):
    if "+" in t:
        return t[t.index("+")+1:]
    else:
        return t
tsents = [[(w.lower(),simplify_tag(t)) for (w,t) in sent] for sent in floresta.tagged_sents() if sent]
Tagger1 = nltk.UnigramTagger(tsents, backoff=Tagger0)
Tagger2 = nltk.BigramTagger(tsents, backoff=Tagger1)

#### Add unigram and bigram pos tag features

In [131]:
all_unigram = set()
all_bigram = set()
train_unigrams = []
train_bigrams = []
for sents in train_df.sentences:
    tui = {}
    tbi = {}
    for sent in sents:
        tsent = Tagger2.tag(nltk.word_tokenize(sent))
        for i in range(len(tsent) - 1):
            t0 = tsent[i][1]
            t1 = tsent[i+1][1]
            all_unigram.add(t0)
            all_bigram.add((t0, t1))
            tui[t0] = tui[t0] + 1 if t0 in tui else 1
            tbi[(t0, t1)] = tbi[(t0, t1)] + 1 if (t0, t1) in tbi else 1
        t0 = tsent[len(tsent) - 1][1]
        all_unigram.add(t0)
        tui[t0] = tui[t0] + 1 if t0 in tui else 1
    train_unigrams.append(tui)
    train_bigrams.append(tbi)

Only keep those tag grams appear in at least 5 documents

In [132]:
L = 5
use_cnt = [(x, sum(x in tu for tu in train_unigrams)) for x in all_unigram]
use_unigram = set([x for (x,y) in use_cnt if y >L])
use_cnt = [(x, sum(x in tb for tb in train_bigrams)) for x in all_bigram]
use_bigram = set([x for (x,y) in use_cnt if y >L])


#### pos tag features of test data

In [133]:
test_unigrams = []
test_bigrams = []
for sents in test_df.sentences:
    tui = {}
    tbi = {}
    for sent in sents:
        tsent = Tagger2.tag(nltk.word_tokenize(sent))
        for i in range(len(tsent) - 1):
            t0 = tsent[i][1]
            t1 = tsent[i+1][1]
            if t0 in use_unigram:
                tui[t0] = tui[t0] + 1 if t0 in tui else 1
            if (t0, t1) in use_bigram:
                tbi[(t0, t1)] = tbi[(t0, t1)] + 1 if (t0, t1) in tbi else 1
        t0 = tsent[len(tsent) - 1][1]
        if t0 in use_unigram:
            tui[t0] = tui[t0] + 1 if t0 in tui else 1
    test_unigrams.append(tui)
    test_bigrams.append(tbi)

#### unigram pos tag features

In [134]:
train_unigram0 = [{u: tu[u] if u in tu else 0 for u in use_unigram} for tu in train_unigrams]

train_uni_mat = pd.DataFrame(train_unigram0).values

test_unigram0 = [{u: tu[u] if u in tu else 0 for u in use_unigram} for tu in test_unigrams]

test_uni_mat = pd.DataFrame(test_unigram0).values

In [135]:
X_train_2 = train_uni_mat
X_test_2 = test_uni_mat

#### bigram pos tag features

In [136]:
train_bigram0 = [{u: tu[u] if u in tu else 0 for u in use_bigram} for tu in train_bigrams]

train_bi_mat = pd.DataFrame(train_bigram0).values

test_bigram0 = [{u: tu[u] if u in tu else 0 for u in use_bigram} for tu in test_bigrams]

test_bi_mat = pd.DataFrame(test_bigram0).values

In [137]:
X_train_3 = train_bi_mat
X_test_3 = test_bi_mat

### Create term-document matrix

In [138]:
train_text = np.array(train_df['text'])
test_text = np.array(test_df['text'])

In [139]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### Count vector  ignoring tokens with doc-frequency < 5 and excluding stopwords

In [140]:
vect = CountVectorizer(min_df=5, stop_words = stopwords).fit(train_text)

In [141]:
X_train_4 = vect.transform(train_text)
X_test_4 = vect.transform(test_text)

#### Tfidf ignoring tokens with doc-frequency < 5 and excluding stopwords

In [142]:
vect = TfidfVectorizer(min_df=5, stop_words=stopwords).fit(train_text)

In [143]:
X_train_5 = vect.transform(train_text)
X_test_5 = vect.transform(test_text)

#### Feature sets

In [144]:
X_train_123 = np.concatenate((X_train_1, X_train_2, X_train_3), axis = 1)

In [145]:
X_test_123 = np.concatenate((X_test_1, X_test_2, X_test_3), axis = 1)

#### normalize

In [146]:
from sklearn.preprocessing import StandardScaler


In [147]:
scaler = StandardScaler().fit(X_train_1)
X_train_1n = scaler.transform(X_train_1)
X_test_1n = scaler.transform(X_test_1)

In [148]:
scaler = StandardScaler().fit(X_train_123)
X_train_123n = scaler.transform(X_train_123)
X_test_123n = scaler.transform(X_test_123)

In [149]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

In [150]:
X_train_1lst = [X_train_1[:,i] for i in range(X_train_1n.shape[1])]
X_test_1lst = [X_test_1[:,i] for  i in range(X_test_1n.shape[1])]

In [151]:
X_train_14 = add_feature(X_train_4, X_train_1lst)
    

In [152]:
X_test_14 = add_feature(X_test_4, X_test_1lst)
X_train_15 = add_feature(X_train_5, X_train_1lst)
X_test_15 = add_feature(X_test_5, X_test_1lst)

In [216]:
X_train_1235 = add_feature(X_train_5, [np.array(x) for x in X_train_123.T.tolist()])

In [217]:
X_test_1235 = add_feature(X_test_5, [np.array(x) for x in X_test_123.T.tolist()])

### Try some basic models

#### constant model

In [153]:
from sklearn.metrics import accuracy_score, mean_squared_error, f1_score, confusion_matrix

In [154]:
z_predict_b = np.ones(y_test.shape[0])*2.5
y_predict_b = np.ones(y_test.shape[0])
print accuracy_score(y_test, y_predict_b)
print np.sqrt(mean_squared_error(y_test, y_predict_b))

0.847575057737
0.390416370383


#### Logistic Regression

In [155]:
from sklearn.linear_model import LogisticRegression


In [156]:
model = LogisticRegression(class_weight={1:1.0, 0:3.8}, C=100.0)

#### use feature set 1

In [157]:
def check_result(y_true, y_predict):
    print accuracy_score(y_true, y_predict)
    print f1_score(y_true, y_predict)
    print confusion_matrix(y_true, y_predict)

In [158]:
model.fit(X_train_1n, y_train)
predictions1 = model.predict(X_test_1n)
check_result(y_test, predictions1)


0.812933025404
0.892998678996
[[ 14  52]
 [ 29 338]]


### use feature set 123

In [159]:
model = LogisticRegression(class_weight={1:1.0, 0:2}, C=1000.0)

In [160]:
model.fit(X_train_123n, y_train)
predictions2 = model.predict(X_test_123n)
check_result(y_test, predictions2)


0.732101616628
0.843243243243
[[  5  61]
 [ 55 312]]


#### use feature set X4

In [161]:
model = LogisticRegression(class_weight={1:1.0, 0:5}, C=1000.0)
model.fit(X_train_4, y_train)
predictions3 = model.predict(X_test_4)
check_result(y_test, predictions3)


0.775981524249
0.870838881491
[[  9  57]
 [ 40 327]]


#### use feature set X5

In [162]:
model = LogisticRegression(class_weight={1:1.0, 0:1}, C=1000.0)
model.fit(X_train_5, y_train)
predictions3 = model.predict(X_test_5)
check_result(y_test, predictions3)

0.789838337182
0.881355932203
[[  4  62]
 [ 29 338]]


#### use feature set X14

In [163]:
model = LogisticRegression(class_weight={1:1.0, 0:1}, C=1000.0)
model.fit(X_train_14, y_train)
predictions4 = model.predict(X_test_14)
check_result(y_test, predictions4)

0.80831408776
0.89121887287
[[ 10  56]
 [ 27 340]]


#### use feature set X15

In [164]:
model = LogisticRegression(class_weight={1:1.0, 0:2.5}, C=1000.0)
model.fit(X_train_15, y_train)
predictions5 = model.predict(X_test_15)
check_result(y_test, predictions5)

0.792147806005
0.88188976378
[[  7  59]
 [ 31 336]]


### Random Forest

In [165]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

#### use feature set 1

In [166]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:20})
model.fit(X_train_1n, y_train)
predictions6 = model.predict(X_test_1n)
check_result(y_test, predictions6)

0.838337182448
0.911838790932
[[  1  65]
 [  5 362]]


In [167]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_1n, z_train)
pred = model.predict(X_test_1n)
print np.sqrt(mean_squared_error(z_test, pred))

#check_result(y_test, predictions6)

0.402651564589


In [168]:
predictions7 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions7)

0.815242494226
0.897698209719
[[  2  64]
 [ 16 351]]


#### Use feature set 123

In [174]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:1})
model.fit(X_train_123n, y_train)
predictions8 = model.predict(X_test_123n)
check_result(y_test, predictions8)


0.847575057737
0.9175
[[  0  66]
 [  0 367]]


In [175]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_123n, z_train)
pred = model.predict(X_test_123n)
print np.sqrt(mean_squared_error(z_test, pred))
predictions8 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions8)

0.393703827127
0.849884526559
0.918648310388
[[  1  65]
 [  0 367]]


### use feature 14

In [176]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:10})
model.fit(X_train_14, y_train)
predictions9 = model.predict(X_test_14)
check_result(y_test, predictions9)

0.845265588915
0.915934755332
[[  1  65]
 [  2 365]]


In [177]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_14, z_train)
pred = model.predict(X_test_14)
print np.sqrt(mean_squared_error(z_test, pred))
predictions10 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions10)

0.407113424484
0.847575057737
0.9175
[[  0  66]
 [  0 367]]


#### use feature 15

In [180]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:100})
model.fit(X_train_15, y_train)
predictions11 = model.predict(X_test_15)
check_result(y_test, predictions11)

0.842956120092
0.914141414141
[[  3  63]
 [  5 362]]


In [79]:
model = RandomForestRegressor(n_estimators=500)
model.fit(X_train_15, z_train)
pred = model.predict(X_test_15)
print np.sqrt(mean_squared_error(z_test, pred))
predictions12 = np.array([1 if x >= 2 else 0 for x in pred])
check_result(y_test, predictions12)

0.358633844767
0.863741339492
0.926889714994
[[  0  52]
 [  7 374]]


#### use feature 1235  (use this?)

In [224]:
model = RandomForestClassifier(n_estimators=500, class_weight = {1:1, 0:100})
model.fit(X_train_1235, y_train)
predictions11 = model.predict(X_test_1235)
check_result(y_test, predictions11)

0.845265588915
0.91572327044
[[  2  64]
 [  3 364]]


In [241]:
output = pd.DataFrame({'prediction': predictions11, 'pk1': test_df.pk1})

In [242]:
output.to_csv("C:/Users/alin/Documents/ORAnalytics/AES/data/output.csv")

### GBM

In [80]:
import h2o

In [81]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,19 mins 20 secs
H2O cluster version:,3.10.5.4
H2O cluster version age:,1 month and 1 day
H2O cluster name:,alin
H2O cluster total nodes:,1
H2O cluster free memory:,3.538 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://localhost:54321


### Use feature set 1

In [82]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [83]:
y_train0 = y_train.reshape(y_train.shape[0],1)
z_train0 = z_train.reshape(z_train.shape[0],1)

In [84]:
train = np.concatenate((X_train_1n, y_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_1n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions13 = [1 if x > 0.5 else 0 for x in pred1]
check_result(y_test, predictions13)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.875288683603
0.933497536946
[[  0  52]
 [  2 379]]


In [85]:
train = np.concatenate((X_train_1n, z_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_1n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])
print np.sqrt(mean_squared_error(z_test, pred1))
predictions14 = [1 if x >= 2 else 0 for x in pred1]
check_result(y_test, predictions14)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.363471146243
0.866050808314
0.928217821782
[[  0  52]
 [  6 375]]


#### Use feature 123

In [86]:
train = np.concatenate((X_train_123n, y_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_123n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions15 = [1 if x > 0.5 else 0 for x in pred1]
check_result(y_test, predictions15)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.879907621247
0.936117936118
[[  0  52]
 [  0 381]]


In [87]:
train = np.concatenate((X_train_123n, z_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_123n)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])
print np.sqrt(mean_squared_error(z_test, pred1))
predictions16 = [1 if x >= 2 else 0 for x in pred1]
check_result(y_test, predictions16)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.376590871913
0.854503464203
0.92095357591
[[  3  49]
 [ 14 367]]


### Use feature 14

In [88]:
X_train_14a = X_train_14.toarray()

In [89]:
X_test_14a = X_test_14.toarray()

In [90]:
train = np.concatenate((X_train_14a, y_train0), axis = 1)
train_hex = h2o.H2OFrame(train)
gbm = H2OGradientBoostingEstimator()
gbm.train(x = range(train.shape[1]-1), y = train.shape[1]-1, training_frame=train_hex)

test_hex = h2o.H2OFrame(X_test_14a)

pred = gbm.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions16 = [1 if x > 0.5 else 0 for x in pred1]
check_result(y_test, predictions16)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
0.879907621247
0.936117936118
[[  0  52]
 [  0 381]]


### NN

In [91]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [95]:
train = np.concatenate((X_train_14a, z_train0), axis = 1)
train_hex = h2o.H2OFrame(train)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [96]:
model = H2ODeepLearningEstimator(distribution="gaussian",
                                 activation="tanh",
                                 hidden=[32,32,32],
                                 input_dropout_ratio=0.2,
                                 sparse=True,
                                 l1=1e-5,
                                 epochs= 2)

In [97]:
model.train(x=range(train.shape[1]-1), y=train.shape[1]-1, training_frame=train_hex)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [99]:
test_hex = h2o.H2OFrame(X_test_14a)

pred = model.predict(test_hex)

pred1 = np.array([pred[i, 0] for i in range(y_test.shape[0])])

predictions16 = [1 if x > 2 else 0 for x in pred1]
check_result(y_test, predictions16)

Parse progress: |█████████████████████████████████████████████████████████| 100%
deeplearning prediction progress: |███████████████████████████████████████| 100%
0.856812933025
0.921914357683
[[  5  47]
 [ 15 366]]


In [111]:
output = pd.DataFrame(data = {'pred1': predictions1,  'pk1': test_df.pk1})

In [112]:
output.head()

Unnamed: 0,pk1,pred1
1432,604784,1
590,614030,1
1240,706801,1
364,567512,1
432,591678,1


In [105]:
test_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text,tokens,nlp_text,chr_cnt,token_cnt,tokens_fld,token_cnt_fld,sentences,sent_cnt,long_sent_cnt,avg_sent_len,question_tokens,pass
1432,1.610127,604784,ATIVIDADE 1,2.5,"<p>Olá,</p> \n<p>Acredito que 90% da responsab...",4,"Olá, \n Acredito que 90% da responsabilidade...","[Olá, ,, Acredito, que, 90, %, da, responsabil...","(Olá, ,, Acredito, que, 90, %, da, responsabil...",1307,223,"[olá, acredito, responsabilidade, comportament...",112,"[ Olá, \n Acredito que 90% da responsabilidad...",8,1,27.0,6,1
590,1.209834,614030,ATIVIDADE 1,2.5,<p>tbm concordo com a importancia que tem a mo...,1,tbm concordo com a importancia que tem a moti...,"[tbm, concordo, com, a, importancia, que, tem,...","(tbm, concordo, com, a, importancia, que, tem,...",137,26,"[tbm, concordo, importancia, motivação, ambien...",10,[ tbm concordo com a importancia que tem a mot...,1,0,26.0,1,1
1240,1.210197,706801,ATIVIDADE 1,2.5,<p>Podemos ver diversas empresas percebendo a ...,2,Podemos ver diversas empresas percebendo a ne...,"[Podemos, ver, diversas, empresas, percebendo,...","(Podemos, ver, diversas, empresas, percebendo,...",510,86,"[podemos, ver, diversas, empresas, percebendo,...",47,[ Podemos ver diversas empresas percebendo a n...,2,1,43.0,2,1
364,1.209087,567512,ATIVIDADE 1,2.5,<p> O comportamento é de resposanbilidade da p...,2,O comportamento é de resposanbilidade da pes...,"[O, comportamento, é, de, resposanbilidade, da...","(O, comportamento, é, de, resposanbilidade, da...",530,93,"[comportamento, resposanbilidade, pessoa, form...",44,[ O comportamento é de resposanbilidade da pe...,3,0,31.0,2,1
432,1.209182,591678,ATIVIDADE 1,2.0,"<p>Na minha opinião, o comportamento é unicame...",3,"Na minha opinião, o comportamento é unicament...","[Na, minha, opinião, ,, o, comportamento, é, u...","(Na, minha, opinião, ,, o, comportamento, é, u...",464,81,"[na, opinião, comportamento, unicamente, respo...",35,"[ Na minha opinião, o comportamento é unicamen...",3,0,27.0,7,1
