## Project Description

In this project we explore how to use basic natural languate processing and machine learning techniques to automatically grade essays (AES) in Brazilian Portuguese. In general, AES is a very difficult problem even in English. The available tools that can be used for Portuguese are even less. 

### Load Data

In [1]:
%qtconsole

In [20]:
# -*- coding: utf-8 -*-

In [2]:
import numpy as np
import pandas as pd


In [27]:
import re

In [48]:
import nltk

### Essay Question

In [21]:
question = pd.read_excel(io="C:/Users/alin/Documents/ORAnalytics/AES/data/Brazil_essay.xlsx", sheetname="Essay Question" )

In [22]:
question = question.Original[0]


In [26]:
print question

Conforme apresentado no vídeo da Revista Exame, o comportamento é muito importante no trabalho, nas organizações. Em sua opinião, o comportamento é responsabilidade da própria pessoa ou da empresa em que trabalha? O que as organizações podem fazer para ajudar seus colaboradores a desenvolverem comportamentos melhores e mais adequados às necessidades do trabalho?


###  Essays

In [24]:
essay_df = pd.read_excel(io="C:/Users/alin/Documents/ORAnalytics/AES/data/Brazil_essay.xlsx", sheetname="result" )

In [28]:
essay_df.columns = essay_df.columns.str.lower()

In [29]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin..."
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:..."


### Count # of paragraphs and remove html tags

In [40]:
essay_df['paragraphs'] = essay_df.apply(lambda r: len(re.findall(r'<p', r['resposta'])), axis = 1)

In [42]:
essay_df['text'] = essay_df.apply(lambda r: re.sub(r'<[^<>]*>', ' ', r['resposta']), axis = 1)



In [43]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...,6,Em minha opinião que quando se trata de comp...
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col...",4,"Olá Patricia, \n Concordo com sua colocação ..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...,2,O comportamento do profissional dentro da emp...
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin...",4,\n Em um primeiro momento o comportamen...
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:...",2,Questões comportamentais estão relacionadas ...


### Create some basic features

In [58]:
essay_df['tokens'] = essay_df.apply(lambda r: nltk.wordpunct_tokenize(r['text']), axis = 1)
essay_df['nlp_text'] = essay_df.apply(lambda r: nltk.Text(r['tokens']), axis = 1) 


In [59]:
essay_df.head()

Unnamed: 0,user_id,pk1,title,average_score,resposta,paragraphs,text,tokens,nlp_text
0,1.071078,2500,ATIVIDADE 1,2.5,<p> Em minha opinião que quando se trata de co...,6,Em minha opinião que quando se trata de comp...,"[Em, minha, opinião, que, quando, se, trata, d...","(Em, minha, opinião, que, quando, se, trata, d..."
1,1.081038,4817,ATIVIDADE 1,2.5,"<p>Olá Patricia,</p> \n<p>Concordo com sua col...",4,"Olá Patricia, \n Concordo com sua colocação ...","[Olá, Patricia, ,, Concordo, com, sua, colocaç...","(Olá, Patricia, ,, Concordo, com, sua, colocaç..."
2,1.081924,7253,ATIVIDADE 1,2.5,<p>O comportamento do profissional dentro da e...,2,O comportamento do profissional dentro da emp...,"[O, comportamento, do, profissional, dentro, d...","(O, comportamento, do, profissional, dentro, d..."
3,1.091905,11815,ATIVIDADE 1,2.5,"<p><span style=""font-family: georgia , palatin...",4,\n Em um primeiro momento o comportamen...,"[Em, um, primeiro, momento, o, comportamento, ...","(Em, um, primeiro, momento, o, comportamento, ..."
4,1.101049,704153,ATIVIDADE 1,1.5,"<p><span style=""font-size: 10.0pt;font-family:...",2,Questões comportamentais estão relacionadas ...,"[Questões, comportamentais, estão, relacionada...","(Questões, comportamentais, estão, relacionada..."


#### Character counts

In [60]:
essay_df['chr_cnt'] = essay_df.apply(lambda r: len(r['text']), axis = 1)

#### Token counts (including stopwords)

In [62]:
essay_df['token_cnt'] = essay_df.apply(lambda r: len(r['tokens']), axis = 1)

#### Token counts (excluding stopwords)

In [64]:
stopwords = nltk.corpus.stopwords.words('portuguese')

In [65]:
essay_df['tokens_fld'] = essay_df.apply(lambda r: [w.lower() for w in r['tokens'] 
                                                   if w not in stopwords and not w.isnumeric() and len(w) > 1], axis = 1)

In [68]:
essay_df['token_cnt_fld'] = essay_df.apply(lambda r: len(r['tokens_fld']), axis = 1) 

#### Number of sentences and number of sentences longer than 250 characters

In [70]:
essay_df['sentences'] = essay_df.apply(lambda r: nltk.sent_tokenize(r['text']), axis = 1)

In [72]:
essay_df['sent_cnt'] = essay_df.apply(lambda r: len(r['sentences']), axis = 1)

In [77]:
essay_df['long_sent_cnt'] = essay_df.apply(lambda r: len([s for s in r['sentences'] if len(s) > 250]), axis = 1)

#### Average length (# of tokens) of sentence

In [79]:
essay_df['avg_sent_len'] = essay_df.apply(lambda r: float(r['token_cnt'] / r['sent_cnt']), axis = 1)

#### Number of words  that appear both in the question and the essay

In [82]:
question_token = set([w.lower() for w in nltk.wordpunct_tokenize(question) 
                      if w not in stopwords and not w.isnumeric() and len(w) > 1])

In [92]:
essay_df['question_tokens'] = essay_df.apply(lambda r: len(set(r['tokens_fld']).intersection(question_token)), axis = 1)

### Binary outcome

In [100]:
essay_df['pass'] = np.where(essay_df['average_score'] >= 2, 1,0)

### Train/test split


Use Chris' split method 

In [212]:
from sklearn.utils import shuffle

essay_df = shuffle(essay_df)


In [213]:
train_df = essay_df[:1000]

In [214]:
test_df = essay_df[1000:]

### Create term-document matrix

In [133]:
X_train = np.array(train_df['text'])
X_test = np.array(test_df['text'])

In [221]:
y_train = train_df['pass']
y_test = test_df['pass']

In [121]:
from sklearn.feature_extraction.text import CountVectorizer

#### Use a count vectorizer ignoring tokens with doc-frequency < 5 and word n-grams from 1 to 3 and excluding stopwords

In [123]:
vect = CountVectorizer(min_df=5, ngram_range=(1,3), stop_words = stopwords).fit(X_train)

In [124]:
X_train_vec = vect.transform(X_train)


In [127]:
X_test_vec = vect.transform(X_test)

In [147]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1000)

In [148]:
clf.fit(X_train_vec, y_train)
predictions = clf.predict(X_test_vec)

In [153]:
from sklearn.ensemble import RandomForestClassifier

In [154]:
clf = RandomForestClassifier(n_estimators=500)

In [155]:
clf.fit(X_train_vec, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [156]:
predictions = clf.predict(X_test_vec)

In [159]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [160]:
vect = CountVectorizer(min_df=5, ngram_range=(1,3), stop_words = stopwords).fit(X_train)

In [161]:
X_train_vec = vect.transform(X_train)
X_test_vec = vect.transform(X_test)

In [162]:
clf = RandomForestClassifier(n_estimators=500)

In [163]:
clf.fit(X_train_vec, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [164]:
predictions = clf.predict(X_test_vec)

In [167]:
clf = LogisticRegression(C = 1)

In [169]:
clf.fit(X_train_vec, y_train)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [170]:
predictions = clf.predict(X_test_vec)

In [None]:
vect = CountVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)
X_train_vec = vect.transform(X_train)
    train_len = X_train.apply(lambda r: len(r))
    train_digit = X_train.apply(lambda r:  len([c for c in r if c in DIGIT]))
    train_nonword = X_train.apply(lambda r: len(re.findall(r'\W', r)))
    X_train_vec1 = add_feature(X_train_vec, [train_len, train_digit, train_nonword])
    
    
    X_test_vec = vect.transform(X_test)
    test_len = X_test.apply(lambda r: len(r))
    test_digit = X_test.apply(lambda r:  len([c for c in r if c in DIGIT]))
    test_nonword = X_test.apply(lambda r: len(re.findall(r'\W', r)))
    X_test_vec1 = add_feature(X_test_vec, [test_len, test_digit, test_nonword])
    

In [177]:
from sklearn.ensemble import RandomForestRegressor

In [222]:
z_train = train_df.average_score
z_test = test_df.average_score

In [179]:
clf = RandomForestRegressor(n_estimators=500)
clf.fit(X_train_vec, z_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [180]:
predictions = clf.predict(X_test_vec)

In [201]:
features = ['chr_cnt', 'token_cnt', 'token_cnt_fld', 'sent_cnt', 'long_sent_cnt', 'avg_sent_len', 'question_tokens']

In [223]:
X_train1 = train_df[features].values

In [224]:
X_test1 = test_df[features].values

In [225]:
clf.fit(X_train1, z_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [226]:
predictions = clf.predict(X_test1)