# Artificial Neural Network for Automatic Short Answer Grading
Automatic Short Answer Grading (ASAG) is the task of implementing a system that automatically assigns a class value related to the quality of an short-answer question (correct - incorrect). The aim of this project is to take a question from the ScientistBank dataset for train an Artifical Neural Network (ANN) that clasifies the answers of the students. The architecture is based on the model of  [Alikaniotis(2016)](https://arxiv.org/abs/1606.04289)

# Estructura trabajo
## Planteamiento proyecto
Redes neuronales para calificación automática de respuestas cortas
## Diseño experimental con diferntes parámetros (variable independiente factor)
Valores de variable independiente por factores: cada setup experimental es una forma de la variable independiente

- Entrenar word embeddings
- Comenzar con un perceptrón y terminar con una red LSTM (mirar otras arquitecturas posibles)
- Combinar otros métodos (SVM...)

## Definir medida de desempeño (variable dependiente)
AUC(ROC),Concordancias entre jueces, coeficientes de correlación.


In [5]:
# Libraries
import xml.etree.ElementTree as ET
#import sswe
import pandas as pd
import numpy as np
import keras
import theano
#import nltk
import matplotlib.pyplot as plt
%matplotlib inline


# ASOBEK Embeddings
#from nltk.tokenize import word_tokenize
from sklearn.cross_validation import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score as f1
from sklearn import cross_validation
import re
#from nltk.util import ngrams
from collections import *
from sklearn.preprocessing import StandardScaler



## Load Data


In [6]:
tree = ET.parse('PS-inv1-2a_ScientisTrain.xml')
root = tree.getroot()
question=root[0].text
grade=[branch.attrib["accuracy"] for branch in root[2]]
answers_st=[branch.text for branch in root[2]]
answers_ref=[branch.text for branch in root[1]]### If there are more of 1 reference answer it will give an array of an answers. Make sure you integrate it before processing

In [7]:
califs=[1 if resp=="correct" else 0 for resp in grade]

In [8]:
question

'Why does a rubber band make a sound when you pluck it (pull and let go quickly)?'

In [9]:
answers_ref

['The rubber band vibrates.']

In [10]:
len(answers_st)
anwsr_refArry=[answers_ref[0] for answer in answers_st]

In [11]:
pd.DataFrame(np.column_stack([answers_st,grade]),columns=["Respuesta","Calif"])[:15]
#answers_st[:25]

Unnamed: 0,Respuesta,Calif
0,Because it is sticky!,incorrect
1,Because it hits the other side.,incorrect
2,Because vibration.,correct
3,It makes the plucking sound from stretching an...,correct
4,Because if you stretch it and let it go it mak...,incorrect
5,Plucking is pull.,incorrect
6,Because it hits the other part of the rubber b...,incorrect
7,Because it vibrates.,correct
8,Because your making vibrations.,correct
9,Because it vibrates.,correct


## Preprocessing
- stemming
- tolower
- tf-idf(opcional)

- http://www.nltk.org/book/ch03.html
- https://de.dariah.eu/tatom/preprocessing.html

- Query expantion
wordnet


## Baseline methods
- wordcount con word2vec (preentrenado google news)
- ASOBEK
- c&w (???)

### word2vec Embeddings

In [12]:
import gensim, logging

ImportError: No module named gensim

In [13]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.Word2Vec(answers_st)

NameError: name 'logging' is not defined

In [14]:
model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(answers_st)  # can be a non-repeatable, 1-pass generator
model.train(answers_st)  # can be a non-repeatable, 1-pass generator

NameError: name 'gensim' is not defined

### ASOBEK Embedding


ASOBEK feature extractor

In [15]:
def encode_asobek(dataA, dataB):
    '''
    Takes the paraphrases and returns the asobek features ## (Array de respuestas)
    
    Arguments
    --
    dataA: List containing the source sentences of paraphrasing
    dataB: List containing the candidate sentences of paraphrasing
    
    Returns
    --
    [Unigram word features, Bigram word features, Unigram character features, Bigram character features]
    '''
    if len(dataA) != len(dataB):
        print 'Check length of your data'
        return
    features = []
    def get_cardinalities(ngramA, ngramB):
        vector = []
        vector.append(union(ngramA, ngramB))
        vector.append(intersect(ngramA, ngramB))
        vector.append(set(ngramA))
        vector.append(set(ngramB))
        return vector
    
    for x in np.arange(len(dataA)):
        unigram_1 = get_wordngram(dataA[x],1)
        unigram_2 = get_wordngram(dataB[x],1)
        bigram_1 = get_wordngram(dataA[x],2)
        bigram_2 = get_wordngram(dataB[x],2)
        unigram_c_1 = get_characterngram(dataA[x],1)
        unigram_c_2 = get_characterngram(dataB[x],1)
        bigram_c_1 = get_characterngram(dataA[x],2)
        bigram_c_2 = get_characterngram(dataB[x],2)
        w1 = [len(x) for x in get_cardinalities(unigram_1, unigram_2)]
        w2 = [len(x) for x in get_cardinalities(bigram_1, bigram_2)]
        c1 = [len(x) for x in get_cardinalities(unigram_c_1, unigram_c_2)]
        c2 = [len(x) for x in get_cardinalities(bigram_c_1, bigram_c_2)]
        features.append([w1, w2, c1, c2])
    return features

And these are some helpers for encode_asobek

In [16]:
def union(list1, list2):
    cnt1 = Counter()
    cnt2 = Counter()
    for tk1 in list1:
        cnt1[tk1] += 1
    for tk2 in list2:
        cnt2[tk2] += 1
    inter = cnt1 | cnt2
    return set(inter.elements())
def intersect (list1, list2) :
    cnt1 = Counter()
    cnt2 = Counter()
    for tk1 in list1:
        cnt1[tk1] += 1
    for tk2 in list2:
        cnt2[tk2] += 1
    inter = cnt1 & cnt2
    return list(inter.elements())

def get_characterngram(string, n):
    '''Returns n-grams of characters'''
    char1 = [c for c in string]
    return list(ngrams(char1, n))

def get_wordngram(string, n):
    '''Returns n-grams of words'''
    words = word_tokenize(string)
    return list(ngrams(words, n))

#### Elaborate ASOBEK features

In [17]:
dataA, dataB = [], []
#decode("utf-8") is really important
#Preprocessing: Only lowercase all the words
dataA = [' '.join(word_tokenize(x.decode("utf-8").lower())) for x in anwsr_refArry]# Respuesta referencia - Tantas como rtas de estudiantes hayan
dataB = [' '.join(word_tokenize(x.decode("utf-8").lower())) for x in answers_st]# Respuesta estudiantes
X_train = encode_asobek(dataA, dataB)

NameError: name 'word_tokenize' is not defined

In [18]:
# keras tokenizer
import keras

In [19]:
keras.preprocessing.text.Tokenizer(anwsr_refArry[0])
#dataA = [keras.preprocessing.text.Tokenizer(text=x) for x in anwsr_refArry]

AttributeError: 'module' object has no attribute 'text'

In [16]:
import keras
tk = keras.preprocessing.text.Tokenizer()
texts = ['I love you.', 'I love you, too.']
tk.fit_on_texts(texts)
tk.texts_to_matrix(texts, mode='tfidf')

AttributeError: 'module' object has no attribute 'text'

# ANN for clasification
### Long-short term memory network


## Validation