# **English Language Learners (ELLs)** Vocabulary Score prediction 

## An **NLP** and **ML** approach

### The dataset presented here comprises argumentative essays written by 8th-12th grade English Language Learners (ELLs). The essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.

### Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5. Our task is to predict the score of each of the six measures for the essays given in the test set.

In [68]:
import numpy as np
import pandas as pd

import nltk


In [69]:
ELL_train = pd.read_csv("../input/feedback-prize-english-language-learning/train.csv")

In [70]:
ELL_train_text = ELL_train.full_text
ELL_train_text

0       I think that students would benefit from learn...
1       When a problem is a change you have to let it ...
2       Dear, Principal\n\nIf u change the school poli...
3       The best time in life is when you become yours...
4       Small act of kindness can impact in other peop...
                              ...                        
3906    I believe using cellphones in class for educat...
3907    Working alone, students do not have to argue w...
3908    "A problem is a chance for you to do your best...
3909    Many people disagree with Albert Schweitzer's ...
3910    Do you think that failure is the main thing fo...
Name: full_text, Length: 3911, dtype: object

In [71]:
# Each line is one essay.

for i, line in enumerate(ELL_train_text):
    if i > 10: # Lets take a look at the first 10 essasy.
        break
    print(str(i) + ':\t' + line+"\n\n")

0:	I think that students would benefit from learning at home,because they wont have to change and get up early in the morning to shower and do there hair. taking only classes helps them because at there house they'll be pay more attention. they will be comfortable at home.

The hardest part of school is getting ready. you wake up go brush your teeth and go to your closet and look at your cloths. after you think you picked a outfit u go look in the mirror and youll either not like it or you look and see a stain. Then you'll have to change. with the online classes you can wear anything and stay home and you wont need to stress about what to wear.

most students usually take showers before school. they either take it before they sleep or when they wake up. some students do both to smell good. that causes them do miss the bus and effects on there lesson time cause they come late to school. when u have online classes u wont need to miss lessons cause you can get everything set up and go tak

In [72]:
single_no8 = ELL_train_text[8]
print(single_no8)

positive attitude is the key to success. I agree because you can do anything as long as you put your mind and soul into it and motivation you can accomplish it. Then so by doing it you feel good about yourself you'll feel unstable. But do what brings the best in you, what makes you yourself.

One way that importance of attitude is key to success is it motivates you to keep going forward. For example when you come across some difficulties you wont feel discouraged because your the limit and none other then you can or will change it. But the more discouraged you tend to feel at that moment you would wanna overcome it even if it makes whatever just as long as your mind is set to positive attitude you will accomplish it. Then so always remember what you put your mind to such as determination, positive attitude, willing you can its possible you will reach your goals. However you will with a positive mindset have a strong determination and you will build up self confidence in yourself.

Then

## Tokenization on Train Data


In [73]:
from nltk import sent_tokenize, word_tokenize

In [74]:
sent_tokenize(single_no8) # breaks essay into sentence

['positive attitude is the key to success.',
 'I agree because you can do anything as long as you put your mind and soul into it and motivation you can accomplish it.',
 "Then so by doing it you feel good about yourself you'll feel unstable.",
 'But do what brings the best in you, what makes you yourself.',
 'One way that importance of attitude is key to success is it motivates you to keep going forward.',
 'For example when you come across some difficulties you wont feel discouraged because your the limit and none other then you can or will change it.',
 'But the more discouraged you tend to feel at that moment you would wanna overcome it even if it makes whatever just as long as your mind is set to positive attitude you will accomplish it.',
 'Then so always remember what you put your mind to such as determination, positive attitude, willing you can its possible you will reach your goals.',
 'However you will with a positive mindset have a strong determination and you will build up s

In [75]:
for sent in sent_tokenize(single_no8):
    print(word_tokenize(sent)) 
    
# breaks essay into words per sentence

['positive', 'attitude', 'is', 'the', 'key', 'to', 'success', '.']
['I', 'agree', 'because', 'you', 'can', 'do', 'anything', 'as', 'long', 'as', 'you', 'put', 'your', 'mind', 'and', 'soul', 'into', 'it', 'and', 'motivation', 'you', 'can', 'accomplish', 'it', '.']
['Then', 'so', 'by', 'doing', 'it', 'you', 'feel', 'good', 'about', 'yourself', 'you', "'ll", 'feel', 'unstable', '.']
['But', 'do', 'what', 'brings', 'the', 'best', 'in', 'you', ',', 'what', 'makes', 'you', 'yourself', '.']
['One', 'way', 'that', 'importance', 'of', 'attitude', 'is', 'key', 'to', 'success', 'is', 'it', 'motivates', 'you', 'to', 'keep', 'going', 'forward', '.']
['For', 'example', 'when', 'you', 'come', 'across', 'some', 'difficulties', 'you', 'wont', 'feel', 'discouraged', 'because', 'your', 'the', 'limit', 'and', 'none', 'other', 'then', 'you', 'can', 'or', 'will', 'change', 'it', '.']
['But', 'the', 'more', 'discouraged', 'you', 'tend', 'to', 'feel', 'at', 'that', 'moment', 'you', 'would', 'wan', 'na', 'over

In [76]:
for sent in sent_tokenize(single_no8):
    print([word.lower() for word in word_tokenize(sent)])

# we should convert word into a single case (can be either upper or lower)

['positive', 'attitude', 'is', 'the', 'key', 'to', 'success', '.']
['i', 'agree', 'because', 'you', 'can', 'do', 'anything', 'as', 'long', 'as', 'you', 'put', 'your', 'mind', 'and', 'soul', 'into', 'it', 'and', 'motivation', 'you', 'can', 'accomplish', 'it', '.']
['then', 'so', 'by', 'doing', 'it', 'you', 'feel', 'good', 'about', 'yourself', 'you', "'ll", 'feel', 'unstable', '.']
['but', 'do', 'what', 'brings', 'the', 'best', 'in', 'you', ',', 'what', 'makes', 'you', 'yourself', '.']
['one', 'way', 'that', 'importance', 'of', 'attitude', 'is', 'key', 'to', 'success', 'is', 'it', 'motivates', 'you', 'to', 'keep', 'going', 'forward', '.']
['for', 'example', 'when', 'you', 'come', 'across', 'some', 'difficulties', 'you', 'wont', 'feel', 'discouraged', 'because', 'your', 'the', 'limit', 'and', 'none', 'other', 'then', 'you', 'can', 'or', 'will', 'change', 'it', '.']
['but', 'the', 'more', 'discouraged', 'you', 'tend', 'to', 'feel', 'at', 'that', 'moment', 'you', 'would', 'wan', 'na', 'over

In [77]:
print(word_tokenize(single_no8))  # Treats the whole essay as one line.

['positive', 'attitude', 'is', 'the', 'key', 'to', 'success', '.', 'I', 'agree', 'because', 'you', 'can', 'do', 'anything', 'as', 'long', 'as', 'you', 'put', 'your', 'mind', 'and', 'soul', 'into', 'it', 'and', 'motivation', 'you', 'can', 'accomplish', 'it', '.', 'Then', 'so', 'by', 'doing', 'it', 'you', 'feel', 'good', 'about', 'yourself', 'you', "'ll", 'feel', 'unstable', '.', 'But', 'do', 'what', 'brings', 'the', 'best', 'in', 'you', ',', 'what', 'makes', 'you', 'yourself', '.', 'One', 'way', 'that', 'importance', 'of', 'attitude', 'is', 'key', 'to', 'success', 'is', 'it', 'motivates', 'you', 'to', 'keep', 'going', 'forward', '.', 'For', 'example', 'when', 'you', 'come', 'across', 'some', 'difficulties', 'you', 'wont', 'feel', 'discouraged', 'because', 'your', 'the', 'limit', 'and', 'none', 'other', 'then', 'you', 'can', 'or', 'will', 'change', 'it', '.', 'But', 'the', 'more', 'discouraged', 'you', 'tend', 'to', 'feel', 'at', 'that', 'moment', 'you', 'would', 'wan', 'na', 'overcome',

##  Stop Words 

### We keep them it in final model, as removing them decreases measure scores

In [78]:
from nltk.corpus import stopwords

In [79]:
stopwords_en = stopwords.words('english')
print(stopwords_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [80]:
single_no8_tokenized_lowered = list(map(str.lower, word_tokenize(single_no8)))
print(single_no8_tokenized_lowered)

['positive', 'attitude', 'is', 'the', 'key', 'to', 'success', '.', 'i', 'agree', 'because', 'you', 'can', 'do', 'anything', 'as', 'long', 'as', 'you', 'put', 'your', 'mind', 'and', 'soul', 'into', 'it', 'and', 'motivation', 'you', 'can', 'accomplish', 'it', '.', 'then', 'so', 'by', 'doing', 'it', 'you', 'feel', 'good', 'about', 'yourself', 'you', "'ll", 'feel', 'unstable', '.', 'but', 'do', 'what', 'brings', 'the', 'best', 'in', 'you', ',', 'what', 'makes', 'you', 'yourself', '.', 'one', 'way', 'that', 'importance', 'of', 'attitude', 'is', 'key', 'to', 'success', 'is', 'it', 'motivates', 'you', 'to', 'keep', 'going', 'forward', '.', 'for', 'example', 'when', 'you', 'come', 'across', 'some', 'difficulties', 'you', 'wont', 'feel', 'discouraged', 'because', 'your', 'the', 'limit', 'and', 'none', 'other', 'then', 'you', 'can', 'or', 'will', 'change', 'it', '.', 'but', 'the', 'more', 'discouraged', 'you', 'tend', 'to', 'feel', 'at', 'that', 'moment', 'you', 'would', 'wan', 'na', 'overcome',

In [81]:
stopwords_en = set(stopwords.words('english')) # Set checking is faster in Python than list.

# List comprehension with stop words elimination.
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en])

['positive', 'attitude', 'key', 'success', '.', 'agree', 'anything', 'long', 'put', 'mind', 'soul', 'motivation', 'accomplish', '.', 'feel', 'good', "'ll", 'feel', 'unstable', '.', 'brings', 'best', ',', 'makes', '.', 'one', 'way', 'importance', 'attitude', 'key', 'success', 'motivates', 'keep', 'going', 'forward', '.', 'example', 'come', 'across', 'difficulties', 'wont', 'feel', 'discouraged', 'limit', 'none', 'change', '.', 'discouraged', 'tend', 'feel', 'moment', 'would', 'wan', 'na', 'overcome', 'even', 'makes', 'whatever', 'long', 'mind', 'set', 'positive', 'attitude', 'accomplish', '.', 'always', 'remember', 'put', 'mind', 'determination', ',', 'positive', 'attitude', ',', 'willing', 'possible', 'reach', 'goals', '.', 'however', 'positive', 'mindset', 'strong', 'determination', 'build', 'self', 'confidence', '.', 'another', 'reason', 'grow', 'individually', 'hard', 'work', '.', 'positive', 'attitude', 'ise', "n't", 'quite', 'easiest', 'people', 'really', 'difficult', 'others', 'l

### Punctuations

#### We keep them it in final model, as removing them decreases measure scores

In [82]:
from string import punctuation

# It's a string so we have to them into a set type

print('From string.punctuation:', type(punctuation), punctuation)

From string.punctuation: <class 'str'> !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [83]:
stopwords_en_withpunct = stopwords_en.union(set(punctuation)) # adding punctuations to stop words
print(stopwords_en_withpunct)

{'who', 'hasn', 'both', 'him', "that'll", 'all', ',', 'in', 'more', 'wouldn', 'you', 'y', 'haven', ';', 'is', 'our', 'aren', 'were', '!', "doesn't", 'any', 'while', 'my', 'only', "wasn't", 'not', 'they', 'weren', 'through', 've', 'hers', 'being', 'has', 'then', "didn't", 'further', 'your', "'", 'their', '`', '~', 'are', '%', 'where', 'd', 'on', "isn't", 'yours', '*', 'just', "mustn't", 'it', 'very', 'other', 'off', "you'd", 'have', 'her', 'between', "weren't", 'to', 'those', 'such', 'from', "won't", 'down', 'yourself', 'them', 'ma', 'the', ']', 'most', 't', 'hadn', 'he', 'once', 'do', 'been', "mightn't", 'these', 'here', 'no', "wouldn't", 'shan', 'because', "you've", 'after', 'against', ')', 'about', 'needn', 'until', 'what', 'am', 'does', '"', 'but', '[', 'had', 'doing', '=', "should've", 'ourselves', '+', 'above', 'up', 'will', 'why', '/', 'mightn', '^', 'out', 'a', 'its', 'this', 're', 'doesn', 'theirs', 'few', '?', "she's", 'ain', 'each', "you're", 'so', 'his', "aren't", "couldn't"

In [84]:
# print an essay without stop words and pronouns

print([word for word in single_no8_tokenized_lowered if word not in stopwords_en_withpunct])

['positive', 'attitude', 'key', 'success', 'agree', 'anything', 'long', 'put', 'mind', 'soul', 'motivation', 'accomplish', 'feel', 'good', "'ll", 'feel', 'unstable', 'brings', 'best', 'makes', 'one', 'way', 'importance', 'attitude', 'key', 'success', 'motivates', 'keep', 'going', 'forward', 'example', 'come', 'across', 'difficulties', 'wont', 'feel', 'discouraged', 'limit', 'none', 'change', 'discouraged', 'tend', 'feel', 'moment', 'would', 'wan', 'na', 'overcome', 'even', 'makes', 'whatever', 'long', 'mind', 'set', 'positive', 'attitude', 'accomplish', 'always', 'remember', 'put', 'mind', 'determination', 'positive', 'attitude', 'willing', 'possible', 'reach', 'goals', 'however', 'positive', 'mindset', 'strong', 'determination', 'build', 'self', 'confidence', 'another', 'reason', 'grow', 'individually', 'hard', 'work', 'positive', 'attitude', 'ise', "n't", 'quite', 'easiest', 'people', 'really', 'difficult', 'others', 'learn', 'even', 'willing', 'others', 'tend', 'always', 'positive',

## Stronger/longer list of stopwords

In [85]:
# Stopwords from stopwords-json
stopwords_json = {"en":["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]}
stopwords_json_en = set(stopwords_json['en'])
stopwords_nltk_en = set(stopwords.words('english'))
stopwords_punct = set(punctuation)
# Combine the stopwords. Its a lot longer so I'm not printing it out...
stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en, stopwords_punct)

# Remove the stopwords from `single_no8`.
print('With combined stopwords:')
print([word for word in single_no8_tokenized_lowered if word not in stoplist_combined])

With combined stopwords:
['positive', 'attitude', 'key', 'success', 'agree', 'long', 'put', 'mind', 'soul', 'motivation', 'accomplish', 'feel', 'good', "'ll", 'feel', 'unstable', 'brings', 'makes', 'importance', 'attitude', 'key', 'success', 'motivates', 'forward', 'difficulties', 'wont', 'feel', 'discouraged', 'limit', 'change', 'discouraged', 'tend', 'feel', 'moment', 'wan', 'na', 'overcome', 'makes', 'long', 'mind', 'set', 'positive', 'attitude', 'accomplish', 'remember', 'put', 'mind', 'determination', 'positive', 'attitude', 'reach', 'goals', 'positive', 'mindset', 'strong', 'determination', 'build', 'confidence', 'reason', 'grow', 'individually', 'hard', 'work', 'positive', 'attitude', 'ise', "n't", 'easiest', 'people', 'difficult', 'learn', 'tend', 'positive', 'attitude', "'re", 'negative', 'havent', 'give', 'stuck', 'listen', 'thinks', 'focus', 'accomplish', 'dont', 'reflect', 'dont', "'s", 'negative', 'person', 'bring', 'effect', 'ignore', 'dont', 'bet', 'affect', 'dont', 'wan

## Stemming and Lemmatization

Stemming - producing morphological variants of a root/base word. Algorithm based. Faster but less accurate

Lemmatization - lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. Corpus based. More accuarte, but slower

In [86]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

for word in ['walking', 'walks', 'walked']:
    print(porter.stem(word))

walk
walk
walk


In [87]:
## Create a Preprocess Function (we don't use it since we end up using N-grams)

def preprocess_text(text):
    # Input: str, i.e. document/sentence
    # Output: list(str) , i.e. list of lemmas
    return [word if not word.isdigit() else "<digit>" for word in lemmatize_sent(text)]

# Prediction

Now we consider labels as Vocabulary Score, which are in the form of {1,1.5,2......4,4.5,5} i.e. discrete but numeric label values

Both Regression and Classification can be used to predict such type of score. Let us run both and find out which works best

In [88]:
np.sort(ELL_train.vocabulary.unique())

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

# Classification using Navie Bayes

In [89]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split 

# It doesn't really matter what the function name is called
# but the `train_test_split` is splitting up the data into 
# 2 parts according to the `test_size` argument you've set.

# When we're splitting up the training data, we're spltting up 
# into train, valid split. The function name is just a name =)
train, valid = train_test_split(ELL_train, test_size=0.2,random_state=1)

In [90]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer and 
# override the analyzer totally with the preprocess_text().
# Note: the vectorizer is just an 'empty' object now.
count_vect = TfidfVectorizer()

# When we use `CounterVectorizer.fit_transform`,
# we essentially create the dictionary and 
# vectorize our input text at the same time.
train_set = count_vect.fit_transform(train.full_text)
valid_set = count_vect.transform(valid.full_text)

In [91]:
train_tags = train.vocabulary.astype('str')
valid_tags = valid.vocabulary.astype('str')

In [92]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() 

# To train the classifier, simple do 
clf.fit(train_set, train_tags) 

MultinomialNB()

In [93]:
from sklearn.metrics import accuracy_score

# To predict our tags (i.e. whether requesters get their pizza), 
# we feed the vectorized `test_set` to .predict()
predictions_valid = clf.predict(valid_set)

print('Vocabulary accuracy = {}'.format(
        accuracy_score(predictions_valid, valid_tags) * 100)
     )

Vocabulary accuracy = 35.24904214559387


## What if Regression works better ?

In [94]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [95]:
train_tags = train.vocabulary.astype('float') # Regression accepts only numeric labels
valid_tags = valid.vocabulary.astype('float')

In [96]:
train_tags = np.array(train_tags)#.reshape(-1,1) # Regression accepts only in this format
valid_tags = np.array(valid_tags)#.reshape(-1,1)

In [97]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error

reg = XGBRegressor().fit(train_set, train_tags)

predictions_valid = list(reg.predict(valid_set))

print((1 - mean_absolute_percentage_error(valid_tags, predictions_valid))*100) # 1 - mean_absolute_percentage_error = accuracy

85.98420840857874


Regression provides better results even though we have discrete labels (1,1.5,2.5.....)

This is because while labels are discrete, the nature of the labels are sequential (5 > 4.5 > 4 > 3.5 ...... )

Now, one thing to note is that regression outputs will be in a continuous, numeric format. But, we must predict Vocabulary score outputs in a discrete manner from the set of {1,1.5,2.5.....}

Thus, we can round off our predictions to the nearest score (e.g. if score is predicted as 3.7, we round it to 3.5)

In [98]:
predictions_valid_round_to_5 = ([[0.5 * round(x/0.5)] for x in predictions_valid])
print((1 - mean_absolute_percentage_error(valid_tags, predictions_valid_round_to_5))*100)

86.54756836749173


We see a score *improvement*

We also need to ensure that no score is lesser than 1 (minimum) or greater than 5 (maximum)

In [99]:
predictions_valid_round_to_5 = np.array([[1] if x[0] <1 else [5] if x[0]>5 else x for x in predictions_valid_round_to_5])

In [100]:
print(max(predictions_valid_round_to_5))
print(min(predictions_valid_round_to_5))

[4.5]
[2.]


In [101]:
print((1 - mean_absolute_percentage_error(valid_tags, predictions_valid_round_to_5))*100)

86.54756836749173


We see a score *improvement*

In [102]:
ELL_train = pd.read_csv("../input/feedback-prize-english-language-learning/train.csv")
ELL_test = pd.read_csv("../input/feedback-prize-english-language-learning/test.csv")

In [103]:
from sklearn.feature_extraction.text import TfidfVectorizer

count_vect = TfidfVectorizer(ngram_range=(1,2))

train_set = count_vect.fit_transform(ELL_train.full_text)
test_set = count_vect.transform(ELL_test.full_text)

measure_list = ['cohesion','syntax','vocabulary','phraseology','grammar','conventions']

measure_df = pd.DataFrame()

measure_df['text_id'] = ELL_test['text_id']

for measure in measure_list:

    train_tags = ELL_train[measure].astype('float') # Regression accepts only numeric labels

    train_tags = np.array(train_tags) # Regression accepts only in this format

    reg = XGBRegressor().fit(train_set, train_tags)

    predictions_test = reg.predict(test_set)

    predictions_test_round_to_5 = np.array([[0.5 * round(x/0.5)] for x in predictions_test])

    predictions_test_round_to_5 = np.array([[1] if x[0] <1 else [5] if x[0]>5 else x for x in predictions_test_round_to_5])
    
    print(str(measure) + " : " + str(predictions_test_round_to_5))

    measure_df[measure] = predictions_test_round_to_5

cohesion : [[3.]
 [3.]
 [4.]]
syntax : [[2.5]
 [3. ]
 [3.5]]
vocabulary : [[3. ]
 [3. ]
 [3.5]]
phraseology : [[3. ]
 [2.5]
 [3.5]]
grammar : [[2.5]
 [3. ]
 [4. ]]
conventions : [[3.5]
 [3. ]
 [3.5]]


In [104]:
measure_df

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,3.0,2.5,3.0,3.0,2.5,3.5
1,000BAD50D026,3.0,3.0,3.0,2.5,3.0,3.0
2,00367BB2546B,4.0,3.5,3.5,3.5,4.0,3.5


In [105]:
measure_df.set_index('text_id').to_csv('submission.csv')