# **The Hewlett Foundation Essay Scoring**

### **1. Introduction**

The purpose of this report is to conduct predictive analysis, through machine learning on a dataset using Python. The dataset is provided in a single comma separated (CSV) file, it was derived from a set of essays and is used to describe the essay features in numeric information.

An outline of the report is as follows:

   - Introduction
       - A Brief overview
       - Importing necessary libraries
       - Reading the .csv file
       - Investigating the content of the DataFrame
   - Supervised Learning
       - Describing Supervised Machine Learning
       - What is the notion of Labelled Data?
       - What are training and test datasets?
       - Splitting Dataset into input and labelled data
       - Splitting Dataset into training and test datasets
   - Classification
       - Describing binary and multi-class classification
       - Why do we normalise data
       - Normalisation of the Dataset
       - Describing Support Vector Machine Algorithm (SVM)
       - What is the kernel in SVM?
       - Building a model using SVM with training data
       - Predicting a score using test data
       - Creating a confusion matrix to check accuracy of model
       - What is Quadratic Weighted Kappa (QWK)?
       - Obtaining the QWK
   - Analysis
       - Reading the second .csv file
       - Investigating the the content of the DataFrame
       - Seperating the input Dataset
       - Normalisation of the Dataset
       - Using model to predict score for essay
       - Creating a new DataFrame of the predicted score
       - Writing the DataFrame to a .csv file
   - Conclusion

Importing necessary libraries

In [146]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Reading the .csv file

In [147]:
df = pd.read_csv('FIT1043-Essay-Features.csv')

Investigating the content of df

In [148]:
df.shape

(1332, 19)

In [149]:
df.head()

Unnamed: 0,essayid,chars,words,commas,apostrophes,punctuations,avg_word_length,sentences,questions,avg_word_sentence,POS,POS/total_words,prompt_words,prompt_words/total_words,synonym_words,synonym_words/total_words,unstemmed,stemmed,score
0,1457,2153,426,14,6,0,5.053991,16,0,26.625,423.995272,0.995294,207,0.485915,105,0.246479,424,412,4
1,503,1480,292,9,7,0,5.068493,11,0,26.545455,290.993103,0.996552,148,0.506849,77,0.263699,356,345,4
2,253,3964,849,19,26,1,4.669022,49,2,17.326531,843.990544,0.9941,285,0.335689,130,0.153121,750,750,4
3,107,988,210,8,7,0,4.704762,12,0,17.5,207.653784,0.988828,112,0.533333,62,0.295238,217,209,3
4,1450,3139,600,13,8,0,5.231667,24,1,25.0,594.65215,0.991087,255,0.425,165,0.275,702,677,4


The df has 1332 rows and 19 columns. The columns are essayid, chars, words, commas, apostrophes, punctuations, avg_word_length, sentences, questions, avg_word_sentence, POS, POS/total_words, prompt_words, prompt_words/total_words, synonym_words, synonym_words/total_words, unstemmed, stemmed and score. We are going to use this as the training data as we have to build a model.

### **2. Supervised Learning**

Supervise machine learning is usually for classification and regression problems. In supervised machine learning, there is labelled data and from the input data the alogrithms learn to predict the output of the data. Here, the actual objective is to predict the mapping function in a way that you can predict the output of the data when you are provided with the input.<br>
The importance of labeled data in supervised machine learning is quite alot, as it highlighths the features of the data. These features further help in the prediction of the output.<br>
In machine learning, the training dataset trains the predictive models to make predictions. Usually, an increase in the training data improves the test performance. While on the other hand, test datasets evaluate how well the predictive models predict the labels for the test instances.

Splitting the dataset into input data and its corresponding labelled data. The first 18 columns are the input data and the last column is the output data, the score.

In [150]:
# Input Data: essayid, chars, words, commas, apostrophes, punctuations, avg_word_length, sentences, questions, avg_word_sentence, POS, POS/total_words, prompt_words, prompt_words/total_words, synonym_words, synonym_words/total_words, unstemmed and stemmed
X = df.iloc[:, [0, 17]].values
# Labeled Data: score
y = df.iloc[:, 18].values

We are splitting the data into training dataset and testing dataset, assuming that we want to have 75% of the data as training dataset and 25% for the testing dataset.<br>
We are using the *sklearn.model_selection.train_test_split* function to split the data.

In [151]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state = 130
)

In [152]:
X_train

array([[ 659,  495],
       [ 183,  464],
       [ 693,  523],
       ...,
       [1201,  369],
       [1485,  601],
       [ 953,  418]])

### **3. Classification**

In machine learning, the difference between multi-class classification and binary classification is that binary classication can only classify cases into one of two categories. While on the other hand, multi-class classification can classify cases into one of three or more categories.

As the range of values of the raw data varies alot, the objective functions in some machine learning algorithms do not work in the correct way without normalization. An example of this is that a distance between two points is calculated by many classifiers using the Euclidean distance. The distance will be affected heavily if a feature has a wide range of values. Hence, for each feature to contribute equally the range of all the features is normalized.

Using normalisation function *StandardScaler()* from *sklearn.preprocessing* to scale the data appropriately 

In [153]:
# Feature Scaling (Normalization)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Support Vector Machine (SVM) are mostly used for classification problems; however, it works for both classication and regression problems as it can easily manipulate continous and categorical data. Continous data is used in regression, while categorical data is used in classification problems. The basic concept of Support Vector Machine is that it divides datasets, by finding a maximum marginal hyperplane(MMH), into seperate classes.

Support Vector Machine algorithm (SVM) is exceuted using a kernel. The kernel changes the input data space to its preferred  form. The SVM alogrithm uses a method called kernel trick, this helps in building a more accurate model. In this method the kernel changes a low-dimentional input space to a higher-dimensional one.

As our data is ready, we can now use the training dataset to build a model using Support Vector Machine/Regression algorithm.

In [154]:
# Fitting Support Vector Machine/Regression to the Training set
from sklearn import svm
classifier = svm.SVC(kernel = 'linear', gamma = 10, random_state = 130)
classifier.fit(X_train, y_train)

SVC(gamma=10, kernel='linear', random_state=130)

Since, we have a model using the Support Vector Machine algorithm. We will now test the model. Firstly, we use the testing data in the model and predict the score.

In [155]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

Moving on; we have y_pred,the predicted output. Now, we will compare it with y_test, the actual output, to find out the accuracy of the prediction of the model we created. We are going to view the accuracy using the confusion matrix.

In [156]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[  0,   3,   0,   0,   0,   0],
       [  0,  16,  10,   0,   0,   0],
       [  0,   3,  86,  48,   0,   0],
       [  0,   0,  34, 118,   0,   0],
       [  0,   0,   1,  12,   0,   0],
       [  0,   0,   0,   2,   0,   0]])

Quadratic Weighted Kappa (QWK) is a measure, that helps measure the similarity between two ratings. Mostly, the score of QWK ranges from 0 to 1; 0 being some similarity between the two ratings and 1 being complete similarity between the two ratings. In some cases the score of QWK can go below zero, this would mean that there is less similarity than expected. Furthermore, QWK is usually calculated to measure the similarity between a predicted and actual score.

In [157]:
# Using the sklearn.metrics library to code and obtain the QWK score
from sklearn.metrics import cohen_kappa_score
qwk = cohen_kappa_score(y_test, y_pred, weights="quadratic") 
print("QWK_score:",qwk)

QWK_score: 0.6160621467858696


### **4. Kaggle Submission**

Read the .csv file

In [158]:
df2 = pd.read_csv('FIT1043-Essay-Features-Submission.csv')

Investigating the content of df2

In [159]:
df2.shape

(199, 18)

In [160]:
df2.head()

Unnamed: 0,essayid,chars,words,commas,apostrophes,punctuations,avg_word_length,sentences,questions,avg_word_sentence,POS,POS/total_words,prompt_words,prompt_words/total_words,synonym_words,synonym_words/total_words,unstemmed,stemmed
0,1623,4332,900,28,13,0,4.813333,39,1,23.076923,893.988852,0.993321,392,0.435556,196,0.217778,750,750
1,1143,1465,280,11,3,1,5.232143,14,3,20.0,278.321343,0.994005,131,0.467857,51,0.182143,339,316
2,660,1696,325,17,2,0,5.218462,19,1,17.105263,321.31677,0.988667,178,0.547692,92,0.283077,352,337
3,1596,2640,555,20,17,0,4.756757,28,0,19.821429,551.98915,0.994575,228,0.410811,107,0.192793,632,605
4,846,2844,596,33,4,1,4.771812,24,9,24.833333,593.65881,0.996072,279,0.468121,138,0.231544,626,607


Splitting the dataset into its input data. From the second column to the last columns are the input data.

In [161]:
# Input Data: essayid, chars, words, commas, apostrophes, punctuations, avg_word_length, sentences, questions, avg_word_sentence, POS, POS/total_words, prompt_words, prompt_words/total_words, synonym_words, synonym_words/total_words, unstemmed and stemmed
X = df2.iloc[:, [1, 17]].values

Using normalisation function StandardScaler() from sklearn.preprocessing to scale the data appropriately

In [162]:
# Feature Scaling (Normalization)
X = sc.fit_transform(X)

Using the model built using the Support Vector Machine algorithm, we will now predict the score of the essays.

In [163]:
# Predicting the results
Y_pred = classifier.predict(X)
Y_pred

array([4, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 4, 4, 4, 4, 3, 4, 3, 3,
       4, 4, 4, 3, 4, 4, 4, 3, 2, 4, 3, 3, 4, 3, 4, 4, 3, 3, 4, 3, 3, 3,
       2, 3, 3, 4, 4, 3, 3, 4, 4, 4, 3, 4, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4,
       3, 4, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 2, 4, 4,
       2, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 3, 4, 2, 4, 3, 4, 3, 3, 4, 4,
       4, 3, 4, 4, 4, 3, 2, 4, 2, 3, 4, 4, 4, 3, 2, 3, 3, 4, 3, 2, 4, 4,
       2, 3, 4, 3, 3, 4, 2, 3, 4, 4, 4, 3, 3, 4, 4, 3, 4, 4, 4, 3, 3, 3,
       4, 3, 3, 3, 4, 4, 4, 3, 2, 3, 4, 4, 3, 3, 2, 4, 3, 4, 3, 3, 4, 3,
       4, 3, 3, 4, 3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3,
       4])

Creating a new DataFrame of the *essayid* and the predicted score.

In [164]:
df3 = pd.DataFrame(Y_pred, df2['essayid'])

Renaming the columns and aligning the column names using the *reset_index()* function

In [165]:
df3.rename(columns = {0: 'score'}, inplace = True) 
df3 = df3.reset_index()
df3

Unnamed: 0,essayid,score
0,1623,4
1,1143,3
2,660,3
3,1596,4
4,846,4
...,...,...
194,1226,3
195,862,4
196,1562,4
197,1336,3


Writing the df3 DataFrame to a .csv file 

In [143]:
columns = ["essayid", "score"]
df3.to_csv('99999999-YourName-1.csv', mode='w', header = columns, index = False)

### **4. Conclusion**

In conclusion, the file *'FIT1043-Essay-Features.csv'* had data which was used to train the model to predict a score for the essays. The data from this file was split into training and test datasets. These datasets were used to train the model. I made a confusion matrix to check the accuracy of the model and also obtained a QWK score. The file *'FIT1043-Essay-Features-Submission.csv'* had data for which we had to predict a score for the essays. Finally, this prediction was written to a .csv file.