# Automatic Script Evaluvator

### Notebook by __[Vishaal Rao](https://www.linkedin.com/in/vishaal-rao/)__

## Table of Contents

 1. [Introduction](#Introduction)
 2. [Libraries used](#Important-Libraries-used-in-the-following-project)
 3. [Problem Domain](#The-Problem-Domain)
 4. [Methodology](#Methodology)
 5. [Program](#Program)
 6. [N/A value treatment](#N/A-value-treatment)
 7. [Performing Natural language process](#Performing-Natural-language-process)
 8. [Building Regression model](#Building-Regression-model)
 9. [Predicting the scores for the given test data](#Predicting-the-scores-for-the-given-test-data)

## Introduction

The number of students around the globe enrolled in higher education is forecast to more than double to 262 million by 2025. Nearly all of this growth will be in the developing world, with more than half in China and India alone. With every student giving exams to prove themselves to be cut above the rest, It leaves the teachers around the world with the daunting task of correcting every single paper a student submits.

In this notebook i'll be attempting to grade 10 different sets of essays answered by students, which has been marked according to the coherence and clarity by 2 different teachers. While the teacher have given 5 seperate scores with a maximum marks of either 2 or 3(depending on the set), i'll be taking the mean of these marks and will be assigning total marks out of 10 or 15 respectively.

## Important Libraries used in the following project

1. **NLTK**: The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.


2. **Pandas** :pandas is a software library written for the Python programming language for data manipulation and analysis.


3. **sklearn.feature_extraction.text** The library has been used to perform TFIDF vectorization.


4. **sklearn.metrics.pairwise** : The library has been used to obtain the cosine similarity for the essay text.


5. **sklearn.ensemble** : The library has been used to perform bagging for the model.


6. **sklearn.tree** : Descision tree regressor has been called from the library mentioned.
 

## The Problem Domain 
For the purpose of this exercise, let's pretend a school has asked us to buid a machine learning algorithm which shall automatically grade the essay's submitted by students of grade 10. We have been given 4 inputs namely Essay set, coherence, clarity and the essay text written by th student.

## Methodology



1. In order to judge the essays based on text we will be comparing all the essays having the highest clarity and coherence in their sets. We will be comparing it using:
    
    1. **Term frequency**: is a scoring of the frequency of the word in the current document.
                    
          *TF = (Number of times term t appears in a document)/(Number of terms in the document)*
    
    2. **Inverse Document frequency**:  is a scoring of how rare the word is across documents.
    
          *IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.*
        
Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus



2. **Cosine Similarity** :Cosine similarity is a measure of similarity between two non-zero vectors. Using this formula we can find out the similarity between any two documents d1 and d2.

      *Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||*
      
      where d1,d2 are two non zero vectors. In our scenario, d1 will be one of the essay texts we are testing and d2 will be one of the essay text which has been rate highly.The similarity shall be stored in a column named 'EssayMrk
      


3. **Decision Tree Regressor**: We shall be using a descision tree regressor where the 'Essayset','clarity','coherent','EssayMrk' shall be the independent variable and the 'Total' shall be the dependent variable.




4. **Bagging Regressor**: We shall be using the ensamble Bagging to improve the accuracy of the said model.

      
      


## Program

In [1]:
import os
os.chdir('D:\Data science\Kaggle project\incedo_participant')

import nltk

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import string

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.ensemble import BaggingRegressor

from sklearn.tree import DecisionTreeRegressor

import math

import warnings
warnings.filterwarnings("ignore")

nltk.download('punkt')

nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vishaal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Vishaal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
Train=pd.read_csv('train_dataset.csv') 

In [3]:
Train.shape

(17043, 12)

In [4]:
Train.head()

Unnamed: 0,ID,Essayset,min_score,max_score,score_1,score_2,score_3,score_4,score_5,clarity,coherent,EssayText
0,1,1.0,0,3,1,1,1.0,1.0,1.0,average,worst,Some additional information that we would need...
1,2,1.0,0,3,1,1,,1.5,1.0,excellent,worst,"After reading the expirement, I realized that ..."
2,3,1.0,0,3,1,1,1.0,1.0,1.5,worst,above_average,"What you need is more trials, a control set up..."
3,4,1.0,0,3,0,0,0.0,0.0,1.0,worst,worst,The student should list what rock is better an...
4,5,1.0,0,3,2,2,2.0,2.5,1.0,above_average,worst,For the students to be able to make a replicat...


In [5]:
Train.dtypes

ID             int64
Essayset     float64
min_score      int64
max_score      int64
score_1        int64
score_2        int64
score_3      float64
score_4      float64
score_5      float64
clarity       object
coherent      object
EssayText     object
dtype: object

In [6]:
Train.isna().sum()

ID             0
Essayset     157
min_score      0
max_score      0
score_1        0
score_2        0
score_3      147
score_4      136
score_5      144
clarity      138
coherent     145
EssayText      0
dtype: int64

### N/A value treatment

In [7]:
Train.dropna(subset=['Essayset'],inplace=True)

In [8]:
Train['score_1']=Train['score_1'].fillna(round(Train['score_1'].mean()),axis=0)
Train['score_2']=Train['score_2'].fillna(round(Train['score_2'].mean()),axis=0)
Train['score_3']=Train['score_3'].fillna(round(Train['score_3'].mean()),axis=0)
Train['score_4']=Train['score_4'].fillna(round(Train['score_4'].mean()),axis=0)
Train['score_5']=Train['score_5'].fillna(round(Train['score_5'].mean()),axis=0)


In [9]:
Train['clarity'] = Train['clarity'].map( {'worst':0, 'average':1, 'above_average':2,'excellent':3})

In [10]:
Train['coherent'] = Train['coherent'].map( {'worst':0, 'average':1, 'above_average':2,'excellent':3})

In [11]:
Train['clarity']=Train['clarity'].fillna(round(Train['clarity'].mean()),axis=0)

In [12]:
Train['coherent']=Train['coherent'].fillna(round(Train['coherent'].mean()),axis=0)

In [13]:
Train.isna().sum()

ID           0
Essayset     0
min_score    0
max_score    0
score_1      0
score_2      0
score_3      0
score_4      0
score_5      0
clarity      0
coherent     0
EssayText    0
dtype: int64

In [14]:
Train.head()

Unnamed: 0,ID,Essayset,min_score,max_score,score_1,score_2,score_3,score_4,score_5,clarity,coherent,EssayText
0,1,1.0,0,3,1,1,1.0,1.0,1.0,1.0,0.0,Some additional information that we would need...
1,2,1.0,0,3,1,1,1.0,1.5,1.0,3.0,0.0,"After reading the expirement, I realized that ..."
2,3,1.0,0,3,1,1,1.0,1.0,1.5,0.0,2.0,"What you need is more trials, a control set up..."
3,4,1.0,0,3,0,0,0.0,0.0,1.0,0.0,0.0,The student should list what rock is better an...
4,5,1.0,0,3,2,2,2.0,2.5,1.0,2.0,0.0,For the students to be able to make a replicat...


Taking the mean of all the score and dividing by the maxium score that can be obtained.

In [15]:
Train['Total']=round((Train['score_1']+Train['score_2']+Train['score_3']+Train['score_4']+Train['score_5'])/(Train['max_score']*5),2)

In [16]:
Train.head()

Unnamed: 0,ID,Essayset,min_score,max_score,score_1,score_2,score_3,score_4,score_5,clarity,coherent,EssayText,Total
0,1,1.0,0,3,1,1,1.0,1.0,1.0,1.0,0.0,Some additional information that we would need...,0.33
1,2,1.0,0,3,1,1,1.0,1.5,1.0,3.0,0.0,"After reading the expirement, I realized that ...",0.37
2,3,1.0,0,3,1,1,1.0,1.0,1.5,0.0,2.0,"What you need is more trials, a control set up...",0.37
3,4,1.0,0,3,0,0,0.0,0.0,1.0,0.0,0.0,The student should list what rock is better an...,0.07
4,5,1.0,0,3,2,2,2.0,2.5,1.0,2.0,0.0,For the students to be able to make a replicat...,0.63


In [17]:
Train.drop(['ID','min_score','max_score','score_1','score_2','score_3','score_4','score_5'],inplace=True,axis=1)

In [18]:
Train.head()

Unnamed: 0,Essayset,clarity,coherent,EssayText,Total
0,1.0,1.0,0.0,Some additional information that we would need...,0.33
1,1.0,3.0,0.0,"After reading the expirement, I realized that ...",0.37
2,1.0,0.0,2.0,"What you need is more trials, a control set up...",0.37
3,1.0,0.0,0.0,The student should list what rock is better an...,0.07
4,1.0,2.0,0.0,For the students to be able to make a replicat...,0.63


In [19]:
for i in range(len(Train['EssayText'])):
    Train.iloc[i,3]=Train.iloc[i,3].lower()

In [20]:
Train.head()

Unnamed: 0,Essayset,clarity,coherent,EssayText,Total
0,1.0,1.0,0.0,some additional information that we would need...,0.33
1,1.0,3.0,0.0,"after reading the expirement, i realized that ...",0.37
2,1.0,0.0,2.0,"what you need is more trials, a control set up...",0.37
3,1.0,0.0,0.0,the student should list what rock is better an...,0.07
4,1.0,2.0,0.0,for the students to be able to make a replicat...,0.63


## Performing Natural language process

Dividing the rows based on the 'Essayset' they belong to. In this problem there are 10 different essay sets.

In [21]:
Train['Essayset'].unique().tolist()

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

In [22]:
Tab_1=[]
Tab_2=[]
Tab_3=[]
Tab_4=[]
Tab_5=[]
Tab_6=[]
Tab_7=[]
Tab_8=[]
Tab_9=[]
Tab_10=[]
for j in range(0,len(Train['Essayset'])):
        if Train.iloc[j,0]==1:
            Tab_1.insert(len(Tab_1),Train.iloc[j,:])
        elif Train.iloc[j,0]==2:
            Tab_2.insert(len(Tab_2),Train.iloc[j,:])
        elif Train.iloc[j,0]==3:
            Tab_3.insert(len(Tab_3),Train.iloc[j,:])
        elif Train.iloc[j,0]==4:
            Tab_4.insert(len(Tab_4),Train.iloc[j,:])
        elif Train.iloc[j,0]==5:
            Tab_5.insert(len(Tab_5),Train.iloc[j,:])
        elif Train.iloc[j,0]==6:
            Tab_6.insert(len(Tab_6),Train.iloc[j,:])
        elif Train.iloc[j,0]==7:
            Tab_7.insert(len(Tab_7),Train.iloc[j,:])
        elif Train.iloc[j,0]==8:
            Tab_8.insert(len(Tab_8),Train.iloc[j,:])
        elif Train.iloc[j,0]==9:
            Tab_9.insert(len(Tab_9),Train.iloc[j,:])
        elif Train.iloc[j,0]==10:
            Tab_10.insert(len(Tab_10),Train.iloc[j,:])

In [23]:
Tab_1=pd.DataFrame(Tab_1)
Tab_2=pd.DataFrame(Tab_2)
Tab_3=pd.DataFrame(Tab_3)
Tab_4=pd.DataFrame(Tab_4)
Tab_5=pd.DataFrame(Tab_5)
Tab_6=pd.DataFrame(Tab_6)
Tab_7=pd.DataFrame(Tab_7)
Tab_8=pd.DataFrame(Tab_8)
Tab_9=pd.DataFrame(Tab_9)
Tab_10=pd.DataFrame(Tab_10)

In [24]:
Tab_1.reset_index(inplace=True,drop=True)
Tab_2.reset_index(inplace=True,drop=True)
Tab_3.reset_index(inplace=True,drop=True)
Tab_4.reset_index(inplace=True,drop=True)
Tab_5.reset_index(inplace=True,drop=True)
Tab_6.reset_index(inplace=True,drop=True)
Tab_7.reset_index(inplace=True,drop=True)
Tab_8.reset_index(inplace=True,drop=True)
Tab_9.reset_index(inplace=True,drop=True)
Tab_10.reset_index(inplace=True,drop=True)

Defining a function **Best_vals** which shall return a list of the best rows with high 'coherence' and 'clarity'. It was noticed that essay set 6 did not have any good performers hence we have considered the rows with either high coherence or high clarity.

In [25]:
def Best_vals(df):
    B_V=[]
    for j in range(len(df)):
        if j!=6:
            if (df.iloc[j,1]==3) & (df.iloc[j,2]==3) :
                B_V.append(j)
        else:
            if (df.iloc[j,1]==max(df.iloc[:,1])) | (df.iloc[j,2]==max(df.iloc[:,2])) :
                B_V.append(j)
    return B_V

In [26]:
BV1=Best_vals(Tab_1)
BV2=Best_vals(Tab_2)
BV3=Best_vals(Tab_3)
BV4=Best_vals(Tab_4)
BV5=Best_vals(Tab_5)
BV6=Best_vals(Tab_6)
BV7=Best_vals(Tab_7)
BV8=Best_vals(Tab_8)
BV9=Best_vals(Tab_9)
BV10=Best_vals(Tab_10)

Defining functions to perform **Lemmatization**. Lemmatization means returning the base form of a particular word. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.

In [27]:
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

function **Vals** returns the cosine similarity values of the essay texts. it shall be done seperately for seperate essays.

In [28]:
def Vals(Tab,BV):
    vals=[]
    Tab_list=list(Tab)
    z=np.zeros((1,len(Tab_list)))
    vectorizer = TfidfVectorizer(tokenizer=LemNormalize,stop_words='english')
    X = vectorizer.fit_transform(Tab_list)            #Input Tab will have to be a list, so input Tab_1.iloc[:,0]
    for i,v in enumerate(BV):    
        vals=cosine_similarity(X[v], X)
        z=np.append(z,vals,axis=0)
    z=np.delete(z, 0, 0)
    Mean=pd.DataFrame(z.mean(axis=0))
    
    return Mean

In [29]:
EssayMark1=Vals(Tab_1.iloc[:,3],BV1)
EssayMark2=Vals(Tab_2.iloc[:,3],BV2)
EssayMark3=Vals(Tab_3.iloc[:,3],BV3)
EssayMark4=Vals(Tab_4.iloc[:,3],BV4)
EssayMark5=Vals(Tab_5.iloc[:,3],BV5)
EssayMark6=Vals(Tab_6.iloc[:,3],BV6)
EssayMark7=Vals(Tab_7.iloc[:,3],BV7)
EssayMark8=Vals(Tab_8.iloc[:,3],BV8)
EssayMark9=Vals(Tab_9.iloc[:,3],BV9)
EssayMark10=Vals(Tab_10.iloc[:,3],BV10)

In [30]:
lst=[EssayMark1,EssayMark2,EssayMark3,EssayMark4,EssayMark5,EssayMark6,EssayMark7,EssayMark8,EssayMark9,EssayMark10]
EssayMark=[]
for i in range(len(lst)):
    EssayMark.extend(lst[i].iloc[:,0])

In [31]:
EssayMark=pd.DataFrame(EssayMark)

In [32]:
EssayMark.isna().sum()

0    0
dtype: int64

In [33]:
Train.reset_index(inplace=True,drop=True)

In [34]:
Train['EssayMrk']=EssayMark

In [35]:
Train.isna().sum()

Essayset     0
clarity      0
coherent     0
EssayText    0
Total        0
EssayMrk     0
dtype: int64

In [36]:
Train.drop('EssayText',inplace=True,axis=1)
Train.head()

Unnamed: 0,Essayset,clarity,coherent,Total,EssayMrk
0,1.0,1.0,0.0,0.33,0.145242
1,1.0,3.0,0.0,0.37,0.046966
2,1.0,0.0,2.0,0.37,0.042876
3,1.0,0.0,0.0,0.07,0.015729
4,1.0,2.0,0.0,0.63,0.07588


In [37]:
Train.head() # Final table.

Unnamed: 0,Essayset,clarity,coherent,Total,EssayMrk
0,1.0,1.0,0.0,0.33,0.145242
1,1.0,3.0,0.0,0.37,0.046966
2,1.0,0.0,2.0,0.37,0.042876
3,1.0,0.0,0.0,0.07,0.015729
4,1.0,2.0,0.0,0.63,0.07588


In [38]:
Train.isna().sum()

Essayset    0
clarity     0
coherent    0
Total       0
EssayMrk    0
dtype: int64

## Building Regression model 

In [39]:
regr = BaggingRegressor(base_estimator=DecisionTreeRegressor(),oob_score=True)
X=Train[['Essayset','clarity','coherent','EssayMrk']]
Y=Train[['Total']]

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [41]:
regr.fit(X_train,y_train)

BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

In [42]:
print ('Coefficients: ', regr.oob_score_)

Coefficients:  0.6546685882580758


In [43]:
regr.score(X_test,y_test)

0.7078440593172011

We perform Bagging on multiple estimatiors in order to achieve the most efficient model.

In [44]:
for w in range(10,300,20):
    regr=BaggingRegressor(oob_score=True,n_jobs=-1,n_estimators=w,random_state=400,
                          base_estimator=DecisionTreeRegressor())
    regr.fit(X_train,y_train)
    oob=regr.oob_score_
    print('For n_estimators = '+str(w))
    print('OOB score is '+str(oob))
    print('************************')

For n_estimators = 10
OOB score is 0.6636541351563024
************************
For n_estimators = 30
OOB score is 0.717162318458297
************************
For n_estimators = 50
OOB score is 0.7235625779116421
************************
For n_estimators = 70
OOB score is 0.7254154129084711
************************
For n_estimators = 90
OOB score is 0.7275251725391145
************************
For n_estimators = 110
OOB score is 0.7278178050931972
************************
For n_estimators = 130
OOB score is 0.7288367215486253
************************
For n_estimators = 150
OOB score is 0.7294000969816308
************************
For n_estimators = 170
OOB score is 0.7296874255154073
************************
For n_estimators = 190
OOB score is 0.729781389465202
************************
For n_estimators = 210
OOB score is 0.729901439260723
************************
For n_estimators = 230
OOB score is 0.7298908254762734
************************
For n_estimators = 250
OOB score is 0.7299471782

In [45]:
regr=BaggingRegressor(oob_score=True,n_jobs=-1,n_estimators=250,random_state=400, #It was noticed that when estimator=250, the model had the highest efficiency.
                          base_estimator=DecisionTreeRegressor())
regr.fit(X_train,y_train)

BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=250, n_jobs=-1, oob_score=True,
         random_state=400, verbose=0, warm_start=False)

We shall see the which split had the most affect in the regression.

In [46]:
imp=[]
for i in regr.estimators_:
    imp.append(i.feature_importances_)
imp=np.mean(imp,axis=0)

In [47]:
feature_importance=pd.Series(imp,index=X.columns.tolist())

In [48]:
feature_importance.sort_values(ascending=False)

coherent    0.506951
EssayMrk    0.219869
clarity     0.185002
Essayset    0.088178
dtype: float64

It was noticed that coherence has played the most important part in judging a students performance, followed by the cosine similarity to the best written essays.

In [49]:
regr.score(X_test,y_test)

0.7200323358633295

The above built model has an efficiency of **72%**.

## Predicting the scores for the given test data

In [50]:
Test=pd.read_csv('test_dataset.csv')

In [51]:
Test.head()

Unnamed: 0,ID,Essayset,min_score,max_score,clarity,coherent,EssayText
0,1673,1,0,3,average,worst,The procedures I think they should have includ...
1,1674,1,0,3,average,worst,"In order to replicate this experiment, you wou..."
2,1675,1,0,3,above_average,above_average,"In order to replicate their experiment, you wo..."
3,1676,1,0,3,worst,worst,Pleace a simple of one material into one conta...
4,1677,1,0,3,worst,worst,Determin the mass of four different samples ma...


In [52]:
Test['clarity'] = Test['clarity'].map( {'worst':0, 'average':1, 'above_average':2,'excellent':3})
Test['coherent'] = Test['coherent'].map( {'worst':0, 'average':1, 'above_average':2,'excellent':3})

In [53]:
Test.drop(['ID','min_score'],inplace=True,axis=1)

In [54]:
for i in range(len(Test['EssayText'])):
    Test.iloc[i,4]=Test.iloc[i,4].lower()

In [55]:
Test.head()

Unnamed: 0,Essayset,max_score,clarity,coherent,EssayText
0,1,3,1,0,the procedures i think they should have includ...
1,1,3,1,0,"in order to replicate this experiment, you wou..."
2,1,3,2,2,"in order to replicate their experiment, you wo..."
3,1,3,0,0,pleace a simple of one material into one conta...
4,1,3,0,0,determin the mass of four different samples ma...


In [56]:
Test_Tab_1=[]
Test_Tab_2=[]
Test_Tab_3=[]
Test_Tab_4=[]
Test_Tab_5=[]
Test_Tab_6=[]
Test_Tab_7=[]
Test_Tab_8=[]
Test_Tab_9=[]
Test_Tab_10=[]
for j in range(0,len(Test['Essayset'])):
        if Test.iloc[j,0]==1:
            Test_Tab_1.insert(len(Test_Tab_1),Test.iloc[j,:])
        elif Test.iloc[j,0]==2:
            Test_Tab_2.insert(len(Test_Tab_2),Test.iloc[j,:])
        elif Test.iloc[j,0]==3:
            Test_Tab_3.insert(len(Test_Tab_3),Test.iloc[j,:])
        elif Test.iloc[j,0]==4:
            Test_Tab_4.insert(len(Test_Tab_4),Test.iloc[j,:])
        elif Test.iloc[j,0]==5:
            Test_Tab_5.insert(len(Test_Tab_5),Test.iloc[j,:])
        elif Test.iloc[j,0]==6:
            Test_Tab_6.insert(len(Test_Tab_6),Test.iloc[j,:])
        elif Test.iloc[j,0]==7:
            Test_Tab_7.insert(len(Test_Tab_7),Test.iloc[j,:])
        elif Test.iloc[j,0]==8:
            Test_Tab_8.insert(len(Test_Tab_8),Test.iloc[j,:])
        elif Test.iloc[j,0]==9:
            Test_Tab_9.insert(len(Test_Tab_9),Test.iloc[j,:])
        elif Test.iloc[j,0]==10:
            Test_Tab_10.insert(len(Test_Tab_10),Test.iloc[j,:])

In [57]:
Test_Tab_1=pd.DataFrame(Test_Tab_1)
Test_Tab_2=pd.DataFrame(Test_Tab_2)
Test_Tab_3=pd.DataFrame(Test_Tab_3)
Test_Tab_4=pd.DataFrame(Test_Tab_4)
Test_Tab_5=pd.DataFrame(Test_Tab_5)
Test_Tab_6=pd.DataFrame(Test_Tab_6)
Test_Tab_7=pd.DataFrame(Test_Tab_7)
Test_Tab_8=pd.DataFrame(Test_Tab_8)
Test_Tab_9=pd.DataFrame(Test_Tab_9)
Test_Tab_10=pd.DataFrame(Test_Tab_10)

In [58]:
Test_Tab_1.reset_index(inplace=True,drop=True)
Test_Tab_2.reset_index(inplace=True,drop=True)
Test_Tab_3.reset_index(inplace=True,drop=True)
Test_Tab_4.reset_index(inplace=True,drop=True)
Test_Tab_5.reset_index(inplace=True,drop=True)
Test_Tab_6.reset_index(inplace=True,drop=True)
Test_Tab_7.reset_index(inplace=True,drop=True)
Test_Tab_8.reset_index(inplace=True,drop=True)
Test_Tab_9.reset_index(inplace=True,drop=True)
Test_Tab_10.reset_index(inplace=True,drop=True)

In [59]:
def Best_vals2(df):
    B_V=[]
    for j in range(len(df)):
        if (df.iloc[j,2]==max(df.iloc[:,2])) | (df.iloc[j,3]==max(df.iloc[:,3])) :
            B_V.append(j)
    return B_V

In [60]:
test_BV1=Best_vals2(Test_Tab_1)
test_BV2=Best_vals2(Test_Tab_2)
test_BV3=Best_vals2(Test_Tab_3)
test_BV4=Best_vals2(Test_Tab_4)
test_BV5=Best_vals2(Test_Tab_5)
test_BV6=Best_vals2(Test_Tab_6)
test_BV7=Best_vals2(Test_Tab_7)
test_BV8=Best_vals2(Test_Tab_8)
test_BV9=Best_vals2(Test_Tab_9)
test_BV10=Best_vals2(Test_Tab_10)

In [61]:
Test_EssayMark1=Vals(Test_Tab_1.iloc[:,4],test_BV1)
Test_EssayMark2=Vals(Test_Tab_2.iloc[:,4],test_BV2)
Test_EssayMark3=Vals(Test_Tab_3.iloc[:,4],test_BV3)
Test_EssayMark4=Vals(Test_Tab_4.iloc[:,4],test_BV4)
Test_EssayMark5=Vals(Test_Tab_5.iloc[:,4],test_BV5)
Test_EssayMark6=Vals(Test_Tab_6.iloc[:,4],test_BV6)
Test_EssayMark7=Vals(Test_Tab_7.iloc[:,4],test_BV7)
Test_EssayMark8=Vals(Test_Tab_8.iloc[:,4],test_BV8)
Test_EssayMark9=Vals(Test_Tab_9.iloc[:,4],test_BV9)
Test_EssayMark10=Vals(Test_Tab_10.iloc[:,4],test_BV10)

In [62]:
lst2=[Test_EssayMark1,Test_EssayMark2,Test_EssayMark3,Test_EssayMark4,Test_EssayMark5,Test_EssayMark6,Test_EssayMark7,Test_EssayMark8,Test_EssayMark9,Test_EssayMark10]
Test_EssayMark=[]
for i in range(len(lst2)):
    Test_EssayMark.extend(lst2[i].iloc[:,0])

In [63]:
EssayMark2=pd.DataFrame(Test_EssayMark)

In [64]:
Test['EssayMrk']=EssayMark2

In [65]:
Test.isna().sum()

Essayset     0
max_score    0
clarity      0
coherent     0
EssayText    0
EssayMrk     0
dtype: int64

In [66]:
Test.head()

Unnamed: 0,Essayset,max_score,clarity,coherent,EssayText,EssayMrk
0,1,3,1,0,the procedures i think they should have includ...,0.077698
1,1,3,1,0,"in order to replicate this experiment, you wou...",0.135626
2,1,3,2,2,"in order to replicate their experiment, you wo...",0.114849
3,1,3,0,0,pleace a simple of one material into one conta...,0.051692
4,1,3,0,0,determin the mass of four different samples ma...,0.047638


In [67]:
y_hat= regr.predict(Test[['Essayset','clarity','coherent','EssayMrk']])

In [68]:
Submission=pd.read_csv('test_dataset.csv')

In [69]:
Submission.head()

Unnamed: 0,ID,Essayset,min_score,max_score,clarity,coherent,EssayText
0,1673,1,0,3,average,worst,The procedures I think they should have includ...
1,1674,1,0,3,average,worst,"In order to replicate this experiment, you wou..."
2,1675,1,0,3,above_average,above_average,"In order to replicate their experiment, you wo..."
3,1676,1,0,3,worst,worst,Pleace a simple of one material into one conta...
4,1677,1,0,3,worst,worst,Determin the mass of four different samples ma...


In [70]:
Submission=Submission[['ID','Essayset','max_score']]

In [71]:
Submission['Score']=round(y_hat*Submission['max_score']*5,1)

In [72]:
Submission.drop('max_score',inplace=True,axis=1)

In [73]:
Submission.head(10)

Unnamed: 0,ID,Essayset,Score
0,1673,1,1.6
1,1674,1,4.3
2,1675,1,12.4
3,1676,1,3.0
4,1677,1,3.3
5,1678,1,11.0
6,1679,1,2.4
7,1680,1,3.4
8,1681,1,13.6
9,1682,1,10.5


In [74]:
Submission.to_csv('Submission.csv')