# Scoring Articles

In this project I will try to score three articles. I will use a neural network model from Keras. Let's start.

#### Overview From Kaggle
The first automated essay scoring competition to tackle automated grading of student-written essays was twelve years ago. How far have we come from this initial competition? With an updated dataset and light years of new ideas we hope to see if we can get to the latest in automated grading to provide a real impact to overtaxed teachers who continue to have challenges with providing timely feedback, especially in underserved communities.
<br><br>
The goal of this competition is to train a model to score student essays. Your efforts are needed to reduce the high expense and time required to hand grade these essays. Reliable automated techniques could allow essays to be introduced in testing, a key indicator of student learning that is currently commonly avoided due to the challenges in grading.

Original Version of the <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2">Competition</a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv
/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv
/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv


In [2]:
ss=pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv")
test=pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv")
train=pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv")

In [3]:
train.head()

Unnamed: 0,essay_id,full_text,score
0,000d118,Many people have car where they live. The thin...,3
1,000fe60,I am a scientist at NASA that is discussing th...,3
2,001ab80,People always wish they had the same technolog...,4
3,001bdc0,"We all heard about Venus, the planet without a...",4
4,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3


In [4]:
test.head()

Unnamed: 0,essay_id,full_text
0,000d118,Many people have car where they live. The thin...
1,000fe60,I am a scientist at NASA that is discussing th...
2,001ab80,People always wish they had the same technolog...


In [5]:
ss.head()

Unnamed: 0,essay_id,score
0,000d118,3
1,000fe60,3
2,001ab80,4


In [6]:
train.shape

(17307, 3)

# Feature Engineering

Here I make the data lower and remove the punctuations, digits and new lines. That will make it easier for me to work with the data. I make the test data too, because I will predict that data. I don't want to be influenced by the diferent thinks.

In [7]:
train["full_text"]=train["full_text"].str.lower()
test["full_text"]=test["full_text"].str.lower()
train["full_text"]=train["full_text"].str.replace('[^\w\s]','',regex=True)
test["full_text"]=test["full_text"].str.replace('[^\w\s]','',regex=True)

train["full_text"]=train["full_text"].str.replace('\d+','')
test["full_text"]=test["full_text"].str.replace('\d+','')

train["full_text"]=train["full_text"].str.replace('\n',' ')
test["full_text"]=test["full_text"].str.replace('\n',' ')

In [8]:
train.head()

Unnamed: 0,essay_id,full_text,score
0,000d118,many people have car where they live the thing...,3
1,000fe60,i am a scientist at nasa that is discussing th...,3
2,001ab80,people always wish they had the same technolog...,4
3,001bdc0,we all heard about venus the planet without al...,4
4,002ba53,dear state senator this is a letter to argue ...,3


In [9]:
test.head()

Unnamed: 0,essay_id,full_text
0,000d118,many people have car where they live the thing...
1,000fe60,i am a scientist at nasa that is discussing th...
2,001ab80,people always wish they had the same technolog...


In [10]:
import nltk

In [11]:
# Packages for English
nltk.download("punkt")

[nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>


False

The **lemmafn** function is designed to perform lemmatization on a given text. Lemmatization is the process of reducing words to their base or root form. This is useful in natural language processing (NLP) tasks such as text analysis, text mining, and machine learning. 

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
from nltk.stem import PorterStemmer
pr=PorterStemmer()
def lemmafn(text):
    words=TextBlob(text).words
    return [pr.stem(word) for word in words]

Here I identify my vectorizer. Stopwords are in English, range is (1,3) and max features is 100000. Also my analyzer is lemmafn.

In [13]:
vect=CountVectorizer(stop_words="english",ngram_range=(1,3),max_features=100000,analyzer=lemmafn)

In [14]:
x=train["full_text"]

In [15]:
x=vect.fit_transform(x)



In [16]:
d={1:0,2:1,3:2,4:3,5:4,6:5}
train["score"]=train["score"].map(d)
y=train["score"]

In [17]:
train.head()

Unnamed: 0,essay_id,full_text,score
0,000d118,many people have car where they live the thing...,2
1,000fe60,i am a scientist at nasa that is discussing th...,2
2,001ab80,people always wish they had the same technolog...,3
3,001bdc0,we all heard about venus the planet without al...,3
4,002ba53,dear state senator this is a letter to argue ...,2


# Modelling

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [20]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

2024-05-25 16:00:14.631697: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-25 16:00:14.631826: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-25 16:00:14.801159: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [21]:
model=Sequential()
model.add(Dense(1000,activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(600,activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(400,activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(250,activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(6,activation="softmax"))
model.compile(optimizer="adam",loss="sparse_categorical_crossentropy",metrics=["accuracy"])

In [22]:
history=model.fit(x_train,y_train,epochs=10,validation_data=(x_test,y_test),batch_size=32,verbose=1)

Epoch 1/10
[1m433/433[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m225s[0m 505ms/step - accuracy: 0.3931 - loss: 1.3811 - val_accuracy: 0.5910 - val_loss: 0.9630
Epoch 2/10
[1m433/433[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 506ms/step - accuracy: 0.6020 - loss: 0.9377 - val_accuracy: 0.6031 - val_loss: 0.9256
Epoch 3/10
[1m433/433[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 516ms/step - accuracy: 0.6819 - loss: 0.7634 - val_accuracy: 0.5962 - val_loss: 0.9680
Epoch 4/10
[1m433/433[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m222s[0m 513ms/step - accuracy: 0.7502 - loss: 0.6120 - val_accuracy: 0.5812 - val_loss: 1.0224
Epoch 5/10
[1m433/433[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m222s[0m 512ms/step - accuracy: 0.8154 - loss: 0.4844 - val_accuracy: 0.5812 - val_loss: 1.2662
Epoch 6/10
[1m433/433[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 509ms/step - accuracy: 0.8466 - loss: 0.4037 - val_accuracy: 0.5656 - val_loss: 1.2768
Epoc

In [23]:
model.save("model.h5")

# Predicting

In [24]:
predictions = []


for index, row in test.iterrows():
    data = row["full_text"]
    data = vect.transform([data])
    pred = model.predict(data)
    predicted_class = np.argmax(pred)
    
    
    prediction = predicted_class + 1
    
    predictions.append(prediction)


test['score'] = predictions


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 142ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step


In [25]:
test.head()

Unnamed: 0,essay_id,full_text,score
0,000d118,many people have car where they live the thing...,1
1,000fe60,i am a scientist at nasa that is discussing th...,3
2,001ab80,people always wish they had the same technolog...,4


In [26]:
submission = test[['essay_id', 'score']]
submission.to_csv('submission.csv', index=False)
submission.head()

Unnamed: 0,essay_id,score
0,000d118,1
1,000fe60,3
2,001ab80,4


In thid project I tried to predict the score of three essays. But the important thing is that without machine learning model, reading and scoring three articles is a very hard task. Also people spend their time to score essays. But with machine learning model, it is very easy and fast.