# **LIBRARIES IMPORT**

In [1]:
import sys
sys.path.append('D:\\Projects\\nlp-projects\\utils')
print(sys.path)

['C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312', 'd:\\Projects\\nlp-projects\\.venv', '', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages\\win32', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages\\win32\\lib', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages\\Pythonwin', 'D:\\Projects\\nlp-projects\\utils']


In [2]:
import pandas as pd
import nltk
import utils
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

# **DATA WRANGLING & PROCESSING PIPELINE**

The dataset is a collection of answers to questions, a question is identified by a unique id and here we have multiple answers to the same question right and wrong, with a score for each answer. The goal is to predict the score of the answers to know if it is correct or not.

Here we will choose the answer for the question with the most amount of answers, but the same process can be applied to any question.

In [3]:
dataframe = pd.read_csv('data/answers.csv')

#drop duplicates
dataframe.drop_duplicates(inplace=True)

# count by id
print(dataframe['id'].value_counts())

#unique ids
print(dataframe['id'].unique())

id
11.1    60
12.1    54
3.2     31
3.7     31
3.6     31
        ..
9.7     20
10.3    19
9.2     18
4.6     18
9.6     16
Name: count, Length: 85, dtype: int64
[ 1.1  1.2  1.3  1.4  1.5  1.6  1.7  2.1  2.2  2.3  2.4  2.5  2.6  2.7
  3.1  3.2  3.3  3.4  3.5  3.6  3.7  4.1  4.2  4.3  4.4  4.5  4.6  4.7
  5.1  5.2  5.3  5.4  6.1  6.2  6.3  6.4  6.5  6.6  6.7  7.1  7.2  7.3
  7.4  7.5  7.6  7.7  8.1  8.2  8.3  8.4  8.5  8.6  8.7  9.1  9.2  9.3
  9.4  9.5  9.6  9.7 10.1 10.2 10.3 10.4 10.5 10.6 10.7 11.1 11.2 11.3
 11.4 11.5 11.6 11.7 11.8 11.9 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8
 12.9]


### **Choosing the question with the most amount of answers**

In [4]:
dataframe = dataframe[dataframe['id'] == 1.1]

dataframe.drop(columns=['correct', 'id'], inplace=True)

dataframe

Unnamed: 0,answer,score
0,High risk problems are address in the prototyp...,3.5
1,To simulate portions of the desired final prod...,5.0
2,A prototype program simulates the behaviors of...,4.0
3,Defined in the Specification phase a prototype...,5.0
4,It is used to let the users have a first idea ...,3.0
5,To find problem and errors in a program before...,2.0
6,To address major issues in the creation of the...,2.5
7,you can break the whole program into prototype...,5.0
8,To provide an example or model of how the fini...,3.5
9,Simulating the behavior of only a portion of t...,5.0


In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\legion\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\legion\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\legion\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### **Tokenization**

In [6]:
tokenized_dataframe = dataframe.copy()

tokenized_dataframe['answer'] = tokenized_dataframe['answer'].apply(lambda x: utils.tokenize(x))

tokenized_dataframe

Unnamed: 0,answer,score
0,"[high, risk, problems, address, prototype, pro...",3.5
1,"[simulate, portions, desired, final, product, ...",5.0
2,"[prototype, program, simulates, behaviors, por...",4.0
3,"[defined, specification, phase, prototype, sti...",5.0
4,"[used, let, users, first, idea, completed, pro...",3.0
5,"[find, problem, errors, program, finalized]",2.0
6,"[address, major, issues, creation, program, wa...",2.5
7,"[break, whole, program, prototype, programs, s...",5.0
8,"[provide, example, model, finished, program, p...",3.5
9,"[simulating, behavior, portion, desired, softw...",5.0


### **Lemmatization**

In [7]:
tokenized_dataframe['answer'] = tokenized_dataframe['answer'].apply(lambda x: utils.lemma(x))

tokenized_dataframe

Unnamed: 0,answer,score
0,"[high, risk, problem, address, prototype, prog...",3.5
1,"[simulate, portion, desired, final, product, q...",5.0
2,"[prototype, program, simulates, behavior, port...",4.0
3,"[defined, specification, phase, prototype, sti...",5.0
4,"[used, let, user, first, idea, completed, prog...",3.0
5,"[find, problem, error, program, finalized]",2.0
6,"[address, major, issue, creation, program, way...",2.5
7,"[break, whole, program, prototype, program, si...",5.0
8,"[provide, example, model, finished, program, p...",3.5
9,"[simulating, behavior, portion, desired, softw...",5.0


### **Sanity Check**

In [8]:
# Number of unique words
unique_words = set(word for sentence in tokenized_dataframe['answer'] for word in sentence)
print(len(unique_words))

# Number of non-empty sentences
non_empty_sentences = len([sentence for sentence in tokenized_dataframe['answer'] if sentence])
print(non_empty_sentences)

174
28


### **Word2Vec**

To use the Word2Vec model for predicting the score of the answers, we need to choose the right vector size for each regression model since they can be sensitive to it.

##### **Fine-Tuning the Word2Vec Model for SVR**

In [9]:
utils.finetune_mse_r2(tokenized_dataframe, SVR())

Vector size:  100  - MSE:  0.25891109288636227  - R2:  0.6415077175419599  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  200  - MSE:  0.3732319797374766  - R2:  0.4832172588250323  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  300  - MSE:  0.36690605184096753  - R2:  0.4919762359125064  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  400  - MSE:  0.2771444376116618  - R2:  0.6162615479223144  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  500  - MSE:  0.26013077707383586  - R2:  0.6398189240516119  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  600  - MSE:  0.2589136302764537  - R2:  0.6415042042326025  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size: 

(0.07157048490702274, 0.9009024055133531, 23)

##### **Fine-Tuning the Word2Vec Model for Linear Regression**

In [10]:
utils.finetune_mse_r2(tokenized_dataframe, LinearRegression())

Vector size:  100  - MSE:  0.29445146078455764  - R2:  0.5922979773752279  - Best MSE:  0.06088525665688849  - Best R2:  0.9156973369366159  - Best Vector Size:  98
Vector size:  200  - MSE:  0.25029258991207826  - R2:  0.653441029352507  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  300  - MSE:  0.72143147834286  - R2:  0.0010948761406551766  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  400  - MSE:  0.1304774043959469  - R2:  0.8193389785286889  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  500  - MSE:  0.19339278082475175  - R2:  0.7322253803964975  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  600  - MSE:  0.13937666341972013  - R2:  0.8070169275726952  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector 

(0.002165965274343762, 0.997000971158601, 199)

##### **Fine-Tuning the Word2Vec Model for Decision Trees Regressor**

In [11]:
utils.finetune_mse_r2(tokenized_dataframe, DecisionTreeRegressor())

Vector size:  100  - MSE:  0.5  - R2:  0.3076923076923076  - Best MSE:  0.08333333333333333  - Best R2:  0.8846153846153846  - Best Vector Size:  33
Vector size:  200  - MSE:  0.08333333333333333  - R2:  0.8846153846153846  - Best MSE:  0.08333333333333333  - Best R2:  0.8846153846153846  - Best Vector Size:  33
Vector size:  300  - MSE:  0.8333333333333334  - R2:  -0.15384615384615397  - Best MSE:  0.08333333333333333  - Best R2:  0.8846153846153846  - Best Vector Size:  33
Vector size:  400  - MSE:  1.0833333333333333  - R2:  -0.5  - Best MSE:  0.08333333333333333  - Best R2:  0.8846153846153846  - Best Vector Size:  33
Vector size:  500  - MSE:  1.1666666666666667  - R2:  -0.6153846153846154  - Best MSE:  0.08333333333333333  - Best R2:  0.8846153846153846  - Best Vector Size:  33
Vector size:  600  - MSE:  0.4166666666666667  - R2:  0.423076923076923  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  595
Vector size:  700  - MSE:  1.0833333333333333  - R2:  -0.5  - Best MSE: 

(0.0, 1.0, 595)

##### **Evaluating SVR with 0.1 test size**

In [15]:
print('SVR with 0.1 test size:')
utils.evaluate_reg_model(tokenized_dataframe, SVR(), 23)

SVR with 0.1 test size:
MSE:  0.07157048490702274  - R2:  0.9009024055133531


(0.07157048490702274, 0.9009024055133531)

##### **Evaluating Linear Regression with 0.1 test size**

In [16]:
print('Linear Regression with 0.1 test size:')
utils.evaluate_reg_model(tokenized_dataframe, LinearRegression(), 199)

Linear Regression with 0.1 test size:
MSE:  0.002165965274343762  - R2:  0.997000971158601


(0.002165965274343762, 0.997000971158601)

##### **Evaluating Decision Trees Regressor with 0.1 test size**

In [43]:
print('Decision Tree with 0.1 test size:')
utils.evaluate_reg_model(tokenized_dataframe, DecisionTreeRegressor(), 82)

Decision Tree with 0.1 test size:
MSE:  0.08333333333333333  - R2:  0.8846153846153846


(0.08333333333333333, 0.8846153846153846)

##### **Fine-Tuning the Word2Vec Model for SVR with 0.2 test size**

In [44]:
utils.finetune_mse_r2(tokenized_dataframe, SVR(), 0.2)

Vector size:  100  - MSE:  0.25891109288636227  - R2:  0.6415077175419599  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  200  - MSE:  0.3732319797374766  - R2:  0.4832172588250323  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  300  - MSE:  0.36690605184096753  - R2:  0.4919762359125064  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  400  - MSE:  0.2771444376116618  - R2:  0.6162615479223144  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  500  - MSE:  0.26013077707383586  - R2:  0.6398189240516119  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size:  600  - MSE:  0.2589136302764537  - R2:  0.6415042042326025  - Best MSE:  0.07157048490702274  - Best R2:  0.9009024055133531  - Best Vector Size:  23
Vector size: 

(0.07157048490702274, 0.9009024055133531, 23)

##### **Fine-Tuning the Word2Vec Model for Linear Regression with 0.2 test size**

In [45]:
utils.finetune_mse_r2(tokenized_dataframe, LinearRegression(), 0.2)

Vector size:  100  - MSE:  0.29445146078455764  - R2:  0.5922979773752279  - Best MSE:  0.06088525665688849  - Best R2:  0.9156973369366159  - Best Vector Size:  98
Vector size:  200  - MSE:  0.25029258991207826  - R2:  0.653441029352507  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  300  - MSE:  0.72143147834286  - R2:  0.0010948761406551766  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  400  - MSE:  0.1304774043959469  - R2:  0.8193389785286889  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  500  - MSE:  0.19339278082475175  - R2:  0.7322253803964975  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector size:  600  - MSE:  0.13937666341972013  - R2:  0.8070169275726952  - Best MSE:  0.002165965274343762  - Best R2:  0.997000971158601  - Best Vector Size:  199
Vector 

(0.002165965274343762, 0.997000971158601, 199)

##### **Fine-Tuning the Word2Vec Model for Decision Trees Regressor with 0.2 test size**

In [46]:
utils.finetune_mse_r2(tokenized_dataframe, DecisionTreeRegressor(), 0.2)

Vector size:  100  - MSE:  0.8333333333333334  - R2:  -0.15384615384615397  - Best MSE:  0.08333333333333333  - Best R2:  0.8846153846153846  - Best Vector Size:  22
Vector size:  200  - MSE:  0.08333333333333333  - R2:  0.8846153846153846  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  157
Vector size:  300  - MSE:  1.5  - R2:  -1.076923076923077  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  157
Vector size:  400  - MSE:  1.1666666666666667  - R2:  -0.6153846153846154  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  157
Vector size:  500  - MSE:  1.6666666666666667  - R2:  -1.307692307692308  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  157
Vector size:  600  - MSE:  0.4166666666666667  - R2:  0.423076923076923  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  157
Vector size:  700  - MSE:  0.75  - R2:  -0.03846153846153855  - Best MSE:  0.0  - Best R2:  1.0  - Best Vector Size:  157
Vector size:  800  - MSE:  0.4166666666666667  - R2:  0.

(0.0, 1.0, 157)

##### **Evaluating SVR with 0.2 test size**

In [48]:
print('SVR with 0.2 test size:')
utils.evaluate_reg_model(tokenized_dataframe, SVR(), 23, test_size=0.2)

SVR with 0.2 test size:
MSE:  0.173259253998202  - R2:  0.847869923318652


(0.173259253998202, 0.847869923318652)

##### **Evaluating Linear Regression with 0.2 test size**

In [49]:
print('Linear Regression with 0.2 test size:')
utils.evaluate_reg_model(tokenized_dataframe, LinearRegression(), 199, test_size=0.2)

Linear Regression with 0.2 test size:
MSE:  0.06778304202537318  - R2:  0.9404831826118675


(0.06778304202537318, 0.9404831826118675)

##### **Evaluating Decision Trees Regressor with 0.2 test size**

In [57]:
print('Decision Tree with 0.2 test size:')
utils.evaluate_reg_model(tokenized_dataframe, DecisionTreeRegressor(), 157, test_size=0.2)

Decision Tree with 0.2 test size:
MSE:  0.7916666666666666  - R2:  0.30487804878048785


(0.7916666666666666, 0.30487804878048785)