# Prediction on Unseen Data

## 1. Imports

## 2. Load Data
- Load Test Data
- Load Pre-trained Scaler, Model, and TF-IDF Vectorizer

## 3. Preprocess Data
### 3.1 Define Preprocessing Functions
- HTML Removal
- Punctuation Replacement
- Contraction Replacement
- Trailing 's Removal
- Special Character and Punctuation Removal
- Multiple Spaces Replacement
- Words with Numbers Removal
- Lemmatization
- Emoji and Non-ASCII Character Removal
- Stop Words Removal
- Misspelled Words Identification

### 3.2 Apply Preprocessing Functions
- Transformation
- Scaling Numerical Features
- TF-IDF Vectorizer and PCA
- Combine Numerical and TF-IDF Features

## 4. Predicting and Discretizing Test Data via CatBoost Regressor
- Ensure Target Classes Consistency
- Predict on Test Data
- Discretize Predictions

## 5. Evaluate Results on Unseen Data Set
- Calculate QWK Score

## 6. Traceability Matrix and Detailed Analysis
- Confusion Matrix
- Detailed Analysis


# Imports

In [25]:
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import numpy as np
import re
import string
import unicodedata
import time
import multiprocessing
from collections import Counter
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
# import spacy
import nltk
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize, regexp_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from spellchecker import SpellChecker
import textstat
# from scipy.sparse import hstack

from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from catboost import CatBoostRegressor, Pool
from sklearn.feature_extraction.text import TfidfVectorizer

import joblib

warnings.filterwarnings("ignore", category=UserWarning, module='sklearn.metrics')

# Get the number of available CPU cores
num_cores = multiprocessing.cpu_count()
print(f"Number of available CPU cores: {num_cores}")

Number of available CPU cores: 32


# Get Data

In [13]:
# Read the test data
test_df = pd.read_csv('test_split.csv')

In [58]:
test_df.shape

(4514, 21)

# Load pre-trained scaler, TF-IDF vectorizer, PCA, model

In [50]:
import joblib
from joblib import load

# Load the pre-trained scaler
scaler = joblib.load('scaler_exp2.pkl')
print("Scaler loaded")

# Load the pre-trained TF-IDF vectorizer
tfidf_vectorizer = joblib.load('tfidf_vectorizer_exp2.pkl')
print("TF-IDF Vectorizer loaded")

# Load the pre-trained PCA
pca = joblib.load('pca_500_exp2.pkl')
print("PCA loaded")

# Load pre-trained model
model_path = 'catboost_model_exp2_pca_500.pkl'
cb_reg = load(model_path)
print("Model saved")

Scaler loaded
TF-IDF Vectorizer loaded
PCA loaded
Model saved


# Preprocess data

### Define preprocessing functions

In [15]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Load stopwords
stop_words = set(stopwords.words('english'))

# Contractions dictionary
contractions = {
    "aren't": "are not", "can't": "cannot", "could've": "could have", "couldn't": "could not", "didn't": "did not",
    "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
    "he'd": "he would", "he'll": "he will", "he's": "he is", "i'd": "i would", "i'll": "i will", "i'm": "i am",
    "i've": "i have", "isn't": "is not", "it'd": "it would", "it'll": "it will", "it's": "it is", "might've": "might have",
    "mightn't": "might not", "must've": "must have", "mustn't": "must not", "shan't": "shall not", "she'd": "she would",
    "she'll": "she will", "she's": "she is", "should've": "should have", "shouldn't": "should not", "that'd": "that would",
    "that's": "that is", "there's": "there is", "they'd": "they would", "they'll": "they will", "they're": "they are",
    "they've": "they have", "wasn't": "was not", "we'd": "we would", "we'll": "we will", "we're": "we are",
    "we've": "we have", "weren't": "were not", "what'll": "what will", "what're": "what are", "what's": "what is",
    "what've": "what have", "where's": "where is", "who'd": "who would", "who'll": "who will", "who're": "who are",
    "who's": "who is", "who've": "who have", "won't": "will not", "would've": "would have", "wouldn't": "would not",
    "you'd": "you would", "you'll": "you will", "you're": "you are", "you've": "you have", "let's": "let us",
    "here's": "here is", "how's": "how is",
    "Aren't": "Are not", "Can't": "Cannot", "Could've": "Could have", "Couldn't": "Could not", "Didn't": "Did not",
    "Doesn't": "Does not", "Don't": "Do not", "Hadn't": "Had not", "Hasn't": "Has not", "Haven't": "Have not",
    "He'd": "He would", "He'll": "He will", "He's": "He is", "I'd": "I would", "I'll": "I will", "I'm": "I am",
    "I've": "I have", "Isn't": "Is not", "It'd": "It would", "It'll": "It will", "It's": "It is", "Might've": "Might have",
    "Mightn't": "Might not", "Must've": "Must have", "Mustn't": "Must not", "Shan't": "Shall not", "She'd": "She would",
    "She'll": "She will", "She's": "She is", "Should've": "Should have", "Shouldn't": "Should not", "That'd": "That would",
    "That's": "That is", "There's": "There is", "They'd": "They would", "They'll": "They will", "They're": "They are",
    "They've": "They have", "Wasn't": "Was not", "We'd": "We would", "We'll": "We will", "We're": "We are",
    "We've": "We have", "Weren't": "Were not", "What'll": "What will", "What're": "What are", "What's": "What is",
    "What've": "What have", "Where's": "Where is", "Who'd": "Who would", "Who'll": "Who will", "Who're": "Who are",
    "Who's": "Who is", "Who've": "Who have", "Won't": "Will not", "Would've": "Would have", "Wouldn't": "Would not",
    "You'd": "You would", "You'll": "You will", "You're": "You are", "You've": "You have", "Let's": "Let us",
    "Here's": "Here is", "How's": "How is"
}

transitional_phrases = [
    'Above all', 'Accordingly', 'Additionally', 'After', 'After all', 'Afterward', 'All in all', 'Also', 'Alternatively', 
    'As a result', 'As an illustration', 'As long as', 'As mentioned earlier', 'As noted', 'At the same time', 'Before', 
    'Besides', 'But', 'By all means', 'Consequently', 'Conversely', 'Correspondingly', 'Despite', 'During', 'Even if', 
    'Even so', 'Especially', 'Eventually', 'Finally', 'First', 'For example', 'For instance', 'Furthermore', 'Hence', 
    'However', 'If', 'In addition', 'In brief', 'In case', 'In comparison', 'In conclusion', 'In fact', 'In contrast', 
    'In other words', 'In particular', 'In simpler terms', 'In summary', 'In the meantime', 'In the same way', 'Indeed', 
    'Instead', 'Lastly', 'Later', 'Likewise', 'Meanwhile', 'Moreover', 'More importantly', 'Namely', 'Nevertheless', 
    'Next', 'Nonetheless', 'Notably', 'Now', 'On the contrary', 'On condition that', 'On one hand', 'On the other hand', 
    'Overall', 'Particularly', 'Plus', 'Previously', 'Provided that', 'Regardless', 'Second', 'Similarly', 'Since', 
    'Specifically', 'Still', 'Subsequently', 'That is', 'Then', 'Therefore', 'Third', 'Thus', 'To clarify', 'To conclude', 
    'To demonstrate', 'To illustrate', 'To put it another way', 'To summarize', 'To sum up', 'Ultimately', 'Unless', 'Unlike', 
    'Until', 'Whereas', 'Yet', 'Above and beyond', 'According to', 'After a while', 'All things considered', 'Although', 
    'Another key point', 'As a consequence', 'As a matter of fact', 'As can be seen', 'As far as', 'As soon as', 'At first', 
    'At last', 'At length', 'At this point', 'Be that as it may', 'By and large', 'By the same token', 'Even though', 
    'For fear that', 'For that reason', 'For the most part', 'Granted', 'Henceforth', 'If by chance', 'If so', 'In a moment', 
    'In any case', 'In any event', 'In light of', 'In order to', 'In particular', 'In reality', 'In short', 'In spite of', 
    'In view of', 'It follows that', 'Least of all', 'Most importantly', 'Needless to say', 'Of course', 'On the whole', 
    'One example is', 'One reason is', 'Or', 'Over time', 'Prior to', 'Provided that', 'Seeing that', 'So as to', 'Sooner or later', 
    'Such as', 'That being said', 'The next step', 'Thereafter', 'Thereby', 'Thirdly', 'Through', 'Till', 'To be sure', 
    'To begin with', 'To illustrate', 'To reiterate', 'To the end that', 'To this end', 'Until now', 'Up to now', 'What is more', 
    'Without a doubt', 'Without delay', 'Without exception', 'Yet again'
]

# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Function to lemmatize text
def lemmatize_text(text):
    words_and_tags = pos_tag(word_tokenize(text))
    lemmatized_words = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag)) for word, tag in words_and_tags]
    return ' '.join(lemmatized_words)

# Function to remove HTML tags
def removeHTML(x):
    html = re.compile(r'<.*?>')
    return html.sub(r'', x)

# Function to replace punctuation with space if absent
def replace_punctuation_with_space_if_absent(text):
    pattern = r'([.,!?;:]+)(?!\s)'
    corrected_text = re.sub(pattern, r'\1 ', text)
    return corrected_text

# Function to replace contractions
def replace_contractions(text, contractions_dict):
    contractions_re = re.compile('|'.join(map(re.escape, contractions_dict.keys())))
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, text)

# Function to remove trailing 's
def remove_trailing_s(text):
    words = text.split()
    words = [word[:-2] if word.endswith("'s") else word for word in words]
    return ' '.join(words)

# Function to remove special characters and punctuation
def remove_special_characters_and_punctuation(text):
    normalized_text = unicodedata.normalize('NFKD', text)
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', normalized_text)
    return cleaned_text

# Function to replace multiple spaces with a single space
def replace_multiple_spaces(text):
    return re.sub(r'\s+', ' ', text)

# Function to remove words with numbers
def remove_words_with_numbers(text):
    cleaned_text = re.sub(r'\b\w*\d\w*\b', '', text)
    return re.sub(r'\s+', ' ', cleaned_text).strip()

def find_transitional_phrases(text):
    return [phrase for phrase in transitional_phrases if phrase.lower() in text.lower()]

def preprocessed_text_part1(text):
    text = removeHTML(text)
    text = re.sub("@\w+", '', text)
    text = re.sub(r"\b\d+(?:'s?)?\b", '', text)
    text = re.sub("http\w+", '', text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\.+", ".", text)
    text = re.sub(r"\,+", ",", text)
    text = replace_punctuation_with_space_if_absent(text)
    text = replace_contractions(text, contractions)
    text = text.strip()
    return text

def preprocessed_text(text):
    text = remove_trailing_s(text)
    text = remove_special_characters_and_punctuation(text)
    text = replace_multiple_spaces(text)
    text = remove_words_with_numbers(text)
    text = text.strip()
    return text

def clean_text_from_emojis_and_non_ascii(text):
    text = re.sub(r'[^\w\s,.\n]', '', text)
    text = text.encode('ascii', 'ignore').decode('ascii')
    return re.sub(r'\s+', ' ', text).replace('\n ', '\n')

def find_misspelled_words(text):
    spell = SpellChecker()
    words = word_tokenize(text)
    misspelled_words = spell.unknown(words)
    return [word for word in misspelled_words if re.match(r'\w+', word)]

def remove_stop_words(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

def textstat_features(text):
    features = {}   
    features['difficult_words'] = textstat.difficult_words(text)  
    features['reading_time'] = textstat.reading_time(text)  
    features['sentence_count'] = textstat.sentence_count(text)  
    features['polysyllabcount'] = textstat.polysyllabcount(text)  
    return features  

# Function to find transitional phrases in text
def find_transitional_phrases(text):
    return [phrase for phrase in transitional_phrases if phrase.lower() in text.lower()]

# General text counting functions
def count_phrases_in_list(phrases_list):
    return len(phrases_list)

def count_distinct_mistakes(word_list):
    return len(set(word_list))

def count_words(text):
    return len(text.split())

def count_distinct_words(text):
    words = text.lower().split()
    return len(set(words))


### Transformation

In [21]:
# 4. Feature Engineering
# Preprocessed Text
print("Working on Preprocessed Text Features")
start_time = time.time()

test_df['preprocessed_text_part1'] = Parallel(n_jobs=num_cores)(delayed(preprocessed_text_part1)(row) for row in test_df['full_text'])
test_df['preprocessed_text'] = Parallel(n_jobs=num_cores)(delayed(preprocessed_text)(row) for row in test_df['preprocessed_text_part1'])
test_df['lemmatized_preprocessed_text'] = test_df['preprocessed_text'].apply(lemmatize_text)
test_df['clean_lemm_preprocessed_text'] = test_df['lemmatized_preprocessed_text'].apply(remove_stop_words)
test_df['full_text_without_non_ascii'] = Parallel(n_jobs=num_cores)(delayed(clean_text_from_emojis_and_non_ascii)(row) for row in test_df['full_text'])

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

# Misspelled Words
print("Working on Misspelled Features")
start_time = time.time()

test_df['misspelled_words_spell_checker'] = Parallel(n_jobs=num_cores)(delayed(find_misspelled_words)(row) for row in test_df['preprocessed_text'])

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

## Text Analysis Features
print("Working on Text Analysis Features")
start_time = time.time()

test_df['comma_count'] = test_df['full_text'].str.count(',')
test_df['transitional_phrases'] = test_df['preprocessed_text'].apply(find_transitional_phrases)
test_df['mistakes_dist_count'] = test_df['misspelled_words_spell_checker'].apply(count_distinct_mistakes)
test_df['transitional_phrases_c'] = test_df['transitional_phrases'].apply(count_phrases_in_list)
test_df['preprocessed_text_count'] = test_df['preprocessed_text'].apply(count_words)
test_df['preprocessed_text_dist_count'] = test_df['preprocessed_text'].apply(count_distinct_words)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

## Ratio Features
print("Working on Ratio Features")
start_time = time.time()

test_df['text_dist_words_ratio'] = test_df.apply(lambda x: x['preprocessed_text_dist_count'] / x['preprocessed_text_count'], axis=1)
test_df['mistakes_dist_ratio'] = test_df.apply(lambda x: x['mistakes_dist_count'] / x['preprocessed_text_count'], axis=1)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

## Text Statistics Features
print("Working on Text Statistics Features")
start_time = time.time()

results = Parallel(n_jobs=num_cores)(delayed(textstat_features)(text) for text in test_df['full_text_without_non_ascii'])
features_df = pd.DataFrame(results)
test_df = pd.concat([test_df, features_df], axis=1)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")


Working on Preprocessed Text Features
Elapsed time: 342.83931851387024 seconds
Working on Misspelled Features
Elapsed time: 108.97266125679016 seconds
Working on Text Analysis Features
Elapsed time: 8.923282623291016 seconds
Working on Ratio Features
Elapsed time: 0.3555586338043213 seconds
Working on Text Statistics Features
Elapsed time: 8.227802515029907 seconds


### Scaling Numerical Features

In [34]:
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]
# Create a DataFrame for numerical features
test_numerical_features_df = test_df[numerical_features]

In [36]:
# Apply the pre-trained scaler to the numerical features of the test dataset
scaled_test_features = scaler.transform(test_numerical_features_df)

# Convert scaled numerical features back to DataFrame
scaled_test_features_df = pd.DataFrame(scaled_test_features, columns=numerical_features)

print("Scaling for test data completed and saved.")

Scaling for test data completed and saved.


### TF-IDF Vectorizer and PCA

In [38]:
# Apply the TF-IDF vectorizer to the test data
test_tfidf_vectors = tfidf_vectorizer.transform(test_df['clean_lemm_preprocessed_text'])

# Apply the PCA to the TF-IDF vectors of the test data
test_tfidf_vectors_reduced = pca.transform(test_tfidf_vectors.toarray())

# Create a DataFrame with the reduced TF-IDF features for the test data
test_tfidf_features_df = pd.DataFrame(test_tfidf_vectors_reduced, columns=[f'tfidf_feature_{i}' for i in range(1, test_tfidf_vectors_reduced.shape[1] + 1)])

print("TF-IDF and PCA transformation for test data completed")


TF-IDF and PCA transformation for test data completed


### Combine Numerical and TF-IDF Features

In [40]:
# Combine the scaled numerical features with the TF-IDF features
combined_test_features_df = pd.concat([scaled_test_features_df, test_tfidf_features_df], axis=1)

# Print the count of the combined features
print(f"Total number of features for test data: {combined_test_features_df.shape[1]}")

Total number of features for test data: 508


# Predicting and Discretizing Test Data via CatBoost Regressor

In [54]:
# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

# Custom scorer
def qwk_scorer(y_true, y_pred):
    target_classes = np.sort(np.unique(y_true))
    y_pred_discretized = discretize_predictions(y_pred, target_classes)
    return cohen_kappa_score(y_true, y_pred_discretized, weights='quadratic')
    
# Ensure the target classes are consistent with the training data
target_classes = np.array([1, 2, 3, 4, 5, 6])

# Predict on the test data
y_test_pred = cb_reg.predict(combined_test_features_df)
y_test_pred_discretized = discretize_predictions(y_test_pred, target_classes)

# Print the predicted results
test_results_df = pd.DataFrame({
    'Predicted Score': y_test_pred_discretized
})
print("Predicted results on the test data:")
print(test_results_df)

Predicted results on the test data:
      Predicted Score
0                   5
1                   5
2                   3
3                   5
4                   2
...               ...
4509                2
4510                5
4511                2
4512                4
4513                3

[4514 rows x 1 columns]


# Evaluate Result on Unseen Data Set

In [60]:
# Evaluate the performance
y_test_actual = test_df['score']
qwk_score = qwk_scorer(y_test_actual, y_test_pred_discretized)

# Print the predicted results
test_results_df = pd.DataFrame({
    'Actual Score': y_test_actual,
    'Predicted Score': y_test_pred_discretized
})
print("Predicted results on the test data:")
print(test_results_df)

# Print the QWK score
print(f"Quadratic Weighted Kappa Score: {qwk_score:.4f}")

Predicted results on the test data:
      Actual Score  Predicted Score
0                6                5
1                4                5
2                2                3
3                4                5
4                2                2
...            ...              ...
4509             2                2
4510             6                5
4511             3                2
4512             4                4
4513             2                3

[4514 rows x 2 columns]
Quadratic Weighted Kappa Score: 0.8359


# Traceability Matrix and Detailed Analysis

In [68]:
from sklearn.metrics import cohen_kappa_score, confusion_matrix

# Confusion Matrix and Detailed Analysis
conf_matrix = confusion_matrix(y_test_actual, y_test_pred_discretized, labels=target_classes)
conf_matrix_df = pd.DataFrame(conf_matrix, index=target_classes, columns=target_classes)

# Display the confusion matrix as a DataFrame
print("Confusion Matrix (DataFrame):")
print(conf_matrix_df)

# Detailed Prediction Analysis
detailed_analysis = []
for i, true_label in enumerate(target_classes):
    for j, pred_label in enumerate(target_classes):
        if true_label != pred_label:
            count = conf_matrix[i, j]
            detailed_analysis.append({
                'Actual': true_label,
                'Predicted': pred_label,
                'Count': count
            })

detailed_analysis_df = pd.DataFrame(detailed_analysis)

# Display the detailed analysis as a DataFrame
print("\nDetailed Analysis (DataFrame):")
print(detailed_analysis_df)

Confusion Matrix (DataFrame):
   1     2    3    4    5   6
1  1   204   44   18    0   0
2  0  1055  241   44    2   0
3  0   258  706  270   22   0
4  0     4  188  461  132   0
5  0     0    1   90  591   4
6  0     0    0    0  160  18

Detailed Analysis (DataFrame):
    Actual  Predicted  Count
0        1          2    204
1        1          3     44
2        1          4     18
3        1          5      0
4        1          6      0
5        2          1      0
6        2          3    241
7        2          4     44
8        2          5      2
9        2          6      0
10       3          1      0
11       3          2    258
12       3          4    270
13       3          5     22
14       3          6      0
15       4          1      0
16       4          2      4
17       4          3    188
18       4          5    132
19       4          6      0
20       5          1      0
21       5          2      0
22       5          3      1
23       5          4     90
24 

The QWK score on the unseen data set is 0.8359, which is consistent with the QWK scores observed during the training and validation process. This indicates that the model has generalized well to the unseen data.

The detailed analysis provides insights into specific misclassifications. Notable points include:

- The majority of instances are correctly classified, especially for classes 2 and 3.
- Some instances of class 1 are misclassified as class 2 (204) and class 3 (44).
- Some instances of class 3 are misclassified as class 2 (258) and class 4 (270).
- Misclassifications are less frequent for higher classes (5 and 6), but there are still some notable errors, such as class 6 being predicted as class 5 (160 times).