<a href="https://colab.research.google.com/github/akamdem/capstone/blob/main/Word2VecGBClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')



In [2]:
import pandas as pd

from google.colab import files
uploaded = files.upload()

Saving kaggle_train_diff_essays.csv to kaggle_train_diff_essays.csv


In [8]:
import io
df = pd.read_csv(io.BytesIO(uploaded['kaggle_train_diff_essays.csv']))

df.shape

(6020, 5)

In [9]:
df = df[['essay_id', 'essay_set', 'essay', 'domain1_score']]

In [10]:
df.head()

Unnamed: 0,essay_id,essay_set,essay,domain1_score
0,5978,3,The features of the setting affect the cyclist...,1
1,5979,3,The features of the setting affected the cycli...,2
2,5980,3,Everyone travels to unfamiliar places. Sometim...,1
3,5981,3,I believe the features of the cyclist affected...,1
4,5982,3,The setting effects the cyclist because of the...,2


**Preprocess and Vectorize Text**
reference: https://github.com/codebasics/nlp-tutorials/blob/main/16_word_vectors_gensim_text_classification/gensim_w2v_google.ipynb

In [11]:
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

def preprocess_and_vectorize(text):
    doc = nlp(text)

    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    vectors = []
    for token in filtered_tokens:
        try:
            vectors.append(wv[token])
        except KeyError:
            continue

    # get mean vector of all words in essay

    return wv.get_mean_vector(vectors)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


**Add vector column to dataframe**

In [12]:
df['vector'] = df['essay'].apply(lambda text: preprocess_and_vectorize(text))

**Select Essay to Classify**

In [17]:
# change essay_set below to classify a different essay set. Available essay sets: 3, 4, 6, and 8

essay_set = 6

df_essay = df[df['essay_set'] == essay_set]

df_essay.shape

(1800, 5)

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_essay.vector.values,
    df_essay.domain1_score,
    test_size=0.2,
    random_state=1234,
    stratify=df_essay.domain1_score
)

**Convert array of arrays to 2d array**

In [19]:
import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

**Gradient Boosting Classifier**

In [20]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn import metrics

clf = GradientBoostingClassifier()

clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

metrics.confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames = ['Actual'], colnames =['Predicted'], margins = True)

              precision    recall  f1-score   support

           0       1.00      0.11      0.20         9
           1       0.48      0.29      0.36        34
           2       0.45      0.41      0.43        81
           3       0.57      0.77      0.66       163
           4       0.51      0.32      0.39        73

    accuracy                           0.54       360
   macro avg       0.60      0.38      0.41       360
weighted avg       0.53      0.54      0.51       360



Predicted,0,1,2,3,4,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,4,4,0,0,9
1,0,10,19,5,0,34
2,0,7,33,41,0,81
3,0,0,15,126,22,163
4,0,0,2,48,23,73
All,1,21,73,220,45,360
