### This notebook shows the highest accuracy for Support Vector Machines (A1, A2, B1, B2+)
(MLP Classifier was more accurate than SVM when I resampled 250 texts in the training set for each class, but that is in a different notebook)

##### Bootstrapping 
- with 500 texts in each class in the training set


In [1]:
# imports
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# load data
datafile_path = "bc_cam_with_ada_002_embeddings.csv"

df = pd.read_csv(datafile_path)
# convert all C level labels to B2
df['cefr'] = df['cefr'].replace({'C': 'B2'})
df

Unnamed: 0.1,Unnamed: 0,filename,text,cefr,embedding
0,0,A1Movers_1_1,"Look, Grandpa. My friend's family are in the g...",A1,"[0.010332267731428146, -0.0009531814139336348,..."
1,1,A1Movers_1_2,"Come quickly, children. The train's waiting to...",A1,"[0.002182072727009654, -6.590186239918694e-05,..."
2,2,A1Movers_1_3,"Hello, Mrs Castle. Hello Sally, Oh I'm tired. ...",A1,"[-0.00018498786084819585, 0.013357731513679028..."
3,3,A1Movers_1_4,"Dad, come and watch this DVD with me. What's i...",A1,"[0.017183320596814156, -0.00948919914662838, 0..."
4,4,A1Movers_1_5,Can you colour this mountain picture now? Yes!...,A1,"[0.01187464315444231, 0.009958968497812748, 0...."
...,...,...,...,...,...
723,723,C2Prof_16-20,"Today, we're talking to marine biologists Gina...",B2,"[0.0013554414035752416, -0.0029449746944010258..."
724,724,C2Prof_21-30,I knew I'd be short of money if I didn't work ...,B2,"[-0.007415663916617632, -0.02614154852926731, ..."
725,725,C2Prof_3-4,"Last year, Tim Fitzgerald exhibited photograph...",B2,"[-0.009252717718482018, 0.008551654405891895, ..."
726,726,C2Prof_5-6,One of my own thoughts about this piece is the...,B2,"[-0.02017894573509693, -0.001436770660802722, ..."


I adapted the code, so the texts in the training set could be bootstrapped (using resample in sklearn)
- maybe this code is not DRY (I think there are 1 or 2 uneccesary steps and it could be cleaner, but it works)

In [2]:
import ast
from sklearn.utils import resample

# Use ast.literal_eval to safely evaluate the string and convert it into a list
df['embedding'] = df['embedding'].apply(ast.literal_eval)

# create a column for each embedding
df_embeddings = pd.DataFrame(df['embedding'].to_list(), columns=[f'embed_{i}' for i in range(len(df['embedding'][0]))])

# Add the labels back
df_embeddings = pd.concat([df_embeddings, df["cefr"]], axis=1)

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    df_embeddings.drop('cefr', axis=1), df_embeddings['cefr'], test_size=0.2, random_state=160923
)

# Convert the training set lists to a DataFrame
df_train = pd.concat([X_train, y_train], axis=1)

# Separate the classes in the training set
class_A1 = df_train[df_train['cefr'] == "A1"]
class_A2 = df_train[df_train['cefr'] == "A2"]
class_B1 = df_train[df_train['cefr'] == "B1"]
class_B2 = df_train[df_train['cefr'] == "B2"]

# Bootstrap each class in the training set to have 500 samples
class_A1_sampled = resample(class_A1, replace=True, n_samples=500, random_state=160923)
class_A2_sampled = resample(class_A2, replace=True, n_samples=500, random_state=160923)
class_B1_sampled = resample(class_B1, replace=True, n_samples=500, random_state=160923)
class_B2_sampled = resample(class_B2, replace=True, n_samples=500, random_state=160923)

# Concatenate the bootstrapped classes back together
df_train_sampled = pd.concat([class_A1_sampled, class_A2_sampled, class_B1_sampled, class_B2_sampled])

# Now can use df_train_sampled for machine learning tasks
X_train_sampled = df_train_sampled.drop('cefr', axis=1)
y_train_sampled = df_train_sampled.cefr

Final Support Vector Machines Model  
See other notebooks for other alternatives

Polynomial
- this seemed to be the most accurate kernel (slightly higher than the default rbf)


Try MinMaxScaler
- Transform features by scaling each feature to a given range.

- This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

- The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) \
X_scaled = X_std * (max - min) + min \
where min, max = feature_range.

- This transformation is often used as an alternative to zero mean, unit variance scaling.
- macro avg 1% lower, but it seems to be overall more accurate across the classes

In [3]:
from sklearn import svm
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# train SVM classifier
# I'm using poly because it was the most accurate 
clf = make_pipeline(MinMaxScaler(), SVC(kernel='poly', probability=True))
clf.fit(X_train_sampled, y_train_sampled)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)

# print confusion matrix
cm = confusion_matrix(y_test, preds)
print("Confusion Matrix:")
print(cm)

              precision    recall  f1-score   support

          A1       0.64      0.88      0.74         8
          A2       0.75      0.69      0.72        13
          B1       0.76      0.66      0.71        44
          B2       0.86      0.90      0.88        81

    accuracy                           0.81       146
   macro avg       0.75      0.78      0.76       146
weighted avg       0.81      0.81      0.81       146

Confusion Matrix:
[[ 7  1  0  0]
 [ 2  9  2  0]
 [ 1  2 29 12]
 [ 1  0  7 73]]


Save the model

In [4]:
import pickle

# Save to file in the current working directory
pkl_filename = "cefr_listening_finetuned.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file)

Load the model (obviously I could do this in another notebook too, but I will keep it in the same one for this)

In [5]:
# Load from file
with open(pkl_filename, 'rb') as file:
    pickle_model = pickle.load(file)

In [6]:
yt_df = pd.read_csv("yt22_with_ada_002_embeddings.csv")
yt_df

Unnamed: 0.1,Unnamed: 0,video_id,text,embedding
0,0,---vnMZfsbY,sick of always in their little muscle shirts. ...,"[-0.013431421481072903, -0.0332380011677742, -..."
1,1,--2FQJwhoVI,God will not allow you to dream which you can ...,"[-0.013295106589794159, -0.012363391928374767,..."
2,2,--4PKpcm1B0,"okay, so we are in agreement. if you use the w...","[0.0010209871688857675, -0.039078325033187866,..."
3,3,--GDh3brZVg,so you've done all the work of building a busi...,"[-0.017448538914322853, 0.008592084050178528, ..."
4,4,--Le-wk1IBM,you just trotted around with it like it was th...,"[-0.008555376902222633, -0.010373649187386036,..."
...,...,...,...,...
25429,25429,__8zCbdNn1I,please welcome back to my channel. so I just w...,"[-0.010289541445672512, -0.02965848334133625, ..."
25430,25430,__ClCB4IZgY,"love. sorry, no, love. only have like 10 secon...","[-0.012538553215563297, -0.007643131073564291,..."
25431,25431,__KmlpDeJkQ,Foreign. who do I believe when they're talking...,"[-0.0016878355527296662, -0.022198883816599846..."
25432,25432,__sGJCXrl90,"foreign productive video. this year, chosen on...","[0.000500993337482214, -0.012254416011273861, ..."


In [7]:
# Use ast.literal_eval to safely evaluate the string and convert it into a list
yt_df['embedding'] = yt_df['embedding'].apply(ast.literal_eval)
yt_df["embedding"]

0        [-0.013431421481072903, -0.0332380011677742, -...
1        [-0.013295106589794159, -0.012363391928374767,...
2        [0.0010209871688857675, -0.039078325033187866,...
3        [-0.017448538914322853, 0.008592084050178528, ...
4        [-0.008555376902222633, -0.010373649187386036,...
                               ...                        
25429    [-0.010289541445672512, -0.02965848334133625, ...
25430    [-0.012538553215563297, -0.007643131073564291,...
25431    [-0.0016878355527296662, -0.022198883816599846...
25432    [0.000500993337482214, -0.012254416011273861, ...
25433    [0.014580446295440197, -0.004550808575004339, ...
Name: embedding, Length: 25434, dtype: object

In [8]:
len(yt_df['embedding'][0])

1536

I think I could just pass yt_df["embedding"] to the model (as an array)
- but as I converted it into a df when training the model, I will do the same again
- then the data that I am passing to the model is the same as X_test

In [9]:
# create a column for each embedding
yt_df_embeddings = pd.DataFrame(yt_df['embedding'].to_list(), columns=[f'embed_{i}' for i in range(len(yt_df['embedding'][0]))])
yt_df_embeddings

Unnamed: 0,embed_0,embed_1,embed_2,embed_3,embed_4,embed_5,embed_6,embed_7,embed_8,embed_9,...,embed_1526,embed_1527,embed_1528,embed_1529,embed_1530,embed_1531,embed_1532,embed_1533,embed_1534,embed_1535
0,-0.013431,-0.033238,-0.001576,-0.020406,-0.003048,0.001739,-0.022299,-0.009815,-0.015080,-0.014276,...,0.022858,0.006113,-0.001905,-0.031140,0.005071,0.013220,-0.009815,-0.002091,0.003329,-0.024820
1,-0.013295,-0.012363,-0.016335,-0.015951,-0.002116,0.012529,-0.023841,-0.019758,-0.027965,-0.009291,...,0.005917,0.016084,0.018542,-0.024661,-0.041339,0.008260,-0.004358,-0.011868,0.002575,-0.024833
2,0.001021,-0.039078,0.000406,-0.016821,-0.028317,-0.016239,-0.016169,-0.031063,-0.032283,-0.011274,...,0.016835,-0.007308,0.019165,-0.020122,-0.020135,-0.001105,0.006126,0.022008,-0.008924,-0.018818
3,-0.017449,0.008592,0.011778,-0.065035,-0.032914,0.006457,-0.019920,-0.016166,-0.010277,-0.022260,...,0.002774,-0.001005,0.012611,-0.013258,-0.011672,-0.011394,0.000145,-0.018109,0.020436,-0.009683
4,-0.008555,-0.010374,0.014438,-0.030748,-0.016242,0.016392,0.007660,-0.010109,-0.012423,-0.015727,...,0.000492,-0.022145,0.004471,-0.020313,-0.020517,0.004427,0.006866,-0.004895,-0.008854,-0.042580
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25429,-0.010290,-0.029658,-0.025313,-0.029471,-0.016327,0.002579,0.008317,-0.009273,-0.011714,-0.038537,...,0.018293,-0.023842,0.023240,-0.013445,-0.024791,0.044849,-0.018600,-0.008143,0.011533,-0.034419
25430,-0.012539,-0.007643,-0.004602,-0.023113,0.011379,0.014673,0.000938,-0.034009,-0.001321,-0.005093,...,0.019364,0.002441,0.009020,-0.022145,-0.018886,0.017918,-0.034009,-0.018273,-0.002086,-0.037936
25431,-0.001688,-0.022199,-0.003307,-0.034313,-0.007681,0.028053,0.010798,-0.015520,-0.009011,-0.023901,...,0.011879,-0.002015,0.023142,-0.017052,-0.008860,0.033344,-0.004407,-0.015873,-0.013594,-0.022042
25432,0.000501,-0.012254,-0.004184,-0.021104,-0.015761,0.016574,0.015069,-0.004293,-0.025241,-0.026706,...,0.017551,-0.007948,0.000401,-0.025431,-0.026177,0.031955,-0.013285,-0.019328,0.005859,-0.026204


Predict the cefr level of the YouTube texts

In [10]:
yt_cefr = pickle_model.predict(yt_df_embeddings)
yt_cefr

array(['B2', 'B2', 'B2', ..., 'B1', 'B2', 'B1'], dtype=object)

In [11]:
yt_df["cefr"] = yt_cefr
yt_df

Unnamed: 0.1,Unnamed: 0,video_id,text,embedding,cefr
0,0,---vnMZfsbY,sick of always in their little muscle shirts. ...,"[-0.013431421481072903, -0.0332380011677742, -...",B2
1,1,--2FQJwhoVI,God will not allow you to dream which you can ...,"[-0.013295106589794159, -0.012363391928374767,...",B2
2,2,--4PKpcm1B0,"okay, so we are in agreement. if you use the w...","[0.0010209871688857675, -0.039078325033187866,...",B2
3,3,--GDh3brZVg,so you've done all the work of building a busi...,"[-0.017448538914322853, 0.008592084050178528, ...",B2
4,4,--Le-wk1IBM,you just trotted around with it like it was th...,"[-0.008555376902222633, -0.010373649187386036,...",B2
...,...,...,...,...,...
25429,25429,__8zCbdNn1I,please welcome back to my channel. so I just w...,"[-0.010289541445672512, -0.02965848334133625, ...",B2
25430,25430,__ClCB4IZgY,"love. sorry, no, love. only have like 10 secon...","[-0.012538553215563297, -0.007643131073564291,...",B2
25431,25431,__KmlpDeJkQ,Foreign. who do I believe when they're talking...,"[-0.0016878355527296662, -0.022198883816599846...",B1
25432,25432,__sGJCXrl90,"foreign productive video. this year, chosen on...","[0.000500993337482214, -0.012254416011273861, ...",B2


In [12]:
yt_df["cefr"].value_counts()

cefr
B2    24187
B1     1075
A2      139
A1       33
Name: count, dtype: int64

In [13]:
yt_df.to_csv("yt22_cefr_labels.csv", index=False, encoding="utf-8")