# LT2222 Machine learning for statistical NLP: introduction
# Assignment 1: Language Identification

**Language Identification** (also known as LangID, LID, LI, or Language Detection) is an NLP task the goal of which is to correctly identify the language of a word or a passage. It is a type of **classification task.** It is a task that can be very useful in a variety of applications and situations, including e.g. returning relevant results in the same language in Information Retrieval tasks. It is especially important when the possible user base is multilingual; in some cases, that can even apply to e.g. governmental websites or applications - think of those countries that have more than one official language. For a comprehensive survey of Language Identification, see [Jauhiainen et al. (2019) ](https://www.proquest.com/scholarly-journals/automatic-language-identification-texts-survey/docview/2554056804/se-2?accountid=11162). As the authors write, simpler ML methods, such as **Support Vector Machines (SVMs),** can achieve very good performance in this task In fact, many of the best submissions for various LI shared tasks have been SVM-based.

While Language Identification is a term that encompasses all the possible modalities (e.g. speech or sign language), in this assignment, the focus will be on the LI of **textual data.** The general task for this assignment is to import the [CoLI-Kenglish dataset](https://sites.google.com/view/kanglishicon2022/dataset?authuser=0), a dataset of containing predominantly tokens in English and Kannada (one of the languages spoken in India), inspect its structure, select the features for the model to take into account, train, and evaluate an SVM model.

In this assignment, you will be provided with some pre-existing code and instructions for the missing parts. The assignment should therefore be completed in your copy of this notebook. It is possible to score 25 points in this assignment, with an additional 6 extra points.



### Part 1: Importing the dataset (4/5 points)




The first step for this assignment is downloading the [CoLI-Kenglish dataset's](https://sites.google.com/view/kanglishicon2022/dataset?authuser=0) train set and test set with labels. The *wget* commands below will download those two .csv files into your working directory as *kanglish-train.csv* and *kanglish-test.csv*.

In [2]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=15I5-evuUKgXjVfR1kFPWnhtfMXAjwVir' -O kanglish-train.csv
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ajTuulVO6uWH6izCLOOuefgI_GUjz0UH' -O kanglish-test.csv

--2024-02-12 09:17:53--  https://docs.google.com/uc?export=download&id=15I5-evuUKgXjVfR1kFPWnhtfMXAjwVir
Resolving docs.google.com (docs.google.com)... 142.250.74.14, 2a00:1450:400f:802::200e
Connecting to docs.google.com (docs.google.com)|142.250.74.14|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=15I5-evuUKgXjVfR1kFPWnhtfMXAjwVir&export=download [following]
--2024-02-12 09:17:53--  https://drive.usercontent.google.com/download?id=15I5-evuUKgXjVfR1kFPWnhtfMXAjwVir&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 216.58.211.1, 2a00:1450:400f:80c::2001
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|216.58.211.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 176944 (173K) [application/octet-stream]
Saving to: ‘kanglish-train.csv’


2024-02-12 09:17:54 (3.51 MB/s) - ‘kanglish-train.csv’ saved [176944/176944]

--

Next, we need to import the libraries that are relevant for this assignment. Feel free to add more to this list if you discover that you need to use a different library.

In [3]:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import re

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Now that we both have our dataset files and the necessary libraries, it is time to import and inspect the data in the notebook.

One easy way to import a .csv file in Python is using the pandas library. This will result in our data now being stored in a DataFrame object. These are very handy for storing and manipulating the data.

**YOUR TASK:**


*   Import the test and train files using [pandas' *read_csv* function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
*   Display the first 10 lines of the training set using the [*.head()* method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
*   Read about [indexing DataFrames](https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/) and use it to select the tag column, and then return the unique tags in that column using the [*.unique()* method](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html). Store that under a new variable name




In [4]:
# read in the files as DataFrames
# kanglish_train =  pd.read_csv('Train_Kanglish.csv')
# kanglish_test =  pd.read_csv('Test_withLabels_Kanglish.csv')

# the part above was good, as long as those were the filenames - the wget was supposed to download it with names universal for everyone, and that's the ones I am using; it's not a mistake on your end though!
kanglish_test = pd.read_csv('./kanglish-test.csv')
kanglish_train = pd.read_csv('./kanglish-train.csv')
print(kanglish_train)

                word    tag
0            anusthu     kn
1               woww     en
2            staying     en
3               near     en
4             hostel     en
...              ...    ...
14842    hiremadtara  en-kn
14843    solutionila  en-kn
14844  accessmadkoli  en-kn
14845    glasshakisi  en-kn
14846        keybeku  en-kn

[14847 rows x 2 columns]


In [5]:
# display the first 10 lines of the training set
kanglish_train.head(11)  # that is 11 lines - .head(10) is perfectly fine, as the line numbering starts at 0

Unnamed: 0,word,tag
0,anusthu,kn
1,woww,en
2,staying,en
3,near,en
4,hostel,en
5,confirmed,en
6,faith,en
7,linked,en
8,gotila,kn
9,germany,en


In [6]:
labels =  kanglish_train.loc[:,'tag'].unique()

In [7]:
# display the labels
labels

array(['kn', 'en', 'name', 'location', 'en-kn', 'other'], dtype=object)

### Part 2: Feature selection (10/10 points)

Now that we have the data imported and we know what it looks like, it is time for us to select the features that our machine learning model should be looking at. Character-based features, such as co-occurring characters, character repetitions, or sequence length are known to be informative for this task.

**YOUR TASK:**


*   Create a function that takes a word and returns a dictionary containing the following: word length (number of characters) and the last 2 letters of the word (e.g. for the word "tag" this dictionary should look somewhat like this: *{'len': 3, 'suffix': 'ag'}*, with the key names being up to you)
*   Iterate over the *word* column of the train set and test set to create two separate lists of features representing these words
*   Use sklearn's [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) to turn the feature dictionaries into a machine learning model-readable version: with just numbers. The output should be stored as *X_train* and *X_test*. **Important!** Note that the training and test data have to be encoded the same way. Pay attention to the *fit*, *transform*, and *fit_transform* methods that the DictVectorizer has in order to first fit it to your training data, then transform the training data, and transform the test data using the same vectorizer
*   Store the *tag* column of the train and test sets in as a *y_train* and *y_test*.



In [8]:
def encode_features(word):
  encoding = {'length' : len(word), 'last_letters' : word[-2:] }
  return encoding


In [9]:
# encode all of the words in training and test sets, creating two lists of dictionaries
features_train = []
for word in kanglish_train.loc[:,'word']:
    features_train.append(encode_features(word))
features_test = []
for word in kanglish_test.loc[:,'word']:
    features_test.append(encode_features(word))
print(features_train)
print(features_test)

[{'length': 7, 'last_letters': 'hu'}, {'length': 4, 'last_letters': 'ww'}, {'length': 7, 'last_letters': 'ng'}, {'length': 4, 'last_letters': 'ar'}, {'length': 6, 'last_letters': 'el'}, {'length': 9, 'last_letters': 'ed'}, {'length': 5, 'last_letters': 'th'}, {'length': 6, 'last_letters': 'ed'}, {'length': 6, 'last_letters': 'la'}, {'length': 7, 'last_letters': 'ny'}, {'length': 6, 'last_letters': 'hu'}, {'length': 7, 'last_letters': 'dh'}, {'length': 6, 'last_letters': 'de'}, {'length': 7, 'last_letters': 're'}, {'length': 10, 'last_letters': 'on'}, {'length': 6, 'last_letters': 'ne'}, {'length': 7, 'last_letters': 'de'}, {'length': 7, 'last_letters': 'de'}, {'length': 13, 'last_letters': 'ke'}, {'length': 8, 'last_letters': 'te'}, {'length': 10, 'last_letters': 'hu'}, {'length': 8, 'last_letters': 'de'}, {'length': 8, 'last_letters': 'de'}, {'length': 5, 'last_letters': 'al'}, {'length': 7, 'last_letters': 'ed'}, {'length': 5, 'last_letters': 'si'}, {'length': 5, 'last_letters': 'ri'

In [10]:
# instantiate a vectorizer
vectorizer = DictVectorizer()

In [90]:
# attune the vectorizer to your data


In [11]:
# use the vectorizer on the training data and the test data
X_train =  vectorizer.fit_transform(features_train)
X_test =  vectorizer.transform(features_test) 
#using transform to have the same vocabulary 
#Named features not encountered during fit or fit_transform will be silently ignored.  # Exactly! Great observation.

In [12]:
# extract the 'tag' column (classes)
y_train = kanglish_train.loc[:,'tag']
y_test = kanglish_test.loc[:,'tag']



### Part 3: Training the model (4/4 points)


We now have our train and test sets encoded in a machine learning-friendly format, with our features (X) and our classes (y) separated. It is high time we train a machine learning model.

**YOUR TASK**:


*   Instantiate a [LinearSVC model](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)
*   Fit the model on *X_train* and *y_train*.



In [13]:
model =  LinearSVC()


In [14]:
# fit the model on your data
model.fit(X_train,y_train)




### Part 4: Evaluating the model (5/6 points)


We have successfully trained a model - but now what? In order to know how successful it is, we should evaluate it using some measures.

**YOUR TASK:**


*   Use your model to predict the classes for *X_test*
*   Use [sklearn's evaluation measure functions](https://scikit-learn.org/0.15/modules/model_evaluation.html) to calculate the following measures: accuracy, and per-class precision, recall, and F1 for the predictions in comparison with the ground truth (*y_test*). Note that you will have to specify some parameters in order to get the per-class measures; print them in a way that makes it clear which score refers to which class.
*   Discuss the results. Do you think the model is performing well? What classes is the model having problems with?



In [15]:
y_pred =  model.predict(X_test)

In [16]:
# print out the evaluation using various sklearn functions
accuracy = accuracy_score(y_test,y_pred)
print(accuracy)
precision_per_class = precision_score(y_test,y_pred,average = None)
for label , precision in zip(labels,precision_per_class):  # the order of the scores is not the same as of the labels; see below how to solve it with the labels parameter (and how it changes).
    print(f'precision for class {label}: {precision}')
recall_per_class = recall_score(y_test,y_pred,average = None)
for label , recall in zip(labels,recall_per_class):
    print(f'recall for class {label}: {recall}')
f1_per_class = f1_score(y_test,y_pred,average=None)
for label , f1 in zip(labels,f1_per_class):
    print(f'F1 for class {label}: {f1}')



0.7491821155943293
precision for class kn: 0.8309776207302709
precision for class en: 0.18
precision for class name: 0.7334818114328137
precision for class location: 0.0
precision for class en-kn: 0.5370370370370371
precision for class other: 0.11363636363636363
recall for class kn: 0.7782680639823497
recall for class en: 0.0967741935483871
recall for class name: 0.9006381039197813
recall for class location: 0.0
recall for class en-kn: 0.08192090395480225
recall for class other: 0.1
F1 for class kn: 0.8037596126459698
F1 for class en: 0.1258741258741259
F1 for class name: 0.8085106382978723
F1 for class location: 0.0
F1 for class en-kn: 0.14215686274509803
F1 for class other: 0.10638297872340426


In [17]:
accuracy = accuracy_score(y_test,y_pred)
print(accuracy)
precision_per_class = precision_score(y_test,y_pred,average = None, labels=labels)
for label , precision in zip(labels,precision_per_class):
    print(f'precision for class {label}: {precision}')
recall_per_class = recall_score(y_test,y_pred,average = None, labels=labels)
for label , recall in zip(labels,recall_per_class):
    print(f'recall for class {label}: {recall}')
f1_per_class = f1_score(y_test,y_pred,average=None, labels=labels)
for label , f1 in zip(labels,f1_per_class):
    print(f'F1 for class {label}: {f1}')



0.7491821155943293
precision for class kn: 0.7334818114328137
precision for class en: 0.8309776207302709
precision for class name: 0.5370370370370371
precision for class location: 0.0
precision for class en-kn: 0.18
precision for class other: 0.11363636363636363
recall for class kn: 0.9006381039197813
recall for class en: 0.7782680639823497
recall for class name: 0.08192090395480225
recall for class location: 0.0
recall for class en-kn: 0.0967741935483871
recall for class other: 0.1
F1 for class kn: 0.8085106382978723
F1 for class en: 0.8037596126459698
F1 for class name: 0.14215686274509803
F1 for class location: 0.0
F1 for class en-kn: 0.1258741258741259
F1 for class other: 0.10638297872340426


**DISCUSS** the model performance.

The model's accuracy is pretty good however its precision does quite poor. This shows us that the features we have chosen aren't quite sufficient for the model to perform well.



In [None]:
# What do the per-class scores tell us?

### Extra part 1: Feature selection (3/3 points)

The features we have selected in part 2 do not need to be the best out there - so let us expand on the feature selection.
**YOUR TASK:**


*   Pick one more feature we could use and justify your choice
*    Expand upon the code from part 2 to include that feature
*   Train and evaluate the model as above
*    Discuss whether the model's performance has improved

In [18]:
def new_encode_features(word):
  consonant_cluster = re.findall(r'[bcdfghjklmnpqrstvwxz]{2,}',word) 
  clusters = ', '.join(consonant_cluster)
  encoding = {'length' : len(word), 'last_letters' : word[-2:] , 'cons_clusters' : clusters}
  return encoding

In [19]:
new_features_train = []
for word in kanglish_train.loc[:,'word']:
    new_features_train.append(new_encode_features(word))
new_features_test = []
for word in kanglish_test.loc[:,'word']:
    new_features_test.append(new_encode_features(word))
print(new_features_train)
print(new_features_test)

[{'length': 7, 'last_letters': 'hu', 'cons_clusters': 'sth'}, {'length': 4, 'last_letters': 'ww', 'cons_clusters': 'ww'}, {'length': 7, 'last_letters': 'ng', 'cons_clusters': 'st, ng'}, {'length': 4, 'last_letters': 'ar', 'cons_clusters': ''}, {'length': 6, 'last_letters': 'el', 'cons_clusters': 'st'}, {'length': 9, 'last_letters': 'ed', 'cons_clusters': 'nf, rm'}, {'length': 5, 'last_letters': 'th', 'cons_clusters': 'th'}, {'length': 6, 'last_letters': 'ed', 'cons_clusters': 'nk'}, {'length': 6, 'last_letters': 'la', 'cons_clusters': ''}, {'length': 7, 'last_letters': 'ny', 'cons_clusters': 'rm'}, {'length': 6, 'last_letters': 'hu', 'cons_clusters': 'dh'}, {'length': 7, 'last_letters': 'dh', 'cons_clusters': 'st, ndh'}, {'length': 6, 'last_letters': 'de', 'cons_clusters': 'ns'}, {'length': 7, 'last_letters': 're', 'cons_clusters': ''}, {'length': 10, 'last_letters': 'on', 'cons_clusters': ''}, {'length': 6, 'last_letters': 'ne', 'cons_clusters': 'dhn'}, {'length': 7, 'last_letters': '

In [20]:
X_train_new =  vectorizer.fit_transform(new_features_train)
X_test_new =  vectorizer.transform(new_features_test) 

In [21]:
model.fit(X_train_new,y_train)



In [22]:
y_pred_new =  model.predict(X_test_new)

In [23]:
accuracy = accuracy_score(y_test,y_pred_new)
print(accuracy)
precision_per_class = precision_score(y_test,y_pred_new,average = None)
for label , precision in zip(labels,precision_per_class):
    print(f'precision for class {label}: {precision}')
recall_per_class = recall_score(y_test,y_pred_new,average = None)
for label , recall in zip(labels,recall_per_class):
    print(f'recall for class {label}: {recall}')
f1_per_class = f1_score(y_test,y_pred_new,average=None)
for label , f1 in zip(labels,f1_per_class):
    print(f'F1 for class {label}: {f1}')

0.7659760087241003
precision for class kn: 0.8534883720930233
precision for class en: 0.2391304347826087
precision for class name: 0.758701603441533
precision for class location: 0.2857142857142857
precision for class en-kn: 0.6907216494845361
precision for class other: 0.11607142857142858
recall for class kn: 0.809707666850524
recall for class en: 0.23655913978494625
recall for class name: 0.8842297174111212
recall for class location: 0.06451612903225806
recall for class en-kn: 0.18926553672316385
recall for class other: 0.13
F1 for class kn: 0.8310217945089159
F1 for class en: 0.23783783783783785
F1 for class name: 0.8166701747000631
F1 for class location: 0.10526315789473684
F1 for class en-kn: 0.29711751662971175
F1 for class other: 0.12264150943396226


I chose to add consonant clusters as an additional feature as I thought it would make English easier to classify as the language has a lot of distinct clusters.
The extra feature that I added did make a difference in the scores but quite a small one. Perhaps that's because the other two features are quite general and don't target the languages well. Furtheremore, the extra classes make the data unclear and that lowers the models performance.

In [24]:
# Great job justifying, implementing and discussing your extra feature! One thing I can see that perhaps could be solved differently is that now, if you have more than one consonant cluster per word,
# you include both of them in one feature, as in this sample: {'length': 7, 'last_letters': 'ng', 'cons_clusters': 'st, ng'}. I think this may be contributing to the increase in performance not being
# as good as expected! It would be better if each cluster could be its own feature (but it's also not as straightforward to do with our current setup).

# Keep in mind that the label ordering issue still persists here!

**DISCUSS** the model performance

### Extra part 2: Excluding the non-language classes (3/3 points)

As you may have noted in part 1, the dataset contains some tags that represent languages (kn, en, en-kn) and some that correspond to Named Entity types and miscellaneous tokens (name, location, other). Since our task is to detect the language of a token, and the ground truth is not provided for the latter three classes in the same way as it is for the first three, let us try to exclude them.

**YOUR TASK:**


*   Use [Boolean indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing) to filter out the words with tags other than kn, en, and en-kn
*   Proceed with encoding the features and training as in the main part of the assignment
*   Evaluate the model using whichever measures you deem relevant. Discuss your choice and whether the model's performance has improved



In [25]:
mask_1 = kanglish_train['tag'] == 'name'
mask_2 = kanglish_train['tag'] == 'location'
mask_3 = kanglish_train['tag'] == 'other'
filtered_kanglish_train = kanglish_train[(~mask_1) & (~mask_2) & (~mask_3)]
mask_1 = kanglish_test['tag'] == 'name'
mask_2 = kanglish_test['tag'] == 'location'
mask_3 = kanglish_test['tag'] == 'other'
filtered_kanglish_test = kanglish_test[(~mask_1) & (~mask_2) & (~mask_3)]

In [26]:
labels2 =  filtered_kanglish_train.loc[:,'tag'].unique()



In [27]:
y_train_filtered = filtered_kanglish_train.loc[:,'tag']
y_test_filtered = filtered_kanglish_test.loc[:,'tag']



In [28]:
features_train_filtered1 = []
for word in filtered_kanglish_train.loc[:,'word']:
    features_train_filtered1.append(encode_features(word))
features_test_filtered1 = []
for word in filtered_kanglish_test.loc[:,'word']:
    features_test_filtered1.append(encode_features(word))


In [29]:
X_train_filtered1 =  vectorizer.fit_transform(features_train_filtered1)
X_test_filtered1 =  vectorizer.transform(features_test_filtered1) 

In [30]:
model.fit(X_train_filtered1,y_train_filtered)



In [31]:
y_pred_filtered1=  model.predict(X_test_filtered1)


In [32]:
accuracy = accuracy_score(y_test_filtered,y_pred_filtered1)
print(accuracy)
precision_per_class = precision_score(y_test_filtered,y_pred_filtered1,average = None)
for label , precision in zip(labels2,precision_per_class):
    print(f'precision for class {label}: {precision}')
recall_per_class = recall_score(y_test_filtered,y_pred_filtered1,average = None)
for label , recall in zip(labels2,recall_per_class):
    print(f'recall for class {label}: {recall}')
f1_per_class = f1_score(y_test_filtered,y_pred_filtered1,average=None)
for label , f1 in zip(labels2,f1_per_class):
    print(f'F1 for class {label}: {f1}')

0.84
precision for class kn: 0.8989394884591391
precision for class en: 0.1791044776119403
precision for class en-kn: 0.8193415637860082
recall for class kn: 0.794815223386652
recall for class en: 0.12903225806451613
recall for class en-kn: 0.9074749316317229
F1 for class kn: 0.8436768149882904
F1 for class en: 0.15
F1 for class en-kn: 0.861159169550173


In this endeavor I am trying with only the initial two features as I want to see what's the difference when cleaning up the data.
The results are a bit perplexing as in some cases it does better for example in recall for class en-kn from an 18% it went up to 90% but the model still does poorly on classifying english correctly. It seems that the model is doig a great job classifying kn which was the case even before removing the extra classes. It seems that the features are not good enough to detect English.

**DISCUSS** the model performance and your choice of measures

In [33]:
features_train_filtered2 = []
for word in filtered_kanglish_train.loc[:,'word']:
    features_train_filtered2.append(new_encode_features(word))
features_test_filtered2 = []
for word in filtered_kanglish_test.loc[:,'word']:
    features_test_filtered2.append(new_encode_features(word))
print(features_train_filtered2)
print(features_test_filtered2)

[{'length': 7, 'last_letters': 'hu', 'cons_clusters': 'sth'}, {'length': 4, 'last_letters': 'ww', 'cons_clusters': 'ww'}, {'length': 7, 'last_letters': 'ng', 'cons_clusters': 'st, ng'}, {'length': 4, 'last_letters': 'ar', 'cons_clusters': ''}, {'length': 6, 'last_letters': 'el', 'cons_clusters': 'st'}, {'length': 9, 'last_letters': 'ed', 'cons_clusters': 'nf, rm'}, {'length': 5, 'last_letters': 'th', 'cons_clusters': 'th'}, {'length': 6, 'last_letters': 'ed', 'cons_clusters': 'nk'}, {'length': 6, 'last_letters': 'la', 'cons_clusters': ''}, {'length': 7, 'last_letters': 'ny', 'cons_clusters': 'rm'}, {'length': 6, 'last_letters': 'hu', 'cons_clusters': 'dh'}, {'length': 7, 'last_letters': 'dh', 'cons_clusters': 'st, ndh'}, {'length': 6, 'last_letters': 'de', 'cons_clusters': 'ns'}, {'length': 7, 'last_letters': 're', 'cons_clusters': ''}, {'length': 10, 'last_letters': 'on', 'cons_clusters': ''}, {'length': 6, 'last_letters': 'ne', 'cons_clusters': 'dhn'}, {'length': 7, 'last_letters': '

In [34]:
X_train_filtered2 =  vectorizer.fit_transform(features_train_filtered2)
X_test_filtered2 =  vectorizer.transform(features_test_filtered2) 

In [35]:
model.fit(X_train_filtered2,y_train_filtered)



In [36]:
y_pred_filtered2=  model.predict(X_test_filtered2)

In [37]:
accuracy = accuracy_score(y_test_filtered,y_pred_filtered2)
print(accuracy)
precision_per_class = precision_score(y_test_filtered,y_pred_filtered2,average = None)
for label , precision in zip(labels2,precision_per_class):
    print(f'precision for class {label}: {precision}')
recall_per_class = recall_score(y_test_filtered,y_pred_filtered2,average = None)
for label , recall in zip(labels2,recall_per_class):
    print(f'recall for class {label}: {recall}')
f1_per_class = f1_score(y_test_filtered,y_pred_filtered2,average=None)
for label , f1 in zip(labels2,f1_per_class):
    print(f'F1 for class {label}: {f1}')

0.8504878048780488
precision for class kn: 0.9020085209981741
precision for class en: 0.24210526315789474
precision for class en-kn: 0.8391193903471634
recall for class kn: 0.8174296745725317
recall for class en: 0.24731182795698925
recall for class en-kn: 0.9033728350045579
F1 for class kn: 0.8576388888888888
F1 for class en: 0.24468085106382978
F1 for class en-kn: 0.8700614574187884


The model does a bit better in all classes with the consonant clusters feature comparing to the scores of the two features. It also does better overall of all the other tries. However, the scores of English are still quite low. In conclusion, the features aren't good enough for classifying English.

In [38]:
# Great job, it was a fantastic idea to compare these with the two different feature encoders!

# Keep in mind that the label ordering issue still persists here! This is why your results are perplexing, as what you print out as the score for en is actually the score for en-kn.