Most of this should look very familiar from our last notebook. We're just increasing the difficulty by training a model to identify each of our 40 classes, rather than just two. We also introduce a new preprocesses step - removing stopwords.

In [0]:
import pandas as pd
import numpy as np
import sklearn

In [0]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

train.head()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Unnamed: 0,text,category,label
0,لوكسمبورغ: كاميرون ارتكب خطأ تاريخيا بطرح الاس...,استفتاء_بريطانيا,1
1,روسيا بصدد تصنيع مركبة فضائية جديدة\n تبدأ عمل...,التقنية_والمعلومات,10
2,صادرات ألمانيا إلى روسيا عند أدنى مستوى منذ 1...,عقوبات_اقتصادية,25
3,الجيش السوري يصد هجوم جبهة النصرة في ريف حلب\n...,المعارضة_السورية,12
4,ردود أفعال وسائل إعلام غربية على عملية درع الف...,الأزمة_السورية,6


In [0]:
#Due to RAM issues on Google Colab, we'll focus on the first 20 classes in our data.
train = train[train.label <= 20]
test = test[test.label <= 20]

In the last notebook, we passed all of our text into our model. But what about simple words like prepositions, that appear very often and don't necessarily contain helpful information about the meaning of the text?

Those simple, often repeated words are often called stopwords. A common preprocessing step is to remove those words from the text.

We won't be exploring it here, but `TfidfVectorizer` can also help address this problem, by essential penalizing words that appear very often -- including words that might not be stopwords. However, even in this case you'd probably want to remove stopwords.

In [0]:
#we'll use nltk to download and then load in a set of Arabic stopwords.
import nltk

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('arabic')

stop_words[0:5]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['إذ', 'إذا', 'إذما', 'إذن', 'أف']

In [0]:
#Removing stopwords is easy - just an additional argument when we create our CountVectorizer!
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,3), stop_words=stop_words)

words_train = [text for text in train.text]
words_test = [text for text in test.text]

X_train = vectorizer.fit_transform(words_train)
X_test = vectorizer.transform(words_test)

Y_train = train.label
Y_test = test.label

Conveniently, most `sklearn` classifiers are set up to do multiclass classification by default. You just feed in your data as you would for binary classsification and add one argument to specify that this is a multiclass problem. If you're interested in details on this, and the full list of classifiers that are inherently multiclass, see: https://scikit-learn.org/stable/modules/multiclass.html.

The next two cells will take a while to run.

In [0]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(multi_class='multinomial').fit(X_train, Y_train)
classifier.score(X_test, Y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.7637735849056604

First, the "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT" message means that in the default number of iterations (1000), the model has failed to converge. We could address this by increased the number of iterations but for now we'll just move forward.

Unsurprisingly, our results aren't as good as they were for binary classification. But we're also not using the best metric here! Accuracy hides a lot of important information for multiclass problems. Instead, we'll look at the confusion matrix again.

In this case `pd.crosstab` works better visually than `sklearn`'s confusion matrix.

In [0]:

preds = classifier.predict(X_test)
pd.crosstab(Y_test, preds)


col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,17,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,4,0,0,0
1,0,15,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,37,0,0,0,0,0,0,0,13,1,0,0,3,0,0,0,0,0,0
3,0,0,0,33,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,37,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,47,6,0,0,0,0,0,0,1,0,0,4,0,0,0,0
6,2,0,0,2,1,13,244,3,0,0,0,0,54,6,0,1,4,0,0,3,1
7,0,0,0,1,0,0,2,45,0,0,0,0,0,0,0,0,1,0,0,2,1
8,0,0,5,0,0,0,0,0,46,0,2,0,0,0,9,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,14,0,0,0,0,0,0,0,2,0,1,0


Here, we can see visually that the model is performing fairly well, and identify a few areas where it's failing. For example, categories 6 and 12 are most often confused.

In [0]:
print(train.category[train.label == 6].unique(), train.category[train.label == 12].unique())

['الأزمة_السورية'] ['المعارضة_السورية']


Great, this makes sense! Category 6 is "Syrian opposition" and category 12 is "Syrian conflict." Without knowing the guidelines provided to the labelers, even a human might have a hard time dividing articles accurately into these two categories!