Let's get started by exploring the very common `sklearn` package, which among many other things has some nice utilities for text classification.

In [0]:
import pandas as pd
import numpy as np
import sklearn

In [50]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

train.head()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Unnamed: 0,text,category,label
0,لوكسمبورغ: كاميرون ارتكب خطأ تاريخيا بطرح الاس...,استفتاء_بريطانيا,1
1,روسيا بصدد تصنيع مركبة فضائية جديدة\n تبدأ عمل...,التقنية_والمعلومات,10
2,صادرات ألمانيا إلى روسيا عند أدنى مستوى منذ 1...,عقوبات_اقتصادية,25
3,الجيش السوري يصد هجوم جبهة النصرة في ريف حلب\n...,المعارضة_السورية,12
4,ردود أفعال وسائل إعلام غربية على عملية درع الف...,الأزمة_السورية,6


For simplicity, we're going to reduce our problem from multiclass to binary classification for now.

We have a few categories of news about Yemen and a few others about Syria -- let's imagine that we're trying to differentiate between those.

In [0]:
#This is a nice way to create a new variable in a dataframe based on a condition
train['binary'] = np.where(train['category'].str.contains('اليمن|يمني') == True, 1,
                           np.where(train['category'].str.contains('السوريا|سوري') == True, 0, -1))

test['binary'] = np.where(test['category'].str.contains('اليمن|يمني') == True, 1,
                           np.where(test['category'].str.contains('السوريا|سوري') == True, 0, -1))

#Let's look at values for this new class to be sure our transformation worked
np.unique(test.binary, return_counts=True)

(array([-1,  0,  1]), array([2296,  418,   53]))

In [0]:
#Now we'll remove the items that don't belong to our new binary categories
train = train[train.binary >= 0]
test = test[test.binary >= 0]

len(train)

3781

For any machine learning approach, we first need a numeric representation of our text. Then we feed that numeric representation, and our labels, into a machine learning model. Sometimes a single technique or library provides both a tool for getting a good representation of a text (sometimes called feature extraction) and a machine learning approach. But that isn't always the case! For example, we might use a complex feature extraction technique but a simpler (but still effective!) machine learning model when we're first testing that feature extration technique, in order to iterate more quickly. Complex models take a longer time (and more computing resources) to train.

`sklearn.feature_extraction.text` provides two simple utilities for text: `CountVectorizer` and `TfidfVectorizer`. Let's use `CountVectorizer` for now. It counts the frequency of each word in the training data and creates a matrix of words and counts. For the test data, it counts the frequency of the same words - the model woudldn't know what to do with a word in the test data that it hadn't been trained on. 

In [0]:
#We create a CountVectorizer object and tell it to count n-grams in the range 1-3 (1 to 3 word phrases)
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,3))

#We want just our text as a list, rather than a row in a dataframe
words_train = [text for text in train.text]
words_test = [text for text in test.text]

#Now we use our vectorizer on the train and test data
X_train = vectorizer.fit_transform(words_train)
X_test = vectorizer.transform(words_test)

#We also need to separate out our labels
Y_train = train.binary
Y_test = test.binary

Note that `vectorizer.fit_transform()` is actually doing two things at once, and you could call `vectorizer.fit()` and then `vectorizer.transformer()` instead. The vectorizer is first 'fit' to the training data, and then we do the same transformation on both the training and test data so that we can feed them into the same model.

Next, we can use any number of the many classifications techniques available in sklearn. Let's look in depth at two approaches: a simple `LogisticRegression` and the similar, but slightly more complex, `LinearSVC`. This is short for Support Vector Classifier, and here's a nice introduction if you're not familiar: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47. Very briefly, it's another approach to linear regression that does very well with these kinds of classification problems and is not computationally expensive.

`score()` produces an accuracy metric for the classifier, using the test data. In other words, how many of the test labels did the classifier guess correctly?

In [0]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(max_iter=5000).fit(X_train, Y_train)
classifier.score(X_test, Y_test)

0.970276008492569

In [0]:
from sklearn.svm import LinearSVC

classifier = LinearSVC(max_iter=10000).fit(X_train, Y_train)
classifier.score(X_test, Y_test)



0.9723991507430998

These results look good! As expected, the SVC performs slightly better. However, remember that our classes are a bit imbalanced - we have more articles about Syria that about Yemen. So let's use a confusion matrix to get a better sense of how the model is performing.

In [0]:
from sklearn.metrics import confusion_matrix

preds = classifier.predict(X_test)

confusion_matrix(y_true = Y_test,
                 y_pred = preds)

array([[412,   6],
       [  7,  46]])

So it seems like class imbalance isn't really an issue here - we have only a few false negatives (bottom left value) and a few false positives (top right value).

Great! We've gotten good results for this simply binary classification problem. But remember that we're really dealing with 40 categories here. In future notebooks we'll address that more complex problem.