# Lab 11 Tasks

A common text classification task involves automatically determining the language in which a document is written, based on previously-labelled example documents.

In this notebook, we will look at automatically classifying the text from tweets as either English or non-English. The dataset we will use is a subset of the [UMass Global English on Twitter Dataset](https://www.kaggle.com/rtatman/the-umass-global-english-on-twitter-dataset).

## Task 1 - Preprocessing

Read the Twitter dataset from the CSV file 'tweet-language.tsv' into a Pandas DataFrame, where the row index is given by 'Tweet Id'.

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score, RepeatedKFold
%matplotlib inline

In [6]:
df = pd.read_csv("tweet-language.tsv", sep="\t").set_index("Tweet ID")
print("Read %d documents" % len(df))
df.head(10)

Read 6759 documents


Unnamed: 0_level_0,Tweet,English
Tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1
285903159434563584,volkan konak adami tribe sokar yemin ederim :d,0
285965965118824448,i felt my first flash of violence at some fool...,1
286057979831275520,ladies drink and get in free till 10:30,1
286216100784521216,watching #miranda on bbc1!!! u r hilarious,1
286525170670243840,all over twitter because you and your friends ...,1
286916662836490241,"~ i'm falling apart,with a broken heart,barely...",1
286927073078018048,"oh my god, we go way back #lovethis #rahrahrah",1
286927433498759168,"the way you treat me. the way you accept me, a...",1
286941320851890177,i just wanna get pulled on the sled by the fou...,1
286949739071672321,"oh there's that fake door slam noise , ugh #cbb",1


Our target label for classification here is going to be the column 'English' -- a value of 1 indicates that a tweet is in English, while a value of 0 indicates it is written in another language.

From this column, check the number of tweets in the dataset for each class.

In [8]:
target_categories = ["Non-English", "English"]
documents = df["Tweet"]
target = df["English"]


Using the DataFrame and functionality from scikit-learn, create a vector representations of the documents. For real applications we would want to use a custom tokenizer to handle the specifics of tweets (e.g. mentions, hashtags etc). However, for this example we can just use the standard scikit-learn tokenizer and a simple *CountVectorizer*. 

Note that we should not use any "stop words" here. For language detection, common stop words might actually prove to be useful features.

In [10]:
vectorizer = CountVectorizer(min_df = 10, stop_words=None)
X = vectorizer.fit_transform(documents)
print(X.shape)

(6759, 892)


In [12]:
terms = vectorizer.get_feature_names_out()
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 892 distinct terms


In [14]:
print(terms[150:170])

['central' 'centre' 'centro' 'change' 'che' 'check' 'chicago' 'chile'
 'christmas' 'city' 'cl' 'clerical' 'click' 'cloudy' 'club' 'co' 'coffee'
 'coisa' 'com' 'come']


## Task 2 - Classification and Train/Test Evaluation

Train a kNN classification model with 3 neighbours, and evaluate the accuracy of this model using a single train/test split, so that we have 70% of the tweets in the training set and 30% in the test set.

In [16]:
data_train, data_test, target_train, target_test = train_test_split(X, target, test_size=0.3)

Repeat the classification and evaluation process again using a different train/test split. Did the classifier achieve the same accuracy score as before?

In [20]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(data_train, target_train)
predicted = model.predict(data_test)
print("Accuracy = %.4f" % accuracy_score(target_test, predicted))

Accuracy = 0.8511


## Task 3 - Classification and Cross-Validation

If we re-run the evaluation above several times, we will get different performance scores depending on the randomly-generated training/test split that we are using. A more robust strategy involves using *k-fold cross-validation* to evaluate a classifier.

Evaluate the kNN classifier from above, but this time using 5-fold cross validation. The model in each fold should be evaluated using accuracy. Calculate the overall average accuracy across all 5 folds.


In [23]:
from sklearn.pipeline import Pipeline
pipeline1 = Pipeline([
    ('vec', CountVectorizer(min_df = 10, stop_words=None)),
    ('tfidf', TfidfTransformer()),
    ('clf', KNeighborsClassifier(n_neighbors=3))
])

In [29]:
acc_scores = cross_val_score(pipeline1, documents, target, cv=5, scoring="accuracy")
s_acc = pd.Series(acc_scores)
print("Mean accuracy: %.4f" % s_acc.mean())

Mean accuracy: 0.7029


In [31]:
pipeline2 = Pipeline([
    ('vec', CountVectorizer(stop_words="english")),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier())
])

In [33]:
acc_scores = cross_val_score(pipeline2, documents, target, cv=5, scoring="accuracy")
s_acc = pd.Series(acc_scores)
print("Mean accuracy: %.4f" % s_acc.mean())

Mean accuracy: 0.9072
