# Logistic Regression using fastText embeddings

This notebook uses the [fastText Library](https://github.com/facebookresearch/fastText) for text classification to perform binary classification on Twitter data. 

The [fastText classifier](https://github.com/facebookresearch/fastText) classifies text by averaging its n-gram embeddings to obtain a sentence/document vector, applying multinomial logistic regression on the vector for classification, and using hierarchical softmax to compute the probability distribution over the pre-defined classes, as well as speed up the computation.

In [None]:
import logging
import torch
logging.basicConfig(level=logging.INFO)
print("done")

done


FastText installation

In [None]:
!wget https://github.com/facebookresearch/fastText/archive/0.2.0.zip
!unzip 0.2.0.zip
%cd fastText-0.2.0
!make
print("done installing")

# Loading the datasets

We load data from our predefined training and validation split, **train.csv** and **test.csv**, which can be found in the *train_test* folder.

To train the model on a subset of the data, pass the argument *size*. If it it set to **None**, the full split will be used.

In [None]:
import os
import pandas as pd

DF_TRAIN = pd.read_csv('train.csv')
DF_EVAL = pd.read_csv('test.csv')

def select_train(size=160_000): 
  df = DF_TRAIN
  if size is not None:
    df = df.iloc[:size]
  return df.drop(['Unnamed: 0'], axis='columns')

def select_eval(size=40_000):
  df = DF_EVAL
  if size is not None:
    df = df.iloc[:size]
  return df.iloc[:size].drop(['Unnamed: 0'], axis='columns')


In [None]:
df_train = select_train(size=None)
df_val = select_eval(size=None)
df_train.rename(columns = {'text':'x'}, inplace = True)
df_val.rename(columns = {'text':'x'}, inplace = True)
#df = pd.concat([df_train, df_eval]) uncomment this to train on both train and validation test
print("done loading")
df_train

done loading


Unnamed: 0,index,x,label
0,157049,jennifer lawrence the queen of derp . <url>\n,1
1,2366208,airbake by wearever nonstick 15-1 / 2 by 20 - ...,0
2,1948945,apparently its main event time for #ufc145 . i...,0
3,1684769,<user> <user> i'll say it again ( about ko i l...,0
4,2262152,deep stall : the turbulent story of boeing com...,0
...,...,...,...
1249995,1500650,really wishing i could celebrate with <user> t...,0
1249996,1405608,"i love my mova lor rican ass . but , she don't...",0
1249997,839358,ahhh lovely night in with <user> love our one-...,1
1249998,254031,<user> if you were near-by i'd say came on ove...,1


In [None]:
df_val

Unnamed: 0,index,x,label
0,922648,sunny day with my bff <user> <url>\n,1
1,944379,"<user> also , that statement wasn't really dir...",1
2,2182552,thoughts are with former dons striker lee mill...,0
3,786886,- excitedd for my lil'mans party ! thanks to h...,1
4,1130778,shout out to <user> xoxo\n,1
...,...,...,...
1249995,1478680,"stylecraft 22 "" x 82 "" joined board and batten...",0
1249996,1972646,<user> haha don't ask i'm ardent fan of barca ...,0
1249997,1710597,<user> i know xx\n,0
1249998,1835784,does anyone have an extra pair of headphones i...,0


In [None]:
from loading import load_test #import module from the file loading.py

df_test = load_test()
df_test

Unnamed: 0,x
1,sea doo pro sea scooter ( sports with the port...
2,<user> shucks well i work all week so now i ca...
3,i cant stay away from bug thats my baby\n
4,<user> no ma'am ! ! ! lol im perfectly fine an...
5,"whenever i fall asleep watching the tv , i alw..."
...,...
9996,had a nice time w / my friend lastnite\n
9997,<user> no it's not ! please stop !\n
9998,not without my daughter ( dvd two-time oscar (...
9999,<user> have fun in class sweetcheeks\n


# Preprocessing the datasets
We apply three preprocessing methods to our data, as we have noticed an increase in performance:


*   Tag removal
*   Punctuation removal
*   Contraction & misspelled words fixing

Furthermore, we adapt the training dataset into the required format in order for fastText to recognize the labels, and write our cleaned and preprocessed data into seperate files.



In [None]:
import nltk
!nltk.download('omw-1.4')
import pandas as pd
!pip install contractions
import contractions
import itertools
import re


def remove_tags(df: pd.DataFrame):
  df['x'] = df['x'].apply(lambda x: x.replace('<user>', '').replace('<url>', '').strip())

def remove_punctuation(df: pd.DataFrame):
  df['x'] = df['x'].apply(lambda tweet: ' '.join(re.sub("[\.\,\!\?\:\;\-\=]", " ", tweet).split()))

def fix_contractions_and_misspelled_words(df: pd.DataFrame):
  df['x'] = df['x'].apply(lambda tweet: contractions.fix(tweet))
  df['x'] = df['x'].apply(lambda tweet: ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet)))


remove_tags(df_train)
remove_punctuation(df_train)
fix_contractions_and_misspelled_words(df_train)
df_train


remove_tags(df_val)
remove_punctuation(df_val)
fix_contractions_and_misspelled_words(df_val)
df_val


remove_tags(df_test)
remove_punctuation(df_test)
fix_contractions_and_misspelled_words(df_test)
df_test

Unnamed: 0,x
1,sea doo pro sea scooter ( sports with the port...
2,shucks well i work all week so now i cannot co...
3,i cannot stay away from bug that is my baby
4,no madam lol i am perfectly fine and not conta...
5,whenever i fall asleep watching the tv i alway...
...,...
9996,had a nice time w / my friend lastnite
9997,no it is not please stop
9998,not without my daughter ( dvd two time oscar (...
9999,have fun in class sweetcheeks


In [None]:
#fastText can only recognize labels of the form __label__0 and __label__1, e.g convert "sentence with emotion" to "__label__0 sentence with emotion"
for index, row in df_train.iterrows():
    res = "__label__" + str(row['label']) + " " + row['x']
    df_train.at[index,'x'] = res

In [None]:
from google.colab import files


#write contents to files
f = open('CleanedTrain.txt', 'w')
for index, row in df_train.iterrows():
    #print(row['x'])
    f.write(row['x']+"\n")

f.close()

f = open('ValCleaned.txt', 'w')
for index, row in df_val.iterrows():
    f.write(row['x']+"\n")

f.close()


f = open('TestCleaned.txt', 'w')
for index, row in df_test.iterrows():
    f.write(row['x']+"\n")

f.close()


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Training

We train our model for 20 epochs using a learning rate of 0.01 and a vector size (dim) of 20. The larger the vectors are, more data is required to be learned, and more information can be captured. 
[Source](https://fasttext.cc/docs/en/supervised-tutorial.html)


Training

In [None]:
!./fasttext supervised -input CleanedTrain.txt -output model -dim 20 -lr 0.01 -epoch 20 

Read 19M words
Number of words:  340539
Number of labels: 2
Progress: 100.0% words/sec/thread:  634662 lr:  0.000000 loss:  0.371888 ETA:   0h 0m


# Evaluation on Validation Set

In [None]:
!cat ValCleaned.txt
!./fasttext predict model.bin ValCleaned.txt > predictionsVal.txt

In [None]:
from evaluation import evaluate
#convert predictions back to int
preds = pd.read_csv('predictionsVal.txt', names=['Labels'], header=None)
preds.rename( columns={'Unnamed: 0':'Labels'}, inplace=True )
preds['Labels'] = preds['Labels'].str[-1].apply(np.int64)

                 
y_pred = torch.tensor(preds['Labels'])


evaluate(df_val['label'], y_pred)


INFO:root:---
* accuracy: 0.8331816
* precision: 0.8212345587714189
* recall: 0.8521404945862169
* f1: 0.8364021223796826
* bce: 5.761777639267986
* auc: 0.8331654180547339
---


(0.8331816,
 0.8212345587714189,
 0.8521404945862169,
 0.8364021223796826,
 5.761777639267986,
 0.8331654180547339)

# Predictions on Test Set

Finally, we generate predictions on the test set and write them to a csv file.

In [None]:
!./fasttext predict model.bin TestCleaned.txt > predictionsTest.txt

In [None]:
#convert predictions back to int
preds = pd.read_csv('predictionsTest.txt', names=['Labels'], header=None)
preds.rename( columns={'Unnamed: 0':'Labels'}, inplace=True )
preds['Labels'] = preds['Labels'].str[-1].apply(np.int64)


In [None]:
#create dataframe for csv
preds.rename(columns = {'Labels':'Prediction'}, inplace = True)
preds['Id'] = range(1, len(preds)+1)
preds = preds[["Id", "Prediction"]]
preds['Prediction'] = preds['Prediction'].replace({0: -1})
preds.to_csv('submission.csv')
preds

Unnamed: 0,Id,Prediction
0,1,-1
1,2,-1
2,3,-1
3,4,1
4,5,-1
...,...,...
9995,9996,1
9996,9997,-1
9997,9998,-1
9998,9999,1


In [None]:
files.download('submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>