# **Romanian sub-dialect identification**
<br>

#### Discriminate between the Moldavian and the Romanian dialects across different text genres (news versus tweets)

#### Author: Manolache Andrei - 244
<br>

---
<br>

  One of the most important sub-tasks in pattern classification are feature extraction and selection. Prior to fitting the model and using machine learning algorithms for training, we need to think about how to best represent a text document as a feature vector. A commonly used model in Natural Language Processing is the so-called **TF-IDF** (term frequency-inverse document frequency) model. 
  <br>
  This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
<br>
  TF-IDF for a word in a document is calculated by multiplying two different metrics:
  1. **TF: Term Frequency**, which measures how frequently a term occurs in a document. <br>
  $TF(t)$ = (Number of times term t appears in a document) / (Total number of terms in the document).
  1.**IDF: Inverse Document Frequency** of the word across a set of documents. This means, how common or rare a word is in the entire document set.   <br>
  $IDF(t)$ = log(Total number of documents / Number of documents with term t in it).
  <br><br>
  $\text{Tf-idf} = TF(t) \cdot IDF(t)$

This way, TF-IDF gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. 

<br>

Furthermore, for training the data-sample, we used two classifiers: **SVM** (Support Vector Machines) and **ComplementNB** (Complement Naive Bayes)

<br>

###**Complement Naive Bayes**
It is based on the Multinomial Naive Bayes classifier, but it improves upon the weakness of the Naive Bayes classifier by estimating parameters from data in all classes except the one which we are evaluating for. Here, there is one important parameter to be used:

*   Alpha (used for smoothing the prediction)


###**Support Vector Machines**
A Support Vector Machine (SVM) is a classifier defined by a separating hyperplane. In other words, given labeled training data, the algorithm outputs an optimal hyperplane which categorizes new examples. To separate two classes, there are many possible hyperplanes that could be chosen. The optimal hyperplane is the one that has a maximum margin(maximum distance between data points of both classes). For adjusting the prediction, there are a few parameters that cand be used: 

  1. Regularization parameter (penalizes missclasifed points)
  1. Gamma parameter (defines how far the influence of other points is considered)
  1. Kernel parameter (used for transforming data so that it can be linear-separated)

<br>

First, we import the libraries needed for the classifiers and scores:

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import re
import numpy as np
import pandas as pd

my_token_pattern = r"(?u)\S\S+" # defining a token pattern, used for TFIDF

Function that opens a file (path) and returns a list of strings, each string representing a single line


In [0]:
def readFile(path):

    file = open(path , 'r' , encoding = "utf-8")
    return file.readlines()

The files are opened, read and being returned a list of strings

In [0]:
train_samples = readFile('train_samples.txt')
test_samples = readFile('test_samples.txt')
validation_samples = readFile('validation_samples.txt')
train_labels = readFile('train_labels.txt')
validation_labels = readFile('validation_labels.txt')



Gets the ids from the test samples:

In [0]:
test_ids = [x for x in test_samples]
test_ids = [int(x.split('\t')[0]) for x in test_ids]

Function that gets the labels from samples:

In [0]:
def get_labels(file):

    labels = [int(label.split('\t')[1]) for label in file]
    return labels

Gets the labels for the train and validations samples:

In [0]:
train_labels = get_labels(train_labels)
validation_labels = get_labels(validation_labels)

Function that gets rid of the samples tweets ids:

In [0]:
def convert_to_list_strings(dataSample):

    finalString = [row.split('\t')[1].replace('\n' , '') for row in dataSample]
    return finalString

Updates the arrays, getting only the samples, without its:

In [0]:
train_samples = convert_to_list_strings(train_samples)
test_samples = convert_to_list_strings(test_samples)
validation_samples = convert_to_list_strings(validation_samples)

Function that preprocesses the samples, spliting the words and keeping only those with length > 2 (So as to get rid of prepositions)

In [0]:
def my_preprocessor(text):

    words = re.split("\\s+", text)
    words = [word.replace("\n", "") for word in words if len(word) > 2] # gets rid of the '\n' character and keeps only the words of length > 2
    return ' '.join(words)

Defines the Tfidf processor, using parameters:
1. 'l2' norm
1. custom token pattern (for tokenizing the words)
1. lowercase = False (in this case, a lower letter is not correlated with an upper one)
1. custom preprocessor function (used for spliting the words)

Then, it tokenizes each of the iterable string in words, creates a vocabulary of words and creates the tf-idf word count vector


In [0]:
vectorizer = TfidfVectorizer(analyzer = 'word' , norm = 'l2' , token_pattern = my_token_pattern , lowercase = False , preprocessor = my_preprocessor )
vector = vectorizer.fit_transform(train_samples)

Function that generates a term document matrix for the samples, using the generated vocabulary

In [0]:
def transform(samples, vectorizer):
    return vectorizer.transform(samples).toarray()

Generates term document matrixes:

In [0]:
train_samples = transform(train_samples, vectorizer)
test_samples = transform(test_samples, vectorizer)
validation_samples = transform(validation_samples, vectorizer)

### Now, we start training the first classifier: **SVM**
<br>

We'll use the **linearSVC** classifier, an improved implementation of the SVC classifier with **kernel = 'linear'** parameter, with different values for the regularization parameter C:



In [96]:
data = [] # array for the accuracy and f1 score of the predictions

for c in [0.001 , 0.01 , 0.1 , 1, 10, 100, 1000]:
    svmClassifier = svm.LinearSVC(C = c)
    svmClassifier.fit(train_samples , train_labels)
    predictions = svmClassifier.predict(validation_samples)
    data.append([c , accuracy_score(validation_labels , predictions) , f1_score(validation_labels , predictions)])

dataFrame = pd.DataFrame(data , columns = ['C ' , ' Accuracy ' , ' F1 Score'])
print(dataFrame)

         C    Accuracy    F1 Score
0     0.001    0.564759   0.698172
1     0.010    0.650226   0.707586
2     0.100    0.680723   0.709788
3     1.000    0.668298   0.681835
4    10.000    0.650979   0.664009
5   100.000    0.648720   0.661097
6  1000.000    0.647590   0.659636


We notice that the results are increasing and then decreasing, with its peak point of best accuracy / score for **C = 0.1**, with an accuracy of **0.680723**
<br>
<br>

### Next, we'll check the **complementNB** classifier, adjusting the alpha hyperparamater:

In [97]:
data = [] # array for the accuracy and f1 score of the predictions

for alph in [0.01 , 0.1 , 0.26 , 0.3 , 0.5 , 1 , 10 , 25 , 50]:
    complementNBClassifier = ComplementNB(alpha = alph)
    complementNBClassifier.fit(train_samples , train_labels)
    predictions = complementNBClassifier.predict(validation_samples)
    data.append([alph , accuracy_score(validation_labels , predictions) , f1_score(validation_labels , predictions)])
    
dataFrame = pd.DataFrame(data , columns = ['C ' , '   Accuracy ' , ' F1 Score'])
print(dataFrame)

      C      Accuracy    F1 Score
0   0.01      0.718750   0.729446
1   0.10      0.716867   0.732765
2   0.26      0.716114   0.732434
3   0.30      0.717244   0.732645
4   0.50      0.710467   0.723878
5   1.00      0.707078   0.719134
6  10.00      0.671687   0.662016
7  25.00      0.660392   0.637168
8  50.00      0.649473   0.614812


We notice that the accuracy is decreasing, but the F1 score is gaing a peak point for **alpha = 0.3**, with an accuracy of **0.717244** and F1 score of **0.732645**.
<br>
In conclusion, we notice that the best results were accomplished using the **complementNB** classifier, with the best results for the alpha parameter **0.3** and **0.26**. 

<br>

So, the F1 score and confusion matrix for complementNB classifier and alpha = 0.26 are:


In [98]:
complementNBClassifier = ComplementNB(alpha = 0.26)
complementNBClassifier.fit(train_samples , train_labels)
predictions = complementNBClassifier.predict(validation_samples)
print('F1 score is ' , f1_score(validation_labels , predictions))
print('Confusion matrix for ComplementNB and alpha = 0.26: ')
print(confusion_matrix(validation_labels , predictions) , '\n\n')

F1 score is  0.7324343506032648
Confusion matrix for ComplementNB and alpha = 0.26: 
[[ 870  431]
 [ 323 1032]] 




F1 score and confusion matrix for complementNB classifier and alpha = 0.3 are:

In [99]:
complementNBClassifier = ComplementNB(alpha = 0.3)
complementNBClassifier.fit(train_samples , train_labels)
predictions = complementNBClassifier.predict(validation_samples)
print('F1 score is ' , f1_score(validation_labels , predictions))
print('Confusion matrix for ComplementNB and alpha = 0.3: ')
print(confusion_matrix(validation_labels , predictions) , '\n\n')

F1 score is  0.7326450694197223
Confusion matrix for ComplementNB and alpha = 0.3: 
[[ 876  425]
 [ 326 1029]] 




So, for predicting the test samples, we'll use the **complementNB** classifier, with **alpha = 0.3** 


In [0]:
complementNBClassifier = ComplementNB(alpha = 0.3)
complementNBClassifier.fit(train_samples , train_labels)
predictions_test = complementNBClassifier.predict(test_samples)

And generate the output file:

In [0]:
csvFile = pd.DataFrame({ "id" : test_ids , "label" : predictions_test})
csvFile.to_csv("results.csv" , index = False)

### Concluding Remarks

For this dialect classification project, I used the TfidfVectorizer technique to correlate each word with a score, depending on its frequency / number of appearances in the documents and samples. I compared 2 types of classifiers: linearSVC(SVM) and complementNB(Naive Bayes) with different parameters, concluding that the one with the best results is complementNB with alpha = 0.3.

<br>

### Bibliography and Research
For this project, along the ideas explained at the course / laboratory, I used some concepts from the following websites:

*   https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
*   https://scikit-learn.org/stable/modules/naive_bayes.html
*   https://blog.floydhub.com/naive-bayes-for-machine-learning/
*   https://sebastianraschka.com/Articles/2014_naive_bayes_1.html
*   https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a



