# Assignment #3: A simple language classifier

Author: Pierre Nugues

## Objectives

In this assignment, you will implement a language detector inspired from Google's _Compact language detector_, version 3 (CLD3). [https://github.com/google/cld3]. CLD3 is written in C++ and its code is available from GitHub. The objectives of the assignment are to:
* Write a program to classify languages
* Use neural networks
* Know what a classifier is
* Write a short report of 1 to 2 pages on the assignment

## Description

### System Overview

Read the GitHub description (_Model_ section). In your individual report you will:
1. Summarize the system in two or three sentences;
2. Outline the CLD3 overall architecture in a figure. Use building blocks only and do not specify the parameters.

### Dataset

As dataset, we will use Tatoeba [https://tatoeba.org/eng/downloads]. It consists of more than 8 million short texts in 347 languages and it is available in one file called `sentences.csv`.

The dataset is structured this way: There is one text per line, where each line consists of the three following fields separated by tabulations and ended by a carriage return:
```
sentence id [tab] language code [tab] text [cr]
```
Each text has a unique id and has a language code that follows the ISO 639-3 standard (see below). 

### Scope of the lab

In this lab, you will consider three languages only: French (fra), English (eng), and Swedish (swe). Below is an excerpt of the Tatoeba dataset limited to these three languages: 

```
1276    eng     Let's try something.
1277    eng     I have to go to sleep.
1280    eng     Today is June 18th and it is Muiriel's birthday!
...
1115    fra     Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.
1279    fra     Je ne supporte pas ce type.
1441    fra     Pour une fois dans ma vie je fais un bon geste... Et ça ne sert à rien.
...
337413  swe     Vi trodde att det var ett flygande tefat.
341910  swe     Detta är huset jag bodde i när jag var barn.
341938  swe     Vi hade roligt på stranden igår.
...
```

### Understanding the $\mathbf{X}$ matrix (feature matrix)

We will now investigate the CLD3 features:
 *  What are the features CLD3 extracts from each text?
 * Create manually a simplified $\mathbf{X}$ matrix where you will represent the 9 texts with CLD3 features. You will use a restricted set of features: You will only consider the letters _a_, _b_, and _n_ and the bigrams _an_, _ba_, and _na_. You will ignore the the rest of letters and bigrams as well as the trigrams. Your matrix will have 9 rows and 6 columns, each column will contain these counts: `[#a, #b, #n, #an, #ba, #na]`

The CLD3's original description uses relative frequencies (counts of a letter divided by the total counts of letters in the text). Here, you will use the raw counts. To help you, your instructor filled the fourth row of the matrix corresponding to the first text in French. Fill in the rest.

$\mathbf{X} =
\begin{bmatrix}
0& 0& 1& 0& 0& 0\\
1& 0& 0& 0& 0& 0\\
3& 1& 2& 1& 0& 0\\
8& 0& 8& 1& 0& 0\\
1& 0& 1& 0& 0& 0\\
4& 1& 5& 1& 0& 0\\
4& 0& 1& 1& 0& 0\\
5& 2& 2& 0& 1& 0\\
2& 0& 2& 1& 0& 0\\
\end{bmatrix}$
; $\mathbf{y} =
\begin{bmatrix}
     \text{eng} \\
     \text{eng}\\
     \text{eng}\\
    \text{fra}\\
   \text{fra}  \\
     \text{fra}\\
    \text{swe}\\
 \text{swe}   \\
 \text{swe}   
\end{bmatrix}$

## Programming: Extracting the features

Before you start programming, download the Tatoeba dataset

### Loading and filtering the dataset

Run the code to read the dataset and split it into lines

In [8]:
dataset = open('sentences.csv', encoding='utf8').read().strip()
dataset = dataset.split('\n')
dataset[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

Run the code to split the fields and remove possible whitespaces

In [9]:
dataset = list(map(lambda x: tuple(x.split('\t')), dataset))
dataset = list(map(lambda x: tuple(map(str.strip, x)), dataset))
dataset[:3]

[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

Write the code to extract the French, English, and Swedish texts. You will call the resulting dataset: `dataset_small`

In [10]:
# Write your code here
dataset_small = list(filter(lambda x: x[1] == 'fra' or x[1] == 'eng' or x[1] == 'swe', dataset))

In [11]:
dataset_small[:5]

[('1115',
  'fra',
  "Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent."),
 ('1276', 'eng', "Let's try something."),
 ('1277', 'eng', 'I have to go to sleep.'),
 ('1279', 'fra', 'Je ne supporte pas ce type.'),
 ('1280', 'eng', "Today is June 18th and it is Muiriel's birthday!")]

### Functions to Count Characters Ngrams

Write a function `count_chars(string, lc=True)` to count characters (unigrams) of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the unigrams.

In [12]:
from collections import Counter

def get_ngrams(string, n):
    num_chars = len(string)
    ngrams = [string[i : i+n] for i in range(num_chars - n + 1)]
    return dict(Counter(ngrams))

def count_chars(string, lc=True):
    string = string.lower() if lc else string
    unigrams = get_ngrams(string, 1)
    tot = sum(unigrams.values())
    return {k: v/tot for k, v in unigrams.items()}

Write a function `count_bigrams(string, lc=True)` to count the characters bigrams of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the bigrams.

In [13]:
def count_bigrams(string, lc=True):
    string = string.lower() if lc else string
    bigrams = get_ngrams(string, 2)
    tot = sum(bigrams.values())
    return {k: v/tot for k, v in bigrams.items()}

Write a function `count_trigrams(string, lc=True)` to count the characters trigrams of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the trigrams.

In [14]:
def count_trigrams(string, lc=True):
    string = string.lower() if lc else string
    trigrams = get_ngrams(string, 3)
    tot = sum(trigrams.values())
    return {k: v/tot for k, v in trigrams.items()}

In [13]:
count_chars("Let's try something.")

{'l': 0.05,
 'e': 0.1,
 't': 0.15,
 "'": 0.05,
 's': 0.1,
 ' ': 0.1,
 'r': 0.05,
 'y': 0.05,
 'o': 0.05,
 'm': 0.05,
 'h': 0.05,
 'i': 0.05,
 'n': 0.05,
 'g': 0.05,
 '.': 0.05}

In [14]:
count_bigrams("Let's try something.")

{'le': 0.05263157894736842,
 'et': 0.10526315789473684,
 "t'": 0.05263157894736842,
 "'s": 0.05263157894736842,
 's ': 0.05263157894736842,
 ' t': 0.05263157894736842,
 'tr': 0.05263157894736842,
 'ry': 0.05263157894736842,
 'y ': 0.05263157894736842,
 ' s': 0.05263157894736842,
 'so': 0.05263157894736842,
 'om': 0.05263157894736842,
 'me': 0.05263157894736842,
 'th': 0.05263157894736842,
 'hi': 0.05263157894736842,
 'in': 0.05263157894736842,
 'ng': 0.05263157894736842,
 'g.': 0.05263157894736842}

In [15]:
count_trigrams("Let's try something.")

{'let': 0.05555555555555555,
 "et'": 0.05555555555555555,
 "t's": 0.05555555555555555,
 "'s ": 0.05555555555555555,
 's t': 0.05555555555555555,
 ' tr': 0.05555555555555555,
 'try': 0.05555555555555555,
 'ry ': 0.05555555555555555,
 'y s': 0.05555555555555555,
 ' so': 0.05555555555555555,
 'som': 0.05555555555555555,
 'ome': 0.05555555555555555,
 'met': 0.05555555555555555,
 'eth': 0.05555555555555555,
 'thi': 0.05555555555555555,
 'hin': 0.05555555555555555,
 'ing': 0.05555555555555555,
 'ng.': 0.05555555555555555}

### Counting the ngrams in the dataset

You will now extract the features from each text. For this, add the character, bigram, and trigram relative frequencies to the texts using this format:
`(text_id, language_id, text, char_cnt, bigram_cnt, trigram_cnt)`. 

From the datapoint:
`('1276', 'eng', "Let's try something.")`,
you must return:

`('1276', 'eng', "Let's try something.", 
  {'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 'n': 0.05, 'g': 0.05, '.': 0.05},
  {'le': 0.05263157894736842, 'et': 0.10526315789473684, "t'": 0.05263157894736842, "'s": 0.05263157894736842, 's ': 0.05263157894736842, ' t': 0.05263157894736842, 'tr': 0.05263157894736842, 'ry': 0.05263157894736842, 'y ': 0.05263157894736842, ' s': 0.05263157894736842, 'so': 0.05263157894736842, 'om': 0.05263157894736842, 'me': 0.05263157894736842, 'th': 0.05263157894736842, 'hi': 0.05263157894736842, 'in': 0.05263157894736842, 'ng': 0.05263157894736842, 'g.': 0.05263157894736842},
  {'let': 0.05555555555555555, "et'": 0.05555555555555555, "t's": 0.05555555555555555, "'s ": 0.05555555555555555, 's t': 0.05555555555555555, ' tr': 0.05555555555555555, 'try': 0.05555555555555555, 'ry ': 0.05555555555555555, 'y s': 0.05555555555555555, ' so': 0.05555555555555555, 'som': 0.05555555555555555, 'ome': 0.05555555555555555, 'met': 0.05555555555555555, 'eth': 0.05555555555555555, 'thi': 0.05555555555555555, 'hin': 0.05555555555555555, 'ing': 0.05555555555555555, 'ng.': 0.05555555555555555})`

In [15]:


#
# REMOVE THE :1000 FROM THE END HERE!
#


dataset_small_feat = [(_id, lang, txt, count_chars(txt), count_bigrams(txt), count_trigrams(txt)) 
                      for _id, lang, txt in dataset_small[:1000]]


NameError: name 'fdas' is not defined

In [16]:
dataset_small_feat[:2]

[('1115',
  'fra',
  "Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.",
  {'l': 0.044444444444444446,
   'o': 0.05555555555555555,
   'r': 0.05555555555555555,
   's': 0.07777777777777778,
   'q': 0.022222222222222223,
   'u': 0.044444444444444446,
   "'": 0.011111111111111112,
   'i': 0.06666666666666667,
   ' ': 0.16666666666666666,
   'a': 0.08888888888888889,
   'd': 0.022222222222222223,
   'e': 0.05555555555555555,
   'm': 0.011111111111111112,
   'n': 0.08888888888888889,
   'é': 0.022222222222222223,
   'v': 0.011111111111111112,
   't': 0.05555555555555555,
   'c': 0.022222222222222223,
   'f': 0.011111111111111112,
   'ê': 0.011111111111111112,
   ',': 0.011111111111111112,
   'g': 0.011111111111111112,
   'ç': 0.011111111111111112,
   'p': 0.011111111111111112,
   '.': 0.011111111111111112},
  {'lo': 0.011235955056179775,
   'or': 0.011235955056179775,
   'rs': 0.011235955056179775,
   'sq': 0.011235955056179775,
   'qu': 0.02247191

The unigram frequencies

In [17]:
dataset_small_feat[0][3].items()

dict_items([('l', 0.044444444444444446), ('o', 0.05555555555555555), ('r', 0.05555555555555555), ('s', 0.07777777777777778), ('q', 0.022222222222222223), ('u', 0.044444444444444446), ("'", 0.011111111111111112), ('i', 0.06666666666666667), (' ', 0.16666666666666666), ('a', 0.08888888888888889), ('d', 0.022222222222222223), ('e', 0.05555555555555555), ('m', 0.011111111111111112), ('n', 0.08888888888888889), ('é', 0.022222222222222223), ('v', 0.011111111111111112), ('t', 0.05555555555555555), ('c', 0.022222222222222223), ('f', 0.011111111111111112), ('ê', 0.011111111111111112), (',', 0.011111111111111112), ('g', 0.011111111111111112), ('ç', 0.011111111111111112), ('p', 0.011111111111111112), ('.', 0.011111111111111112)])

The bigram frequencies

In [18]:
dataset_small_feat[0][4].items()

dict_items([('lo', 0.011235955056179775), ('or', 0.011235955056179775), ('rs', 0.011235955056179775), ('sq', 0.011235955056179775), ('qu', 0.02247191011235955), ("u'", 0.011235955056179775), ("'i", 0.011235955056179775), ('il', 0.011235955056179775), ('l ', 0.011235955056179775), (' a', 0.033707865168539325), ('a ', 0.02247191011235955), (' d', 0.011235955056179775), ('de', 0.011235955056179775), ('em', 0.011235955056179775), ('ma', 0.011235955056179775), ('an', 0.011235955056179775), ('nd', 0.011235955056179775), ('dé', 0.011235955056179775), ('é ', 0.02247191011235955), (' q', 0.011235955056179775), ('ui', 0.011235955056179775), ('i ', 0.011235955056179775), ('av', 0.011235955056179775), ('va', 0.011235955056179775), ('ai', 0.02247191011235955), ('it', 0.011235955056179775), ('t ', 0.02247191011235955), (' c', 0.011235955056179775), ('ca', 0.011235955056179775), ('as', 0.011235955056179775), ('ss', 0.011235955056179775), ('sé', 0.011235955056179775), (' l', 0.02247191011235955), ('la

## Programming: Building $\mathbf{X}$

You will now build the $\mathbf{X}$ matrix. In this assignment, you will only consider unigrams to speed up the training step. This means that you will set aside the character bigrams and trigrams.

When you are done with the lab requirements, feel free to improve the program and include bigrams and trigrams. To add bigrams, a possible method is to add the bigram dictionary to the unigram one using update and then to extract the resulting dictionary. You can easily extend this to trigrams. Feel free to use another method if you want.

In [15]:
INCLUDE_BIGRAMS = False
if INCLUDE_BIGRAMS:
    for i in range(len(dataset_small_feat)):
        dataset_small_feat[i][3].update(dataset_small_feat[i][4])

### Vectorizing the features

The CLD3 architecture uses embeddings. In this lab, we will simplify it and we will use a feature vector instead consisting of the character frequencies. For example, you will represent the text:

`"Let's try something."`

with:

`{'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 
 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 
 'n': 0.05, 'g': 0.05, '.': 0.05}`

To create the $\mathbf{X}$ matrix, we need to transform the dictionaries of `dataset_small` into numerical vectors. The `DictVectorizer` class from the scikit-learn library, see here [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html], has two methods, `fit()` and `transform()`, and a combination of both `fit_transform()` to convert dictionaries into such vectors.

You will now write the code to:

1. Extract the character frequency dictionaries from `dataset_small` corresponding to its 3rd index and set them in a list;
2. Convert the list of dictionaries into an $\mathbf{X}$ matrix using `DictVectorizer`.

#### Extracting the chatacter frequencies

Produce a new list of datapoints with the unigrams only. Each item in this list will be a dictionary. You will call it `X_cat`

In [52]:
X_cat = [element[3] for element in dataset_small_feat]

In [53]:
X_cat[:2]

[{'l': 0.044444444444444446,
  'o': 0.05555555555555555,
  'r': 0.05555555555555555,
  's': 0.07777777777777778,
  'q': 0.022222222222222223,
  'u': 0.044444444444444446,
  "'": 0.011111111111111112,
  'i': 0.06666666666666667,
  ' ': 0.16666666666666666,
  'a': 0.08888888888888889,
  'd': 0.022222222222222223,
  'e': 0.05555555555555555,
  'm': 0.011111111111111112,
  'n': 0.08888888888888889,
  'é': 0.022222222222222223,
  'v': 0.011111111111111112,
  't': 0.05555555555555555,
  'c': 0.022222222222222223,
  'f': 0.011111111111111112,
  'ê': 0.011111111111111112,
  ',': 0.011111111111111112,
  'g': 0.011111111111111112,
  'ç': 0.011111111111111112,
  'p': 0.011111111111111112,
  '.': 0.011111111111111112},
 {'l': 0.05,
  'e': 0.1,
  't': 0.15,
  "'": 0.05,
  's': 0.1,
  ' ': 0.1,
  'r': 0.05,
  'y': 0.05,
  'o': 0.05,
  'm': 0.05,
  'h': 0.05,
  'i': 0.05,
  'n': 0.05,
  'g': 0.05,
  '.': 0.05}]

#### Vectorize `X_cat`

Convert you `X_cat` matrix into a numerical representation using `DictVectorizer`.  

In [104]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
X = v.fit_transform(X_cat)

## Programming: Building $\mathbf{y}$

You will now convert the list of language symbols into a $\mathbf{y}$ vector

Extract the language symbols from `dataset_small_feat`and call the resulting list `y_cat`

In [55]:
y_cat = [element[1] for element in dataset_small_feat]

In [56]:
y_cat[:5]

['fra', 'eng', 'eng', 'fra', 'eng']

Extract the set of language symbols and build two indices mapping the symbols to integers and the integers to symbols. Both indices will be dictionaries that you will call: `lang2inx`and `inx2lang`.

In [57]:
langs = set(y_cat)
lang2inx = {}
inx2lang = {}
for i,lang in enumerate(langs):
    lang2inx[lang] = i
    inx2lang[i] = lang

In [58]:
inx2lang

{0: 'fra', 1: 'eng'}

In [59]:
lang2inx

{'fra': 0, 'eng': 1}

Convert your `y_cat` vector into a numerical vector. Call this vector `y`.

In [60]:
y = [lang2inx[lang] for lang in y_cat]

In [61]:
y[:5]

[0, 1, 1, 0, 1]

## Programming: Building the Model

Create a neural network using sklearn with a hidden layer of 50 nodes and a relu activation layer. [https://scikit-learn.org/stable/modules/neural_networks_supervised.html]. Set the maximal number of iterations to 5, in the beginning, and verbose to True. Use the default values for the rest.

In [79]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(activation='relu', hidden_layer_sizes=(50,), max_iter=5, verbose=1)

### Training and Validation Sets

#### We shuffle the indices

In [63]:
import numpy as np
indices = list(range(X.shape[0]))
np.random.shuffle(indices)
print(indices[:10])
X = X[indices, :]
y = np.array(y)[indices]

[777, 761, 293, 819, 268, 485, 219, 626, 497, 924]


#### We split the dataset

In [64]:
training_examples = int(X.shape[0] * 0.8)

X_train = X[:training_examples, :]
y_train = y[:training_examples]

X_val = X[training_examples:, :]
y_val = y[training_examples:]

### We fit the model

Fit the model

In [80]:
clf.fit(X_train, y_train)

Iteration 1, loss = 0.72166437
Iteration 2, loss = 0.69664535
Iteration 3, loss = 0.67242928
Iteration 4, loss = 0.64884763
Iteration 5, loss = 0.62557390




MLPClassifier(hidden_layer_sizes=(50,), max_iter=5, verbose=1)

In [72]:
clf

MLPClassifier(hidden_layer_sizes=(50,), max_iter=5, verbose=1)

## Predicting

Predict the `X_val` languages. You will call the result `y_pred`

In [81]:
y_pred = clf.predict(X_val)

In [82]:
print(y_pred[:10])
print(y_val[:10])

[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]


#### Evaluating

In [83]:
# evaluate the model
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

0.995

In [88]:
y_symbols = list(lang2inx)
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val, y_pred, target_names=y_symbols))
print('Micro F1:', f1_score(y_val, y_pred, average='micro'))
print('Macro F1', f1_score(y_val, y_pred, average='macro'))

              precision    recall  f1-score   support

         fra       0.00      0.00      0.00         1
         eng       0.99      1.00      1.00       199

    accuracy                           0.99       200
   macro avg       0.50      0.50      0.50       200
weighted avg       0.99      0.99      0.99       200

Micro F1: 0.995
Macro F1 0.49874686716791977


  _warn_prf(average, modifier, msg_start, len(result))


### Confusion Matrix

In [89]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_val, y_pred)

array([[  0,   1],
       [  0, 199]])

Increase the number of iteration to improve the score. You may also change the parameters.

## Predict the language of a text

Now predict the languages of the strings below.

In [90]:
langs = ["Salut les gars !", "Hejsan grabbar!", "Hello guys!"]

Create features vectors from this list. Call this matrix `X_test`

And run the prediction

In [109]:
new_X_cat = [count_chars(text) for text in langs]
X_test = v.transform(new_X_cat)
prediction = clf.predict(X_test)
[inx2lang[pred] for pred in prediction]

['eng', 'eng', 'eng']

## Postscript from Pierre Nugues

I created this assignment from an examination I wrote last year for the course on applied machine learning. I simplified it from the `README.md` on GitHub [https://github.com/google/cld3]. I found the C++ code difficult to understand and I reimplemented a Keras/Tensorflow version of it from this `README`. Should you be interested, you can find it here: [https://github.com/pnugues/language-detector]