# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


In [None]:
import pandas as pd
import numpy as np
import glob
import re, string, collections
from collections import Counter, defaultdict
from sklearn import preprocessing
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 53 (delta 23), reused 32 (delta 8), pack-reused 0[K
Unpacking objects: 100% (53/53), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:
!ls nlp-course/lm-languages-data-new

en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
def preprocess():
  current_set = set()

  for name in glob.glob('nlp-course/lm-languages-data-new/*.csv'):
    if 'test' not in name:
      id = pd.read_csv(name)
            
      for i in id.index:
        add = set(id.iloc[i,1])
        current_set =  current_set | add 

  vocabulary = list(current_set)
  return vocabulary

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def add_start_end(row, n):
  row = 'א'*(n-1) + row + 'ת'
  return row

def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  df = pd.read_csv(data_file_path)
  n_gram_dictionary = collections.Counter()
  n_minus_1_gram_dict = collections.Counter()

  # Add start and end chars
  df['tweet_text'] = df.apply(lambda data: add_start_end(data['tweet_text'], n),axis=1, result_type='expand')  
  
  # Split the data to n-grams
  for i in df.index:
    tokens = str(df.iloc[i, 1])
    line_length = len(tokens)
    if line_length > 0:
      manual_partial_ngram = []
      manual_partial_ngram_minus_1 = []
      for j in range(line_length-n):
        nword_split = tuple(list(tokens[j:j+n]))
        manual_partial_ngram.append(nword_split)
        nword_split_minus_1 = tuple(list(tokens[j:j+n-1]))
        manual_partial_ngram_minus_1.append(nword_split_minus_1)

    n_gram_dictionary.update(manual_partial_ngram)
    n_minus_1_gram_dict.update(manual_partial_ngram_minus_1)

  # Calculate the frequency (counts)
  model = defaultdict(lambda: defaultdict(lambda: 0))

  if n == 1:
    model = defaultdict(lambda: 0)
    total_count = float(sum(n_gram_dictionary.values()))
    for key, value in n_gram_dictionary.items():
      model[key] = value/total_count

  else:
    for key, value in n_gram_dictionary.items():
      n_minus_1_gram = tuple(list(key[0:(n-1)]))
      nth = key[-1]
      model[n_minus_1_gram][nth] = value

    # Create frequency distributions (probablity)
    for w_n_minus_1 in model:
        total_count = float(sum(model[w_n_minus_1].values()))
        for w_n in model[w_n_minus_1]:
            model[w_n_minus_1][w_n] /= total_count 

  # Add one smoothing implementation 
  if add_one:
    v = len(vocabulary)
    model_smooth = defaultdict(lambda: defaultdict(lambda: 0))

    if n == 1:
      total_count = float(sum(n_gram_dictionary.values()))
      model_smooth = defaultdict(lambda: 1/(total_count+v))
      for keys, values in n_gram_dictionary.items():
        prob = (values+1)/(total_count+v)  
        model_smooth[keys] = prob
      
    else: 
      for key in n_minus_1_gram_dict:
        C_n_minus_1 = n_minus_1_gram_dict[key]
        val = 1/(C_n_minus_1+v) 
        model_smooth[key] = defaultdict(lambda: val)

      for keys, values in n_gram_dictionary.items():
        C_n = values 
        w_n_minus_1 = keys[0:(n-1)]
        nth = keys[-1]
        C_n_minus_1 = n_minus_1_gram_dict[w_n_minus_1]
        prob = (C_n+1)/(C_n_minus_1+v)  
        model_smooth[w_n_minus_1][nth] = prob
    return model_smooth
  
  else:
    return model

In [None]:
#Run vocab 
vocab = preprocess()

In [None]:
#For testing
model = lm(3, vocab, 'nlp-course/lm-languages-data-new/es.csv', True)
modelF = lm(3, vocab, 'nlp-course/lm-languages-data-new/es.csv', False)

# test for unigram
# print(model[('R',)]) 
# print(modelF[('R',)])

# # test for bigram
# print(model[('R', )]['T']) 
# print(modelF[('R', )]['T'])

# test for three-gram
print(model[('R', 'T')][' ']) 
print(modelF[('R', 'T')][' '])

# test for 4-gram
# print(model[('R', 'T', ' ')][' '])
# print(modelF[('R', 'T', ' ')][' '])

0.7005524861878453
0.9788219722038385


**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def eval(n, model, data_file):
  prob = []
  data = pd.read_csv(data_file)
  
  for i in data.index:
    tweet = data['tweet_text'][i]    
    k = len(tweet)

    for j in range(0,k-n):
      n_minus_1_gram = tuple(list(tweet[j:j+n-1]))
      n_th_char = tweet[j+n-1]

      if n == 1:
        p_value = model[(n_th_char,)]
      else:  
        p_value = model[n_minus_1_gram][n_th_char]
      
      # According to instructor note in piazza, ignore p_value = 0, will change to p = 1 so that log(p) = 0
      if p_value == 0:
        p_value = 1

      prob.append(p_value)

  prob = np.array(prob)
  h_x = -1*(np.log2(prob).mean())

  return 2**h_x

As we can see, the preplexity decrease when the n increase without add-one smoothing


In [None]:
for i in range(1,5):
  model_to_eval = lm(i, vocab, 'nlp-course/lm-languages-data-new/en.csv', False)
  perp = eval(i,model_to_eval, 'nlp-course/lm-languages-data-new/en.csv')
  print(f'Perplexity of model based on en.csv with n={i}: {perp}')

Perplexity of model based on en.csv with n=1: 36.37175674135632
Perplexity of model based on en.csv with n=2: 17.812446482947372
Perplexity of model based on en.csv with n=3: 8.78191067907144
Perplexity of model based on en.csv with n=4: 4.473087603253531


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  
  df = pd.DataFrame(columns =  ['en' ,'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl'], index = ['en' ,'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl'])
  vocab = preprocess()
  for ind in df.index:
    file_path_lang = 'nlp-course/lm-languages-data-new/' + ind + '.csv'
    model = lm(n, vocab, file_path_lang, add_one)
    for col in df.columns:
      file_path = 'nlp-course/lm-languages-data-new/' + col + '.csv'
      df[col][ind] = eval(n, model, file_path)

  return df

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
for i in range(1,5):
  print(f'\nEvaluation table for {i}-gram model without add-one smoothing\n')
  print(match(i, False))
  print(f'\nEvaluation table for {i}-gram model with add-one smoothing\n')
  print(match(i, True))


Evaluation table for 1-gram model without add-one smoothing

         en       es       fr       in       it       nl       pt       tl
en  36.3718  36.1548  37.3242  38.4186   36.263   37.091  36.0055  41.5854
es  39.1031  34.0706  36.7792  40.3493  36.3388  38.5385  33.7447  43.8624
fr  38.5986  36.8581  35.5018  41.2117  37.0243  38.1238  35.7268  45.8036
in  39.6237  35.7185  42.5719   35.199  39.0492  39.0412  36.7088  39.6835
it  38.3622  36.2891  36.9337  40.0934  35.4928   38.239  35.1112  43.0466
nl  37.7169  37.3643   38.983  38.3525  37.7293  35.3959  37.4361  43.0685
pt  39.3343  34.9173  37.1023  39.6317    36.55  38.5397  34.7806  43.6094
tl  39.0536  39.7435  42.8051  36.1025  38.5084  39.9539  40.7483   38.383

Evaluation table for 1-gram model with add-one smoothing

         en       es       fr       in       it       nl       pt       tl
en  36.4241  38.5553  40.5718  39.1659  38.3813  37.4964  40.7872  42.2584
es  39.7203  34.1266  38.7646  41.2297  37.8494  39.19

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

Loading data:

In [None]:
data = pd.read_csv('nlp-course/lm-languages-data-new/test.csv', index_col = 'tweet_id')
data.head()

Unnamed: 0_level_0,tweet_text,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en
836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it
836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl
847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl
836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl


Creating features: <br>
We decided to use cross entropy of different models as features. For each language and n $ \in $ {1,2,3,4} with add-on smoothing we calcualted  a language model. With these 32 models we calcualted entopry of a single tweet.  <br>
In addition, we also added 8 features that were not based on the language models but were based on the unique word vocabulary or each language. For each language corpus, a unique set of words was created. This is all the words that appear in the language corpus itself and do not appear in any of the other corpuses. So for each language we have a set of unique words and for each tweet we calculate the percent of words in the tweet that appear in the unique set per language out of total words in the tweet. Words are defined as any set of characters that are joined together without interuption of a space char.




---



In [None]:
# Creating all 32 add-on models
def models(vocab):
  models = {}
  languages = ['en' ,'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']

  for l in languages:
    file_path_lang = 'nlp-course/lm-languages-data-new/' + l + '.csv'
    for i in range(1,5):
      add_one = True
      key_true = str(i) + '_' + l + '_True'
      model = lm(i, vocab, file_path_lang, add_one)
      model['name'] = key_true
      models[key_true] = model

  return models  

# Calculating entopy features
def eval_tweet(tweet, model):
  n = int(model['name'][0])
  tweet = 'א'*(n-1) + tweet + 'ת'
  prob = []
  k = len(tweet)

  for j in range(0,k-n):
    n_minus_1_gram = tuple(list(tweet[j:j+n-1]))
    n_th_char = tweet[j+n-1]

    if n == 1:
      p_value = model[(n_th_char,)]
    else:  
      p_value = model[n_minus_1_gram][n_th_char]

    # According to instructor note in piazza, ignore p_value = 0, will change to p = 1 so that log(p) = 0
    if p_value == 0:
      p_value = 1

    prob.append(p_value)

  prob = np.array(prob)
  prob = np.ma.masked_equal(prob,0)
  h_x = -1*(np.log2(prob).mean())
  return h_x  

In [None]:
models = models(vocab)

In [None]:
# Creating set of unique words per language
def create_vocab(data_file):
  id = pd.read_csv(data_file)
  current_set = set()
  for i in id.index:
      add = set(id.iloc[i,1].split())
      current_set =  current_set | add 
  return current_set

def find_unique_word_in_language(id):
  languages = ['en' ,'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
  path = 'nlp-course/lm-languages-data-new/' + id + '.csv'
  corpus_language = create_vocab(path)
  diff_lang = set()
  for lang in languages:
    if lang != id:
      path2 =  'nlp-course/lm-languages-data-new/' + lang + '.csv'
      new = create_vocab(path2)
      diff_lang = diff_lang | new
  final = corpus_language - diff_lang  
  return final

# Calculates percent of unique words in tweet per language
def count_unique(tweet, lang_set):
  tweet = str(tweet).split()
  word_length = len(tweet)
  counter = 0
  for word in tweet:
    if word in lang_set:
      counter += 1
  return counter/word_length

In [None]:
es_unique = find_unique_word_in_language('es')
en_unique = find_unique_word_in_language('en')
fr_unique = find_unique_word_in_language('fr')
in_unique = find_unique_word_in_language('in')
it_unique = find_unique_word_in_language('it')
nl_unique = find_unique_word_in_language('nl')
pt_unique = find_unique_word_in_language('pt')
tl_unique = find_unique_word_in_language('tl')

In [None]:
sets = {'es' : es_unique, 'en' :en_unique, 'fr': fr_unique, 'in':in_unique, 'it':it_unique, 'nl':nl_unique, 'pt':pt_unique,'tl': tl_unique}

In [None]:
def add_features(data):
  for key , value in models.items():
    name = value['name']
    n = int(value['name'][0])
    data[f"{name}"]=data.apply(lambda data: eval_tweet(data['tweet_text'], value ),axis=1, result_type='expand')

  for lang in sets:
    lang_set = sets[lang]
    data[f"{lang}"]=data.apply(lambda data: count_unique(data['tweet_text'], lang_set ),axis=1, result_type='expand')  
    
  return data

In [None]:
# Creating features columns in dataset
data = add_features(data)
print('40 features added to dataset')
data.head()

40 features added to dataset


Unnamed: 0_level_0,tweet_text,label,1_en_True,2_en_True,3_en_True,4_en_True,1_es_True,2_es_True,3_es_True,4_es_True,1_fr_True,2_fr_True,3_fr_True,4_fr_True,1_in_True,2_in_True,3_in_True,4_in_True,1_it_True,2_it_True,3_it_True,4_it_True,1_nl_True,2_nl_True,3_nl_True,4_nl_True,1_pt_True,2_pt_True,3_pt_True,4_pt_True,1_tl_True,2_tl_True,3_tl_True,4_tl_True,es,en,fr,in,it,nl,pt,tl
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,4.806124,4.128117,4.419504,5.949909,4.925381,4.659428,5.680545,7.773401,4.931331,4.696791,5.779017,7.919382,4.965101,4.608029,5.79756,7.831239,4.949518,4.755509,5.891226,8.001155,4.895818,4.601566,5.447773,7.640374,4.895037,4.734157,5.970497,7.784575,4.982964,4.547294,5.502566,7.346213,0.0,0.064516,0.0,0.0,0.0,0.0,0.0,0.0
836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,5.646338,5.043846,6.444367,8.265513,5.669175,4.921034,6.167898,7.585042,5.810842,5.141068,6.597034,7.635056,5.835198,5.257088,6.879072,8.61146,5.546321,4.392376,4.979507,6.449854,5.832572,5.558514,7.248568,8.13285,5.659642,4.964194,6.494578,7.958456,5.615826,5.138169,6.474877,8.438085,0.035714,0.0,0.0,0.0,0.25,0.0,0.0,0.0
836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,4.889307,4.291687,5.783332,8.052264,4.755791,4.349695,5.609305,7.708268,4.868811,4.444821,5.928465,8.050271,4.755671,4.190791,5.506231,7.674738,4.841224,4.428876,5.661525,7.93943,4.870653,4.502419,5.954939,7.853751,4.826941,4.390016,5.876185,8.035873,4.773894,4.127792,5.275387,7.656184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,5.224471,4.672308,5.72393,6.376911,5.224056,4.847826,6.094197,7.13718,5.24734,4.719948,5.874682,7.107212,5.352139,4.902325,6.394287,7.410632,5.201246,4.916412,6.080168,7.328718,5.184213,4.316553,4.925423,5.658421,5.243593,4.906557,6.191998,7.110749,5.341384,4.932243,6.213419,7.071552,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0
836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,5.314212,4.489866,5.454357,6.347071,5.429339,4.555982,5.424989,6.413107,5.436612,4.631663,5.649457,6.324208,5.30926,4.239137,4.839582,5.59961,5.363777,4.494646,5.533508,5.915919,5.350184,4.589027,5.56611,6.078988,5.43442,4.71795,5.672761,6.428938,5.28359,4.101586,4.449843,4.83456,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.416667


Function for classifying language of tweet: <br>
Using the 40 features we used linear logistic regression and random forest models to predict the classification. The data for logistic regression was scaled before running the model. 

In [None]:
label = preprocessing.LabelEncoder()
label.fit(data['label'])
y=label.transform(data['label'])
X = data.iloc[:,2:].values

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Scaling data for linear logistic regression
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
scaler_test = preprocessing.StandardScaler().fit(X_test)
X_test_scaled = scaler_test.transform(X_test)

In [None]:
# Linear Logistic Regression Model
classifiermodel = LogisticRegression(max_iter=1000,C=4)
classifiermodel.fit(X_train_scaled, y_train)
llr_predictions = classifiermodel.predict(X_test_scaled)

In [None]:
# Random Forest Model
rfclassifiermodel = RandomForestClassifier(criterion='entropy',random_state=0,n_estimators=50)
rfclassifiermodel.fit(X_train,y_train)
rf_predictions = rfclassifiermodel.predict(X_test)

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
f1_llr =f1_score(y_test, llr_predictions,average='weighted')
print(f'F1 score for Linear Logistic Regression Model:{np.round(f1_llr, 4)}')
f1_rf = f1_score(y_test, rf_predictions,average='weighted')
print(f'F1 score for Random Forest Model:{np.round(f1_rf, 4)}')

F1 score for Linear Logistic Regression Model:0.9225
F1 score for Random Forest Model:0.9093


# **Good luck!**