# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


In [None]:
%cd /content/nlp-course/lm-languages-data-new
!ls

/content/nlp-course/lm-languages-data-new
en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
import os
import glob
from collections import defaultdict

In [None]:
# define unique start (<START>) and end (<END>) symbols
start_symbol = '<'
end_symbol = '>'

In [None]:
# use glob to get all the csv files 
# in the folder
path = os.getcwd() + '/'

In [None]:
# auxiliary method to get a file path
def get_file_path(data_file):
  return path + data_file

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
def preprocess():
  vocabulary = set(['', start_symbol, end_symbol])
  # loop over the list of csv files
  for f in os.listdir(path):
    if f.endswith('.csv'):
      file_path = get_file_path(f)
      # read the csv file
      df = pd.read_csv(file_path)
      # iterate over a 'tweet_text' column in DataFrame and 1-D array, containing the elements of the input, is returned.
      for tweet in df['tweet_text'].values:
        vocabulary.update(list(tweet))
  vocabulary = list(vocabulary)
  vocabulary.append(start_symbol)
  vocabulary.append(end_symbol)
  # return list of all the charachters
  return vocabulary


In [None]:
vocabulary = preprocess()

In [None]:
print(vocabulary)

['', 'k', '♦', 'Ｏ', '😺', '⛪', '☁', '∀', '★', '로', '바', '嫌', '🐯', '動', '요', '─', 'À', 'F', '、', '╲', '²', '✴', '🐾', '║', '!', 'و', '👦', '😂', 'ү', '🈴', '🌃', '入', '👊', '☀', '🍌', '직', '෴', '🈷', 'Ｗ', '🏊', '☞', 'v', '⛈', '🇬', '\x97', 'と', 'ต', '画', '投', '👙', '🌷', '⛽', '🛳', '①', '服', '🆗', 'Ｂ', '🐽', '😇', '📷', '넷', 'É', '🤡', 'U', '＂', '역', 'ş', '🏴', 'ᴗ', '﹏', '🌫', '🐠', '🎁', '🥒', '💡', '🙂', '洲', '🌐', '⋭', '🤴', '儿', '1', "'", '影', '😞', 'Ｉ', '논', '🕵', '🔛', 'Ⅳ', '👠', '🚻', '🇯', 'б', 'ه', '🍁', '🌭', '슨', '🏫', '🎤', 'ㅜ', '🐬', '📏', '͜', 'ń', 'ナ', '러', '☆', '🐟', 'ì', '👡', '🐄', '화', '💿', '🗣', '┘', '동', '♥', '½', '🇦', '🕛', '💸', '❁', 'た', '∆', 'c', '\x9d', '봄', '☼', '業', '😮', '➡', '격', '🏈', 'P', '゜', '▔', 'ð', '▲', '↓', '👳', 'ⁿ', '위', '🇵', '🐙', '┗', '‿', 'บ', '🔊', '🤢', '🌇', '्', '◡', '🇾', 'M', '오', '환', 'А', 'Ｍ', 'Ｐ', '🍍', '🚗', 'س', '🍪', '🥄', 'な', 'ブ', '⌣', '🌪', '🇮', '♫', 'Ц', '込', 'q', '📬', '🦃', '🌀', '🏡', 'ᴬ', '¸', '😐', '🔜', 'ʳ', '엠', '신', 'm', 'ᵃ', '増', '링', 'ë', '🔵', '⒉', '해', '🏢', '송', '영', '울', 'ゴ', '\u2

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  # a dictionary that represent the model
  language_model = {}
  # read the csv file
  df = pd.read_csv(data_file_path)

  # append unique start ('<') and end ('>') symbols to each tweet
  tweet_text = df['tweet_text'].apply(lambda tweet: start_symbol * (n-1) + tweet + end_symbol).values # return a Numpy representation of the DataFrame
  
  # count repetitions of a next char for each substring of lenght n-1
  for tweet in tweet_text:
    for i in range(0, len(tweet) - n + 1):
      key = tweet[i : i + n - 1]
      value = tweet[i + n - 1]
      if key not in language_model:
        language_model[key] = {}
      if value not in language_model[key]:
        language_model[key][value] = 1
      else:
        language_model[key][value] += 1

  # Add-1 estimate
  vocabulary_lenght = len(vocabulary) 
  for x in language_model:  #iterate over the keys
    total_repetitions = sum(language_model[x].values()) # count occurences of each prefix
    # calculate probabilities, use add-one smoothing
    language_model[x] = {k: (v + (1 * add_one))/ (total_repetitions + (vocabulary_lenght * add_one)) for k, v in language_model[x].items()}
  
  return language_model

In [None]:
# auxiliary params for code testing

# define add_one
add_one = False
# define the n-gram to use
n = 3
# define the file we test
data_file = 'en.csv'
# define the data file path
data_file_path = get_file_path(data_file)

In [None]:
language_model = lm(n, vocabulary, data_file_path, add_one)

In [None]:
# Iterate over key/value pairs in dict and print them
for key, value in language_model.items():
    print(key, ' : ', value)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
<n  :  {'a': 0.125, 'o': 0.375, 'e': 0.25, 'i': 0.25}
Rj  :  {'M': 0.07692307692307693, 'm': 0.07692307692307693, 'w': 0.07692307692307693, '>': 0.15384615384615385, 'Y': 0.07692307692307693, '3': 0.07692307692307693, '5': 0.15384615384615385, 'D': 0.07692307692307693, 'a': 0.07692307692307693, 'p': 0.07692307692307693, 'j': 0.07692307692307693}
jM  :  {'g': 0.08333333333333333, 'j': 0.08333333333333333, 'N': 0.08333333333333333, 'Z': 0.08333333333333333, 'B': 0.16666666666666666, 'F': 0.08333333333333333, 'v': 0.08333333333333333, 'z': 0.08333333333333333, 'e': 0.08333333333333333, 'X': 0.08333333333333333, ' ': 0.08333333333333333}
Mg  :  {'W': 0.05263157894736842, '4': 0.05263157894736842, '9': 0.05263157894736842, '3': 0.05263157894736842, '>': 0.21052631578947367, 'p': 0.05263157894736842, 'I': 0.05263157894736842, 'w': 0.05263157894736842, 'a': 0.10526315789473684, 'O': 0.05263157894736842, 'k': 0.05263157894736842,

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def perplexity(n, tweets, model):
  probs = []
  lower_bound = 1e-8
  for tweet in tweets:
    for index in range(len(tweet) - n + 1):
      key = tweet[index : index + n - 1]
      value = tweet[index + n - 1]
      if key in model and value in model[key]:
        probs.append(model[key][value])
      else:
        probs.append(lower_bound)
  return -np.log2(probs).mean()

In [None]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
  data_file_path = get_file_path(data_file)
  df = pd.read_csv(data_file_path)
  tweets = df['tweet_text'].apply(lambda tweet: start_symbol * (n-1) + tweet + end_symbol).values
  return 2 ** (perplexity(n, tweets, model)).mean()

In [None]:
eval(n, language_model, data_file)

8.895583908566534

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  perplexities = {}
  for f in os.listdir(path):
    if f.endswith('.csv') and not f.startswith('test'):
      data_file_path = get_file_path(f)
      model = lm(n , vocabulary, data_file_path, add_one)
      model_name = os.path.splitext(f)[0]
      perplexities[model_name] = {}
      for appropriate_f in os.listdir(path):
         if appropriate_f.endswith(".csv") and not appropriate_f.startswith("test"):
           eval_name = os.path.splitext(appropriate_f)[0]
           perplexities[model_name][eval_name] = eval(n, model, appropriate_f)
  return pd.DataFrame(perplexities)

In [None]:
df_match = match(n, add_one)
print(f' n={n}, add_one={add_one}')
df_match

 n=3, add_one=False


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,8.056089,111.072863,61.400827,76.692405,114.440298,100.102785,88.62175,108.997435
in,191.794073,9.81674,143.616554,154.562765,93.288788,66.177434,112.671813,99.867108
es,57.208459,89.779883,8.549181,61.453559,89.896051,75.585739,67.723393,80.810279
it,71.052143,76.852118,53.955534,8.515423,79.881328,70.09503,58.276734,66.892344
nl,198.560322,94.23224,154.816467,155.104065,9.157069,116.795297,105.535129,93.469065
tl,126.658968,62.665526,107.315781,100.112807,87.672805,8.539539,102.91998,79.20967
fr,134.59593,151.644672,104.992635,101.039841,101.305602,160.811467,8.523336,114.914614
en,102.051602,60.773678,82.617457,76.658202,51.246569,51.797725,61.287845,8.895584


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
def run_match():
  for n in range(1, 5):
    for add_one in [False, True]:
        match_table = match(n, add_one)
        print(f' n={n}, add_one={add_one}')
        display(match_table)

In [None]:
run_match()

 n=1, add_one=False


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,36.275453,47.373639,40.195926,41.796982,43.092959,44.944095,41.504732,44.397999
in,42.467234,36.667112,43.128368,42.955011,41.044044,38.576607,43.914524,40.957939
es,36.802394,44.517863,35.407098,38.238282,40.279056,42.166579,39.332586,40.982248
it,40.281451,43.145736,39.802244,36.850432,40.456761,42.539031,39.267834,40.542709
nl,40.956112,41.13627,40.82319,40.389004,36.770588,42.030926,40.213942,38.991354
tl,46.540422,41.997008,46.614139,45.801752,45.801481,39.88148,48.503497,44.100097
fr,40.318366,47.316152,40.810352,40.188495,41.317912,47.660025,36.765854,43.132498
en,41.904343,41.8462,41.395017,40.902593,40.20417,41.560692,40.998799,37.789599


 n=1, add_one=True


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,36.350012,47.309668,40.245727,41.838944,43.103029,44.83566,41.433604,44.315177
in,42.551489,36.725532,43.196506,43.024646,41.103533,38.655254,43.970543,41.016328
es,36.865181,44.549989,35.464352,38.264239,40.273249,42.157104,39.155123,40.897606
it,40.361543,43.190309,39.84804,36.916213,40.50442,42.616724,39.323893,40.601691
nl,41.045399,41.202126,40.89336,40.459668,36.829995,42.118609,40.273936,39.047177
tl,46.626325,42.055756,46.682579,45.865418,45.857486,39.954949,48.556027,44.158283
fr,40.385496,47.03928,40.830589,40.253505,41.329519,47.606433,36.817601,43.13655
en,41.980004,41.910114,41.452573,40.969087,40.261977,41.64637,41.055286,37.843974


 n=2, add_one=False


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,16.610491,41.378707,26.364092,29.820778,37.180665,35.530901,31.934027,37.727715
in,32.499027,18.152606,31.10342,30.70605,28.215486,23.560148,30.282809,27.584487
es,21.576111,34.167414,16.261329,23.873942,31.856288,28.422947,27.773173,30.234148
it,24.846706,28.385365,23.490902,16.657866,29.020809,27.227342,24.708985,27.48386
nl,31.459103,27.16077,30.193777,30.386491,17.951972,28.285704,27.256254,25.493314
tl,31.907823,24.150752,30.669746,29.960255,29.372633,17.982552,31.614201,26.887441
fr,29.206695,41.34821,29.09161,30.725265,29.94778,41.498493,17.147552,34.014511
en,30.093111,26.820514,29.253189,28.956971,25.125833,25.016779,26.218623,18.281159


 n=2, add_one=True


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,20.273245,48.708924,30.825421,35.193293,43.907257,43.249393,36.724499,43.826823
in,39.785352,21.586516,37.361363,36.765905,32.938188,27.782462,35.816143,32.104621
es,25.896175,40.114614,19.175051,28.386233,37.584446,34.403909,31.993787,35.37353
it,30.075166,33.7753,27.756831,19.68842,34.130435,32.322013,28.688402,31.854929
nl,38.175769,32.107777,35.83704,35.888377,20.927884,33.55625,31.806512,29.515684
tl,39.972461,29.008267,37.450459,36.377111,35.083603,21.761562,38.173198,31.796062
fr,35.408411,49.367146,34.807851,36.162699,35.872105,49.358837,20.089377,39.617958
en,36.941273,31.946515,34.90214,34.289736,29.379591,29.732194,30.704251,21.378962


 n=3, add_one=False


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,8.056089,111.072863,61.400827,76.692405,114.440298,100.102785,88.62175,108.997435
in,191.794073,9.81674,143.616554,154.562765,93.288788,66.177434,112.671813,99.867108
es,57.208459,89.779883,8.549181,61.453559,89.896051,75.585739,67.723393,80.810279
it,71.052143,76.852118,53.955534,8.515423,79.881328,70.09503,58.276734,66.892344
nl,198.560322,94.23224,154.816467,155.104065,9.157069,116.795297,105.535129,93.469065
tl,126.658968,62.665526,107.315781,100.112807,87.672805,8.539539,102.91998,79.20967
fr,134.59593,151.644672,104.992635,101.039841,101.305602,160.811467,8.523336,114.914614
en,102.051602,60.773678,82.617457,76.658202,51.246569,51.797725,61.287845,8.895584


 n=3, add_one=True


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,26.513788,301.706576,147.718962,186.237268,301.781404,280.285566,208.210175,263.986673
in,560.545654,31.624118,401.48114,432.41158,253.807951,174.768283,312.296771,260.150813
es,142.714677,236.244864,25.01534,144.962886,235.678194,207.035217,153.618158,192.494931
it,190.420439,215.366473,134.691601,25.72047,219.77616,196.359667,146.167513,170.777165
nl,548.746205,256.223069,407.368484,410.29409,28.306024,321.878926,268.142612,225.972789
tl,417.118774,173.471034,332.543905,305.624006,256.547651,30.010008,315.327573,219.157982
fr,336.392167,394.86937,252.917404,249.044222,261.134718,422.250005,25.145476,263.508474
en,312.9489,177.489853,237.200515,221.675969,139.47534,149.235137,164.439225,27.176422


 n=4, add_one=False


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,4.358358,4007.239293,676.890158,880.944171,5047.74059,2128.008352,2211.893279,3698.729611
in,22787.694896,5.043424,12507.029748,12866.426253,6438.934037,1430.700734,9771.648734,8224.475952
es,637.813609,2531.767572,4.700578,590.959676,2790.430877,1224.712481,1107.934579,1854.990617
it,1387.829327,2326.645869,736.065661,4.594426,2906.544888,1244.431779,1269.363545,1950.301924
nl,16896.763696,4035.890125,8546.044479,9774.963069,4.589406,4866.809087,4710.252121,2908.716635
tl,5803.355102,943.345912,4199.121743,3605.21259,3823.107392,4.299478,5463.184295,2702.148464
fr,3951.131677,3937.678187,2224.680094,2055.703261,2097.809616,3673.713853,4.454898,1957.955205
en,2996.769552,737.052501,1966.169179,1474.784405,542.393744,376.391579,970.596805,4.450424


 n=4, add_one=True


Unnamed: 0,pt,in,es,it,nl,tl,fr,en
pt,59.178756,19615.543702,3177.885498,4464.409964,23698.949255,12122.113269,9746.487709,16490.167626
in,84724.60377,78.685294,48293.437897,50252.986367,28780.926546,7671.095201,41046.943779,34018.460579
es,3189.76021,13098.979408,56.423888,3039.448305,13746.666224,7507.808695,5007.96332,8964.682374
it,7340.446171,12521.31071,3790.544652,58.32831,15316.885475,7760.553128,6420.096242,9853.960481
nl,64129.183466,18814.693928,32709.740028,38175.772702,64.400172,22928.935651,18762.090561,12093.447204
tl,30019.817228,5312.178348,22239.178866,18351.493399,19264.521877,68.520696,27540.424904,13557.990856
fr,16928.137519,19383.145774,9353.589039,9531.245189,10000.486029,18530.035055,54.672454,8581.331506
en,16851.054811,5204.730235,10835.51098,8840.466254,3482.438472,2837.113711,5527.667001,61.392896


**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
def classify():
  data_file_path = get_file_path('test.csv')
  df = pd.read_csv(data_file_path)
  tweets = df['tweet_text']
  models = {}
  eval_results = {}
  preds = []

  # fit model
  for f in os.listdir(path):
    if f.endswith('.csv') and not f.startswith('test'):
      data_file_path = get_file_path(f)
      model_name = os.path.splitext(f)[0]
      models[model_name] = lm(n ,vocabulary, data_file_path, add_one) # save the result in dict

  # calculate preplexity for each tweet, choose the one with minimum perplexity as the predicted language
  for index, tweet in enumerate(tweets):
    eval_results[index] = {}
    for name, value in models.items():
      eval_results[index][name] = perplexity(n, [tweet], value)
    preds.append(min(eval_results[index], key=eval_results[index].get))

  # add the prediction to the dataframe for future visualisation and calculation of accuracy
  df["Prediction"] = preds
  
  return df

In [None]:
df_classification = classify()

In [None]:
def accuracy(df_predictions):
  return sum(df_predictions['Prediction'] == df_predictions['label']) / df_predictions.shape[0]

In [None]:
acc = accuracy(df_classification)
print("The prediction's accuracy for n = {} and ADD_ONE = {} is: {:.2f}%".format(n, add_one, acc * 100))
df_classification

The prediction's accuracy for n = 3 and ADD_ONE = False is: 84.00%


Unnamed: 0,tweet_id,tweet_text,label,Prediction
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
7994,836250659464761344,"La triste historia que inspiró ""Tu falta de qu...",es,es
7995,847676283089637380,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,in,in
7996,836319299279138816,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,it,it
7997,836258179847716865,Assistimos de camarote varias brigas ontem!,pt,pt


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
from sklearn.metrics import f1_score

def calc_f1(result):
  ground_truth =  result["label"].tolist()
  pred = result["Prediction"].tolist()
  return f1_score(ground_truth, pred, average='weighted')

calc_f1(df_classification)

0.8399589347964393

# **Good luck!**