<a href="https://colab.research.google.com/github/gabibu/nlp/blob/master/Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 53 (delta 23), reused 32 (delta 8), pack-reused 0[K
Unpacking objects: 100% (53/53), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [2]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [3]:
import os
import io
import pathlib
import pandas as pd
pd.set_option("max_colwidth", 140)
import json 

folder = 'nlp-course/lm-languages-data-new'

def read_tweets(file_path):
  file_suffix = pathlib.Path(file_path).suffix
  if file_suffix == '.json':
    with open(file_path) as f:
      data = json.load(f)
      return [tweet.lower() for tweet in data['tweet_text'].values()]
      
        
  elif file_suffix == '.csv':
    df = pd.read_csv(file_path, header=0)
    return [tweet.lower() for tweet in df['tweet_text'].tolist()]

def preprocess():

  vocab = set()
  for file_name in os.listdir(folder):
    file_path = os.path.join(folder, file_name)
    tweets = read_tweets(file_path)

    vocab.update({word for tweet in tweets for word in tweet.lower()})

  return vocab
        

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [5]:
from collections import defaultdict
import numpy as np 

start = '<S>'
end = '<E>'

def build_probs_model(n, vocabulary, tweets, add_one):

    counts = {}

    num_of_tokens = 0.0
    for tweet in tweets:
      
        characters = list(tweet)

        tokens = [start] + characters + [end] if n > 1 else characters

        num_of_tokens += len(tokens)

        if len(tokens) < n:
            continue

        for i in range(n, len(tokens)):
            current_tokens = tokens[i - n: i]

            character = current_tokens[-1]

            if n > 1:
                n_key = ''.join(current_tokens[0: -1])

                if n_key not in counts:
                    counts[n_key] = {}

                if character not in counts[n_key]:
                    counts[n_key][character] = 0.0

                counts[n_key][character] += 1.0
            else:
                if character not in counts:
                    counts[character] = 0.0

                counts[character] += 1


    if n == 1:
        
        if add_one:
            model = defaultdict(lambda : 1.0/len(vocabulary), {word: (count + 1.0) / (num_of_tokens + len(vocabulary)) for (word, count) in counts.items()})
        else:
          model = defaultdict(lambda : 0.0, {word: count / num_of_tokens for (word, count) in counts.items()})
    else:
            
        default_model = defaultdict(lambda :0.0) if not add_one else defaultdict(lambda :1.0/len(vocabulary))
        model = defaultdict(lambda : default_model)
        for gram, next_character_counts in counts.items():
            n = np.sum(list(next_character_counts.values()))

            if add_one:
              model[gram] = defaultdict(lambda: 1.0/(n + len(vocabulary)), {word: (count +1.0) / (n + len(vocabulary)) for (word, count) in next_character_counts.items()})
            else:
              model[gram] = defaultdict(lambda : 0.0, {word: count / n for (word, count) in next_character_counts.items()})

    return model






def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  # TODO
  tweets = read_tweets(data_file_path)
  return build_probs_model(n, vocabulary, tweets, add_one)

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [9]:
def eval(n, model, data_file):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # data_file - the tweets file that you wish to claculate a perplexity score for

    # TODO
    tweets = read_tweets(data_file)

    tweets_perplexity = []
    for tweet in tweets:
        probs_n = 0.0
        entropy_sum = 0.0
        if n == 1:
            for token in tweet:
                prob = model[token]
                
                entropy_sum += 1 * np.log2(0.000000001) if prob == 0 else prob * np.log2(prob)
                probs_n += 1
        else:
            for index in range(0, len(tweet)):
                to_index = index + n
                tokens = tweet[index: to_index]
                if len(tokens) < n:
                    break
                prev = tokens[0: -1]
                to_token = tokens[-1]
                prob = model[prev][to_token]
                entropy_sum += 1 * np.log2(0.000000001) if prob == 0 else prob * np.log2(prob)
               
                probs_n += 1

        entropy = (-1 / probs_n) * entropy_sum
        tweet_perplexity = np.power(2, entropy)
        tweets_perplexity.append(tweet_perplexity)

    return np.mean(tweets_perplexity)

# vocab = preprocess()
# model = lm(2, vocab, 'nlp-course/lm-languages-data-new/en.csv', False)
# eval(2, model, 'nlp-course/lm-languages-data-new/en.csv')


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [12]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not

  vocabulary = preprocess()

  language_to_model = {}
  language_to_data_file = {}
  for file_name in os.listdir(folder):
    language = os.path.splitext(file_name)[0]
    if language in language_to_model or 'test' in language:
      continue

    file_path = os.path.join(folder, file_name)
    language_to_data_file[language] = file_path

    model = lm(n, vocabulary, file_path, add_one)
    language_to_model[language] = model

  languages = sorted(list(language_to_model.keys()))

  rows = []
  for index, language1 in enumerate(languages):
    current_res = []
    for index, language2 in enumerate(languages):
      perplexity = eval(n, language_to_model[language1], language_to_data_file[language2])
      current_res.append(perplexity)
    
    current_res.append(n)
    current_res.append(add_one)
    
    rows.append(current_res)
  
  return pd.DataFrame(rows, columns =languages + ["n", "add_one"], index=languages)

Unnamed: 0,en,es,fr,in,it,nl,pt,tl,n,add_one
en,1.21727,580.0845,4.599724,1315.648613,3.360552,760.082728,655.469301,1734.693461,2,False
es,289.056752,1.231016,4.056581,1326.930346,15.520342,27721.10514,2653.111915,114522.871187,2,False
fr,718.535807,3007.488901,1.220492,2394.099623,15.364691,55753.908597,3839.7093,114772.86976,2,False
in,140.344924,332.118673,4.24232,1.218189,44.289797,19593.936101,2873.163966,112102.410661,2,False
it,335.229626,3010.433071,4.076182,1208.983847,1.227589,3281.066055,3840.630578,112804.33758,2,False
nl,441.140692,1895.78182,3.286142,1739.885654,3.005452,1.218804,3814.072488,111546.301161,2,False
pt,143.96867,102.913132,3.887172,1557.970792,44.459766,55677.157295,1.224135,113528.685159,2,False
tl,298.47385,2999.141287,4.306378,3968.717602,10.19638,43.805729,3825.059178,1.220652,2,False


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [13]:
frames = []
for n in range(1,5):
  for add_one in [True, False]:
    frames.append(match(n, add_one))
    
pd.concat(frames)

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [30]:
def build_languages_models(n, add_one):

  vocabulary = preprocess()

  language_to_model = {}
  for file_name in os.listdir(folder):
    language = os.path.splitext(file_name)[0]
    if language in language_to_model or 'test' in language:
      continue
      
    
    file_path = os.path.join(folder, file_name)

    model = lm(n, vocabulary, file_path, add_one)
    language_to_model[language] = model
  return language_to_model

two_gram_language_to_model = build_languages_models(2, True)
unigram_gram_language_to_model = build_languages_models(1, True)

In [31]:
TEST_FILE = 'test.csv'
test_df = pd.read_csv(os.path.join(folder, TEST_FILE))[['tweet_text', 'label']]
test_df["tweet_text"] = test_df["tweet_text"].str.lower()

In [32]:
import string
import re
def clean_text(tweet):

  if tweet.startswith('rt'):
    tweet = tweet.replace('rt', '')

  seen = False
  
  xx = ''
  for x in tweet:
    if seen:
      if x == ' ':
        seen = False
      else:
        continue
    else:
     
      if x in ['😂','👌🏽','🕋','💎','😌', '🆘', '🔴','#', '😍','🔥','💙','😻','💔', '🙌🏾','👙','☀️','🍾','🎈','%', '😈','💦','!', '🤘🏼']:
        xx +=''
        continue

      if x == '@':
        seen = True
      else:
        xx += x

  g = re.findall(r'(https?://\S+)', xx)

  for x in g:
    xx = xx.replace(x, '')
  
  xx = xx.replace(' ', ' ')
  return xx.strip()


In [35]:

test_df['clean_tweet_text'] = test_df['tweet_text'].apply(clean_text)

In [50]:

def classify_language(tweet, n):


  best_language_prob = float('-inf')
  best_language = None

  for language, model in two_gram_language_to_model.items():
    unigram_model = unigram_gram_language_to_model[language]

    index = 0

    language_log_prob = 0.0
    while index + n < len(tweet):
      if n == 1:
        token = tweet[index: index + 1]
        prob = model[token]
        prob = prob if prob > 0 else 0.0000001
        language_log_prob += np.log(prob)
      else:
        tokens = tweet[index: index + n]
        prev = tokens[0:-1]
        next = tokens[-1]

        prob = model[prev][next]
        prob = prob if prob > 0 else 0.0000001
        uni_prob = unigram_model[next] if unigram_model[next] > 0 else 0.0000001
        language_log_prob += np.log((0.8 * prob)+ (uni_prob * 0.2))
      index += 1
    
    if language_log_prob > best_language_prob:
      best_language_prob = language_log_prob
      best_language = language
    

  return best_language




# ind = 484
# test_df['clean_tweet_text'] = test_df['tweet_text'].apply(clean_text)

# xx = classify_language(test_df.iloc[ind]['clean_tweet_text'], 2)
# xx, test_df.iloc[ind]['label'], test_df.iloc[ind]['clean_tweet_text']




In [51]:
test_df["predicted_label"] = test_df["clean_tweet_text"].apply(lambda t: classify_language(t, 2))

In [52]:
len(test_df[test_df['label'] == test_df['predicted_label']])/len(test_df)

0.8677334666833354

In [23]:
test_df[test_df['label'] != test_df['predicted_label']].sample(3)[['clean_tweet_text', 'tweet_text', 'predicted_label', 'label']]

Unnamed: 0,clean_tweet_text,tweet_text,predicted_label,label
7220,pordalab ❤❤❤❤ follow 👇 dtbyisitreal,rt @teacherrichards: pordalab ❤❤❤❤ follow 👇 @dtby_ninay @madamsexytary_ @legionadntarlac #dtbyisitreal https://t.co/gjs9nhrrp2,tl,in
2269,kyt derek,@derekmolina kyt derek,fr,nl
320,"""sorry bugsplat"" hahahahahhahahahahahah loftrriiiiip","@rjibarra_ ""sorry bugsplat"" hahahahahhahahahahahah loftrriiiiip 😂😂",tl,nl


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [53]:
from sklearn.metrics import f1_score

f1_score(test_df['label'].tolist(), test_df['predicted_label'].tolist(), average='micro')

0.8677334666833354

# **Good luck!**