# Week1
Basically introduction to the course and main applications of NLP

# Week2 
Let's define some terminology:
1. type is an element of the vocabulary. The vocabulary is predefined.
2. Token: is an instance of that type in the text: for example cat is a type, cats is a token

* Heaps Law determines a relation between the number of types in a vocabulary and the number of tokens:
Assuming $N$ tokens and a vocabulary $V$ of size $|V|$, then: 
$\begin{align} 
|V| = K\cdot N ^ {\beta}
\end{align}$ 
where $0.67<\beta< 0.75$
This is an emperical result: The idea is that the number of tokens in signficantly larger for several reasons: each type can be repeated as it is and even in different forms: a verb can be converted to a noun, adjective, or even an adverb. The latter leads to the explosion of the number of tokens. 

# Week 3
* word classes define the role certain words play in speech. They are also referred to as  Part Of Speech tag (POS tag). 
* Several word classes were defined several centruies before including: noun, verb, prooun, preposition...
* Modern day linguistics define multiple tags: certain languages might introduce more than 150 tags
* Closed classes: for which no more words can be added: the ***preposition*** class is one example.
* Open classes: for which new words are being added over time: Think of the **verb** ***google*** for example
* There are two main appproaches when categorizing / creating classes:
    1. syntatic / morphological (morpheme is the least mearning-breaning unit in a language) distributional properties
    2. semtantic function: words bearing similar meanings
* The syntatic rules / properties tend to be favored over semantic ones.
* word classes are quite useful for a number of reasons: 
    1. a word class can significantly limit the possible affixes that word can take
    2. a word class can facilitate parsing
    3. the word classes are correlated with one another: they reflect context to a certain extent

* Considering the language as a set of unique words, it would be safe to state that around $85\%$ of words are  "unambigous" class-wise. Considering words' repetitions, the $15\%$ left represent more than $60\%$ of the word tokens. 

* why is POS tagging challenging ? Well, for starters there is no one-to-one map between a words and POS tag. A word's POS tag varies depending on the sentence. It is not always obvious how to determine a POS tag from the context. 


## Mathematical Break
LEt's consider one of the most interesting and elegant mathematical models. Hidden states Markov Models. Assuming the following
1. We have a hidden (non-directly observable) random variable H, we know the probabilities of transitioning from one variable value to another
2. We have an observable random variable R, we know the emissions matrix: the values of the conditional probabilities: $P(R|H)$
3. Assuming we have a sequence of values of the random variable $R$: $r_1, r_2, ... r_n$. The model introduces a mechanism to find another sequence $h_1, h_2, .. h_n$ for which the probability: 
$
\begin{align} 
P = P(r_1, r_2, ..., r_n, h_1, h_2, ... h_n)
\end{align}
$
is maximized.

* The matrix of conditional probabilities of hidden states is called transitions matrix
* The matrix of conditional probabilities of observable states on hidden states is called emissions matrix

Well Give me a break for god's sake here. We literally have a probability of $2n$ values, right? Well hold your horses a bit dude. Let's break this expression down:
$
\begin{align}
P = P(r_n | r_1, r_2, ... r_{n - 1}, h_1, h_2, ... h_n) \cdot P(r_{n - 1} | r_1, r_2, ... r_{n - 2}, h_1, h_2, ... h_n) ... \cdot P(r_1 |h_n, h_{n-1}, ... h_1) \cdot P(h_n|h_1, h_2, ... h_{n - 1}) \cdot P(h_{n - 1}|h_1, h_2, ... h_{n - 2}) ... \cdot P(h_2 | h_1)
\end{align}
$
Well the expression isn't getting any prettier old man!! Care to explain. Well, the markovian hypothesis comes to the rescue: $P(r = r_i)$ is only dependent on the value of $r_{i - 1}$. Thus, the expression simplifies to:
$\begin{align}
P = \prod_{j=1}^{n} P(r_j | h_j) \cdot \prod_{j=n}^{2} P(h_j | h_{j - 1}) \cdot P(h_1 | S)
\end{align}$
where $P(h|S)$ is the initial probability distribution of the hidden states.
Each term of the $2n$ terms in this simplified expression of $P$ is available and thus the combination of hidden states maximizing can be easily found. 

## HMM IN POS TAGGING
well HMM can be applied to POS tagging, let's see how that turns out.

In [1]:
# let's try to implement it
# let's first download data with tagged speech
import requests
import os

url = "https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt"

r = requests.get(url, allow_redirects=True)

PATH = os.path.join(os.getcwd(), 'POS_corpus.txt')

# let's save the data to a text
with open(PATH, 'w') as f:
    f.write(r.text)


In [2]:
# let's what our text have for us
with open(PATH, 'r') as f:
    for line in f.readlines()[:20]:
        print(line[:-1])
# great !! so each line represents pair of word and tag
# we will take into consideration building the transitions and emissions matrix

Confidence NN
in IN
the DT
pound NN
is VBZ
widely RB
expected VBN
to TO
take VB
another DT
sharp JJ
dive NN
if IN
trade NN
figures NNS
for IN
September NNP
, ,
due JJ
for IN


In [3]:
# first let's build a generator from the file in question
pairs = (pair[:-1] for pair in open(PATH))

In [4]:
# let's build the emission matrix: the probability of having a certain word given its tag
from collections import Counter
import re
import pandas as pd
# each pair will be mapped to word + " " + tag
# the matrix can be rebuild from the counter easily


def build_emissions_matrix(pairs) -> pd.DataFrame: 
    """Given an iterable of pairs (word and tag), this functions returns the emissions matrix as used in Viterbi algorithm. The function makes use of
    the computational enhancements of the pandas package
    
    Args:
        pairs any iterable: an iterable where each item represents a pair: word and associated tag

    Returns:
        pd.DataFrame: the emissions matrix calculated and saved in a dataframe
    """
    emissions_counter = Counter()

    for pair in pairs:
        # first make sure to lower case the words
        p = re.split(r'\s+', pair)
        if len(p) == 2 and p[0].isalnum():
            emissions_counter.update([p[0].lower() + " " + p[1]]) # make sure to pass the argument as a list

    # the next step is to build the matrix out of the counter
    # let's extract the unique set of names
    # since the tags' number is quite small
    tags_unique = [pair.split()[0] for pair in emissions_counter.keys()]

    # create an empty data frame 
    emissions_df = pd.DataFrame(data=[], index=tags_unique)

    # the great thing about using dataframes is that adding a column in not computationally expensive
    for pair, pair_count in emissions_counter.items():
        # extract the word and the tag
        word, tag = pair.split()

        # add the column for the word, if the word has not been encountered yet.
        if word not in emissions_df.columns: 
            emissions_df[word] = 0
            
        # update the value in both cases
        emissions_df.at[tag, word] += pair_count

    # now emission_df represents tabular data where the columns are words and the rows are the tags
    # we want for each cell ('t', 'w') to be the conditional probability P(Word=w|Tag=t)
    # in other words the count of the pair ('t', 'w') divided by the total count of the entire occurences of the tag 't' 

    # calculate the total occurences of each tag
    tag_sums = emissions_df.sum()

    emissions = emissions_df / tag_sums

    return emissions

In [None]:
emissions_matrix = build_emissions_matrix(pairs)

Well, HMM, despite its elegance and simplicity, comes with a number of limitations:
1. It (over?)simplies the actual dependencies between the elements of the sequences where it solely considers the transitions between two consecutive elements
2. Both transmission and emission probabilities are static. They are invariant with respect to the elements' positions.

Unlike HMM, conditional random fields are discriminative models: in other worlds, they are more descriptive rather explanative (or more formally generative as the case with HMM). A CRF can consider any connection between any elements of the sequences. Considering a more extensive set of connections introduces computational overhead. Thus, generally the model used is the Linear Chain CRF as a tradeoff between complexity and computational overhead.

# Week 5
* Introducing grammaticality, A word is grammarly correct if it follows a certain set of rules (a mechanism). This set of rules represents the grammar of language. It is said that grammar generates the language. A syntax makes language useful for communication. IT limits the number of possible forms of a sentence which in turn limits the possible meanings a single sentence can bear.

* Some Pragmatic points: 
1. $N$ is a set of ***non-terminal*** symbols
2. $\Sigma$ is a set of ***terminal*** symbols
3. $R$ represents a set of possible productions rules of the form: 

* More concrete stuff: SO a Phrase Structure Grammar consists mainly of two components:
1. a Lexicon: a word class, a category of words
2. phrase structure rules: determine how lexicon can be grouped together to build larger phrases

* To be verified: a group of at least 2 lexicons is referred to as a ***phrase*** ???
* Phrases belonging to the same category exhibit similar statistical distributions and thus similar behavior in terms of context and grammar
* The distribution's similarity is tested by the means of substitution: 
    * Consider the same sentence with two different strings (phrases). Both phrases should have the same state grammarly: both (un)/grammatical 

# Week 6: Probabilistic Models: N-grams
DISCLAIMER: My notes are mainly taken from [standford references](https://web.stanford.edu/~jurafsky/slp3/3.pdf).   
Why would anyone want to predict, the upcoming words in a given sentence ? That's a great question? Well cause Natural language is inherently ambiguous. The mathematical (objective) approach to handle ambinguity(and hence incertainty) is probability. Such approach is the most effective (currently) which systems such as speech recognition, spelling correction... A "good" probabilistic model should understand that the user's statement is much more likely ***The student has been sick*** than ***The student husband sick***
### Mathematical Break
So how can we model the probability of the following sentence: "Such happening did seem, in such momoment of crippling despair, beyond the realms of possibility" ? Good question. Well, No matter how large the corpus is, this very particular sentence (and generally any sentence of certain complexity) wouldn't be repeated enough times for us to build a statistically significant representation. One approach is based on the chain rule of of probability theory: 
$
\begin{align}
P = P(w_1, w_2, ... w_n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2, w_1) ... \cdot P(w_n|w_{n-1}...w_1)
\end{align}
$
This rule doesn't seem to add much, right ?. Well technically it doesn't, but that's what engineering is about. Approximating phenomena well enough to achieve meaningful results. The main assumption of $N-grams$ model is:
$
\begin{align}
P(w_n|w_{n - 1}, w_{n - 2}, .. w_1) = P(w_n|w_{n - 1}... w_{n - N + 1})
\end{align}
$
For the case of a Bi-gram with additional end and start tokens, the following equalities hold:
$
\begin{align}
P(w_n | w_{n - 1}) = \frac{C(w_{n - 1}w_n)}{\sum_{w} C(w_{n - 1}w)} = \frac{C(w_{n - 1}w_n)}{C(w_{n - 1})}
\end{align}
$
The same equation can be extended for N-grams in general.

## Language Model Evaluation
There are two main approaches to evaluate a probabilistic model:
1. extrinsic: how well the model in question performs on the task at hand using a test dataset (unseen data): how good is the translation / classification...
2. intrinsic: perplixity: assuming a sequence $W$ of $N$ words: $w_1, w_2, ... w_{N}$:
$
\begin{align}
P(W) = P(w_1, w_2, ... w_n) ^ {-\frac{1}{N}}
\end{align}
$
It is important to note that comparing two models using intrinsic criteria requires training the two models on the same training vocabulary while having absolutely no prior knowledge of the test dataset. An increase in the **instrinsic** metric **does not guarantee** an improvement in the **extrinsic** one.

## Generalization and Zeroes
* statistical models are quite sensitive to the training corpus. In order to have a concrete improvement, The test and training corpuses must of the same genre (distribution). This dependency is quite heavy in NLP in general.
* what about sparsity ? Well a perfectly reasonable N-gram might not be present in a certain training dataset but present in the training one (even with two splits with extremely similar distribution). Such detail introduces two issues:
1. It reflects that the model is not general enough, in the sense that it under estimates the probability of certain N-grams: hurting the performance in general
2. Cannot mathematically compute the perplexity as a probability of a test set with an unknown word eventually evaluates to $0$. Spoiler alert we can't divide by $0$

This issue has several manifestations that should be probably tackled separately:
### Completely unknown words
This specific issue can be tackled by a number of different approachs:
1. build a vocabulary with a predetermined size $|V|$, converting any word out of vocabulary to a predetermined token $UNK$ for example.
2. Either choosing a small number $n$ and assigning any word with a frequency lower than $n$ to a $UNK$ token, or probably pick the $V$ most frequent words and assign the rest the $UNK$ token.

## Smoothing
### Laplace Smoothing
Well, let's first consider the simple case of a unigram with $N$ tokens in the corpus
$
\begin{align}
P(w_i) = \frac{c_i}{N}
\end{align}
$
At least from a mathematical perspective, Laplace smoothing is about converting $0$ counts to $1$, This can be done as follows:
$\begin{align}
P(w_i) = \frac{c_i + 1}{N + V}
\end{align}$
Where $V$ is the size of the vocabulary. The same technique can be extended for N-gram in general.  
Laplace's smoothing is sometimes referred to as **add-one smoothing**. A natural extension would be **add-K smoothing**:
$
\begin{align} 
P(w_i) = \frac{c_i + k}{N + k\cdot V}
\end{align}
$

# Classification With Naive Bayes
Disclaimer: These notes are taken from the 4th chapter of [Stanford Notes](https://web.stanford.edu/~jurafsky/slp3/4.pdf)  
classification in general and text classification in particular are quite popular and important tasks. Sentiment analysis, spam classification and even category classification. The general process is simple:
* take a piece of text
* process it
* extract features out of it
* pass it through a classifier


# Recurrent Neural Networks
Recurrent Neural Networks raised to solve time-series problems. Assuming the input is a sequence of values in time, then the model can use the output of the previous neuron to predict the current output.   
The introduction of timve-variant input required several changes, reflected mathematically as follows:
$$h_t = g(U \cdot h_{t - 1} + W \cdot x_t)$$
$$y_t = f(V\cdot h_t)$$
Assuming the following architecture hyperparameters:
1. output dimensions: $n_o$
2. hidden state dimensions: $n_h$
3. input dimensions: $n_i$
Then we have:
1. V = ($n_o$, $n_h$ )
2. U = ($n_h$, $n_h$)
3. W = ($n_h$, $n_i$)

In addition to the presence of bias units, we can determine the exact number of trainable parameters of an RNN model.

The nature of the input requires an incremenet inference mechanism. Nevertheless, with few mathematical manipulations, any intermediate output or hidden state can be written in terms of previous input values $x_{i \leq t}$. Such experssion can be quite complex


## Training an Rnn as a Language model
Since RNN are not bound by the vocabulary and the word occurence in the corpus as it is the case with statistical models: n-grams, and the fixed length of a context as with Forward Neural networks. Therefore, RNN seems like the perfect candidates for Language modeling. 

Let's write some code please. The code below is taken from the amazing [blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Bidirectionel RNNs: are an improvement of the original RNNS. The last hidden state output is generally associated with much more information from the last part of the sequence. To overcome such impediment, it is possible to apply the same architecture but in the opposite order resulting in 2 outputs at each input in the sequence. Each of these outputs are concatenated into a single vector saving much more of the context at each point. 

The hidden layers in the RNN architecture try to tackle 2 tasks at the time.
1. decoding  extracting information useful for the current local part / output of the sequence
2. extract useful / important information for the future parts in the sequence

Even though theoretically, RNN models are indeed turing complete and theoretically capable of solving the complex problems they are expected to solve, they do not acheive the desired performance in practice. A simple reason is ***vanishing gradient*** meaning the the signal generated from terms far away in the past is lost. The model, thus might not be able to capture connections between elements further away in the sequence.


In [None]:
# most of the code is already encapsulated in the built-in classes in Pytorch.