## 1. The BLEU score – Evaluating the MT systems

### 1.1 What is BLEU ?

- **BLEU** stands for **Bilingual Evaluation Understudy** and is a way of automatically evaluating MT(machine translation) systems.

- *BLEU score measures the quality/accuracy of translations produced by the model. BLEU score looks at `n`-grams of tokens produced by the decoder to measure how "close" the predicted translation is to the actual ground truth sequence.*

- *The BLEU score denotes the number of `n`-grams (for example, unigrams and bigrams) of candidate translation that matched in the reference translation. So the higher the BLEU score, the better the MT system.*

Reference: [Tensorflow GitHub - Code Implementation](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py)

* **

Let's consider an example to learn the calculations of the BLEU score. 

- Say we have two **candidate sentences** (i.e., a sentence predicted by our MT system) and a **reference sentence** (i.e., the corresponding actual translation) for some given source sentence:
    - *Reference_1* $\rightarrow \textit{The cat sat on the mat} \leftarrow$ *Actual Translation*
    
    - *Candidate_1* $\rightarrow \textit{The cat is on the mat} \leftarrow$ *Predicted Translation*

* ** 

### 1.2 *Why Precision fails in MT ?*

- To see how good the translation is, we can use one measure, **precision**. 

    - *Precision is a measure of how many words in the candidate are actually present in the reference.* 
    
    - In general, if you consider a classification problem with two classes (denoted by negative and positive), precision is given by the following formula: $$Precision = \frac{\text{No. of samples correctly classified as +ve}}{\text{all samples classified as +ve}}$$
    
    - Let’s now calculate the precision for candidate 1: $$Precision = \frac{\text{# of times each word of candidate appeared in reference}}{\text{/# of words in candidate}}$$
    
    - Mathematically, $$Precision = \frac{\sum_{\text{unigram} \in \text{Candidate}}\text{Is Found In Ref}(unigram)}{|\text{Candidate}|}$$ <br></br>$$\textit{Precision for candidatee_1} = \frac{5}{6}$$
    
    - $\textit{This is also known as 1-gram precision since we consider a single word at a time.}$
    
Now let’s introduce a new candidate:

- *Candidate_2* $\rightarrow \textit{The the the cat cat cat}$

- *It is not hard for a human to see that candidate 1 is far better than candidate 2. Let’s calculate the precision:* $$\textit{Precision for candidatee_2} = \frac{6}{6} = 1$$

>🗝️ **As we can see, the precision score disagrees with the judgment we made. Therefore, precision alone cannot be trusted to be a good measure of the quality of a translation.**

* **

### 1.3 *Modified Precision* 

To address the precision limitation, we can use a **modified 1-gram precision**. 

- ***The modified precision clips the number of occurrences of each unique word in the candidate by the number of times that word appeared in the reference:*** 

$$p_1 = \frac{\sum_{\text{unigram} \in \text{{Candidate}}}Min(\text{Occurences}(unigram), {unigram}_{max})}{|\text{Candidate}|}$$ 


- Therefore, for candidates 1 and 2, the modified precision would be as follows: 

$$\textit{Mod-1-gram-Precision Candidate 1} = \frac{(1+1+1+1+1)}{6} = \frac{5}{6}$$

$$\textit{Mod-1-gram-Precision Candidate 2} = \frac{(1+1+1)}{6} = \frac{3}{6}$$


*We can already see that this is a good modification as the precision of candidate 2 is reduced. This can be extended to any n-gram by considering n words at a time instead of a single word.*

* **

### 1.4 *Brevity penalty*

- Precision naturally prefers small sentences. 

- This raises a question in evaluation, as the MT system might generate small sentences for longer references and still have higher precision. Therefore, a brevity penalty is introduced to avoid this. 

- The brevity penalty is calculated by the following: 

$$
BP =
\begin{cases}
1 & \text{if } c > r \\
e^{1 - \frac{r}{c}} & \text{if } c \leq r
\end{cases}
$$

*Here, $c = \textit{candidate sentence length}$ & $r = \textit{reference sentence length}$*. 

In our example, we calculate it as shown here:
- $\textit{BP for candidate-1} = e^{1 - \frac{6}{6}} = e^0 = 1$
- $\textit{BP for candidate-2} = e^{1 - \frac{6}{6}} = e^0 = 1$

### 1.5 *The final BLEU Score*

- Next, to calculate the BLEU score, we first calculate several different modified n-gram precisions for a bunch of different $n=1,2, \ldots ,N$ values. We will then calculate the weighted geometric mean of the $n-gram$ precisions:

$$BELU = BP \times \exp\left(\sum_{i=1}^{N}w_n p_n\right)$$

- Here, $w_n$ is the weight for the modified $n-gram$ precision $p_n$. 

- By default, equal weights are used for all $n-gram$ values. 

In conclusion, BLEU calculates a modified $n-gram$ precision and penalizes the modified $n-gram$ precision with a brevity penalty. The modified n-gram precision avoids potential high precision values given to meaningless sentences (for example, candidate 2).

## 2. Other applications of Seq2Seq models – ChatBots

- One other popular application of sequence-to-sequence models is in creating chatbots. *A chatbot is a computer program that is able to have a realistic conversation with a human.*

### 2.1 Training a ChatBot

- Training a chatbot is similar to training a NMT(Neural Machine Translation) model. The only difference would be how the source and target sentence pairs are formed.

- In the NMT system, the sentence pairs consist of a source sentence and the corresponding translation in a target language for that sentence. 

- *However, in training a chatbot, the data is extracted from the dialogue between two people. The source sentences would be the sentences/phrases uttered by person A, and the target sentences would be the replies to person A made by person B.* 

- Datasets that can be used for training a chatbot:

    - [Movie dialogues between people](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
    - [Reddit comments dataset](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)
    - [Maluuba dialogue dataset](https://datasets.maluuba.com/Frames)
    - [Ubuntu dialogue corpus](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/)
    - [NIPS conversational intelligence challenge](http://convai.io/)
    - [Microsoft Research social media text corpus](https://tinyurl.com/y7ha9rc5)
    
    
Below figure shows the similarity of a chatbot system to an NMT system. *For example, we train a chatbot with a dataset consisting of dialogues between two people. The encoder takes in the sentences/phrases spoken by one person, where the decoder is trained to predict the other person’s response. After training in such a way, we can use the chatbot to provide a response to a given question:*

<div align='center'>
    <img src='images/chatbots.png'/>
</div>

### 2.2 Evaluating chatbots – the Turing Test

- After building a chatbot, one way to evaluate its effectiveness is using the **Turing test**. 

- **The Turing test** was invented by Alan Turing in the 1950s as a way of measuring the intelligence of a machine.

- The experiment settings are well suited for evaluating chatbots. The experiment is set up as follows:

    - There are three parties involved: 
        - **an evaluator** (i.e., a human)(A), 
        - **another human** (B), and 
        - **a machine** (C)<br></br> 

    - The three of them sit in three different rooms so that none of them can see the others. 

    - The only communication medium is text, which is typed into a computer by one party, and the receiver sees the text on a computer on their side. 
    
    - The evaluator communicates with both the human and the machine. 

    - And at the end of the conversation, the evaluator is to distinguish the machine from the human. 

    - If the evaluator cannot make the distinction, the machine is said to have passed the Turing test. 

The setup is illustrated:
<div align='center'>
    <img src='images/turing_test.png'/>
</div>

* **