# NLP Lab 2 / Text classification

You may work by pairs. There is no increase of grade or compensation if you work alone. 
Please first indicate whom you work with in the cell above:

Your answer here : worked with Matthieu Sousa Ferreira & Baptiste Rozan

In [None]:
! source .venv/bin/activate

# 0. Submission instructions

The due date of this lab is October 13h, 23:59. Late deliveries will be penalized 1pt/day. 

Please visit https://mvproxy.esiee.fr (disregard the security warning, the certificate is self-signed). When visiting for the first time, provide your ESIEE login, but leave the password field empty, and click on "Connexion". You should then receive an email containing your mvproxy password (check your SPAM folder if you don't).

Visit https://mvproxy.esiee.fr a second time, but now fill in both your ESIEE login and mvproxy password. You should be logged in.

Drop a ZIP archive containing :
- This notebook (lab2.ipynb), filled with answers to the questions ;
- and a *local* copy of the text articles or CSV files you are working with, which you should access to relatively and not absolutely. 

Please pay attention to the latter point. Code like
```
csv.reader('C:\Users\Yoyodyne\My Documents\AIC-5102B\Lab2 Text Classification/dataset.csv')
```

should be banned, and replaced by
```
csv.reader('dataset.csv')
```
or
```
csv.read('./data/class1.csv')
csv.read('./data/class2.csv')
```

I should be able to run your code on *Linux* after sourcing <t>~/pynlp/bin/activate</t> , without modifying the 'C:\Users\Yoyodyne\My Documents\AIC-5102B' path, which I don't have access to, nor changing anything else in your notebook as I run it. You must also stick to NLTK, and packages I include myself in the following code snippets.


## 1. Setup

This lab must be done on Linux Debian 12, under the Python virtual environment [decribed here](https://perso.esiee.fr/~hilairex/NLP/docker-fc39.html). You may also use the Docker container described in this document if you prefer, but the Python venv is enough.

It is all about text classification, and your first task will consist in selecting a dataset.

Please visit https://www.kaggle.com/datasets?search=text+classification in order to choose a dataset you like (or randomly, if you don't known which one to choose). We will attempt to separate samples from 2 classes only, so you may either choose between a dataset that has natively 2 classes only (spam/non spam email, positive/negative review, etc.) or one that has natively $n > 2$ classes, but two only of which will be used (e.g. politics/cooking, or sport/computer articles).

Which dataset and classes did you chose ? Plesase give your answer below with its related URL, and copy the related files to your working directory.


Dataset : SMS Spam collection
classes : spam / not spam
https://www.kaggle.com/datasets/thedevastator/sms-spam-collection-a-more-diverse-dataset


## 2. Text vectorization

The following functions :
- extract the vocabulary from a CSV file assuming the text is located in column number 5
- build the document-term matrix by reading again the same CSV file

Adapt them, so that they fit your dataset and produce a document-term matrix in the end.

Please note that:
- the tokenization method used is wordpunct_tokenize(), which may not be optimal. You may call something different in case you find too much garbage in your resulting vocabulary.
- there are two "if" tests in dtmat_from_file which appear unnecessary so far. They are, indeed, because the test samples may include unseen words, which would generate out-of-bounds index. So unseen words are just ignored.
- you may also consider lemmatizing. 

In [12]:
import csv
import nltk
import numpy as np
import sys

from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=True, reduce_len=False)

def voc_from_csv(csvfile):
    nlines=0
    voc=[]
    with open(csvfile, errors='ignore') as file:
        reader=csv.reader(file, delimiter=',')
        for row in reader:
            nlines=nlines+1
            voc.extend(tweet_tokenizer.tokenize(row[0]))
    voc=sorted(set(voc))        
    return voc,nlines

def dtmat_from_csv(csvfile):
    voc,rows=voc_from_csv(csvfile)
    cols=len(voc)
    mat=np.zeros((rows,cols))
    d=0
    with open(csvfile,  errors='ignore') as file:
        reader=csv.reader(file, delimiter=',')
        for row in reader:
            w= nltk.wordpunct_tokenize(row[0])
            X=np.searchsorted(voc,w,side='left')            
            for i in range(0,len(w)):
                if (X[i] < cols):
                    if (w[i] == voc[X[i]]):
                        mat[d][X[i]]+=1
            d=d+1
    return mat


In [13]:
csv_path = 'sms-spam-collection-a-more-diverse-dataset/train.csv'
vocab = voc_from_csv(csv_path)
dtmat = dtmat_from_csv(csv_path)

### Question 1 

Run dtmat_from_csv on your dataset, or on a sample drawn from it if it is very large. Examine the resulting matrix. How many times does it happen that a given word is seen only once (possibly twice) in your training set ? Give a few Python code lines which show this below.

In [14]:
# Calculer la fréquence totale de chaque mot dans le corpus
total_counts = np.sum(dtmat, axis=0)

# Indices des mots vus une seule fois
once_indices = np.where(total_counts == 1)[0]
# Indices des mots vus deux fois
twice_indices = np.where(total_counts == 2)[0]

# Afficher quelques exemples de mots vus une seule ou deux fois
print("Mots vus une seule fois :", [vocab[0][i] for i in once_indices[:100]])
print("Mots vus deux fois :", [vocab[0][i] for i in twice_indices[:100]])

# Nombre total de mots vus une seule fois
print("Nombre de mots vus une seule fois :", len(once_indices))
# Nombre total de mots vus deux fois
print("Nombre de mots vus deux fois :", len(twice_indices))

Mots vus une seule fois : ['0089', '01223585236', '02072069400', '02085076972', '07008009200', '07046744435', '07090201529', '07090298926', '07099833605', '07732584351', '07753741225', '077xxx', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '078498', '07880867867', '0789xxxxxxx', '07946746291', '0796XXXXXX', '07973788240', '07XXXXXXXXX', '08002988890', '08081263000', '08448350055', '08448714184', '08450542832', '08452810071', '08700469649', '08701213186', '08701237397', '08701752560', '08702490080', '08704050406', '08704439680', '08707500020', '08707808226', '08708800282', '08709501522', '08712103738', '08712400200', '08712400603', '08712402578', '08712402779', '08712402902', '08712402972', '08712404000', '08712466669', '08714342399', '08714712379', '08714712388', '08714712394', '08714712412', '08714714011', '08715203028', '08715203649', '08715203652', '08715203656', '08715203677', '08715203685', '08715203694', '08715205273', '08715500022', '08717111821', 

### your answer here

### Strategy Choice (Question 2)

We adopt **Strategy 2**, the kernelized LDA with the linear kernel \(k(x, y)=\langle x,yangle\).
The dataset contains 5 574 SMS with ~8.9 k unique tokens, so a dense covariance
in the original vocabulary space would be singular (\(d \gg n\)) and expensive to invert.
Working in the sample space through the Gram matrix circumvents that issue while keeping
memory within reach (≈250 MB in float64, much less in float32), and lets us rely on
iterative eigensolvers provided by `kfda`/`scikit-learn`.
We will therefore stick to strategy 2 for the implementation that follows.

### Your answers here

In [15]:
### Your answers here

In [None]:
from collections import Counter
from scipy.sparse import csr_matrix

# Helper utilities to get bag-of-words features and labels once for the whole notebook

def load_sms_dataset(csv_path):
    texts, labels = [], []
    with open(csv_path, newline='', encoding='utf-8', errors='ignore') as fh:
        reader = csv.DictReader(fh)
        for row in reader:
            texts.append(row['sms'])
            labels.append(int(row['label']))
    return texts, np.array(labels, dtype=np.int32)


def build_sparse_bow(texts, tokenizer=tweet_tokenizer):
    indptr = [0]
    indices, data = [], []
    vocab = {}
    for text in texts:
        token_counts = Counter(tokenizer.tokenize(text))
        for token, count in token_counts.items():
            idx = vocab.setdefault(token, len(vocab))
            indices.append(idx)
            data.append(float(count))
        indptr.append(len(indices))
    matrix = csr_matrix((data, indices, indptr), shape=(len(texts), len(vocab)), dtype=np.float32)
    return matrix, vocab


### Question 5

Implement the strategy you chose at question 2. You may either use the sklearn.lda.LDA class from Scikit-Learn (strategy 1), the Kfda class from Pypi (strategy 2), or your own implementation (strategy 3). In all cases, you should assume that the data of the two classes are normally distributed after they are projected.
Test your implementation on the classes you chose.

In [3]:
### Your answers here

## 4. Kernelized LDA

The kernelized version of LDA is implemented as a kfda package. Its homepage is here: https://pypi.org/project/kfda/
To install it, suffice to run <t>pip3 install kfda</t> from your Python virtual environment.

### Question 6

Let $\boldsymbol{x}$ and $\boldsymbol{y}$ be any two columns of your D-T matrix (which you may assume TD-IDF normalized or not, it does not change the problem). Consider the inhomogeneous polynomial kernel 
$$k(\boldsymbol{x},\boldsymbol{y})= (1+<\boldsymbol{x},\boldsymbol{y}>)^d$$
where $d>0$ is integer.

- Suppose that $d=2$, and that the above kernel is used in a kernelized LDA. What are the new axes created in the feature space, that didn't exist when $d=1$? Which of these could be useful, and change the solution computed by LDA in the feature space ?
- If the input data consists of $n$ samples, what is the time complexity of K-FDA ? 
- Try to classify using this setup, and report your results. Then increase $n$, from 2 to 4. You will very likely encounter some overflow issues. If it happens, explain what is wrong, and add some code to preprocess the data to circumvent it.

In [5]:
### Your answer here

### Question 7

We will now slightely improve the above kernel by replacing the natural dot product 
$$<\boldsymbol{x},\boldsymbol{y}>$$ 
by 
$$ \sum_i \min(\boldsymbol{x}_i, \boldsymbol{y}_i) $$
resulting in

$$f(\boldsymbol{x},\boldsymbol{y})= (1+\sum_i \min(\boldsymbol{x}_i, \boldsymbol{y}_i) )^n$$

Is $f$ a positive semidefinite kernel ? Either prove that it is, or give a counter-example.

### Question 8

Irrespective of your answer to question 8, try kfda with $f$ as its kernel. Looking at the source code https://github.com/concavegit/kfda/blob/master/kfda/kfda.py you will notice (line 92) that it relies on the paiwise_kernels function from sklearn to compute the Gram matrix. 

According to sklearn documentation https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_kernels.html the kernel parameter can be a callable, hence you can supply a function of your own as the kernel argument, possibly using the keywords field (kwds).

Report your classification results, possibly varying $n$ (be reasonable with values, high $n$ may cause floating point exceptions, as in the last question). You should likely obtain decent (~ 75% accuracy, say) but not outstanding results.  This, however, is highly dependent on the dataset and classes you chose.

In [None]:
### Your answer here

### Question 9

One reason why the obtained accuracy is not fantastic is that the vector model we are using is blind to bigrams. For instance, we may encounter (normalized) words "donald", and "trump" separately in a document, but this is very different from "donald trump".

One way to fix this is to include bigrams in the vocabulary : for two consecutive words, like "donald trump", we would add a synthetic word "donald_trump" to the vocabulary. 

Add an extra "bigram" parameter to voc_from_csv() to do so, and compare your results to those of question 9. Bigrams can be generated very simply using a code similar to this one:

In [4]:
w=['I','think','traveling','to','Rio','next','winter','would','be','great']
[w[i]+'_'+w[i+1] for i in range(0,len(w)-1)]

['I_think',
 'think_traveling',
 'traveling_to',
 'to_Rio',
 'Rio_next',
 'next_winter',
 'winter_would',
 'would_be',
 'be_great']