<a href="https://colab.research.google.com/github/christopherdiamana/nlp/blob/main/Previous/Copy_of_NLP_projet_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Natural Language Processing 1 Lab03**

```
dorian: stemming + bag of words "BOW counter or list dict" 
        prochaine session: stopword + bayesien naif binaire

christopher: choix map avec token et nombre d'occ 
             prochaine session: regard sur la log reg

thibaut: familiarisation dataset et cours + debut modele bayesien naif
```

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
import numpy as np
!pip install datasets
from datasets import load_dataset
from collections import Counter
!python -m spacy download en_core_web_sm
import spacy

In [None]:
nltk.download('punkt')

In [None]:
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
lemmatizer = spacy.load("en_core_web_sm")

Nous avons ici généré les methodes de prétraitement des données par lemmatization et stemming.

In [None]:
def tokenize(text):
  tokens = word_tokenize(text.lower())
  return tokens

In [None]:
def stemming_tokenize(text):
  stemmed = [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
  return stemmed

In [None]:
def lemmatization_tokenize(text):
  lemmas = [token.lemma_ for token in lemmatizer(text.lower()) if re_word.match(token.text)]
  return lemmas

In [None]:
imdb_hugging_train = load_dataset('imdb', ignore_verifications=True, split='train')
imdb_hugging_test = load_dataset('imdb', ignore_verifications=True, split='test')

In [None]:
imdb_hugging_train['text'][:5]

['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!',
 'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything fro

In [None]:
imdb_hugging_train['label'][:5]

[1, 1, 1, 1, 1]

In [None]:
def text_to_token(data):
  toks = []
  for line in data['text']:
    toks.append(tokenize(line))
  return toks

def text_to_stem_token(data):
  toks = []
  for line in data['text']:
    toks.append(stemming_tokenize(line))
  return toks

def text_to_lem_token(data):
  toks = []
  for line in data['text']:
    toks.append(lemmatization_tokenize(line))
  return toks

def get_labels(data):
  return data['label']

In [None]:
y_train = get_labels(imdb_hugging_train[:5])
tok_by_reviews = text_to_token(imdb_hugging_train)
tok_by_lem_reviews = text_to_lem_token(imdb_hugging_train)
tok_by_stem_reviews = text_to_stem_token(imdb_hugging_train)

ici on a généré trois liste de liste de tokens formé a partir de tokenization avec ou non du prétraitement.

In [None]:
y_train[0], tok_by_reviews[0]

In [None]:
sum([tok_by_reviews[0].count(x) for x in ['i', 'a']])

In [None]:
tok_counts = Counter(tok_by_reviews[0])
tok_counts

In [None]:
def full_vocab_histo(docs_post_treatment): ##cette fonction permet de générer un histogramme qui nous permettra de compter l'occurence des mots
  vocab = Counter()
  for doc in docs_post_treatment:
    vocab = vocab + (Counter(doc))
  return vocab

def full_vocab_no_histo(docs_post_treatment): ##afin d'avoir une fonction plus rapide, on gere avec un set pour la présence ou pas des mots
  vocab = set()
  for doc in docs_post_treatment:
    vocab = vocab.union(set(doc))
  return vocab

#full_vocab_histo(tok_by_reviews)

In [None]:
def train_bayes(doc, labels, classes): 
  '''
    doc: array of reviews split into tokens
    labels: array of labels where labels[i] is the ith review's label
    classes: array containing all unique label names 
  '''

  log_likelihood = {}
  logprior = {}

  ndoc = len(doc)
  vocab = full_vocab_no_histo(doc)
  len_vocab = len(vocab)

  for class_name in classes:
    nclass = labels.count(class_name)
    logprior[class_name] = np.log(nclass / ndoc)
    bigdoc = [doc[i] for i in range(ndoc) if labels[i] == class_name]

    #Necessary operations to reduce computation
    class_vocab = full_vocab_histo(bigdoc)
    full_count_class_vocab = sum(class_vocab.values())

    for word in vocab:
      count_word = class_vocab[word]
      log_likelihood[(word, class_name)] = np.log((count_word + 1) / (full_count_class_vocab + len_vocab - 1))

  return logprior, log_likelihood, vocab

In [None]:
def train_bayes_naifs_binaires(doc, labels, classes):
  '''
    doc: array of reviews split into tokens
    labels: array of labels where labels[i] is the ith review's label
    classes: array containing all unique label names (here [neg (or 0), pos (or 1)])
  '''

  log_likelihood = {}
  logprior = {}

  ndoc = len(doc)
  vocab = full_vocab_no_histo(doc)
  len_vocab = len(vocab)
  for class_name in classes:
    nclass = labels.count(class_name)
    logprior[class_name] = np.log(nclass / ndoc)
    bigdoc = [doc[i] for i in range(ndoc) if labels[i] == class_name]

    #Necessary operations to reduce computation
    class_vocab = full_vocab_no_histo(bigdoc)

    for word in vocab:
      log_likelihood[(word, class_name)] = ({word} & class_vocab == {word}) #equation to return 1 if word exist in class_vocab else return 0

  return logprior, log_likelihood, vocab

In [None]:
def test_naive_bayes(testdoc, logprior, log_likelihood, classes, V, pretreatment=tokenize):
    '''
      testdoc: string
      logprior: array of value where logprior[i] return a value of the ratio of the training set for the classes[i]
      log_likelood: array of value where log_likelood[i] return the probabilitate of the event i
      classes:  array containing all unique label names (here [neg (or 0), pos (or 1)])
      V : vocabulary containing the word of the training
      pretreatment: choose the correct fonction to adapt the input to the model chosen
    '''
    token_list = pretreatment(testdoc)
    sum = [0 for k in range(len(classes))]
    for i in range(len(classes)):
      sum[i] = logprior[classes[i]]
      for word in token_list:
        if word in V:
          sum[i] += log_likelihood[(word, classes[i])]
    max_indice = sum.index(max(sum))
    return classes[max_indice]

### training the bayes modeles
Dans cette section, nous allons générer nos divers modeles avec nos données sous different format de prétraitement.
  * without pretreatment

In [None]:
logprior_class, log_likelihood_class, vocab_class = train_bayes(tok_by_reviews, imdb_hugging_train['label'], [0, 1])

In [None]:
logprior_binary_class, log_likelihood_binary_class, vocab_binary_class = train_bayes_naifs_binaires(tok_by_reviews, imdb_hugging_train['label'], [0, 1])

* with stemming


In [None]:
logprior_stem, log_likelihood_stem, vocab_stem = train_bayes(tok_by_stem_reviews, imdb_hugging_train['label'], [0, 1])

In [None]:
logprior_binary_stem, log_likelihood_binary_stem, vocab_binary_stem = train_bayes_naifs_binaires(tok_by_stem_reviews, imdb_hugging_train['label'], [0, 1])

* with lemmatization

In [None]:
logprior_lem, log_likelihood_lem, vocab_lem = train_bayes(tok_by_lem_reviews, imdb_hugging_train['label'], [0, 1])

In [None]:
logprior_binary_lem, log_likelihood_binary_lem, vocab_binary_lem = train_bayes_naifs_binaires(tok_by_lem_reviews, imdb_hugging_train['label'], [0, 1])

ici, on a entrainé 6 modeles differents qui sont calculé avec soit un algorithm different soit avec un dataset de training modifié avec ou pas du prétraitement.
les modeles seront défini ainsi a partir de 3 variables qui sont le logprior, le loglikehood et le vocab.


### resultat


Ici nous allons d'abord mélanger l'ordre de nos test et aussi créer une nouvelle liste aléatoire de review pour tester notre précision

In [None]:
import random
list_random = [k for k in range(len(imdb_hugging_test['text']))]
random.shuffle(list_random)
test_list = list_random[:]

On regarde ici ci pour un exemple aléatoire, on a les 3 modeles qui nous donne une réponse identique a celle correcte

In [None]:
print(test_list[0])
print(imdb_hugging_test['text'][test_list[0]])
print(imdb_hugging_test['label'][test_list[0]])
print(test_naive_bayes(imdb_hugging_test['text'][test_list[0]], logprior_class, log_likelihood_class, [0, 1], vocab_class))
print(test_naive_bayes(imdb_hugging_test['text'][test_list[0]], logprior_stem, log_likelihood_stem, [0, 1], vocab_stem, pretreatment=stemming_tokenize))
test_naive_bayes(imdb_hugging_test['text'][test_list[0]], logprior_lem, log_likelihood_lem, [0, 1], vocab_lem, pretreatment=lemmatization_tokenize)

15739
This is probably my least favorite episode. I lived in Cape Girardeau for quite some time. I can tell you there is no ocean or shrimp boats, fresh crab or scallops anywhere near Missouri. Cape Girardeau is the only inland Cape, it's on the Mississippi River. It looked like the license plates were from Mississippi, which may explain why there was so much racial tension. Missouri and Mississippi are 2 completely different states that don't touch one another. There are many roads in and out of town and none of them are Route 6 or Route 666. This whole inaccuracy was very distracting. Also, Cassie did not seem like someone who would want to hang around Dean if she was well educated. I did not buy them as a couple and didn't enjoy the lengthy love scene. Jo was more Dean's style.
0
0
0


0

In [None]:
precision = 0
precision_stem = 0
precision_lem = 0
for i in test_list:
  if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_class, log_likelihood_class, [0, 1], vocab_class):
    precision += 1/len(test_list)
  if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_stem, log_likelihood_stem, [0, 1], vocab_stem, pretreatment=stemming_tokenize):
    precision_stem += 1/len(test_list)
  if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_lem, log_likelihood_lem, [0, 1], vocab_lem, pretreatment=lemmatization_tokenize):
    precision_lem += 1/len(test_list)
print(precision)
print(precision_stem)
print(precision_lem)

0.8095200000002465
0.8005600000002375
0.8104000000002474


on observe ici que la précision pour le modele par occurence est de 80% ou plus en moyenne et qu'on a une plus grande précision lorsque on a une prétraitement avec de la lemmatization.

In [None]:
precision_bin = 0
precision_stem_bin = 0
precision_lem_bin = 0
for i in test_list:
  if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_binary_class, log_likelihood_binary_class, [0, 1], vocab_binary_class):
    precision_bin += 1/len(test_list)
  if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_binary_stem, log_likelihood_binary_stem, [0, 1], vocab_binary_stem, pretreatment=stemming_tokenize):
    precision_stem_bin += 1/len(test_list)
  if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_binary_lem, log_likelihood_binary_lem, [0, 1], vocab_binary_lem, pretreatment=lemmatization_tokenize):
    precision_lem_bin += 1/len(test_list)
print(precision_bin)
print(precision_stem_bin)
print(precision_lem_bin)

0.5658800000000028
0.5393199999999763
0.547079999999984


on observe ici que la précision pour le modele par existence est de 53% ou plus en moyenne et qu'on a une plus grande précision lorsque on a une prétraitement avec de la lemmatization.

on observe d'apres les résultat précédent que le model ayant le meilleurs tot de succes est celui de bayes avec les données ayant été prétraitré avec une lemmatization. On observe que l'on passe d'ailleurs d'une précision de 85% a 54% lors du passage du classique au binaire.

## determination de la précision et des valeurs de recall ainsi que le F1 score


Ici, nous allons déterminer les valeurs de précision pour les classes ainsi que les valeurs de recall en plus des F1 score avec notre modele ayant eu le meilleur tot de précision, c'est a dire le modele généré avec le bayes naives avec les données prétraité avec une lemmatization.

In [None]:
#on peut ici séparer les cas de vrai positif(VP), vrai negatif(VN), faux positif(FP), faux négatif(FN)
nb_pos = 0
nb_neg = 0
nb_obtain_pos = 0
nb_obtain_neg = 0
precision_pos = 0
precision_neg = 0
recall_pos = 0
recall_neg = 0
for i in test_list:
    if imdb_hugging_test['label'][i] == 0:
      nb_neg += 1
      if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_lem, log_likelihood_lem, [0, 1], vocab_lem, pretreatment=lemmatization_tokenize): #FN
        precision_neg += 1
        nb_obtain_neg += 1
      else: #FP
        nb_obtain_pos += 1 
    else:
      nb_pos += 1
      if imdb_hugging_test['label'][i] == test_naive_bayes(imdb_hugging_test['text'][i], logprior_lem, log_likelihood_lem, [0, 1], vocab_lem, pretreatment=lemmatization_tokenize): #VP
        precision_pos += 1
        nb_obtain_pos += 1
      else: #VN
        nb_obtain_neg += 1

recall_pos = precision_pos / nb_obtain_pos
recall_neg = precision_neg / nb_obtain_neg
precision_neg /= nb_neg
precision_pos /= nb_pos
f1_score_pos = (2 * precision_pos * recall_pos) / (precision_pos + recall_pos)
f1_score_neg = (2 * precision_neg * recall_neg) / (precision_neg + recall_neg)
list_pos = [precision_pos, recall_pos, f1_score_pos]
list_neg = [precision_neg, recall_neg, f1_score_neg]

print(list_pos)
print(list_neg)

[0.7468, 0.8556370302474794, 0.7975224263135413]
[0.874, 0.7753726046841731, 0.8217374952989847]


on a généré les résultats sous le format:
```
[precision_pos, recall_pos, f1_score_pos]
[precision_neg, recall_neg, f1_score_neg]
```

## **LOGISTIC REGRESSION**

### Features' treatment 


Download vader_lexicon.txt

In [None]:
!wget https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt

--2021-10-07 15:45:13--  https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 426786 (417K) [text/plain]
Saving to: ‘vader_lexicon.txt’


2021-10-07 15:45:13 (11.4 MB/s) - ‘vader_lexicon.txt’ saved [426786/426786]



In [None]:
import pandas as pd

df = pd.read_csv('/content/vader_lexicon.txt', 
                 delimiter = "\t", 
                 names = ('token', 'mean-sentiment-rating', 'standard deviation', 'raw-human-sentiment-ratings'))

In [None]:
df.head()

Unnamed: 0,token,mean-sentiment-rating,standard deviation,raw-human-sentiment-ratings
0,$:,-1.5,0.80623,"[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]"
1,%),-0.4,1.0198,"[-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]"
2,%-),-1.5,1.43178,"[-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]"
3,&-:,-0.4,1.42829,"[-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]"
4,&:,-0.7,0.64031,"[0, -1, -1, -1, 1, -1, -1, -1, -1, -1]"


**The threshold values**

According to the VADER sentiment documentation, the sentiment ratings features were rated on a scale from "[–4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)". They kept every lexical feature that had a non-zero mean rating, and whose standard deviation was less than 2.5 as determined by the aggregate of those ten independent raters. This left their with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4.


Thus, the threshold that I'm going to use is 0.

In [None]:
threshold = 0

In [None]:
df_positive = df[df['mean-sentiment-rating'] > threshold].token
positive_lexicon = np.array(df_positive)

df_negative = df[df['mean-sentiment-rating'] < threshold].token
negative_lexicon = np.array(df_negative)

In [None]:
len(df), len(positive_lexicon), len(negative_lexicon)

(7520, 3347, 4173)

In [None]:
import time

In [None]:
start_time_lr = time.time()

In [None]:
 start_time = time.time()
print(len(df_positive[df_positive['token'] == 'accomplishes']))
print("--- %s seconds ---" % (time.time() - start_time))

KeyError: ignored

In [None]:
start_time = time.time()
print(int(df_positive.isin(['accomplishes']).sum()))
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
start_time = time.time()
print(sum(np.in1d(positive_lexicon, 'accomplishes')))
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
imdb_hugging_train['text'][:5][0]

Application of the following features:

* 1 if "no" appear in the doc, 0 otherwise
* The count of first and second pronouns in the document
* 1 if "!" is in the document, 0 otherwise
* log(word count in the document)
* Number of words in the document which are in the positive lexicon
* Number of words in the document which are in the negative lexicon

In [None]:
import math

def text_review_to_vector(string, tokenization_func):

  tokens = np.array(tokenization_func(string))

  no_appear = 0
  pronous_occu = 0
  exclamation_mark = 0
  log_words = 0
  nb_positive_word = 0
  nb_negative_word = 0

  for token in tokens:
    no_appear = 1 if token == 'no' else no_appear
    pronous_occu += 1 if token in ['i', 'you'] else 0
    exclamation_mark = 1 if token == '!' else no_appear
    if token in positive_lexicon:
      nb_positive_word += 1
    elif token in negative_lexicon:
      nb_negative_word += 1

  log_words = math.log2(len(tokens))

  return [no_appear, pronous_occu, exclamation_mark, log_words, nb_positive_word, nb_negative_word]

In [None]:
def build_review(data, tokenization_func):
  features_vect = []
  
  for line in data['text']:
    features_vect.append(text_review_to_vector(line, tokenization_func))

  return features_vect, data['label']

## Dataset

In [None]:
sub_imdb_hugging_train = imdb_hugging_train[12400:12600]
sub_imdb_hugging_test = imdb_hugging_test[12400:12600]

## With or without pretreatment ?

In this section, we are going to compare our model without pretreatment, with stemming and with lemmatization and see which is the best solution. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support

### Logistic regression without pretreatment

In [None]:
start_time = time.time()

X_train, y_train = build_review(imdb_hugging_train, tokenize)

print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
X_test_without_pretreat, y_test_without_pretreat = build_review(sub_imdb_hugging_test, tokenize)

**Train**

In [None]:
logisticRegr_without_pretreat = LogisticRegression().fit(X_train, y_train)

**Prediction**

In [None]:
y_test_pred_without_pretreat = logisticRegr_without_pretreat.predict(X_test_without_pretreat)

In [None]:
score = logisticRegr_without_pretreat.score(X_test_without_pretreat, y_test_without_pretreat)
print(score)

### Logistic regression with pretreatment (stemming)

In [None]:
#X_train, y_train = build_review(imdb_hugging_train)
X_train, y_train = build_review(sub_imdb_hugging_train, stemming_tokenize)

In [None]:
X_test_with_stemming, y_test_with_stemming = build_review(sub_imdb_hugging_test, stemming_tokenize)

**Train**

In [None]:
logisticRegr_with_stemming = LogisticRegression().fit(X_train, y_train)

**Prediction**

In [None]:
y_test_pred_with_stemming = logisticRegr_with_stemming.predict(X_test_with_stemming)

In [None]:
score = logisticRegr_with_stemming.score(X_test_with_stemming, y_test_with_stemming)
print(score)

### Logistic regression with pretreatment (lemmatization)

In [None]:
#X_train, y_train = build_review(imdb_hugging_train)
X_train, y_train = build_review(sub_imdb_hugging_train, lemmatization_tokenize)

In [None]:
X_test_with_lemmatization, y_test_with_lemmatization = build_review(sub_imdb_hugging_test, lemmatization_tokenize)

**Train**

In [None]:
logisticRegr_with_lemmatization = LogisticRegression().fit(X_train, y_train)

**Prediction**

In [None]:
y_test_pred_with_lemmatization = logisticRegr_with_lemmatization.predict(X_test_with_lemmatization)

In [None]:
score = logisticRegr_with_lemmatization.score(X_test_with_lemmatization, y_test_with_lemmatization)
print(score)

### Evaluation measure logistic regression with and without pretreatment

In [None]:
precision_recall_fscore_support(y_test_without_pretreat, y_test_pred_without_pretreat, average=None, labels=[1, 0])

In [None]:
precision_recall_fscore_support(y_test_with_stemming, y_test_pred_with_stemming, average=None, labels=[1, 0])

In [None]:
precision_recall_fscore_support(y_test_with_lemmatization, y_test_pred_with_lemmatization, average=None, labels=[1, 0])

The best is ???

## More features

In this section, we are going to add at least 2 more features 

### Feature number 1

### Feature number 2

### Evaluation measure logistic regression with the added features

*justify your choices with observations*

As we can see with these features the precision is increase. This is due to ... 

## With or without regularization ?

In this section, we are going to compare our model with and without regularization.

## Wrongly classified samples - Oh no is it possible ??

In this section, we are going to provide examples of wrongly classified samples, as well as explanations on why these examples were attributed to the wrong class.

In [None]:
print("--- %s seconds ---" % (time.time() - start_time_lr))