<a href="https://colab.research.google.com/github/christopherdiamana/nlp/blob/main/catch_up1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Processing Catch-up 1

## The dataset

In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 8.4 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 60.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 7.7 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 68.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 

In [2]:
from datasets import load_dataset_builder

In [3]:
ds_builder = load_dataset_builder("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

In [4]:
# Inspect dataset description
ds_builder.info.description

'Large Movie Review Dataset.\nThis is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.'

In [5]:
# Inspect dataset features
ds_builder.info.features

{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

In [6]:
from datasets import get_dataset_split_names

In [7]:
get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

In [8]:
from datasets import load_dataset

In [9]:
dataset = load_dataset("imdb")

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [11]:
imdb_train = load_dataset('imdb', split='train')
imdb_test = load_dataset('imdb', split='test')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


In [12]:
from collections import Counter

In [13]:
Counter(imdb_train['label'])

Counter({0: 12500, 1: 12500})

In [14]:
Counter(imdb_test['label'])

Counter({0: 12500, 1: 12500})



---



## Naive Bayes classifier

In [15]:
import numpy as np

In [16]:
def compute_occurences(vocabulary, documents):
  
  # create a vocabulary dictionary where the elements are initialise to 0 
  occurences = dict.fromkeys(vocabulary, 0)

  for text in documents:
    for current_word in text.split():
      occurences[current_word] +=1

  return occurences

In [17]:
def compute_class_log_likelihood(log_likelihood, vocabulary, class_label, occurences):
  '''
  Computes the log likelihoods of words in the documents by class.

  Parameters
  ----------
  log_likelihood: the current log likelihood vocabulary dictionary 
  vocabulary: the vocabulary of the full corpus
  class_label: the label of the class for whom the likelihood will be calculate
  occurences: list of words occurence of a specific class

  Returns
  -------
  dictionary with words as keys and log likelihood as values
  '''

  denominator = sum([occurences[word] + 1 for word in vocabulary])

  for word in occurences:
    log_likelihood[word][class_label] = math.log((occurences[word] + 1) / denominator)

  return log_likelihood

In [18]:
import math
from operator import countOf


def train_naive_bayes(documents, classes):
  log_prior = {}
  vocabulary = set()
  log_likelihood = {}
  
  [vocabulary.update(document['text'].split()) for index, document in documents.iterrows()]
  num_documents = len(documents['text'])
  
  log_likelihood = { word : {} for word in vocabulary }

  for class_label in classes:
    num_documents_of_class = np.count_nonzero(np.array(documents['label']) == class_label)
    log_prior[class_label] = math.log(num_documents_of_class / num_documents)

    big_document = [document['text'] for index, document in documents.iterrows() if document['label'] == class_label]
    occurences = compute_occurences(vocabulary, big_document)

    log_likelihood = compute_class_log_likelihood(log_likelihood, vocabulary, class_label, occurences)

  return log_prior, log_likelihood, vocabulary

In [19]:
def test_naive_bayes(test_document, logprior, loglikelihood, classes, vocabulary):

  summation = []

  for class_label in classes:
    summation.append(logprior[class_label])
    
    for word in test_document.split():
      if word in vocabulary:
        summation[class_label] += loglikelihood[word][class_label] 
  
  return np.argmax(summation)

### Pretreatment

In [20]:
from string import punctuation
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [21]:
# For the Lemmatizer 
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [22]:
def pre_treatment(text):
  new_text = text.lower()
  new_text = new_text.translate(str.maketrans('', '', punctuation))
  
  #Other pretreatments
  new_text = re.sub(r'[^a-z]+', ' ', new_text)
  new_text = re.sub(r'\b\w\b', ' ', new_text)
  new_text = re.sub(r'\b\w\w\b', ' ', new_text)
  word_tokens = word_tokenize(new_text)
  lemmatizer = WordNetLemmatizer()

  return ' '.join([lemmatizer.lemmatize(w) for w in word_tokens])

In [23]:
imdb_train['text'][0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [24]:
pre_treated_text = [pre_treatment(text) for text in imdb_train['text']]

In [25]:
pre_treated_text[0]

'rented curiousyellow from video store because all the controversy that surrounded when wa first released also heard that first wa seized custom ever tried enter this country therefore being fan film considered controversial really had see this for myselfbr the plot centered around young swedish drama student named lena who want learn everything she can about life particular she want focus her attention making some sort documentary what the average swede thought about certain political issue such the vietnam war and race issue the united state between asking politician and ordinary denizen stockholm about their opinion politics she ha sex with her drama teacher classmate and married menbr what kill about curiousyellow that year ago this wa considered pornographic really the sex and nudity scene are few and far between even then it not shot like some cheaply made porno while countryman mind find shocking reality sex and nudity are major staple swedish cinema even ingmar bergman arguably

### Naive Bayes classifier on the training set

#### Converte to a dataframe

In [26]:
import pandas as pd

In [27]:
data_imdb_train = {'text': pre_treated_text, 'label': imdb_train['label']}
df_imdb_train = pd.DataFrame(data = data_imdb_train)

In [28]:
df_imdb_train.head()

Unnamed: 0,text,label
0,rented curiousyellow from video store because ...,0
1,curious yellow risible and pretentious steamin...,0
2,only avoid making this type film the future th...,0
3,this film wa probably inspired godard masculin...,0
4,brotherafter hearing about this ridiculous fil...,0


#### Train the imdb

In [29]:
classes = [0, 1]

In [30]:
log_prior, log_likelihood, vocabulary = train_naive_bayes(df_imdb_train, classes)

In [31]:
log_prior, log_likelihood

({0: -0.6931471805599453, 1: -0.6931471805599453},
 {'oval': {0: -13.983024267752858, 1: -13.604299283042069},
  'whalewatching': {0: -13.983024267752858, 1: -14.70291157171018},
  'restricted': {0: -12.373586355318757, 1: -11.707179298156188},
  'manned': {0: -13.577559159644695, 1: -13.093473659276079},
  'driveins': {0: -13.289877087192913, 1: -13.604299283042069},
  'harwood': {0: -13.577559159644695, 1: -14.70291157171018},
  'cogently': {0: -13.983024267752858, 1: -14.70291157171018},
  'garbage': {0: -8.792849059824524, 1: -10.498218952319213},
  'carrienot': {0: -13.983024267752858, 1: -14.70291157171018},
  'genocidal': {0: -13.289877087192913, 1: -14.009764391150235},
  'mendes': {0: -13.289877087192913, 1: -11.484035746841979},
  'unhealthy': {0: -12.373586355318757, 1: -12.623470030030344},
  'postmodern': {0: -11.631649010589381, 1: -12.063854242094921},
  'snapshotters': {0: -13.983024267752858, 1: -14.70291157171018},
  'burgermister': {0: -14.676171448312804, 1: -14.009

### Accuracy on both training and test set

#### Converte test imdb to a dataframe

In [32]:
df_imdb_test = pd.DataFrame(data = imdb_test)

In [33]:
df_imdb_test.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


#### Pretreatment of test imdb

In [34]:
df_imdb_test['text'] = df_imdb_test['text'].apply(pre_treatment)

In [35]:
df_imdb_test.head()

Unnamed: 0,text,label
0,love scifi and willing put with lot scifi movi...,0
1,worth the entertainment value rental especiall...,0
2,it totally average film with few semialright a...,0
3,star rating saturday night friday night friday...,0
4,first off let say you havent enjoyed van damme...,0


#### Prediction

In [36]:
df_imdb_train['prediction'] = df_imdb_train.apply(lambda row: test_naive_bayes(row['text'], log_prior, log_likelihood, classes, vocabulary), axis=1)

In [37]:
df_imdb_test['prediction'] = df_imdb_test.apply(lambda row: test_naive_bayes(row['text'], log_prior, log_likelihood, classes, vocabulary), axis=1)

In [38]:
df_imdb_train.head()

Unnamed: 0,text,label,prediction
0,rented curiousyellow from video store because ...,0,0
1,curious yellow risible and pretentious steamin...,0,0
2,only avoid making this type film the future th...,0,0
3,this film wa probably inspired godard masculin...,0,0
4,brotherafter hearing about this ridiculous fil...,0,0


In [39]:
df_imdb_test.head()

Unnamed: 0,text,label,prediction
0,love scifi and willing put with lot scifi movi...,0,0
1,worth the entertainment value rental especiall...,0,0
2,it totally average film with few semialright a...,0,0
3,star rating saturday night friday night friday...,0,0
4,first off let say you havent enjoyed van damme...,0,0


#### Accuracy

In [40]:
def compute_accuracy(dataframe, prediction_column):
  correct_answers = 0

  for index, document in dataframe.iterrows():
    if document['label'] == document[prediction_column]:
      correct_answers += 1

  accuracy = correct_answers/len(dataframe)

  return accuracy

Test set accuracy

In [41]:
test_set_accuracy = compute_accuracy(df_imdb_test, prediction_column="prediction")
print(f"Test set accuracy: {test_set_accuracy:.2%}")

Test set accuracy: 81.67%


Train set accuracy

In [42]:
train_set_accuracy = compute_accuracy(df_imdb_train, prediction_column="prediction")
print(f"Train set accuracy: {train_set_accuracy:.2%}")

Train set accuracy: 91.14%


### Question 2.4: Why is accuracy a sufficient measure of evaluation here?

...


### What are the top 10 most important words (features) for each class?


In [43]:
type(log_likelihood["and"])

dict

In [44]:
def sort_log_likelihood(loglikelihood, class_label):
  
  loglikelihood_items = {}
  
  for word in loglikelihood.keys():
    loglikelihood_items[word] = loglikelihood[word][class_label]
  
  # tuples = zip(loglikelihood.col, coo_matrix.data)
  
  sorted_tuples = sorted(loglikelihood_items.items(), key=lambda item: item[1], reverse=True)
  
  return sorted_tuples
  # return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

# #generate tf-idf for the given document
# tf_idf_vector=tfidf_transformer.transform(cv.transform(["change number node recognition rate defined relative frequency"]))

# #sort the tf-idf vectors by descending order of scores
# sorted_items=sort_coo(tf_idf_vector.tocoo())

# sorted_items

#### Look at the words with the highest likelihood in each class.


In [45]:
Top_10_negative = sort_log_likelihood(log_likelihood, 0)[:10]

In [46]:
Top_10_negative

[('the', -2.677978196365483),
 ('and', -3.471675128939145),
 ('this', -4.068201205552985),
 ('that', -4.204278280380475),
 ('movie', -4.443128381768861),
 ('wa', -4.50357320891033),
 ('for', -4.691380444712641),
 ('but', -4.708020382450953),
 ('film', -4.708301636847407),
 ('with', -4.7366895355533)]

In [47]:
Top_10_positive = sort_log_likelihood(log_likelihood, 1)[:10]

In [48]:
Top_10_positive

[('the', -2.645785670000755),
 ('and', -3.308623257048428),
 ('this', -4.247120615961752),
 ('that', -4.266915184576918),
 ('film', -4.622701440780578),
 ('with', -4.653507144273896),
 ('for', -4.69164642700062),
 ('wa', -4.711642105790518),
 ('movie', -4.713108717722131),
 ('but', -4.778053993437859)]

#### Remove stopwords (see NLTK stopwords corpus) and check again.

In [49]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [50]:
stop_words = set(stopwords.words('english'))

In [51]:
log_likelihood_without_stop = log_likelihood.copy()

In [52]:
for word in stop_words:
    log_likelihood_without_stop.pop(word, None)

In [53]:
print("Length with stop word: ", len(log_likelihood))
print("Length without stop word: ", len(log_likelihood_without_stop))

Length with stop word:  108227
Length without stop word:  108106


In [54]:
Top_10_negative_without_stop = sort_log_likelihood(log_likelihood_without_stop, 0)[:10]

In [55]:
Top_10_negative_without_stop

[('movie', -4.443128381768861),
 ('wa', -4.50357320891033),
 ('film', -4.708301636847407),
 ('one', -5.206162750340039),
 ('like', -5.357514932853162),
 ('even', -5.739084412136281),
 ('ha', -5.744355211003636),
 ('good', -5.797813407684981),
 ('time', -5.804245197195175),
 ('bad', -5.8094487811187046)]

In [56]:
Top_10_positive_without_stop = sort_log_likelihood(log_likelihood_without_stop, 1)[:10]

In [57]:
Top_10_positive_without_stop

[('film', -4.622701440780578),
 ('wa', -4.711642105790518),
 ('movie', -4.713108717722131),
 ('one', -5.177103741274279),
 ('ha', -5.581839667492178),
 ('like', -5.601716646994794),
 ('time', -5.746560631219308),
 ('good', -5.788688751006879),
 ('story', -5.820797527501359),
 ('character', -5.867555600588574)]

### Take at least 2 wrongly classified example from the test set and try explaining why the model failed.

In [58]:
wrong_df_imdb_test = df_imdb_test[df_imdb_test.label != df_imdb_test.prediction]

In [59]:
two_wrongly = wrong_df_imdb_test.sample(2, random_state=3)
two_wrongly

Unnamed: 0,text,label,prediction
12235,james marsh the king film that mystifies cant ...,0,1
22458,hilarious comedy the best director ever scott ...,1,0


In [60]:
two_wrongly.iloc[0]['text']

'james marsh the king film that mystifies cant think what it meant for it story about young man called elvis played gael garcia bernal who get honourable discharge after year navy service and then go off find his biological father and behaves dishonourably with him and his family it all rather sick really elvis worm his way into the family seducing his year old sister malerie pell james it rather impossible identify with anyone this film from here middle england preacher father and bouncy joyful christian congregation couldnt work out whether the film meant deriding them for their mindless belief the target the happy family and are meant think thats unviable just saying that some people are lost and just hell bent destruction it shallow all know that bad thing happen the interesting bit learn why but this film just gratuitously depicts violence without ever unravelling the thinking that ha led the king such lost opportunity there are some really interesting question about honour the wa

In [61]:
two_wrongly.iloc[1]['text']

'hilarious comedy the best director ever scott the list eighty icon go and milano who the bos yothers family tie stone belvedere robinson night court jackee dabo wonder year walston hand one the funniest movie ever great line meaningless subplots cheesy bad acting about group high school kid who need pas driver mac from night court need them pas their final exam hell fired great performance brian bloom the jerkkinda cool guy riko conner but nothing compared wongs kiki pronounced keechee great movie for all age bad it good'

...


## FastText

In [62]:
!pip install fastText

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fastText
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 4.1 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.9.2-py2.py3-none-any.whl (213 kB)
Building wheels for collected packages: fastText
  Building wheel for fastText (setup.py) ... [?25l[?25hdone
  Created wheel for fastText: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3147142 sha256=9adce09967f05ac154e2c1e3b70a88583b1898f6ce2c03b5d2f4eb818f018eae
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597a29c8f4f19e38f9c02a345bab9b
Successfully built fastText
Installing collected packages: pybind11, fastText
Successfully installed fastText-0.9.2 pybind11-2.9.2


### Dataset compatible with Fastext

In [63]:
dataframe_imdb_train = pd.DataFrame(data = imdb_train)

In [64]:
dataframe_imdb_train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [65]:
dataframe_imdb_test = pd.DataFrame(data = imdb_test)

In [66]:
dataframe_imdb_test.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


Shuffle the dataset to avoid having a strong model bias.

In [67]:
dataframe_imdb_train = dataframe_imdb_train.sample(frac=1, random_state=1)
dataframe_imdb_train.head()

Unnamed: 0,text,label
21492,This is an amazing movie and all of the actors...,1
9488,"""Proximity"" tells of a convict (Lowe) who thin...",0
16933,When I heard this film was directed by Ang Lee...,1
12604,"OK, so the musical pieces were poorly written ...",1
8222,"I'm sorry, but this is such a bad movie it's h...",0


In [68]:
dataframe_imdb_test = dataframe_imdb_test.sample(frac=1, random_state=1)
dataframe_imdb_test.head()

Unnamed: 0,text,label
21492,"Man, is it great just to see Young and The Res...",1
9488,Seems like a pretty innocent choice at first- ...,0
16933,I was into the movie right away. I've seen the...,1
12604,Anyone who appreciates fine acting and ringing...,1
8222,High school female track star dies of a blood ...,0


#### Pretreatment

In [69]:
def simple_pre_treatment(text):
  new_text = text.lower()
  new_text = new_text.translate(str.maketrans('', '', punctuation))
  
  return new_text

In [70]:
dataframe_imdb_train['text'] = dataframe_imdb_train['text'].apply(simple_pre_treatment)

In [71]:
dataframe_imdb_train.head()

Unnamed: 0,text,label
21492,this is an amazing movie and all of the actors...,1
9488,proximity tells of a convict lowe who thinks t...,0
16933,when i heard this film was directed by ang lee...,1
12604,ok so the musical pieces were poorly written a...,1
8222,im sorry but this is such a bad movie its hila...,0


In [72]:
dataframe_imdb_test['text'] = dataframe_imdb_test['text'].apply(simple_pre_treatment)

In [73]:
dataframe_imdb_test.head()

Unnamed: 0,text,label
21492,man is it great just to see young and the rest...,1
9488,seems like a pretty innocent choice at first t...,0
16933,i was into the movie right away ive seen the o...,1
12604,anyone who appreciates fine acting and ringing...,1
8222,high school female track star dies of a blood ...,0


FastText classifier takes a text file as input and every line must have the following format:
```
__label__<your_label> <corresponding text>
```
So we need to format our dataset.

In [74]:
def data_to_formating_text_file(dataframe, file_name):

  text_file = open(file_name, "w")
  
  for index, document in dataframe.iterrows():
    label = "positive" if document["label"] == 1 else "negative"
    line = "__label__{} {}\n".format(label, document["text"])
    text_file.write(line)

  text_file.close()

In [75]:
file_name = "imdb_train.txt"

In [76]:
data_to_formating_text_file(dataframe_imdb_train, file_name)

In [77]:
!head $file_name

__label__positive this is an amazing movie and all of the actors and actresses and very good even though some of the actors and actresses werent very popular in show business it seemed like they have been acting since they were 1 one year old it was funny gross and just all out a very good movie in most parts i just didnt know what was going to happen next i was like i think this is going to happen wait i think this is going to happen all age groups will love this movie in some parts i couldnt stop laughing it was so funny but in some parts i was totally grossed out and i couldnt believe what i was seeing i am definitely going to see this movie again it is one of those movies where it cant get boring every time you see it is so suspenseful i definitely recommend seeing this movie
__label__negative proximity tells of a convict lowe who thinks the prison staff is out to kill him this very ordinary film is an actiondrama with a weak plot stereotypical poorly developed characters and a one

### FastText classifier with default parameters

In [78]:
import fasttext

In [79]:
model = fasttext.train_supervised(input="imdb_train.txt")

In [80]:
dataframe_imdb_test.head()

Unnamed: 0,text,label
21492,man is it great just to see young and the rest...,1
9488,seems like a pretty innocent choice at first t...,0
16933,i was into the movie right away ive seen the o...,1
12604,anyone who appreciates fine acting and ringing...,1
8222,high school female track star dies of a blood ...,0


In [81]:
model.predict('text')[0][0]

'__label__negative'

`model.predict` return a tuple with like that `(('__label__<label>',), array([0.999]))` That is a tuple of tuple. I will save the item 0 of the tuple 0 on the dataframe

In [82]:
dataframe_imdb_test['prediction'] = dataframe_imdb_test.apply(lambda row: model.predict(row['text'])[0][0], axis=1)

In [83]:
dataframe_imdb_test.head()

Unnamed: 0,text,label,prediction
21492,man is it great just to see young and the rest...,1,__label__positive
9488,seems like a pretty innocent choice at first t...,0,__label__negative
16933,i was into the movie right away ive seen the o...,1,__label__positive
12604,anyone who appreciates fine acting and ringing...,1,__label__positive
8222,high school female track star dies of a blood ...,0,__label__negative


In [84]:
dataframe_imdb_test['label'] = dataframe_imdb_test['label'].apply(lambda label: "__label__{}".format("positive" if label == 1 else "negative"))

In [85]:
dataframe_imdb_test.head()

Unnamed: 0,text,label,prediction
21492,man is it great just to see young and the rest...,__label__positive,__label__positive
9488,seems like a pretty innocent choice at first t...,__label__negative,__label__negative
16933,i was into the movie right away ive seen the o...,__label__positive,__label__positive
12604,anyone who appreciates fine acting and ringing...,__label__positive,__label__positive
8222,high school female track star dies of a blood ...,__label__negative,__label__negative


#### Accuracy

In [86]:
fasttext_test_set_accuracy = compute_accuracy(dataframe_imdb_test, prediction_column="prediction")
print(f"Test set accuracy: {fasttext_test_set_accuracy:.2%}")

Test set accuracy: 87.67%


### The hyperparameters search functionality of FastText 

#### Split training set into a training and a validation set

In [87]:
from sklearn.model_selection import train_test_split

In [88]:
train_data, validation_data = train_test_split(dataframe_imdb_train, test_size=0.33, shuffle=True, random_state=1)

#### Convert to a format compatible with Fastextde

In [89]:
file_name = "reviews.train"
data_to_formating_text_file(train_data, file_name)
!head $file_name

__label__positive i enjoyed watching this well acted movie very muchit was well actedparticularly by actress helen hunt and actors steven weber and jeff faheyit was a very interesting moviefilled with drama and suspensefrom the beginning to the very endi reccomend that everyone take the time to watch this made for television movieit is excellent and has great acting
__label__positive this movie is still an all time favorite only a pretentious humorless moron would not enjoy this wonderful film this movie feels like a slice of warm apple pie topped with french vanilla ice cream i think this is chers best work ever and her most believable performance cher has always been blessed with charisma good looks and an enviably thin figure whether you like her singing or not  who else sounds like cher cher has definitely made her mark in the entertainment industry and will be remembered long after others have come and gone she is one of the most unique artists out there its funny because who woul

In [90]:
file_name = "reviews.valid"
data_to_formating_text_file(validation_data, file_name)
!head $file_name

__label__negative i saw this movie at the toronto film festival with fairly solid expectations the movie has a great cast and was closing at the festival so it must be good right how wrong i was br br i knew we were in trouble when before the film the director was talking about how when he was directing an episode of wiseguy he met an unknown actor named kevin spacey a directorwriter of wiseguy making his feature debut  blah well the directorwriter of edison must have some incriminating pictures of kevin spacey killing a homeless man because i cannot see how he along with the other actors in the film would ever agree to be in this disaster br br this movie is absolutely appalling its a mixture of every cop hard boiled cliché ever there is nothing new with edison the acting was bad and the direction was even worse it looked like that aforementioned episode of wiseguy this was the best casted direct to video movie ive ever seen br br some examples of just bad silly moments in edison morg

#### Activate hyperparameter optimization

In [91]:
model_hyper = fasttext.train_supervised(input='reviews.train', autotuneValidationFile='reviews.valid')

In [92]:
model.test("reviews.valid")

(8250, 0.904969696969697, 0.904969696969697)

#### Test of the model with hyperparameter

In [93]:
dataframe_imdb_test['hyper_prediction'] = dataframe_imdb_test.apply(lambda row: model_hyper.predict(row['text'])[0][0], axis=1)

In [94]:
dataframe_imdb_test.head()

Unnamed: 0,text,label,prediction,hyper_prediction
21492,man is it great just to see young and the rest...,__label__positive,__label__positive,__label__positive
9488,seems like a pretty innocent choice at first t...,__label__negative,__label__negative,__label__negative
16933,i was into the movie right away ive seen the o...,__label__positive,__label__positive,__label__positive
12604,anyone who appreciates fine acting and ringing...,__label__positive,__label__positive,__label__positive
8222,high school female track star dies of a blood ...,__label__negative,__label__negative,__label__negative


##### Accuracy

In [95]:
fasttext_test_set_accuracy = compute_accuracy(dataframe_imdb_test, prediction_column="hyper_prediction")
print(f"Test set accuracy: {fasttext_test_set_accuracy:.2%}")

Test set accuracy: 89.08%


### Question 3.4: Look at their attributes. How do the two models differ?

By adding `-autotune-validation` argument FastText active the hyperparameter optimization (autotune).


### Take 2 wrongly classified example from the test set using the tuned model.

In [96]:
wrong_dataframe_imdb_test = dataframe_imdb_test[dataframe_imdb_test.label != dataframe_imdb_test.hyper_prediction]

In [97]:
two_wrongly_tunned_model = wrong_dataframe_imdb_test.sample(2, random_state=3)
two_wrongly_tunned_model

Unnamed: 0,text,label,prediction,hyper_prediction
21350,this film had a distinct woody allen feel abou...,__label__positive,__label__positive,__label__negative
12038,this film did not excite me while on vacation ...,__label__negative,__label__negative,__label__positive


In [98]:
two_wrongly_tunned_model.iloc[0]['text']

'this film had a distinct woody allen feel about it so if youre not a fan of dry humor dark humor or backhanded humor you probably could find something else to do if you are however this is quirky with some nice twists and a flowing natural dialogbr br the story itself is quite engaging not quite like a train wreck from which you cannot disengage your eyes but close i mean that in the best way possible the intrigues are plenty the twists are enough to fully engage the senses and the characters are downright lovablebr br i had a great time with this moviebr br it rates an 8310 frombr br the fiend '

In [99]:
two_wrongly_tunned_model.iloc[1]['text']

'this film did not excite me while on vacation in turkey i noticed that it was playing at the local cinema and decided to take the chance to go see it the action sequences in the film are well choreographed however the story drags during the middle of the filmbr br one thing that i did not quite follow is how borte was able to get the money to bribe the guardsbr br the other thing that i did not particularly like was the missing segment of khans life it did not go into detail as to how the bad guys were able to find him and reimprison him after he had gone missing for so long br br overall i rate this movie a 4 because i enjoyed the action sequences but they were too few to justify a higher rating as the dialog left me less than impressed'

### Question 3.6: Why is it likely that the attributes `minn` and `maxn` are at 0 after an hyperparameter search on our data?

## Theoritical questions