<a href="https://colab.research.google.com/github/christopherdiamana/Algo-for-big-data-practicals/blob/main/catch_up1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Processing Catch-up 1

## The dataset

In [17]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 35.3 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 61.6 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.2 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 71.1 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████

In [18]:
from datasets import load_dataset_builder

In [19]:
ds_builder = load_dataset_builder("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

In [20]:
# Inspect dataset description
ds_builder.info.description

'Large Movie Review Dataset.\nThis is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.'

In [21]:
# Inspect dataset features
ds_builder.info.features

{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

In [22]:
from datasets import get_dataset_split_names

In [23]:
get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

In [24]:
from datasets import load_dataset

In [25]:
dataset = load_dataset("imdb")

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [27]:
imdb_train = load_dataset('imdb', split='train')
imdb_test = load_dataset('imdb', split='test')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


In [28]:
from collections import Counter

In [29]:
Counter(imdb_train['label'])

Counter({0: 12500, 1: 12500})

In [30]:
Counter(imdb_test['label'])

Counter({0: 12500, 1: 12500})



---



## Naive Bayes classifier

In [31]:
import math
from operator import countOf


def train_naive_bayes(documents, classes):
  log_prior = dict()
  vocabulary = set()
  log_likelihood = dict()
  
  [vocabulary.update(document['text'].split()) for index, document in documents.iterrows()]
  
  num_documents = len(documents['text'])

  for class_name in classes:
    
    # num_documents_of_class = len([document_class for document_class in documents['label'] if document_class == class_name])

    num_documents_of_class = np.count_nonzero(np.array(documents['label']) == class_name)

    log_prior[class_name] = math.log(num_documents_of_class / num_documents)

    big_document = [document['text'] for index, document in documents.iterrows() if document['label'] == class_name]



    log_likelihood[class_name] = {}

    for current_word in vocabulary:
      count = countOf(big_document, current_word)
      
      numerator = count + 1
      denominator = sum([countOf(big_document, word) for word in vocabulary])+ 1
      
      log_likelihood[class_name][current_word] = math.log(numerator / denominator)

  return log_prior, log_likelihood, vocabulary

In [33]:
import numpy as np


def test_naive_bayes(test_documents, log_prior, log_likelihood, classes, vocabulary):
  summation = []

  for class_name in classes:
    summation[class_name] = log_prior[class_name].value
    for word in test_documents:
      #word = test_documents[i]
      if word in vocabulary:
        summation[class_name] += log_likelihood[word][class_name] 
  
  return np.argmax(summation)

### Pretreatment

In [34]:
from string import punctuation
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [35]:
# For the Lemmatizer 
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [36]:
def pre_treatment(text):
  new_text = text.lower()
  new_text = new_text.translate(str.maketrans('', '', punctuation))
  
  #Other pretreatments
  new_text = re.sub(r'[^a-z]+', ' ', new_text)
  new_text = re.sub(r'\b\w\b', ' ', new_text)
  new_text = re.sub(r'\b\w\w\b', ' ', new_text)
  word_tokens = word_tokenize(new_text)
  lemmatizer = WordNetLemmatizer()

  return ' '.join([lemmatizer.lemmatize(w) for w in word_tokens])

In [37]:
imdb_train['text'][0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [38]:
pre_treated_text = [pre_treatment(text) for text in imdb_train['text']]

In [39]:
pre_treated_text[0]

'rented curiousyellow from video store because all the controversy that surrounded when wa first released also heard that first wa seized custom ever tried enter this country therefore being fan film considered controversial really had see this for myselfbr the plot centered around young swedish drama student named lena who want learn everything she can about life particular she want focus her attention making some sort documentary what the average swede thought about certain political issue such the vietnam war and race issue the united state between asking politician and ordinary denizen stockholm about their opinion politics she ha sex with her drama teacher classmate and married menbr what kill about curiousyellow that year ago this wa considered pornographic really the sex and nudity scene are few and far between even then it not shot like some cheaply made porno while countryman mind find shocking reality sex and nudity are major staple swedish cinema even ingmar bergman arguably

### Naive Bayes classifier on the training set

#### Converte to a dataframe

In [40]:
import pandas as pd


data_imdb_train = {'text': pre_treated_text, 'label': imdb_train['label']}
df_imdb_train = pd.DataFrame(data = data_imdb_train)

In [41]:
df_imdb_train.head()

Unnamed: 0,text,label
0,rented curiousyellow from video store because ...,0
1,curious yellow risible and pretentious steamin...,0
2,only avoid making this type film the future th...,0
3,this film wa probably inspired godard masculin...,0
4,brotherafter hearing about this ridiculous fil...,0


In [42]:
type(df_imdb_train['label'][0])

numpy.int64

In [43]:
classes = [0, 1]

In [None]:
log_prior, log_likelihood, vocabulary = train_naive_bayes(df_imdb_train, [0, 1])

In [None]:
log_prior, log_likelihood, vocabulary

In [None]:
from IPython.core.display import HTML

# Visualizing data 
HTML(pre_treated_text[0].to_html())

TypeError: ignored

In [None]:
from gensim.parsing.preprocessing import STOPWORDS

#Creating a list of custom stopwords
new_words = ["fig","figure","image","sample","using", 
             "show", "result", "large", 
             "also", "one", "two", "three", 
             "four", "five", "seven","eight","nine"]

stop_words = STOPWORDS.union(set(new_words))

In [None]:
{idx:label for idx, label in enumerate(labels)}

In [None]:
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}