# Classifying movie reviews from scratch

In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from pathlib import Path
import re
import string
import html
import unicodedata

import nltk
nltk.download('punkt')
from nltk import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

load data from URL

In [2]:
from pathlib import Path
import os
DATA_PATH=Path('./data/')# make file on colab with name -->data
DATA_PATH.mkdir(exist_ok=True) #To make sure the file exists

if not os.path.exists('./data/aclImdb'):
    #this commend like linux
    !curl -O http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz # to load data from URL
    !tar -xf aclImdb_v1.tar.gz -C {DATA_PATH}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  70.3M      0  0:00:01  0:00:01 --:--:-- 70.3M


The data is already split into train/test. Moreover, we have 3 class folders:

- pos

- neg

- unsup (no specific label)

In [3]:
CLASS=['neg','pos']
PATH=Path('./data/aclImdb')
def get_text(path):
  text,labels=[],[]
  for idx,label in enumerate(CLASS):
    for fname in (path/label).glob('*.*'):
      text.append(fname.open('r',encoding='utf-8').read()) 
      labels.append(idx)
  
  return text,labels


In [4]:
train_text,train_labels=get_text(PATH/'train')
test_text,test_labels=get_text(PATH/'test')

In [5]:
for i in train_text[:3]:
  print(i)
  print('\n')

What a disappointment! I hated the mummy but this one was even worse! It was very tiring and unbelievable and at a certain point I found myself sighing and yawning all the time. I can't believe that people actually liked this movie. The role of Nicholas Cage wasn't very convincing. The whole movie felt like a grand tour around America's most wanted buildings. The never stopping flow of hints and combinations wasn't very convincing either. I stopped paying attention around 30 minutes. What was supposed to be a happy night out became a total disappointment. What a drag... I guess I've just seen too many movies to enjoy National Treasure.


I shall not waste my time writing anything much further about how every aspect of this film is indescribably bad. That has been done in great detail already, many times over. The 'plot' started out as a very uninspiring cockney wide-boy/gangster-by-numbers bore and very quickly descended into an utter shambles. Anybody who pretends that they can see so

In [6]:
train_labels[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## So what it takes to go from the raw form to the prepared form?

__A - Text preprocessing__

1- Data sequencing: each sentence --> sequence (list) of words

2- Data cleaning: This step varies from task to task. For some tasks it's better to remove special characters and punctuations, for other they are critical (emotiocons). Good for perfromance.

3- Text normalization: in general text morphology is a big issue in NLP. Upper and lower cases, stemming and lemmatization, ...etc. Again it's task dependent.

4- Padding (model dependent): Dense and CNN. RNN can skip this step.

__B- Text preparation__

5- Binarization/vectorization/digitization: transform words into numbers according to a vocab index.

# text preprocessing

### Manual (split on white spaces)

In [7]:
s=train_text[0].split()
s

['What',
 'a',
 'disappointment!',
 'I',
 'hated',
 'the',
 'mummy',
 'but',
 'this',
 'one',
 'was',
 'even',
 'worse!',
 'It',
 'was',
 'very',
 'tiring',
 'and',
 'unbelievable',
 'and',
 'at',
 'a',
 'certain',
 'point',
 'I',
 'found',
 'myself',
 'sighing',
 'and',
 'yawning',
 'all',
 'the',
 'time.',
 'I',
 "can't",
 'believe',
 'that',
 'people',
 'actually',
 'liked',
 'this',
 'movie.',
 'The',
 'role',
 'of',
 'Nicholas',
 'Cage',
 "wasn't",
 'very',
 'convincing.',
 'The',
 'whole',
 'movie',
 'felt',
 'like',
 'a',
 'grand',
 'tour',
 'around',
 "America's",
 'most',
 'wanted',
 'buildings.',
 'The',
 'never',
 'stopping',
 'flow',
 'of',
 'hints',
 'and',
 'combinations',
 "wasn't",
 'very',
 'convincing',
 'either.',
 'I',
 'stopped',
 'paying',
 'attention',
 'around',
 '30',
 'minutes.',
 'What',
 'was',
 'supposed',
 'to',
 'be',
 'a',
 'happy',
 'night',
 'out',
 'became',
 'a',
 'total',
 'disappointment.',
 'What',
 'a',
 'drag...',
 'I',
 'guess',
 "I've",
 'just

# NLTK 

So far, we have split the words using manual approaches (white spaces mainly).

Is there more mature method?

Actually there is: __tokenizers__

The most basic tokenizers take care of punctuation 

NLTK can be used for that.

## Sentence tokenization

Before we dive into words splitting, let's talk a little about sentence tokenization. Sometimes, the data comes in very long bult of text, a document or long paragraphs for example. 

In most NLP models, such long sequences are not desirable (forgetting effect).


`sent_tokenize` can be used to tokenize into shorter sequences, mapped to sentences as we know it. This tokenization is mostly driven by punctuations like full stop.

In [9]:
s=train_text[0]
s


"What a disappointment! I hated the mummy but this one was even worse! It was very tiring and unbelievable and at a certain point I found myself sighing and yawning all the time. I can't believe that people actually liked this movie. The role of Nicholas Cage wasn't very convincing. The whole movie felt like a grand tour around America's most wanted buildings. The never stopping flow of hints and combinations wasn't very convincing either. I stopped paying attention around 30 minutes. What was supposed to be a happy night out became a total disappointment. What a drag... I guess I've just seen too many movies to enjoy National Treasure."

In [12]:
nltk.tokenize.sent_tokenize(s)

['What a disappointment!',
 'I hated the mummy but this one was even worse!',
 'It was very tiring and unbelievable and at a certain point I found myself sighing and yawning all the time.',
 "I can't believe that people actually liked this movie.",
 "The role of Nicholas Cage wasn't very convincing.",
 "The whole movie felt like a grand tour around America's most wanted buildings.",
 "The never stopping flow of hints and combinations wasn't very convincing either.",
 'I stopped paying attention around 30 minutes.',
 'What was supposed to be a happy night out became a total disappointment.',
 'What a drag...',
 "I guess I've just seen too many movies to enjoy National Treasure."]

## Words tokenization


In [15]:
from nltk import word_tokenize
word_tokenize(s)

['What',
 'a',
 'disappointment',
 '!',
 'I',
 'hated',
 'the',
 'mummy',
 'but',
 'this',
 'one',
 'was',
 'even',
 'worse',
 '!',
 'It',
 'was',
 'very',
 'tiring',
 'and',
 'unbelievable',
 'and',
 'at',
 'a',
 'certain',
 'point',
 'I',
 'found',
 'myself',
 'sighing',
 'and',
 'yawning',
 'all',
 'the',
 'time',
 '.',
 'I',
 'ca',
 "n't",
 'believe',
 'that',
 'people',
 'actually',
 'liked',
 'this',
 'movie',
 '.',
 'The',
 'role',
 'of',
 'Nicholas',
 'Cage',
 'was',
 "n't",
 'very',
 'convincing',
 '.',
 'The',
 'whole',
 'movie',
 'felt',
 'like',
 'a',
 'grand',
 'tour',
 'around',
 'America',
 "'s",
 'most',
 'wanted',
 'buildings',
 '.',
 'The',
 'never',
 'stopping',
 'flow',
 'of',
 'hints',
 'and',
 'combinations',
 'was',
 "n't",
 'very',
 'convincing',
 'either',
 '.',
 'I',
 'stopped',
 'paying',
 'attention',
 'around',
 '30',
 'minutes',
 '.',
 'What',
 'was',
 'supposed',
 'to',
 'be',
 'a',
 'happy',
 'night',
 'out',
 'became',
 'a',
 'total',
 'disappointment',

## Stop words

Not every word contribute to the semantics or meaning. Some words like 'the', 'to', 'on', 'we',...etc are not important for many tasks, specially classification tasks.

Such words are called stop words


In [62]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words=stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Putting all the pipeline together:

In [92]:
def remove_special_chars(text):
    re1 = re.compile(r'  +')
    x1 = text.lower().replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
        ' @-@ ', '-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x1))


def remove_non_ascii(text):
    """Remove non-ASCII characters from list of tokenized words"""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  
def to_lowercase(text):
  return text.lower()

def remove_punctuation(text):
  #This translation mapping is then used for replacing a character to its mapped character when used in translate() method.
  translator=str.maketrans('','',string.punctuation)
  return text.translate(translator)

def replace_numbers(text):
  """Replace all interger occurrences in list of tokenized words with textual representation"""
  return re.sub(r'\d+','',text)

def remove_whitespace(text):
  return text.strip()

def text2words(text):
  return word_tokenize(text)

def remove_stopwords(words,stop_words):
  return [word for word in words if word not in stop_words]

def stem_words(words):
  stemmer=PorterStemmer()
  return ' '.join([stemmer.stem(word) for word in words])

def lemmatize_words(words):
    """Lemmatize words in text"""

    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

def lemmatize_verbs(words):
    """Lemmatize verbs in text"""

    lemmatizer = WordNetLemmatizer()
    return  [lemmatizer.lemmatize(word, pos='v') for word in words]

def normalize_text(text):
    text = remove_special_chars(text)
    text = remove_non_ascii(text)
    text = remove_punctuation(text)
    text = to_lowercase(text)
    text = replace_numbers(text)
    words = text2words(text)
    words = remove_stopwords(words, stop_words)
    #words = stem_words(words)# Either stem ovocar lemmatize
    words = lemmatize_words(words)
    words = lemmatize_verbs(words)

    return ' '.join(words)


In [93]:
normalize_text(train_text[0])

'disappointment hat mummy one even worse tire unbelievable certain point find sigh yawn time cant believe people actually like movie role nicholas cage wasnt convince whole movie felt like grand tour around america want build never stop flow hint combination wasnt convince either stop pay attention around minute suppose happy night become total disappointment drag guess ive see many movie enjoy national treasure'

Now let's apply this on the whole corpus:d

In [94]:
def normalize_corpus(corpus):
  return [normalize_text(t) for t in corpus]

In [99]:
train_data=normalize_corpus(train_text)
test_data=normalize_corpus(test_text)

In [98]:
train_data[1]

'shall waste time write anything much every aspect film indescribably bad do great detail already many time plot start uninspiring cockney wideboygangsterbynumbers bore quickly descend utter shamble anybody pretend see hide masterpiece inside awful mess kid year since watch week run cinema pull yet stick mind easily terrible film ever see make comment indeed reason go see film amuse fact brother eddie appear second heavy pub scene hand thrust zippo lighter towards rhys ifans face bar russia actually film former butlins holiday camp barry island brother absolutely act experience whatsoever recently join extra agency first part see film appear nobody require act experience whatsoever remember people whole cinema couple day release never hear film unpopular disappear fast rightly case think rent film dvd would advise instead put two pound coin fire redhot jam eye socket probably lot le painful watch film'

### Text preparation
The preparation phase includes transforming text into binary/integer/digital format

For that we need a vocabulary vector:

## Vocab and inverse vocab
Vocabulary is a mapping (dict) from words to indices (integers). It represents ALL the words in a language. But it's hard to get ALL words, so we count only what we have in a dataset/corpus.

Since we don't account for all words, we might encounter Out-Of-Vocab words which we dont know a mapping for. So we usually reserve a special token index for UNKnown words.

#Manual

In [110]:
texts=train_data+test_data
# for text in texts && for word in text.split()
words=[word for text in texts for word in text.split()]
corp=sorted(list(set(words)))
corp

['\x08\x08\x08\x08a',
 '\x10own',
 'aa',
 'aaa',
 'aaaaaaaaaaaahhhhhhhhhhhhhh',
 'aaaaaaaargh',
 'aaaaaaah',
 'aaaaaaahhhhhhggg',
 'aaaaagh',
 'aaaaah',
 'aaaaargh',
 'aaaaarrrrrrgggggghhhhhh',
 'aaaaatchkah',
 'aaaaaw',
 'aaaahhhhhh',
 'aaaahhhhhhh',
 'aaaand',
 'aaaarrgh',
 'aaaawwwwww',
 'aaaggghhhhhhh',
 'aaagh',
 'aaah',
 'aaahhhhhhh',
 'aaahthe',
 'aaall',
 'aaand',
 'aaargh',
 'aaarrrghim',
 'aaaugh',
 'aab',
 'aachen',
 'aada',
 'aadha',
 'aadmittedly',
 'aag',
 'aage',
 'aagh',
 'aaghh',
 'aah',
 'aahemy',
 'aahhh',
 'aahhhh',
 'aaila',
 'aailiyah',
 'aaip',
 'aaja',
 'aajala',
 'aak',
 'aakash',
 'aake',
 'aaker',
 'aakrosh',
 'aalcc',
 'aaliot',
 'aaliyah',
 'aaliyahs',
 'aalox',
 'aames',
 'aamess',
 'aamilne',
 'aamir',
 'aamirs',
 'aamirsalmanraveenakarishma',
 'aamr',
 'aan',
 'aanekoski',
 'aankh',
 'aankhen',
 'aaoon',
 'aap',
 'aapke',
 'aapkey',
 'aaran',
 'aardman',
 'aardmans',
 'aardvark',
 'aarf',
 'aargh',
 'aarghlets',
 'aarika',
 'aaron',
 'aaroncurb',
 'aaron

### build vocabulary manual

In [118]:
str2idx={w:i for i,w in enumerate(corp)}
idx2str={i:w for i,w in enumerate(corp)}
len(str2idx)

140832

If we want word_counts and ordered dict str2idx (same in keras Tokenizer)