<a href="https://colab.research.google.com/github/finardi/IA376A/blob/master/CountVectorizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nome: Paulo Finardi


Funções a serem implementadas:
1. vocab = build_vocab(corpus)
2. corpus_tok = tokenizer(corpus, vocab)
3. doc_term = feature(corpus_tok)

Enquanto está depurando o seu programa, utilize um corpus bem pequeno, com poucos exemplos e depois de depurado, rode ele nos 1000 exemplos do imdb_sample.

# Bibliotecas utilizadas

In [None]:
import numpy as np
import pandas as pd
import re
import torch
from itertools import chain

# Download do dataset do IMDB_sample (apenas 1000 exemplos)

O dataset está sendo carregado dos datasets disponibilizados pelo curso fast.ai: https://course.fast.ai/datasets.html

O comando wget busca o arquivo imdb.tgz
O comando tar descomprime o arquivo no diretório local

In [None]:
!wget -nc http://files.fast.ai/data/examples/imdb_sample.tgz
!tar -xzf imdb_sample.tgz

--2020-03-08 21:17:13--  http://files.fast.ai/data/examples/imdb_sample.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 571827 (558K) [application/x-gtar-compressed]
Saving to: ‘imdb_sample.tgz’


2020-03-08 21:17:14 (1.66 MB/s) - ‘imdb_sample.tgz’ saved [571827/571827]



In [None]:
df = pd.read_csv('imdb_sample/texts.csv')
print(df.shape)
df.head()

(1000, 3)


Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


# Minha Solução 

In [None]:
# text with some ramdom special chars %#$[] etc...
my_text = ["O valor-p, é a #prob. de se obter% uma estatística de teste igual ou \
           mais extrema que aquela observada em uma amostra, sob a hipótese nula.",
           "Um valor-p pequeno significa que a prob. de obter um $valor da \
           estatística[] de teste como o observado é muito improvável, levando \
           assim à rejeição da hipótese nula."
          ]     

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?',
          ]

# Fç build_vocab

In [None]:
def build_vocab(allsentences):
  words = []
  for sentence in allsentences:
    w = text_cleaner(sentence)
    words.extend(w)
  o = sorted(list(set(words)))
  return o

def text_cleaner(sentence):
  words = re.sub(r"[^a-zA-Z0-9áéíóúÁÉÍÓÚâêîôÂÊÎÔãõÃÕçÇ:-]", " ",  str(sentence))
  words = re.sub('\s\s+', ' ',words).split() # drop double spaces
  cleaned_text = [w.lower() for w in words] 
  return cleaned_text 

##########
# Testing
##########
 
vocab = build_vocab(corpus)
print(f'My solution: {vocab}\n')

My solution: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']



## Comparando com o Sklearn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X  = cv.fit_transform(corpus)
names = cv.get_feature_names()
print(f'SKLearn: {names}\n')

SKLearn: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']



# My tokenizer class

In [None]:
# My tokenizer class
class Vocab:

  def __init__(self, name):
    self.name = name
    self.token2index   = {}
    self.token2count   = {}
    self.index2token   = {}
    self.num_tokens    = 0  
    self.num_chunks    = 0
    self.longest_chunk = 0
  
  def add_token(self, token):
    if token not in self.token2index:
      self.token2index[token] = self.num_tokens
      self.token2count[token] = 1
      self.index2token[self.num_tokens] = token
      self.num_tokens += 1
    else:
      self.token2count[token] += 1

  def add_chunk(self, chunk):
    chunk_len = 0
    for token in chunk.split(' '):
      chunk_len += 1
      self.add_token(token)
    if chunk_len > self.longest_chunk:
      self.longest_chunk = chunk_len
    self.num_chunks +=1
  
  def to_token(self, index):
    return self.index2token[index]
  
  def to_index(self, token):
    return self.token2index[token]

## Fç tokenizer

In [None]:
def tokenizer(corpus, vocab):
  tokenizer_word, tokenizer_encoded =[],[]
  for chunk in corpus:
    word_list  = text_cleaner(chunk)
    token_list = [vocab.to_index(w) for w in word_list] 
    tokenizer_word.append(word_list)
    tokenizer_encoded.append(token_list)
  return tokenizer_word, tokenizer_encoded

##########
# Testing
########## 

voc = Vocab(build_vocab(corpus)) # Tokenizer class
for sent in vocab: # creating the vocab
  voc.add_chunk(sent)

tokenized_word, tokenized_encoded = tokenizer(corpus, voc)
print(f'My tokenizer solution: {tokenized_word}')
print(f'{tokenized_encoded}')

My tokenizer solution: [['this', 'is', 'the', 'first', 'document'], ['this', 'document', 'is', 'the', 'second', 'document'], ['and', 'this', 'is', 'the', 'third', 'one'], ['is', 'this', 'the', 'first', 'document']]
[[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]]


## Comparando com o Sklearn

In [None]:
skl_tknzd = [cv.build_tokenizer()(chunk) for chunk in corpus]
print(f'SKLearn: {skl_tknzd}')

for chunk in skl_tknzd:
    for _, token in enumerate(chunk):
        chunk[_] = names.index(token.lower())
print(f'{skl_tknzd}')

SKLearn: [['This', 'is', 'the', 'first', 'document'], ['This', 'document', 'is', 'the', 'second', 'document'], ['And', 'this', 'is', 'the', 'third', 'one'], ['Is', 'this', 'the', 'first', 'document']]
[[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]]


# Fç feature

In [None]:
def feature(tokenized_encoded):
  allbags = []
  max_lenght = max([max(v) for v in tokenized_encoded])
  for chunk in tokenized_encoded:
    bag_vector = np.zeros(max_lenght+1)
    for i, position in enumerate(chunk):
      bag_vector[position] += 1  
    allbags.append(bag_vector)
  return np.asarray(allbags).astype(int)

##########
# Testing
##########

print(f'Corpus: {corpus}')
doc_term = feature(tokenized_encoded)
print(f'My doc_term solution:\n{doc_term}')

Corpus: ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
My doc_term solution:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## Comparando com o Sklearn

In [None]:
print(f'Corpus: {corpus}')
X = cv.fit_transform(corpus)
print(f'SKLearn:\n{X.toarray()}')

Corpus: ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
SKLearn:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [None]:
# Comparação final 
(doc_term == X.toarray()).all()

True

## Para o IMDB com 1000 amostras

In [None]:
#################
# 1o build vocab
#################
vocab = build_vocab(df.loc[:1000, 'text',].tolist())

##########################
# 2o instance Vocab class
##########################
voc = Vocab(vocab) 
for sent in vocab: 
  voc.add_chunk(sent)

##########################
# 3o tokenize the samples
##########################
tokenized_word, tokenized_encoded = tokenizer(df.loc[:1000, 'text',].tolist(), voc)

####################
# 4o Build doc_term
####################
doc_term = feature(tokenized_encoded)
print(f'My doc_term solution:\n{doc_term}')
print(f'Shape:\n{doc_term.shape}')

My doc_term solution:
[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [3 0 0 ... 0 0 0]]
Shape:
(1000, 20077)


### Com o Sklearn IMDB (IMDB 1000 samples)



In [None]:
X = cv.fit_transform(df.loc[:1000, 'text',].tolist())
print(f'SKLearn:\n{X.toarray()}')
print(f'Shape:\n{X.toarray().shape}')

SKLearn:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Shape:
(1000, 18668)


### Conclusão
- Minha função que processa o texto é muito mais simples do que a do SKlearn, desse
fato, é necessário um melhor conjunto de regras para que os shapes fiquem iguais.

### fim do notebook
