# DM_09_04
### Importar pacotes
Vamos usar "codecs" para ler os arquivos de texto, "re" (que significa "regular expressions", ou expressões regulares) e "collections" para trabalhar com tokens e "nltk" ("Natural Language Toolkit") em diversas operações.

In [45]:
% matplotlib inline

import codecs
import re
import copy
import collections

import numpy as np
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from __future__ import division

Precisamos de algumas funções especializadas do NLTK que não estão incluídas por padrão. É possível baixar apenas a parte com as "stopwords", palavras irrelevantes, mas talvez seja mais fácil baixar tudo no NLTK. Observe que é um processo muito demorado; levou mais de 30 minutos em meu computador.

In [None]:
nltk.download('all')

Baixar o pacote "stopwords" do NLTK.

In [46]:
from nltk.corpus import stopwords

## Ler dados

In [47]:
with codecs.open("JaneEyre.txt", "r", encoding="utf-8") as f:
    jane_eyre = f.read()
with codecs.open("WutheringHeights.txt", "r", encoding="utf-8") as f:
    wuthering_heights = f.read()

## Processar dados
Verificar palavras irrelevantes em inglês.

In [48]:
esw = stopwords.words('english')
esw.append("would")

Filtrar tokens (usando expressões regulares).

In [49]:
word_pattern = re.compile("^\w+$")

Criar função para contagem de tokens.

In [50]:
def get_text_counter(text):
    tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
    tokens = list(map(lambda x: x.lower(), tokens))
    tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
    return collections.Counter(tokens), len(tokens)

Criar função para cálculo da frequência absoluta e da frequência relativa das palavras mais comuns.

In [51]:
def make_df(counter, size):
    abs_freq = np.array([el[1] for el in counter])
    rel_freq = abs_freq / size
    index = [el[0] for el in counter]
    df = pd.DataFrame(data=np.array([abs_freq, rel_freq]).T, index=index, columns=["Absolute frequency", "Relative frequency"])
    df.index.name = "Most common words"
    return df

## Analisar textos individuais

Calcular as palavras mais comuns de _Jane Eyre_. Isso demora um pouco. Então, exibir as 10 mais comuns.

In [52]:
je_counter, je_size = get_text_counter(jane_eyre)
make_df(je_counter.most_common(10), je_size)

Unnamed: 0_level_0,Absolute frequency,Relative frequency
Most common words,Unnamed: 1_level_1,Unnamed: 2_level_1
one,593.0,0.006789
said,584.0,0.006686
mr,543.0,0.006217
could,504.0,0.00577
like,397.0,0.004545
rochester,366.0,0.00419
well,348.0,0.003984
jane,342.0,0.003916
little,341.0,0.003904
sir,315.0,0.003607


Salvar as 1.000 palavras mais comuns de _Jane Eyre_ como CSV.

In [53]:
je_df = make_df(je_counter.most_common(1000), je_size)
je_df.to_csv("JE_1000.csv")

Calcular as palavras mais comuns de _Wuthering Heights_. Isso também demora um pouco. Exibir as 10 mais comuns.

In [54]:
wh_counter, wh_size = get_text_counter(wuthering_heights)
make_df(wh_counter.most_common(10), wh_size)

Unnamed: 0_level_0,Absolute frequency,Relative frequency
Most common words,Unnamed: 1_level_1,Unnamed: 2_level_1
heathcliff,475.0,0.008735
linton,404.0,0.007429
catherine,379.0,0.00697
said,375.0,0.006896
mr,312.0,0.005738
one,290.0,0.005333
could,279.0,0.005131
master,205.0,0.00377
shall,191.0,0.003512
come,190.0,0.003494


Salvar as 1.000 palavras mais comuns de _Wuthering Heights_ como CSV.

In [55]:
wh_df = make_df(wh_counter.most_common(1000), wh_size)
wh_df.to_csv("WH_1000.csv")

## Comparar textos

Identificar as palavras mais comuns nos dois documentos.

In [56]:
all_counter = wh_counter + je_counter
all_df = make_df(wh_counter.most_common(1000), 1)
most_common_words = all_df.index.values

Criar um quadro de dados com as diferenças de frequência das palavras.

In [57]:
df_data = []
for word in most_common_words:
    je_c = je_counter.get(word, 0) / je_size
    wh_c = wh_counter.get(word, 0) / wh_size
    d = abs(je_c - wh_c)
    df_data.append([je_c, wh_c, d])
dist_df = pd.DataFrame(data=df_data, index=most_common_words,
                       columns=["Jane Eyre relative frequency", "Wuthering Heights relative frequency",
                                "Relative frequency difference"])
dist_df.index.name = "Most common words"
dist_df.sort_values("Relative frequency difference", ascending=False, inplace=True)

Exibir as palavras mais distintas.

In [58]:
dist_df.head(10)

Unnamed: 0_level_0,Jane Eyre relative frequency,Wuthering Heights relative frequency,Relative frequency difference
Most common words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
heathcliff,0.0,0.008735,0.008735
linton,0.0,0.007429,0.007429
catherine,1.1e-05,0.00697,0.006958
hareton,0.0,0.003292,0.003292
sir,0.003607,0.000791,0.002816
master,0.001133,0.00377,0.002636
joseph,0.0,0.002575,0.002575
earnshaw,0.0,0.002372,0.002372
cathy,0.0,0.00228,0.00228
edgar,0.0,0.002133,0.002133


Salvar a lista completa com as palavras distintas como um CSV intitulado "bronte.csv".

In [59]:
dist_df.to_csv("bronte.csv")