# O objetivo desta tarefa é gerar, dado um conjunto de documentos, gerar vetore de características:
 - TF (frequência do termo)
 - TFIDF (Frequência do Termo e Frequência inversa do DOcumento)

Ao término da tarefa você deve gerar os vetores para os datasets contidos na pasta DATA.

## Preparação do Ambiente

In [40]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/alexmarino/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/alexmarino/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/alexmarino/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from wordcloud import WordCloud

## Importar dataset

In [17]:
df = pd.read_csv('../DATA/data.csv')
df.head(2)

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative


## Pontuação para NLP

In [8]:
df['clean_punct'] = df['Sentence'].str.replace('[,.:;!?]+', ' ', regex=True).copy()

In [18]:
df['clean_punct'] = df['Sentence'].str.replace ('[/<>()|\+\-\$%&#@\'\"]+', ' ', regex=True).copy()
df

Unnamed: 0,Sentence,Sentiment,clean_punct
0,The GeoSolutions technology will leverage Bene...,positive,The GeoSolutions technology will leverage Bene...
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative,"ESI on lows, down 1.50 to 2.50 BK a real po..."
2,"For the last quarter of 2010 , Componenta 's n...",positive,"For the last quarter of 2010 , Componenta s n..."
3,According to the Finnish-Russian Chamber of Co...,neutral,According to the Finnish Russian Chamber of Co...
4,The Swedish buyout firm has sold its remaining...,neutral,The Swedish buyout firm has sold its remaining...
...,...,...,...
5837,RISING costs have forced packaging producer Hu...,negative,RISING costs have forced packaging producer Hu...
5838,Nordic Walking was first used as a summer trai...,neutral,Nordic Walking was first used as a summer trai...
5839,"According shipping company Viking Line , the E...",neutral,"According shipping company Viking Line , the E..."
5840,"In the building and home improvement trade , s...",neutral,"In the building and home improvement trade , s..."


## Números para NLP

In [21]:
df['clean_punct'] = df['clean_punct'].str.replace('[0-9]+', '', regex=True).copy()
df['clean_punct'] = df['clean_punct'].str.replace ('[/<>()|\+\-\$%&#@\'\"]+', ' ', regex=True).copy()
df['clean_punct'] = df['clean_punct'].str.replace ('[,.:;!?]+', ' ', regex=True).copy()
df

Unnamed: 0,Sentence,Sentiment,clean_punct
0,The GeoSolutions technology will leverage Bene...,positive,The GeoSolutions technology will leverage Bene...
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative,ESI on lows down to BK a real possibility
2,"For the last quarter of 2010 , Componenta 's n...",positive,For the last quarter of Componenta s net s...
3,According to the Finnish-Russian Chamber of Co...,neutral,According to the Finnish Russian Chamber of Co...
4,The Swedish buyout firm has sold its remaining...,neutral,The Swedish buyout firm has sold its remaining...
...,...,...,...
5837,RISING costs have forced packaging producer Hu...,negative,RISING costs have forced packaging producer Hu...
5838,Nordic Walking was first used as a summer trai...,neutral,Nordic Walking was first used as a summer trai...
5839,"According shipping company Viking Line , the E...",neutral,According shipping company Viking Line the E...
5840,"In the building and home improvement trade , s...",neutral,In the building and home improvement trade s...


## Remoçao de Stop Words

In [26]:
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

## Tokenização

In [41]:
sentence = df["clean_punct"].iloc[1]
nltk.word_tokenize(sentence)


['ESI', 'on', 'lows', 'down', 'to', 'BK', 'a', 'real', 'possibility']

## CountVectorizer

In [43]:
countVectorizer = CountVectorizer(strip_accents='ascii', 
                      lowercase=True, 
                      stop_words=stop_words)
countVectorizer.fit(df['clean_punct'][:2])

In [45]:
countVectorizer.get_feature_names_out()

array(['based', 'benefon', 'bk', 'commercial', 'communities', 'content',
       'esi', 'geosolutions', 'gps', 'leverage', 'location', 'lows',
       'model', 'multimedia', 'new', 'platform', 'possibility',
       'powerful', 'providing', 'real', 'relevant', 'search', 'solutions',
       'technology'], dtype=object)

In [49]:
MTX = countVectorizer.transform(df['clean_punct'][:2])
MTX.toarray()

array([[1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
        1, 2],
       [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 0]])

## TFIDF

In [52]:
tfidf = TfidfTransformer(use_idf=True, norm="l2")

# usando a matriz de frequência para gerar tfidf 

In [55]:
tfVectorizer = tfidf.fit(MTX)
DTM = tfVectorizer.transform(MTX)
DTM.toarray()

array([[0.2      , 0.2      , 0.       , 0.2      , 0.2      , 0.2      ,
        0.       , 0.2      , 0.2      , 0.2      , 0.4      , 0.       ,
        0.2      , 0.2      , 0.2      , 0.2      , 0.       , 0.2      ,
        0.2      , 0.       , 0.2      , 0.2      , 0.2      , 0.4      ],
       [0.       , 0.       , 0.4472136, 0.       , 0.       , 0.       ,
        0.4472136, 0.       , 0.       , 0.       , 0.       , 0.4472136,
        0.       , 0.       , 0.       , 0.       , 0.4472136, 0.       ,
        0.       , 0.4472136, 0.       , 0.       , 0.       , 0.       ]])

Para todos que chegarm neste ponto deve efetuar as seguintes tafefas:


1 - Elaborar um gáfico de termos mais frequentes consierando os seguintes ranges de ngram:
    ngram(1,1)
    ngram(1,2)
    ngram(2,2)
    ngram(2,3)
    
2 - Elaborar wordcloud para as configurações de NGRAM acima.

3 - Investigar o efeito da remoção da lista de stoprods no tamanho do dicionário final.

4 - Investigar as técnicas de normalização l1 e l2.