## Leitura dos Dados

In [151]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [152]:
df = pd.read_csv('../imdb_reviews.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


# Funções

In [154]:
def map_sentiments(x):
    if x == 'positive':
        return 1
    return 0

def create_new_dfs(df_train, df_test, text_col, representation, vocab_size=1000):
    
    if representation == 'bow':
        vectorizer = CountVectorizer(max_features=vocab_size)
    elif representation == 'tfidf':
        vectorizer = TfidfVectorizer(norm=None, max_features=vocab_size)
        
    X_train = vectorizer.fit_transform(df_train[text_col].values)
    X_test = vectorizer.transform(df_test[text_col].values)
    
    new_df_train = pd.DataFrame(X_train.toarray(), columns=vectorizer.get_feature_names_out())
    new_df_test = pd.DataFrame(X_test.toarray(), columns=vectorizer.get_feature_names_out())
    
    new_df_train['target_val'] = df_train['sentiment'].values
    new_df_test['target_val'] = df_test['sentiment'].values
    
    return new_df_train, new_df_test

def cross_validation(df, model, representation, k=5):
    idx_list = np.arange(df.shape[0])
    test_available_idx = np.arange(df.shape[0])
    test_size = int((1/k) * len(idx_list))
    acc = 0
    
    print("Accuracy per fold:")
    for i in range(k):
        test_idx = np.random.choice(test_available_idx, test_size, replace=False)
        train_idx = np.setdiff1d(idx_list, test_idx)
        
        df_train = df.iloc[train_idx]
        df_test = df.iloc[test_idx]
        
        train, test = create_new_dfs(df_train, df_test, 'review', representation)
        
        X_train = train.drop('target_val',axis=1)
        y_train = train['target_val'].values
        
        X_test = test.drop('target_val', axis=1)
        y_test = test['target_val'].values
        
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        
        test_available_idx = np.setdiff1d(test_available_idx, test_idx)
        score = accuracy_score(y_test, preds)
        acc += score
        print(f"  Fold number {i+1}: {score:.3f}")
        
    print(f"\nAverage accuracy score is {acc/k:.3f}")

## Bag of Words

- Neste primeiro momento, utilizaremos o Bag of Words como a forma de representação para os nossos dados.
- Para esta forma de representação, cada frase do nosso dataset se torna uma linha com vocab_size + 1 colunas.
- Em cada coluna, temos uma palavra diferente, mais em específico as vocab_size palavras mais comuns.
- O valor em nosso novo dataset para um elemento i,j é o número de aparições da palavra j no review de número i.

In [156]:
df['sentiment'] = df['sentiment'].apply(map_sentiments)

In [160]:
bow_train, bow_val = create_new_dfs(df.loc[:10,:],df.loc[10:20,:], 'review', 'bow')
bow_train

Unnamed: 0,10,15,1990,25,70,950,about,accustomed,acting,action,...,wrenching,writing,written,years,york,you,young,your,zombie,target_val
0,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,3,0,1,0,1
1,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,1
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,2,0
4,0,0,0,0,0,0,2,0,1,1,...,0,0,0,0,1,0,0,0,0,1
5,0,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
6,1,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,3,0,0,0,1
7,0,0,1,0,1,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
8,0,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,4,1,0,0,1


In [157]:
model_bow = LogisticRegression(max_iter=1000)
cross_validation(df, model_bow, 'bow')

Accuracy per fold:
  Fold number 1: 0.870
  Fold number 2: 0.863
  Fold number 3: 0.866
  Fold number 4: 0.860
  Fold number 5: 0.867

Average accuracy score is 0.865


## TF-IDF (Term Frequency-Inverse Document Frequency)

- Forma de representar a importância das palavras $t$ em documentos $d$
- O TF-IDF de uma palavra é diretamente proporcional à quantidade de aparições da palavra no documento
- É inversamente proporcional ao logaritmo da proporção dos documentos que contém a palavra t em um córpus de tamanho $N$

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times  \log(1 + \frac{N}{{df}_t})
$$

In [158]:
tfidf_train, tfidf_val =  create_new_dfs(df.loc[:10,:],df.loc[10:20,:], 'review', 'tfidf')
tfidf_train

Unnamed: 0,10,15,1990,25,70,950,about,accustomed,acting,action,...,wrenching,writing,written,years,york,you,young,your,zombie,target_val
0,0.0,0.0,0.0,0.0,0.0,0.0,1.693147,2.791759,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.079442,0.0,2.791759,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,1.693147,0.0,0.0,0.0,...,0.0,0.0,2.791759,0.0,0.0,1.693147,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.098612,0.0,0.0,2.386294,0.0,0.0,1
3,2.386294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.386294,0.0,0.0,5.583519,0
4,0.0,0.0,0.0,0.0,0.0,0.0,3.386294,0.0,2.386294,2.791759,...,0.0,0.0,0.0,0.0,2.791759,0.0,0.0,0.0,0.0,1
5,0.0,2.791759,0.0,2.791759,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.098612,0.0,0.0,0.0,0.0,0.0,1
6,2.386294,0.0,0.0,0.0,0.0,0.0,3.386294,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.079442,0.0,0.0,0.0,1
7,0.0,0.0,2.791759,0.0,2.791759,0.0,0.0,0.0,0.0,0.0,...,0.0,2.791759,0.0,2.098612,0.0,0.0,0.0,0.0,0.0,0
8,0.0,0.0,0.0,0.0,0.0,2.791759,1.693147,0.0,2.386294,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.791759,0.0,0.0,0.0,0.0,6.772589,2.386294,0.0,0.0,1


In [159]:
model_tfidf = LogisticRegression(max_iter=1000)
cross_validation(df, model_tfidf, 'tfidf')

Accuracy per fold:
  Fold number 1: 0.872
  Fold number 2: 0.868
  Fold number 3: 0.859
  Fold number 4: 0.872
  Fold number 5: 0.864

Average accuracy score is 0.867
