# Regressão Logística para classificação de reviews em positivos ou negativos

1. <a href="#1">Lendo o dataset</a>
2. <a href="#2">Análise dos dados</a>
3. <a href="#3">Limpeza do texto</a>
4. <a href="#4">Dado de treinamento e validação</a>
5. <a href="#5">Processamento do dado usando pipeline</a>
6. <a href="#6">Treinamento</a>
7. <a href="#7">Predições no dado de validação</a>
8. <a href="#8">Como poderiamos melhorar</a>

In [None]:
!pip install -q nltk

In [None]:
import os
import numpy as np
import pandas as pd
import nltk, re
import time
import torch
import torch.nn as nn

from os import path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from torch.nn import BCELoss
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

nltk.download("punkt")
nltk.download("stopwords")

%matplotlib inline
import matplotlib.pyplot as plt

## 1. <a name="1">Lendo o dataset</a>


In [None]:
df = pd.read_csv("./data/train.csv")

print(f"The shape of the dataset is: {df.shape}")

Let's look at the first 10 rows of the dataset. 

In [None]:
df.head(10)

## 2. <a name="2">Análise dos dados</a>

Let's look at the distribution of __isPositive__ field.

In [None]:
df["isPositive"].value_counts()

We can check the number of missing values for each columm below.

In [None]:
print(df.isna().sum())

We have missing values in our text fields. We will use the __reviewText__ field, so we fill-in the missing values in it with the empty string.

In [None]:
df["reviewText"].fillna("", inplace=True)

## 3. <a name="3">Limpeza do texto</a>
(<a href="#0">Go to top</a>)

We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence).

In [None]:
# Let's get a list of stop words from the NLTK library
stop = stopwords.words("english")

# These words are important for our problem. We don't want to remove them.
excluding = [
    "against",
    "not",
    "don",
    "don't",
    "ain",
    "aren",
    "aren't",
    "couldn",
    "couldn't",
    "didn",
    "didn't",
    "doesn",
    "doesn't",
    "hadn",
    "hadn't",
    "hasn",
    "hasn't",
    "haven",
    "haven't",
    "isn",
    "isn't",
    "mightn",
    "mightn't",
    "mustn",
    "mustn't",
    "needn",
    "needn't",
    "shouldn",
    "shouldn't",
    "wasn",
    "wasn't",
    "weren",
    "weren't",
    "won",
    "won't",
    "wouldn",
    "wouldn't",
]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer("english")

def process_text(texts):
    final_text_list = []
    for sent in texts:

        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence = []
        
        # Lowercase
        sent = sent.lower()
        # Remove leading/trailing whitespace
        sent = sent.strip()
        # Remove extra space and tabs
        sent = re.sub("\s+", " ", sent)
        # Remove HTML tags/markups:
        sent = re.compile("<.*?>").sub("", sent)

        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if (not w.isnumeric()) and (len(w) > 2) and (w not in stop_words):
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence)  # final string of cleaned words

        final_text_list.append(final_string)

    return final_text_list

Let's inspect the effect of our preprocessing on some text from our data set

In [None]:
review = df.reviewText.iloc[0]
print(f"Raw text: \n{review}\n")
print(f"Preprocessed text:\n{process_text([review])}")

In [None]:
review = df.reviewText.iloc[20]
print(f"Raw text: \n{review}\n")
print(f"Preprocessed text:\n{process_text([review])}")

## 4. <a name="4">Dado de treinamento e validação</a>
(<a href="#0">Go to top</a>)

Let's split our dataset into training (90%) and validation (10%). 

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    df[["reviewText"]],
    df["isPositive"].values,
    test_size=0.10,
    shuffle=True,
    random_state=324,
)

In [None]:
print("Processing the reviewText fields...")
X_train["reviewText"] = process_text(X_train["reviewText"].tolist())
X_val["reviewText"] = process_text(X_val["reviewText"].tolist())

In [None]:
# Use TD-IDF to vectorize to vectors of len 750.
tf_idf_vectorizer = TfidfVectorizer(max_features=750)

# Fit the vectorizer to training data
# Don't use the fit() on validation or test datasets
tf_idf_vectorizer.fit(X_train["reviewText"].values)

# Transform text fields
X_train = tf_idf_vectorizer.transform(X_train["reviewText"].values).toarray()
X_val = tf_idf_vectorizer.transform(X_val["reviewText"].values).toarray()

print("Shapes of features: Training and Validation")
print(X_train.shape, X_val.shape)

We can inspect the tokens in the vocabulary (which we've set to be of size 750)

In [None]:
tf_idf_vectorizer.get_feature_names_out()[:10]

Our __process_text()__ method in section 3 uses empty string for missing values.

## 6. <a name="6">Treinamento</a>
(<a href="#0">Go to top</a>)

## Regressão Logística

To handle this binary classification problem, let's first consider a logistic regression model, mapping the input $\mathbf{X}$ to the output $\mathbf{y}$ by

$$
\mathbf{\hat{y}} = \mathrm{sigmoid}(\mathbf{X}\mathbf{w} + \mathbf{b}),
$$

with some initial choices for the parameters, $\mathbf{w}$ weights matrix and bias $\mathbf{b}$. We initialize the weights as random Gaussian noise, with zero mean and standard deviation 1, and start with zero bias.


Let's introduce the single layer networks! We use the Pytorch library here. In this first part, we do the following:
* Select our hyper-parameters to use in this problem
* Build the network
* Define loss function and optimizer
* Create data loaders

In [None]:
# How many samples to use for each weight update
batch_size = 16
# Total number of iterations
# One epoch is one pass over all data in the training set
epochs = 15
# Learning rate
lr = 0.01

# Run the training in the GPU if supported by our instance, else in the CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Let's build our single layer network (logistic regression here)
net = nn.Sequential(
    nn.Linear(in_features=750, # matches the size of vectorizer: 750
              out_features=1), 
    nn.Sigmoid()
)
net.to(device)

# Initialize the network
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=1)
        nn.init.zeros_(m.bias)
net.apply(init_weights)

# Define the loss. For binary classification the appropriate choice is Binary Cross Entropy.
# As we used sigmoid in the last layer, we use `nn.BCELoss`.
# Otherwise we could have made use of `nn.BCEWithLogitsLoss`.
loss = BCELoss(reduction="none")

# Define the optimizer, SGD (Stochastic Gradient Descent) with learning rate
optimizer = torch.optim.SGD(net.parameters(), lr=lr)

# Use PyTorch DataLoaders to load the data in batches
train_dataset = TensorDataset(
    torch.tensor(X_train, dtype=torch.float32),
    torch.tensor(y_train, dtype=torch.float32),
)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

# Move validation dataset on CPU/GPU device
X_val = torch.tensor(X_val, dtype=torch.float32).to(device)
y_val = torch.tensor(y_val, dtype=torch.float32).to(device)

After these, we are ready to train this network. See below for the training and validation code. At high level, we have three main parts: 

__1. Epochs loop:__ Loop over your datast to learn and update your weights. Neural networks usually need to see the dataset multiple times to learn the better (this can be considered a slow process).

__2. Training loop:__ During the training, we loop over the data loader. Data loader loads the data in batches. We selected batch size earlier. Here, the weights are updated for each batch.

__3. Validation:__ After the training loop is done, we go over the validation data and calculate the validation loss.

In [None]:
# Lists to store the losses as the training progresses
train_losses = []
val_losses = []

# Loop over epochs
for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    # Build a training loop to train the network
    for data, target in train_loader:
        # zero the parameter gradients
        optimizer.zero_grad()

        data = data.to(device)
        target = target.to(device).view(-1, 1)

        # Forward pass - compute the predictions of the NN on the batch
        output = net(data)  
        # Compute the loss and sum (error between the net's predictions and the actual labels)
        L = loss(output, target).sum()
        training_loss += L.item() 
        # Calculate gradients
        L.backward()  
        # Update weights with gradient descent
        optimizer.step()  

    # Get validation predictions
    val_predictions = net(X_val)
    # Calculate the validation loss
    val_loss = torch.sum(loss(val_predictions, y_val.view(-1, 1))).item()

    # Take the average losses
    training_loss = training_loss / len(y_train)
    val_loss = val_loss / len(y_val)

    train_losses.append(training_loss)
    val_losses.append(val_loss)

    end = time.time()
    print(
        f"Epoch {epoch}. Train_loss {training_loss:.3f}, Validation_loss {val_loss:.3f}, Seconds {end-start:.3f}"
    )

In [None]:
plt.plot(train_losses, label="Training Loss")
plt.plot(val_losses, label="Validation Loss")
plt.title("Loss values")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

## 7. <a name="7">Predições no dado de validação</a>
(<a href="#0">Go to top</a>)

Let's get the validation predictions with this ```val_predictions = net(X_val)```. Then, we look at our validation data performance. See comments in the code for more details.

In [None]:
# Get predictions using the trained network: "net"
val_predictions = net(X_val)
# Round up (to 1) or down (to 0) the result (remember the sigmoid).
# Use np.rint() for that
val_predictions = np.rint(val_predictions.detach().cpu().numpy())
print(val_predictions)

Let's get rid of the one extra level in this array. We will simply use the np.squeeze() function. 

In [None]:
val_predictions = np.squeeze(val_predictions)
print(val_predictions)

Confusion matrix, classification report and accuracy score are printed below.

In [None]:
y_val = y_val.detach().cpu().numpy()
print(confusion_matrix(y_val, val_predictions))
print(classification_report(y_val, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val, val_predictions))

## 8. <a name="8">Como poderiamos melhorar</a>
(<a href="#0">Go to top</a>)

* Testando diferentes hiperparâmetros como batch size e learning rate
* Adicionando novas camadas na rede
```python
# Exemplo: 
net = nn.Sequential(
    # Input size of 1 is expected. Linear layer-1 with 10 units
    nn.Linear(in_features=750, out_features=100),
    # Relu activation is applied
    nn.ReLU(),
    # Linear layer-2 with 10 units
    nn.Linear(100, 10),
    # Relu activation is applied
    nn.ReLU(),
    # Output layer with single unit
    nn.Linear(10, 1),
    # Add Sigmoid at the end to turn output to probabilities
    nn.Sigmoid(),
)
```
* Adicionando novas features como pontuação, letras maiúsculas ou palavras chave