# Data Analysis Portfolio

This notebook is part of my data analysis portfolio, where I explore **three** key areas:
1. Data Processing and Visualization
2. Traditional Machine Learning and Deep Learning
3. Text Sentiment and Topic Modeling

In [1]:
#Import general packages
from IPython.display import Image

## <span style="background-color: #FFE5B4 "> Section 2. Traditional Machine Learning and Deep Learning </span>

### General information
There are various important packages for *traditional machine learning and deep learning*. In the example code below, I will be focusing on:
- pytorch
- scikit-learn

### Import required packages

In [6]:
#Import packages/modules
import sklearn as sk
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

#Import specific objects
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor

### <span style="background-color: #FFE5B4 ">2.1 Traditional Machine Learning</span>
scikit-learn is focused on traditional machine learning tasks, such as linear regression, clustering, and support vector machines (CVMs). 

### <span style="background-color: #FFE5B4 ">2.2 Deep Learning</span>
PyTorch is primarily designed for deep learning tasks, such as neural networks (CNNs, RNNs) and transformers (BERT, RoBERTa).

#### Important terminology: PyTorch

- **autograd**: Computes the gradients (slopes) of the loss function with respect to the model's weights.
  - Important: When a forward function is defined in PyTorch, the backward function is automatically generated by PyTorch's autograd system, so the backward function doesn't need to be explicitly defined.
- **backpropagation**: The process of adjusting the weights of a neural network by analyzing the error rate from the previous iteration.
- **batch**: A hyperparameter that defines the number of samples that are processed before the interal model parameters are updated.
- **Dataset**: Data primitive that stores the samples and their corresponding labels.
- **DataLoader**: Data primitive that wraps an iterable around the Dataset to enable easy access to the samples.
- **gradient descent**: An iterative optimization method that minimises the loss function in machine learning models.
- **epoch**: A hyperparameter that defines the number of complete passes through the training dataset.
- **hyperparameter**: A parameter that is set before the machine learning process begins.
- **learning rate**: A hyperparameter that controls the step size of each gradient descent update.
- **loss function**: A mathematical function that measures the difference between the model's predictions and the actual labels. In other words, it computes a value that estimates how far the output is from the target.
  - Common loss functions: nn.CrossEntropyLoss() | nn.MSELoss() | nn.BCELoss() | nn.L1Loss()
- **Model**: A neural network architecture that is designed to solve a specific problem.
- **module**: Base class for all neural network models (the building blocks).
- **neural network (NN)**: A machine learning program/model that makes decisions in a manner similar to the human brain.
  - **Feedforward Networks**: Data flows in only one direction, from input layer to output layer, with no feedback loops.
  - **Recurrent Neural Networks (RNNs)**: Data flows in a loop, processing sequential data (text, audio, time series).
  - **Convolutional Neural Networks (CNNs)**: Data flows through convolutional and pooling layers to extract features from grid-like data (images).
  - **Transformers**: Data flows through self-attention mechanisms and encoder-decoder structures to process sequential data. Where RNNs process sequential data one element at a time, transformers processes entire sequences simultaneously (parallel processing).
- **optimizer**: A tool that helps with the process of training a machine learning model.
  - SGD (stochastic gradient descent) is an optimizer that updates model weights based on the gradient of the loss function.
  - torch.optim provides a wide range of optimizers.
- **parameter**: 
- **propagation**
    - **forward propagation**: NN makes best guess about the correct output. It runs the input through each of the functions to make this guess.
    - **backward propagation**: NN adjusts its parameters proportionate to the error in its guess, making bigger/smaller changes for bigger/smaller errors. Backpropagation relies on autograd to compute gradients, which are then used to update the model's weights.
    - loss.backward() implicitly generates the backward function.
- **sample**: A single row of data.
- **tensor**: A multi-dimensional array of numerical values (a "container" for data) that run on GPU to accelerate computing.
    - A tensor can be created by running: torch.tensor(data) | torch.ones(r,c) | torch.zero(r,c) | torch.rand(r,c).
    - A tensor can also be created from a numpy array by running: torch.from_numpy(np_array)
    - Tensors of similar shapes can be added, multiplied, etc.
- **ToTensor**: Transformation function that converts NumPy array into PyTorch tensor representation. 
- **training**: The process of adjusting the model's parameters to minimize the loss function.
- **validation**: The process of evaluating the model's parameters on a separate dataset to monitor overfitting.


---
**Important metrics for evaluating the performance of a model**:
- **Accuracy**: The model's overall correctness.
  - TP + TN / TP + TN + FP + FN
- **Precision**: Accuracy of positive predictions.
  - TP / TP + FP
- **Recall**: Ability to identify all positive instances.
  - TP / TP + FN


![Confusion Matrix](https://newbiettn.github.io/images/confusion-matrix-noted.jpg)
<br>

**Source**: <a href="https://newbiettn.github.io/2016/08/30/precision-recall-sensitivity-specificity/">Ngoc Tran, 2016</a>

---
**Gradient-based Optimization** <br>
Gradient descent is an optimization algorithm used to minimize the loss function (error) between predicted and actual outputs.

- The graphs below show the loss value (y-axis) as a function of the weights (x-axis)
- Bottom of the U is where the loss is minimized (optimal weights).
- Gradient refers to the slope of the loss function at a given point on the graph. It measures the rate of change of the loss with respect to the weights.
- A very small gradient value close that comes as close to zero as possible is the goal.

![Confusion Matrix](https://duchesnay.github.io/pystatsml/_images/learning_rate_choice.png)
<br>

**Source**: <a href="https://duchesnay.github.io/pystatsml/optimization/optim_gradient_descent.html"> Edouard Duchesnay</a>

---
**Basics of a Neural Network Model** <br>
- **Forward pass**: Get the model predictions by passing input data through the model.
- **Loss calculation**: Compute the loss between the predicted and true labels/values.
- **Backward pass**: Compute gradient of the loss function with respect to the model parameters.
- **Weights update**: Update the weights of the model using an optimizer.


![Confusion Matrix](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*ZXAOUqmlyECgfVa81Sr6Ew.png)
<br>

**Source**: <a href="https://medium.com/data-science-365/overview-of-a-neural-networks-learning-process-61690a502fa"> Rukshan Pramoditha, 2022</a>

---

#### PyTorch dataset: YelpReviewFull
Find more information about this dataset: https://huggingface.co/datasets/Yelp/yelp_review_full

**Data Fields**
- *text*: The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
- *label*: Corresponds to the score associated with the review (between 1 and 5).

For personal reference: A similar dataset is ag_news (text and label)

In [None]:
from datasets import load_dataset

#ds = load_dataset("fancyzhx/ag_news")

print(ds)

print(iter(ds["train"]))

In [7]:
#Import dataset through Huggingface dataset
ds_yelp = load_dataset("Yelp/yelp_review_full")

#Include print statements to see data structure
print("Dataset information")
print("_" * 40 + "\n")

print(f"Type{(type(ds_yelp))}")
print()

print(f"Length: {len(ds_yelp)}")
print()

print(f"Dataset structure: {ds_yelp.column_names}")
print()

print(f"Dataset overview: {ds_yelp}")
print()

print(f"Structure of first dataset: {ds_yelp['train']}")
print(f"Structure of second dataset: {ds_yelp['test']}")

Dataset information
________________________________________

Type<class 'datasets.dataset_dict.DatasetDict'>

Length: 2

Dataset structure: {'train': ['label', 'text'], 'test': ['label', 'text']}

Dataset overview: DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

Structure of first dataset: Dataset({
    features: ['label', 'text'],
    num_rows: 650000
})
Structure of second dataset: Dataset({
    features: ['label', 'text'],
    num_rows: 50000
})


In [8]:
#Create an train_iter to iterate through the training items
train_iter = iter(ds_yelp["train"])
test_iter = iter(ds_yelp["test"])

for i in range(3):
    print(next(train_iter))
    print()

print("--" * 68)

for i in range(3):
    print(next(test_iter))
    print()

{'label': 4, 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

{'label': 1, 'text': "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  

#### Step-by-step guide to define the neural network model

#### Step 1. Text preprocessing
- Text cleaning
- Tokenization
- Build vocabulary
- Create pipelines
- Split the data into training and validation sets

#### Step 2. Define the neural network
- Pick suitable architecture
- Define input layer (text embeddings), hidden layer, output layer (star prediction)

#### Step 3. Compile the model
- Choose a loss function 
- Select an optimizer
- Define the evaluation metrics

#### Step 4. Train the model
- Divide the training data into batches using DataLoader
- Use the training data to train the model
- Forward propagation: Pass the input data through the network to get predictions
- Calculate the loss between predictions and actual labels
- Backward propagation: Update the weights using the optimizer and loss

#### Step 5. Evaluate the model
- Use the validation data to evaluate the model's performance
- Calculate metrics: Accuracy, F1-score, and loss

#### Step 6. Fine-tune the model
- Adjust hyperparameters to improve performance
- Experiment with different architectures or techniques

#### Step 1. Text preprocessing

##### **Text cleaning**

In [None]:
import re

#Remove non-words and non-space characters
for i, text in enumerate(ds_yelp['train']['text']):
    ds_yelp['train']['text'][i] = re.sub(r"[^\w\s.']", '', text)

##### **Remove stopwords**

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

#Define a set of english stopwords
stop_words = set(stopwords.words('english'))

for i, text in enumerate(ds_yelp['train']['text']):
    words = text.split()
    #Remember: List comprehension [expression for variable in iterable if condition]
    #Expression is a placeholder/alias for the value that is being produced and added to the words list
    words = [word for word in words if word.lower() not in stop_words]
    #Storing the cleaned text back in the dataset
    ds_yelp['train']['text'][i] = ' '.join(words)

##### **Tokenization**

In [None]:
from torchtext.data import get_tokenizer

#Define a function for tokenization
def tokenize_text(text):
    #Load pre-built tokenizer
    tokenizer = get_tokenizer("basic_english")
    return tokenizer(text)

#Apply tokenization to the training data
for i, text in enumerate(ds_yelp['train']['text']):
    tokens = tokenize_text(text)
    ds_yelp['train']['text'][i] = tokens
    
#Print the structure of tokenized_data
print(tokenized_data, end ='\n\n')

# Print the structure of each dataset
print(f"Example of tokens in train dataset: {ds_yelp['train']['text'][0]}, end='\n\n'")

##### **Padding**

In [None]:
from torch.nn.utils.rnn import pad_sequence

#Define a function for padding
def pad_tokens(tokens):
    #By default, the pad_sequence function pads the sequences to the length of the longest sequence in the batch
    return pad_sequence([tokens], batch_first=True)

#Apply padding to the tokenized training data
for i, tokens in enumerate(ds_yelp['train']['text']):
    padded_tokens = pad_tokens(tokens)
    ds_yelp['train']['text'][i] = padded_tokens


##### **Build vocabulary**

In [None]:
#Select the training dataset
train_dataset = [tokenized_data['train'][0]['tokens']]
#train_dataset = [tokenized_data['train']['tokens']]

#Build a vocabulary with the raw training dataset
vocab = build_vocab_from_iterator(train_dataset, specials=["<unk>"])

#Setting the default index to <unk>, so when we encounter an unknown world in new data it's replaced with the <unk> token
vocab.set_default_index(vocab["<unk>"])

#Replace placeholder PAD_IDX with actual index of the <PAD> token
PAD_IDX = vocab['<PAD>']

#Print dictionary where the keys are the tokens and the values their indices
print(vocab.get_stoi())

##### **Create pipelines**

In [None]:
#Remove non-words and non-space characters
cleaning_pipeline = lambda x: re.sub(r"[^\w\s.]", '', x)

#Tokenize the incoming text data and look each token up in the vocab dictionary
tokenizer_pipeline = lambda x: vocab(tokenizer(x))

#Convert label data to numerical indices (by extracting 1 we are ensuring the labels align with zero-based idnexing)
label_pipeline = lambda x: int(x) - 1


#### Step 2. Define the neural network

#### Step 3. Compile the model

#### Step 4. Train the model

##### **Divide the training data into batches using DataLoader**

#### Step 5. Evaluate the model

#### Step 6. Fine-tune the model

#### Other relevant pieces of code

In [None]:
#Move tensor to the GPU if available
if torch.cuda.is_available():
  tensor = tensor.to('cuda')
  print(f"Device tensor is stored on: {tensor.device}")

#Define a loss function
criterion = nn.CrossEntropyLoss()

#Create an optimizer (Stochastic Gradient Descent)
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)


for epoch in range(num_epochs):

    #Clear the gradients for the next iteration
    optimizer.zero_grad()
    
    #Forward pass: Get model's predictions by passing input data through the model
    prediction = model(data)

    #Calculate loss: Compute the loss between the predicted labels and the true labels using the loss function
    loss = criterion(output, target) #output = model's predicted labels/values | target = tensor of true labels/values

    #Backward pass: Compute the gradients of the loss with respect to the model's parameters
    loss.backward()

    #Updating weights: Use an optimizer to update the model's weights
    optimizer.step()


#### Common structure for deep learning project
- **utils.py**: Utility functions for handling hyperparameters, logging, and storing the model.
- **model/net.py**: The neural network architecture, the loss function, and evaluation metrics.
- **model/data_loader.py**: Data loading, preprocessing, and batching for training and evaluation.
- **main.py**: Entry point for the project, includes training (train.py) and evaluation (evaluate.py) of the model.

<hr style="border: 0.8px solid black;">

## License and Copyright

© 2024 Noor de Bruijn. All rights reserved.