# Data Analysis Portfolio

This notebook is part of my data analysis portfolio, where I explore **three** key areas:
1. Data processing and visualization
2. Data science and machine learning
3. Text analysis and insights

## <span style="background-color: #FFE5B4 ">  Section 1. Data Processing and Visualization </span>

### General information
There are various important packages for *data processing and visualization*. In the example code below, I will be focusing on:
- pandas
- numpy
- statsmodel
- matplotlib
- seaborn

### Import required packages

In [None]:
#Import packages/modules
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels as sm

### <span style="background-color: #FFE5B4 ">1.1 Data Processing</span>

<a href="https://www.fullstory.com/blog/what-is-data-processing/">Data processing</a> is a series of operations performed on data to transform, analyze, and organze it in a useful format.

#### 1.1.1 Cleaning

#### 1.1.2 Transforming

#### 1.1.3 Merging

### <span style="background-color: #FFE5B4 ">1.2 Data Visualization</span>

<a href="https://www.tableau.com/learn/articles/data-visualization">Data visualization</a> is the graphical representation of data through use of visual elements, such as charts, graphs, plots, and infographics.

#### 1.2.1 Static

#### 1.2.2 Dynamic

<hr style="border: 0.8px solid black;">

## <span style="background-color: #FFE5B4 "> Section 2. Traditional Machine Learning and Deep Learning </span>

### General information
There are various important packages for *traditional machine learning and deep learning*. In the example code below, I will be focusing on:
- pytorch
- scikit-learn

### Import required packages

In [2]:
#Import packages/modules
import sklearn as sk
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

#Import specific objects
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor

### <span style="background-color: #FFE5B4 ">2.1 Traditional Machine Learning</span>
scikit-learn is focused on traditional machine learning tasks, such as linear regression, clustering, and support vector machines (CVMs). 

### <span style="background-color: #FFE5B4 ">2.2 Deep Learning</span>
PyTorch is primarily designed for deep learning tasks, such as neural networks (CNNs, RNNs) and transformers (BERT, RoBERTa).

#### Important terminology: PyTorch
- **backpropagation**: The process of adjusting the weights of a neural network by analyzing the error rate from the previous iteration.
- **batch**: A hyperparameter that defines the number of samples that are processed before the interal model parameters are updated.
- **Dataset**: Data primitive that stores the samples and their corresponding labels.
- **DataLoader**: Data primitive that wraps an iterable around the Dataset to enable easy access to the samples.
- **gradient descent**: An iterative optimization method that minimises the loss function in machine learning models.
- **epoch**: A hyperparameter that defines the number of complete passes through the training dataset.
- **hyperparameter**: A parameter that is set before the machine learning process begins.
- **learning rate**: A hyperparameter that controls the step size of each gradient descent update.
- **loss function**: A mathematical function that measures the difference between the model's predictions and the actual labels.
- **Model**: A neural network architecture that is designed to solve a specific problem.
- **module**: Base class for all neural network models (the building blocks).
- **neural network**: A machine learning program/model that makes decisions in a manner similar to the human brain.
- **optimizer**: A tool that helps with the process of training a machine learning model.
- **sample**: A single row of data.
- **tensor**: A multi-dimensional array of numerical values (a "container" for data).
    - A tensor can be created by running: torch.ones(r,c) or torch.rand(r,c).
    - Tensors of similar shapes can be added, multiplied, etc.
- **ToTensor**: Transformation function that converts NumPy array into PyTorch tensor representation. 
- **training**: The process of adjusting the model's parameters to minimize the loss function.
- **validation**: The process of evaluating the model's parameters on a separate dataset to monitor overfitting.


#### PyTorch dataset: YelpReviewFull
Find more information about this dataset: https://huggingface.co/datasets/Yelp/yelp_review_full

**Data Fields**
- *text*: The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
- *label*: Corresponds to the score associated with the review (between 1 and 5).

In [12]:
#Import dataset through Huggingface dataset
#ds_yelp = load_dataset("Yelp/yelp_review_full")

#Include print statements to see data structure
print("Dataset information")
print("_" * 40 + "\n")

print(f"Type{(type(ds_yelp))}")
print()

print(f"Length: {len(ds_yelp)}")
print()

print(f"Dataset structure: {ds_yelp.column_names}")
print()

print(f"Dataset overview: {ds_yelp}")
print()

print("First row in train set:")
print(f"Label: {ds_yelp['train']['label'][0]}")
print(f"Text: {ds_yelp['train']['text'][0]}")
print()

print("First row in test set:")
print(f"Label: {ds_yelp['test']['label'][0]}")
print(f"Text: {ds_yelp['test']['text'][0]}")
print()

Dataset information
________________________________________

Type<class 'datasets.dataset_dict.DatasetDict'>

Length: 2

Dataset structure: {'train': ['label', 'text'], 'test': ['label', 'text']}

Dataset overview: DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

First row in train set:
Label: 4
Text: dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.

First row in 

<hr style="border: 0.8px solid black;">

## <span style="background-color: #FFE5B4 "> Section 3. Text Sentiment and Topic Modeling </span>

### General information
There are various important packages for *text sentiment and topic modeling*. In the example code below, I will be focusing on:
- nltk
- spacy
- gensim
- transformers

### Import required packages

In [None]:
#Import packages/modules
import gensim
import nltk
import spacy
import transformers

#Import specific objects
from gensim.models import Word2Vec, TfidfModel
from gensim.corpora import Dictionary
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from spacy import displacy
from transformers import AutoModelForSequenceClassification, AutoTokenizer

### <span style="background-color: #FFE5B4 ">3.1 Text Sentiment</span>

### <span style="background-color: #FFE5B4 ">3.2 Topic Modeling</span>

## License and Copyright

© 2024 Noor de Bruijn. All rights reserved.