#   LAB 06 - Python version

Luca Catalano, Daniele Rege Cambrin, Eleonora Poeta

### Disclaimer

The purpose of creating this material is to enhance the knowledge of students who are interested in learning how to solve problems presented in laboratory classes using Python. This decision stems from the observation that some students have opted to utilize Python for tackling exam projects in recent years.

To solve these exercises using Python, you need to install Python (version 3.9.6 or later) and some libraries using pip or conda.

Here's a list of the libraries needed for this case:

- `os`: Provides operating system dependent functionality, commonly used for file operations such as reading and writing files, interacting with the filesystem, etc.
- `pandas`: A data manipulation and analysis library that offers data structures and functions to efficiently work with structured data.
- `numpy`: A numerical computing library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- `matplotlib.pyplot`: A plotting library for creating visualizations like charts, graphs, histograms, etc.
- `sklearn`: Machine learning algorithms and tools.
- `xlrd`: A Python library used for reading data and formatting information from Excel files (.xls and .xlsx formats). It provides functionality to extract data from Excel worksheets, including cells, rows, columns, and formatting details.

You can download Python from [here](https://www.python.org/downloads/) and follow the installation instructions for your operating system.

For installing libraries using [pip](https://pip.pypa.io/en/stable/) or [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html), you can use the following commands:

- Using pip:
  ```
  pip install pandas numpy matplotlib scikit-learn xlrd
  ```

- Using conda:
  ```
  conda install pandas numpy matplotlib scikit-learn xlrd
  ```

Make sure to run these commands in your terminal or command prompt after installing Python. You can also execute them in a cell of a Jupyter Notebook file (`.ipynb`) by starting the command with '!'.

#   Exercise 1

Import some libraries

In [None]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import cross_val_predict, GridSearchCV
from sklearn.metrics import confusion_matrix

## Read file excel "user.xlsx"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In [None]:
# Read file excel


In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

In [None]:
# print dataset


##  Define the label column in the dataset data frame

Rename the 'Response' column to 'Label' [use dataset.rename(columns={'actual_col_name': 'new_col_name'})]


In [None]:
# rename column Response to Label


In [None]:
# print datsaset to check if the column has been renamed


##  Separate the dataset into features, referred to as X, and labels, referred to as y. Afterwards, utilize Label Encoder to encode the categorical features.

[You can achieve this by selecting columns using the [] operator on the dataframe, then initializing the Label Encoder and applying its fit_transform method]

In [None]:
# Split the dataset into features (X) and target variable (y)
# Features

# Target variable


# Label encoding

# Apply label encoding to each column, except for the age column

# Transform Negative into 0value and Positive into 1 value (use label encoder with .fit_transform)


## Use the random forest classifier model.

To start, split the dataset `users.xlsx` into two parts: training and testing. This allows for training the model on the training portion and evaluating its performance using the test portion.

Please note that the test portion is not a real-case test dataset but rather an archetype for evaluating the model with a small dataset that contains the correct labels.

Set these parameters:

- Max Depth: 100
- Number of trees: 20


[use train_test_split() to split the dataset]

[Use RandomForestClassifier() and its .fit and .predict function]

In [None]:
# Split the dataset into training set and test set


# Create a Random Forest Classifier


# Train the model using the training sets


# Predict the response for test dataset


# Evaluate the model: Accuracy, Precision, Recall


# Print the evaluation metrics


## Validation of Random Forest Classifier model using Cross Validation

Cross-validation is a technique used to assess the performance and generalization ability of machine learning models, particularly in the context of classification tasks. It involves partitioning the dataset into multiple subsets, known as folds.

1. **Partitioning the Dataset**: The dataset is divided into k equal-sized folds.

2. **Training and Testing**: The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.

3. **Evaluation**: The performance of the model is evaluated on each fold, and the results are averaged to obtain a robust estimate of the model's performance.

4. **Advantages**: Cross-validation provides a more reliable estimate of the model's performance compared to a single train-test split. It helps to detect overfitting and assesses the model's ability to generalize to unseen data.

[Use `cross_val_score` and `cross_val_predict` to perform cross-validation easily]

In [None]:
# Initialize the decision tree classifier


# Perform cross-validation predictions


# Calculate confusion matrix


# Evaluate accuracy


# Print accuracy


# Print confusion matrix



##  Implement Grid Search

Grid Search is a technique used to find the optimal hyperparameters for a machine learning model. It works by searching through a predefined set of hyperparameters and evaluating the model's performance for each combination using cross-validation.


Specifically, you need to:

1. Define a grid of hyperparameters to search through.
2. Use Grid Search to find the best combination of hyperparameters.

In [None]:
# Grid search. It takes more or less 30 seconds to run
# Define the parameter grid
param_grid =    {
                    "n_estimators": [100, 250, 500],
                    "max_depth": [None, 10, 20, 30],
                }
# Perform grid search

# Initialize the grid search [use .fit() method]

# Print the best parameters and the best score


#   Exercise 2

Import some libraries

In [None]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC as SupportVectorMachineClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import cross_val_predict, GridSearchCV
from sklearn.metrics import confusion_matrix

## Read file excel "user.xlsx"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In [None]:
# Read file excel


In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

In [None]:
# print dataset


##  Define the label column in the dataset data frame

Rename the 'Response' column to 'Label' [use dataset.rename(columns={'actual_col_name': 'new_col_name'})]


In [None]:
# rename column Response to Label


In [None]:
# print datsaset to check if the column has been renamed


##  Separate the dataset into features, referred to as X, and labels, referred to as y. Afterwards, utilize Label Encoder to encode the categorical features.

[You can achieve this by selecting columns using the [] operator on the dataframe, then initializing the Label Encoder and applying its fit_transform method]

In [None]:
# Split the dataset into features (X) and target variable (y)
# Features

# Target variable


# Label encoding

# Apply label encoding to each column, except for the age column


# Transform Negative into 0value and Positive into 1 value (use label encoder with .fit_transform)



## Use the Support Vector Machine classifier model.

Use the same split of the dataset `users.xlsx` into two parts

Set these parameters:

- C: 100
- gamma: 0.1
- kernel='rbf'


[Use SVM() and its .fit and .predict function]

In [None]:

# Split the dataset into training set and test set


# Create a SVM Classifier


# Train the model using the training sets


# Predict the response for test dataset


# Evaluate the model: Accuracy, Precision, Recall



# Print the evaluation metrics



## Validation of SVM Classifier model using Cross Validation

Cross-validation is a technique used to assess the performance and generalization ability of machine learning models, particularly in the context of classification tasks. It involves partitioning the dataset into multiple subsets, known as folds.

1. **Partitioning the Dataset**: The dataset is divided into k equal-sized folds.

2. **Training and Testing**: The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.

3. **Evaluation**: The performance of the model is evaluated on each fold, and the results are averaged to obtain a robust estimate of the model's performance.

4. **Advantages**: Cross-validation provides a more reliable estimate of the model's performance compared to a single train-test split. It helps to detect overfitting and assesses the model's ability to generalize to unseen data.

[Use `cross_val_score` and `cross_val_predict` to perform cross-validation easily]

In [None]:
# Initialize the decision tree classifier


# Perform cross-validation predictions


# Calculate confusion matrix


# Evaluate accuracy


# Print accuracy


# Print confusion matrix



##  Implement Grid Search

Grid Search is a technique used to find the optimal hyperparameters for a machine learning model. It works by searching through a predefined set of hyperparameters and evaluating the model's performance for each combination using cross-validation.


Specifically, you need to:

1. Define a grid of hyperparameters to search through.
2. Use Grid Search to find the best combination of hyperparameters.

In [None]:
# Grid search. It takes more or less 30 seconds to run
# Define the parameter grid
param_grid =    {
                    "C": [1, 2, 5, 10],
                    "gamma": [2, 1, 0.1, 0.01],
                    "kernel": ['rbf', 'linear']
                }
# Perform grid search

# Initialize the grid search


# Print the best parameters and the best score


#   Exercise 3

Import some libraries

In [None]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import cross_val_predict, GridSearchCV
from sklearn.metrics import confusion_matrix

## Read file excel "user.xlsx"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In [None]:
# Read file excel


In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

In [None]:
# print dataset


##  Define the label column in the dataset data frame

Rename the 'Response' column to 'Label' [use dataset.rename(columns={'actual_col_name': 'new_col_name'})]


In [None]:
# rename column Response to Label


In [None]:
# print datsaset to check if the column has been renamed


##  Separate the dataset into features, referred to as X, and labels, referred to as y. Afterwards, utilize Label Encoder to encode the categorical features.

[You can achieve this by selecting columns using the [] operator on the dataframe, then initializing the Label Encoder and applying its fit_transform method]

In [None]:
# Split the dataset into features (X) and target variable (y)
# Features

# Target variable


# Label encoding

# Apply label encoding to each column, except for the age column


# Transform Negative into 0value and Positive into 1 value (use label encoder with .fit_transform)


## Use the MLP classifier model.

Use the same split of the dataset `users.xlsx` into two parts

A Multi-Layer Perceptron (MLP) is a type of artificial neural network (ANN) that consists of multiple layers of nodes, or neurons, arranged in a feedforward manner. MLPs are widely used for various machine learning tasks, including classification and regression.

### Structure of an MLP:

1. **Input Layer**: The first layer of the MLP, which receives input features from the dataset.

2. **Hidden Layers**: Intermediate layers between the input and output layers. Each hidden layer consists of multiple neurons, and the number of hidden layers and neurons per layer can vary depending on the complexity of the task.

3. **Output Layer**: The final layer of the MLP, which produces the network's output. The number of neurons in the output layer depends on the number of classes in the classification task or the number of output values in the regression task.

### Activation Function:

Each neuron in the MLP applies an activation function to its input to introduce non-linearity into the model and enable the network to learn complex patterns. Common activation functions include:

- **ReLU (Rectified Linear Unit)**
- **Sigmoid**
- **Tanh (Hyperbolic Tangent)**

### Training an MLP:

MLPs are trained using an optimization algorithm such as gradient descent to minimize a loss function, which measures the difference between the predicted output and the true labels in the training data. Common loss functions include cross-entropy loss for classification tasks and mean squared error for regression tasks.

Set these parameters:

- max_iter = 500 
- solver='sgd' 
- learning_rate_init=0.001
- hidden_layer_sizes=(512, 256, 128)
- random_state=42


[Use MLPClassifier() and its .fit and .predict function]

In [None]:
# Split the dataset into training set and test set

# Create a MLP Classifier

# Train the model using the training sets

# Predict the response for test dataset

# Evaluate the model: Accuracy
