#   LAB 05 - Python version

Luca Catalano, Daniele Rege Cambrin, Eleonora Poeta

### Disclaimer

The purpose of creating this material is to enhance the knowledge of students who are interested in learning how to solve problems presented in laboratory classes using Python. This decision stems from the observation that some students have opted to utilize Python for tackling exam projects in recent years.

To solve these exercises using Python, you need to install Python (version 3.9.6 or later) and some libraries using pip or conda.

Here's a list of the libraries needed for this case:

- `os`: Provides operating system dependent functionality, commonly used for file operations such as reading and writing files, interacting with the filesystem, etc.
- `pandas`: A data manipulation and analysis library that offers data structures and functions to efficiently work with structured data.
- `numpy`: A numerical computing library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- `matplotlib.pyplot`: A plotting library for creating visualizations like charts, graphs, histograms, etc.
- `sklearn`: Machine learning algorithms and tools.
- `xlrd`: A Python library used for reading data and formatting information from Excel files (.xls and .xlsx formats). It provides functionality to extract data from Excel worksheets, including cells, rows, columns, and formatting details.

You can download Python from [here](https://www.python.org/downloads/) and follow the installation instructions for your operating system.

For installing libraries using [pip](https://pip.pypa.io/en/stable/) or [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html), you can use the following commands:

- Using pip:
  ```
  pip install pandas numpy matplotlib ltk scikit-learn xlrd
  ```

- Using conda:
  ```
  conda install pandas numpy matplotlib scikit-learn xlrd
  ```

Make sure to run these commands in your terminal or command prompt after installing Python. You can also execute them in a cell of a Jupyter Notebook file (`.ipynb`) by starting the command with '!'.

#   Exercise 1

Import some libraries

In [None]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.tree import export_text
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

## Read file excel "user.xlsx"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In [None]:
# Read file excel


In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

In [None]:
# print dataset


##  Define the label column in the dataset data frame

Rename the 'Response' column to 'Label' [use dataset.rename(columns={'actual_col_name': 'new_col_name'})]


In [None]:
# rename column Response to Label


In [None]:
# print datsaset to check if the column has been renamed


##  Separate the dataset into features, referred to as X, and labels, referred to as y. Afterwards, utilize Label Encoder to encode the categorical features.

[You can achieve this by selecting columns using the [] operator on the dataframe, then initializing the Label Encoder and applying its fit_transform method]

In [None]:
# Split the dataset into features (X) and target variable (y)
# Features
# Target variable


# Label encoding

# Apply label encoding to each column, except for the age column

# print X


##  Use the decision tree classifier model.

Set these parameters:

- Criterion: 'entropy'
- Max Depth: 20
- Min Impurity Decrease: 0.001


[Use DecisionTreeClassifier() and its .fit function]

In [None]:
# Initialize the Decision Tree Classifier

# Train the Decision Tree Classifier


##  Print the structure of the decision tree

[use export_text(classifier_name, feature_names=list(x.columns))]

In [None]:
# Print the structure of the decision tree



## Use the trained model on unseen data

Now that we have trained the model using the `fit` function, we can apply it to a dataset that the model hasn't seen before and evaluate its performance. [We'll use the variable `clf` that was declared previously (without redefining it) and apply the `predict` function to make predictions on the new dataset]

Another way to store the trained model for later reuse is by using serialization techniques such as `joblib` or `pickle`. These libraries allow you to save the trained model to a file, which can then be loaded and used whenever needed without having to retrain the model from scratch.


### Load the new dataset "prospects.xlsx"

In [None]:
# load the new dataset. [Use pd.read_excel() function to load the dataset. Use the path of the file as an argument of the function.]


In [None]:
# print the new dataset


Please be mindful that in this scenario, we lack the variable "Label" (nor "Response"). As a matter of fact, we are unaware of the outcomes, yet we aim to forecast them using a model pre-trained on actual values.

##  Utilize Label Encoder to encode the categorical features.

[Rename the dataframe as X, then initializing the Label Encoder and applying the fit_transform method]

In [None]:
X = new_dataset

# Label encoding for the new_dataset

# Apply label encoding to each column, except for the age column

# print X


##  Apply the pretrained Decision Tree model

In [None]:
# Predict the target variable of the new dataset

# print the prediction


# Exercise 2

## Read file excel "user.xlsx"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In [None]:
# read file excel


In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

In [None]:
# print dataset


##  Define the label column in the dataset data frame

Rename the 'Response' column to 'Label' [use dataset.rename(columns={'actual_col_name': 'new_col_name'})]


In [None]:
# rename column Response to Label


In [None]:
# print datsaset to check if the column has been renamed


##  Separate the dataset into features, referred to as X, and labels, referred to as y. Afterwards, utilize Label Encoder to encode the categorical features.

[You can achieve this by selecting columns using the [] operator on the dataframe, then initializing the Label Encoder and applying the fit_transform method]

In [None]:
# Split the dataset into features (X) and target variable (y)
# Features

# Target variable


# Label encoding

# Apply label encoding to each column, except for the age column

# print X


## Validation of Decision Tree classification model using Cross Validation

Cross-validation is a technique used to assess the performance and generalization ability of machine learning models, particularly in the context of classification tasks. It involves partitioning the dataset into multiple subsets, known as folds.

1. **Partitioning the Dataset**: The dataset is divided into k equal-sized folds.

2. **Training and Testing**: The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.

3. **Evaluation**: The performance of the model is evaluated on each fold, and the results are averaged to obtain a robust estimate of the model's performance.

4. **Advantages**: Cross-validation provides a more reliable estimate of the model's performance compared to a single train-test split. It helps to detect overfitting and assesses the model's ability to generalize to unseen data.

[Use `cross_val_score` and `cross_val_predict` to perform cross-validation easily. Follow the same instruction of Exercise 1 to initialise and use the model]

Set these parameters for Decision Classfier model:

- Criterion: 'entropy'
- Max Depth: 25
- Min Impurity Decrease: 0.01


In [None]:
# Initialize the decision tree classifier

# Perform cross-validation predictions


# Calculate confusion matrix


# Evaluate accuracy

# Print accuracy


# Print confusion matrix
conf_matrix = pd.DataFrame(conf_matrix, columns=['Predicted No', 'Predicted Yes'], index=['Actual No', 'Actual Yes'])
conf_matrix
