## Predictive Analytics Template - Structured Data

* <ins>Author</ins>: Duncan Calvert
* <ins>Last Modified</ins>: 9/24/24

This article is part of my series Watch One, Do One, Teach One (WDT), A data science series focused around helping beginner data scientist learn AI concepts, practice implementing them, and then teach the concept to others in order to cement their understanding. 

This specific template is meant to be used for predictive analytics/ML with structured data sets. It has the following sections

## Table of Contents
1. [Library Imports](#1-package-imports-and-installs)
2. [Configurations](#2-configurations)
3. [Downloading the Data Set](#3-downloading-the-data-set)
4. [Exploratory Data Analysis](#exploratory-data-analysis)
    - [4a. Pandas EDA](#4a-pandas-eda)
    - [4b. Checking for Target Leakage](#4b-checking-for-target-leakage)
    - [4c. AutoViz](#4c-autoviz)
5. [Data Cleaning](#5-data-cleaning)
6. [Feature Engineering](#6-feature-engineering)
7. [AutoML](#7-automl)
8. [Iterative Modeling and Hyperparameter Tuning](#8-iterative-modeling-and-hyperparameter-tuning)
9. [Explainability](#9-explainability)
    - [9a. Feature importance](#9a-feature-importance)
10. [Summary and Lessons Learned](#10-summary-and-lessons-learned)

## 1. Package Imports and Installs

In [1]:
# !pip install -r requirements.txt

In [2]:
# Data Imports
from palmerpenguins import load_penguins

# General packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

# AutoEDA Tools
from autoviz import AutoViz_Class

Imported v0.1.905. Please call AutoViz in this sequence:
    AV = AutoViz_Class()
    %matplotlib inline
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)


## 2. Configurations

In [3]:
# sets the max column count, allowing us to control truncation and visibility
pd.set_option('display.max_columns', None) 

# allows the dataframe to stretch across multiple pages
pd.set_option('display.expand_frame_repr', False) 

# sets the maximum width of columns, allowing us to control visibility
pd.set_option('max_colwidth', None) 

# display numbers to a higher precision
pd.options.display.float_format = '{:,.7}'.format 

## 3. Downloading the Data Set

#### Loading a toy data set

In [4]:
# Load the toy pandas data set from the palmerpenguins package
df = load_penguins()

#### Loading a CSV

In [None]:
# df = pd.read_csv('<insert_file_name.csv>')

#### Loading an Excel

In [5]:
# df = pandas.read_excel('<Insert Excel File Name>.xlsx',sheet_name = '<Insert tab name>')

#### Load a JSON

In [None]:
"""
file_path = "/home/jupyter/data/news/news_some_company.json"
df = pd.read_json(file_path, orient='records', lines=True)
df.head()
"""

#### Loading a Google Drive File

In [6]:
"""
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')
"""

"\nfrom google.colab import drive\n\n# Mount Google Drive\ndrive.mount('/content/drive')\n"

#### Copy files to local FS from GCS

In [None]:
"""
# GCS download function
def get_gcs_data(bucket_name, folder_name, file_name, path_local):
    url = 'https://storage.googleapis.com/' + bucket_name + '/' + folder_name + '/' + file_name
    r = requests.get(url)
    open(path_local + '/' + file_name , 'wb').write(r.content)

# Receiving path
path_news = '/home/jupyter/data/news'
os.makedirs(path_news, exist_ok=True)
    
# GCS bucket details
bucket_name = 'msca-bdp-data-open'
folder_name = 'news'
file_name = ['news_some_company.json']
path_local = path_news

os.makedirs(path_local, exist_ok=True)

for file in file_name:
    get_gcs_data (bucket_name = bucket_name,
                 folder_name = folder_name,
                 file_name = file,
                 path_local = path_local)
    print('Downloaded: ' + file)
"""

## Exploratory Data Analysis (EDA)

EDA is one of the most important parts of any data science workflow as understanding the nuances of your data, identifying potential biases, missing data, or issues with your data set is extremely important for all subsequent steps. 

* <ins>Note</ins>: Even before EDA, whenever possible, you should attempt to discuss any potential data quality, availability, or bias issue with any data subject matter experts that are available. This is important as there are often nuances and potential pitfalls in the way data is collected and cataloged that may not be apparent from code-based EDA.

### 4a. Pandas EDA

#### Shape
The "shape" attribute gives the axis dimensions of the object, consistent with ndarray allowing us to quickly guage the size of our data set

In [7]:
df.shape

(344, 8)

#### Head/Tail
The "head" and "tail" methods allow us to view a small sample of a Series or DataFrame object with the default length being 5. You can pass a parameter to the methods to increase/decrease their row count. These methods allow us to quickly get a sense of the columns/features available and what a few examples of the data set look like.

In [8]:
df.head(2)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007


#### Info
The "info" method provides us with column level info on our dataframe, specifically data types, null count, and indexes

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


#### Describe
The "describe" method provides common summary statistics of your dataframes features. It is primarily used with numeric data sets

In [10]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,342.0,342.0,342.0,342.0,344.0
mean,43.92193,17.15117,200.9152,4201.754,2008.029
std,5.459584,1.974793,14.06171,801.9545,0.8183559
min,32.1,13.1,172.0,2700.0,2007.0
25%,39.225,15.6,190.0,3550.0,2007.0
50%,44.45,17.3,197.0,4050.0,2008.0
75%,48.5,18.7,213.0,4750.0,2009.0
max,59.6,21.5,231.0,6300.0,2009.0


### 4b. Checking for Target Leakage

### 4c. AutoViz

AutoViz is an automated vizualisation package that gives a quick interactive overview of your data

In [1]:
"""
AV = AutoViz_Class()
%matplotlib inline 

# Run AutoViz on the dataframe
dft = AV.AutoViz(
    "",
    sep=",",
    depVar="",
    dfte=df,
    header=0,
    verbose=1,
    lowess=False,
    chart_format="svg",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
    save_plot_dir=None
)

"""

'\nAV = AutoViz_Class()\n%matplotlib inline \n\n# Run AutoViz on the dataframe\ndft = AV.AutoViz(\n    "",\n    sep=",",\n    depVar="",\n    dfte=df,\n    header=0,\n    verbose=1,\n    lowess=False,\n    chart_format="svg",\n    max_rows_analyzed=150000,\n    max_cols_analyzed=30,\n    save_plot_dir=None\n)\n\n'

## 5. Data Cleaning

## 6. Feature Engineering

* Differencing

## 7. AutoML

## 8. Iterative Modeling and Hyperparameter Tuning

## 9. Explainability

### 9a. Feature importance

<ins>Permutation importance</ins>: Permutation importance is calculated after a model has been fitted, and involves randomly shuffling a single column of the data set, leaving the target and all other features in place. Then you measure how that affects the accuracy of predictions in that now-shuffled data. The benefits of permutation importance is that it is generally fast to calculate, widely used, and easy to understand. 
* Shuffling a single feature/column at random should lead to less accurate predictions, since the resulting data set no longer corresponds to anything observed in the real world. Model accuracy especially suffers if we shuffle a column that the model relied on heavily for predictions.

The high-level process is as follows
1. Get a trained model.
2. Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.
3. Return the data to the original order (undoing the shuffle from step 2). Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.
4. Interpret your results
    * The values towards the top (i.e. the bigger ones) are the most important featues
    * The first column shows how much model performance increased/decreased with a random shuffle
    * The second column measures the amount of randomness by repeating the process with multiple shuffles. The number after the +- measures performance varied from one-reshuffling to the next
    * In the case of negative values, the predictions on the shuffled/noisy data happened to be more accurate than the real data. This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100,
                                  random_state=0).fit(train_X, train_y)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

## 10. Summary and Lessons Learned