# Machine Learning Workflow - Regression - Basic

Basic Machine learning workflow for regression. Starting from data set understanding, up until machine learning model training and evaluation.

Dataset: https://www.kaggle.com/mirichoi0218/insurance

## Machine Learning Workflow Components

0. **Data Description, Tasks & Observations**
0. **Import Libararies** & other settings
1. **Data Import**
2. **Exploratory Data Analysis (EDA)**
3. **Data Preparation**
4. **Model Training**
5. **Model Evaluation**
5. **Save Model**

## Data Description, Tasks & Observations

### Data Description
**Dataset Summary**: Data of insurance company that includes individuals from age [x-y]. It has 7 columns.  
**Index Column**: Does not exist in dataset  
**Target Column**: `charges`  
**Features**: All Other Columns <br>
**Data Source**: Public Domain  
**Other Comments**: ???

### Tasks:
1. Understand data and select columns to be used as features
1. Exploratory Data Analysis
1. Preparing Data for model
2. Train basic model
3. Evaluate model 
4. Save best performing model

### Observations:
1. *Will be filled as we go through workflow*

## Import Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.__version__

## Data Import

### Read CSV File to Dataframe

In [None]:
#If running on Kaggle
path = "../input/insurance/insurance.csv"

#If Running anywhere else
# To Add

df = pd.read_csv(path)

### Basic Dataset Information

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe(include='all')

## Exploratory Data Analysis (EDA)

### Find Data Types

In [None]:
df.dtypes

### Find Null Counts

In [None]:
#Find count of nulls in each column
df.isna().sum()

In [None]:
#Find % of null values in each column
(df.isna().sum() / df.shape[0]) * 100

#### Observations
1. Age column has around 20% null values

### Target Column

In [None]:
target_column = 'charges'

In [None]:
df[[target_column]].describe().transpose()

In [None]:
df[[target_column]].astype('object').describe().T

In [None]:
sns.displot(data=df, x=target_column)

In [None]:
sns.displot(data=df, x='charges', kind='kde', hue='smoker')

#### Observations
1. Most people the smallest prices
2. Most people pay around 2000 to 10,000
3. Only few pay more than 50,000
4. 30,000 Range has small amount of data

### Numeric Columns

In [None]:
df.select_dtypes(include=np.number).columns

In [None]:
numeric_columns = ['age', 'bmi', 'children', 'charges']

In [None]:
df[numeric_columns].head()

In [None]:
df[numeric_columns].describe()

In [None]:
sns.displot(data=df, x='age')

In [None]:
sns.scatterplot(data=df, x='age', y='charges')

In [None]:
sns.scatterplot(data=df, x='age', y='charges', hue='smoker')

In [None]:
sns.jointplot(data=df, x='bmi', y='charges')

In [None]:
sns.jointplot(data=df, x='bmi', y='charges', hue='smoker')

In [None]:
sns.jointplot(data=df, x='children', y='charges', hue='smoker')

In [None]:
sns.pairplot(df[numeric_columns])

In [None]:
#heatmap
corr = df.corr()
corr

In [None]:
sns.heatmap(corr)

In [None]:
sns.heatmap(
    df.corr(),
    cmap='Blues',
    annot=True,
)

#### Observations
1. ...

### Categorical Columns

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.select_dtypes(include=np.object).columns

In [None]:
categorical_columns = ['sex', 'smoker', 'region', 'children']

In [None]:
df[categorical_columns].head()

In [None]:
df[categorical_columns].astype(np.object).describe()

In [None]:
sns.countplot(data=df, x='smoker')

In [None]:
sns.countplot(data=df, x='region')

In [None]:
sns.countplot(data=df, x='sex')

In [None]:
sns.countplot(data=df, x='smoker', hue='sex')

In [None]:
sns.countplot(data=df, x='smoker', hue='region')

In [None]:
sns.countplot(data=df, hue='children', x='region')

In [None]:
sns.boxplot(data=df, x='charges', y='smoker')

In [None]:
sns.boxplot(data=df, x='bmi', y='sex')

In [None]:
sns.boxplot(data=df, x='charges', y='smoker', hue='region')

In [None]:
sns.boxplot(data=df, x='charges', y='sex', hue='smoker')

#### Observations
1. ...

## Data Preparation

## Feature Selection

In [None]:
target_column = 'charges'

#Manually selecting relavent features based on EDA
feature_columns = ['age', 'bmi', 'smoker']

keep_columns = [target_column] + feature_columns

In [None]:
df = df[keep_columns]

In [None]:
df.head()

### Fill Null Values

In [None]:
df['age'].median()

In [None]:
df['age'] = df['age'].fillna(df['age'].median())

### One-Hot Encoding of Text Columns

In [None]:
df = pd.get_dummies(df)

In [None]:
df.head()

### Create X & y

In [None]:
df.columns

In [None]:
feature_columns = set(df.columns) - {target_column}

X = df[feature_columns]
y = df[target_column]

In [None]:
feature_columns

In [None]:
X

In [None]:
y

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train

In [None]:
y_train

## Model Training & Evaluation

### Model Training

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

model.fit(X_train,y_train)

### Model Predictions on Test Data

In [None]:
y_pred = model.predict(X_test)

In [None]:
df_test = X_test.copy()
df_test['y_true'] = y_test
df_test['y_pred'] = y_pred

In [None]:
df_test.head()

### Model Evaluation

#### Actual VS Predicted Plot

In [None]:
g = sns.scatterplot(data=df_test, x='y_true', y='y_pred')

g.ax_joint.plot([0,1], [0,1], ':y', transform=g.ax_joint.transAxes)

# This is the x=y line using transforms
plt.show()

#### Regression Evaluation Metrics

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
r2_score = r2_score(y_test, y_pred)
mae_score = mean_absolute_error(y_test, y_pred)
mse_score = mean_squared_error(y_test, y_pred)
rmse_score = mean_squared_error(y_test, y_pred, squared=False)

In [None]:
print(f"R2: {r2_score}")
print(f"Mean Absolute Error: {mae_score}")
print(f"Mean Squared Error: {mse_score}")
print(f"Root Mean Squared Error: {rmse_score}")

## Save Model

In [None]:
#TODO