<a href="https://colab.research.google.com/github/amanullahshah32/Machine-Learning/blob/main/Machine_Learning_Project_Approaces/Machine_Learning_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

jupyter runtime code:
jupyter notebook --NotebookApp.allow_origin='https://colab.research.google.com' --port=8888 --NotebookApp.port_retries=0


Steps for   approaching ML problems:

1. Understand the business requirements and the nature of the available data.
2. Classify the problem as supervised/unsupervised and regression/classification.
3. Download, clean & explore the data and create new features that may improve models.
4. Create training/test/validation sets and prepare the data for training ML models.
5. Create a quick & easy baseline model to evaluate and benchmark future models.
6. Pick a modeling strategy, train a model, and tune hyperparameters to achieve optimal fit.
7. Experiment and combine results from multiple strategies to get a better result.
8. Interpret models, study individual predictions, and present your findings.

In [None]:
import matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## Step 1 - Understand Business Requirements & Nature of Data

<img src="https://i.imgur.com/63XEArk.png" width="640">


Most machine learning models are trained to serve a real-world use case. It's important to understand the business requirements, modeling objectives and the nature of the data available before you start building a machine learning model.

### Understanding the Big Picture

The first step in any machine learning problem is to read the given documentation, talk to various stakeholders and identify the following:

1. What is the business problem you're trying to solve using machine learning?
2. Why are we interested in solving this problem? What impact will it have on the business?
3. How is this problem solved currently, without any machine learning tools?
4. Who will use the results of this model, and how does it fit into other business processes?
5. How much historical data do we have, and how was it collected?
6. What features does the historical data contain? Does it contain the historical values for what we're trying to predict.
7. What are some known issues with the data (data entry errors, missing data, differences in units etc.)
8. Can we look at some sample rows from the dataset? How representative are they of the entire dataset.
9. Where is the data stored and how will you get access to it?
10. ...


Gather as much information about the problem as possible, so that you're clear understanding of the objective and feasibility of the project.

## Step 2 - Classify the problem as supervised/unsupervised & regression/classification

<img src="https://i.imgur.com/rqt2A7F.png" width="640">

Here's the landscape of machine learning([source](https://medium.datadriveninvestor.com/machine-learning-in-10-minutes-354d83e5922e)):

<img src="https://miro.medium.com/max/842/1*tlQwBmbL6RkuuFq8OPJofw.png" width="640">



Here are the topics in machine learning that we're studying in this course ([source](https://vas3k.com/blog/machine_learning/)):

<img src="https://i.imgur.com/VbVFAsg.png" width="640">



### Loss Functions and Evaluation Metrics

Once you have identified the type of problem you're solving, you need to pick an appropriate evaluation metric. Also, depending on the kind of model you train, your model will also use a loss/cost function to optimize during the training process.

* **Evaluation metrics** - they're used by humans to evaluate the ML model

* **Loss functions** - they're used by computers to optimize the ML model

They are often the same (e.g. RMSE for regression problems), but they can be different (e.g. Cross entropy and Accuracy for classification problems).

See this article for a survey of common loss functions and evaluation metrics: https://towardsdatascience.com/11-evaluation-metrics-data-scientists-should-be-familiar-with-lessons-from-a-high-rank-kagglers-8596f75e58a7

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
ross_df = pd.read_csv('/content/drive/MyDrive/Datasets/rossmann-store-sales/train.csv')

In [None]:
ross_df

In [None]:
store_df = pd.read_csv('/content/drive/MyDrive/Datasets/rossmann-store-sales/store.csv')

In [None]:
store_df


### we can merge the two data freames to get a richer set of feature for each row of the training set

In [None]:
merged_df = ross_df.merge(store_df,  on='Store')

In [None]:
merged_df

In [None]:
merged_df.shape

In [None]:
test_df = pd.read_csv('/content/drive/MyDrive/Datasets/rossmann-store-sales/test.csv')

In [None]:
merged_test_df = test_df.merge(store_df, on='Store')

In [None]:
merged_test_df

# Cleaning Data

### the first step is to check the column data types and identify if there are any null values

In [None]:
merged_df.info()

In [None]:
round(merged_df.describe().T,2)

In [None]:
merged_df['Store'].unique()

In [None]:
#df= merged_df[merged_df['Store']>3]

In [None]:
# df

In [None]:
merged_df.duplicated().sum()

In [None]:
merged_df['Date'] = pd.to_datetime(merged_df.Date)

In [None]:
merged_test_df['Date']= pd.to_datetime(merged_test_df.Date)

In [None]:
merged_df

In [None]:
merged_df.Date.min(), merged_df['Date'].max()

In [None]:
merged_test_df.Date.min(), merged_test_df.Date.max()

# Exploratory Data Analysis and Visualization

Objectives of exploratory data analysis:

- Study the distributions of individual columns (uniform, normal, exponential)
- Detect anomalies or errors in the data (e.g. missing/incorrect values)
- Study the relationship of target column with other columns (linear, non-linear etc.)
- Gather insights about the problem and the dataset
- Come up with ideas for preprocessing and feature engineering



In [None]:
sns.set_theme(rc={'figure.figsize':(11.7,8.27)})

In [None]:

sns.histplot(data = merged_df, x='Sales')

In [None]:
# store closes analysis,
# sales are zero on the day when the store is close

merged_df.Open.value_counts()

In [None]:
(merged_df['Sales']==0).sum()

In [None]:
merged_df.Sales.value_counts()[0]

### delete rows where the store is closed

In [None]:
# exclude the dates when store was closed
merged_df = merged_df[merged_df.Open==1].copy()

In [None]:
merged_df


In [None]:
sns.histplot(merged_df, x='Sales')

In [None]:
merged_df.describe()

In [None]:
temp_df = merged_df.sample(40000)

In [None]:

sns.scatterplot(x= temp_df.Sales, y = temp_df.Customers, hue= temp_df.Date.dt.year, alpha = 0.8)
plt.title('Sales vs Customers')
plt.show()

In [None]:
# Plot the total customers per year

total_customers_per_year = temp_df.groupby(temp_df['Date'].dt.year)['Customers'].sum()


plt.figure(figsize=(10, 6))
sns.barplot(x=total_customers_per_year.index, y=total_customers_per_year.values, color='skyblue')
plt.title('Total Customers per Year')
plt.xlabel('Year')
plt.ylabel('Total Customers')
plt.show()

In [None]:
temp_df = merged_df.sample(10000)
sns.scatterplot(x = temp_df.Store, y= temp_df.Sales, hue=  temp_df.Date.dt.year, alpha =0.8)
plt.title("stores vs sales")
plt.show()

In [None]:
sns.barplot(data=merged_df, x='DayOfWeek', y='Sales', hue=True)

In [None]:
sns.barplot(merged_df, x='Promo', y ='Sales', palette='husl')

In [None]:
merged_df

In [None]:
merged_df.Promo.unique()

In [None]:
merged_df.corr()['Sales'].sort_values(ascending=False)

In [None]:
merged_df.Sales.corr(merged_df.Promo)

In [None]:
merged_df.corr()

In [None]:
sns.heatmap(merged_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('COrrelationMatrx')
plt.show()

# **Feature Engineering**

Feature engineer is the process of creating new features (columns) by transforming/combining existing features or by incorporating data from external sources.


For example, here are some features that can be extracted from the "Date" column:

1. Day of week
2. Day or month
3. Month
4. Year
5. Weekend/Weekday
6. Month/Quarter End


In [None]:
merged_df['Day'] = merged_df.Date.dt.day
merged_df['Month'] = merged_df.Date.dt.month
merged_df['Year'] = merged_df.Date.dt.year

In [None]:
merged_test_df['Day'] = merged_test_df.Date.dt.day
merged_test_df['Month'] = merged_test_df.Date.dt.month
merged_test_df['Year'] = merged_test_df.Date.dt.year

In [None]:
sns.barplot(merged_df, x='Year',y='Sales', palette='husl')

In [None]:
filtered_df= merged_df[merged_df['Year']==2015]
#sns.barplot(data= filtered_df, x='Month', y='Sales', palette='husl')

In [None]:
sns.barplot(data= merged_df, x='Month', y='Sales', palette='husl')

In [None]:
pd.options.display.max_columns = None # to show all the column names in the dataFrame

In [None]:
merged_df

In [None]:
#merged_df.Year.value_counts()



## Using date information, we can also create new current columns like:

1. Weather on each day
2. Whether the date was a public holiday
3. Whether the store was running a promotion on that day.


> **EXERCISE**: Create new columns using the above ideas.

In [None]:
merged_df.columns

### comparison between holidays vs sales.

1. compare sales of state holiday
2. compare sales of school holiday

compare sales of state holiday

In [None]:
plt.figure(figsize=(12, 6))



state_holiday= merged_df[merged_df['StateHoliday']==1]
non_state_holiday= merged_df[merged_df['StateHoliday']==0]

# Plot sales during state holidays
plt.subplot(1, 2, 1)
sns.barplot(data=state_holiday, x='Month', y='Sales', palette='husl')
plt.title('Sales during State Holidays')
plt.xlabel('Month')
plt.ylabel('Sales')

# Plot sales during non-state holidays
plt.subplot(1, 2, 2)
sns.barplot(data=non_state_holiday, x='Month', y='Sales', palette='husl')
plt.title('Sales during Non-State Holidays')
plt.xlabel('Month')
plt.ylabel('Sales')

plt.tight_layout()
plt.show()

In [None]:
# Calculate total sales during state holidays and non-holidays
total_state_holiday_sales = merged_df[merged_df['StateHoliday'] == 1]['Sales'].sum()
total_non_holiday_sales = merged_df[merged_df['StateHoliday'] == 0]['Sales'].sum()

# Create a DataFrame for plotting
sales_comparison_df = pd.DataFrame({
    'State Holiday': total_state_holiday_sales,
    'Non-Holiday': total_non_holiday_sales
}, index=['Total Sales'])

# Plot the bar plot to compare total sales during state holidays and non-holidays
plt.figure(figsize=(8, 6))
sns.barplot(data=sales_comparison_df, palette='husl')
plt.title('Total Sales Comparison: State Holiday vs. Non-Holiday')
plt.ylabel('Total Sales')
plt.show()

compare sales of school holiday

In [None]:
plt.figure(figsize=(12,6))

school_holiday = merged_df[merged_df['SchoolHoliday']==1]
non_school_holiday = merged_df[merged_df['SchoolHoliday']==0]

#plot sales during school holidays
plt.subplot(1,2,1)
sns.barplot(school_holiday, x="Month", y='Sales', palette='husl')
plt.title('sales during school holiday')
plt.xlabel('month')
plt.ylabel('sales')

#plot sales during non school holidays
plt.subplot(1,2,2)
sns.barplot(non_school_holiday, x="Month", y='Sales', palette='husl')
plt.title('sales during non school holiday')
plt.xlabel('month')
plt.ylabel('sales')

plt.tight_layout()
plt.show()

In [None]:
merged_df

HW

> **EXERCISE**: The features `Promo2`, `Promo2SinceWeek` etc. are not very useful in their current form, because they do not relate to the current date. How can you improve their representation?

## Step 4 - Create a training/test/validation split and prepare the data for training

<img src="https://i.imgur.com/XZ9aP10.png" width="640">

### Train/Test/Validation Split

The data already contains a test set, which contains over one month of data after the end of the training set. We can apply a similar strategy to create a validation set. We'll the last 25% of rows for the validation set, after ordering by date

In [None]:
merged_df

In [None]:
# calculating the unique days in test dataset

merged_test_df['Date'] = pd.to_datetime(merged_test_df['Date'])
date = merged_test_df['Date'].dt.date
num_days = len(date.unique())
num_days

In [None]:
len(merged_df)

In [None]:
train_size = int(.75 * len(merged_df))
train_size

In [None]:
sorted_df  = merged_df.sort_values('Date')
train_df, val_df = sorted_df[:train_size], sorted_df[train_size:]

In [None]:
len(train_df) , len(val_df)

In [None]:
train_df

In [None]:
val_df

In [None]:
train_df.Date.min(), train_df.Date.max()

In [None]:
val_df.Date.min(), val_df.Date.max()

In [None]:
merged_test_df.Date.min(), merged_test_df.Date.max()

In [None]:
train_df.columns.values.tolist(), len(train_df.columns)

In [None]:
merged_test_df.columns.values.tolist(), len(merged_test_df.columns)

### Input and Target columns

Let's also identify input and target columns. Note that we can't use the no. of customers as an input, because this information isn't available beforehand. Also, we needn't use all the available columns, we can start out with just a small subset.

In [None]:
input_cols = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'StoreType', 'Assortment', 'Day', 'Month', 'Year']

In [None]:
target_col = 'Sales'

In [None]:
merged_df[input_cols].nunique()

In [None]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()

In [None]:
val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()

In [None]:
test_inputs = merged_test_df[input_cols].copy()
# Test data does not have targets

Note that some columns can be treated as both numeric and categorical, and it's up to you to decide how you want to deal with them.

In [None]:
numeric_cols = ['Store', 'Day', 'Month', 'Year']
categorical_cols = ['DayOfWeek', 'Promo', 'StateHoliday', 'StoreType', 'Assortment']

In [None]:
# data types check

numeric_cols = ['Store', 'Day', 'Month', 'Year']
categorical_cols = ['DayOfWeek', 'Promo', 'StateHoliday', 'StoreType', 'Assortment']

numeric_data_types = {col: 'int' for col in numeric_cols}
categorical_data_types = {col: 'object' for col in categorical_cols}

print("Numeric Columns Data Types:")
print(numeric_data_types)

print("\nCategorical Columns Data Types:")
print(categorical_data_types)


### Imputation, Scaling and Encode

Let's impute missing data from numeric columns and scale the values to the $(0, 1)$ range.

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(strategy='mean').fit(train_inputs[numeric_cols])

In [None]:
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

Finally, let's encode categorical columns as one-hot vectors.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(train_inputs[categorical_cols])
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))

In [None]:
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

Explore the `scikit-learn` preprocessing module: https://scikit-learn.org/stable/modules/preprocessing.html

Let's now extract out the numeric data.

In [None]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

## Step 5 - Create quick & easy baseline models to benchmark future models

<img src="https://i.imgur.com/1DLgiEz.png" width="640">

A quick baseline model helps establish the minimum score any ML model you train should achieve.


### Fixed/Random Guess

Let's define a model that always returns the mean value of Sales as the prediction.