## ML Model Building & Deployment
Author: Ejaz-ur-Rehman\
Date Created: 25-03-2025\
Email ID: ijazfinance@gmail.com

### Steps involved in Model Building:
1. Define the prblem
2. Data Collection / Gathering
3. Data Preprocessing
4. Selection of a Model
5. Spliting the Data
6. Evaluating the Model
7. Hyper Parameter Tuning
8. Cross Validation - to optimize the model accuracy
9. Model Finalization
10. Model Deployment
11. Retesting, Refining, Updating the Model

### what is Algorithm?
An algorithm is a set of instructions that is used to solve a problem or perform a specific task. It is a well-defined procedure that takes some input and produces a corresponding output. Algorithms are used in a wide range of fields, including computer science, mathematics, and engineering. They can be expressed in various forms, such as natural language, flowcharts, or programming languages. The key characteristics of an algorithm include:
1.  **Input**: An algorithm takes some input, which can be in the form of data , numbers, or other information. 
2.  **Output**: An algorithm produces a corresponding output, which can be in the form of a solution , a result, or a transformed input.
3.  **Finiteness**: An algorithm must terminate after a finite number of steps.
4.  **Definiteness**: Each step of an algorithm must be well-defined and unambiguous.
5.  **Effectiveness**: An algorithm must be able to solve the problem it is designed to solve.
6.  **Feasibility**: An algorithm must be able to be implemented and executed using available resources.
7.  **Correctness**: An algorithm must produce the correct output for a given input.
8.  **Efficiency**: An algorithm should be able to solve the problem in a reasonable amount of time without consuming excessive resources.
9.  **Scalability**: An algorithm should be able to handle large inputs and produce the correct output without significant degradation in performance.
10.  **Maintainability**: An algorithm should be easy to understand, modify, and maintain.
11.  **Testability**: An algorithm should be easy to test and verify its correctness.
12.  **Reusability**: An algorithm should be able to be reused in different contexts and applications.
13.  **Flexibility**: An algorithm should be able to adapt to changing requirements and inputs.
14.  **Robustness**: An algorithm should be able to handle unexpected inputs and errors without crashing or producing incorrect results.
15.  **Security**: An algorithm should be able to protect sensitive data and prevent unauthorized access.
16.  **Transparency**: An algorithm should be transparent in its operation and decision-making process.
17.  **Accountability**: An algorithm should be accountable for its actions and decisions.
18.  **Auditability**: An algorithm should be able to provide a clear audit trail of its actions and decisions.
19.  **Compliance**: An algorithm should comply with relevant laws, regulations, and standards.
20.  **Sustainability**: An algorithm should be sustainable and environmentally friendly.
21.  **Inclusivity**: An algorithm should be inclusive and accessible to all users, regardless of their background, culture, or ability.
22.  **Accessibility**: An algorithm should be accessible and usable by people with disabilities.
23.  **Usability**: An algorithm should be easy to use and understand, with a user-friendly interfac.


### Testing Data
The testing data is used to evaluate the performance of the model. The testing data is a subset of the training data, and it is used to get an estimate of how the model will perform on unse ens data.

### Trainning Data
The training data is used to train the model. The training data is a subset of the entire dataset , and it is used to train the model to make predictions on new, unseen data. 

### Features
- **Easy to use**: The library is designed to be easy to use, with a simple and intuitive API.
-  **Highly customizable**: The library allows for a high degree of customization, with many options for configuring the behavior of the library.
-  **Fast and efficient**: The library is designed to be fast and efficient, with a focus on minimizing overhead and maximizing performance.

### Labels
Label is the output or target variable that a model is trained to predict. Labels represent the correct answer or ground truth in a supervised learning problem.

### Overfitting
Overfitting occurs when a model is too complex and fits the training data too closely. This can result in poor performance on unseen data. To prevent overfitting, we can use techniques such as regularization, early stopping, and cross-validation.

### Underfitting
Underfitting occurs when the model is too simple and cannot capture the underlying patterns in the data. This can lead to poor predictions and a high error rate.

### Data Preprocessing
The data was preprocessed to ensure that it was in a suitable format for analysis. The data was cleaned by removing any missing values and handling any inconsistencies in the data. The data was then split into training and testing sets to evaluate the performance of the model. The training set was used to train the model, and the testing set was used to evaluate its performance.

### Missing Values Handling
Missing values are a common problem in data analysis. There are several ways to handle missing values, including : 
1.  **Listwise Deletion**: This method involves deleting all rows that contain missing values.
2.  **Pairwise Deletion**: This method involves deleting only the rows that contain missing values for the specific variables being analyzed.
3.  **Mean/Median Imputation**: This method involves replacing missing values with the mean or median of the variable.
4.   **Regression Imputation**: This method involves using a regression model to predict the missing values.
5.   **K-Nearest Neighbors (KNN) Imputation**: This method involves using the KNN algorithm to predict the missing values based on the values of similar observations.
6.   **Multiple Imputation**: This method involves creating multiple versions of the dataset with different imputed values for the missing data.
7.   **Last Observation Carried Forward (LOCF)**: This method involves carrying forward the last observed value for a variable to replace missing values.
8.   **Next Observation Carried Forward (NOCF)**: This method involves carrying forward the next observed value for a variable to replace missing values.
9.   **Cold Deck Imputation**: This method involves replacing missing values with a value from a similar observation.

### Six Important ways for Imputing Missing Values

1. Simple Imputation Technique:
   - This technique involves replacing missing values with a simple imputation value. For example, if a value is missing, it can be  replaced with the mean or median or mode of the respective feature. This technique is simple to implement but may not be effective for all types of data. 
2. K-Nearest Neighbors (KNN) Imputation:
   - This technique involves finding the K-nearest neighbors to the data point with missing values and then imputing the missing values based on the values of these neighbors. This technique is more effective than simple imputation but can be computationally expensive for large datasets.
3. Regression Imputation:
   - This technique involves using a regression model to predict the missing values. This technique is more effective than simple imputation but requires a good understanding of the relationships between the features. 
4. Decision Tree Imputation:
   - This technique involves using a decision tree to predict the missing values. This technique is more effective than simple imputation but can be computationally expensive for large datasets. 
5. Multiple Imputation by Chained Equations (MICE):
   - This technique involves using a series of regression models to impute the missing values. This technique is more effective than simple imputation but can be computationally expensive for large datasets.
6. Deep Learning Imputation:
   - This technique involves using deep learning models such as neural networks to impute the missing values. This technique is more effective than simple imputation but requires a good understanding of deep learning concepts.
7. Time Series Effective Method:
   - This technique involves using time series analysis to impute the missing values. This technique is more effective than simple imputation but requires a good understanding of time series concepts.
   

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
data = sns.load_dataset('titanic')
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [22]:
# find missing values
data.isnull().sum().sort_values(ascending=False)

embark_town    2
embarked       2
sex            0
age            0
survived       0
pclass         0
parch          0
sibsp          0
class          0
fare           0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

In [13]:
# drop deck column
data.drop('deck', axis=1, inplace=True, errors='ignore')



In [14]:
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [16]:
# impute missing values with mean
data['age'] = data['age'].fillna(data['age'].mean())

# check the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

embark_town    2
embarked       2
sex            0
age            0
survived       0
pclass         0
parch          0
sibsp          0
class          0
fare           0
who            0
adult_male     0
alive          0
alone          0
dtype: int64