<h1 align=center> Machine Learning Project LifeCyle </h1>

## Machine Learning Project LifeCycle

<img src="../resources/lifeCycle.png" height=500px width=500px>

<br/>

The **machine learning project lifecycle** refers to the stages involved in developing and deploying a machine learning model. The project lifecycle typically includes the following stages:
<br/>



- **Problem Definition:** The first step is to define the problem we want to solve. This includes identifying the business or research problem, defining the data requirements, and determining the objectives and metrics for success.
<br/>
- **Data Collection:** The second step is to collect the data that we will use to build your model. This may involve acquiring data from various sources, including internal and external databases, APIs, and web scraping.
<br/>

- **Data Cleaning and Preparation:** Once we have collected the data, we need to clean and prepare it for analysis. This involves tasks such as removing duplicates, handling missing data, transforming the data into the appropriate format.
<br/>

- **Data Analysis and Visualization:** With the data cleaned and prepared, the next step is to analyze and visualize the data to gain insights into the problem we are trying to solve. This may include exploratory data analysis, data visualization, and feature selection.
<br/>

- **Model Selection and Training:** Once we have a clear understanding of the data, the next step is to select an appropriate machine learning model and train it on the data, before training we need to split the data into training and testing sets.. This step involves evaluating different models and selecting the one that provides the best performance.
<br/>

- **Model Evaluation:** Once we have trained the model, we need to evaluate its performance on a holdout test set. This involves calculating various metrics, such as mse, rmse, r2,  accuracy, precision, recall, and F1 score.
<br/>

- **Model Deployment:** If the model performs well, the next step is to deploy it in a production environment. This may involve integrating it with existing systems or building a custom application to provide access to the model.
<br/>

- **Model Maintenance and Monitoring:** After the model has been deployed, it is important to monitor its performance and ensure that it continues to provide accurate results. This involves tasks such as updating the model as new data becomes available, retraining the model periodically, and monitoring for drift or changes in the data distribution.
<br/>



<h2 align='center'>Data Collection</h2>
<br>

<img src="../resources/acq.PNG">

<h2 align='center'> Data Preprocessing </h2>

<br>

<p align="center"><div class="alert alert-success" style="margin: 20px"> 
Data preprocessing is the third step in a machine learning project where we clean and organize the data so that it can be used for analysis and model training. This involves tasks like handling missing values, outliers, and inconsistent data, transforming data to make it suitable for machine learning algorithms, visualizing data to understand patterns. The aim of data preprocessing is to ensure that the data used for machine learning is accurate, reliable, and ready for analysis, and to improve the performance and accuracy of the machine learning models by addressing data quality issues.
</div></p>
    
    
<br/>

### Data Cleaning

- Handling missing values, if any
	- Imputing missing values using simple techniques like mean, median, or mode
	- Removing rows or columns with missing values, if appropriate

- Identifying and addressing outliers, if any
	- Using basic statistical techniques to detect outliers
	- Applying appropriate techniques like data transformation or removing outliers, if necessary

- Managing inconsistent data, if any
	- Standardizing data format and units
	- Resolving any data inconsistencies or conflicts

### Data Transformation

- Data normalization
	- Scaling data to a common range to ensure fair comparison
- Data encoding
	- Converting categorical data to numerical format for machine learning algorithms

### Data Visualization

- Using simple data visualization techniques like scatter plots, bar charts, or histograms to explore and understand the data
- Identifying patterns, trends, and relationships in the data

## Recap

- **NumPy**
- **Pandas**
- **Matplotlib and Searrborn**

<h2 align='center'> NumPy </h2>



<p align="center"><div class="alert alert-success" style="margin: 20px"> 
Numpy is a tool for mathematical computing and data preparation in Python. It can be utilized to perform a number of mathematical operations on arrays such as trigonometric, statistical and algebraic routines. This library provides many useful features including handling n-dimensional arrays, broadcasting, performing operations, data generation, etc., thus, it’s the fundamental package for scientific computing with Python. It also provides a large collection of high-level mathematical functions to operate on arrays.

</div></p>

<img src="../resources/Uses-of-NumPy-1.webp" height=500px width=500px>

In [10]:
# Installation and Implementation

!pip3 install numpy

In [11]:
import numpy as np

In [12]:
# check path and version 

In [13]:
#Concept of scalar, vector , matrix

#### NumPy Array Creation

**1. Built-In Methods**

Numpy allows us to use many built-in methods for generating arrays.
- `np.array()` - Best method to create a simple array
- `np.arange()` – array of arranged values from low to high value
- `np.zeros()` – array of zeros with specified shape
- `np.ones()` – similarly to zeros, array of ones with specified shape
- `np.linspace()` – array of linearly spaced numbers, with specified size
- `np.eye()` – two dimensional array with ones on the diagonal, zeros elsewhere


<br>

**2. Random**

Numpy allows you to use various functions to produce arrays with random values. To access these functions, first we have to access the `random` function itself. This is done using `np.random`, after which we specify which function we need. Here is a list of the most used random functions and their purpose:
- `np.random.rand()` – produce random values in the given shape from 0 to 1
- `np.random.randn()` – produce random values with a ‘standard normal’ distribution, from -1 to 1
- `np.random.randint()` – produce random numbers from low to high, specified as parameter


<br>

**3. Array Attributes and Methods**

Now we will continue with more attributes and methods that can be used on arrays.

- `np.reshape()` – changes the shape of an array into the desired shape
- `np.shape()` – returns a tuple of the shape of the given array as parameter
- `np.dtype()` – returns the data type of the values in the array


<br>

**4. Numpy Indexing and Selection**

Here, we will discuss how to select element or groups of elements from an array and change them. There are two methods are used to perform these operations

- `Indexing` – pick one or more elements from an array
- `Broadcasting` – changing values within an index range

<br>

**5. Numpy Operations**

We can perform different types of operations on NumPy arrays. What this means is we can sum, subtract, multiply or divide the values inside our array, even do things like taking the square root. Below is a list of what we will discuss in this lecture.

- `Arithmetic Operations` – sum, subtract, multiply, divide on arrays
- `Universal Array Functions` – Mathematical operations provided by NumPy

## Exercise

#### Exercise 1: 
Create a 4X2 integer array and Prints its attributes.    
**Note: The element must be a type of unsigned int16. And print the following Attributes: –**

- The shape of an array.
- Array dimensions.
- The Length of each element of the array in bytes.

<br>

#### Exercise 2: 
Create a 5X2 integer array from a range between 100 to 200 such that the difference between each element is 10.    
**Hint: Use np.arange() and reshape() function.**

<br>

#### Exercise 3: 
Following is the provided numPy array. Return array of items by taking the third column from all rows.    
**sampleArray = numpy.array([[11 ,22, 33], [44, 55, 66], [77, 88, 99]])**

<h2 align='center'> Pandas </h2>

<br>


<img align="center" width="700" height="700"  src="../resources/pandas-apps.png"  >

<br>
<br>


> Pandas is an open source python library built on top of numpy and provides easy to use data structures and data analysis tools. Pandas has derived its name from panel data system and was developed by wes mckinney in 2008.

> Data scientists use pandas for performing various data science tasks starting from downloading, opening, reading and writing files of different file formats like csv, excel, json, html and so on. They load the data set into its data structure called data frame.

> A Pandas Dataframe is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.

> After the data is loaded in a data frame data scientists perform a various data manipulation tasks like filtering and modifying data based on multiple conditions cutting splitting merging sorting scaling pivoting and aggregating of data.

> Data cleaning is done to enhance the data accuracy and integrity by identifying and removing null values duplicates and outliers.

> Data wrangling actually transforms the data structurally to appropriate format and makes it ready to be used by the machine learning engineers so that they can apply appropriate machine learning models or algorithm on that data set for training validating and testing purposes.


<img align="center" width="400" height="400"  src="../resources/pandas1.png"  >


### Anatomy of a Dataframe
<img align="center" width="800" height="500"  src="../resources/dataframe.webp"  >

### Anatomy of a Series

<img align="center" width="500" height="600"  src="../resources/series-anatomy.png"  >


In [23]:
#Installation and Implementation

In [24]:
#path and version

## Series Introduction

1. **Creating a Series**
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
    - Creating empty series object
2. **Attributes of a Pandas Series**
3. **Understanding Index in a Series and its usage**
    - Identification
    - Selection/Filtering/Subsetting


In [26]:
# creating a series

## DataFrame Introduction

1. **Creating Dataframe**
    - An empty dataframe
    - Two-Dimensional NumPy Array
    - Dictionary of Python Lists
    - Dictionary of Panda Series
2. **Attributes of a Dataframe**

In [29]:
#create a dataframe

## Data Handling with Pandas..

- **Data Reading** : Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.

- **Viewing data** – Viewing data from a data frame can be done by three ways
 >- using the data frame’s name – returns the top and bottom 5 rows in the data frame.
 >- using dataframe.head() function
 >- using dataframe.tail() function

- **Data Overview** : To see more details on the data frame, the `info()` function can be used. info() gives an idea about what datatype each series in a data frame points to.

- The following functions are used to find the unique entries within a series/column in a data frame.
 >- datafame.unique() – returns the unique values
 >- dataframe.nunique() – returns the count of unique values
 >- dataframe.value_counts() – returns the frequency of each of the categories in the column

- In our example, the titanic dataset contains a column called `Survived` which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
 >- `dataframe.astype()` is the function which lets us do the conversion

## All statistical functions
- `count()` : Returns the number of times an element/data has occurred (non-null)
- `sum()`	: Returns sum of all values
- `mean()` : Returns the average of all values
- `median()` : Returns the median of all values
- `mode()` : Returns the mode
- `std()`	: Returns the standard deviation
- `min()`	: Returns the minimum of all values
- `max()`	: Returns the maximum of all values
- `abs()`	: Returns the absolute value

### Mean: 

- The mean, also known as the average, is the sum of all values in a dataset divided by the number of values. It is a measure of central tendency that represents the average value of the data.

**Example:** 
- Consider the following dataset of ages (in years) of a group of individuals: 20, 25, 30, 35, 40. The mean of this dataset would be (20 + 25 + 30 + 35 + 40) / 5 = 30 years.

**Usage:** 
- Mean is commonly used to calculate the central tendency of data and provide an overall measure of the average value. It is widely used in statistics, data analysis, and machine learning to understand the typical or average value of a dataset.

**Value:** 
- The mean can be affected by extreme values, also known as outliers, and may not be an appropriate measure of central tendency when dealing with skewed data or data with outliers. A smaller mean value indicates that the values in the dataset are on average smaller, while a greater mean value indicates that the values in the dataset are on average greater.



### Median:

- The median is the middle value in a dataset when it is arranged in ascending or descending order. It is a measure of central tendency that represents the middle value that separates the data into two equal halves.

**Example:** 
    
- Consider the same dataset of ages (in years) as in the previous example: 20, 25, 30, 35, 40. The median of this dataset would be 30, as it is the middle value when the dataset is arranged in ascending order.

**Usage:**
    
- Median is commonly used when dealing with datasets that have extreme values or are skewed, as it is less affected by outliers compared to the mean. It is also used when the dataset is not normally distributed or when the data does not follow a symmetrical pattern.


**Value:**
- The median is not affected by extreme values and may be more appropriate than the mean when dealing with skewed data or data with outliers. It represents the central value that separates the dataset into two equal halves. There is no notion of a smaller or greater median value, as it represents the middle value of the dataset.




### Variance: 
- Variance is a measure of how much the values in a dataset deviate from the mean. It is the average of the squared differences between each value and the mean.

**Example:** 
- Consider the following dataset of exam scores: 80, 85, 90, 95, 100. The mean of this dataset is 90. The variance can be calculated as ((80 - 90)^2 + (85 - 90)^2 + (90 - 90)^2 + (95 - 90)^2 + (100 - 90)^2) / 5 = 100.

**Usage:** 
- Variance is commonly used to measure the spread or variability of data points around the mean. It provides a quantitative measure of how much the data points deviate from the mean, indicating the extent of variation in the dataset.


**Value:**

- A smaller variance value indicates that the data points are closer to the mean and have less variability, while a greater variance value indicates that the data points are further from the mean and have higher variability.

### Standard Deviation: 

- Standard deviation is the square root of the variance and provides a measure of the average amount of variation or dispersion of data points around the mean.

**Example:** 
- Using the same dataset of exam scores as in the previous example, the variance was calculated as 100. The standard deviation can be calculated as the square root of the variance, which is the square root of 100, or 10.

**Usage:** 
- Standard deviation is commonly used as a measure of data dispersion or variability, similar to variance. It provides a more interpretable measure compared to variance, and is often used in statistical analysis and machine learning to understand the spread or variability of data points within a dataset.

**Smaller or Greater Value:** 
- A smaller value of standard deviation indicates less variability or dispersion of data points around the mean, which can be considered "good" in some cases as it suggests that the data points are closer to the mean and the dataset is more consistent. On the other hand, a greater value of standard deviation indicates higher variability or dispersion of data points around the mean, which can be considered "bad" in some cases as it suggests that the dataset is more spread out and less consistent.

## Aggregation
- The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
- Syntax : 
 >- DataFrame.aggregate(self, func, axis=0, *args, ***kwargs)
 
<img src="../resources/pandas-agg-func.png" height=400px width=600px>