# Introduction to Machine Learning and Data Preprocessing
## What is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence (AI) where systems learn from data to make predictions or decisions without being explicitly programmed.

---

## Types of Machine Learning:
**Supervised Learning:** The algorithm learns from labeled data (input-output pairs).
Example: Predicting house prices based on size and location.
**Unsupervised Learning:** The algorithm learns patterns in unlabeled data.
**Example:** Grouping customers based on purchase behavior (clustering).

---

## Importance of Data Preprocessing
Raw data is often messy and contains inconsistencies such as missing values, outliers, or varying scales. Preprocessing ensures:

The data is clean and ready for analysis.
Machine learning models can perform optimally.

---

## Practical Steps
### 1. Handling Missing Values
Missing data can occur due to errors in data collection or storage. Common strategies to handle them:

- **Mean Imputation:** Replace missing values with the column's mean.
- **Median Imputation:** Replace missing values with the column's median (useful for skewed data).
- **Dropping Rows/Columns:** Remove rows or columns with too many missing values.

In [1]:
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Age': [25, 30, None, 35], 'Salary': [50000, None, 45000, 40000]}
df = pd.DataFrame(data)

imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

print(df)

    Age   Salary
0  25.0  50000.0
1  30.0  45000.0
2  30.0  45000.0
3  35.0  40000.0


### 2. Normalizing/Scaling Data
Features in a dataset can have varying scales (e.g., age in years vs. salary in thousands). Machine learning algorithms often perform better when features are on a similar scale.

- **Normalization:** Scales values to `[0, 1]` range.
- **Standardization:** Centers values around the mean with unit variance.

In [2]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df)

scaler_standard = StandardScaler()
df_standardized = scaler_standard.fit_transform(df)

print("Normalized Data:\n", df_normalized)
print("Standardized Data:\n", df_standardized)

Normalized Data:
 [[0.  1. ]
 [0.5 0.5]
 [0.5 0.5]
 [1.  0. ]]
Standardized Data:
 [[-1.41421356  1.41421356]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 1.41421356 -1.41421356]]


### 3. Encoding Categorical Variables
Machine learning models typically work with numerical data. Categorical data must be converted into numeric format.

- **One-Hot Encoding:** Creates binary columns for each category.
- **Label Encoding:** Assigns numeric values to categories (e.g., Male = 1, Female = 0).

In [4]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

data = {'Gender': ['Male', 'Female', 'Male', 'Female']}
df = pd.DataFrame(data)

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['Gender']])
print("One-Hot Encoded Data:\n", encoded)

label_encoder = LabelEncoder()
df['Gender_Label'] = label_encoder.fit_transform(df['Gender'])
print("Label Encoded Data:\n", df)

One-Hot Encoded Data:
 [[0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]]
Label Encoded Data:
    Gender  Gender_Label
0    Male             1
1  Female             0
2    Male             1
3  Female             0


---

## Tools Used
- **Pandas:** For data manipulation and cleaning.
- **2Scikit-learn:** For preprocessing techniques like imputation, scaling, and encoding.

---

## Outcome
- **By preprocessing data:**

Missing values are addressed, ensuring no errors during model training.
Features are scaled, improving model convergence.
Categorical variables are converted, making the data usable by algorithms.
This foundational step is critical to ensure the success of any machine learning model! 🚀

---