<font color="Blue" size="6">Python Learning 6th Week</font>

# Week 6 – Machine Learning Advanced

 # 1. Data Preprocessing (Missing Values, Label Encoding)
# Data Preprocessing in Machine Learning

Machine Learning model banane se pehle hamesha data ko clean aur transform karna padta hai.
Do basic steps jo ham use karte hain:

Missing Values handle karna (mean/median/mode/forward fill, etc.)

Label Encoding (categorical data ko numeric me convert karna).

 # Missing Values Handling

🔹 Why?
Machine Learning models NaN (missing values) samajh nahi paate, isliye hame unhe fill / remove karna padta hai.

Example Dataset

In [5]:

import pandas as pd

# Dummy data
data = {
    "Name": ["Amit", "Ravi", "Neha", "Priya", "Karan"],
    "Age": [25, None, 28, None, 22],
    "Marks": [85, 90, None, 75, None],
    "Gender": ["Male", "Male", "Female", "Female", None]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)


Original DataFrame:
     Name   Age  Marks  Gender
0   Amit  25.0   85.0    Male
1   Ravi   NaN   90.0    Male
2   Neha  28.0    NaN  Female
3  Priya   NaN   75.0  Female
4  Karan  22.0    NaN    None


# Solution 1: Numerical Columns → Fill with Mean / Median

In [6]:
# Age column -> mean
df["Age"] = df["Age"].fillna(df["Age"].mean())

# Marks column -> median
df["Marks"] = df["Marks"].fillna(df["Marks"].median())


# Solution 2: Categorical Columns → Fill with Mode

In [7]:
# Gender column -> mode
df["Gender"] = df["Gender"].fillna(df["Gender"].mode()[0])


In [None]:
Updated DataFrame:

    Name   Age  Marks  Gender
0   Amit  25.0   85.0    Male
1   Ravi  25.0   90.0    Male
2   Neha  28.0   85.0  Female
3  Priya  25.0   75.0  Female
4  Karan  22.0   85.0    Male

In [9]:
import pandas as pd

data = {
    "Name": ["Amit", "Ravi", "Neha", "Priya", "Karan"],
    "Age": [25, None, 28, None, 22],
    "Marks": [85, 90, None, 75, None],
    "Gender": ["Male", "Male", "Female", "Female", None]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# 🔹 Numerical fill
df["Age"] = df["Age"].fillna(df["Age"].mean())       # mean
df["Marks"] = df["Marks"].fillna(df["Marks"].median())  # median

# 🔹 Categorical fill
df["Gender"] = df["Gender"].fillna(df["Gender"].mode()[0])

print("\nCleaned DataFrame:\n", df)


Original DataFrame:
     Name   Age  Marks  Gender
0   Amit  25.0   85.0    Male
1   Ravi   NaN   90.0    Male
2   Neha  28.0    NaN  Female
3  Priya   NaN   75.0  Female
4  Karan  22.0    NaN    None

Cleaned DataFrame:
     Name   Age  Marks  Gender
0   Amit  25.0   85.0    Male
1   Ravi  25.0   90.0    Male
2   Neha  28.0   85.0  Female
3  Priya  25.0   75.0  Female
4  Karan  22.0   85.0  Female


 # Label Encoding (Categorical → Numeric)

Machine Learning algorithms numbers ke sath kaam karte hain, strings ke sath nahi.

Example: Gender Column

In [10]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])


In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

print(df.head())        # DataFrame ka 1st 5 rows print hoga
print(df["Gender"].unique())   # Gender column ka encoded unique values


    Name   Age  Marks  Gender
0   Amit  25.0   85.0       1
1   Ravi  25.0   90.0       1
2   Neha  28.0   85.0       0
3  Priya  25.0   75.0       0
4  Karan  22.0   85.0       0
[1 0]


# Train / Test Split
Kya hai?

Machine Learning model banate time hum apne dataset ko 2 parts me divide karte hain:

Training set → Model ko train karne ke liye (patterns learn karne ke liye).

Testing set → Model ko test karne ke liye (check karne ke liye ki model sahi seekh raha hai ya nahi).

 Agar pura data training me de doge, model sab kuch rat lega (overfitting) aur naye data pe fail hoga.
Isliye unseen data (test set) pe evaluate karna zaroori hai.

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Test model
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)


Accuracy: 1.0


# Cross-Validation (CV)
Kya hai?

Jab dataset chhota ho ya ek single train-test split reliable na ho, tab hum Cross-Validation use karte hain.

Process:

Dataset ko k equal parts (folds) me divide karo.

Har baar ek part ko test ke liye use karo, baaki (k-1) parts ko train ke liye.

Ye process k baar repeat hota hai.

Final accuracy = sab runs ka average.

Isko bolte hain k-Fold Cross Validation.

Code Example:

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

model = DecisionTreeClassifier()

# 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)

print("Cross-Validation Scores:", scores)
print("Average Accuracy:", scores.mean())


Cross-Validation Scores: [0.96666667 0.96666667 0.9        0.96666667 1.        ]
Average Accuracy: 0.9600000000000002


# Summary

Train/Test Split → Simple, fast evaluation (ek hi exam jaisa).

Cross-Validation → Reliable evaluation (multiple exams ka average jaisa).

# Accuracy, Confusion Matrix, Precision / Recall

# 1. Accuracy

Definition:
Accuracy = (Correct Predictions) ÷ (Total Predictions)

 Matlab model ne kitne answers sahi diye total answers me se.

# Formula:
Accuracy= TP+TN/TP+TN+FP+FN
	


TP (True Positive) = Positive ko correctly Positive predict karna

TN (True Negative) = Negative ko correctly Negative predict karna

FP (False Positive) = Negative ko galat se Positive predict karna

FN (False Negative) = Positive ko galat se Negative predict karna

# Example:

100 patients → model ne 90 sahi predict kiye (70 healthy + 20 sick) →
Accuracy = 90/100 = 90%

 Problem: Agar data imbalance ho (e.g., 95% healthy, 5% sick), to model sabko healthy bolega fir bhi accuracy high dikhayega.

In [14]:
from sklearn.metrics import confusion_matrix

y_true = [1, 0, 1, 1, 0, 1, 0]   # Actual
y_pred = [1, 0, 1, 0, 0, 1, 1]   # Predicted

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)


Confusion Matrix:
 [[2 1]
 [1 3]]


# Matlab:

TN = 2

FP = 1

FN = 1

TP = 3

# 3. Precision

Definition:
Precision batata hai ki jo model ne positive bola, usme se kitne actually positive the.

# Formula:

Precision=TP+TP/FP


 “How many of the predicted positives are actually correct?”

# Example:

Model ne 30 patients ko sick bola → 25 sach me sick the, 5 galat the.
Precision = 25 / (25+5) = 83.3%

# 4. Recall (Sensitivity / True Positive Rate)

Definition:
Recall batata hai ki actual positive cases me se model kitne detect kar paya.

Formula:
Recall= TP+TP/FN

 “Out of all actual positives, how many did the model correctly identify?”

Example:

Total 40 sick patients the → Model ne 25 ko correctly sick bola aur 15 miss kar diye.
Recall = 25 / (25+15) = 62.5%

# Code Example (All Together)

In [15]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

y_true = [1, 0, 1, 1, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))


Accuracy: 0.7142857142857143
Confusion Matrix:
 [[2 1]
 [1 3]]
Precision: 0.75
Recall: 0.75


# Summary

Accuracy → overall correctness

Confusion Matrix → detailed view of errors (TP, TN, FP, FN)

Precision → quality of positive predictions

Recall → ability to catch all positives

# Save Model (Pickle)
Kya hai?

Jab hum machine learning model train kar lete hain to har baar dobara training karna time-consuming hota hai.
Isliye hum model ko file ke form me save kar lete hain → baad me directly load karke use kar sakte ho.

Python me model save/load karne ke liye mainly 2 libraries use hoti hain:

Pickle (general Python objects ke liye)

Joblib (large numpy arrays ke liye optimized)

# Example with Pickle

In [16]:
import pickle
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Train model
iris = load_iris()
X, y = iris.data, iris.target
model = DecisionTreeClassifier()
model.fit(X, y)

# ---- Save model ----
with open("model.pkl", "wb") as file:
    pickle.dump(model, file)

print("Model saved successfully!")

# ---- Load model ----
with open("model.pkl", "rb") as file:
    loaded_model = pickle.load(file)

# Test loaded model
print("Predictions:", loaded_model.predict([[5.1, 3.5, 1.4, 0.2]]))


Model saved successfully!
Predictions: [0]


# Example with Joblib

import joblib
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Train model
iris = load_iris()
X, y = iris.data, iris.target
model = DecisionTreeClassifier()
model.fit(X, y)

# ---- Save ----
joblib.dump(model, "model_joblib.pkl")

# ---- Load ----
loaded_model = joblib.load("model_joblib.pkl")
print("Predictions:", loaded_model.predict([[6.2, 3.4, 5.4, 2.3]]))


# Where is it useful?

Jab tum model ko deploy karna chahte ho (Flask/Django/Streamlit app me).

Future me dobara train ki zaroorat na ho.

Large datasets ke sath time save karne ke liye.

# Summary of Week 6 Topics

Data Processing (Missing values, Label Encoding)

Train/Test Split & Cross-Validation

Accuracy, Confusion Matrix, Precision, Recall

Save Model (Pickle/Joblib)

# Project 1 - : Flight Delay predictor 

# Project 1 – Flight Delay Predictor
# Step 1: Problem Statement

Goal = Predict karna ki flight delay hogi ya nahi based on given flight data.

Input: flight details (airline, departure time, distance, etc.)

Output: Delayed (1) ya Not Delayed (0)

# Step 2: Dataset

Flight Delay dataset ka use karenge. Example dataset available hai Kaggle pe:
 "Airline Delay Dataset" (features: airline, origin, destination, distance, dep_time, arr_time, delay).

# Step 3: Project Pipeline

- Load Dataset

Pandas se CSV read karna

- Data Preprocessing

Missing values handle karna

Categorical encoding (Airline, Airport names → Label Encoding / OneHotEncoding)

- Split Data

Train / Test Split (80/20)

- Model Training

Use RandomForestClassifier (ya DecisionTree for simplicity)

- Evaluation

Accuracy

Confusion Matrix

Precision & Recall

- Save Model

Pickle / Joblib

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
import pickle

# 1. Load Dataset
df = pd.read_csv("flights.csv")
print(df.head())

# 2. Data Preprocessing
# Fill missing values
df = df.fillna(method="ffill")

# Encode categorical columns
le = LabelEncoder()
df["Airline"] = le.fit_transform(df["Airline"])
df["Origin"] = le.fit_transform(df["Origin"])
df["Destination"] = le.fit_transform(df["Destination"])

# Features and Target
X = df[["Airline", "Origin", "Destination", "Distance", "DepTime"]]
y = df["Delayed"]   # 1 = Delayed, 0 = On-time

# 3. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Train Model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# 5. Evaluation
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

# 6. Save Model
with open("flight_delay_model.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model Saved Successfully!")


FileNotFoundError: [Errno 2] No such file or directory: 'flights.csv'

In [19]:
df = pd.read_csv("C:/Users/Tushar/Downloads/airline_delay.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Tushar/Downloads/airline_delay.csv'

In [20]:
import os
print(os.getcwd())


C:\Users\DELL
