# Preprocessing


## Step	Action
### Step 1	Import required libraries (pandas, numpy, sklearn.preprocessing)
### Step 2	Load dataset using pd.read_csv()
### Step 3	Check dataset info with df.info()
### Step 4	Check for missing values and handle them
### Step 5	Drop unnecessary columns (Id)
### Step 6	Encode categorical values using LabelEncoder
### Step 7	Scale numerical features using StandardScaler or MinMaxScaler
### Step 8	Verify preprocessed data

### 📌 Step 1: Import Required Libraries

In [14]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

### 📌 Step 2: Load the Dataset

In [4]:
df = pd.read_csv("iris.csv") 


### 📌 Step 3: Check Dataset Information

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [51]:
df.head()


Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,-0.900681,1.032057,-1.341272,-1.312977,0
1,-1.143017,-0.124958,-1.341272,-1.312977,0
2,-1.385353,0.337848,-1.398138,-1.312977,0
3,-1.506521,0.106445,-1.284407,-1.312977,0
4,-1.021849,1.26346,-1.341272,-1.312977,0


### 📌 Step 4: Check for Missing Values

In [9]:
print(df.isnull().sum())  # Check for missing values


Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


# 📌 Step 5 
## 1. Check for missing values 
 1. Drop rows/columns with missing values
 2. Filling missing values with the mean/mode/median


In [29]:
print(df.isnull().sum())

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


In [32]:
df.drop(columns=["Id"], inplace=True)

In [36]:
print(df.isnull().sum())

SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


## 2. Convert Categorical Data into Numerical (Encoding)
 encoder = LabelEncoder()
df["species"] = encoder.fit_transform(df["species"])


In [38]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["Species"] = encoder.fit_transform(df["Species"])
print(dict(enumerate(encoder.classes_)))  # See the mapping


{0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'}


In [42]:
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


## 3. Feature Scaling (Standardization or Normalization):
#### Option 1: Standardization (Z-score Scaling)
#### Option 2: Min-Max Normalization

In [45]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:, :-1] = scaler.fit_transform(df.iloc[:, :-1])  # Scale all features except 'Species'


In [47]:
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,-0.900681,1.032057,-1.341272,-1.312977,0
1,-1.143017,-0.124958,-1.341272,-1.312977,0
2,-1.385353,0.337848,-1.398138,-1.312977,0
3,-1.506521,0.106445,-1.284407,-1.312977,0
4,-1.021849,1.263460,-1.341272,-1.312977,0
...,...,...,...,...,...
145,1.038005,-0.124958,0.819624,1.447956,2
146,0.553333,-1.281972,0.705893,0.922064,2
147,0.795669,-0.124958,0.819624,1.053537,2
148,0.432165,0.800654,0.933356,1.447956,2


## 📌 Step 6: Verify the Preprocessed Data

In [76]:
print(df.head())


   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  Species
0      -0.900681      1.032057      -1.341272     -1.312977        0
1      -1.143017     -0.124958      -1.341272     -1.312977        0
2      -1.385353      0.337848      -1.398138     -1.312977        0
3      -1.506521      0.106445      -1.284407     -1.312977        0
4      -1.021849      1.263460      -1.341272     -1.312977        0
