**DATA PREPROCESSING TECHNIQUES ON THE IRIS DATASET**

**Steps Performed in the Code**

**1. Dataset Loading**
*   Loaded the Iris dataset directly from the UCI Machine Learning Repository using an online URL.

**2. Handling Missing Values**



*   Introduced missing values in the sepal_width column for demonstration. Replaced missing values using mean imputation.

**3. Categorical Encoding**

*   Converted the class column (flower species) into numerical form using Label Encoding.

**4. Standardization**

*   Applied Z-score scaling to numerical features so that they have mean = 0 and standard deviation = 1.

**5. Normalization**

*   Applied Min-Max Normalization to rescale numerical features into the range [0,1].

**6. Feature Engineering**

*   Created a new feature petal_area by multiplying petal_length and petal_width.

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# 1. Load dataset directly from website (UCI repository - Iris dataset)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
df = pd.read_csv(url, names=columns)

print("Original Dataset (first 5 rows):")
print(df.head(), "\n")

# ---------------- PREPROCESSING STEPS ----------------

# 2. Handle Missing Values (Introduce some NaN for demonstration)
df.loc[5:10, "sepal_width"] = np.nan
imputer = SimpleImputer(strategy="mean")
df["sepal_width"] = imputer.fit_transform(df[["sepal_width"]])

# 3. Encode Categorical Column (class → numbers)
encoder = LabelEncoder()
df["class"] = encoder.fit_transform(df["class"])

# 4. Standardization (Z-score scaling)
scaler = StandardScaler()
df[["sepal_length", "sepal_width", "petal_length", "petal_width"]] = scaler.fit_transform(
    df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
)

# 5. Min-Max Normalization (scale values between 0 and 1)
minmax = MinMaxScaler()
df[["sepal_length", "sepal_width", "petal_length", "petal_width"]] = minmax.fit_transform(
    df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
)

# 6. Feature Engineering (Create a new feature - petal_area)
df["petal_area"] = df["petal_length"] * df["petal_width"]

print("Preprocessed Dataset (first 5 rows):")
print(df.head())


Original Dataset (first 5 rows):
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa 

Preprocessed Dataset (first 5 rows):
   sepal_length  sepal_width  petal_length  petal_width  class  petal_area
0      0.222222     0.625000      0.067797     0.041667      0    0.002825
1      0.166667     0.416667      0.067797     0.041667      0    0.002825
2      0.111111     0.500000      0.050847     0.041667      0    0.002119
3      0.083333     0.458333      0.084746     0.041667      0    0.003531
4      0.194444     0.666667      0.067797     0.041667      0    0.002825
