
**Author**: Aaryan Samanta

**Organization**: Legend College Preparatory

**Date**: 2025

**Title**: Iris Dataset - Data Processing

**Version**: 1.0

**Type**: Source Code

**Adaptation details**: Based on classroom exercises

**Description**: The goal of this exercise is to help students learn how to clean and preprocess datasets using Python.



---


Developed as part of the AI Internship at Legend College Preparatory.
Please note that it is a violation of school policy to copy and use this code without proper attribution and credit acknowledgement.
Failing to do so can constitute plagiarism, even with small code snippets.

In [None]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
data = pd.read_csv(url, header=None, names=columns)

print("Missing data in each column:")
print(data.isnull().sum())

data.fillna(data.mean(numeric_only=True), inplace=True)

data.dropna(subset=["species"], inplace=True)

filtered_data = data[data["sepal_length"] > 5]
print("\nRows where sepal_length > 5:")
print(filtered_data)

data.rename(columns={"sepal_length": "Sepal Length (cm)", "species": "Flower Species"}, inplace=True)

print("\nCleaned dataset preview:")
print(data.head())
print(f"\nThe dataset has {data.shape[0]} rows and {data.shape[1]} columns.")
print("\nColumn names:", data.columns.tolist())

print(f"\nNumber of rows with Sepal Length > 5: {filtered_data.shape[0]}")

Missing data in each column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Rows where sepal_length > 5:
     sepal_length  sepal_width  petal_length  petal_width         species
0             5.1          3.5           1.4          0.2     Iris-setosa
5             5.4          3.9           1.7          0.4     Iris-setosa
10            5.4          3.7           1.5          0.2     Iris-setosa
14            5.8          4.0           1.2          0.2     Iris-setosa
15            5.7          4.4           1.5          0.4     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9       

1.
There were 0 missing values. This is because the Iris dataset is a well-known teaching dataset which has already been cleaned up.

2.
The new column names are Sepal Length (cm), sepal_width, petal_length, petal_width, and Flower Species. They were changed so that they are less generic and are more descriptive.

3.
There are 118 rows with a sepal length greater than 5. This number is significant because it shows that most of the flowers in this dataset have longer sepals.

4.
Yes, because when asked to .fillna(data.mean()) it returned an error when it tried to calculate the mean on the text data in the species columns. The solution was to make sure it only looked at numeric columns.

5.
Longer sepals are frequently Iris-versicolor or Iris-virginica which demonstrates sepal length as a biological separator for species. I also find sepal length and petal length may be correlated which is interesting because it alludes to a biological relationship.