# 🧹 Task 1 — Data Cleaning & Preprocessing (Iris Dataset)

This notebook is part of my **Codveda Data Analytics Internship (Level 1 - Basic)**.  
The goal of this task is to **inspect, clean, and preprocess** the Iris dataset in preparation for further analysis.  

## ✅ Objectives
- Load and inspect the dataset  
- Identify and handle missing values and duplicates  
- Standardize column names and data formats  
- Encode categorical variables for ML readiness  
- Export the cleaned dataset for use in subsequent tasks  

---

## Step 1 — Setup & Imports
### Install Dependencies

In [1]:
%pip install -r ../requirements.txt




[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





### Import Python libraries needed for data analysis.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("Libraries imported successfully")
print("pandas:", pd.__version__)
print("numpy:", np.__version__)

Libraries imported successfully
pandas: 2.1.3
numpy: 1.26.2


## Step 2 — Load & Inspect Dataset

In [3]:
# Load the dataset from the data folder
df = pd.read_csv("../Raw-Dataset/1) iris.csv")

# Show the first 5 rows of the dataset
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Step 3 — Dataset Overview
### 3.1 — Dataset Shape

In [4]:
# Number of rows and columns
df.shape

(150, 5)

### 3.2 — Column Info

In [5]:
# Column names, data types, and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


### 3.3 — Summary Statistics

In [6]:
# Basic statistics for numeric columns
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### 3.4 — Missing Values

In [7]:
# Count of missing values per column
df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

### 3.5 — Duplicate Rows

In [8]:
# Count of duplicate rows
df.duplicated().sum()

3

## Step 4 — Data Cleaning
### 4.1 — Remove Duplicate Rows

In [9]:
# Remove duplicate rows
df = df.drop_duplicates()

# Verify that duplicates are gone
df.duplicated().sum()

0

### 4.2 — Check Categorical Consistency

In [10]:
df['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

## Step 5 — Save the Cleaned Dataset

In [11]:
# Save the cleaned dataset for use in later tasks
output_path = "../Task 1 - Data Cleaning And Preprocessing/Cleaned-Dataset/iris_cleaned.csv"
df.to_csv(output_path, index=False)

print(f"Cleaned dataset saved to: {output_path}")

Cleaned dataset saved to: ../Task 1 - Data Cleaning And Preprocessing/Cleaned-Dataset/iris_cleaned.csv
