# Task 2: Data Cleaning - TechnoHacks Internship

## üéØ Objective:
To clean the dataset by checking for:

- **Missing values**
- **Duplicate rows**
- **Invalid entries**
- **Consistency in data types**

And save the cleaned dataset for further analysis.

## üìå Dataset:
**Iris Dataset** - Building upon the data collected in Task 1

## üß† Step-by-Step Implementation:
This notebook will systematically clean the Iris dataset to ensure it's ready for visualization and machine learning tasks.

## üîπ 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
print("Libraries imported successfully!")
print("Ready to start data cleaning process...")

Libraries imported successfully!
Ready to start data cleaning process...


## üîπ 2. Load the Dataset

In [2]:
# Load the dataset
df = pd.read_csv("iris_dataset.csv")
print("Dataset loaded successfully!")
print(f"Original dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Dataset loaded successfully!
Original dataset shape: (150, 5)
Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']


## üîπ 3. Initial Data Inspection

In [3]:
# View first few rows
print("First 5 rows of the dataset:")
print(df.head())
print("\n" + "="*50)
print("Dataset Information:")
print(df.info())

First 5 rows of the dataset:
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


## üîπ 4. Check for Missing Values

In [4]:
# Check for missing values
print("Missing values per column:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

if missing_values.sum() == 0:
    print("‚úÖ No missing values found in this dataset!")
else:
    print("‚ùå Missing values detected - need to handle them")

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

Total missing values: 0
‚úÖ No missing values found in this dataset!


## üîπ 5. Identify and Handle Duplicate Rows

In [5]:
# Check for duplicate rows
duplicates = df.duplicated()
print("Number of duplicate rows:", duplicates.sum())
print(f"Shape before removing duplicates: {df.shape}")

# Remove duplicates if any
if duplicates.sum() > 0:
    df = df.drop_duplicates()
    print(f"‚úÖ {duplicates.sum()} duplicate row(s) removed!")
    print(f"Shape after removing duplicates: {df.shape}")
else:
    print("‚úÖ No duplicate rows found!")

# Reset index after removing duplicates
df = df.reset_index(drop=True)
print(f"Final shape: {df.shape}")

Number of duplicate rows: 3
Shape before removing duplicates: (150, 5)
‚úÖ 3 duplicate row(s) removed!
Shape after removing duplicates: (147, 5)
Final shape: (147, 5)


## üîπ 6. Validate Data Types

In [6]:
# Check data types
print("Data types:")
print(df.dtypes)

# Expected data types validation
expected_numeric = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
expected_categorical = ['class']

print(f"\n‚úÖ Data Type Validation:")
print(f"Numeric columns (should be float64): {expected_numeric}")
print(f"Categorical columns (should be object): {expected_categorical}")

# Verify all numeric columns are float64
numeric_types_ok = all(df[col].dtype == 'float64' for col in expected_numeric)
categorical_types_ok = all(df[col].dtype == 'object' for col in expected_categorical)

if numeric_types_ok and categorical_types_ok:
    print("‚úÖ All data types are correct!")
else:
    print("‚ùå Some data types need correction")

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class            object
dtype: object

‚úÖ Data Type Validation:
Numeric columns (should be float64): ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Categorical columns (should be object): ['class']
‚úÖ All data types are correct!


## üîπ 7. Check for Outliers

In [7]:
# Check for outliers using statistical summary
print("Statistical Summary:")
print(df.describe())

# Check for potential outliers using IQR method
print("\nüîç Outlier Detection (using IQR method):")
numeric_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"{col}: {len(outliers)} potential outliers")

print("\n‚úÖ Note: For the Iris dataset, these are natural variations and not true outliers to remove.")

Statistical Summary:
       sepal_length  sepal_width  petal_length  petal_width
count    147.000000   147.000000    147.000000   147.000000
mean       5.856463     3.055782      3.780272     1.208844
std        0.829100     0.437009      1.759111     0.757874
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.400000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

üîç Outlier Detection (using IQR method):
sepal_length: 0 potential outliers
sepal_width: 4 potential outliers
petal_length: 0 potential outliers
petal_width: 0 potential outliers

‚úÖ Note: For the Iris dataset, these are natural variations and not true outliers to remove.


## üîπ 8. Standardize Class Names

In [10]:
# Check current class names
print("Current unique class names:")
print(df['class'].unique())
print(f"\nClass value counts:")
print(df['class'].value_counts())

# Standardize class names (remove extra spaces if any)
df['class'] = df['class'].str.strip()

# Check again after standardization
print(f"\nAfter standardization:")
print("Unique class names:", df['class'].unique())
print("Class value counts:")
print(df['class'].value_counts())

# Verify expected classes
expected_classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
if set(df['class'].unique()) == set(expected_classes):
    print("‚úÖ All class names are standardized and correct!")
else:
    print("‚ùå Some class names may need attention")

Current unique class names:
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

Class value counts:
class
Iris-versicolor    50
Iris-virginica     49
Iris-setosa        48
Name: count, dtype: int64

After standardization:
Unique class names: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Class value counts:
class
Iris-versicolor    50
Iris-virginica     49
Iris-setosa        48
Name: count, dtype: int64
‚úÖ All class names are standardized and correct!


## üîπ 9. Save the Cleaned Dataset

In [11]:
# Save the cleaned dataset
df.to_csv("iris_cleaned.csv", index=False)
print(" Dataset saved successfully as 'iris_cleaned.csv'")
print(f"Final cleaned dataset shape: {df.shape}")

# Final verification
print("\nüéØ FINAL CLEANED DATASET SUMMARY:")
print(f"- Total samples: {len(df)}")
print(f"- Total features: {len(df.columns) - 1}")
print(f"- Classes: {df['class'].nunique()}")
print(f"- Missing values: {df.isnull().sum().sum()}")
print(f"- Duplicate rows: {df.duplicated().sum()}")
print(f"- Data types consistent: ‚úÖ")
print(f"- Class names standardized: ‚úÖ")

print("\n Data cleaning process completed successfully!")
print("The cleaned dataset is ready for visualization and modeling.")

 Dataset saved successfully as 'iris_cleaned.csv'
Final cleaned dataset shape: (147, 5)

üéØ FINAL CLEANED DATASET SUMMARY:
- Total samples: 147
- Total features: 4
- Classes: 3
- Missing values: 0
- Duplicate rows: 0
- Data types consistent: ‚úÖ
- Class names standardized: ‚úÖ

 Data cleaning process completed successfully!
The cleaned dataset is ready for visualization and modeling.


## üìÅ Files Created:
- `iris_cleaned.csv` ‚Äì Cleaned and ready-to-use dataset
- `Task2_DataCleaning.ipynb` ‚Äì This notebook with all cleaning steps

## üìå Summary:
‚úÖ **Task 2: Data Cleaning Completed Successfully!**

In this task, we:
1. ‚úÖ **Imported required libraries** (pandas, numpy)
2. ‚úÖ **Loaded the dataset** from iris_dataset.csv
3. ‚úÖ **Performed initial inspection** of the data structure
4. ‚úÖ **Checked for missing values** (None found)
5. ‚úÖ **Identified and removed duplicate rows** (if any)
6. ‚úÖ **Validated data types** (all correct)
7. ‚úÖ **Checked for outliers** (natural variations, not removed)
8. ‚úÖ **Standardized class names** (already consistent)
9. ‚úÖ **Saved the cleaned dataset** as iris_cleaned.csv

## üéØ Key Achievements:
- **Data Quality**: Ensured high-quality, clean dataset
- **Consistency**: All data types and formats are consistent
- **Reliability**: Removed duplicates and handled potential issues
- **Readiness**: Dataset is now ready for visualization and modeling

## üöÄ Next Steps:
The cleaned dataset is now ready for:
- **Task 3**: Data Visualization
- **Task 4**: Machine Learning Modeling

**Data Cleaning Process Complete!** üéâ