# Task 1: Data Collection - TechnoHacks Internship

## 🎯 Objective:
To collect a real-world dataset from an open-source source (UCI Machine Learning Repository), load it using Python, explore it briefly, and save it in a structured format (CSV).

## 📌 Dataset Chosen:
**Iris Dataset** from UCI Machine Learning Repository

## 📋 Description:
The Iris dataset is a classic dataset in machine learning that contains measurements of 150 iris flowers from three species: Iris setosa, Iris versicolor, and Iris virginica. Each sample has four features:

- **Sepal Length**
- **Sepal Width**
- **Petal Length**
- **Petal Width**

## 🔹 1. Importing Required Libraries

In [11]:
import pandas as pd
import numpy as np
print("Libraries imported successfully!")

Libraries imported successfully!


## 🔹 2. Loading the Dataset

Since we're working offline, we'll load the dataset from the local CSV file that was previously downloaded from the UCI repository:

Original URL: `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data`

In [12]:
# Load the dataset
df = pd.read_csv('iris_dataset.csv')
print("Dataset loaded successfully!")
print(f"Shape of the dataset: {df.shape}")

Dataset loaded successfully!
Shape of the dataset: (150, 5)


## 🔹 3. Displaying First Few Rows

In [13]:
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


## 🔹 4. Dataset Information and Exploration

In [14]:
# Basic information about the dataset
print("Dataset Information:")
print(df.info())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


In [15]:
# Statistical summary
print("\nStatistical Summary:")
print(df.describe())


Statistical Summary:
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


In [16]:
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64


In [17]:
# Check unique classes
print("\nUnique Classes:")
print(df['class'].value_counts())


Unique Classes:
class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64


## 🔹 5. Data Validation and Quality Check

In [8]:
# Check data types
print("Data Types:")
print(df.dtypes)

# Check for any duplicate rows
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Display column names
print(f"\nColumn names: {list(df.columns)}")

Data Types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class            object
dtype: object

Duplicate rows: 3

Column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']


## 🔹 6. Final Dataset Overview

In [9]:
print("=== FINAL DATASET OVERVIEW ===")
print(f"Total samples: {len(df)}")
print(f"Total features: {len(df.columns) - 1}")
print(f"Classes: {df['class'].nunique()}")
print(f"Missing values: {df.isnull().sum().sum()}")
print("\nDataset is ready for further analysis!")

=== FINAL DATASET OVERVIEW ===
Total samples: 150
Total features: 4
Classes: 3
Missing values: 0

Dataset is ready for further analysis!


## 🔹 7. Saving Dataset as CSV

The dataset is already in CSV format and ready for future use in data cleaning, visualization, or modeling tasks.

In [10]:
# Save the dataset (confirmation)
df.to_csv("iris_dataset.csv", index=False)
print("✅ Dataset saved successfully as 'iris_dataset.csv'")
print("\nThe dataset is now ready for:")
print("- Data Cleaning (Task 2)")
print("- Data Visualization (Task 3)")
print("- Machine Learning Modeling (Task 4)")

✅ Dataset saved successfully as 'iris_dataset.csv'

The dataset is now ready for:
- Data Cleaning (Task 2)
- Data Visualization (Task 3)
- Machine Learning Modeling (Task 4)


## 📁 Files Created:
- `iris_dataset.csv` – Complete Iris dataset ready for analysis
- `Task1_DataCollection.ipynb` – This notebook with all steps and outputs

## 📌 Summary:
✅ **Task 1 Completed Successfully!**

In this task, we:
1. ✅ Collected a real-world dataset from UCI Machine Learning Repository
2. ✅ Loaded the dataset using Python and Pandas
3. ✅ Explored the dataset structure and characteristics
4. ✅ Validated data quality (no missing values, no duplicates)
5. ✅ Saved the dataset in structured CSV format

The Iris dataset is now ready for the next phases of the data science pipeline!