# Iris Dataset - Exploratory Data Analysis

## Overview
The Iris dataset is one of the most famous datasets in the machine learning world. It contains measurements of 150 iris flowers from three different species: Setosa, Versicolor, and Virginica. Each flower is described by four features: sepal length, sepal width, petal length, and petal width.

## Table of Contents
1. [Data Loading and Initial Inspection](#1-data-loading-and-initial-inspection)
2. [Data Cleaning](#2-data-cleaning)
3. [Exploratory Data Analysis](#3-exploratory-data-analysis)
4. [Key Insights](#4-key-insights)
5. [Conclusions](#5-conclusions)

## 1. Data Loading and Initial Inspection

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.datasets import load_iris
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('default')
sns.set_palette('husl')

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [None]:
# Load the Iris dataset
iris_data = load_iris()
df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
df['species'] = iris_data.target
df['species_name'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("First 5 rows of the dataset:")
df.head()

In [None]:
# Basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names and types:")
print(df.dtypes)
print(f"\nDataset info:")
df.info()

In [None]:
# Basic statistics
print("Basic statistics for numerical features:")
df.describe()

In [None]:
# Species distribution
print("Species distribution:")
species_counts = df['species_name'].value_counts()
print(species_counts)

# Visualize species distribution
plt.figure(figsize=(8, 6))
species_counts.plot(kind='bar', color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.title('Distribution of Iris Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

## 2. Data Cleaning

In [None]:
# Check for missing values
missing_data = df.isnull().sum()
print("Missing values per column:")
print(missing_data)
print(f"\nTotal missing values: {missing_data.sum()}")

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print("\nDuplicate rows:")
    print(df[df.duplicated()])

## 3. Exploratory Data Analysis

### 3.1 Univariate Analysis

In [None]:
# Separate numerical and categorical columns
numerical_cols = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(f"Numerical columns: {numerical_cols}")

# Statistical summary by species
print("\nStatistical summary by species:")
df.groupby('species_name')[numerical_cols].describe()

In [None]:
# Distribution plots for all numerical features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, col in enumerate(numerical_cols):
    # Histogram
    axes[i].hist(df[col], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    
    # Add vertical line for mean
    axes[i].axvline(df[col].mean(), color='red', linestyle='--', label=f'Mean: {df[col].mean():.2f}')
    axes[i].legend()

plt.tight_layout()
plt.show()

In [None]:
# Box plots to show distribution by species
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, col in enumerate(numerical_cols):
    sns.boxplot(data=df, x='species_name', y=col, ax=axes[i])
    axes[i].set_title(f'{col} by Species')
    axes[i].set_xlabel('Species')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### 3.2 Bivariate Analysis

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5)
plt.title('Correlation Matrix of Iris Features')
plt.show()

# Print correlation insights
print("Strong correlations (|r| > 0.8):")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.8:
            print(f"{correlation_matrix.columns[i]} vs {correlation_matrix.columns[j]}: {corr_val:.3f}")

In [None]:
# Pairplot to visualize relationships between all features
plt.figure(figsize=(12, 10))
sns.pairplot(df, hue='species_name', diag_kind='hist', 
             palette=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.suptitle('Pairplot of Iris Features by Species', y=1.02)
plt.show()

In [None]:
# Detailed scatter plots for most interesting relationships
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Petal length vs Petal width
for species in df['species_name'].unique():
    species_data = df[df['species_name'] == species]
    axes[0].scatter(species_data['petal length (cm)'], species_data['petal width (cm)'], 
                   label=species, alpha=0.7, s=60)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('Petal Length vs Petal Width')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Sepal length vs Sepal width
for species in df['species_name'].unique():
    species_data = df[df['species_name'] == species]
    axes[1].scatter(species_data['sepal length (cm)'], species_data['sepal width (cm)'], 
                   label=species, alpha=0.7, s=60)
axes[1].set_xlabel('Sepal Length (cm)')
axes[1].set_ylabel('Sepal Width (cm)')
axes[1].set_title('Sepal Length vs Sepal Width')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 3.3 Advanced Visualizations

In [None]:
# Violin plots to show distribution shapes
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, col in enumerate(numerical_cols):
    sns.violinplot(data=df, x='species_name', y=col, ax=axes[i])
    axes[i].set_title(f'{col} Distribution by Species')
    axes[i].set_xlabel('Species')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Interactive 3D scatter plot using plotly
fig = px.scatter_3d(df, x='sepal length (cm)', y='sepal width (cm)', z='petal length (cm)',
                    color='species_name', size='petal width (cm)',
                    title='3D Scatter Plot of Iris Features',
                    labels={'species_name': 'Species'})
fig.show()

### 3.4 Statistical Analysis

In [None]:
# Calculate means and standard deviations by species
species_stats = df.groupby('species_name')[numerical_cols].agg(['mean', 'std'])
print("Means and Standard Deviations by Species:")
print(species_stats)

# Calculate coefficient of variation (CV) to compare variability
cv_stats = df.groupby('species_name')[numerical_cols].agg(lambda x: x.std() / x.mean() * 100)
print("\nCoefficient of Variation (%) by Species:")
print(cv_stats)

In [None]:
# Feature importance analysis - how well each feature separates species
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numerical_cols])
y = df['species']

# Fit LDA to understand feature importance
lda = LinearDiscriminantAnalysis()
lda.fit(X_scaled, y)

# Get feature importance (coefficients)
feature_importance = pd.DataFrame({
    'Feature': numerical_cols,
    'Importance': np.abs(lda.coef_[0])
}).sort_values('Importance', ascending=False)

print("Feature Importance for Species Classification:")
print(feature_importance)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.title('Feature Importance for Species Classification')
plt.xlabel('Importance Score')
plt.show()

## 4. Key Insights

### Key Findings:

1. **Species Separability**: 
   - Setosa is clearly distinguishable from the other two species across all measurements
   - Versicolor and Virginica show more overlap, especially in sepal measurements

2. **Most Discriminative Features**:
   - Petal length and petal width are the most important features for species classification
   - Sepal width shows the least discriminative power

3. **Feature Correlations**:
   - Strong positive correlation between petal length and petal width (r ≈ 0.96)
   - Moderate correlation between sepal length and petal length
   - Sepal width shows weak correlation with other features

4. **Species Characteristics**:
   - **Setosa**: Smallest petals, wider sepals relative to length
   - **Versicolor**: Medium-sized features across all measurements
   - **Virginica**: Largest petals and longest sepals

5. **Distribution Patterns**:
   - All features show approximately normal distributions
   - Petal measurements show clear species clustering
   - Some outliers exist but they don't significantly affect the overall patterns

## 5. Conclusions

### Summary:
- The Iris dataset demonstrates clear morphological differences between the three species
- Petal measurements are significantly more informative than sepal measurements for species identification
- The dataset is well-balanced with no missing values, making it ideal for machine learning applications

### Recommendations:
- For species classification, focus primarily on petal length and width measurements
- Setosa can be easily identified using simple rules (e.g., petal length < 2 cm)
- More sophisticated methods may be needed to distinguish between Versicolor and Virginica

### Next Steps:
- Apply machine learning classification algorithms to predict species
- Investigate decision boundaries between Versicolor and Virginica
- Explore dimensionality reduction techniques (PCA, LDA) for visualization
- Compare different classification algorithms' performance on this dataset