# EDA (Exploratory Data Analysis) of the Abalone Dataset

## Project Overview
This notebook explores the Abalone dataset to understand the relationship between physical measurements and age (rings) of abalone. The goal is to predict abalone age using physical measurements instead of the time-consuming method of counting shell rings under a microscope.

## Dataset Information
- **Target Variable**: Age (calculated as Rings + 1.5)
- **Features**: Physical measurements (Length, Diameter, Height, Weights) and Sex
- **Problem Type**: Regression (predicting continuous age values)

# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

In [None]:
%load_ext autoreload
%autoreload 2

import kagglehub
# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

print("Path to dataset files:", path)

: 

## 1. Data Loading and Basic Information

In [None]:
# Load the dataset
df = pd.read_csv(path)

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nColumn names:")
print(df.columns.tolist())

## 2. Data Quality Analysis

In [None]:
# Check for missing values
print("Missing Values Analysis:")
print("=" * 40)
missing_values = df.isnull().sum()
print("Null values per column:")
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

# Check for duplicates
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Check data types
print("\nData Types:")
print(df.dtypes)

✅ **Data Quality Summary**: The dataset has no missing values and no duplicate rows, making it clean and ready for analysis.

In [None]:
# Statistical summary of numerical features
print("Statistical Summary:")
print("=" * 50)
print(df.describe())

# Check unique values in categorical column
print(f"\nUnique values in 'Sex' column: {df['Sex'].unique()}")
print(f"Sex distribution:")
print(df['Sex'].value_counts())


In [None]:
## 3. Target Variable Analysis (Age/Rings)

In [None]:
numerical_features = [
    "Length",
    "Diameter",
    "Height",
    "Whole weight",
    "Shucked weight",
    "Viscera weight",
    "Shell weight",
    "Rings"
]

for feature in numerical_features:
    plt.figure(figsize=(10, 6))
    # plot distribution of feature
    sns.histplot(df[feature], bins=30, edgecolor='black')
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()


In [None]:
df_age = df["Rings"] + 1.5
plt.figure(figsize=(10, 6))
sns.histplot(df_age, bins=30, edgecolor='black')
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.show()

In [None]:
# plot the correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()