# EDA (Exploratory Data Analysis) of the Abalone Dataset

## Project Overview
This notebook explores the Abalone dataset to understand the relationship between physical measurements and age (rings) of abalone. The goal is to predict abalone age using physical measurements instead of the time-consuming method of counting shell rings under a microscope.

## Dataset Information
- **Target Variable**: Age (calculated as Rings + 1.5)
- **Features**: Physical measurements (Length, Diameter, Height, Weights) and Sex
- **Problem Type**: Regression (predicting continuous age values)

# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

In [1]:
%load_ext autoreload
%autoreload 2

import kagglehub
# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/rodolfomendes/abalone-dataset?dataset_version_number=3...


100%|██████████| 57.3k/57.3k [00:00<00:00, 29.3MB/s]

Extracting files...
Path to dataset files: /Users/gustavetriomphe/.cache/kagglehub/datasets/rodolfomendes/abalone-dataset/versions/3





## 1. Data Loading and Basic Information

In [3]:
import pandas as pd
df = pd.read_csv(path)
df.to_csv('../data/abalone.csv', index=False)

IsADirectoryError: [Errno 21] Is a directory: '/Users/gustavetriomphe/.cache/kagglehub/datasets/rodolfomendes/abalone-dataset/versions/3'

In [None]:
# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nColumn names:")
print(df.columns.tolist())

## 2. Data Quality Analysis

In [None]:
# Check for missing values
print("Missing Values Analysis:")
print("=" * 40)
missing_values = df.isnull().sum()
print("Null values per column:")
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

# Check for duplicates
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Check data types
print("\nData Types:")
print(df.dtypes)

✅ **Data Quality Summary**: The dataset has no missing values and no duplicate rows, making it clean and ready for analysis.

In [None]:
# Statistical summary of numerical features
print("Statistical Summary:")
print("=" * 50)
print(df.describe())

# Check unique values in categorical column
print(f"\nUnique values in 'Sex' column: {df['Sex'].unique()}")
print(f"Sex distribution:")
print(df['Sex'].value_counts())


In [None]:
## 3. Target Variable Analysis (Age/Rings)

In [None]:
# Create age variable (Rings + 1.5)
df['Age'] = df['Rings'] + 1.5

# Analyze target variable distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Rings distribution
axes[0,0].hist(df['Rings'], bins=30, edgecolor='black', alpha=0.7)
axes[0,0].set_title('Distribution of Rings')
axes[0,0].set_xlabel('Rings')
axes[0,0].set_ylabel('Frequency')

# Age distribution
axes[0,1].hist(df['Age'], bins=30, edgecolor='black', alpha=0.7, color='orange')
axes[0,1].set_title('Distribution of Age (Rings + 1.5)')
axes[0,1].set_xlabel('Age')
axes[0,1].set_ylabel('Frequency')

# Box plot of Rings by Sex
df.boxplot(column='Rings', by='Sex', ax=axes[1,0])
axes[1,0].set_title('Rings Distribution by Sex')
axes[1,0].set_xlabel('Sex')
axes[1,0].set_ylabel('Rings')

# Box plot of Age by Sex
df.boxplot(column='Age', by='Sex', ax=axes[1,1])
axes[1,1].set_title('Age Distribution by Sex')
axes[1,1].set_xlabel('Sex')
axes[1,1].set_ylabel('Age')

plt.tight_layout()
plt.show()

# Target variable statistics
print("Target Variable Statistics:")
print("=" * 40)
print(f"Rings - Min: {df['Rings'].min()}, Max: {df['Rings'].max()}, Mean: {df['Rings'].mean():.2f}")
print(f"Age - Min: {df['Age'].min()}, Max: {df['Age'].max()}, Mean: {df['Age'].mean():.2f}")


In [None]:
# Age group analysis
print("Age Group Analysis:")
print("=" * 30)

# Create age groups
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 5, 10, 15, 20, 25, 30], 
                        labels=['0-5', '5-10', '10-15', '15-20', '20-25', '25-30'])

age_group_counts = df['Age_Group'].value_counts().sort_index()
print(age_group_counts)

# Visualize age groups
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
age_group_counts.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Distribution of Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
age_group_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('Age Group Proportions')
plt.ylabel('')

plt.tight_layout()
plt.show()

In [None]:
## 4. Feature Analysis and Visualizations

In [None]:
# Individual feature distributions
numerical_features = [
    "Length", "Diameter", "Height", "Whole weight", 
    "Shucked weight", "Viscera weight", "Shell weight"
]

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.ravel()

for i, feature in enumerate(numerical_features):
    # Histogram
    axes[i].hist(df[feature], bins=30, edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Distribution of {feature}')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')
    
    # Add statistics text
    mean_val = df[feature].mean()
    std_val = df[feature].std()
    axes[i].axvline(mean_val, color='red', linestyle='--', alpha=0.7, label=f'Mean: {mean_val:.2f}')
    axes[i].legend()

plt.tight_layout()
plt.show()

# Feature statistics
print("Feature Statistics:")
print("=" * 50)
for feature in numerical_features:
    print(f"{feature}: Mean={df[feature].mean():.3f}, Std={df[feature].std():.3f}, Range=[{df[feature].min():.3f}, {df[feature].max():.3f}]")


In [None]:
# Categorical feature analysis (Sex)
print("Sex Distribution Analysis:")
print("=" * 40)
sex_counts = df['Sex'].value_counts()
print(sex_counts)
print(f"\nSex proportions:")
print(df['Sex'].value_counts(normalize=True))

# Visualize sex distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Bar plot
sex_counts.plot(kind='bar', ax=axes[0], color=['lightblue', 'lightpink', 'lightgreen'])
axes[0].set_title('Sex Distribution')
axes[0].set_xlabel('Sex')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

# Pie chart
sex_counts.plot(kind='pie', ax=axes[1], autopct='%1.1f%%', startangle=90)
axes[1].set_title('Sex Proportions')
axes[1].set_ylabel('')

# Age distribution by sex
df.boxplot(column='Age', by='Sex', ax=axes[2])
axes[2].set_title('Age Distribution by Sex')
axes[2].set_xlabel('Sex')
axes[2].set_ylabel('Age')

plt.tight_layout()
plt.show()

# Age statistics by sex
print("\nAge Statistics by Sex:")
print("=" * 30)
for sex in df['Sex'].unique():
    sex_data = df[df['Sex'] == sex]['Age']
    print(f"{sex}: Mean={sex_data.mean():.2f}, Median={sex_data.median():.2f}, Std={sex_data.std():.2f}")


In [None]:
## 5. Correlation Analysis


In [None]:
# Correlation matrix
corr_matrix = df[numerical_features + ['Rings', 'Age']].corr()

# Create correlation heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of All Features')
plt.tight_layout()
plt.show()

# Focus on correlations with target variable
print("Correlations with Age (Target Variable):")
print("=" * 50)
age_correlations = corr_matrix['Age'].drop('Age').sort_values(key=abs, ascending=False)
for feature, corr in age_correlations.items():
    print(f"{feature}: {corr:.3f}")

# Scatter plots of top correlated features with Age
top_features = age_correlations.head(4).index.tolist()
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for i, feature in enumerate(top_features):
    axes[i].scatter(df[feature], df['Age'], alpha=0.6, s=20)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Age')
    axes[i].set_title(f'{feature} vs Age (r={age_correlations[feature]:.3f})')
    
    # Add trend line
    z = np.polyfit(df[feature], df['Age'], 1)
    p = np.poly1d(z)
    axes[i].plot(df[feature], p(df[feature]), "r--", alpha=0.8)

plt.tight_layout()
plt.show()


In [None]:
## 6. Feature Relationships and Insights


In [None]:
# Pairwise relationships between key features
key_features = ['Length', 'Diameter', 'Height', 'Whole weight', 'Shell weight', 'Age']

# Create pairplot for key features
g = sns.pairplot(df[key_features], diag_kind='hist', plot_kws={'alpha': 0.6})
g.fig.suptitle('Pairwise Relationships Between Key Features', y=1.02)
plt.show()

# Feature importance based on correlation
print("Feature Importance (based on correlation with Age):")
print("=" * 60)
for i, (feature, corr) in enumerate(age_correlations.items(), 1):
    print(f"{i}. {feature}: {corr:.3f}")

# Check for potential outliers
print("\nPotential Outliers Analysis:")
print("=" * 40)
for feature in numerical_features:
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
    print(f"{feature}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")


## 7. Key Insights and Conclusions


In [None]:
# Summary statistics and insights
print("📊 DATASET SUMMARY")
print("=" * 50)
print(f"• Total samples: {len(df):,}")
print(f"• Features: {len(df.columns)}")
print(f"• Age range: {df['Age'].min():.1f} - {df['Age'].max():.1f} years")
print(f"• Average age: {df['Age'].mean():.1f} years")

print("\n🎯 TARGET VARIABLE INSIGHTS")
print("=" * 50)
print(f"• Age distribution is right-skewed (younger abalones are more common)")
print(f"• Most abalones ({(df['Age'] <= 10).sum()/len(df)*100:.1f}%) are 10 years or younger")
print(f"• Sex distribution: {dict(df['Sex'].value_counts())}")

print("\n🔍 FEATURE RELATIONSHIPS")
print("=" * 50)
print("• Strong positive correlations between physical measurements")
print("• Shell weight shows highest correlation with age")
print("• Length and diameter are highly correlated (0.99+)")
print("• Height shows moderate correlation with age")

print("\n📈 MODELING IMPLICATIONS")
print("=" * 50)
print("• Shell weight, length, and diameter are key predictors")
print("• Sex may be important for age prediction")
print("• Feature engineering could help (ratios, combinations)")
print("• Consider handling outliers in preprocessing")

print("\n✅ DATA QUALITY")
print("=" * 50)
print("• No missing values")
print("• No duplicate records")
print("• Clean dataset ready for modeling")


### 🎯 **Key Findings for Model Development**

1. **Target Variable**: Age ranges from 1.5 to 30.5 years with a right-skewed distribution
2. **Best Predictors**: Shell weight, length, and diameter show strongest correlations with age
3. **Data Quality**: Clean dataset with no missing values or duplicates
4. **Feature Engineering Opportunities**: 
   - Create ratios between measurements
   - Consider polynomial features for non-linear relationships
   - Handle potential outliers in preprocessing

### 📋 **Next Steps for Modeling**
- Use shell weight, length, diameter as primary features
- Consider sex as a categorical feature
- Implement feature scaling for numerical features
- Test both linear and non-linear models
- Validate model performance with appropriate metrics
