# Assignment 4: Exploratory Data Analysis (EDA) - Cardiotocographic Dataset

## Objective
Conduct a thorough exploratory analysis to uncover insights, identify patterns, and understand the dataset's underlying structure.

## Dataset Columns
- **LB** - Baseline Fetal Heart Rate
- **AC** - Accelerations
- **FM** - Fetal Movements
- **UC** - Uterine Contractions
- **DL** - Decelerations Late
- **DS** - Decelerations Short
- **DP** - Decelerations Prolonged
- **ASTV** - Percentage of Time with Abnormal Short Term Variability
- **MSTV** - Mean Value of Short Term Variability
- **ALTV** - Percentage of Time with Abnormal Long Term Variability
- **MLTV** - Mean Value of Long Term Variability
- **Width** - Histogram Width
- **Tendency** - Histogram Tendency
- **NSP** - Fetal State (1=Normal, 2=Suspect, 3=Pathological)

---
## Step 1: Import Libraries and Load Data

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)

# Load the dataset
df = pd.read_csv('Cardiotocographic.csv')

# Display basic info
print("Dataset loaded successfully!")
print("Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

---
## Step 2: Data Cleaning and Preparation

In [None]:
# Check for missing values
print("=== Missing Values ===")
missing_values = df.isnull().sum()
print(missing_values)

# Check total missing
total_missing = missing_values.sum()
print("\nTotal missing values:", total_missing)

In [None]:
# Check data types
print("=== Data Types ===")
print(df.dtypes)

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)

# Remove duplicates if any
if duplicates > 0:
    df = df.drop_duplicates()
    print("Duplicates removed. New shape:", df.shape)

In [None]:
# Handle missing values - fill with median (for numerical data)
print("=== Handling Missing Values ===")

for column in df.columns:
    missing_count = df[column].isnull().sum()
    if missing_count > 0:
        median_value = df[column].median()
        df[column] = df[column].fillna(median_value)
        print("Filled", missing_count, "missing values in", column, "with median:", median_value)

# Verify no missing values remain
print("\nRemaining missing values:", df.isnull().sum().sum())

---
## Step 3: Statistical Summary

In [None]:
# Get statistical summary using describe()
print("=== Statistical Summary ===")
summary = df.describe()
summary

In [None]:
# Calculate additional statistics
print("=== Detailed Statistics for Each Column ===")

for column in df.columns:
    print("\n" + column + ":")
    print("-" * 30)
    
    # Mean
    mean_val = df[column].mean()
    print("  Mean:", round(mean_val, 4))
    
    # Median
    median_val = df[column].median()
    print("  Median:", round(median_val, 4))
    
    # Standard Deviation
    std_val = df[column].std()
    print("  Std Dev:", round(std_val, 4))
    
    # IQR (Interquartile Range)
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    print("  IQR:", round(iqr, 4))

---
## Step 4: Data Visualization

### 4.1 Histograms - Distribution of Numerical Variables

In [None]:
# Create histograms for all numerical columns
numerical_cols = ['LB', 'AC', 'FM', 'UC', 'ASTV', 'MSTV', 'ALTV', 'MLTV', 'Width']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

for i in range(len(numerical_cols)):
    column = numerical_cols[i]
    axes[i].hist(df[column], bins=30, color='steelblue', edgecolor='black')
    axes[i].set_title('Distribution of ' + column)
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('histograms.png')
plt.show()

### 4.2 Boxplots - Identifying Outliers

In [None]:
# Create boxplots
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

for i in range(len(numerical_cols)):
    column = numerical_cols[i]
    axes[i].boxplot(df[column].dropna())
    axes[i].set_title('Boxplot of ' + column)
    axes[i].set_ylabel(column)

plt.tight_layout()
plt.savefig('boxplots.png')
plt.show()

### 4.3 Bar Chart - Fetal State (NSP) Distribution

In [None]:
# Bar chart for NSP (Fetal State)
# NSP: 1 = Normal, 2 = Suspect, 3 = Pathological

nsp_counts = df['NSP'].value_counts().sort_index()

plt.figure(figsize=(8, 6))
colors = ['green', 'orange', 'red']
bars = plt.bar(nsp_counts.index, nsp_counts.values, color=colors, edgecolor='black')

# Add labels
labels = ['Normal', 'Suspect', 'Pathological']
plt.xticks([1, 2, 3], labels)
plt.xlabel('Fetal State')
plt.ylabel('Count')
plt.title('Distribution of Fetal State (NSP)')

# Add count on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             str(int(height)),
             ha='center', va='bottom')

plt.savefig('nsp_distribution.png')
plt.show()

# Print percentages
print("\nFetal State Distribution:")
total = len(df)
for state, count in zip(labels, nsp_counts.values):
    percentage = (count / total) * 100
    print(state + ":", count, "(", round(percentage, 2), "%)")

### 4.4 Correlation Heatmap

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr()

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, 
            annot=True, 
            fmt='.2f', 
            cmap='coolwarm',
            center=0,
            square=True)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png')
plt.show()

### 4.5 Scatter Plots - Relationships Between Variables

In [None]:
# Scatter plots for key relationships
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Scatter 1: LB vs ASTV
axes[0, 0].scatter(df['LB'], df['ASTV'], alpha=0.5, c='blue')
axes[0, 0].set_xlabel('LB (Baseline Heart Rate)')
axes[0, 0].set_ylabel('ASTV')
axes[0, 0].set_title('LB vs ASTV')

# Scatter 2: MSTV vs MLTV
axes[0, 1].scatter(df['MSTV'], df['MLTV'], alpha=0.5, c='green')
axes[0, 1].set_xlabel('MSTV')
axes[0, 1].set_ylabel('MLTV')
axes[0, 1].set_title('MSTV vs MLTV')

# Scatter 3: Width vs LB
axes[1, 0].scatter(df['Width'], df['LB'], alpha=0.5, c='red')
axes[1, 0].set_xlabel('Width')
axes[1, 0].set_ylabel('LB')
axes[1, 0].set_title('Width vs LB')

# Scatter 4: ASTV vs ALTV
axes[1, 1].scatter(df['ASTV'], df['ALTV'], alpha=0.5, c='purple')
axes[1, 1].set_xlabel('ASTV')
axes[1, 1].set_ylabel('ALTV')
axes[1, 1].set_title('ASTV vs ALTV')

plt.tight_layout()
plt.savefig('scatter_plots.png')
plt.show()

### 4.6 Violin Plots - Distribution by Fetal State

In [None]:
# Violin plots showing distribution of key variables by NSP
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Violin 1: LB by NSP
sns.violinplot(x='NSP', y='LB', data=df, ax=axes[0, 0], palette='Set2')
axes[0, 0].set_title('LB Distribution by Fetal State')

# Violin 2: ASTV by NSP
sns.violinplot(x='NSP', y='ASTV', data=df, ax=axes[0, 1], palette='Set2')
axes[0, 1].set_title('ASTV Distribution by Fetal State')

# Violin 3: MSTV by NSP
sns.violinplot(x='NSP', y='MSTV', data=df, ax=axes[1, 0], palette='Set2')
axes[1, 0].set_title('MSTV Distribution by Fetal State')

# Violin 4: Width by NSP
sns.violinplot(x='NSP', y='Width', data=df, ax=axes[1, 1], palette='Set2')
axes[1, 1].set_title('Width Distribution by Fetal State')

plt.tight_layout()
plt.savefig('violin_plots.png')
plt.show()

---
## Step 5: Pattern Recognition and Insights

In [None]:
# Find top correlations
print("=== Top Correlations ===")

# Get correlation values
corr_pairs = []
columns = df.columns.tolist()

for i in range(len(columns)):
    for j in range(i+1, len(columns)):
        col1 = columns[i]
        col2 = columns[j]
        corr_value = correlation_matrix.loc[col1, col2]
        corr_pairs.append((col1, col2, corr_value))

# Sort by absolute correlation value
corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)

# Print top 10 correlations
print("\nTop 10 Strongest Correlations:")
for i in range(10):
    pair = corr_pairs[i]
    print(str(i+1) + ".", pair[0], "vs", pair[1], ":", round(pair[2], 4))

In [None]:
# Compare means across fetal states
print("=== Mean Values by Fetal State ===")
print("\n(NSP: 1=Normal, 2=Suspect, 3=Pathological)")

grouped_means = df.groupby('NSP').mean()
grouped_means

In [None]:
# Outlier Detection using IQR method
print("=== Outlier Detection ===")

for column in numerical_cols:
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    outlier_count = len(outliers)
    
    if outlier_count > 0:
        percentage = (outlier_count / len(df)) * 100
        print(column + ":", outlier_count, "outliers (", round(percentage, 2), "%)")

---
## Step 6: Conclusion

### Key Insights:

1. **Dataset Overview:**
   - The dataset contains cardiotocographic measurements for fetal health monitoring
   - Most cases are classified as Normal (NSP=1)

2. **Correlations Found:**
   - Strong correlations exist between variability measures (ASTV, MSTV, ALTV, MLTV)
   - Width shows correlation with heart rate measures

3. **Differences by Fetal State:**
   - Pathological cases (NSP=3) show different patterns in variability measures
   - ASTV and ALTV values tend to be higher in abnormal cases

4. **Outliers:**
   - Several columns contain outliers that may need attention
   - These could represent extreme but valid medical cases

### Recommendations:
- The variability measures (ASTV, MSTV, ALTV, MLTV) appear to be good indicators for fetal health
- Machine learning models could use these features to predict fetal state
- Outliers should be investigated with domain experts before removal