# Notebook 03: Exploratory Data Analysis (EDA)

## Purpose
This notebook performs comprehensive exploratory data analysis to:
- Understand data distributions
- Discover patterns and relationships
- Identify correlations between variables
- Detect outliers
- Generate insights for modeling

## Learning Objectives
- Apply statistical analysis techniques
- Create meaningful visualizations
- Interpret data patterns
- Document insights and observations

---
## 1. Import Libraries and Load Clean Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Display and visualization settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

In [None]:
# Load cleaned dataset
df = pd.read_csv('../data/cleaned_dataset.csv')

print(f"Dataset loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")
print(f"\nFirst few rows:")
df.head()

---
## 2. Univariate Analysis

### 2.1 Numerical Features Distribution

In [None]:
# Select numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove ID columns
numerical_cols = [col for col in numerical_cols if 'id' not in col.lower()]

print(f"Numerical columns for analysis: {numerical_cols}")

In [None]:
# Create histograms for numerical features
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols[:9]):
    axes[idx].hist(df[col].dropna(), bins=50, color='skyblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.suptitle('Numerical Features Distributions', fontsize=14, fontweight='bold', y=1.001)
plt.show()

### 2.2 Price Distribution Analysis

Price is likely our target variable for prediction, so let's analyze it in detail.

In [None]:
# Detailed price analysis
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Histogram
axes[0].hist(df['price'], bins=100, color='coral', edgecolor='black', alpha=0.7)
axes[0].set_title('Price Distribution', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Price ($)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].axvline(df['price'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: ${df["price"].mean():.2f}')
axes[0].axvline(df['price'].median(), color='green', linestyle='--', linewidth=2, label=f'Median: ${df["price"].median():.2f}')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Box plot
axes[1].boxplot(df['price'], vert=True)
axes[1].set_title('Price Box Plot', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Price ($)', fontsize=11)
axes[1].grid(axis='y', alpha=0.3)

# Log-transformed price
log_price = np.log1p(df['price'])
axes[2].hist(log_price, bins=50, color='lightgreen', edgecolor='black', alpha=0.7)
axes[2].set_title('Log-Transformed Price Distribution', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Log(Price + 1)', fontsize=11)
axes[2].set_ylabel('Frequency', fontsize=11)
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Price statistics
print("\nPRICE STATISTICS:")
print("="*80)
print(df['price'].describe())
print(f"\nSkewness: {df['price'].skew():.2f}")
print(f"Kurtosis: {df['price'].kurtosis():.2f}")

### 2.3 Categorical Features Analysis

In [None]:
# Analyze categorical features
categorical_cols = ['neighbourhood_group', 'room_type']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Neighbourhood group distribution
if 'neighbourhood_group' in df.columns:
    neighbourhood_counts = df['neighbourhood_group'].value_counts()
    axes[0].bar(neighbourhood_counts.index, neighbourhood_counts.values, color='steelblue', edgecolor='black')
    axes[0].set_title('Listings by Neighbourhood Group', fontsize=12, fontweight='bold')
    axes[0].set_xlabel('Neighbourhood Group', fontsize=11)
    axes[0].set_ylabel('Count', fontsize=11)
    axes[0].tick_params(axis='x', rotation=45)
    axes[0].grid(axis='y', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(neighbourhood_counts.values):
        axes[0].text(i, v + 200, str(v), ha='center', fontweight='bold')

# Room type distribution
if 'room_type' in df.columns:
    room_counts = df['room_type'].value_counts()
    axes[1].bar(room_counts.index, room_counts.values, color='salmon', edgecolor='black')
    axes[1].set_title('Listings by Room Type', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('Room Type', fontsize=11)
    axes[1].set_ylabel('Count', fontsize=11)
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].grid(axis='y', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(room_counts.values):
        axes[1].text(i, v + 200, str(v), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

---
## 3. Bivariate Analysis

### 3.1 Price vs Location

In [None]:
# Price by neighbourhood group
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Box plot
if 'neighbourhood_group' in df.columns:
    df.boxplot(column='price', by='neighbourhood_group', ax=axes[0])
    axes[0].set_title('Price Distribution by Neighbourhood Group', fontsize=12, fontweight='bold')
    axes[0].set_xlabel('Neighbourhood Group', fontsize=11)
    axes[0].set_ylabel('Price ($)', fontsize=11)
    axes[0].tick_params(axis='x', rotation=45)
    plt.sca(axes[0])
    plt.xticks(rotation=45)

# Average price by neighbourhood
if 'neighbourhood_group' in df.columns:
    avg_price = df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)
    axes[1].barh(avg_price.index, avg_price.values, color='teal', edgecolor='black')
    axes[1].set_title('Average Price by Neighbourhood Group', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('Average Price ($)', fontsize=11)
    axes[1].set_ylabel('Neighbourhood Group', fontsize=11)
    axes[1].grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(avg_price.values):
        axes[1].text(v + 2, i, f'${v:.2f}', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

### 3.2 Price vs Room Type

In [None]:
# Price by room type
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

if 'room_type' in df.columns:
    # Box plot
    df.boxplot(column='price', by='room_type', ax=axes[0])
    axes[0].set_title('Price Distribution by Room Type', fontsize=12, fontweight='bold')
    axes[0].set_xlabel('Room Type', fontsize=11)
    axes[0].set_ylabel('Price ($)', fontsize=11)
    axes[0].tick_params(axis='x', rotation=45)
    
    # Average price
    avg_price_room = df.groupby('room_type')['price'].mean().sort_values(ascending=False)
    axes[1].bar(avg_price_room.index, avg_price_room.values, color='orange', edgecolor='black')
    axes[1].set_title('Average Price by Room Type', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('Room Type', fontsize=11)
    axes[1].set_ylabel('Average Price ($)', fontsize=11)
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].grid(axis='y', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(avg_price_room.values):
        axes[1].text(i, v + 5, f'${v:.2f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### 3.3 Price vs Reviews

In [None]:
# Scatter plots: Price vs Review metrics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Price vs Number of Reviews
axes[0].scatter(df['number_of_reviews'], df['price'], alpha=0.3, s=10, color='purple')
axes[0].set_title('Price vs Number of Reviews', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Number of Reviews', fontsize=11)
axes[0].set_ylabel('Price ($)', fontsize=11)
axes[0].grid(alpha=0.3)

# Price vs Reviews per Month
if 'reviews_per_month' in df.columns:
    axes[1].scatter(df['reviews_per_month'], df['price'], alpha=0.3, s=10, color='brown')
    axes[1].set_title('Price vs Reviews per Month', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('Reviews per Month', fontsize=11)
    axes[1].set_ylabel('Price ($)', fontsize=11)
    axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate correlation
if 'reviews_per_month' in df.columns:
    corr_reviews = df['price'].corr(df['number_of_reviews'])
    corr_reviews_month = df['price'].corr(df['reviews_per_month'])
    print(f"Correlation (Price vs Number of Reviews): {corr_reviews:.3f}")
    print(f"Correlation (Price vs Reviews per Month): {corr_reviews_month:.3f}")

---
## 4. Correlation Analysis

Understanding correlations helps identify which features are most related to price.

In [None]:
# Calculate correlation matrix for numerical features
correlation_matrix = df[numerical_cols].corr()

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Numerical Features', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show correlations with price
if 'price' in df.columns:
    price_correlations = correlation_matrix['price'].sort_values(ascending=False)
    print("\nCORRELATIONS WITH PRICE:")
    print("="*80)
    print(price_correlations)

---
## 5. Outlier Detection

### 5.1 Statistical Outlier Detection (IQR Method)

In [None]:
# Detect outliers using IQR method for price
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]

print("OUTLIER DETECTION (IQR METHOD):")
print("="*80)
print(f"Q1 (25th percentile): ${Q1:.2f}")
print(f"Q3 (75th percentile): ${Q3:.2f}")
print(f"IQR: ${IQR:.2f}")
print(f"Lower Bound: ${lower_bound:.2f}")
print(f"Upper Bound: ${upper_bound:.2f}")
print(f"\nNumber of outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.2f}% of data)")
print(f"Price range of outliers: ${outliers['price'].min():.2f} - ${outliers['price'].max():.2f}")

### 5.2 Visualize Outliers

In [None]:
# Visualize outliers
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot with outliers
axes[0].boxplot(df['price'], vert=True, showfliers=True)
axes[0].set_title('Price Box Plot (with outliers)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Price ($)', fontsize=11)
axes[0].axhline(upper_bound, color='r', linestyle='--', label=f'Upper Bound: ${upper_bound:.2f}')
axes[0].axhline(lower_bound, color='r', linestyle='--', label=f'Lower Bound: ${lower_bound:.2f}')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Histogram highlighting outliers
axes[1].hist(df['price'], bins=100, color='lightblue', edgecolor='black', alpha=0.7, label='Normal')
axes[1].hist(outliers['price'], bins=50, color='red', edgecolor='black', alpha=0.7, label='Outliers')
axes[1].set_title('Price Distribution (Outliers Highlighted)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Price ($)', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

---
## 6. Geographic Analysis

Analyzing the spatial distribution of listings and prices.

In [None]:
# Geographic scatter plot
if 'latitude' in df.columns and 'longitude' in df.columns:
    plt.figure(figsize=(12, 10))
    
    scatter = plt.scatter(df['longitude'], df['latitude'], 
                         c=df['price'], cmap='YlOrRd', 
                         alpha=0.5, s=10, vmin=0, vmax=500)
    
    plt.colorbar(scatter, label='Price ($)')
    plt.title('Geographic Distribution of Listings (colored by price)', 
              fontsize=14, fontweight='bold')
    plt.xlabel('Longitude', fontsize=11)
    plt.ylabel('Latitude', fontsize=11)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## 7. Key Insights and Observations

### Summary of Findings:

1. **Price Distribution**:
   - Price is right-skewed with most listings under $200
   - Significant outliers exist (very expensive listings)
   - Log transformation may improve model performance

2. **Location Impact**:
   - Manhattan has the highest average prices
   - Brooklyn and Queens are more affordable
   - Geographic location strongly influences price

3. **Room Type**:
   - Entire homes/apartments are most expensive
   - Private rooms are moderately priced
   - Shared rooms are least expensive

4. **Reviews**:
   - Weak correlation between price and number of reviews
   - Popular listings (many reviews) span all price ranges

5. **Outliers**:
   - Approximately 5-10% of listings are outliers
   - May need special handling in modeling

### Implications for Modeling:

- **Target Variable**: Price prediction (regression problem)
- **Important Features**: Location, room type, availability
- **Preprocessing Needed**: 
  - Log transformation of price
  - Encoding categorical variables
  - Feature scaling
  - Outlier handling strategy

---
**Next Notebook**: [04_feature_engineering.ipynb](04_feature_engineering.ipynb)