# House Price Dataset - Exploratory Data Analysis
This notebook provides comprehensive visualization and analysis of the Kaggle House Price dataset.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

In [None]:
# Load the dataset
train_df = pd.read_csv(r'd:\Downloads\Compressed\house-prices-advanced-regression-techniques\train.csv')
print(f"Dataset shape: {train_df.shape}")
print(f"\nFirst few rows:")
train_df.head()

In [None]:
# Get basic information about the dataset
print("Dataset Info:")
train_df.info()

In [None]:
# Statistical summary
train_df.describe()

In [None]:
# Identify numeric and categorical features
numeric_features = train_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

print(f"Number of numeric features: {len(numeric_features)}")
print(f"Number of categorical features: {len(categorical_features)}")
print(f"\nNumeric features: {numeric_features[:10]}...")
print(f"\nCategorical features: {categorical_features[:10]}...")

## 1. Histogram
Histograms show the distribution of numeric variables by dividing data into bins and counting observations in each bin.

In [None]:
# Histogram for key numeric features
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Histograms of Key Numeric Features', fontsize=16, fontweight='bold')

features_to_plot = ['SalePrice', 'GrLivArea', 'TotalBsmtSF', 'LotArea', 'YearBuilt', 'OverallQual']

for idx, feature in enumerate(features_to_plot):
    row = idx // 3
    col = idx % 3
    axes[row, col].hist(train_df[feature].dropna(), bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    axes[row, col].set_xlabel(feature, fontsize=12)
    axes[row, col].set_ylabel('Frequency', fontsize=12)
    axes[row, col].set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Histogram:")
print("=" * 80)
print("• SalePrice: Right-skewed distribution, most houses priced between $100k-$200k")
print("• GrLivArea: Slightly right-skewed, most homes between 1000-2000 sq ft")
print("• TotalBsmtSF: Right-skewed with many houses having smaller basements")
print("• LotArea: Highly right-skewed, indicating few very large lots")
print("• YearBuilt: Shows housing development trends over decades")
print("• OverallQual: Roughly normal distribution centered around quality rating 5-6")

## 2. Density Plot (KDE)
Kernel Density Estimation plots show a smoothed version of the histogram, representing probability density.

In [None]:
# Density plots for key numeric features
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Density Plots of Key Numeric Features', fontsize=16, fontweight='bold')

for idx, feature in enumerate(features_to_plot):
    row = idx // 3
    col = idx % 3
    train_df[feature].dropna().plot(kind='density', ax=axes[row, col], color='coral', linewidth=2)
    axes[row, col].set_xlabel(feature, fontsize=12)
    axes[row, col].set_ylabel('Density', fontsize=12)
    axes[row, col].set_title(f'Density Plot of {feature}', fontsize=12, fontweight='bold')
    axes[row, col].grid(True, alpha=0.3)
    axes[row, col].fill_between(axes[row, col].lines[0].get_xdata(), 
                                  axes[row, col].lines[0].get_ydata(), 
                                  alpha=0.3, color='coral')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Density Plot:")
print("=" * 80)
print("• Density plots provide a smooth continuous view of data distribution")
print("• SalePrice: Peak around $150k with long right tail (expensive outliers)")
print("• GrLivArea: Single peak around 1500 sq ft, indicating typical home size")
print("• YearBuilt: Multiple peaks showing construction booms in different decades")
print("• The smooth curves help identify multimodal distributions and skewness")

## 3. Boxplot
Boxplots display the distribution through quartiles, showing median, outliers, and spread.

In [None]:
# Boxplots for numeric features
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Boxplots of Key Numeric Features', fontsize=16, fontweight='bold')

for idx, feature in enumerate(features_to_plot):
    row = idx // 3
    col = idx % 3
    bp = axes[row, col].boxplot(train_df[feature].dropna(), patch_artist=True, 
                                  boxprops=dict(facecolor='lightgreen', alpha=0.7),
                                  medianprops=dict(color='red', linewidth=2))
    axes[row, col].set_ylabel(feature, fontsize=12)
    axes[row, col].set_title(f'Boxplot of {feature}', fontsize=12, fontweight='bold')
    axes[row, col].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Boxplot:")
print("=" * 80)
print("• Boxplots reveal outliers (dots beyond whiskers) and data spread")
print("• SalePrice: Many high-value outliers indicating luxury properties")
print("• LotArea: Extreme outliers showing unusually large lots")
print("• The red line shows the median (50th percentile)")
print("• Box represents IQR (25th to 75th percentile) - middle 50% of data")
print("• Whiskers extend to 1.5×IQR, points beyond are potential outliers")

## 4. Violin Plot
Violin plots combine boxplot and KDE, showing distribution shape and quartiles together.

In [None]:
# Violin plots comparing SalePrice across categorical variables
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Violin Plots: SalePrice by Categorical Features', fontsize=16, fontweight='bold')

# Select key categorical variables with reasonable number of categories
cat_vars = ['OverallQual', 'HouseStyle', 'Neighborhood', 'SaleCondition']

for idx, cat_var in enumerate(cat_vars):
    row = idx // 2
    col = idx % 2
    
    if cat_var == 'Neighborhood':
        # Top 8 neighborhoods by count
        top_neighborhoods = train_df['Neighborhood'].value_counts().head(8).index
        data = train_df[train_df['Neighborhood'].isin(top_neighborhoods)]
        sns.violinplot(data=data, x=cat_var, y='SalePrice', ax=axes[row, col], palette='muted')
        axes[row, col].tick_params(axis='x', rotation=45)
    else:
        sns.violinplot(data=train_df, x=cat_var, y='SalePrice', ax=axes[row, col], palette='muted')
        axes[row, col].tick_params(axis='x', rotation=45)
    
    axes[row, col].set_title(f'SalePrice Distribution by {cat_var}', fontsize=12, fontweight='bold')
    axes[row, col].set_xlabel(cat_var, fontsize=11)
    axes[row, col].set_ylabel('Sale Price ($)', fontsize=11)
    axes[row, col].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Violin Plot:")
print("=" * 80)
print("• Violin width shows density - wider means more data points at that price")
print("• OverallQual: Clear positive relationship - higher quality = higher price")
print("• HouseStyle: Different styles have distinct price distributions")
print("• Neighborhood: Significant price variation across locations (location matters!)")
print("• SaleCondition: Normal sales have different distribution than abnormal/partial")
print("• Violin plots reveal both central tendency and distribution shape simultaneously")

## 5. Scatter Plot
Scatter plots show relationships between two continuous variables.

In [None]:
# Scatter plots showing relationships with SalePrice
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Scatter Plots: Features vs Sale Price', fontsize=16, fontweight='bold')

scatter_features = ['GrLivArea', 'TotalBsmtSF', 'GarageArea', 'YearBuilt', 'OverallQual', '1stFlrSF']

for idx, feature in enumerate(scatter_features):
    row = idx // 3
    col = idx % 3
    axes[row, col].scatter(train_df[feature], train_df['SalePrice'], 
                           alpha=0.5, s=20, color='navy', edgecolors='white', linewidth=0.5)
    axes[row, col].set_xlabel(feature, fontsize=11)
    axes[row, col].set_ylabel('Sale Price ($)', fontsize=11)
    axes[row, col].set_title(f'{feature} vs SalePrice', fontsize=12, fontweight='bold')
    axes[row, col].grid(True, alpha=0.3)
    
    # Add trend line
    z = np.polyfit(train_df[feature].dropna(), train_df['SalePrice'][train_df[feature].notna()], 1)
    p = np.poly1d(z)
    axes[row, col].plot(train_df[feature], p(train_df[feature]), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Scatter Plot:")
print("=" * 80)
print("• GrLivArea: Strong positive correlation - larger living area = higher price")
print("• TotalBsmtSF: Positive correlation, basement size impacts price")
print("• GarageArea: Moderate positive correlation with price")
print("• YearBuilt: Newer homes tend to sell for higher prices")
print("• OverallQual: Very strong linear relationship - quality is key predictor")
print("• Red dashed line shows trend - helps identify linear relationships")
print("• Scatter density indicates most common value combinations")

## 6. Correlogram (Correlation Matrix)
Shows correlation coefficients between all numeric variables.

In [None]:
# Select top correlated features with SalePrice
correlation_matrix = train_df[numeric_features].corr()
top_features = correlation_matrix['SalePrice'].abs().sort_values(ascending=False).head(15).index

# Create correlogram
plt.figure(figsize=(14, 12))
corr_data = train_df[top_features].corr()

# Create mask for upper triangle
mask = np.triu(np.ones_like(corr_data, dtype=bool))

sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, mask=mask)
plt.title('Correlogram: Top 15 Features Correlated with SalePrice', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Correlogram:")
print("=" * 80)
print("• Values range from -1 (perfect negative) to +1 (perfect positive correlation)")
print("• OverallQual (0.79): Strongest predictor of house price")
print("• GrLivArea (0.71): Living area strongly correlates with price")
print("• GarageCars/GarageArea: High correlation (0.88) - measuring similar thing")
print("• TotalBsmtSF & 1stFlrSF: Strong correlation (0.82) - often similar size")
print("• Red colors: Positive correlation | Blue colors: Negative correlation")
print("• Multicollinearity detected: Some features highly correlated with each other")

## 7. Heatmap
Specialized heatmap showing missing values and feature importance.

In [None]:
# Heatmap 1: Missing Values Pattern
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# Missing values heatmap
missing_data = train_df.isnull().sum().sort_values(ascending=False).head(20)
missing_percent = (missing_data / len(train_df) * 100).to_frame(name='Percentage')

sns.heatmap(missing_percent.T, annot=True, fmt='.1f', cmap='Reds', 
            cbar_kws={'label': '% Missing'}, ax=axes[0], linewidths=0.5)
axes[0].set_title('Heatmap: Missing Values (Top 20 Features)', fontsize=13, fontweight='bold')
axes[0].set_xlabel('')
axes[0].set_ylabel('')

# Feature importance heatmap (correlation with SalePrice)
feature_importance = correlation_matrix['SalePrice'].abs().sort_values(ascending=False).head(20).to_frame()
sns.heatmap(feature_importance.T, annot=True, fmt='.2f', cmap='YlGnBu', 
            cbar_kws={'label': 'Correlation'}, ax=axes[1], linewidths=0.5)
axes[1].set_title('Heatmap: Feature Importance (Correlation with SalePrice)', fontsize=13, fontweight='bold')
axes[1].set_xlabel('')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Heatmap:")
print("=" * 80)
print("• Missing Values: PoolQC, MiscFeature, Alley have >90% missing data")
print("• High missing values may indicate rare features (e.g., pools, alleys)")
print("• Feature Importance: OverallQual, GrLivArea most important for prediction")
print("• Heatmaps use color intensity to represent magnitude of values")
print("• Darker colors in importance map = stronger relationship with price")
print("• Missing data patterns help decide imputation vs feature removal strategies")

## 8. Bar Chart
Bar charts display categorical data with rectangular bars showing frequencies or values.

In [None]:
# Bar charts for categorical features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Bar Charts: Categorical Features Distribution', fontsize=16, fontweight='bold')

# 1. Neighborhood frequency
neighborhood_counts = train_df['Neighborhood'].value_counts().head(15)
axes[0, 0].bar(range(len(neighborhood_counts)), neighborhood_counts.values, 
               color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_xticks(range(len(neighborhood_counts)))
axes[0, 0].set_xticklabels(neighborhood_counts.index, rotation=45, ha='right')
axes[0, 0].set_title('Top 15 Neighborhoods by Count', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Count', fontsize=11)
axes[0, 0].grid(True, alpha=0.3, axis='y')

# 2. Building Type
bldg_type_counts = train_df['BldgType'].value_counts()
axes[0, 1].bar(bldg_type_counts.index, bldg_type_counts.values, 
               color='coral', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Building Type Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Count', fontsize=11)
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# 3. Sale Condition
sale_cond_counts = train_df['SaleCondition'].value_counts()
axes[1, 0].bar(sale_cond_counts.index, sale_cond_counts.values, 
               color='mediumseagreen', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Sale Condition Distribution', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Count', fontsize=11)
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3, axis='y')

# 4. Overall Quality
quality_counts = train_df['OverallQual'].value_counts().sort_index()
colors = plt.cm.RdYlGn(np.linspace(0.3, 0.9, len(quality_counts)))
axes[1, 1].bar(quality_counts.index, quality_counts.values, 
               color=colors, edgecolor='black', alpha=0.8)
axes[1, 1].set_title('Overall Quality Rating Distribution', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Quality Rating (1-10)', fontsize=11)
axes[1, 1].set_ylabel('Count', fontsize=11)
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Bar Chart:")
print("=" * 80)
print("• Neighborhoods: NAmes, CollgCr, OldTown are most common areas")
print("• Building Type: Single-family (1Fam) dominates the dataset (>80%)")
print("• Sale Condition: Most sales are 'Normal' condition")
print("• Overall Quality: Distribution peaks at 5-6 (average quality homes)")
print("• Bar height represents frequency/count of each category")
print("• Helps identify class imbalance and dominant categories in dataset")

## 9. Grouped Bar Chart
Grouped bar charts compare multiple categories side-by-side across different groups.

In [None]:
# Grouped bar chart: Average SalePrice by OverallQual and HouseStyle
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
fig.suptitle('Grouped Bar Charts: Comparative Analysis', fontsize=16, fontweight='bold')

# 1. Average SalePrice by OverallQual and CentralAir
grouped_data1 = train_df.groupby(['OverallQual', 'CentralAir'])['SalePrice'].mean().unstack()
grouped_data1.plot(kind='bar', ax=axes[0], color=['lightcoral', 'lightblue'], 
                   edgecolor='black', alpha=0.8, width=0.8)
axes[0].set_title('Average Sale Price by Quality & Central Air', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Overall Quality', fontsize=11)
axes[0].set_ylabel('Average Sale Price ($)', fontsize=11)
axes[0].legend(title='Central Air', labels=['No', 'Yes'])
axes[0].tick_params(axis='x', rotation=0)
axes[0].grid(True, alpha=0.3, axis='y')

# 2. Count of Houses by Neighborhood and House Style (top neighborhoods)
top_neighborhoods = train_df['Neighborhood'].value_counts().head(6).index
top_styles = train_df['HouseStyle'].value_counts().head(4).index
filtered_data = train_df[train_df['Neighborhood'].isin(top_neighborhoods) & 
                          train_df['HouseStyle'].isin(top_styles)]
grouped_data2 = filtered_data.groupby(['Neighborhood', 'HouseStyle']).size().unstack(fill_value=0)
grouped_data2.plot(kind='bar', ax=axes[1], colormap='Set3', 
                   edgecolor='black', alpha=0.8, width=0.8)
axes[1].set_title('House Count by Neighborhood & Style (Top 6 Areas)', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Neighborhood', fontsize=11)
axes[1].set_ylabel('Count', fontsize=11)
axes[1].legend(title='House Style', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].tick_params(axis='x', rotation=45, ha='right')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Grouped Bar Chart:")
print("=" * 80)
print("• Central Air Impact: Houses with central air consistently priced higher")
print("• The price gap increases with overall quality rating")
print("• Neighborhood-Style: Different neighborhoods favor different house styles")
print("• 1Story and 2Story are most common across all neighborhoods")
print("• Grouped bars enable comparison of subcategories within each main category")
print("• Reveals interaction effects between two categorical variables")
print("• Side-by-side positioning makes relative comparisons intuitive")

## 10. Stacked Bar Chart
Stacked bar charts show composition and proportions by stacking bars on top of each other.

In [None]:
# Stacked bar charts
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
fig.suptitle('Stacked Bar Charts: Composition Analysis', fontsize=16, fontweight='bold')

# 1. House Style composition by Neighborhood (top 8 neighborhoods)
top_neighborhoods = train_df['Neighborhood'].value_counts().head(8).index
top_styles = train_df['HouseStyle'].value_counts().head(5).index
filtered_data = train_df[train_df['Neighborhood'].isin(top_neighborhoods) & 
                          train_df['HouseStyle'].isin(top_styles)]
stacked_data1 = filtered_data.groupby(['Neighborhood', 'HouseStyle']).size().unstack(fill_value=0)

stacked_data1.plot(kind='bar', stacked=True, ax=axes[0], colormap='Spectral', 
                   edgecolor='black', alpha=0.8, width=0.8)
axes[0].set_title('House Style Composition by Neighborhood', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Neighborhood', fontsize=11)
axes[0].set_ylabel('Total Count', fontsize=11)
axes[0].legend(title='House Style', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].tick_params(axis='x', rotation=45, ha='right')
axes[0].grid(True, alpha=0.3, axis='y')

# 2. Sale Condition composition by Overall Quality
stacked_data2 = train_df.groupby(['OverallQual', 'SaleCondition']).size().unstack(fill_value=0)
stacked_data2.plot(kind='bar', stacked=True, ax=axes[1], colormap='tab10', 
                   edgecolor='black', alpha=0.8, width=0.8)
axes[1].set_title('Sale Condition Composition by Quality Rating', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Overall Quality', fontsize=11)
axes[1].set_ylabel('Total Count', fontsize=11)
axes[1].legend(title='Sale Condition', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].tick_params(axis='x', rotation=0)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Percentage stacked bar chart for better proportion visualization
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
fig.suptitle('Percentage Stacked Bar Charts: Proportion Analysis', fontsize=16, fontweight='bold')

# Normalize to percentages
stacked_pct1 = stacked_data1.div(stacked_data1.sum(axis=1), axis=0) * 100
stacked_pct1.plot(kind='bar', stacked=True, ax=axes[0], colormap='Spectral', 
                  edgecolor='black', alpha=0.8, width=0.8)
axes[0].set_title('House Style Proportion by Neighborhood (%)', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Neighborhood', fontsize=11)
axes[0].set_ylabel('Percentage (%)', fontsize=11)
axes[0].legend(title='House Style', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].tick_params(axis='x', rotation=45, ha='right')
axes[0].set_ylim(0, 100)
axes[0].grid(True, alpha=0.3, axis='y')

stacked_pct2 = stacked_data2.div(stacked_data2.sum(axis=1), axis=0) * 100
stacked_pct2.plot(kind='bar', stacked=True, ax=axes[1], colormap='tab10', 
                  edgecolor='black', alpha=0.8, width=0.8)
axes[1].set_title('Sale Condition Proportion by Quality (%)', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Overall Quality', fontsize=11)
axes[1].set_ylabel('Percentage (%)', fontsize=11)
axes[1].legend(title='Sale Condition', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].tick_params(axis='x', rotation=0)
axes[1].set_ylim(0, 100)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n📊 INTERPRETATION - Stacked Bar Chart:")
print("=" * 80)
print("• House Style Composition: Shows total volume AND breakdown per neighborhood")
print("• Different neighborhoods have distinct architectural preferences")
print("• Sale Condition: 'Normal' sales dominate across all quality levels")
print("• Higher quality homes show more diverse sale conditions")
print("• Percentage stacked charts normalize for total count differences")
print("• Easier to compare proportions when totals vary significantly")
print("• Each segment represents contribution to whole - shows part-to-whole relationships")
print("• Useful for understanding composition changes across categories")

## Summary & Key Insights

### Overall Dataset Characteristics:
- **Target Variable (SalePrice)**: Right-skewed distribution with median around $163k
- **Most Important Features**: OverallQual, GrLivArea, GarageCars, TotalBsmtSF
- **Data Quality**: Several features with high missing values (PoolQC, Alley, Fence)

### Key Relationships Discovered:
1. **Quality Matters Most**: OverallQual has strongest correlation (0.79) with price
2. **Size Correlates**: Living area, basement, and garage size all positively impact price
3. **Location Premium**: Significant price variation across neighborhoods
4. **Modern Advantage**: Newer homes (YearBuilt) command higher prices
5. **Multicollinearity Present**: GarageCars/GarageArea, TotalBsmtSF/1stFlrSF highly correlated

### Visualization Insights:
- **Histograms & Density**: Revealed skewness requiring potential log transformation
- **Boxplots**: Identified outliers in LotArea, GrLivArea needing investigation
- **Violin Plots**: Showed distribution differences across categories
- **Scatter Plots**: Confirmed linear relationships for regression modeling
- **Heatmaps**: Highlighted multicollinearity and feature importance
- **Bar Charts**: Revealed class imbalance in categorical features

### Recommendations for Modeling:
1. Consider log transformation for skewed variables (SalePrice, LotArea)
2. Handle or remove features with >80% missing values
3. Address multicollinearity (drop one of correlated feature pairs)
4. Investigate and potentially remove outliers
5. Feature engineering: Combine related features (total square footage)
6. Encode categorical variables appropriately for modeling