# Bangalore House Price Prediction - EDA

This notebook performs comprehensive exploratory data analysis on the Bangalore house price dataset. We'll analyze the `banglore.csv` file to understand the data structure, identify patterns, and prepare for modeling.

## Dataset Overview
The dataset contains real estate information for properties in Bangalore with features like location, size, area type, and price.

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load and Explore bangalore.csv Dataset

In [2]:
# Load the Bangalore house price dataset
df = pd.read_csv('banglore.csv')

print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst 5 rows:")
df.head()

Dataset loaded successfully!
Shape: (13320, 9)
Columns: ['area_type', 'availability', 'location', 'size', 'society', 'total_sqft', 'bath', 'balcony', 'price']

First 5 rows:


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [3]:
# Basic information about the dataset
print("=== Dataset Info ===")
print(df.info())
print("\n=== Data Types ===")
print(df.dtypes)
print("\n=== Statistical Summary ===")
df.describe(include='all')

=== Dataset Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB
None

=== Data Types ===
area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

=== Statistical Summary ===


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
count,13320,13320,13319,13304,7818,13320.0,13247.0,12711.0,13320.0
unique,4,81,1305,31,2688,2117.0,,,
top,Super built-up Area,Ready To Move,Whitefield,2 BHK,GrrvaGr,1200.0,,,
freq,8790,10581,540,5199,80,843.0,,,
mean,,,,,,,2.69261,1.584376,112.565627
std,,,,,,,1.341458,0.817263,148.971674
min,,,,,,,1.0,0.0,8.0
25%,,,,,,,2.0,1.0,50.0
50%,,,,,,,2.0,2.0,72.0
75%,,,,,,,3.0,2.0,120.0


## 3. Data Quality Assessment

In [4]:
# Check for missing values
print("=== Missing Values ===")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])

# Check for duplicates
print(f"\n=== Duplicates ===")
print(f"Duplicate rows: {df.duplicated().sum()}")

# Unique values in categorical columns
print("\n=== Unique Values in Categorical Columns ===")
categorical_cols = ['area_type', 'availability', 'location', 'size', 'society']
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")
    if df[col].nunique() < 20:
        print(f"  Values: {df[col].unique()}")
    print()

=== Missing Values ===
          Missing Count  Missing Percentage
location              1            0.007508
size                 16            0.120120
society            5502           41.306306
bath                 73            0.548048
balcony             609            4.572072

=== Duplicates ===
Duplicate rows: 529

=== Unique Values in Categorical Columns ===
area_type: 4 unique values
  Values: ['Super built-up  Area' 'Plot  Area' 'Built-up  Area' 'Carpet  Area']

availability: 81 unique values

location: 1305 unique values

size: 31 unique values

society: 2688 unique values



## 4. Data Visualization and Analysis

In [None]:
# Price distribution
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(df['price'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price (Lakhs)')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
plt.boxplot(df['price'])
plt.title('Price Boxplot')
plt.ylabel('Price (Lakhs)')

plt.subplot(1, 3, 3)
df['price'].plot(kind='density', color='orange')
plt.title('Price Density Plot')
plt.xlabel('Price (Lakhs)')

plt.tight_layout()
plt.show()

print(f"Price Statistics:")
print(f"Min: {df['price'].min():.2f} Lakhs")
print(f"Max: {df['price'].max():.2f} Lakhs")
print(f"Mean: {df['price'].mean():.2f} Lakhs")
print(f"Median: {df['price'].median():.2f} Lakhs")

In [None]:
# Location analysis - Top 20 locations by count and average price
plt.figure(figsize=(15, 10))

# Top 20 locations by count
plt.subplot(2, 2, 1)
top_locations = df['location'].value_counts().head(20)
top_locations.plot(kind='bar')
plt.title('Top 20 Locations by Property Count')
plt.xticks(rotation=45)
plt.ylabel('Count')

# Average price by top 20 locations
plt.subplot(2, 2, 2)
top_loc_names = top_locations.index
avg_price_by_loc = df[df['location'].isin(top_loc_names)].groupby('location')['price'].mean().sort_values(ascending=False)
avg_price_by_loc.plot(kind='bar', color='lightcoral')
plt.title('Average Price by Top Locations')
plt.xticks(rotation=45)
plt.ylabel('Average Price (Lakhs)')

# Size distribution
plt.subplot(2, 2, 3)
df['size'].value_counts().plot(kind='bar', color='lightgreen')
plt.title('Property Size Distribution')
plt.xticks(rotation=45)
plt.ylabel('Count')

# Area type distribution
plt.subplot(2, 2, 4)
df['area_type'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Area Type Distribution')

plt.tight_layout()
plt.show()