<div style="background-color:#1e1e1e; color:#f1f1f1; padding:15px; border-left:5px solid #4caf50; font-family:Arial, sans-serif;">
  <h2 style="color:#4caf50;">🏠 Bengaluru House Price Predictor</h2>
  <p>
    This notebook presents a machine learning model to predict <strong>house prices in Bengaluru</strong> using the 
    <a href="https://www.kaggle.com/datasets/lovishbansal123/dataset-for-bengaluru-house-price-prediction" target="_blank" style="color:#64b5f6;">Bengaluru House Price dataset</a>.
  </p>
  <ul>
    <li>📄 Dataset includes features such as <code>location</code>, <code>total_sqft</code>, <code>bathrooms</code>, <code>balcony</code>, <code>size</code>, and more.</li>
    <li>🛠️ Data preprocessing is performed to clean and engineer meaningful features.</li>
    <li>🌳 A <strong>Random Forest Regressor</strong> is used for predicting the house prices.</li>
  </ul>
  <p>
    Follow along to see how we preprocess the data, train the model, and evaluate its performance!
  </p>
</div>


In [139]:
import pandas as pd
import numpy as np
data = pd.read_csv('bengaluru_house_prices.csv')
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [140]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [141]:
data.isna().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 📄 Dropping `society` Column

The **`society`** column has **5562 missing values** out of **13320 rows (~42%)**, making it unreliable.  
We drop it as it likely doesn’t add much value to the analysis.

---

### 🔷 Code & Steps

✅ Check missing: `data['society'].isna().sum()` → `5562`  
✅ Drop the column: `data.drop(columns=['society'], inplace=True)`  
✅ Verify: `data.head()`
:
```python
data.drop(columns=['society'], inplace=True)


In [143]:
data.drop(columns=['society'], inplace=True)
data.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200,2.0,1.0,51.0


<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🚀 Dropping `balcony` Column

The **`balcony`** column has many missing values and its impact is minor because total square footage (`sqft`) already reflects available space.  
We remove it for a cleaner dataset.

---

### 🔷 Code & Explanation

✅ Many `NaN` values in `balcony`.  
✅ `sqft` already accounts for space, so `balcony` adds little value.  
✅ Dropped with: `data.drop(columns=['balcony'], inplace=True)`  
✅ Checked result with: `data.head()`


In [145]:
#similarly we are dropping balcony due to too many null values and balcony doesn't matter that much if sqft is already taken into factor
data.drop(columns=['balcony'], inplace=True)
data.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440,2.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200,2.0,51.0


In [146]:
data.availability.value_counts()

availability
Ready To Move    10581
18-Dec             307
18-May             295
18-Apr             271
18-Aug             200
                 ...  
15-Aug               1
17-Jan               1
16-Nov               1
16-Jan               1
14-Jul               1
Name: count, Length: 81, dtype: int64

In [147]:
#ready to move have more value generally of a house than houses which are yet to be built so we can convert this column into a binary column 
#where 1 means ready to move and 0 means not yet ready to move
data['availability'] = (data['availability'] == "Ready To Move").astype(int)

In [148]:
data.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,price
0,Super built-up Area,0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Plot Area,1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Built-up Area,1,Uttarahalli,3 BHK,1440,2.0,62.0
3,Super built-up Area,1,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Super built-up Area,1,Kothanur,2 BHK,1200,2.0,51.0


In [149]:
data.area_type.value_counts()

area_type
Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: count, dtype: int64

# 🚫 Dropping `area_type` Feature

The **`area_type`** column describes the type of area (e.g., Super built-up, Carpet, etc.),  
but it does not clearly indicate the price and adds little predictive value.  
Therefore, we drop this column for a cleaner dataset.


In [151]:
data.drop(columns=['area_type'], inplace=True)
data.head()

Unnamed: 0,availability,location,size,total_sqft,bath,price
0,0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,1,Uttarahalli,3 BHK,1440,2.0,62.0
3,1,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,1,Kothanur,2 BHK,1200,2.0,51.0


In [152]:
data.rename(columns={'price':'price(lakhs)'})        #renaming column price to price(lakhs) as these are in lakhs

Unnamed: 0,availability,location,size,total_sqft,bath,price(lakhs)
0,0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.00
2,1,Uttarahalli,3 BHK,1440,2.0,62.00
3,1,Lingadheeranahalli,3 BHK,1521,3.0,95.00
4,1,Kothanur,2 BHK,1200,2.0,51.00
...,...,...,...,...,...,...
13315,1,Whitefield,5 Bedroom,3453,4.0,231.00
13316,1,Richards Town,4 BHK,3600,5.0,400.00
13317,1,Raja Rajeshwari Nagar,2 BHK,1141,2.0,60.00
13318,0,Padmanabhanagar,4 BHK,4689,4.0,488.00


In [153]:
data.isna().sum()

availability     0
location         1
size            16
total_sqft       0
bath            73
price            0
dtype: int64

In [154]:
data.dropna(subset=['location'], inplace=True)               #dropping the one row where location is 0

In [155]:
data.isna().sum()

availability     0
location         0
size            16
total_sqft       0
bath            73
price            0
dtype: int64

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🛁 Handling Missing `bath` Values

We observed that the **`bath`** (number of bathrooms) column has **73 missing values**.  
To handle this, we calculate the **median number of bathrooms** and fill all missing values with this median, ensuring the data remains consistent and reasonable.

---

### 🔷 Approach
✅ Find median of `bath`.  
✅ Replace all `NaN` in `bath'].fillna(median_bath)


In [157]:
data['bath'] = data['bath'].apply(lambda x: data['bath'].median() if pd.isna(x) else x )

In [158]:
data.isna().sum()

availability     0
location         0
size            16
total_sqft       0
bath             0
price            0
dtype: int64

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🛏️ Converting Bedroom Strings to Numbers

We will convert the **string values** in the `bedrooms` column into **numeric values**,  
and handle any **null values** in the process to keep the data clean and usable.


In [160]:
data['size'] = data['size'].apply(lambda x: None if pd.isna(x) else int(x.split()[0]))        

In [161]:
#converting all null values to median of size values
data['size'] = data['size'].apply(lambda x: data['size'].median() if pd.isna(x) else x )    

In [162]:
v = list(data['total_sqft'].unique())
for i in v:
    try:
        float(i)
    except:
        print(i)

2100 - 2850
3010 - 3410
2957 - 3450
3067 - 8156
1042 - 1105
1145 - 1340
1015 - 1540
1520 - 1740
34.46Sq. Meter
1195 - 1440
1200 - 2400
4125Perch
1120 - 1145
4400 - 6640
3090 - 5002
4400 - 6800
1160 - 1195
1000Sq. Meter
4000 - 5249
1115 - 1130
1100Sq. Yards
520 - 645
1000 - 1285
3606 - 5091
650 - 665
633 - 666
5.31Acres
30Acres
1445 - 1455
884 - 1116
850 - 1093
1440 - 1884
716Sq. Meter
547.34 - 827.31
580 - 650
3425 - 3435
1804 - 2273
3630 - 3800
660 - 670
1500Sq. Meter
620 - 933
142.61Sq. Meter
2695 - 2940
2000 - 5634
1574Sq. Yards
3450 - 3472
1250 - 1305
670 - 980
1005.03 - 1252.49
1004 - 1204
361.33Sq. Yards
645 - 936
2710 - 3360
2249.81 - 4112.19
3436 - 3643
2830 - 2882
596 - 804
1255 - 1863
1300 - 1405
1500 - 2400
117Sq. Yards
934 - 1437
980 - 1030
1564 - 1850
1446 - 1506
1070 - 1315
3040Sq. Meter
500Sq. Yards
2806 - 3019
613 - 648
1430 - 1630
704 - 730
1482 - 1846
2805 - 3565
3293 - 5314
1210 - 1477
3369 - 3464
1125 - 1500
167Sq. Meter
1076 - 1199
381 - 535
2215 - 2475
524 - 894
5

In [163]:
#as we can see there are lot of meters and yards values so we have to handle that as well or else it will give error
'142.84Sq. Meter'.split('Sq')

['142.84', '. Meter']

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 📐 Handling `total_sqft` Variants

We process the `total_sqft` column to handle different formats:
- If value is in **Square Meters**, convert to square feet (`× 10.764`).
- If value is in **Square Yards**, convert to square feet (`× 9`).
- If value is a **range** (e.g., `1000 - 1200`), take the average.
- If none of these, return the original value.

Finally, we apply this function to `total_sqft` and store it as a NumPy array for further use:
```python
b = np.array(data['total_sqft'].apply(convertrange))


In [165]:
def convertrange(x):
    if 'Meter' in x:
        temp = x.split('Sq')
        return float(temp[0])*10.764
    elif 'Yard' in x:
        temp = x.split('Sq')
        return float(temp[0])*9
    else:
        temp = x.split(' - ')
        if len(temp)==2:
            mea = (float(temp[0]) + float(temp[1]))/2
            return mea
        else:
            return x

b = np.array(data['total_sqft'].apply(convertrange))

In [166]:
for i in b:
    try:
        float(i)
    except:
        print(i)

4125Perch
5.31Acres
30Acres
3Cents
2.09Acres
24Guntha
1500Cents
2Acres
15Acres
1.26Acres
1Grounds
1.25Acres
38Guntha
6Acres


In [167]:
'4125Perch'.split('P')

['4125', 'erch']

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 📐 Enhanced `total_sqft` Conversion

We update the function to handle additional area units in `total_sqft`:
- Convert **Meter → Sqft** (`× 10.764`)
- Convert **Yard → Sqft** (`× 9`)
- Convert **Perch → Sqft** (`× 272.25`)
- Convert **Acre → Sqft** (`× 43,560`)
- Convert **Guntha → Sqft** (`× 1,089`)
- Convert **Ground → Sqft** (`× 2,400`)
- Convert **Cent → Sqft** (`× 435.6`)
- If value is a range (`1000 - 1200`), take the average.
- Else, convert directly to float.

Finally, we apply the updated function to `total_sqft`:
```python
data['total_sqft'] = data['total_sqft'].apply(convertfinalrange)


In [169]:
#we still have to handle Guntha, Acres, Perch and Rounds and Cents values
#thus we make modifications in the function convertrange(x) and call it convertrangefinal(x):
def convertfinalrange(x):
    if 'Meter' in x:
        temp = x.split('Sq')
        return float(temp[0])*10.764
    elif 'Yard' in x:
        temp = x.split('Sq')
        return float(temp[0])*9
    elif 'Perch' in x:
        temp = x.split('P')
        return float(temp[0])*272.25
    elif 'Acre' in x:
        temp = x.split('A')
        return float(temp[0])*43560
    elif 'Guntha' in x:
        temp = x.split('G')
        return float(temp[0])*1089
    elif 'Ground' in x:
        temp = x.split('G')
        return float(temp[0])*2400
    elif 'Cent' in x:
        temp = x.split('C')
        return float(temp[0])*435.6
    else:
        temp = x.split(' - ')
        if len(temp)==2:
            mea = (float(temp[0]) + float(temp[1]))/2
            return mea
        else:
            return float(x)

data['total_sqft'] = data['total_sqft'].apply(convertfinalrange)

In [170]:
data.rename(columns={'size':'bhk'},inplace=True)  #changing name of the column

In [171]:
data['price_per_sqfeet'] = (data['price']*100000)/data['total_sqft']    #creating new feature price_per_sqfeet

In [172]:
data

Unnamed: 0,availability,location,bhk,total_sqft,bath,price,price_per_sqfeet
0,0,Electronic City Phase II,2.0,1056.0,2.0,39.07,3699.810606
1,1,Chikka Tirupathi,4.0,2600.0,5.0,120.00,4615.384615
2,1,Uttarahalli,3.0,1440.0,2.0,62.00,4305.555556
3,1,Lingadheeranahalli,3.0,1521.0,3.0,95.00,6245.890861
4,1,Kothanur,2.0,1200.0,2.0,51.00,4250.000000
...,...,...,...,...,...,...,...
13315,1,Whitefield,5.0,3453.0,4.0,231.00,6689.834926
13316,1,Richards Town,4.0,3600.0,5.0,400.00,11111.111111
13317,1,Raja Rajeshwari Nagar,2.0,1141.0,2.0,60.00,5258.545136
13318,0,Padmanabhanagar,4.0,4689.0,4.0,488.00,10407.336319


<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 📍 Handling Rare Locations

We clean and group the `location` feature:
- Remove extra spaces from location names.
- Count occurrences of each location.
- Identify locations with ≤10 occurrences.
- Replace those rare locations with **`other`** to reduce noise and sparsity in the data.

Finally, check the updated `location` distribution.


In [174]:
data['location']=data['location'].apply(lambda x:x.strip())
locations_count = data['location'].value_counts()
locations_count_less_than_10 = locations_count[locations_count<=10]
data['location']=data['location'].apply(lambda x : 'other' if x in locations_count_less_than_10 else x)
data['location'].value_counts()

location
other                 2885
Whitefield             541
Sarjapur  Road         399
Electronic City        304
Kanakpura Road         273
                      ... 
Nehru Nagar             11
Banjara Layout          11
LB Shastri Nagar        11
Pattandur Agrahara      11
Narayanapura            11
Name: count, Length: 242, dtype: int64

In [175]:
data.describe()

Unnamed: 0,availability,bhk,total_sqft,bath,price,price_per_sqfeet
count,13319.0,13319.0,13319.0,13319.0,13319.0,13319.0
mean,0.794354,2.803964,1911.621,2.68879,112.567621,7902.289
std,0.404188,1.294261,17277.55,1.338801,148.977089,106253.2
min,0.0,1.0,1.0,1.0,8.0,2.257423
25%,1.0,2.0,1100.0,2.0,50.0,4262.295
50%,1.0,3.0,1277.0,2.0,72.0,5428.571
75%,1.0,3.0,1680.0,3.0,120.0,7312.469
max,1.0,43.0,1306800.0,40.0,3600.0,12000000.0


In [176]:
data['total_sqftbybhk'] = data['total_sqft']/data['bhk']

In [177]:
data.describe()

Unnamed: 0,availability,bhk,total_sqft,bath,price,price_per_sqfeet,total_sqftbybhk
count,13319.0,13319.0,13319.0,13319.0,13319.0,13319.0,13319.0
mean,0.794354,2.803964,1911.621,2.68879,112.567621,7902.289,701.851693
std,0.404188,1.294261,17277.55,1.338801,148.977089,106253.2,6559.412812
min,0.0,1.0,1.0,1.0,8.0,2.257423,0.25
25%,1.0,2.0,1100.0,2.0,50.0,4262.295,473.333333
50%,1.0,3.0,1277.0,2.0,72.0,5428.571,552.5
75%,1.0,3.0,1680.0,3.0,120.0,7312.469,625.0
max,1.0,43.0,1306800.0,40.0,3600.0,12000000.0,653400.0


In [178]:
#as we can see that min value of total_sqftbybath feature is 0.25 which means there is a flat where there is only 0.25 sqft per bhk which is practically
# not possible, so we will be eliminating all rows in which total_sqftbybhk is lesser than 300 to make it feasible
data = data[data['total_sqftbybhk']>=300]

In [179]:
data.describe()

Unnamed: 0,availability,bhk,total_sqft,bath,price,price_per_sqfeet,total_sqftbybhk
count,12571.0,12571.0,12571.0,12571.0,12571.0,12571.0,12571.0
mean,0.784663,2.653011,1967.633,2.560735,111.485946,6294.519343,732.178531
std,0.411072,0.981306,17782.27,1.08172,151.971761,4163.403235,6750.529416
min,0.0,1.0,300.0,1.0,8.44,2.257423,300.0
25%,1.0,2.0,1116.0,2.0,49.0,4203.005984,492.0
50%,1.0,3.0,1300.0,2.0,70.0,5291.005291,562.5
75%,1.0,3.0,1703.5,3.0,115.0,6916.666667,631.0
max,1.0,16.0,1306800.0,16.0,3600.0,176470.588235,653400.0


<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 📊 Removing `total_sqft` Outliers

We clean outliers in `price_per_sqfeet` by:
- Grouping the data by **location**.
- For each location’s group:
  - Compute **mean (m)** and **standard deviation (std)** of `price_per_sqfeet`.
  - Keep only rows where `price_per_sqfeet` ∈ [m - std, m + std].
- Concatenate all cleaned groups into a new DataFrame.

This ensures that for each location, only reasonable `price_per_sqfeet` values (within one standard deviation) are kept, reducing the impact of outliers.


In [181]:
#since we are getting values like 1.306800e+06 in total_sqft we are removing outliers on price per sqfeet parameter
def removesqftoutliers(data):
    data_output = pd.DataFrame()
    for key,sub_df in data.groupby('location'):
        m = np.mean(sub_df.price_per_sqfeet)
        std = np.std(sub_df.price_per_sqfeet)
        gen_df = sub_df[(sub_df['price_per_sqfeet']> (m-std)) & (sub_df['price_per_sqfeet']<= (m+std))]
        data_output = pd.concat([data_output,gen_df], ignore_index=True)
    return data_output
data = removesqftoutliers(data)

In [182]:
data.describe()

Unnamed: 0,availability,bhk,total_sqft,bath,price,price_per_sqfeet,total_sqftbybhk
count,10329.0,10329.0,10329.0,10329.0,10329.0,10329.0,10329.0
mean,0.787104,2.575951,1577.351669,2.47265,91.42258,5659.116082,606.398904
std,0.409374,0.898977,6483.85009,0.980667,86.481773,2266.831229,2150.181414
min,0.0,1.0,300.0,1.0,10.0,33.210897,300.0
25%,1.0,2.0,1110.0,2.0,49.0,4242.424242,496.666667
50%,1.0,2.0,1286.0,2.0,67.0,5175.983437,562.5
75%,1.0,3.0,1650.0,3.0,100.0,6434.782609,625.5
max,1.0,16.0,653400.0,16.0,2200.0,24509.803922,217800.0


<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🏠 Exploring BHK Outlier Stats

We group the data by **location** and then by **BHK (number of bedrooms)**  
to compute summary statistics (`mean`, `std`, `count`) of `price_per_sqfeet`  
for each BHK within each location.  

This step is for inspection only — we print these stats to better understand the distribution  
before applying outlier removal in the next step.


In [184]:
def removebhkoutliers1(df):
    exclude_indexes = np.array([])
    for location,locationdfs in df.groupby('location'):
        bhk_stats={}
        for bhk,bhkdf in locationdfs.groupby('bhk'):
            bhk_stats[bhk] = {'mean' : np.mean(bhkdf.price_per_sqfeet), 'std' : np.std(bhkdf.price_per_sqfeet), 'count' : bhkdf.shape[0]}
        print(location,bhk_stats)
removebhkoutliers1(data)

1st Block Jayanagar {2.0: {'mean': 11983.805668016194, 'std': 0.0, 'count': 1}, 3.0: {'mean': 11756.16905248807, 'std': 701.6243657657865, 'count': 3}, 4.0: {'mean': 15018.711280365416, 'std': 1.2278182423353805, 'count': 3}}
1st Phase JP Nagar {1.0: {'mean': 6726.570336093429, 'std': 774.1893837124771, 'count': 2}, 2.0: {'mean': 7931.806799837383, 'std': 1534.1422783514054, 'count': 8}, 3.0: {'mean': 9151.192151725822, 'std': 1054.731726021645, 'count': 7}, 4.0: {'mean': 7537.92218148637, 'std': 1607.0591069513537, 'count': 3}, 5.0: {'mean': 5666.666666666667, 'std': 0.0, 'count': 1}}
2nd Phase Judicial Layout {2.0: {'mean': 3851.8518518518517, 'std': 497.593660834978, 'count': 3}, 3.0: {'mean': 3620.93991671624, 'std': 241.87983343248052, 'count': 5}}
2nd Stage Nagarbhavi {4.0: {'mean': 15891.203703703704, 'std': 1668.9846920398563, 'count': 4}, 6.0: {'mean': 16891.666666666668, 'std': 1858.333333333333, 'count': 2}}
5th Block Hbr Layout {2.0: {'mean': 4755.410708222867, 'std': 374.0

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🏠 Removing BHK Outliers

We remove outliers in **price_per_sqfeet** based on BHK stats:
- For each **location**, compute stats for each BHK and also for `(BHK - 1)`.
- If the lower BHK has ≥ 5 data points, treat its mean as a baseline.
- Mark rows where a higher BHK has a lower `price_per_sqfeet` than the mean of `(BHK - 1)` as outliers.
- Drop these outlier rows to ensure consistent pricing per square foot with respect to BHK.

This ensures that, in the same location, a property with more bedrooms doesn’t illogically have a lower price per sq ft than a smaller one.


In [208]:
def removebhkoutliers(df):
    exclude_indexes = np.array([])
    for location,locationdfs in df.groupby('location'):
        bhk_stats={}
        for bhk,bhkdf in locationdfs.groupby('bhk'):
            bhk_stats[bhk] = {'mean' : np.mean(bhkdf.price_per_sqfeet), 'std' : np.std(bhkdf.price_per_sqfeet), 'count' : bhkdf.shape[0]}
        for bhk,bhkdf in locationdfs.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)                #get is used to handle errors in case where bhk-1 goes out of index, returns None in such a case
            if stats and stats['count'] >= 5:           #we are considering only the case where the mean calculation can be considered accurate due to value more than 5
                exclude_indexes=np.append(exclude_indexes, bhkdf[bhkdf.price_per_sqfeet<(stats['mean'])].index.values)
    return df.drop(exclude_indexes, axis= 'index')
data= removebhkoutliers(data)

In [212]:
data.shape

(7165, 8)

In [220]:
data

Unnamed: 0,availability,location,bhk,total_sqft,bath,price,price_per_sqfeet,total_sqftbybhk
0,0,1st Block Jayanagar,4.0,2850.0,4.0,428.0,15017.543860,712.500000
1,0,1st Block Jayanagar,3.0,1630.0,3.0,194.0,11901.840491,543.333333
2,1,1st Block Jayanagar,3.0,1875.0,2.0,235.0,12533.333333,625.000000
3,0,1st Block Jayanagar,3.0,1200.0,2.0,130.0,10833.333333,400.000000
4,0,1st Block Jayanagar,2.0,1235.0,2.0,148.0,11983.805668,617.500000
...,...,...,...,...,...,...,...,...
10320,0,other,2.0,1200.0,2.0,70.0,5833.333333,600.000000
10321,1,other,1.0,1800.0,1.0,200.0,11111.111111,1800.000000
10324,1,other,2.0,1353.0,2.0,110.0,8130.081301,676.500000
10325,0,other,1.0,812.0,1.0,26.0,3201.970443,812.000000


In [224]:
cleandata = data.drop(columns=['price_per_sqfeet'])

In [225]:
cleandata.to_csv('cleandata.csv')

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 📦 Why These Libraries?

We import these essential scikit-learn modules to build, preprocess, and evaluate a regression model pipeline efficiently.  
Here’s a quick explanation of each:

---

### 🔷 Imports & Purpose:
✅ <span class="important">train_test_split</span>  
→ Splits the dataset into training and testing sets for evaluation.

✅ <span class="important">OneHotEncoder</span>  
→ Converts categorical variables into a one-hot encoded numeric format.

✅ <span class="important">StandardScaler</span>  
→ Standardizes numerical features to have mean 0 and variance 1.

✅ <span class="important">make_column_transformer</span>  
→ Allows applying different preprocessing steps to specific columns.

✅ <span class="important">make_pipeline</span>  
→ Chains together preprocessing and modeling steps into a single workflow.

✅ <span class="important">r2_score</span>  
→ Evaluates the model by measuring how well it explains variance in the data.

✅ <span class="important">RandomForestRegressor</span>  
→ A powerful ensemble learning algorithm for regression using decision trees.


In [227]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🔷 Splitting Data: Train & Test Sets

We separate our dataset into **features (X)** and **target (y)**,  
and then split it into **training** and **testing** sets to evaluate model performance on unseen data.

---

### 📄 Code Explanation:
✅ <span class="important">`X = cleandata.drop(columns=['price'])`</span>  
→ Features: all columns except `price`.  

✅ <span class="important">`y = cleandata['price']`</span>  
→ Target: the `price` column.  

✅ <span class="important">`train_test_split(...)`</span>  
→ Splits `X` and `y` into training (80%) and testing (20%) sets,  
   with `random_state=0` to ensure reproducibility.


In [230]:
X = cleandata.drop(columns = ['price'])
y = cleandata['price']
X_train,X_test,y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

<style>
.markdown-body {
  background: #121212;
  color: #e0e0e0;
  font-family: Arial, sans-serif;
  line-height: 1.5;
}
.important { color: #8ab4f8; font-weight: bold; }
</style>

# 🚀 Pipeline: Preprocessing + XGBoost

We build a **machine learning pipeline** to preprocess data and train an XGBoost regression model, then evaluate it with \( R^2 \) score.  

---

### 📄 Code Explanation:
✅ <span class="important">`scaler = StandardScaler()`</span>  
→ Standardizes numerical features to mean 0 and variance 1.  

✅ <span class="important">`column_trans = make_column_transformer(...)`</span>  
→ Applies **OneHotEncoder** to the `location` column while passing other columns as-is.  

✅ <span class="important">`xgb = XGBRegressor(...)`</span>  
→ Defines an XGBoost regression model with specified hyperparameters.  

✅ <span class="important">`pipe = make_pipeline(...)`</span>  
→ Chains preprocessing (`column_trans` + `scaler`) and the XGBoost model into one pipeline.  

✅ <span class="important">`pipe.fit(X_train, y_train)`</span>  
→ Trains the pipeline on the training data.  

✅ <span class="important">`r2_score(y_test, pipe.predict(X_test))`</span>  
→ Evaluates the trained model’s performance on test data using \( R^2 \) score.
n
r2_score(y_test, pipe.predict(X_test))


In [232]:
scaler=StandardScaler()
column_trans = make_column_transformer((OneHotEncoder(sparse_output=False), ['location']),remainder='passthrough')
xgb = XGBRegressor(n_estimators=100,learning_rate=0.1,max_depth=5,random_state=42)
pipe = make_pipeline(column_trans, scaler, xgb)
pipe.fit(X_train, y_train)
print(r2_score(y_test, pipe.predict(X_test)))

0.8536249573916868


In [233]:
import pickle
with open("pipeline.pkl", "wb") as f:
    pickle.dump(pipe, f)

In [234]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


<div style="background-color:#1e1e1e; color:#f1f1f1; padding:15px; border-left:5px solid #00bcd4; font-family:Arial, sans-serif;">
  <h2 style="color:#00bcd4;">🚀 Next Steps: Launch the App</h2>
  <p>
    🎉 The machine learning pipeline has been saved as <code>pipeline.pkl</code> and the required dependencies are listed in <code>requirements.txt</code>.<br>
    Now, proceed to run <strong>app.py</strong> to launch the interactive web application using <strong>Streamlit</strong>.
  </p>
  <p>
    Below are some screenshots of the deployed app in action:
  </p>

  <div style="display:flex; justify-content:space-around; margin-top:20px;">
    <img src="screenshot1.png" alt="Screenshot 1" style="width:45%; border:2px solid #333; border-radius:4px;">
    <img src="screenshot2.png" alt="Screenshot 2" style="width:45%; border:2px solid #333; border-radius:4px;">
  </div>

  <div style="display:flex; justify-content:space-around; margin-top:20px;">
    <img src="screenshot3.png" alt="Screenshot 3" style="width:45%; border:2px solid #333; border-radius:4px;">
    <img src="screenshot4.png" alt="Screenshot 4" style="width:45%; border:2px solid #333; border-radius:4px;">
  </div>

  <p style="margin-top:20px;">
    👉 Run <code>streamlit run app.py</code> in your terminal to get started!
  </p>
</div>
