# Initial Package Imports

The following essential Python packages are imported:

- **pandas**: For data manipulation and analysis
- **numpy**: For numerical computing and array operations 
- **matplotlib.pyplot**: For data visualization
- **%matplotlib inline**: IPython magic command to display plots inline in the notebook

These packages provide the core functionality needed for data analysis and machine learning tasks.


In [147]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


This cell loads the Bengaluru House Price dataset from a CSV file using pandas' read_csv() function and displays the first few rows of the data using head(). The dataset contains information about house prices and related features in Bengaluru city.


In [148]:
hp_data = pd.read_csv('../dataset/Bengaluru_House_Data.csv')
hp_data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


This cell uses pandas info() function to display metadata about the DataFrame including:
- Total number of entries
- Column names and their data types
- Number of non-null values in each column
- Memory usage

This provides a quick overview of the dataset structure and helps identify any missing values.


In [149]:
hp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


The describe() function provides statistical summary of the numerical columns in the DataFrame including:

- Count of values
- Mean
- Standard deviation 
- Minimum value
- 25th percentile (Q1)
- Median (50th percentile)
- 75th percentile (Q3) 
- Maximum value

This helps understand the distribution and range of numeric features in the house price dataset.


In [150]:
hp_data.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


# Inspecting Data Grouping by Area Type

This cell analyzes the distribution of properties across different area types using pandas groupby() and agg() functions:

- Groups the data by 'area_type' column
- Counts the number of properties in each area type category
- Provides quick insight into which area types are most common in the dataset


In [151]:
hp_data.groupby('area_type')['area_type'].agg('count')

area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64

# Data Cleaning

# Feature Selection and Dimensionality Reduction

We are removing less important features to simplify the model:

1. Dropping categorical features with low predictive value:
- area_type: Already standardized built-up area in total_sqft
- availability: Future availability date not relevant for price
- society: Too many unique values, likely not significant
- balcony: Missing values and minor impact on price

2. Further dimensionality reduction can be done using PCA after:
- Handling remaining missing values
- Encoding categorical variables
- Scaling numerical features

This preprocessing will help create a simpler, more robust model.


In [152]:
df1 = hp_data.drop(['area_type', 'availability', 'balcony', 'society'], axis='columns')

In [153]:
df1.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


In [154]:
df1.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [155]:
df2 = df1.dropna()


In [156]:
df2.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [157]:
df2.shape

(13246, 5)

In [158]:
df2['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

In [159]:
def token_size(size):
    str = size.split()
    return int(str[0])


In [160]:
df2.loc[:, 'bhk']= df2['size'].apply(token_size)
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.loc[:, 'bhk']= df2['size'].apply(token_size)


Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0,4
2,Uttarahalli,3 BHK,1440,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3
4,Kothanur,2 BHK,1200,2.0,51.0,2


In [161]:
df2['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In [162]:
def is_float(x):
    try:
        float(x)
        return True
    except:
        return False

In [163]:
df2[~df2.total_sqft.apply(is_float)]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
30,Yelahanka,4 BHK,2100 - 2850,4.0,186.000,4
122,Hebbal,4 BHK,3067 - 8156,4.0,477.000,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,43.490,2
188,KR Puram,2 BHK,1015 - 1540,2.0,56.800,2
...,...,...,...,...,...,...
12975,Whitefield,2 BHK,850 - 1060,2.0,38.190,2
12990,Talaghattapura,3 BHK,1804 - 2273,3.0,122.000,3
13059,Harlur,2 BHK,1200 - 1470,2.0,72.760,2
13265,Hoodi,2 BHK,1133 - 1384,2.0,59.135,2


In [164]:
def convert_range_value_to_int(area):
    try:
        if "-" in area:
            parts = area.split("-")
            if len(parts) == 2:
                return (float(parts[0].strip()) + float(parts[1].strip())) / 2
        return float(area.strip())
    except:
        # Return None or a default value if conversion fails
        return None

In [165]:
df2.loc[:, 'area_sqft'] = df2['total_sqft'].apply(convert_range_value_to_int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.loc[:, 'area_sqft'] = df2['total_sqft'].apply(convert_range_value_to_int)


In [166]:
df2.loc[:, 'area_sqft'].unique()

array([1056. , 2600. , 1440. , ..., 1258.5,  774. , 4689. ])

In [167]:
df2.area_sqft.count()


13200

In [168]:
df2['area_sqft'].isnull().sum()

46

In [169]:
df3 = df2.copy()
df3.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,area_sqft
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2,1056.0
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0,4,2600.0
2,Uttarahalli,3 BHK,1440,2.0,62.0,3,1440.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3,1521.0
4,Kothanur,2 BHK,1200,2.0,51.0,2,1200.0


In [170]:
df3.count()

location      13246
size          13246
total_sqft    13246
bath          13246
price         13246
bhk           13246
area_sqft     13200
dtype: int64

In [171]:
df3.isnull().sum()

location       0
size           0
total_sqft     0
bath           0
price          0
bhk            0
area_sqft     46
dtype: int64

In [172]:
df4 = df3.dropna()

In [173]:
df4.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
bhk           0
area_sqft     0
dtype: int64

In [174]:
df4.count()

location      13200
size          13200
total_sqft    13200
bath          13200
price         13200
bhk           13200
area_sqft     13200
dtype: int64

In [175]:
df4.drop('total_sqft', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df5 = df4.drop('total_sqft', axis=1, inplace=True)


In [177]:
df4

Unnamed: 0,location,size,bath,price,bhk,area_sqft
0,Electronic City Phase II,2 BHK,2.0,39.07,2,1056.0
1,Chikka Tirupathi,4 Bedroom,5.0,120.00,4,2600.0
2,Uttarahalli,3 BHK,2.0,62.00,3,1440.0
3,Lingadheeranahalli,3 BHK,3.0,95.00,3,1521.0
4,Kothanur,2 BHK,2.0,51.00,2,1200.0
...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,4.0,231.00,5,3453.0
13316,Richards Town,4 BHK,5.0,400.00,4,3600.0
13317,Raja Rajeshwari Nagar,2 BHK,2.0,60.00,2,1141.0
13318,Padmanabhanagar,4 BHK,4.0,488.00,4,4689.0


# Feature Engineering