In [68]:
#Set up the environment
import pandas as pd

import numpy as np

In [69]:
#Import data
df = pd.read_csv("documents/housemodifiedrent.csv")

In [70]:
df.head(10)

Unnamed: 0,Floor Level Updated,Floor Total,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathrooms,Point of Contact
0,0.0,2,5/18/2022,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,No Preference,2,Contact Owner
1,1.0,3,5/13/2022,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,No Preference,1,Contact Owner
2,1.0,3,5/16/2022,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,No Preference,1,Contact Owner
3,1.0,2,7/4/2022,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,No Preference,1,Contact Owner
4,1.0,2,5/9/2022,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
5,0.0,1,4/29/2022,2,7000,600,Ground out of 1,Super Area,Thakurpukur,Kolkata,Unfurnished,No Preference,2,Contact Owner
6,0.0,4,6/21/2022,2,10000,700,Ground out of 4,Super Area,Malancha,Kolkata,Unfurnished,Bachelors,2,Contact Agent
7,1.0,2,6/21/2022,1,5000,250,1 out of 2,Super Area,Malancha,Kolkata,Unfurnished,Bachelors,1,Contact Agent
8,1.0,2,6/7/2022,2,26000,800,1 out of 2,Carpet Area,"Palm Avenue Kolkata, Ballygunge",Kolkata,Unfurnished,Bachelors,2,Contact Agent
9,1.0,3,6/20/2022,2,10000,1000,1 out of 3,Carpet Area,Natunhat,Kolkata,Semi-Furnished,No Preference,2,Contact Owner


In [71]:
df.tail(10)

Unnamed: 0,Floor Level Updated,Floor Total,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathrooms,Point of Contact
4736,,2,6/28/2022,3,15000,1500,Lower Basement out of 2,Super Area,Almasguda,Hyderabad,Semi-Furnished,Family,3,Contact Owner
4737,,2,7/7/2022,3,15000,1500,Lower Basement out of 2,Super Area,Almasguda,Hyderabad,Semi-Furnished,No Preference,3,Contact Owner
4738,4.0,5,7/6/2022,2,17000,855,4 out of 5,Carpet Area,"Godavari Homes, Quthbullapur",Hyderabad,Unfurnished,Bachelors,2,Contact Agent
4739,2.0,4,7/6/2022,2,25000,1040,2 out of 4,Carpet Area,Gachibowli,Hyderabad,Unfurnished,Bachelors,2,Contact Owner
4740,2.0,2,6/2/2022,2,12000,1350,2 out of 2,Super Area,Old Alwal,Hyderabad,Unfurnished,No Preference,2,Contact Owner
4741,3.0,5,5/18/2022,2,15000,1000,3 out of 5,Carpet Area,Bandam Kommu,Hyderabad,Semi-Furnished,No Preference,2,Contact Owner
4742,1.0,4,5/15/2022,3,29000,2000,1 out of 4,Super Area,"Manikonda, Hyderabad",Hyderabad,Semi-Furnished,No Preference,3,Contact Owner
4743,3.0,5,7/10/2022,3,35000,1750,3 out of 5,Carpet Area,"Himayath Nagar, NH 7",Hyderabad,Semi-Furnished,No Preference,3,Contact Agent
4744,23.0,34,7/6/2022,3,45000,1500,23 out of 34,Carpet Area,Gachibowli,Hyderabad,Semi-Furnished,Family,2,Contact Agent
4745,4.0,5,5/4/2022,2,15000,1000,4 out of 5,Carpet Area,Suchitra Circle,Hyderabad,Unfurnished,Bachelors,2,Contact Owner


Above, there are some missing values within the data. We will look at whether there are other NaaN values in other columns.

In [72]:
df.isna().sum()

Floor Level Updated    34
Floor Total             0
Posted On               0
BHK                     0
Rent                    0
Size                    0
Floor                   0
Area Type               0
Area Locality           0
City                    0
Furnishing Status       0
Tenant Preferred        0
Bathrooms               0
Point of Contact        0
dtype: int64

There are 34 missing values in the Floor Level Updated column. 

There are different ways we can deal with this problem.

1. We can get rid of the null values
2. Fill in those values since we know there is 34 NaaN values

In [73]:
df['Floor Level Updated'].fillna(df['Floor'], inplace=True)


In [74]:
df.isna().sum()

Floor Level Updated    0
Floor Total            0
Posted On              0
BHK                    0
Rent                   0
Size                   0
Floor                  0
Area Type              0
Area Locality          0
City                   0
Furnishing Status      0
Tenant Preferred       0
Bathrooms              0
Point of Contact       0
dtype: int64

As you can see above, now there are none null values. Now we move onto whether we need to change any datatypes for these columns or not.

In [75]:
df.dtypes

Floor Level Updated    object
Floor Total            object
Posted On              object
BHK                     int64
Rent                    int64
Size                    int64
Floor                  object
Area Type              object
Area Locality          object
City                   object
Furnishing Status      object
Tenant Preferred       object
Bathrooms               int64
Point of Contact       object
dtype: object

We see that Floor Level Updated, Floor Total, Posted On, and Floor should have different datatypes since they are numerical. 

Let's start with the Posted On column and change it to datetime.

In [76]:
df['Posted On'] = pd.to_datetime(df['Posted On'])

Now let's add the floor columns and change them to int.

In [77]:
df['Floor'] = df['Floor'].apply(pd.to_numeric, errors='coerce').fillna(0.0)


In [78]:
df['Floor Total'] = df['Floor Total'].apply(pd.to_numeric, errors='coerce').fillna(0.0)


In [79]:
df['Floor Level Updated'] = df['Floor Level Updated'].apply(pd.to_numeric, errors='coerce').fillna(0.0)


In [80]:
df.dtypes

Floor Level Updated           float64
Floor Total                   float64
Posted On              datetime64[ns]
BHK                             int64
Rent                            int64
Size                            int64
Floor                         float64
Area Type                      object
Area Locality                  object
City                           object
Furnishing Status              object
Tenant Preferred               object
Bathrooms                       int64
Point of Contact               object
dtype: object

The code above for the floor columns did these two things:
1. Convert all column types to numeric types, fill in NaN for errors, and fill in 0 for NaNs

2. The column of object is converted to float

Now we move onto the basic descriptive analysis!

In [81]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Floor Level Updated,4746.0,3.445638,5.767071,0.0,1.0,2.0,3.0,76.0
Floor Total,4746.0,6.968605,9.467245,0.0,2.0,4.0,6.0,89.0
BHK,4746.0,2.08386,0.832256,1.0,2.0,2.0,3.0,6.0
Rent,4746.0,34993.451327,78106.412937,1200.0,10000.0,16000.0,33000.0,3500000.0
Size,4746.0,967.490729,634.202328,10.0,550.0,850.0,1200.0,8000.0
Floor,4746.0,0.001054,0.048136,0.0,0.0,0.0,0.0,3.0
Bathrooms,4746.0,1.965866,0.884532,1.0,1.0,2.0,2.0,10.0


- For each house, at least 1 bed, hall, and kitchen. The maximum is 6.
- The size range varies from 550 sq ft to 8000 sq ft. 
- There are at least 1 bathroom however it can go up to 10 maximum.
- The Floor Level Updated( originally suppose to have -2 and -1 for lower basements and floors) but it can go up to 76 floors)
- The Floor Total starts at 0 and goes up to 89.
- Rent can range from 1,200 to 35,000,000

In [83]:
df.mean()

  df.mean()
  df.mean()


Floor Level Updated        3.445638
Floor Total                6.968605
BHK                        2.083860
Rent                   34993.451327
Size                     967.490729
Floor                      0.001054
Bathrooms                  1.965866
dtype: float64

In [84]:
df.median()

  df.median()
  df.median()


Floor Level Updated        2.0
Floor Total                4.0
BHK                        2.0
Rent                   16000.0
Size                     850.0
Floor                      0.0
Bathrooms                  2.0
dtype: float64

In [85]:
df.mode()

Unnamed: 0,Floor Level Updated,Floor Total,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathrooms,Point of Contact
0,1.0,4.0,2022-07-06,2,15000,1000,0.0,Super Area,Bandra West,Mumbai,Semi-Furnished,No Preference,2,Contact Owner


Above info provides the mean, median, and mode of the data set. 

The mean of this data set shows the Rent is much larger than the size, indicating there is a positive skew towards rent. The result also shows a positive skew towards Floor Total, which is 6. 

The median of the data set shows the median size is 850 sq ft, 16,000 rent, 4 floors total, 2 bathrooms, 2 floor levels and 2 bedroom,hall,and kitchen area.

The mode of the data set shows it is more common to rent a house in Mumbai with 2 bathrooms, semi-furnished, no preference of tenant, 4 floors total, 1 floor level, 15,000 rent, size 1000 and the area being a super area and Bandra West.

Based on the info provided, it suggests that there are positive skews towards the extremes/outliers.

In [88]:
df.describe(include='d').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Floor Level Updated,4746.0,3.445638,5.767071,0.0,1.0,2.0,3.0,76.0
Floor Total,4746.0,6.968605,9.467245,0.0,2.0,4.0,6.0,89.0
Floor,4746.0,0.001054,0.048136,0.0,0.0,0.0,0.0,3.0


The floors columns show there are 89 total floors max, 76 floor level max, and 3 floors max. These results give us a better understanding of not only the amount of floors but the commonality of the different variety of floors in this dataset. 

In [90]:
df.describe(include='datetime').T

  df.describe(include='datetime').T


Unnamed: 0,count,unique,top,freq,first,last
Posted On,4746,81,2022-07-06,311,2022-04-13,2022-07-11


The Posted On dates range from 04/13/2022 to 07/11/2022 (89 days). 81 unique values within the time span. The most common date for these listings are on 07/06/2022.

In [87]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Area Type,4746,3,Super Area,2446
Area Locality,4746,2235,Bandra West,37
City,4746,6,Mumbai,972
Furnishing Status,4746,3,Semi-Furnished,2251
Tenant Preferred,4746,3,No Preference,3444
Point of Contact,4746,3,Contact Owner,3216


For non-numerical datatypes, the counts are consistent throughout the attributes. The Area Locality stands out in the unique category as it is at 2235 while the other rows range from 3-6. The rest of the data shows what is most popular and is the same result as the mode of the dataset.