# Basic Decriptive Statistics on Indian Housing Dataset

## Prompt:
Using any tool you wish, prepare a summary of basic descriptive statistics for each column of the dataset.

## Set up the environment

In [1]:
import pandas as pd
import numpy as np

## Import the data

In [2]:
data = pd.read_csv("C:\\Users\\d_bar\\Desktop\\Modified Indian Housing Data.csv")

In [3]:
data.head(10)

Unnamed: 0,Posted On,BHK,Rent,Size,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathrooms,Point of Contact,Floor,Total Floors
0,29-Apr-22,2,12000,900,Carpet Area,Adambakkam,Chennai,Semi-Furnished,Bachelors/Family,2,Contact Owner,1,2.0
1,29-May-22,2,15000,800,Carpet Area,Adambakkam,Chennai,Unfurnished,Bachelors/Family,2,Contact Owner,1,2.0
2,7-Jul-22,2,7500,700,Super Area,"Alapakkam, Porur",Chennai,Unfurnished,Bachelors/Family,2,Contact Owner,1,2.0
3,6-Jul-22,2,9500,900,Super Area,Ambattur,Chennai,Unfurnished,Bachelors/Family,2,Contact Owner,1,2.0
4,20-May-22,2,9000,745,Super Area,Ambattur,Chennai,Semi-Furnished,Bachelors/Family,2,Contact Owner,1,2.0
5,17-Jun-22,2,9500,800,Carpet Area,"Ask Nagar, Adambakkam",Chennai,Semi-Furnished,Bachelors/Family,2,Contact Owner,1,2.0
6,26-May-22,2,10000,1030,Carpet Area,Camp Road,Chennai,Semi-Furnished,Bachelors/Family,2,Contact Owner,1,2.0
7,10-Jun-22,2,9000,740,Super Area,Chitlapakkam,Chennai,Unfurnished,Family,2,Contact Owner,1,2.0
8,20-May-22,2,8000,850,Super Area,Chitlapakkam,Chennai,Unfurnished,Bachelors/Family,2,Contact Owner,1,2.0
9,14-May-22,2,23000,850,Carpet Area,Choolaimedu,Chennai,Furnished,Bachelors/Family,2,Contact Owner,1,2.0


In [4]:
data.tail(10)

Unnamed: 0,Posted On,BHK,Rent,Size,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathrooms,Point of Contact,Floor,Total Floors
4736,20-Jun-22,2,8000,623,Carpet Area,"Shreenath nagar, Morya Nagar",Mumbai,Unfurnished,Family,2,Contact Agent,6,7.0
4737,31-May-22,2,60000,720,Carpet Area,Sindhi Society Chembur,Mumbai,Unfurnished,Bachelors/Family,2,Contact Agent,4,7.0
4738,12-Jun-22,2,40000,575,Carpet Area,"Sunglow, Chandivali",Mumbai,Semi-Furnished,Bachelors/Family,2,Contact Agent,4,7.0
4739,17-May-22,2,35000,650,Carpet Area,"Thakur Village, Kandivali East",Mumbai,Semi-Furnished,Bachelors/Family,2,Contact Agent,5,7.0
4740,14-Jun-22,2,46000,680,Carpet Area,"Trans Residency, Andheri East",Mumbai,Semi-Furnished,Bachelors/Family,2,Contact Agent,4,7.0
4741,6-Jul-22,2,39000,630,Carpet Area,"Yashodham Complex, Goregaon East",Mumbai,Unfurnished,Family,2,Contact Agent,6,7.0
4742,13-May-22,2,33500,700,Carpet Area,"Yogi Hills, Mulund West",Mumbai,Semi-Furnished,Bachelors/Family,2,Contact Agent,4,7.0
4743,30-Jun-22,2,85000,850,Carpet Area,Andheri West,Mumbai,Unfurnished,Bachelors/Family,2,Contact Agent,4,6.0
4744,4-Jun-22,2,35000,650,Carpet Area,Thakur Complex,Mumbai,Semi-Furnished,Family,2,Contact Agent,5,6.0
4745,4-Jun-22,2,90000,800,Carpet Area,Vile Parle West,Mumbai,Furnished,Family,2,Contact Agent,5,6.0


## Deal with missing data

In [5]:
data.isna().sum()

Posted On            0
BHK                  0
Rent                 0
Size                 0
Area Type            0
Area Locality        0
City                 0
Furnishing Status    0
Tenant Preferred     0
Bathrooms            0
Point of Contact     0
Floor                0
Total Floors         4
dtype: int64

We see 4 missing values for `Total Floors`.  This is likely because the string "out of" was not present in the original floor column.  We have 2 options to deal with this:
- Delete the 4 records with the missing values for `Total Floors`
- Impute the missing values with the `Floor` since we know there are at least that many floors.


We'll impute the missing data.

In [6]:
data['Total Floors'].fillna(data['Floor'], inplace=True)

In [7]:
data.isna().sum()

Posted On            0
BHK                  0
Rent                 0
Size                 0
Area Type            0
Area Locality        0
City                 0
Furnishing Status    0
Tenant Preferred     0
Bathrooms            0
Point of Contact     0
Floor                0
Total Floors         0
dtype: int64

## Change datatypes

In [8]:
data.dtypes

Posted On             object
BHK                    int64
Rent                   int64
Size                   int64
Area Type             object
Area Locality         object
City                  object
Furnishing Status     object
Tenant Preferred      object
Bathrooms              int64
Point of Contact      object
Floor                  int64
Total Floors         float64
dtype: object

We notice that the `Total Floors` column is a float, lets change that to an int.

In [9]:
data['Total Floors'] = data['Total Floors'].astype(np.int64)

In [10]:
data.dtypes

Posted On            object
BHK                   int64
Rent                  int64
Size                  int64
Area Type            object
Area Locality        object
City                 object
Furnishing Status    object
Tenant Preferred     object
Bathrooms             int64
Point of Contact     object
Floor                 int64
Total Floors          int64
dtype: object

Let's change `Posted On` to be a datetime object.

In [11]:
data['Posted On'] = pd.to_datetime(data['Posted On'])

In [12]:
data.dtypes

Posted On            datetime64[ns]
BHK                           int64
Rent                          int64
Size                          int64
Area Type                    object
Area Locality                object
City                         object
Furnishing Status            object
Tenant Preferred             object
Bathrooms                     int64
Point of Contact             object
Floor                         int64
Total Floors                  int64
dtype: object

## Basic Descriptive Statistics

In [16]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BHK,4746.0,2.08386,0.832256,1.0,2.0,2.0,3.0,6.0
Rent,4746.0,34993.451327,78106.412937,1200.0,10000.0,16000.0,33000.0,3500000.0
Size,4746.0,967.490729,634.202328,10.0,550.0,850.0,1200.0,8000.0
Bathrooms,4746.0,1.965866,0.884532,1.0,1.0,2.0,2.0,10.0
Floor,4746.0,3.436157,5.77395,-2.0,1.0,2.0,3.0,76.0
Total Floors,4746.0,6.968605,9.467245,0.0,2.0,4.0,6.0,89.0


- Every housing option has at least 1 bed, hall, and kitchen, but the most is 6.
- The cheapest `Rent` is 1200 and the highest is 3,500,000.
- The smallest `Size` is 550 sqft. and the largest is 8000 sqft.
- Every housing option has at least 1 bathroom and the most is 10 bathrooms.
- The `Floor` ranges from Lower Basement (-2) all the way up to 76.
- `Total Floors` ranges from the Ground Floor (0) all the way up to 89. 

How do the medians compare with the means?

In [19]:
data.median()

  data.median()
  data.median()


BHK                 2.0
Rent            16000.0
Size              850.0
Bathrooms           2.0
Floor               2.0
Total Floors        4.0
dtype: float64

We notice some discrepancies between the means and medians of the numerical columns. 
- The mean `Rent` is more than twice as large as the median, meaning that there is a positive skew.
- The mean `Total Floors` is almost twice as large as the median, indicating a positive skew.

These positive skews indicate that there are outliers much larger than average.

What about the non-numerical attributes?

In [17]:
data.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Area Type,4746,3,Super Area,2446
Area Locality,4746,2231,Bandra West,37
City,4746,6,Mumbai,972
Furnishing Status,4746,3,Semi-Furnished,2251
Tenant Preferred,4746,3,Bachelors/Family,3444
Point of Contact,4746,3,Contact Owner,3216


- It's most common for the sq. footage to be calculated in `Super Area`.
- The most common `Area Locality` is Bandra West.
- The most frequent `City` is Mumbai.
- It's most common for housing to be `Semi-Furnished`.
- Most housing has no `Tenant Preference` since either bachelors or families can apply.
- The most common `Point of Contact` is the owner.

There are many unique values for `Area Locality`.

What about the dates the housing options were posted?

In [18]:
data.describe(include='datetime').T

  data.describe(include='datetime').T


Unnamed: 0,count,unique,top,freq,first,last
Posted On,4746,81,2022-07-06,311,2022-04-13,2022-07-11


- The `Posted On`dates range from April, 2022 to July 2022 (89 days).  
    - Since there are 81 unique values, a listing was made roughly every day.
- The most frequent date the housing options were listed on was July 6th, 2022.
