<h1>Median Housing Price Prediction by Building a Machine Learning Model</h1>
<p>The task you are asked to perform is to build a
model of housing prices in California using the California census data. This data has metrics such as the
population, median income, median housing price, and so on for each block group in California. Block
groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block
group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.</p>
<p>Your model should learn from this data and be able to predict the median housing price in any district,
given all the other metrics.</p>

<h3>1. Fetching of the dataset through the Python Script</h3>

In [6]:
import os
import tarfile
from six.moves import urllib


DOWNLOAD_ROOT= "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH= "datasets/housing"
HOUSING_URL= DOWNLOAD_ROOT+HOUSING_PATH+"/housing.tgz"

#function to download the dataset from the URL
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path= os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz= tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

<h3>2. Loading the downloaded data using Pandas</h3>

In [7]:
import pandas as pd

#function to load the data
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path= os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)

<p>Taking a quick look at the data structure, i.e. the top five rows using the DataFrame's head()</p>


In [9]:
housing= load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Now we will use the info() method with is helpful in getting a quick descrpition of the data, in particular the total number of rows, and each attribute's type and number of non-null values.

In [10]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


<p>This concludes that we have 20,640 instances in the dataset, a fairly small dataset by ML standards.
Also, total_bedrooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature.
This is something we have to deal with later on.</p>
<p>All attributes are numerical, except the ocean_proximity field is an object, so that it can have any Python object, but in this case of CSV file, it is a text.</p>
<p>In the top five rows, we notice that the values in that column were repetitive, which means that <b>it is probably a categorical attribute.</b></p>
<p>Now, we can find out what categories exist and how many districts belong to each category by using the value_counts() method:

In [14]:
housing['ocean_proximity'].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

Using describe() method to show a summary of the numerical attributes, for other fields.

In [16]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


Note that the null values are ignored (so, for
example, count of total_bedrooms is 20,433, not 20,640). The std row shows the standard deviation
(which measures how dispersed the values are). The 25%, 50%, and 75% rows show the corresponding
percentiles: a percentile indicates the value below which a given percentage of observations in a group
of observations falls. For example, 25% of the districts have a housing_median_age lower than 18,
while 50% are lower than 29 and 75% are lower than 37. These are often called the 25th percentile (or
1st quartile), the median, and the 75th percentile (or 3rd quartile).