# **1.** **Frame the Problem and Look at the Big Picture**

**Project Goal:**
* Model’s output (a prediction of a district’s `median housing price`) will determine whether it is worth investing in a given area. Getting this right is critical, as it directly affects revenue.

**Designing the System:**
* `Supervised Learning` because model can be trained with labeled examples
* Typical `Regression` task since the model will be asked to predicted a value
  * More specifically, this is a `multiple regression` problem, since the system will use multiple features to make prediction and 
  * It is also `Univariate regression` problem, since we are only trying to predict a single value for each district.
* There is no continuous flow of data coming into the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain ***batch learning*** should do just fine.

**Select a Performance Measure:**
* RMSE (Root Mean Square Error)
* MAE (Mean Absolute Error)<br>
  We will use both<br><br>
**NOTES:**
* The RMSE is more sensitive to outliers than the MAE.
* If Outliers are rare RMSE performs very well


**Libraries to Use In Project:**

In [3]:
# Import Libraries
import pandas as pd

# **2.** **Get the Data**

**2.1.** **Download and Load the Dataset**

In [10]:
# Load the Dataset
df = pd.read_csv('../data/housing.csv')

**2.2.** **Take a Quick Look at the Data Structure**

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [11]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


**Quick Observations:**
* We have 20640 entries and 10 features
* we can quickly identify columns with missing data: Only `total_bedrooms` have null values.
  - **TODO-1:** We need to take care of this
* All attributes are numerical, except for `ocean_proximity`
  * **TODO-2:** We need to prepare this for model.

**Feature Summary:**
<h6><b><u>Features</u></b></h6>

1. `longitude`: A measure of how far west a house is; a higher value is farther west
2. `latitude`: A measure of how far north a house is; a higher value is farther north
3. `housing_median_age`: Median age of a house within a block; a lower number is a newer building
4. `total_rooms`: Total number of rooms within a block
5. `total_bedrooms`: Total number of bedrooms within a block
6. `population`: Total number of people residing within a block
7. `households`: Total number of households, a group of people residing within a home unit, for a block
8. `median_income`: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9.  `ocean_proximity`: Location of the house w.r.t ocean/sea
    
<h6><b><u>Target Variable</u></b></h6>

1. `median_house_value`: Median house value for households within a block (measured in US Dollars)

When we look at the top five rows, noticed that `ocean_proximity` repetitive, which means that it is probably categorical attribute.<br>
Let's look at what categories exist and how many districts belong to each category:

In [15]:
df['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

Let's look at the other fields; summary of the numerical attributes

In [18]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
longitude,20640.0,-119.569704,2.003532,-124.35,-121.8,-118.49,-118.01,-114.31
latitude,20640.0,35.631861,2.135952,32.54,33.93,34.26,37.71,41.95
housing_median_age,20640.0,28.639486,12.585558,1.0,18.0,29.0,37.0,52.0
total_rooms,20640.0,2635.763081,2181.615252,2.0,1447.75,2127.0,3148.0,39320.0
total_bedrooms,20433.0,537.870553,421.38507,1.0,296.0,435.0,647.0,6445.0
population,20640.0,1425.476744,1132.462122,3.0,787.0,1166.0,1725.0,35682.0
households,20640.0,499.53968,382.329753,1.0,280.0,409.0,605.0,6082.0
median_income,20640.0,3.870671,1.899822,0.4999,2.5634,3.5348,4.74325,15.0001
median_house_value,20640.0,206855.816909,115395.615874,14999.0,119600.0,179700.0,264725.0,500001.0


**We can find possible skewness from statistics:**
* `total_rooms`: mean 2635.76, median 1447.75 --- *right (positive) skewness.*
* `total_bedrooms`: mean 537.87, median 435 --- *right (positive) skewness.*
* `population`: mean 1425.47, median 1166 --- *right (positive) skewness.*
* `households`: mean 499.53, median 409 --- *right (positive) skewness.*
* `median_income`: mean 3.87, median 3.53 --- *right (positive) skewness.*

# **3.** **Explore the Data**

# **4.** **Prepare the Data**

# **5.** **Shortlist Promising Models**

# **6.** **Fine-Tune the System**

# **7.** **Present Your Solution**

# **8.** **Launch!**