# Prediciting Median Housing Prices in California

This notebook looks into various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of **predicting the median house price in any district of California**.

A quick breakdown of what the task is: 
1. Supervised learning task 
2. Typical Regression model (predicting a numerical output - median housing price) 
3. Multiple regression model as it uses multiple features 
4. Univariate regression porblem as we are predicting a single value for each district. 
5. Batch Learning problem as the data is only this data set. 

We're going to take the following approach: 
1. Problem defintiion
2. Data 
3. Evaluation 
4. Features 
5. Modelling 
6. Experimentation 

## 1. Problem Definition

In a statement, 
> Given a house located in a district in California can we predict what the median price will be, given all other metrics? 

## 2. Data 

The original data came from the StatLib Repository website - also on Kaggle.

## 3. Evaluation 

We are choosing the Root Mean Square Error as our performance measure (RMSE) - which tell sus how much error the model makes during a prediction - with a heigher wieght for large errors. 

> If we can get the lowest possible RMSE then we have a decent enough model.

## 4. Features 

**Create data dictionary** 

### Heart Disease Data Dictionary

A data dictionary describes the data you're dealing with. Not all datasets come with them so this is where you may have to do your research or ask a **subject matter expert** (someone who knows about the data) for more.

The following are the features we'll use to predict our target variable (heart disease or no heart disease).

1. age - age in years 
2. sex - (1 = male; 0 = female) 
3. cp - chest pain type 
    * 0: Typical angina: chest pain related decrease blood supply to the heart
    * 1: Atypical angina: chest pain not related to heart
    * 2: Non-anginal pain: typically esophageal spasms (non heart related)
    * 3: Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
    * anything above 130-140 is typically cause for concern
5. chol - serum cholestoral in mg/dl 
    * serum = LDL + HDL + .2 * triglycerides
    * above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
    * '>126' mg/dL signals diabetes
7. restecg - resting electrocardiographic results
    * 0: Nothing to note
    * 1: ST-T Wave abnormality
        - can range from mild symptoms to severe problems
        - signals non-normal heart beat
    * 2: Possible or definite left ventricular hypertrophy
        - Enlarged heart's main pumping chamber
8. thalach - maximum heart rate achieved 
9. exang - exercise induced angina (1 = yes; 0 = no) 
10. oldpeak - ST depression induced by exercise relative to rest 
    * looks at stress of heart during excercise
    * unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
    * 0: Upsloping: better heart rate with excercise (uncommon)
    * 1: Flatsloping: minimal change (typical healthy heart)
    * 2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy 
    * colored vessel means the doctor can see the blood passing through
    * the more blood movement the better (no clots)
13. thal - thalium stress result
    * 1,3: normal
    * 6: fixed defect: used to be defect but ok now
    * 7: reversable defect: no proper blood movement when excercising 
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)

**Note:** No personal identifiable information (PPI) can be found in the dataset.


It's a good idea to save these to a Python dictionary or in an external file, so we can look at them later without coming back here.

# 1. Observing Data Structure

In [3]:
import pandas as pd 

# call data function
def load_housing_dataset(housing_data_path): 
    """ 
    Function designed for calling data in: 
    
    Parameters
    ----------
    first : string 
        The location to the housing data
    
    Returns
    -------
        pandas data frame 
    """
    return pd.read_csv(housing_data_path)

In [6]:
housing = load_housing_dataset(housing_data_path = 'Data/housing.csv')
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
