**Main steps**
1. Look at the big picture
2. Get the data
3. Discover and visualize the data to gain insights
4. Prepare data for ML algorithms
5. Select a model and train it
6. Fine-tune the model
7. Present solution
8. Lunch, monitor, and maintain your system

In [4]:
import os
import tarfile
import urllib

import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# The big picture

## Frame the problem

- What is the business objective? How does the company expect to benifit from the model? Deping on this we can decied an algorithm and performance measure.
- How does the current solution look like? This can help as a reference for performance as well as the problem to solve.
- Frame the problem.
- Select a performance measure.
- Check assumptions.

**Example**
- Predicting of median district housing prices to be fed into another component downstream.
- Current solution is an estimate by a group of experts based on a set of rules. Estimates are usually off by 20%.
- Supervised learning task as data are labled. Regression task as a numerical number is to be predicted. To be precise, multiple regression as multiple features go into the model. Univariant as only one value is to be predicted per disttrict. Batch learning as there is no continoues flow, no need to raipdly adjust to incoming data, and all data fits into memory.
- Root mean square error as it puts a larger wight on large errors.
- Actual prices, not categories needed etc.

## Get the data

- Load the data from source onto disk
- Load data from disk into memory
- quick look

### Load the data from source and store it on disk

In [6]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()

### Load the data into memory using Pandas & quick look

In [8]:
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Quick look

In [22]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


total_bedrooms has less non-null values then n_instances indicationg some data are missing.

All data are numerical except ocean_p

In [16]:
print(housing.describe())

 
--- Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None
 
--- Describe ---
          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   

# Terminology

**Pipeline**
A sequence of data processing components. Components typically run asynchronously. Each component pulls in data, processes it and stores the output in a data store. The component interface via these data stores.