# Anne's notes on "Hands-on Machine Learning" (O'Reilly)

## NOTATIONS

### Machine Learning notations that we will use throughout this book:

$m$ is the number of instances in the dataset  
    For example, if you are evaluating the RMSE on a validation set of 2,000 districts, then m = 2,000.

$\pmb{x}^{(i)}$ is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and $y^{(i)}$ is its label (the desired output value for that instance).

$\pmb{X}$ is a matrix containing all the feature values (excluding labels) of all instances in the dataset. There is one row per instance, and the ith row is equal to the transpose of $\pmb{x}^{(i)}$

_h_ is your system’s prediction function, also called a hypothesis.  
When your system is given an instance’s feature vector $\pmb{x}^{(i)}$, it outputs a predicted value  $\hat{y}^{(i)} = h(\pmb{x}^{(i)}$) for that instance 
(ŷ is pronounced “y-hat”).

## SCIKIT-LEARN DESIGN

The consistent iterface: 
   - **Estimators** - an object that can estimate some parameters based on a dataset.   
    The estimation is performed by the **fit()** method (taking one dataset - or a second one with labels if supervised)  
    Other parameters would be Hyperparameters and must be set as an instance variable (generalu via constructor param)
    
   - **Transformers"" - some estimators can also transform a dataset - done via **transform()** method. Usually depends on learned parameters.
    Transformers also have the **fit_transform()** method to fit then transform
    
   - **Predictors** - some estimatros can also predict, eg LindarRegression, using **predict()** method. It has a **score()** method that uses test set (an labels if supervised)
   
Estimator's hyperparameters accessed via public instance variables, eg ```imputer.strategey``` and the learned parameters with an underscore suffix eg ```imputer.statistics_```

## Chapter 2 examples applied to my data

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import pandas as pd
import os

from IPython.display import display, HTML

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [3]:
# import CPA data
CPA = pd.read_csv("../accelerator/data/processed/CPA_tokenized.csv")
CPA.drop('Unnamed: 0',axis=1, inplace=True)
CPA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5522 entries, 0 to 5521
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Order   5522 non-null   int64 
 1   Level   5522 non-null   int64 
 2   Code    5522 non-null   object
 3   Parent  5501 non-null   object
 4   Descr   5522 non-null   object
 5   tokens  5522 non-null   object
dtypes: int64(2), object(4)
memory usage: 259.0+ KB


In [4]:
display(CPA["Level"].value_counts(),
        CPA.head())

6    3218
5    1357
4     576
3     262
2      88
1      21
Name: Level, dtype: int64

Unnamed: 0,Order,Level,Code,Parent,Descr,tokens
0,1208792,1,A,,"PRODUCTS OF AGRICULTURE, FORESTRY AND FISHING",PRODUCTS OF AGRICULTURE FORESTRY AND FISHING
1,1208793,2,01,A,"Products of agriculture, hunting and related s...",Products agriculture hunting related services
2,1208794,3,01.1,01,Non-perennial crops,Non-perennial crops
3,1208795,4,01.11,01.1,"Cereals , leguminous crops and oil seeds",Cereals leguminous crops oil seeds
4,1208796,5,01.11.1,01.11,Wheat,Wheat


In [5]:
CPA2 = CPA.copy()
#get highest level of code
CPA2.loc[CPA2.Level !=1,'Category2'] = CPA2[CPA2.Level !=1].Code.str.split('.').str.slice(0,1).str.join('')

# match up codes and parents
Code_parent = CPA2[CPA2.Level==2][['Parent','Category2']].copy()
CPA2 = CPA2.merge(Code_parent.rename(columns={'Parent':'Category1'}), on='Category2', how='left')
CPA2

Unnamed: 0,Order,Level,Code,Parent,Descr,tokens,Category2,Category1
0,1208792,1,A,,"PRODUCTS OF AGRICULTURE, FORESTRY AND FISHING",PRODUCTS OF AGRICULTURE FORESTRY AND FISHING,,
1,1208793,2,01,A,"Products of agriculture, hunting and related s...",Products agriculture hunting related services,01,A
2,1208794,3,01.1,01,Non-perennial crops,Non-perennial crops,01,A
3,1208795,4,01.11,01.1,"Cereals , leguminous crops and oil seeds",Cereals leguminous crops oil seeds,01,A
4,1208796,5,01.11.1,01.11,Wheat,Wheat,01,A
...,...,...,...,...,...,...,...,...
5517,1214309,2,99,U,Services provided by extraterritorial organisa...,Services provided extraterritorial organisatio...,99,U
5518,1214310,3,99.0,99,Services provided by extraterritorial organisa...,Services provided extraterritorial organisatio...,99,U
5519,1214311,4,99.00,99.0,Services provided by extraterritorial organisa...,Services provided extraterritorial organisatio...,99,U
5520,1214312,5,99.00.1,99.00,Services provided by extraterritorial organisa...,Services provided extraterritorial organisatio...,99,U


### Separate out training and test datasets

In [6]:
from sklearn.model_selection import train_test_split

#train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
train_set, test_set = train_test_split(CPA, test_size=0.2, random_state=42)
train_set.head()

Unnamed: 0,Order,Level,Code,Parent,Descr,tokens
2380,1211172,6,26.70.15,26.70.1,Cinematographic cameras,Cinematographic cameras
2117,1210909,3,25.5,25,"Forging, pressing, stamping and roll-forming s...",Forging pressing stamping roll-forming service...
957,1209749,6,14.12.11,14.12.1,"Men's ensembles, jackets and blazers, industri...",Men 's ensembles jackets blazers industrial oc...
4059,1212851,6,49.41.18,49.41.1,Road transport services of letters and parcels,Road transport services letters parcels
5396,1214188,6,93.29.19,93.29.1,Miscellaneous recreational services n.e.c.,Miscellaneous recreational services n.e.c


In [22]:
CPA2 = train_set.copy()
CPA2.plot