## **supervised machine learning**
**Supervised Machine Learning** involves helping an algorithm, called a **model**, learn to make **predictions**, based on data it has been "trained on".
- Making a prediction entails providing an output, called a **prediction**, based on various inputs called **features**.
- The correct predicton is called the **target**, since the purpose and goal of the model is to hit the target with a correct prediction.
- During training the "correct answer" is called the **label**, since it is provided information.
- Training a model entails providing the model with **training data**, that is "inputs paired with the correct label".
- The "questions" are input variables, known as **features**
- During training the model is provided the "correct answer" (target output) that goes with each input feature.

**training scenario**
  - if a model is being trained to predict car prices based on certain known features of the car, the **features** (fuel efficiency, engine size, horsepower, etc.) would be a 2D array of numeric values, one array per car, and the corresponding **label** (correct answers) would be the **price** of the car. The model would be given a lot of feature sets with correct answers in the hopes that it could learn to then predict prices based on just features (without being given the answers).

  - if a model is being trained to recognize pictures of dogs vs. cats, the **features** would be images of dogs or cats, and the **labels** (correct answers) would be **cat** or **dog**. If the model "sees" enough pictures of dogs vs. cats, it can eventually recognize enough unique features to distinguish between the two.

**the model only "thinks" in numbers**.
- All inputs to a model for machine leaning training need to be numbers. So a "color photo" of a cat would actually be an array of pixel data, as R,G,B values.
- All outputs (predictions) made by a model are numbers, as well.  Again, in the case of  cat-vs.dog, the model's predictions would be one of two numbers, such as 1 for "cat" and 2 for "dog".

**testing data**.  
- Some of the data needs to be withheld from the model during training, so that the model can be tested later. This "withheld" data is known as **testing data**
- **train_test_split()** is a scikit learn method for taking a dataset and dividing it up into training and testing sets, each with two parts: "features" and "labels".
- **train_test_split()
involves splitting a dataset into **training** and testing sets
- 80% of the data is (typically) used during training
- 20% of the data is (typically) reserved for testing


**Predicting Car Sales using LinearRegression Machine Learning model**
- working with **SciKit Learn (sklearn)** machine learning libaries
- **train_test_split()** method divides data into training and testing
- **df.corr(numeric=True)** returns a new df of numeric values
  - new df shape has equal number of rows and cols
  - values are correlations between row-col pairings
  - self-pairs have a correlation value of 1
- **sns.heatmap()** is a visualization of a correlation matrix as color-coded boxes
- **sns.pairplot(df)** is a visualization of a correlation matrix as bar charts comparing columns

In [49]:
# install seaborn if necessary
%pip install seaborn



In [50]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
import pprint as pp

In [51]:
# from google.colab import drive
# drive.mount('/content/drive')

In [52]:
# install sklearn (Sci-Kit Learn)
# %pip install scikit-learn

In [53]:
# import machine learning libraries

# import StandardScaler so we can standardize all the data (make all the mean values 0)


In [54]:
# Load up a different dataset with more rows, so better for ML training
# the one we want is conveniently built in to sklearn


In [55]:
# load the California housing data into dataframe


In [56]:
# get shape of housing_df and print first 5 rows:
print() # (20640, 9)





In [57]:
# get the cols into a list

# pp.pprint()

In [58]:
# check for missing data


**Correlation Matrix** is a **DataFrame** made from numeric variables, where every value is compared to every other value.
- The dataframe contains all columns with one row per column.
- **index** (row names) equal column names -- not integers
- values are floats, ranging from **-1** to **1**.
- **positive** number means positive correlation
- **1.0** (the max) indicates a "self-comparison"
- **negative** number means negative correlation
- **California housing** has 9 columns, so its correlation matrix is a 9x9 grid.
- 1.0000 "self-comparisons" run diagonally, from upper left

**df.corr(numeric_only=True)** method called on a dataframe returns a correlation matrix, also as a df



In [59]:
# make a correlation matrix dataframe from the 9-column df:


In [60]:
# Output the shape and datatype, along with the matrix df itself:
print() # (9,9)






A **Heat Map** is a a color-coded correlation matrix.
- **seaborn** is the package of choice for making heat maps
- positive correlations are shades of orange and tan
- the max 1 (self-correlation) is a beige color
- negative correlations are shades of red and purple
- high negative correlations tend toward black

In [61]:
# Make a seaborn heat map from the 9x9 df correlation matrix
# a heatmap uses colors to show positive / negative correlation between pairs
# the maximum positive correlation is 1
# the maximum negative correlation is -1
# a value near 0 shows little to no correlation
# a "self-pair" with a correlation of 1 is by default beige

# a perfect negative corrleation of -1 is black
# at around -0.7 the negative correlation color is already black

# "California Housing Prices Heatmap"

# save the heatmap as a .png
# "/california-housing-prices-heatmap.png"


A **Pair Plot** is a a correlation matrix of scatter plots
- **seaborn** is the package of choice for making pair plots
- strong positive correlations show as dots trending upward from left the right
- strong negative correlations show as dots trending downward from left to right
- self-correlations are **histograms**, which show data in frequency distribution **bins** (bars)

In [62]:
# make a df of just 6 selected columns
# 'MedInc', 'HouseAge', 'AveRooms', 'Latitude', 'Longitude', 'MedHouseVal'


In [63]:
# make a correlation matrix from the 6-col df:


In [64]:
# Make a seaborn pairplot from the 6 col df of selected features

# understanding the visualizations of the data:
# scatter plot showing trending up to the right
# indicates a strong positive correlation,
# such as engine size to horsepower
# scatter plot showing trending down to the right
# indicates a strong negative correlation,
# such as engine size to fuel efficiency
# self-pair histograms show frequency distribution


In [65]:
# make a scatterplot of just MedHouseVal vs MedInc
# this is the pair with the 0.68 (very high) Pos Correlation
# this will show a diag up to the right, through which we
# can draw a regression line

# "Caifornia Housing: Med Income vs. House Value"
# "Med Income in Tens of Thousands (USD)"
# "Price in Hundreds of Thousands (USD)"

# plot regression line through the dots


**training a linear regression machine learning model**

Using our **California housing prices**, we will train a model to
- **predict california housing prices** based on 6 **independent variables**.  

these independent variables will be "fed" to the model along with the corresponding price of the house
  - median income (MedInc)
  - latitude (Latitude)
  - longitude (Longitude)
  - average number of rooms (AveRooms)
  - house age (HouseAge)
  - average occupancy / number of people living there (AveOccup)
- Save those 4 columns to a new df, called **X**
- 'Longitude' by itself correlates poorly to price (0.046) BUT likely synergizes w 'Latitude' so the two should BOTH be included
- Save just the MedHouseVal -- the value we want to predict -- as **y**
- y is a vector of "answers" / labels; the are dependent variable
- since this is just one column, it is a 1D vector, called a **Pandas Series**

In [66]:
# make a df of the 6 independent variables which will be used to train the model
# the goal is the model detects patterns in the 6 variables that help it to predict
# 'Longitude' by itself correlates poorly to price (0.046) BUT likely synergizes w 'Latitude' so the two should BOTH be included
# 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'AveOccup', 'Latitude', 'Longitude'
# first 5 rows
# ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
    #    'Latitude', 'Longitude',

In [67]:
# Feature Engineering: Make a NEW column
# ONE engineered feature: "RoomsPerPerson"
# replace any occurrence of 0 with 1 so that we never divide by 0 which would give us NaN which we cannot feed to an ML model
# "RoomsPerPerson"
#  "AveRooms"
# "AveBedrms"
#  "AveOccup"

In [68]:
# Output the X df's shape, number of dimensions, data type, and first 5 rows
print() # (20640, 4) 2D DataFrame matrix




In [69]:
# y is the output / label / prediction / "answer"
# 'MedHouseVal'

In [70]:
# Output the y vector's shape, data type and first 5 values
print() # (20640,) 1D Series vector
print()





Supervised Machine Learning involves training an ML Model to recognize patterns in data in such a way that it can make predictions when it sees new data. The training process involves providing the model with inputs and correct answers. After seeing enough correct input-answer sets, the model may be able to then predict with some degree of reliability a future answer from just the inputs.

- **features** are the inputs to the model (**X**). Each **X** could consist of just one variable, or it could contain several (or many).

- **target** is the "answer" the model is trained on and will be trying to predict (**y**)  

- **X_train** is the conventional name given to the set of features set aside for training the model, this is typically 80% of the data  

- **y_train** is the conventional name given to the set of targets / "answers" set aside for training the model, this is typically 80% of the data

In [71]:
# output the train-test-split jpg
# "train-test-split.jpg"

- **train_test_split(X,y,test_size)** method takes 3 inputs: X, y and test_size
- **test_size** is the percent of data that you want to hold in reserve for later testing.
- the model will not see testing data during training;
- the test_size is typically 20%, so .2
- **train_test_split()** method returns 4 arrays of data:
  - **X_train**, a dataframe of X input features for *training* the model; consists of 80% of the data
  - **X_test**, a dataframe the X input features for *testing* the model, consists of 20% of the data
  - **y_train**, a vector of y input labels for training; consists of 80% of the data
  - **y_test**, a vector of y input labels for testing; consists of 20% of the data

In [72]:
# Pass X (features) and y (labels / "correct answers") to the train_test_split method.
# The method returns 4 arrays, so set the call equal to 4 variables, separated by commas

# The order does matter. The method returns:
# X_train as a dataframe,
# X_test as a dataframe
# y_train as a vector
# y_test as a vector
# Add a third argument test_size to specify the percent of data to be used for testing


In [73]:
# Output the X_train dataframe
# this is the data the model will be tested on
# it will be shown these inputs and asked to predict price
# The "prices" -- is not here, as that is the y data
print() # (16512, 4) the df of 80% of rows, 3 cols only, data shuffled
print() # <class 'pandas.core.frame.DataFrame'>

# L@@K: the rows are not in order because random 80% was selected for training





In [74]:
# get the mean age of house
# "HouseAge"
print()




In [75]:
# Output the y_train vector
# this data consists of the labels ("correct answers"), provided during training
# the model will be shown these labels so that it can learn how they relate to the inputs
# the model will figure out a relationship between X and y, and with that can predict prices
print() # (16512,) the vector of 80% of "answers" / labels shuffled in sync with X
print()





In [76]:
# output the first 5 X_test rows from the dataframe
# these are the inputs / features to be used for testing the model
# testing involves having the model predict prices to go with
# the X_test inputs
# accuracy is a measure of how closely the model predicts
# the actual prices of the test set
print() # (4128, 4)





In [77]:
# output the first 5 y_test values from the 1D vector
# these are "correct labels" that correspond to the X_test inputs
print() # (4128,)
print()





- **standard_scaler.fit_transform(X_train)** standardizes the input numbers
  - the mean of each column becomes 0
  - the other values are all standard deviations

In [78]:
# instantiate the StandardScaler


In [79]:
# make a standard scaler version of X_train
# so in addition to setting the mean to 0
# and all other values to standard deviations from the mean
# fit_transform also strips the DataFrame down to a 2D numpy array


In [80]:
print() # (16512, 4)
print()






In [81]:
# make a standard scaler version of X_test
# do NOT use fit_transform as that will re-scale the mean and std
# instead use the mean and std established by fit_transform(X_train)

print() # (4128, 4)





**classification** refers to assigning a label or value to some data
- classification comprises two categories of values: **discrete** and **continuous** classes
- **discrete** classification refers to a small number of possible classes, which can be numeric or strings, such as:
  - 4, 6 or 8 cylinders of a car
  - mammal, reptile, fish, bird or amphibian
  - 'sold' or 'unsold'
- **continuous** classification refers to a range of numeric values, such as:
  - prices
  - distances
  - weights
  
- **LinearRegression** involves predicting *continuous* values

- **sklearn LinearRegression** method returns a model that is trained to make predictions of continuous values
- linear regression prediction involves plotting on a regression line, which is the "best fit"  
line through a set of x-y data points, as a scatter plot
- given input X, the model finds y using slope of a line equation: **y = mx + b**

In [82]:
# instantiate a LinearRegression model


### **training the model**.
**model.fit()**
- the linear regression model has a **fit()** method
- **model.fit()** takes the training data, **X_train**, **y_train**, as its inputs
- **model.fit()** returns a *trained model*
- the trained model can receive an input, X, and predict its y value (the answer)
- the y-value is the prediction
- keep in mind that "X" is not a single value, as in an ordinay plot, but rather a fusion of 4 variables in the feature set

In [83]:
# train the model


#### **testing the model: having the model predict y from X alone**
- **y_pred = model.predict(X_test)** takes the test variables and returns as many answers / predicting

In [84]:
# call the predict method on the model and pass it the testing data matrix:


### **the model's accuracy score: comparing predictions to actual y values**
**model.score(X_test_scaled, y_test)** method test
- **model.score()** method takes the testing data as **X_test_scaled** and **y_test** as its inputs
- **fit()** method returns a trained model which can take an input and provide the output by plotting it to the regression line
- the result would be 100% accuracy if X were just one variable, but since it is 3, plotting the regression point is less straightforward and the model is subject to error.

In [85]:

print()
# 0.6419198902452115
# the model should be 73-74 % accurate on
# predicting prices. this does not mean that it
# predicted 73% of the car prices exactly right
# it means that its predictions as a whole
# were within 73% of the correct answers, as a whole




In [86]:
print() # (20640, 9)





In [87]:
# X the df differs from the original ca_df. We can save the df as csv:
# "cali-housing-new-cols.csv"

In [88]:
print() # (20640, 9)



