# Introduction to Machine Learning (ML) Models 

* ML = automated detection of patterns in data
  * (if we could detect those patterns ourselves, we wound't need ML!


## Types of ML

* There are a bewildering array of ML algorithms out there!
  * Although simple ones often perform quite well
* most ML algorithms fall into three broad categories:
 - **Predictive algorithms**: analyze current and historical facts to make predictions about the future, such as
   * what will customer enjoy watching
   * "people who bought this also bought that"
   * how much will your home sell for?
 - **Classification algorithms**: learn from a body of labeled data (e.g., cancer scans), then use that knowledge to classify new observations
   * (isn't this what a medical resident does?)
 - **Time-series forecasting algorithms**: specialized type of predictive algorithms, hence a separate category.   
   * beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification

## Prediction: linear regression

> **Learning goal:** familiarity with fitting linear regression models, and interpreting their output

* Arguably the simplest form of ML is to draw a line connecting two points and make predictions about where that trend might lead

> But what if you have more than two points—and those points don't line up neatly? What if you have points in more than two dimensions? This is where linear regression comes in.

* Process: predict a quantitative *response* (the values on a Y axis) that is dependent on one or more *predictors* (values on one or more axes that are orthogonal to Y, commonly just thought of collectively as X)
  * working assumption is that the relationship between predictors and response is more or less linear
  * Goal: fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation

<img align="left" style="padding-right:10px;" src="Images/linear_regression2.png">


> Statistically, we can represent this relationship between response and predictors as:

$Y = B_0 + B_1X + E$

> Remember your geometry? $B_0$ is the intercept of our line and $B_1$ is its slope. We commonly refer to $B_0$ and $B_1$ as coefficients and to $E$ as the *error term*, which represents the margin of error in the model.

### Data exploration

* We'll begin by importing the libraries we'll need...

In [None]:
import pandas as pd # "Pandas"
import numpy as np # "Numerical Python"
import matplotlib.pyplot as plt # A plotting package
import seaborn as sns # Another plotting package
%matplotlib inline
# the above is a directive to Jupyter to ensure plots appear immediately

* now read in the data
* in this case, we’ll use a newer housing dataset

In [None]:
df = pd.read_csv('Data/Housing_Dataset_Sample.csv')
df.head()

## Exercise

In [None]:
# Do you remember the DataFrame method for looking at overall information
# about a DataFrame, such as number of columns and rows? Try it here.

> Let's also use the `describe` method to look at some statistics about the data

In [None]:
df.describe()

* in cases like this, where the column names are long, it can be helpful to view the transposition of the summary, like so:

In [None]:
df.describe().T

* let's look at the data in the **Price** column

In [None]:
sns.distplot(df['Price']);

> As we would hope with this much data, our prices form a nice bell-shaped, normally distributed curve.

* let's look at a simple relationship like that between house prices and the average income in a geographic area:

In [None]:
sns.jointplot(df['Avg. Area Income'],df['Price']);

* there is an intuitive, linear relationship between them
* Also good: the pairplot shows the data in both columns is normally distributed, so we don't have to worry about somehow transforming the data for meaningful analysis
* let's take a quick look at all of the columns:

In [None]:
sns.pairplot(df);

> Some observations:
1. Not all combinations of columns provide strong linear relationships–some just look like blobs
  * That's nothing to worry about for our analysis
2. The visualizations that look like lines rather than groups...that's the result of the average number of bedrooms in houses being measured in discrete values rather than continuous ones

> It is now time to make a prediction!

### Fitting the model

* We feed everything into a linear model (average area income, average area house age, average area number of rooms, average area number of bedrooms, and area population) and see how well these factors can help us predict the price of a home

> To do this, we will make our first five columns the X (our predictors) and the **Price** column the Y (our response):

In [None]:
X = df.iloc[:, :5]
y = df['Price']

* We don't want to use ALL of our data!

> Data Scientists divide their datasets into *training* data (the data used to fit or _create_ the model) and *test* data (data used to evaluate how accurate the model is)
* scikit-learn provides a function to do this for us–__`train_test_split`__
  * we'll use 70 percent of our data for training and reserve 30 percent of it for testing
  * note that you will also supply a fourth parameter __`random_state`__
    * __`train_test_split`__ randomly divides up our data between test and training, so this number provides an explicit seed for the random-number generator so that you will get the same result each time you run this code snippet

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=54)

> All that is left now is to import our linear regression algorithm and fit our model based on our training data:

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

In [None]:
reg.fit(X_train, y_train)

### Evaluating the model

Now, a moment of truth: let's see how our model does making predictions based on the test data:

In [None]:
predictions = reg.predict(X_test)

In [None]:
predictions

* our predictions are the house prices that our model predicts, one for every row in our test dataset.

> Remember how we mentioned that linear models have the mathematical form of $Y = B_0 + B_1*X + E$? Let’s look at the actual equation:

In [None]:
print(f'intercept = {reg.intercept_:,.2f}')
for coef in reg.coef_:
    print(f'{coef:,.2f}')

In algebraic terms, here is our model:

$Y=-2,646,401+21.59X_1+165,828.19X_2+121,323.5X_3+2,790X_4+15.17X_5$

where:
 - $Y=$ Price
 - $X_1=$ Average area income
 - $X_2=$ Average area house age
 - $X_3=$ Average area number of rooms
 - $X_4=$ Average area number of bedrooms
 - $X_5=$ Area population

> So, just how good is our model?
 * There are many ways to measure the accuracy of ML models (and details are beyond the scope here)
   * Linear models have a good one: the $R^2$ score (also knows as the coefficient of determination)
   * A high $R^2$, close to 1, indicates better prediction with less error

In [None]:
# Explained variation. A high R2 close to 1 indicates better prediction with less error.
from sklearn.metrics import r2_score

r2_score(y_test, predictions)

> The $R^2$ score also indicates how much explanatory power a linear model has. In the case of our model, the five predictors we used in the model explain a little more than 92% of the price of a house in this dataset.

* We can also plot our errors to get a visual sense of how wrong our predictions were:

In [None]:
# plot errors
sns.distplot([y_test - predictions]);

> Notice the numbers on the left axis
 * whereas a histogram shows the number of things that fall into discrete numeric buckets, a kernel density estimation (KDE, and the histogram that accompanies it in the Seaborn displot) normalizes those numbers to show what proportion of results lands in each bucket
 * essentially, these are all numbers less than 1.0 because the area under the KDE has to add up to 1

> Maybe more gratifying, we can plot the predictions from our model:

In [None]:
# Plot outputs
plt.scatter(y_test, predictions, color='blue');

* Can you think of a way to refine this visualization to make it clearer, particularly if you were explaining the results to someone?

## Exercise

In [None]:
# Hint: Remember to try the plt.scatter parameter alpha=.
# It takes values between 0 and 1.

> **Takeaway:** In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your model’s performance.

## Classification: Logistic Regression

> **Learning goal:** understand how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output

* Let's pivot to discussing classification
  * If our simple analogy of predictive analytics was drawing a line through points and extrapolating from that, then classification can be described in its simplest form as drawing lines around groups of points

* While linear regression is used to predict quantitative responses (or continuous numeric values, such as home prices), *logistic* regression is used for classification problems
  * in this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a sigmoid (S-curve) function]
  * sigmoid functions take any value and transform it to be between 0 and 1, which can be used as a probability for a class to be predicted, with the goal of predictors mapping to 1 when something belongs in the class and 0 when they do not.

<img align="left" style="padding-right:10px;" src="Images/logistic_regression.png?">

> to demonstrate, let's do something a little different and try a historical dataset–the fates of the passengers of the RMS Titanic, which is a popular dataset for classification problems in machine learning
  * the class we want to predict is whether a passenger survived 

The dataset has 12 variables:

 - **PassengerId**
 - **Survived:** 0 = No, 1 = Yes
 - **Pclass:** Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
 - **Sex**
 - **Age**		
 - **Sibsp:** Number of siblings or spouses aboard the *Titanic*	
 - **Parch:** Number of parents or children aboard the *Titanic*
 - **Ticket:** Passenger ticket number	
 - **Fare:** Passenger fare	
 - **Cabin:** Cabin number	
 - **Embarked:** Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
df = pd.read_csv('Data/train_data_titanic.csv')
df.head()

In [None]:
df.info()

> One reason that the Titanic data set is a popular classification set is that it provides opportunities to prepare data for analysis
 * To prepare this dataset for analysis, we need to perform a number of tasks:
  - Remove extraneous variables
  - Check for multicollinearity 
  - Handle missing values

### Remove extraneous variables

* names of passengers and their ticket numbers clearly won't help our model, so we can drop those columns

In [None]:
df.drop(['Name', 'Ticket'], axis=1, inplace=True)

* there may be additional variables that won't add classifying power to our model
  * to find them we will need to look for correlation between variables

### Check for multicollinearity

> If one or more of our predictors can themselves be predicted from other predictors, it can produce a state of *multicollinearity* in our model
  * basically, we are exaggerating the effect of a variable by include it "twice"

* Seaborn has a nice function called __`heatmap`__ which will we can use on the correlations between pairs of variables

In [None]:
sns.heatmap(df.corr(), cmap='coolwarm');

* we can see a high correlation between Fare and Pclass...why?

In [None]:
# let's drop Fare as a result
df.drop(['Fare'], axis=1, inplace=True)
df.head()

### Handle missing values

* we now need to address missing values

In [None]:
# missing
df.isnull().sum()

> We could try to do something about those missing values
 * However, if any pattern does emerge in the data that involves **Cabin**, it will be highly correlated with both **Pclass** and **Fare**
 * And the vast majority of those values are missing, so it could be difficult to reconstruct

In [None]:
df.drop('Cabin', axis=1, inplace=True)

> Let's now run `info` to see if there are columns with just a few null values.

In [None]:
df.info()

> Note: given that 1,503 died in the *Titanic* tragedy (and that we know that some survived), this data set clearly does not include every passenger on the ship (and none of the crew)

* back to missing values
  * **Age** is missing several values, as is **Embarked**

In [None]:
df['Age'].isnull().value_counts() # another way to look at the above

> As we saw above, **Age** isn't really correlated with **Fare**, so it is a variable that we want to eventually use in our model
 * that means that we need to do something with those missing values
   * we could just fill in the missing ones with some known value, such as the mean or median
   * ...let's check to see if our median age is the same for both sexes

In [None]:
df.groupby('Sex')['Age'].median().plot(kind='bar');

In [None]:
# or a better way...
df.groupby(['Sex'])['Age'].describe().T

> The median ages are different for men and women sailing on the *Titanic*, so we should handle the missing values accordingly
* a sound strategy is to replace the missing ages for passengers with the median age, based on sex

In [None]:
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))

> Any other missing values?

In [None]:
df.isnull().sum()

> We are missing two values for **Embarked**. Check to see how that variable breaks down:

In [None]:
df['Embarked'].value_counts()

* the vast majority of passengers embarked on the *Titanic* from Southampton, so we will just fill in those two missing values with the most statistically likely value (the median result): Southampton

In [None]:
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
df['Embarked'].value_counts()

In [None]:
df.isnull().sum()

> Now we need to turn the categorical values ('Sex' and 'Embarked') into numbers so we can perform ML on the data
 * numerical equivalents of categorical variables are called "dummy variables"

In [None]:
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
df.head()

Let's do a final look at the correlation matrix to see if there is anything else we should remove.

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(), cmap='coolwarm');

> Note: we need to remove **Survived** from our X DataFrame because it will be our response DataFrame, Y:

In [None]:
X = df.drop(['Survived'], axis=1)
y = df['Survived']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=67)

> Now we import and fit the logistic regression model:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='liblinear') # avoid deprecation warning

In [None]:
lr.fit(X_train, y_train)

In [None]:
predictions = lr.predict(X_test)

### Evaluate the model

In contrast to linear regression, logistic regression does not produce an $R^2$ score by which we can assess the accuracy of our model. In order to evaluate that, we will use a classification report, a confusion matrix, and the accuracy score.

#### Classification report

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

> The classification reports the proportions of both survivors and non-survivors with four scores, but for simplicity, we'll think of them in terms of document retrieval (Google Search)
 - **Precision:** the percentage of responses that are valid given the query
 - **Recall:** the percentage of documents we were supposed to return that we actually did return
 - **F1 score:** The harmonic mean (a kind of average) of precision and recall.
 - **Support:** The number of true instances for each label.
 
* Why so many ways of measuring accuracy for a model?
  * Well, success means different things in different contexts
  * Imagine that we had a model to diagnose cancer
   * such as system should maximize _recall_, that is we want to be sure and identify every person with cancer

In [None]:
print(classification_report(y_test, predictions))

### Confusion matrix

> a confusion matrix is another way to present this same information, this time with raw scores
 * columns show the true condition, positive on the left, negative on the right
 * rows show predicted conditions, positive on the top, negative on the bottom
 * matrix below shows that our model correctly predicted 146 survivors (true positives) and incorrectly predicted another 16 (false positives)
 * on the other hand, our model correctly predicted 30 non-survivors (true negatives) and incorrectly predicted 76 more (false negatives).

In [None]:
print(confusion_matrix(y_test, predictions))

* let's dress up the confusion matrix a bit to make it a little easier to read:

In [None]:
pd.DataFrame(confusion_matrix(y_test, predictions), columns=['True Survived', 'True Not Survived'], index=['Predicted Survived', 'Predicted Not Survived'])

### Accuracy score

* finally, our accuracy score tells us the fraction of correctly classified samples; in this case (146 + 76) / (146 + 76 + 30 + 16).

In [None]:
print(accuracy_score(y_test, predictions))

> Not bad for an off-the-shelf model with no tuning!

## Classification: decision trees

> **Learning goal:** By the end of this subsection, you should be comfortable fitting decision-tree models and have some understanding of what they output.

If logistic regression uses observations about variables to swing a metaphorical needle between 0 and 1, classification based on decision trees programmatically builds a Yes/No decision to classify items.

<img align="left" style="padding-right:10px;" src="Images/decision_tree.png">

> Let's look at this in practice with the same *Titanic* dataset we used with logistic regression.

In [None]:
from sklearn import tree

In [None]:
tr = tree.DecisionTreeClassifier(max_depth=2)

## Exercise

In [None]:
# Using the same split data as with the logistic regression,
# can you fit the decision tree model?
# Hint: Refer to code snippet for fitting the logistic regression above.

In [None]:
tr.fit(X_train, y_train)

> Once fitted, we get our predicitions just like we did in the logistic regression example above:

In [None]:
tr_predictions = tr.predict(X_test)

In [None]:
pd.DataFrame(confusion_matrix(y_test, tr_predictions), 
             columns=['True Survived', 'True Not Survived'], 
             index=['Predicted Survived', 'Predicted Not Survived'])

In [None]:
print(accuracy_score(y_test,tr_predictions))

> One of the great attractions of decision trees is that the models are readable by humans. Let's visualize to see it in action. (Note, the generated graphic can be quite large, so scroll to the right if the generated graphic just looks blank at first.)

In [None]:
from sklearn.tree import export_graphviz

dot_file = export_graphviz(tr, out_file='titanic.dot', 
                                feature_names=X.columns, 
                                class_names=['Perished', 'Survived'],
                                filled=True, rounded=True)

In [None]:
!dot -Tpng titanic.dot -o titanic.png
from IPython.display import Image
Image('titanic.png')

> **Takeaway:** In this subsection, you performed classification on previously cleaned data by fitting and evaluating a decision tree.