<h1>Machine learning</h1>
<li>Creating programs that learn
<li>The "learned" knowledge is not explicitly contained in the program
<li>The program is designed to learn using <b><span style="color:darkred">real world data</span></b>

<h2>Basic ideas</h2>
<li>The program contains a learning algorithm
<li>The program is given data
<li>The program applies the learning algorithm to the data and figures stuff out!

<h2>Terminology</h2>
<li><span style="color:darkblue">Feature</span>: A (measurable) property of the learning domain
<li><span style="color:darkblue">Feature set</span>: The set of features that are useful for learning in a given domain and a given problem
<ul>
<li>gender, age, income, other demographic data for predicting credit risk
<li>position of pupil, size of nose, presence/absence of dimples in facial data for facial recognition
<li>color, intensity of pixels in image data for image recognition
<li>moving averages, departures, technical indicators, price in stock price prediction
</ul>
<li><span style="color:darkblue">Input features</span>: The observable (useful) features in the domain
<li><span style="color:darkblue">Output features</span>: A feature that is being learned or predicted
<ul>
<li>In stock price prediction: moving averages, departures, technical indicators may be input features and the future return the output feature
<li>In face recognition: various observable facial features are the input feature and the person (name?) the output feature
</ul>
<li><span style="color:darkblue">Independent variables</span>: Synonymous with input feature, used in statistical learning
<li><span style="color:darkblue">Dependent variables</span>: Synonymous with output feature in prediction problems


<h2>Types of learning</h2>
<li><span style="color:darkblue">Supervised learning</span>: The data set contains paired input and output features and the machine learns how to get the output from the given input. In supervised learning, both input as well as output features are used in learning
<li><span style="color:darkblue">Unsupervised learning</span>: The data set contains features and the machine tries to induce concepts or knowledge from this feature set. Typically by organizing the data into "like" clusters. In unsupervised learning, only input features are used in learning

<h1>Examples of machine learning algorithms</h1>
<li><span style="color:darkblue">Regression</span>: The machine learns a mathematical relationship between the input and output features. Regression is a supervised learning technique
<li><span style="color:darkblue">Classification and regression trees (CART)</span>: The machine learns a set of rules that relate the input and output features. CART is a supervised learning technique
<li><span style="color:darkblue">Clustering</span>: Clustering algorithms group the input feature set into "like" groups, usually using a distance metric. Clustering algorithms are unsupervised learning techniques
<li><span style="color:darkblue">Neural networks</span>: Used for "deep learning". Designed to mimic the brain, neural networks are directed, weighted, multi-layered graphs. The first layer is an input layer that corresponds to the input feature set and the final layer is an output layer that corresponds to the output feature set. The graph contains one or more hidden layers and uses an algorithm to compute the weights on the edges to learn the relationship between input and output features. Neural networks are supervised learning techniques 

<h1>Machine learning using Regression</h1>
<li>The machine learns a mathematical relationship between the input feature set and output feature values
<li>All data must be numerical
<li>There is an implied sequence in both independent variable values as well as dependent variable values


<h3>Types of regression</h3>
<span style="color:darkblue">Linear regression</span>: Learns a mathematical linear relationship between input and output values
<li>Linear regression fits a line to the data
<li>Single output values
<li>Estimates a function of the form:$$ y = { \alpha + \beta_1}x_1 + {\beta_2}x_2 + ..... + {\beta_n}x_n + {\epsilon} $$
<li>The x values are independent variables
<li>The y value is the dependent variable
<li>The alpha is a constant intercept term
<li>The betas are the feature weights
<li>The epsilon is an error term


<h2>Linear regression example</h2>

In [1]:
x=(1,2,3,4,5,6,6,7,9,10)
y=(2,4,3,5,6,7,8,9,10,11)
from matplotlib import pyplot as plt
import statsmodels.api as sm
plt.scatter(x,y)
results = sm.OLS(y,sm.add_constant(x)).fit()
x_plot = np.linspace(0,11)
plt.plot(x_plot, x_plot*results.params[0] + results.params[1])
plt.show()

  from pandas.core import datetools


NameError: name 'np' is not defined

<h2>Logistic regression</h2>
<li>Predicts a categorical dependent variable value
<li>Binomial logistic regression predicts two category values (0 or 1)
<li>Typically, the function predicts:
<ul> <li>1 if $$ { \alpha + \beta_1}x_1 + {\epsilon} > 0 $$
<li>0 otherwise

<h1>Classification using regression</h1>
<li>We will use linear regression to predict binomial categories
<li>And examine some of the ways in which we evaluate our ML result


<h3>The data</h3>
<li>Rocks vs Mines (https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-dat)
<li>Though the data is about underwater mines, imagine:
<ol>
<li>You're looking out at a number of fields that have rock like objects strewn all over the place
<li>Some of those objects are actually mines
<li>You have special mine detecting equipment 
<li>The equipment sends sound waves at different frequencies, the waves hit the objects, and report back some sort of measurements
<li>Lucky (for you), you have prior sonar data along with whether a rock like object was a rock or a mine 
<li>You can use this to get your "machine" to learn how to identify rocks and mines
<li>And then test the results by sending your army across the field - and get an estimate of what it will cost you!

<h2>Read the data</h2>

<h3>Data set 1: Rocks vs. Mines</h3>
<li>Independent variables: sonar soundings at different frequencies
<li>Dependent variable (target): Rock or Mine

In [None]:
import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df.describe()

In [None]:
df.info()

<h4>The data</h4>
<li>60 float64 columns. These are the sonar readings and will form our <span style="color:blue">feature set</span>
<li>One object column. This will be our target/output/dependent variable

<h2>Generate a few summary statistics</h2>

<h4>See all columns</h4>

In [None]:
pd.options.display.max_columns=70
df.describe()

<h4>Examine the distribution of the data in column 4</h4>

<li>Quartile 1: from .0067 to .03805
<li>Quartile 2: from .03805 to .0625
<li>Quartile 3: from .0625 to .100275
<li>Quartile 4: from .100275 to .401

<h4>Quartile 4 is much larger than the other quartiles. This raises the possibility of outliers</h4>

<h4> A Quantile - Quantile (qq) plot can help identify outliers</h4>
<li>y-axis contains values
<li>x-axis is the cumulative normal density function plotted as a straight line (-3 to +3)
<li>y-axis is the values ordered from lowest to highest
<li>the closer the curve is to the line, the more it reflects a normal distribution

In [None]:
import numpy as np 
import pylab 
import scipy.stats as stats
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('seaborn')
%matplotlib inline
   
stats.probplot(df[4], dist="norm", plot=pylab)
pylab.show()

<h4>Examine the dependent variable</h4>

In [None]:
df[60].unique()


<h4>Examine correlations</h4>

In [None]:
df.corr()


In [None]:
import matplotlib.pyplot as plot
plot.pcolor(df.corr(),cmap='coolwarm') #https://matplotlib.org/examples/color/colormaps_reference.html
plot.show()

In [None]:
df.corr()[0].plot()


<h4>Highly correlated items = not good!</h4>
<h4>Low correlated items = good </h4>
<h4>Correlations with target (dv) = good (high predictive power)</h4>

<h2>Training a classifier on Rocks vs Mines</h2>
<li><span style="color:blue">scikit-learn</span>: A Python machine learning library
<li>!pip install sklearn

In [None]:
import numpy as np
#import random
#
#from sklearn.metrics import roc_curve, auc
#import pylab as pl


In [None]:
import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df.describe()

<h4>Convert labels R and M to 0 and 1</h4>

In [None]:
df[60]=np.where(df[60]=='R',0,1)

<h2>Training and testing</h2>
<li><span style="color:blue">Training dataset</span>: The model is "fit" using a training sample
<li><span style="color:blue">Testing dataset</span>: The "fitted" model is evaluated on a testing sample
<li><span style="color:blue">validation dataset</span>: Sometimes, a dataset is used to "fine tune" model parameters after training but before testing

<li>We'll use a training and testing dataset
<li>And separate out the feature set and target value for each dataset

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3)
x_train = train.iloc[0:,0:60]
y_train = train[60]
x_test = test.iloc[0:,0:60]
y_test = test[60]
y_train

<h2>Build the model and fit the training data</h2>
<li>The linear regression package is in sklearn's linear_model library
<li>We create a linear regression model object
<li>And give it our training data to "fit" the model

In [None]:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(x_train,y_train)


<h2>Evaluating the model</h2>

<h4>Generate mean square errors and R-Square values</h4>

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
y_pred_training = model.predict(x_train)
y_pred_testing = model.predict(x_test)
training_msq = mean_squared_error(y_pred_training,y_train)
testing_msq = mean_squared_error(y_pred_testing,y_test)
print(training_msq,testing_msq)

In [None]:
print('Train R-Square:',r2_score(y_train,y_pred_training))
print('Test R-Square:',r2_score(y_test,y_pred_testing))


<h3>These are horrible!</h3>
<b>But do we really care?</b>
<li>Focus on the problem
<li>Regression is predicting continuous values between 0 and 1
<li>But all we need is a 0 (rock) or a 1 (mine)
<li>We may not care about mis-identifying rocks as mines as long as we identify mines correctly


<li>We want to predict categories: Rocks or Mines
<li>But we're actually getting a continuous value
<li>Not the same thing. So R-Square and msq probably doesn't mean a whole lot

<li>We need to convert the conitnuous values into  categorical 1s and 0s. We can do this by fixing a threshold value between 0 and 1
<li>Values greater than the threshold are 1 (Mines). Values less than or equal to the threshold are 0 (Rocks)

<h3>Let's examine predictions versus actuals for the training and testing datasets</h3>
<li>Quick visual on how are model is doing
<li>Also a quick picture on how the model is discriminating
<li>Since the model is predicting continuous values
<ol>
<li>we'll create a list containing predicted values for mines
<li>and a list containing predicted values for rocks
<li>and graph the two lists

In [None]:
mines = list()
rocks = list()
actual = np.array(y_train)
for i in range(len(y_train)):
    
    if actual[i]: #mine
        mines.append(training_predictions[i])
    else:
        rocks.append(training_predictions[i])


In [None]:
df_m = pd.DataFrame(mines)
df_r = pd.DataFrame(rocks)
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df_m)
b_heights, b_bins = np.histogram(df_r, bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')



<h3>Repeat for the testing dataset</h3>

In [None]:
mines = list()
rocks = list()
actual = np.array(y_test)
for i in range(len(y_test)):
    
    if actual[i]:
        mines.append(testing_predictions[i])
    else:
        rocks.append(testing_predictions[i])
df_m = pd.DataFrame(mines)
df_r = pd.DataFrame(rocks)
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df_m)
b_heights, b_bins = np.histogram(df_r, bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')



<h3>Confusion matrix and the threshold</h3>
<li>A confusion matrix evaluates each data point in the testing dataset to see which of the following categories it falls into: 
<ol>
<li><span style="color:blue">true positive</span>: model predicts a mine (1) and it is a mine
<li><span style="color:blue">false positive</span>: model predicts a mine but it is a rock (0)
<li><span style="color:blue">true negative</span>: model predicts a rock and it is a rock
<li><span style="color:blue">false negative</span>: model predicts a rock and it is actually a mine
</ol>
<li>It then reports the number (or proportion) of cases in each category

In [None]:
def confusion_matrix(predicted, actual, threshold):
    if len(predicted) != len(actual): return -1
    tp = 0.0
    fp = 0.0
    tn = 0.0
    fn = 0.0
    for i in range(len(actual)):
        if actual[i] > threshold: #labels that are 1.0  (positive examples)
            if predicted[i] > threshold:
                tp += 1.0 #correctly predicted positive
            else:
                fn += 1.0 #incorrectly predicted negative
        else:              #labels that are 0.0 (negative examples)
            if predicted[i] < threshold:
                tn += 1.0 #correctly predicted negative
            else:
                fp += 1.0 #incorrectly predicted positive
    rtn = [tp, fp, tn, fn]

    return rtn



In [None]:
testing_predictions = model.predict(x_test)
tp,fp,tn,fn = confusion_matrix(testing_predictions,np.array(y_test),0.5)
print(tp,fp,tn,fn)

<h2>Evaluation metrics</h2>
Using the results of the confusion matrix, we can calculate a number of metrics that will help evaluate the model
<ol>
<li><span style="color:blue">true positive rate</span> or <span style="color:blue">sensitivity</span> or <span style="color:blue">recall</span>
<li><span style="color:blue">true negative rate</span> or <span style="color:blue">specificity</span>
<li><span style="color:blue">false positive rate</span> or <span style="color:blue">fall out</span>
<li><span style="color:blue">precision</span> or <span style="color:blue">positive predictive value</span>
<li><span style="color:blue">f-score</span>
<li><span style="color:blue">accuracy</span>
<li><span style="color:blue">misclassification rate</span>


</ol>

<h3>True Positive rate/sensitivity/recall</h3>
True Positive Rate is the proportion of positive cases that are correctly identified as positive
$$ tpr = \frac{tp}{(tp + fn)} $$
Sensitivity is a measure of how good our model is in identifying the positive condition. A value of 1, for example, will mean that every positive value (every mine) was correctly idenfified by the model. 
<li>Percentage of persons with a disease correctly identified as having that disease
<li>Percentage of "fake news" items correctly identified as fake news
<li>Percentage of consumers who will click on an ad
<li>Percentage of customers who will move to a new cell phone carrier at the end of their contract

In [None]:
tpr = tp/(tp+fn)
print("Percentage of mines correctly identified as mines:",tpr)

<h3>True Negative Rate or Specificity</h3>
True Negative Rate is the proportion of negative cases that are correctly identified as negative
$$ tpr = \frac{tn}{(tn + fp)} $$
<li>Proportion of real news stories that are correctly identified as real news
<li>Proportion of healthy people that are correctly identified as healthy


In [None]:
tnr = tn/(tn+fp)
print(tnr)

<h3>false positive rate or "fall out"</h3>
The false positive rate is the proportion of rocks that have been identified as mines
$$ fpr = \frac{fp}{(fp + tn)} $$

<li>Proportion of true news items that are identified as fake news
<li>Proportion of consumers who won't use a discount but are identified as target discount users
<li>Proportion of rocks that have been identified as mines

In [None]:
fpr = fp/(fp+tn)
print(fpr)

<h3>Precision</h3>
Precision measures the proportion of cases identified as positive that are actually positive
$$ precision = \frac{tp}{(tp + fp)} $$
<li>Proportion of news items that are actually fake from amongst all the news items that are identified as fake
<li>Proportion of "churners" that are actual churners from amongst all customers identifed as churners
<li>Proportion of actual mines amongst all things that are identified as mines

In [None]:
precision = tp/(tp+fp)
print(precision)

<h3>f-score</h3>
<li>Precision tells us how well our model discriminates amongst cases it identifies as positive. A precision of 1 would mean that if our model says something is positive, it is definitely a positive. 
<li>Recall tells us how good the model is at finding positives (a recall of 1 would mean it has found all positives). <li>Precision does not tell us how good we are at finding positives while recall does not tell us how good our model is at disciminating
<li>The f-score combines the two into a single score
$$ F = 2\frac{precision * recall}{(precision + recall)} $$


In [None]:
f = precision*tpr/(precision+tpr)
print(f)

<h3>accuracy</h3>
Accuracy measures how accurately the model classifies things as positive or negative (mines or rocks)
$$accuracy = \frac{tp + tn}{(tp+tn+fp+fn)} $$
An accuracy of 1 would mean that our model has classified everything correctly

In [None]:
accuracy = (tp+tn)/(tp+tn+fp+fn)
print(accuracy)

<h3>misclassification rate</h3>
Misclassifican rate is the inverse of accuracy. What proportion of the cases are misclassified?
$$ misclassificationRate = \frac{fp + fn}{(tp+tn+fp+fn)} $$


In [None]:
misclassification_rate = (fp + fn)/(tp+fp+tn+fn)
print(misclassification_rate)

<h3>Examining our results</h3>


In [None]:
print("Precision:\t\t%1.2f identified as mines are mines"%precision)
print("Recall:\t\t\t%1.2f proportion of actual mines identified"%tpr)
print("Specificity:\t\t%1.2f proportion of rocks identified as rocks"%tnr)
print("False Positive Rate:\t%1.2f proportion of rocks identified as mines"%fpr)
print("f-score:\t\t%1.2f tradeoff between precision and recall"%f)
print("Accuracy:\t\t%1.2f how well the model has classified"%accuracy)

<h3>Threshold</h3>
<li>Our (regression) model is calculating continuous values between 0 and 1
<li>We're using a threshold of 0.5 to decide whether something is a rock or a mine
<li>What happens if we use a different threshold value?

In [None]:
def print_results(threshold):
    tp,fp,tn,fn = confusion_matrix(testing_predictions,np.array(y_test),threshold)
    precision = tp/(tp+fp)
    tpr = tp/(tp+fn)
    tnr = tn/(tn+fp)
    fpr = fp/(fp+tn)
    f = precision*tpr/(precision+tpr)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    print("Precision:\t\t%1.2f identified as mines are mines"%precision)
    print("Recall/Sensitivity:\t%1.2f proportion of actual mines identified"%tpr)
    print("Specificity:\t\t%1.2f proportion of rocks identified as rocks"%tnr)
    print("False Positive Rate:\t%1.2f proportion of rocks identified as mines"%fpr)

    print("f-score:\t\t%1.2f tradeoff between precision and recall"%f)
    print("Accuracy:\t\t%1.2f how well the model has classified"%accuracy)
    

In [None]:
print("Results for .1")
print_results(0.1)
print("Results for .25")
print_results(0.25)
print("results for .5")
print_results(0.5)
print("results for .75")
print_results(0.75)
print("Results for .9")
print_results(0.9)

<h2>ROC: Receiver Order Characteristic</h2>
<li>An ROC curve shows the performance of a binary classifier as the threshold varies 
<li>It contrasts
<ul>
<li>False positive rate (FPR) Fall out/false alarm on the x-axis
<li>True Positive rate (TPR) Sensitivity/recall on the y-axis
</ul>
<li>Each (fpr,tpr) coordinate is calculated for each threshold value and a curve plotted
<li>An <span style="color:blue">area under the curve (auc)</span> is calculated. 
<li>AUC gives us an estimate of how stable our model is to changes in threshold values



<h2>Drawing the ROC Curve</h2>
<li>sklearn has a function roc_curve that does this for us

In [None]:
from sklearn.metrics import roc_curve, auc

<h4>In-sample ROC Curve</h4>

In [None]:
import matplotlib.pyplot as plt
(fpr, tpr, thresholds) = roc_curve(y_train,y_pred_training)
area = auc(fpr,tpr)
plt.clf() #Clear the current figure
plt.plot(fpr,tpr,label="In-Sample ROC Curve with area = %1.2f"%area)

plt.plot([0, 1], [0, 1], 'k') #This plots the random (equal probability line)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('In sample ROC rocks versus mines')
plt.legend(loc="lower right")
plt.show()


<h4>Out-sample ROC curve</h4>

In [None]:
(fpr, tpr, thresholds) = roc_curve(y_test,y_pred_testing)
area = auc(fpr,tpr)
plt.clf() #Clear the current figure
plt.plot(fpr,tpr,label="Out-Sample ROC Curve with area = %1.2f"%area)

plt.plot([0, 1], [0, 1], 'k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Out sample ROC rocks versus mines')
plt.legend(loc="lower right")
plt.show()

<h2>Precision vs. Recall</h2>
<li>Precision tells us how well we're discriminating within the positively identified  cases
<li>Recall tells us what proportion of actual positive cases we've identified as positive
<li>Obviously, we'd like both numbers to be close to 1!
<li>The precision-recall curve tells us how well we're doing on both factors for different threshold values
<li>We can also compute an <span style="color:blue">average precision (AP) metric</span>
<ul>
<li>AP computes a score at each threshold point
<li>Each score is the precision at that point multiplied by the change in recall from the previous threshold point
<li>These are then summed up to give a weighted average
<li>

In [None]:
from sklearn.metrics import precision_recall_curve,average_precision_score
(p,r,thresholds) = precision_recall_curve(y_train,y_pred_training)
ap = average_precision_score(y_train,y_pred_training)
plt.clf() #Clear the current figure
plt.plot(p,r,label="In-sample precision recall curve = %1.2f"%ap)

plt.plot([0, 1], [0, 1], 'k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('recall')
plt.ylabel('precision')
plt.title('In sample P-R rocks versus mines')
plt.legend(loc="lower right")
plt.show()

In [None]:
from sklearn.metrics import precision_recall_curve
(p,r,thresholds) = precision_recall_curve(y_test,y_pred_testing)
ap = average_precision_score(y_test,y_pred_testing)

plt.clf() #Clear the current figure
plt.plot(p,r,label="Out-sample precision recall curve = %1.2f"%ap)

plt.plot([0, 1], [0, 1], 'k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('recall')
plt.ylabel('precision')
plt.title('Out sample P-R rocks versus mines')
plt.legend(loc="lower right")
plt.show()

<h2>So, what threshold should we actually use?</h2>
<li>ROC curves and precision-recall curves give you a sense for how good your classifier is and how sensitive it is to changes in threshold


<h3>Example: Let's say</h3>
<li>Everything classified as a rock needs to be checked with a hand scanner at $200/scan</li> 
<li>Everything classified as a mine needs to be defused at \$1000 if it is a real mine or \$300 if it turns out to be a rock</li>



In [None]:
tp,fp,tn,fn = confusion_matrix(y_pred_testing,np.array(y_test),.1)
cost1 = tn * 200 + 1000 * tp + 300 * fp
tp,fp,tn,fn = confusion_matrix(y_pred_testing,np.array(y_test),.5)
cost2 = tn * 200 + 1000 * tp + 300 * fp
tp,fp,tn,fn = confusion_matrix(y_pred_testing,np.array(y_test),.9)
cost3 = tn * 200 + 1000 * tp + 300 * fp
print(cost1,cost2,cost3)

<h3>Example: Let's say</h3>
<li>Everything classified as a rock will be assumed a rock and if wrong, will cost $5000 in injuries</li> 
<li>Everything classified as a mine will be left as is (no one will walk on it!)</li>


In [None]:
tp,fp,tn,fn = confusion_matrix(testing_predictions,np.array(y_test),.1)
cost1 = 5000 * fn
tp,fp,tn,fn = confusion_matrix(testing_predictions,np.array(y_test),.5)
cost2 = 5000 * fn
tp,fp,tn,fn = confusion_matrix(testing_predictions,np.array(y_test),.9)
cost3 = 5000 * fn
print(cost1,cost2,cost3)

<h2>Bottom line. Depends on factors from your domain</h2>