# Classification

Many a time, instead of needing to predict a continuous variable, we may desire to predict a variable that has descrete values. In other words, we need to classify a new data point into one of a finite number of categories. The archetypical example for this kind of problem is a spam filter. Giant companies like Google and Microsoft who host email services need to have some sort of spam filter to block out unnecessary emails from there users. So, given some new email, we need to decide whether or not the email is spam. Like the problems we have seen so far, emails will need to have some *features*. Recall that features are just ways to describe an object. So far our email example, some features (word count, number of capitalized words, ect), can we train a machine learning models that decides what features decide whether an email is spam or not. This is the premise of classification. 

# Logistic Regression

One of the most important classification machine learning models is the logistic regression. The problem setup for logistic regression goes like this. Given some sets of data whose dependent variables is binary (two categories, usually yes or no), can we fit a function that gives us the probability that a new observation fall into one of the categories. This is where the logistic regression comes in. One thing you may have noticed right of the bat it that it is called logistic *regression*. Didn't we just finish regression? Well, yes. We are doing regression here, but it is only an intermediary step. Our regression curve represents the probabilties of falling into a given category. Let us make this concrete with an example which comes from the Wikipedia page on logistic regression (https://en.wikipedia.org/wiki/Logistic_regression). Our data set consists of the amount of hours a student spent studying and whether or not they passed the exam. We want to predict based on the number of hours studying if a student passed or not. Notice that our dependent variable is binary (they either passed, or they didn't). We are going to fit a curve that represents the probability that a student passed based on the number of ours they studied. This curve is shown below:
![log_reg](log_reg.jpg)
Again, note that the y-axis is the probability that a student passed given their score. Note that this is still not a classification problem. To turn it into one, we simply have a cutoff probability. 50% is a common cutoff. So if our regression line gives us a value of 0.5 or higher, then we say that a student passed. This cutoff score is up to the machine learning engineer. We would like to not get into too much math, but let us quickly go over the math. We are more than familiar with the linear regression equation, $y = B_0 + B_1x$. To apply logistic regression, we apply the sigmoid function to this data. The sigmoid function is has that *S* shape we saw in the example above and in general, the formula is $S(x) = \frac{1}{1+e^{-x}}$. Converting it to our notation, we get $p(x) = \frac{1}{1+e^{-x}}$ where *p* is the probability of being in the "yes" or '1' class. Note that the sigmoid function is bounded by 0 and 1, which is good because a probability must always be in the interval $[0,1]$. Solving the equation, we get $log(\frac{p}{1-p}) = B_0+B_1x$, which is what is solved to get our curve.

Let us use all of this to make an actual logistic regression model. The data for this example is social media ad data. We are trying to decide based on a users age and salary if they bought the product for some advertisment. First we import our data, fit our logistic regression on the train set, and make predictions on the test set.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values


# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)



Now that we have our predictions, we need to see how accurately our model performed. Since we have our predictions for the test set, as well as the actual results from the test set, we can compare how these two sets match up. This is generally done in what is called a confusion matrix. This following picture of a confusion matrix (from rasbt.github.io), is an excellent visualization of the confusion matrix.
![confusion_matrix](confusion_matrix.jpg)
This matrix breaks down not only how accurate our model was, but where the model was accurate, and where it messed up. This is why cofusion matrices are much more descriptive that accuracy alone. In python, the confusion matrix is simple.

In [3]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[65  3]
 [ 8 24]]


We can match up this matrix with the image above to analyze our data. For example, we can see that our number of true positives (or the number of times our model guessed that a user would respond to the add, and then they did) was 65. 