# Classification

## Logistic Regression

The logistic regression typically models the probability of a class given an input. The most basic form of the logistic regression tries to make a binary choice, but also will model uncertainty (i.e. will output something in the range 0 to 1, not just 0 and 1). Here are some simple cases of image data where it may be difficult to distinguish the classes: ![](https://i.pinimg.com/originals/e3/bd/cb/e3bdcb19e8f72bf9392d935ba95092fa.jpg)

![](https://prods3.imgix.net/images/articles/2016_03/Facebook-Dog-or-Chicken-Labradoodle-or-fried-chicken-puppy-or-bagel-Karen-Zwack-teeny-biscuit-memes.jpg)

One simple form of the model is the following: say that $p=P($image is a dog | image data$)$. Then a linear model could be given by $l = log(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \cdots$, where the $x$s may be pixels of the image.

Unlike in the linear regression, we are trying to predict the probability of an outcome. As such, the $\beta$ coefficients signify the contribution to the probability. While a positive $\beta$ suggests that the pixel's intensity is related to the image being that of a dog, a negative $\beta$ means that the pixel is related to the image being a croissant. Their magnitudes are only indicative of their contribution to the likelihood of the outcome. Non-linear models could be used to characterize more complex relationships.

In [1]:
import pandas as pd, numpy as np

In [2]:
df = pd.read_csv("data.csv")
df = df.dropna()
df['DV'] = (df['DV'] > 3).astype(np.bool_) # 4: 1 | 1,2,3: 0
df = df.drop('ID', 1)
df.head(10)

Unnamed: 0,IV1,IV2,IV3,IV4,IV5,IV6,DV,IV7,IV8,IV9,IV10,IV11,IV12,IV13,IV14,IV15
0,4.0,4.0,4.0,2.0,4.0,4.0,False,4.0,3.0,4.0,4.0,3.0,4.0,4.0,3.0,4.0
1,5.0,5.0,5.0,5.0,5.0,5.0,True,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
2,4.0,4.0,4.0,4.0,5.0,5.0,False,4.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0
3,4.0,5.0,3.0,3.0,4.0,4.0,False,4.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0
4,4.0,4.0,4.0,2.0,5.0,5.0,True,5.0,3.0,5.0,4.0,3.0,4.0,4.0,4.0,5.0
5,5.0,5.0,5.0,4.0,5.0,4.0,True,5.0,5.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0
6,5.0,4.0,3.0,3.0,5.0,4.0,False,5.0,4.0,5.0,5.0,3.0,4.0,4.0,4.0,4.0
7,4.0,3.0,4.0,3.0,4.0,4.0,False,4.0,3.0,3.0,4.0,4.0,3.0,4.0,4.0,4.0
8,3.0,3.0,1.0,1.0,3.0,3.0,False,1.0,3.0,4.0,4.0,3.0,1.0,4.0,4.0,3.0
9,4.0,4.0,4.0,3.0,5.0,5.0,True,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0


In [3]:
y = df.pop('DV')
X = df

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape)
print(X_test.shape)

(441, 15)
(147, 15)


In [5]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, solver='lbfgs')
clf.fit(X_train, y_train)

def output(model, _X_test, _y_test):
    print('Classes: {}'.format(model.classes_))
    print(model.predict_proba(_X_test)[:5,])
    print(".\n"*3)
    print('Score for test data: {:.2%}'.format(model.score(_X_test, _y_test)))
    
output(clf, X_test, y_test)

Classes: [False  True]
[[0.7805515  0.2194485 ]
 [0.11279386 0.88720614]
 [0.88625328 0.11374672]
 [0.95075765 0.04924235]
 [0.33190203 0.66809797]]
.
.
.

Score for test data: 79.59%


In [6]:
print(clf.coef_.shape)
print(clf.coef_) # from doc: coef_ corresponds to outcome 1 (True)
print(clf.intercept_)

(1, 15)
[[-0.07860011  0.1615707   0.03051194  0.12129066  0.08219454  0.47526988
   0.87527162  0.85679054  0.72459541  0.27177606  0.01464484  0.56554802
  -0.21152324 -0.05474793 -0.12654745]]
[-15.26115359]


## Decision Trees

A decision tree is a structured way to make decisions by assigning probabilities to courses of action based on the outcomes of certain events.

We can imagine that every time we want to make a decision, several options are available to us. 

![](https://i1.wp.com/www.samtalksml.net/wp-content/uploads/2017/05/image_dt1-1.png?resize=450%2C368&ssl=1)

![](http://www.prognoz.com/blog/wp-content/uploads/2016/06/tree.png)

![](https://victorzhou.com/media/random-forest-post/decision-tree2-root.svg)

## Random Forest

The problem with decision trees is their tendency to overfit. The fact that apples and grapes appear as leaf nodes multiple times is an example of this problem. For small decision trees, this harms interpretibility. For classification tasks, their generality is harmed when new  data is introduced.

Random Forest is a method of overcoming the overfitting problem. It creates classification trees which each only see a small portion of the training set, and may even only see a subset of its dimensions. By combining these trees, the variance is reduced significantly while slightly increasing the bias.

![](https://miro.medium.com/max/592/1*i0o8mjFfCn-uD79-F1Cqkw.png)

Random forest will only give importance of the features for classification, but has no analogy to the sign of the beta coefficients. Small magnitude importance means that features have little impact. These importance scores are also invariant to feature scaling.

In [7]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, random_state=0)
rfc.fit(X_train, y_train)

output(rfc, X_test, y_test)

Classes: [False  True]
[[0.91       0.09      ]
 [0.05       0.95      ]
 [0.84       0.16      ]
 [1.         0.        ]
 [0.30101371 0.69898629]]
.
.
.

Score for test data: 78.23%
