# Intro to Data Science @ SzISz Part V.
## Classification and Regression

### Table of contents
- Classification
    - <a href="#What-is-Classification?">Classification Theory</a>
    - <a href="#Linear-models">Linear models</a>
    - <a href="#Decision-Trees">Decision Trees</a>
    - <a href="#Neural-Networks">Neural Networks</a>
    - <a href="#Support-Vector-Machines">Support Vector Machines</a>
    - <a href="#Nearest-Neighbours">Nearest Neighbours</a>
- Regression
    - <a href="#What-is-Regression?">Regression Theory</a>
    - <a href="#Linear-regression">Linear regression</a>
    - <a href="#Non-Linear-regression">Non-Linear regression</a>
    - <a href="#Bayesian-regression">Bayesian regression</a>
    - <a href="#Support-Vector-regression">Support Vector regression</a>
    

## What is Classification?
_"Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known."_ from: <a href="https://en.wikipedia.org/w/index.php?title=Classification_in_machine_learning">Wiki</a>

Classification is a <a href="http://scikit-learn.org/stable/supervised_learning.html">supervised machine learning</a> problem. The observed item's properties are called __feature__s, the categorical feature is referenced as the __label__, and the algorithm which implements the classification is the __classifier__. 

## Why is it important?
It is one of the main problems in Data Science; it's application ranges from detecting if a mail is <a href="https://www.youtube.com/watch?v=cFrtpT1mKy8">spam</a> or not (ham), to the self driving car's pedestrian detection. It can be applied to many fields, including:
- computer vision
- medical analysis
- speech recognition
- biometric identification
- document classification
- pattern recognition
- etc.

## Tools
- Linear models
    - Logistic regression
    - Naive Bayes classifier
    - Perceptron
- Support Vector Machines
- Neural Networks
- Instance-based learning
    - k-Nearest Neighbours
- Decision trees
    - Random forests
- Ensemble methods
- etc.

In [None]:
%matplotlib inline
import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split

In [None]:
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Baseline model

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy = DummyClassifier(random_state=42)
dummy.fit(X, y)

In [None]:
y_dummy = dummy.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_dummy))

In [None]:
dummy.score(X_test, y_test)

## Linear models
### Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_hat))

In [None]:
clf.score(X_test, y_test)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_hat))

In [None]:
clf.score(X_test, y_test)

### Perceptron

In [None]:
from sklearn.linear_model import Perceptron

In [None]:
clf = Perceptron(verbose=0, random_state=42)
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_hat))

In [None]:
clf.score(X_test, y_test)

## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_hat))

In [None]:
clf.score(X_test, y_test)

## Neural Networks

In [None]:
XOR_X, XOR_y = np.array([[0,0], [0, 1], [1, 0], [1, 1]]), np.array([0, 1, 1, 0])

In [None]:
perceptron = Perceptron(verbose=2, random_state=42)
perceptron.fit(XOR_X, XOR_y)

In [None]:
plt.scatter(XOR_X[:, 0], XOR_X[:, 1], c=XOR_y)
plt.plot(XOR_X, perceptron.predict(XOR_X), 'r-')

In [None]:
conf_mat = confusion_matrix(XOR_y, perceptron.predict(XOR_X))
conf_mat

In [None]:
sns.heatmap(conf_mat)

In [None]:
try:
    from sklearn.neural_network import MLPClassifier
except:
    print "Maybe in 0.18. http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html"

You should try out this: https://github.com/jorgenkg/python-neural-network

## Support Vector Machines

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(random_state=42)
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_hat))

In [None]:
clf.score(X_test, y_test)

## Nearest Neighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, y_hat))

In [None]:
clf.score(X_test, y_test)

---

## What is Regression?
Regression - just as classification - is a supervised machine learning problem however in case of regression the target variable is continuous. It is also _"a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a __dependent variable__ and one or more __independent variable__s (or 'predictors')."_ from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>

It is important to note that instead of the descriptive nature of statistical regression analysis Data Science focuses on the predictive side of this method.

## Why is it important?
_"Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning."_ from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>

It is used to forecast any continuous variable:
- stock market
- salary prediction
- network traffic
- traffic
- etc.

## Tools
- Linear regression
- Ridge regression
- LASSO
- Bayesian regression
- Support Vector regression
- etc.

## Linear regression

In [None]:
# TODO

## Non-Linear regression

In [None]:
# TODO

## Bayesian regression

In [None]:
# TODO

## Support Vector regression

In [None]:
# TODO