**Disclaimer**: Some examples and some text in this lesson are modified from Jake VanderPlas's [Python Data Science Handbook](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb). It is superb starting point for anyone who wants to learn more about machine learning and data science in Python.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

import seaborn as sns

# Intro to machine learning

Machine learning (ML) is a huge area of research and application, and it is currently expanding very quickly. We will look at the fundamentals principles of machine learning and try out some of the algorithms.

ML can be broadly defined as building models of data, usually with a purpose of transfomation (e.g. dimensionality reduction) or prediction (e.g. classification or regression), but sometimes just to further our understanding of the data.

"Learning" appears when we give these models **tunable parameters** that can be **adapted** to observed data; in this way the algorithm can be considered to be "learning" from the data.

# Categories of ML

### Supervised learning
Involves modeling the relationship between measured **features of data** and some **label** associated with the data; once this model is determined, it can be used to apply labels to new, unknown data. This is further subdivided into **classification** tasks and **regression** tasks: in classification, the labels are discrete categories, while in regression, the labels are continuous quantities.

### Unsupervised learning
Involves modeling the features of a dataset **without** reference to any label. These models include tasks such as clustering and dimensionality reduction. Clustering algorithms identify distinct groups of data, while dimensionality reduction algorithms search for new representations of the data in a space with fewer dimensions.

In addition, there are so-called *semi-supervised* learning methods, which falls somewhere between supervised learning and unsupervised learning. Semi-supervised learning methods are often useful when only incomplete labels are available.

# Regression

In [None]:
x = 10 * np.random.rand(50)
y = 2 * x - 1 + np.random.randn(50)
plt.scatter(x, y);

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression(fit_intercept=True)
model

In [None]:
x.shape

In [None]:
X = x[:, np.newaxis]
X.shape

In [None]:
model.fit(X, y)

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
xfit = np.linspace(-1, 11)

In [None]:
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)

In [None]:
plt.scatter(x, y)
plt.plot(xfit, yfit);

# Classification

In [None]:
iris = pd.read_csv('data/iris.csv')
iris.head()

In [None]:
sns.pairplot(iris, hue='species', size=2.5);

In [None]:
X_iris = iris.drop('species', axis='columns')
X_iris.shape

In [None]:
y_iris = iris['species']
y_iris.shape

In [None]:
# 1. import a model class
from sklearn.naive_bayes import GaussianNB 

# 2. instantiate model
model = GaussianNB()                       

# 3. fit model to data
model.fit(X_iris, y_iris)                  

In [None]:
y_model = model.predict(X_iris)

In [None]:
pd.DataFrame({'true_labels':y_iris, 'predicted_labels':y_model})

In [None]:
np.mean(y_iris == y_model)

In [None]:
from sklearn.model_selection  import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris)

In [None]:
# 2. instantiate model
model = GaussianNB()            

# 3. fit model to data
model.fit(Xtrain, ytrain)                  

# 4. predict on new data
y_model = model.predict(Xtest)             

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

# Dimensionality reduction / transformation

In [None]:
# 1. import the model class
from sklearn.decomposition import PCA  

# 2. instantiate the model with hyperparameters
model = PCA(n_components=2)            

# 3. fit to data. Notice y is not specified!
model.fit(X_iris)                      

# 4. transform the data to two dimensions
X_2D = model.transform(X_iris)         

In [None]:
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);

# Clustering

In [None]:
# 1. import the model class
from sklearn.mixture import GaussianMixture     

# 2. instantiate the model with hyperparameters
model = GaussianMixture (n_components=3,
            covariance_type='full')  

# 3. fit to data. Notice y is not specified!
model.fit(X_iris)                    

# 4. determine cluster labels
iris['cluster'] = model.predict(X_iris)

iris.head()

In [None]:
sns.lmplot("PCA1", "PCA2", data=iris, hue='species',
           col='cluster', fit_reg=False);

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
print(digits.DESCR)

In [None]:
digits.images.shape

In [None]:
digits.images[0]

In [None]:
plt.imshow(digits.images[0])

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i])
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')

In [None]:
X = digits.data
X.shape

In [None]:
y = digits.target
y.shape

In [None]:
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape

In [None]:
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
            edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5);

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

In [None]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(ytest, y_model)

sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

test_images = Xtest.reshape(-1, 8, 8)

for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(y_model[i]),
            transform=ax.transAxes,
            color='green' if (ytest[i] == y_model[i]) else 'red')