## Machine Learning Cheat Sheet 
#### This notebook contains a collection of machine learning techniques and practice. It is a work in progress...

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons, load_digits
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns; sns.set()

### 1. Supervised learning 
* Data is labeled and model is trained to make correct predictions 
* Regression: used to predict real numerical values e.g. home sales prices, stock market prices 
* Classification: classify things into categories e.g. email spam filters, fraud detection, image classification 

##### K-Nearest Neighbor Classification
* Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y, train_size=0.7, random_state_42)
* model = KNeighborsClassifier(n_neighbors = 3) 
* model.fit(Xtrain,Ytrain).predict(Xtest)
* accuracy_score(ytest, y_model)
* for i in range(len(ytest)):
    if ytest[i] != y_model[i]:
        plt.plot(Xtest[i,0],Xtest[i,1],'sk',markersize=10) %plot the misclassified points
* matrix = confusion_matrix(ytest, y_model) %visualize accuracy with confusion matrix

##### Random Forest Classifier 
* forest= RandomForestClassifier(n_estimators=5, random_state=2) %define parameters for model
* y_model = forest.fit(Xtrain, ytrain).predict(Xtest)
* sns.pairplot(df, hue='Item to color by', height=1.5) %examine the features
* sns.regplot(), sns.lmplot() %used to visualize linear relationship
* n_features = model_rf.n_features_ %get number of features
* plt.barh(np.arange(n_features), model_rf.feature_importances_, align='center') %plot feature importances

##### Other Classifiers
* logreg = LogisticRegression()
* y_pred = logreg.fit(Xtrain,Ytrain).predict(Xtest)

* gaussian = GaussianNB()
* y_pred = gaussian.fit(Xtrain,Ytrain).predict(Xtest)

* svc = SVC()
* y_pred = svc.fit(Xtrain,Ytrain).predict(Xtest)

* perceptron = Perceptron(class_weight='balanced')
* y_pred = perceptron.fit(Xtrain,Ytrain).predict(Xtest)

* gbk = GradientBoostingClassifier()
* y_pred = gbk.fit(Xtrain,Ytrain).predict(Xtest)

* ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
* y_pred = ada.fit(Xtrain,Ytrain).predict(Xtest)


### 2. Unsupervised learning
* Data is not labeled 
* Model tries to identify patterns without external help 
* Clustering: providing purchase recommendations for an ecommerce website 
* Anomaly Detection: e.g. someone using your credit card 

##### Visualize dataset using the t-SNE manifold learning algorithm 
* tsne = TSNE() 
* data_tsne = tsne.fit_transform(data.data)
* df_data = pd.DataFrame(digits_tsne, columns=['TSNE1','TSNE2'])  
* df_data["value"] = data.target
* sns.lmplot("TSNE1", "TSNE2", hue='value', data=df_data, fit_reg=False);

##### K-Means Clustering 
* data_km = KMeans(n_clusters=10, random_state=0)
* data_clusters_km = data_km.fit_predict(X_data)
* Can use Elbow method to help determine number of clusters - want small value of k that still has a low SSE 
*   sse=[]

    for i in range(1,20):
        kmeans=KMeans(n_clusters=i,init='k-means++',)
        kmeans.fit(X_digits)
        sse.append(kmeans.inertia_)
    plt.plot(range(1,20),sse)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('SSE' );


### 3. Reinforcement Learning
* Conceptually similar to human learning processes 
* Learns best set of actions to take given a current environment in order to get most reward overtime e.g. recommendations by netflix 

### 4. Deep Learning 
* Tries to loosely emulate how the human brain works 
* Applications: Natural language processing, image audio and video analysis, time series forecasting, etc 
* Requires typically very large datasets of labeled data and is computationally expensive 

