___

# Machine Learning in Geosciences ] 
Department of Applied Geoinformatics and Carthography, Charles University

Lukas Brodsky lukas.brodsky@natur.cuni.cz


## Fundamental Algorithms in Scikit learn applied to different classification problems


Purpose: explore selected classification algoritms based on simulated synthetic data set. 


# Setup

In [None]:
# Common imports for reading, visualizing
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# data preparattion 
from sklearn.datasets import make_classification # make_blobs
from sklearn.datasets import make_moons
from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

# Classifiers
# linear
from sklearn.linear_model import LogisticRegression
# trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# SVM 
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
# TODO later on neural network 

# accuracy assesment 
from sklearn import metrics
# tuning
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.model_selection import cross_val_score

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Project dir
PROJECT_DIR = "./"
if os.path.isdir(PROJECT_DIR):
    print('Ok continue.')
else:
    print('Nok, set correct path to your project directory!')


## Get the simulated data


### Linear / non-linear blobs classification problem

Sklearn's method `make_classification` generate a random n-class classification problem.

* creates **clusters of points** normally distributed (std=1); 
* about vertices of an **n_informative-dimensional hypercube** 
* with sides of length **2*class_sep**;     
* it introduces **interdependence between these features** and adds various types of further **noise** to the data.

* features are contained in the columns X[:, :n_informative + n_redundant + n_repeated] (without shuffling)

* **shuffling** horizontally stacks features in the following order: n_informative features, n_redundant linear combinations of the informative features, n_repeated duplicates
* remaining features are filled with random noise 


`sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, shuffle=True, random_state=None)`

Similarly `make_blobs` generates isotropic Gaussian blobs for clustering.

### Non-linear moons shape classification problem

The `make_moons()` function is for binary classification and will generate a swirl pattern, or two moons. One can control how noisy the moon shapes are and the number of samples to generate. 

This test problem is suitable **for algorithms that are capable of learning nonlinear class boundaries**.

### Non-linear circles shape classification problem

The `make_circles()` function generates a binary classification problem with datasets that fall into concentric circles. Again, as with the moons test problem, you can control the amount of noise in the shapes. 

This test problem is **suitable for algorithms that can learn complex non-linear manifolds**.


### Data set 1

In [None]:
# create synthetic data set 1
X1, y1 = make_classification(n_samples = 1000, n_features = 10, n_informative = 5, n_redundant = 5, 
                          class_sep = 3, n_clusters_per_class=1, 
                          random_state = 42, shuffle = True) 

In [None]:
plt.plot(X1[(y1==0),0], X1[(y1==0),1], 'r.')
plt.plot(X1[(y1==1),0], X1[(y1==1),1], 'b.')

In [None]:
# Numpy array to pandas dataframe
features = [f"Feature {ii+1}" for ii in range(X1.shape[1])]
X1 = pd.DataFrame(X1, columns = features)
y1 = pd.DataFrame(y1, columns = ["Target"])

In [None]:
X1.head()

### Data set 2

In [None]:
# create synthetic data set 1
X2, y2 = make_classification(n_samples = 1000, n_features = 10, n_informative = 5, n_redundant = 5, 
                          class_sep = 1, n_clusters_per_class=1, 
                          random_state = 42, shuffle = True) 

In [None]:
plt.plot(X2[(y2==0),0], X2[(y2==0),1], 'r.')
plt.plot(X2[(y2==1),0], X2[(y2==1),1], 'b.')

In [None]:
# Numpy array to pandas dataframe
features = [f"Feature {ii+1}" for ii in range(X2.shape[1])]
X2 = pd.DataFrame(X2, columns = features)
y2 = pd.DataFrame(y2, columns = ["Target"])

In [None]:
X2.head()

### Data set 3

In [None]:
X3, y3 = make_moons(n_samples=1000,  noise=0.1)

In [None]:
plt.plot(X3[(y3==0),0], X3[(y3==0),1], 'r.')
plt.plot(X3[(y3==1),0], X3[(y3==1),1], 'b.')

In [None]:
# Numpy array to pandas dataframe
features = [f"Feature {ii+1}" for ii in range(X3.shape[1])]
X3 = pd.DataFrame(X3, columns = features)
y3 = pd.DataFrame(y3, columns = ["Target"])

In [None]:
X3.head()

### Data set 4

In [None]:
X4, y4 = make_moons(n_samples=1000,  noise=0.3)

In [None]:
plt.plot(X4[(y4==0),0], X4[(y4==0),1], 'r.')
plt.plot(X4[(y4==1),0], X4[(y4==1),1], 'b.')

In [None]:
# Numpy array to pandas dataframe
features = [f"Feature {ii+1}" for ii in range(X4.shape[1])]
X4 = pd.DataFrame(X4, columns = features)
y4 = pd.DataFrame(y4, columns = ["Target"])

In [None]:
X4.head()

### Data set 5

In [None]:
X5, y5 = make_circles(n_samples=1000, noise=0.09)

In [None]:
plt.plot(X5[(y5==0),0], X5[(y5==0),1], 'r.')
plt.plot(X5[(y5==1),0], X5[(y5==1),1], 'b.')

## Create train and test set    

In [None]:
# test 20 % 
pass

## Select clasifiers & instantiate

In [None]:
# instantiate classifiers 

# logistic regression
log_clf = LogisticRegression(random_state=42)

# SVM (suppor vector classifier) 
lsvc_cls = LinearSVC(random_state=42)
nsvc_cls = SVC(random_state=42)

# trees
dt_clf = DecisionTreeClassifier(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)

In [None]:
# pass

In [None]:
# here is a set of hyper-parameters toselect from 

# Logistic regression
log_param_grid = [
                  {'penalty': ["l1", "l2"]} 
                 ]

# SVM
lsvc_param_grid = [
                   {"C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100]}
                  ]

nsvc_param_grid = [
                    {"kernel": ["linear", "rbf", "poly"],
                     "gamma": ["auto"],
                     "C": [0.1, 0.5, 1, 5, 10, 50, 100]
                    }
                  ]


# Tree classifiers
dt_param_grid = [
                    {
                     "max_depth" : [2, 5, 10],
                     # "criterion": ["gini", "entropy"], 
                     "min_samples_split": [0.01, 0.05, 0.10],
                     "min_samples_leaf": [0.01, 0.05, 0.10]
                    }
                 ]

rf_param_grid = [
                    {
                    "n_estimators": [3, 10, 50, 100],
                    "max_depth" : [2, 5, 10],
                    # "criterion": ["gini", "entropy"], 
                    'max_features': [2, 4, 6, 8, 12], 
                    "min_samples_split": [0.01, 0.05, 0.10],
                    "min_samples_leaf": [0.01, 0.05, 0.10],
                    "n_jobs": [-1]
                    }
                 ]


## Tune the classifier 

In [None]:
pass


## Evaluate the model 

In [None]:
pass

In [None]:
# Which model works well for the data set 1, 2, 3, 4 and 5? 