# Classification & Clustering

**Team work is not allowed.** Everybody implements his/her own code. Discussing issues with others is fine, sharing code with others is not.

## Part 1: Forest Cover Type Classification

In this exercise, we will **predict the forest cover type** (the predominant kind of tree cover) from strictly cartographic variables. 
As in the regression assignment, Y stands for a column vector of "target" values, that is the i-th row of Y contains the desired output for the i-th data point. Contrary to regression, the elements of Y in this classification task are integer values.

We will work with several popular classifiers provided by Scikit-learn package.

## Dataset: Forest cover data
This dataset contains 581012 tree observations from four areas of the Roosevelt National Forest in Colorado. All observations are cartographic variables (no remote sensing) from 30 meter x 30 meter sections of forest. 

This dataset includes information on tree type, shadow coverage, distance to nearby landmarks (roads etcetera), soil type, and local topography.

### Data Dictionary

|Variable Name | Description |
|-|-|
| Elevation | Elevation in meters.|
| Aspect | Aspect in degrees azimuth.|
| Slope | Slope in degrees.|
| Horizontal_Distance_To_Hydrology | Horizontal distance to nearest surface water features.|
| Vertical_Distance_To_Hydrology | Vertical distance to nearest surface water features.|
| Horizontal_Distance_To_Roadways | Horizontal distance to nearest roadway.|
| Hillshade_9am | Hill shade index at 9am, summer solstice. Value out of 255.|
| Hillshade_Noon | Hill shade index at noon, summer solstice. Value out of 255.|
| Hillshade_3pm | Hill shade index at 3pm, summer solstice. Value out of 255.|
| Horizontal_Distance_To_Fire_Points | sHorizontal distance to nearest wildfire ignition points.|
| Wilderness_Area1 | Rawah Wilderness Area|
| Wilderness_Area2 | Neota Wilderness Area|
| Wilderness_Area3 | Comanche Peak Wilderness Area|
| Wilderness_Area4 | Cache la Poudre Wilderness Area|
| Soil_Type| Soil_Type1 to Soil_Type40 (Total 40 Types)|
| **Cover_Type** | Forest Cover Type designation. |

**Cover_Type** Integer value between 1 and 7, with the following key:

    1. Spruce/Fir
    2. Lodgepole Pine
    3. Ponderosa Pine
    4. Cottonwood/Willow
    5. Aspen
    6. Douglas-fir
    7. Krummholz



## Objective: 

We will **predict different cover types** in different wilderness areas of the Roosevelt National Forest of Northern Colorado with the best accuracy.


In [None]:
# import packages
import pandas as pd
import warnings
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

## 1) Load data

- Use ```pandas.read_csv()``` to load the data.

In [None]:
df = pd.read_csv('ForestCover.csv')

- Visualize the first and the last 5 rows of the data, using ```.head()``` and ```.tail()```.

## 2) Basic statistics
- Print overall info, using ```.info()```.

- Print dataframe statistics using ```.describe()```.

- Check if there are missing values, using ```.isnull().sum()```. If yes, drop them or fill them.

## 3) Exploratory Data Analysis
- Show the category distribution, using ```.value_counts()```.

- Visulise this distribution, using ```sns.countplot()```

### Feature Histograms 
- Visulize data distribution of the first four features via Histograms using ```sns.histplot()```. (Show four figures.)

### Correlation between Variables
- Show correlation between variables, using ```sns.heatmap()```. (Since 55 columns are too many, please show here a 10x10 heatmap for the first 10 features. )

### Data Distribution w.r.t. Categories
- Show data distribution w.r.t. categories, using ```sns.boxplot()```. (x-axis: cover type, y-axis: feature variable, please show 10 figures for the first 10 variables.)

- Are there any features which shows not much of variance with respect to classes? Which ones?

- Which features might do good job in the prediction?

## 4) Training Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

- Define Feature (as X) and Target (as y)

In [None]:
X = 
y = 

- Split the data into train (70%) and test (30%), use a random seed.
- Print the size of the train and test set

In [None]:
X_train, X_test, y_train, y_test = 

- Normalize the data using ```StandardScaler()```

### 4a) Logistic Regression
- Train the Linear Regression model

- Predict on the test data

- Compute and print performance metrics, using ```accuracy_score()``` to compute the fraction of correctly classified samples.

### 4b) Random Forest
- Train and test with the Random Forest classifier
- Print the accuracy

### 4c) K Nearest Neighbor
- Train and test with the KNN classifier
- Print the accuracy

(It might take a bit long, around an hour if using one cpu core.)

### OPTIONAL 4d) Support Vector Machine
- Train and test with the SVM classifier
- Print the accuracy

(It might take a bit long, around 3 hours if using one cpu core...)

### Conclusion
- Please write your conclusion:

## Part 2: Clustering 

## 1) sklearn K-means

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, MeanShift
%matplotlib inline 

### 1a) Toy dataset: blobs
- Load toy dataset (blobs).

In [None]:
data_blobs = np.genfromtxt('toy_data.csv', delimiter=',')

- Use sklearn kmeans function (parameters by defalt) to cluster points.

- Plot clustering results using ```plt.scatter()``` and color the datapoints according to their cluster

- Choosing the number of n_clusters without extra information is not trivial. For these blobs data, we don't have any labels. Which configurations do you think are be the best for this dataset? How many n_custers would you choose?
- Run the KMeans algorithm with your n_cluster parameter and plot your results.

### 1b) Spiral dataset
Now we try to use the KMeans algorithm to cluster the Spiral dataset.

In [None]:
# load data
spiral = np.load("spiral.npz")['x']


- Use sklearn kmeans to cluster points and visualize it similar than before with the blob dataset

- Does it work? Please explain your answer, which assumptions required for kmeans? 
- What limitations do you think K-means would have?

## 2) sklearn mean-shift

In this section we do the same task as before but with the mean shift algorithm instead of kmeans.

### 2a) Toy dataset: blobs

- Use sklearn meanshift function to cluster points.

- Plot clustering results 

- Try different hyper-parameters (i.e. bandwidth, which is an important parameter for mean-shift) and plot the results.

### 2b) Toy dataset: spiral

- Use the Mean-shift to cluster the Spiral dataset.
- Plot the results.

- Does it work?
- What kind of data cluster is this approach better at discovering?
- (Optional) Brainstorm: do you have a solution for this dataset? 

As we have seen in 3), the data is imbalanced.

### 5a) Training with under-sampled data

- print the size of the smallest class

- Undersample all the majority classes so that all classes has the same smallest cardinality.

In [None]:
# subsets for each class, using .query()
# downsample each subset, using .sample()
# concatenate the seven subsets, using .concat(), and shuffle the data (using .sample() on the full set)


- check the class distribution of the undersampled data

- Preprocessing data (define X, y; train test split; normalize data)

- Train and test the random forest classifier on under-sampled data
- Print the accuracy

### 5b) Training with over-sampled data

- print the size of the largest class

- Oversample the classes.

In [None]:
# subsets for each class, using .query()
# oversample each subset, using .sample()
# concatenate the seven subsets, using .concat(), and shuffle the data (using .sample() on the full set)


- check the class distribution of the oversampled data

- Preprocessing data (define X, y; train test split; normalize data)

- Train and test the random forest classifier on over-sampled data
- Print the accuracy

### Conclusion on imbalanced data solution (with random forest classifier):

Accuracy:

- imbalanced data:

- undersampled data: 

- oversampled data: 

There is a bit of cheating in over-sampling: Some of the data are copied both in train and test set.