# TP Random Forest - XGBoost

This practical session will allow you to apprehend two very popular ensemble classification methods.
1. Random Forest
2. XGBoost

It is structured as follows
1. The challenge
2. Decision Tree
3. Random Forest
4. XGBoost
5. Analysis

## The challenge
### Introduction
ESA's new Earth Observation constellation **Sentinel-2** provides images with high spatial resolution (10m), with many spectral bands, and a short revisit time of 5 days. It produces time series (in other words, a video) of multispectral images, over all continental surfaces in the world. One of the many practical applications of this imagery, is Land Cover mapping, which involves classifying the various objects that can be seen from a satellite : roads, forests, cities, fields, etc. This is a real supervised classification problem, which is used by several industrials, but also in the public domain (by the C.N.E.S. for instance). The idea is to use an entire year of data to classify each pixel. The temporal information is very useful for distinguishing between different crop types, and to mitigate the negative impact of clouds. There are a few reasons why this problem is challenging. First of all, the very large dimension of the multi-spectral time series : each time series is composed of 33 dates spread across the year, and each date is an image with 10 spectral bands. Therefore, the base feature space is already 330 dimensions. Secondly, there is a great amount of intra-class variation, due to the variety of cultural practices and climatic differences between different areas. Finally, the class nomenclature itself is quite challenging. In this session you will be working with a reduced nomenclature, with only 8 classes, but the full target nomenclature contains 17 classes.

In this practial session, you will test a few basic classification methods to try to solve this problem.
### The data set
The data set is composed of two different areas, a train area, and a test area. For each area, two files have been given.
- The time series : **train1.npy / test1.npy** 
- The labels of some of the pixels in the images **train1ref.npy / train1ref_s.npy / test1ref.npy**

**train1ref_s.npy** is a reduced version of **train1ref.npy**, with fewer samples per class for a faster training.

Link to the data : https://drive.google.com/drive/folders/1aEJD9QIM2hgN_0cqCQF-C3Ia7gH4cKSo?usp=sharing

Q1. Import the training image in a numpy array using *np.load*. Look at the dimension of the image.

The time series is organized first per date, and for each date, the spectral bands are organized as follows. 
- Visible
    - B1 (490 nm) (Blue)
    - B2 (560 nm) (Green)
    - B3 (665 nm) (Red)
- Near Infra Red
    - B4 (842 nm) (NIR, 10m)
    - B5 (705 nm)
    - B6 (740 nm)
    - B7 (783 nm)
    - B8 (865 nm)
- Short wave IR
    - B9 (1610 nm)
    - B10(2190 nm)
    
Q2. Use the function *displayImage* to visualize the RGB bands of the first date. Try to understand the normalize function, and why it is necessary.

In [31]:
def normalize(i,vmax=50):
    return i*255/vmax if i < vmax else 255

def displayImage(data):
    normalize_v=np.vectorize(normalize)
    data_norm=normalize_v(data).astype(np.uint8)
    disp.display(plt.imshow(data_norm))

Q3. Show the 20th date in false color, by replacing the red band by the infra-red band (B4)

Q4. Import the reduced set of training labels *train1ref_s.npy*, and visualize them using *displayMap*.

In [34]:
from matplotlib import colors
def displayMap(m):
    cmap = colors.ListedColormap(['white','#f05824','#f8f381','#1b9c4a','#afd037','#53a97f','#a13a94','#b7529e','#dba0c8','#f0cee2','#3a54a3'])
    bounds=[0,11,12,31,34,36,41,42,43,44,51,52]
    norm = colors.BoundaryNorm(bounds, cmap.N)
    plt.imshow(m,cmap=cmap,norm=norm)
    plt.colorbar(cmap=cmap, norm=norm, boundaries=bounds, ticks=bounds)
    plt.show()

The class nomenclature is as follows

- 0 : Unknown
- 11 : Summer crop
- 12 : Winter crop
- 31 : Forest
- 34 : Natural grassland
- 36 : Woody moorlands
- 41 : Continuous urban fabric
- 42 : Discontinuous urban fabric
- 43 : Industrial and commercial units
- 44 : Roads
- 51 : Water

The final objective will be to label all of the pixels that are labeled "0", meaning the pixels where we do not yet know the class. For this, we're going to learn a classification model. The goal of this model is to associate a label to an unlabeld sample, or in this case, an unlabeled pixel. To train the model, we will have to start by adapting the data set to a format thats readable by scikit-learn, i.e. a list of samples (data) and a list of labels (target).

Q5. Create two identically sized lists, containing respectively the data samples, and their associated labels. For this, you can loop on the X and Y indices of the image, check if the point is labeled, and if it is, append it to your list. How many training samples are available ? How many are available per class (hint : use *plotPriors*) ?

In [1]:
def plotPriors(targetList):
    bins=[11,12,31,34,36,41,42,43,44,51,52]
    priors=np.histogram(targetList, bins)[0]
    plt.bar(np.arange(1,len(priors)+1), priors)
    ax=plt.gca()
    ax.set_xticklabels(bins[:-1])
    ax.set_xticks(np.arange(1,len(priors)+1))
    plt.show()

Q6. In the same way as in previous questions, prepare a list of validation data samples, and a list of their validation labels, using the validation datasets **test1.npy** and **test1ref.npy**. Why is the same image not used for both training and validation ?

## Decision Tree
Q7. Train a decision tree on the training data, and 
1. Print the training error.
2. Print the confusion matrix, using the confusion_matrix function from sklearn.metrics
3. Show the feature importance (use the feature\_importances\_ attribute of the classifier). You can reorganize them into a 33x10 matrix using *reshape*, and plot them with *imshow*. 

Q7. Classify the whole image that was used for training. For this, you can use the *predict* method of the classifier. It allows you to classify an batch of samples, by providing it with a np.array of those samples. You can loop on the columns of the image, and classify each column with one call of *predict*.

Q8. Now look at the generalization error, and the confusion matrix, using the validation data set. What are the main sources of confusion ? How does this classifier perform ?

Q9. Classify the validation image, and visually analyze the result, by comparing it to the RGB bands of the first date of the image.

## Random Forest

Now, we are going to use a group of trees (often called an ensemble), to try to improve the precision of the classification. The main parameters are :
- n_estimators : The number of trees
- criterion : The split criterion
- max_depth : The maximal depth of the trees
- max_features : The maximum amount of features tested at each split

The complete list can be found here :
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Q10. Repeat the steps of questions 6 to 9, using the Random Forest classifier, with a max depth of 5, and 10 trees.

## XGBoost

Q11. Repeat the steps of questions 6-9 with the XG-boost classifier, with 10 trees.

## Questions
Q12. Analyze the effect of the depth of the trees used by the classifier, using both the training error and the generalization errors, for the DT, RF and XGB. For the last two, use a fixed low number of estimators (10 for example). In which case is there overfitting ? How do the ensemble classifiers counter overfitting ?

Q13. For the Random Forest and XGBoost classifiers : Analyse the impact of the number of trees on the training error and on the generalization error. In which case is there overfitting ? 

Q14. Add 100 features of noise to the image, using *np.random.rand* and *np.concatenate*. What is the impact on the different classification results ? Compare a case with few trees, and a case with many trees, and explain the results.

Q15. Copy the first feature  of the image 100 times, and add it to the original image. What is the impact on the different classifiers ? Compare a case with few trees, and a case with many trees, and explain the results.

Q16. Multiply the first 100 features by a constant factor of 1000. What is the impact on the different classifiers ?

Q17. The Normalized Differential Vegetation Index (NDVI) is a non linear combination of Red and Infrared bands that reacts strongly to vegetation, due to the Red-Edge effect. Create the stack of NDVI, using the formula given below, and add it to the classification. How do the classifiers react ? 
$$ NDVI=\frac{IR-R}{IR+R}$$

Q18. Use your best classifier, parameters, and features, to train using the full training set **train1ref.npy**, and to classify the validation image. 