# Training a new Classifier

The project aims to use maschine learning methods to emulate a cloud classification scheme. The classifer can be trained using large amounts of data and later be used to predict cloud types from satelite data. Those two steps can be run separately.

This notebook contains a short explanation how to create a new cloud classifier project and train a new classifier


## Imports

At first we need to point python to the project folder. The path can be assigned as a relative path as shown below, or as an absolute system path.
Than the module can be imported via the `import cloud_classifier` command.

In [1]:
import sys
import ctypytool

import warnings
warnings.filterwarnings("ignore")

## Project creation
Our first step is to create a classifier object:

In [2]:
cc = ctypytool.cloud_classifier()

Than we need to specify a location where the new classifier and all it's preference and data file will be stored. We will tell the program to create a new classifier with the name `NewRandomForestClassifier` in the folder `../classifiers`.



In [3]:
cc.create_new_project(name="NewRandomForestClassifier", path="../classifiers")

Project folder created successfully!


### Project Parameters

The new classifier will automatically be initalized with default parameters. For this example we will leave the parameters as they are. If we want to train a different kind of classifier or change the training parameters, we need to apply those changes to the classifier before continuing. This is described in the Notebook  **Changing_Project_Parameters**


## Training the classifier
### Manually adding Training Data

First we need to decide how we add training files to our classifier. We can do this by manually calling the method `cc.set_project_parameters()` with the parameter `training_sets` to add a list of pairs of training files to our classifier. Each pair needs to contain a satelite file and the corresponding label file.

In [4]:
sat_file_1 = "../data/example_data/msevi-medi-20190317_1800.nc"
sat_file_2 = "../data/example_data/msevi-medi-20190318_1100.nc"
label_file_1 = "../data/example_data/nwcsaf_msevi-medi-20190317_1800.nc"
label_file_2 = "../data/example_data/nwcsaf_msevi-medi-20190318_1100.nc"

training_data = [[sat_file_1, label_file_1], [sat_file_2, label_file_2]]

cc.set_project_parameters(training_sets = training_data)

We now run the training pipeline (with the `run_training_pipeline()` method) which 
* samples our training data and creates training vectors
* uses those vectors to train the classifier
* stores the classifier in the project folder

The option `create_filelist` is set to `False` to use the user-defined training files.

In [5]:
cc.run_training_pipeline(create_filelist = False)

Masked indices set!
Sampling dataset 1/2

sampling took 0.2686769962310791 seconds
Training data created!
Training Forest Classifier
Classifier trained!
Classifier saved!


### Using Automatically Generated Lists of Training Data

Alternatively to the manual definition, the training data file list can be generated automatically.


The easiest way to do so is to put all satellite and label files into an training data folder (here it is set to `../data_example_data`) and just tell the classifier where to look via the `data_source_folder` option.

In [6]:
%%bash

ls -l ../data/example_data/

total 11663800
-rw-r--r--. 1 b380352 bm0834 891362175 Jun  2  2022 evaluation_data.zip
-rw-r--r--. 1 b380352 bm0834      6767 Jun  2  2022 hundred-files.lst
-rw-r--r--  1 b380352 bm0834    664130 Jun  4  2021 lsm_mask_medi.nc
-rw-r--r--  1 b380352 bm0834  15495201 Jun  4  2021 msevi-medi-20190101_0800.nc
-rw-r--r--  1 b380352 bm0834  15772473 Jun  4  2021 msevi-medi-20190101_1200.nc
-rw-r--r--  1 b380352 bm0834  15503297 Jun  4  2021 msevi-medi-20190101_1600.nc
-rw-r--r--  1 b380352 bm0834  15434497 Jun  4  2021 msevi-medi-20190101_1800.nc
-rw-r--r--  1 b380352 bm0834  15541594 Jun  4  2021 msevi-medi-20190102_2000.nc
-rw-r--r--  1 b380352 bm0834  15499935 Jun  4  2021 msevi-medi-20190103_0000.nc
-rw-r--r--  1 b380352 bm0834  15491318 Jun  4  2021 msevi-medi-20190103_0300.nc
-rw-r--r--  1 b380352 bm0834  15897997 Jun  4  2021 msevi-medi-20190103_1200.nc
-rw-r--r--  1 b380352 bm0834  15524305 Jun  4  2021 msevi-medi-20190103_2100.nc
-rw-r--r--  1 b380352 bm0834  15519656 Jun  4  2021 ms

In [7]:
cc.set_project_parameters(data_source_folder = "../data/example_data/")

In [8]:
cc

CTYPYTOOL CLOUD PROJECT class

... project_path              : ../classifiers/NewRandomForestClassifier

CTYPYTOOL PARAMETER HANDLER class

=== Parameters ===

... ccp_alpha                 : 0
... classifier_type           : Forest
... cloudtype_channel         : ct
... data_source_folder        : ../data/example_data/
... difference_vectors        : True
... feature_preselection      : False
... georef_file               : ../data/auxilary_files/msevi-medi-georef.nc
... hours                     : [0]
... input_channels            : ['bt062', 'bt073', 'bt087', 'bt097', 'bt108', 'bt120', 'bt134']
... input_source_folder       : ../data/example_data
... label_file_structure      : nwcsaf_msevi-medi-TIMESTAMP.nc
... mask_file                 : ../data/auxilary_files/lsm_mask_medi.nc
... mask_key                  : land_sea_mask
... mask_sea_coding           : 0
... max_depth                 : 35
... max_features              : None
... merge_list                : []
... min_samples_spli

In a next step, we can let the classifier predict labels from the input files we have specified.
This is again done with the `run_prediction_pipeline()` method.

Now we want the classifier to automatically generate a list of input files from the designated source folder and therefore set the option `create_filelist` to `True`.

In [9]:
cc.run_training_pipeline(create_filelist = True)

Filelist created!
Masked indices set!
Sampling dataset 3/692

sampling took 142.87269759178162 seconds
Removed 679 vectors for containig 'Nan' values
Training data created!
Training Forest Classifier
Classifier trained!
Classifier saved!


The classifier is now trained and saved. It can be used for predicting labels of unknown satellite data files as described in the notebook **Application_of_a_pretrained_classifier**