#### This is only needed during development

The path for the classifier is set here as well as modules reloaded in case they have changed.

In [95]:
# This is the system 
import sys
sys.path.append('../cloud_classifier')

import cloud_trainer 
import data_handler
import importlib
importlib.reload(cloud_trainer)
importlib.reload(data_handler)

<module 'data_handler' from '../cloud_classifier/data_handler.py'>

# General Application of the Cloud Classification Tools

This is a cloud classifier description text

# Data Extraction

The first step of any maschine learning routine is the obtaining of data in a usable form.
The satelite and label data used in this project consists of satelite data stored in the `netcdf` data format. 

The Class `data_handler` helps to extract trainning vectors from this data which can be used in later steps to train a classifer for automatic cloud type detection.

#### Adding data sets to the data handler

In a first step the locations of a number of netcdf data files consisting of satelite data and according label data are added to the `data_handler` via the `add_trainig_files` method.



In [96]:
import data_handler

data_1 = "../data/msevi-nawdex-20160920.nc"
labels_1 = "../data/nwcsaf_msevi-nawdex-20160920.nc"
data_2 = "../data/msevi-nawdex-20160925.nc"
labels_2 = "../data/nwcsaf_msevi-nawdex-20160925.nc"

dh = data_handler.data_handler()

dh.add_training_files(data_1, labels_1)
dh.add_training_files(data_2, labels_2)


#### Using a mask file to narrow down the data

Netcdf datasets may contain data for large swath of land and ocean. In order to limit the scope of the data we can use a mask from a h5-mask-file. The `set_indices_from_mask` method allows instructing the `data_handler` to only use data from inside a sepcific region.

In [97]:
mask = "../data/region_masks_for_msevi_nawdex.h5"
dh.set_indices_from_mask(mask, "mediterranean")

(array([316, 316, 316, ..., 714, 714, 714]),
 array([1904, 1905, 1906, ..., 2169, 2170, 2171]))

#### Setting parameters for the data extraction

A number of parameters for the data extraction can be set. The method `set_extraction_parameters` allows specifing the number of samples per data set, the hours at which the dataset is sampled and a choice of creating difference vectors from the extraced data vectors.

The `set_nwcsaf_version` mehtod allows to specify which nwcsaf standard for mapping cloud types is used and the `set_input_channels` method allows to name the data channels from the sattelite data that are to be considered.

In [98]:
dh.set_extraction_parameters(samples = 1000, hours = range(24), difference_vectors = True, original_values = True)
dh.set_nwcsaf_version(in_version = "auto", out_version = "v2018")
dh.set_input_channels(input_channels = ['bt062', 'bt073', 'bt087', 'bt097', 'bt108', 'bt120', 'bt134'])

#### Extracting data

Based on the above specification a set of feature vectors and labels can now be extracted from the data.

In [99]:
feature_vectors, labels = dh.create_training_set()

The cloud type data is coded after the old (2013) standard
nwcsaf-in-version set accordingly


# Training Classifer

The class `cloud_trainer` wraps a (small) selection of the machine learning functionality from `scikitlearn` for easier use in this specialized scenario. When the need for other mehtods or greater amount of finetuning is arises, the data extracted above can be used with `scikitlearn` directly.

#### Creating a cloud classifier 

After setting some parameters like the type of classifer and if feature preselection is performed the `cloud_trainer` can be trained with the training data created above.


In [100]:
import cloud_trainer

ct = cloud_trainer.cloud_trainer()

ct.set_training_paremeters(classifer_type = "Forest", feature_preselection = False, max_depth = 20)
ct.train_classifier(feature_vectors, labels)

#### Evaluating Classifier and  Predicting labels 

The classifier can now be used to predict labels for a new set of satelite data.
The `data_handler` is used again to extract the feature vectors over this new data set and also returns the indices of the created vectors. Those are necessarry to map the feature vectors as well as later the predicted labels to their specific positions.


In [101]:
data_3 = "../data/msevi-nawdex-20160930.nc"
test_vectors, test_indices = dh.create_test_vectors(data_3, hour = 0)

predicted_labels = ct.predict_labels(test_vectors)

Alternativly, when label data for this new set of satelite data set also exists the performance of the classifer over this data can then be evaluated.

In [102]:
labels_3 = "../data/nwcsaf_msevi-nawdex-20160930.nc"

test_labels = dh.extract_labels(labels_3, test_indices, hour = 0)
ct.evaluate_classifier(test_vectors, test_labels)

Correctly identified 153161 out of 171396 labels! 
Positve rate is: 0.893609


0.8936089523676165

Finally, when only wanting to evaluate the selected training methods and parameters the method `evaluate_parameters` will split a set of feature vectors and labels for a quick evaluation withouding needing any new data.

In [103]:
ct.evaluate_parameters(feature_vectors, labels)

Correctly identified 10527 out of 12000 labels! 
Positve rate is: 0.877250


0.87725

# Saving and Loading

Some of the steps from extracting data until training the classifier are relativly resource intensive and take some time to finish, with this time increasing with the size of the used datasets.
Therefore a few methods exist to save and load time consuming intermiediates and the finished classifier itself.

#### Saving and Loading feature vectors
One of the more time consuming steps is the creation of large sets of feature vectors and corresponding labels from the data sets. The methods `save_training_set` and `load_training_set` can help to save computation time when testing different training parameters over the same set of data.

In [104]:
filename = "../training_set_1"

dh.save_training_set(feature_vectors, labels, filename)

feature_vectors, labels = dh.load_training_set(filename)

#### Saving and Loading the classifier
In order to not having to train the classifier each time it is used it can also be stored and loaded. 

In [105]:
filename_cl = "../classifer_1"

ct.save_classifier(filename_cl)

ct.load_classifier(filename_cl)


#### Exporting labels into NETCDF format
The method `make_xrData` usess a refernce NETCDF file in order to write the predicted labels into the coorect coordiantes of an xarray dataset. If an output file is specified this method will save the labels as a NETCDF file.

In [107]:
labels_xr = dh.make_xrData(predicted_labels, test_indices, NETCDF_out = "../predicted_labels_1.nc")

No refrence file given!
Using latest test file as reference
