# Example for semi-supervised SOMRegressor

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# --- for running the script without pip
import sys
sys.path.append("../")
# ---

import susi

## Read in data

We modify the hyperspectral soil-moisture dataset for this semi-supervised application.
Therefore, we ramdomly set labels of datapoints (only!) in the training dataset to the placeholder -1.
This variable also has to be set in the hyperparameter  `missing_label_placeholder=1` of the `SOMRegressor`.

**Dataset:** Felix M. Riese and Sina Keller, "Hyperspectral benchmark dataset on soil moisture", Dataset, Zenodo, 2018. [DOI:10.5281/zenodo.1227836](http://doi.org/10.5281/zenodo.1227836) and [GitHub](https://github.com/felixriese/hyperspectral-soilmoisture-dataset)

**Introducing paper:** Felix M. Riese and Sina Keller, “Introducing a Framework of Self-Organizing Maps for Regression of Soil Moisture with Hyperspectral Data,” in IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 2018, pp. 6151-6154. [DOI:10.1109/IGARSS.2018.8517812](https://doi.org/10.1109/IGARSS.2018.8517812)

In [2]:
### define ratios (between 0 and 1)
test_size = 0.5
missing_rate = 0.9

# load and split data
df = pd.read_csv(("https://raw.githubusercontent.com/felixriese/"
                  "hyperspectral-soilmoisture-dataset/master/soilmoisture_dataset.csv"))
features = [col for col in df.columns if col.isdigit()]
X_train, X_test, y_train, y_test = train_test_split(
    df[features].values, df["soil_moisture"].values,
    test_size=test_size, random_state=42)

# remove labels from training dataset
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(y_train)) < missing_rate
y_train[random_unlabeled_points] = -1

# preprocessing
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("Datapoints for training:\t\t", y_train.shape[0])
print("Datapoints for training with label:\t", np.sum(y_train != -1))
print("Datapoints for testing:\t\t\t", y_test.shape[0])

Datapoints for training:		 339
Datapoints for training with label:	 32
Datapoints for testing:			 340


## Semi-supervised Regression

In [3]:
# NBVAL_IGNORE_OUTPUT

som_semi = susi.SOMRegressor(
    n_rows=40,
    n_columns=40,
    n_iter_unsupervised=20000,
    n_iter_supervised=20000,
    missing_label_placeholder=-1,  # important for semi-supervised learning!
    random_state=42)

som_semi.fit(X_train, y_train)
y_pred = som_semi.predict(X_test)
r2_semi = metrics.r2_score(y_test, y_pred)

print("R2 = {0:.1f} %".format(r2_semi*100))

R2 = 72.6 %
