# Using TorchSharp to Generate Synthetic Data for a Regression Problem

This tutorial is based on a [PyTorch example](https://jamesmccaffrey.wordpress.com/2023/06/09/using-pytorch-to-generate-synthetic-data-for-a-regression-problem/) posted by James D. McCaffrey on his blog, ported to TorchSharp.

Synthetic data sets can be very useful when evaluating and choosing a model.

Note that we're taking some shortcuts in this example -- rather than writing the data set as a text file that can be loaded from any modeling framework, we're saving the data as serialized TorchSharp tensors. Is should be straight-forward to modify the tutorial to write the data sets as text, instead.

In [None]:
#r "nuget: TorchSharp-cpu"

using TorchSharp;
using static TorchSharp.TensorExtensionMethods;

#### Generative Network
Neural networks can be used to generate data as well as train. The synthetic data can then be used to evaluate different models to see how well they can copy the behavior of the network used to produce the data.

First, we will create the model that will be used to generate the synthetic data. Later, we'll construct a second model that will be trained on the data the first model generates.

In [None]:
class Net : torch.nn.Module<torch.Tensor,torch.Tensor>
{
    private torch.nn.Module<torch.Tensor,torch.Tensor> hid1;
    private torch.nn.Module<torch.Tensor,torch.Tensor> oupt;

    public Net(int n_in) : base(nameof(Net))
    {
        var h = torch.nn.Linear(n_in, 10);
        var o =  torch.nn.Linear(10,1);

        var lim = 0.80;
        torch.nn.init.uniform_(h.weight, -lim, lim);
        torch.nn.init.uniform_(h.bias, -lim, lim);
        torch.nn.init.uniform_(o.weight, -lim, lim);
        torch.nn.init.uniform_(o.bias, -lim, lim);

        hid1 = h;
        oupt = o;

        RegisterComponents();
    }
    public override torch.Tensor forward(torch.Tensor input)
    {
        using var _ = torch.NewDisposeScope();
        var z = hid1.call(input).tanh_();
        z = oupt.call(z).sigmoid_();
        return z.MoveToOuterDisposeScope();
    }
}

Now that we have our generative network, we can define the method to create the data set. If you compare this with the PyTorch code, you will notice that we're relying on TorchSharp to generate a whole batch of data at once, rather than looping. We're also using TorchSharp instead of Numpy for the noise-generation.

In [None]:
void CreateDataFile(Net net, int n_in, string fileName, int n_items)
{

    var x_lo = -1.0;
    var x_hi = 1.0;

    var X = (x_hi - x_lo) * torch.rand(new long[] {n_items, n_in}) + x_lo;

    torch.Tensor y;

    using (torch.no_grad()) {
        y = net.call(X);
    }

    // Add some noise in order not to make it too easy to train...
    y += torch.randn(y.shape) * 0.01;

    // Make sure that the output isn't negative.
    y = torch.where(y < 0.0, y + 0.01 * torch.randn(y.shape) + 0.01, y);

    // Save the data in two separate, binary files.
    X.save(fileName + ".x");
    y.save(fileName + ".y");
}

(torch.Tensor X, torch.Tensor y) LoadDataFile(string fileName)
{
    return (torch.Tensor.load(fileName + ".x"), torch.Tensor.load(fileName + ".y"));
}

In [None]:
var net = new Net(6);

Create the data files.

In [None]:
CreateDataFile(net, 6, "train.dat", 2000);
CreateDataFile(net, 6, "test.dat", 400);

#### Using the Data

Load the data from files again. This is just to demonstrate how to get the data from disk.

In [None]:
var (X_train, y_train) = LoadDataFile("train.dat");
var (X_test, y_test) = LoadDataFile("test.dat");

Create another class, with slightly different logic, and train it on the generated data set.

In [None]:
class Net2 : torch.nn.Module<torch.Tensor,torch.Tensor>
{
    private torch.nn.Module<torch.Tensor,torch.Tensor> hid1;
    private torch.nn.Module<torch.Tensor,torch.Tensor> oupt;

    public Net2(int n_in) : base(nameof(Net2))
    {
        hid1 = torch.nn.Linear(n_in, 5);
        oupt =  torch.nn.Linear(5,1);

        RegisterComponents();
    }
    public override torch.Tensor forward(torch.Tensor input)
    {
        using var _ = torch.NewDisposeScope();
        var z = hid1.call(input).relu_();
        z = oupt.call(z).sigmoid_();
        return z.MoveToOuterDisposeScope();
    }
}

Create an instance of the second network, choose a loss to use, and then you're ready to train it. You also need an optimizer and maybe even an LR scheduler.

In [None]:
var model = new Net2(6);

var loss = torch.nn.MSELoss();

var learning_rate = 0.01f;
var optimizer = torch.optim.Rprop(model.parameters(), learning_rate);
var scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer);

A pretty standard training loop. The input is just in one batch. It ends with evaluating the trained model on the training set.

In [None]:
Console.WriteLine(" initial loss = " + loss.forward(model.forward(X_train), y_train).item<float>().ToString());

for (int i = 0; i < 10000; i++) {

    // Compute the loss
    using var output = loss.forward(model.forward(X_train), y_train);

    // Clear the gradients before doing the back-propagation
    model.zero_grad();

    // Do back-progatation, which computes all the gradients.
    output.backward();

    optimizer.step();
    
    if (i % 100 == 99) {
        scheduler.step();
    }
}

Console.WriteLine(" final loss   = " + loss.forward(model.forward(X_train), y_train).item<float>());

The thing we're really curious about is how the second model does on the test set, which it didn't see during training. If the loss is significantly greater than the one from the training set, we need to train more, i.e. start another epoch. If the test set loss doesn't get closer to the training set loss with more epochs, we may need more data.

In [None]:
loss.forward(model.forward(X_test), y_test).item<float>()

#### Splitting the Data into Batches

If we want to be a little bit more advanced, we can split the training set into batches. 

In [None]:
var N = X_train.shape[0]/10;
var X_batch = X_train.split(N);
var y_batch = y_train.split(N);

That means modifying the training loop, too. Running multiple batches can take longer, but the model may converge quicker, so the total time before you have the desired model may still be shorter.

In [None]:
Console.WriteLine(" initial loss = " + loss.forward(model.forward(X_train), y_train).item<float>().ToString());

for (int i = 0; i < 5000; i++) {

    for (var j = 0; j < X_batch.Length; j++) {
        // Compute the loss
        using var output = loss.forward(model.forward(X_batch[j]), y_batch[j]);

        // Clear the gradients before doing the back-propagation
        model.zero_grad();

        // Do back-progatation, which computes all the gradients.
        output.backward();

        optimizer.step();
    }
    
    scheduler.step();
}

Console.WriteLine(" final loss   = " + loss.forward(model.forward(X_train), y_train).item<float>());

In [None]:
loss.forward(model.forward(X_test), y_test).item<float>()

#### Dataset and DataLoader

If we wanted to be really advanced, we would use TorchSharp data sets and data loaders, which would allow us to randomize the test data set between epocs (at the end of the outer training loop). Here's how we'd do that.

In [None]:
class SyntheticDataset : torch.utils.data.Dataset {

    public SyntheticDataset(string fileName) 
    {
        _data = torch.Tensor.load(fileName + ".x");
        _labels = torch.Tensor.load(fileName + ".y");
        if (_data.shape[0] != _labels.shape[0])
            throw new InvalidOperationException("Data and labels are not of the same lengths.");
    }

    public override Dictionary<string, torch.Tensor> GetTensor(long index)
    {
        var rdic = new Dictionary<string, torch.Tensor>();
        rdic.Add("data", _data[(int)index]);
        rdic.Add("label", _labels[(int)index]);
        return rdic;
    }

    public override long Count => _data.shape[0];

    private torch.Tensor _data;
    private torch.Tensor _labels;
}

The training loop gets slightly more complex with the data set.

In [None]:
var training_data = new SyntheticDataset("train.dat");
var train = new torch.utils.data.DataLoader(training_data, 200, shuffle: true);

In [None]:
Console.WriteLine(" initial loss = " + loss.forward(model.forward(X_train), y_train).item<float>().ToString());

for (int i = 0; i < 1000; i++) {

    foreach (var data in train)
    {
        // Compute the loss
        using var output = loss.forward(model.forward(data["data"]), data["label"]);

        // Clear the gradients before doing the back-propagation
        model.zero_grad();

        // Do back-progatation, which computes all the gradients.
        output.backward();

        optimizer.step();
    }
    
    scheduler.step();
}

Console.WriteLine(" final loss   = " + loss.forward(model.forward(X_train), y_train).item<float>());

It's slower, and the convergence isn't that much better, but that will depend on the model used. You just have to try and try different things.

In [None]:
loss.forward(model.forward(X_test), y_test).item<float>()