# UCLAIS Tutorial Series Challenge 2



In this challenge you will be explore training and inferring using neural networks. You've already seen how to use classifiers in the previous challenge on premier league prediction. Now we will look at a regression task. Simply put, instead of trying to predict from a set of discrete classes, we are predicting a continuous value. In this case we will predict the alcohol content of wine based on a set of other chemical attribute.

If you do not already have a DOXA account, you will want to [sign up](https://doxaai.com/sign-up) first before proceeding and then make sure you are enrolled on the [DOXA challenge page](https://doxaai.com/competition/uclais-2023-2).

## Machine Learning Workflow Reminder

![title](https://miro.medium.com/max/1400/0*V0GyOt3LoDVfY7y5.png)

The overall machine learning process covers a wide sequence of steps, so as you go through this notebook, try to keep in mind which stage are we dealing with and what we are trying to achieve. There are a lot of helpful resources online you can use, such as the excellent [scikit-learn documentation](https://scikit-learn.org/stable/getting_started.html). You are also more than welcome to ask questions in the [DOXA Community Discord server](https://discord.gg/MUvbQ3UYcf)!

## Installing and Importing Useful Packages

To get started, we will install a number of common machine learning packages.

In [None]:
%pip install numpy pandas matplotlib seaborn scikit-learn ipympl
%pip install -U doxa-cli

In [None]:
# Import relevant libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import random
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [None]:
# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

In [None]:
# this gives live loss plots -- recommended
from livelossplot import PlotLosses

We now also make sure we're using out computer GPU for best performance. Make sure it says "Using cuda device" below. If not, go to "Runtime" -> "Change runtime type" in google colab and change to GPU. This will make model training a lot faster!!

In [None]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using {device} device")

## Data Loading

We now load the data as a panda's dataframe. We use the [wine quality dataset](https://archive.ics.uci.edu/dataset/186/wine+quality). The goal of this challenge will be to use a neural network to predict the alcohol content of a wine given its other properties. The properties are based on physicochemical tests, and there are 10 features in total. The target variable is the alcohol content, which is a continuous variable.

In [None]:
# Import the training dataset
train_df_original = pd.read_csv(
    "./data/train.csv"
)  # Change the path accordingly

# Import the testing dataset
test_df = pd.read_csv(
    "./data/test.csv"
)  # Change the path accordingly

In [None]:
# We can then make an in-memory copy of the training set to manipulate
# and process while leaving the original intact as we experiment
df = train_df_original.copy()

## Data Understanding
Before we start to train our Machine Learning model, it is important to have a look and understand first the dataset that we will be using. This will provide some insights onto which model, model hyperparameter, and loss function are suitable for the problem we are dealing with. The [first doxa challenge](https://doxaai.com/competition/uclais-2023-1) has good content on data understanding. Check that out if you want to explore further. 

In [None]:
# TODO: Print the first five rows of the data

# Hint: use the '.head()' method

In [None]:
# TODO: Print the number of rows and columns in the dataset 

# Hint: use '.shape'

In [None]:
# TODO: Print the summary statistics for the dataset

# Hint: use '.describe()'

## Data Preprocessing 

Here we preprocess the data to make the data suitable for training. We will first split the data into training and validation sets. Feel free to add new cells as you see fit. 

In [None]:
# We split the data into X and y variables. X are the features and y is the target variable. we wand to predict. 
# We are trying to predict the alcohol content given the other variables. 
X_train = df.drop('alcohol', axis=1)
y_train = df['alcohol']

# We done covert the Matrix X and vector y in numpy arrays.
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()

In [None]:
# TODO: add your own data pre-processing steps here. (hint: it might be worth looking at normalizing the data to make training easier)

## Define our Neural Network Model

We now define the architecture of our model. Remember the more complex your model architecture, the more complex your data will be able to fit. However, this also means that your model will be more prone to overfitting. So be careful! You can also look at other ways of reducing overfitting such as regularization. 

In [None]:
num_input_features, num_hidden_neurons = X_train.shape[1], 10
model = nn.Sequential(
    # TODO: add layers to our model

    # Note: remember that we are trying to predict a continuous variable.
    # Our output layer should have only one neuron, and our input layer should be the number of columns in X.

)

# Move model to GPU if available
model = model.cuda() if torch.cuda.is_available() else model
print(model)

## Training our Model

Now it's finally time to train our model! Make sure to use the training set to avoid overfittng! First, we define the hyperparameter. Feel free to experiment with those!

In [None]:
# TODO: change 'None' with values you think are appropriate. Experiment with different values to see what works best!
learning_rate = None
batch_size = None
num_epochs = None

Define your loss function below. Options are given in the [documentation](https://pytorch.org/docs/stable/nn.functional.html#loss-functions). 

In [None]:
# TODO: replace 'None' with your loss function.

# Hint: we are trying to predict a continuous variable.

def loss_fun(pred, target):
    return None

Lets also define our optimizer. Look at the [documentation](https://pytorch.org/docs/stable/optim.html#algorithms) for a list of optimization algorithms.

In [None]:
# TODO: replace 'None' with your optimizer.
optim = None

Finally we set up our model for training and plotting.

In [None]:
#Keep track of losses
plotlosses = PlotLosses()

# Convert our training data to tensors
X_train_tensor = torch.from_numpy(X_train).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)

# Change model to training mode
model.train();

Run the code cell below to train your model.

In [None]:
for _ in range(num_epochs):
    # TODO: add the code to train your model. (hint: use plotlosses to see the live loss plot)
    

## Preparing your DOXA Submission

In [None]:
# Convert to numpy arrays
X_test = test_df.to_numpy()

# Pass our data through our neural network
model.eval()
with torch.no_grad(): 
    predictions = model(torch.from_numpy(X_test).float().to(device)).numpy().squeeze()

assert predictions.shape == (1300,) 

# Take a look at the first 20 predictions
predictions[:20]

In [None]:
os.makedirs("submission", exist_ok=True)

with open("submission/y.txt", "w") as f:
    f.writelines([f"{prediction}\n" for prediction in predictions])

with open("submission/doxa.yaml", "w") as f:
    f.write(
        "competition: uclais-2023-2\nenvironment: cpu\nlanguage: python\nentrypoint: run.py"
    )

with open("submission/run.py", "w") as f:
    f.write(
        """import os

with open('y.txt', 'r') as f:
    with open(os.environ["DOXA_STREAMS"] + "/out", "w") as g:
        g.write(f.read().strip())"""
    )

## Submitting to DOXA

Before you can submit to DOXA, you must first ensure that you are enrolled for the challenge on the DOXA website. Visit [the challenge page](https://doxaai.com/competition/uclais-1) and click "Enrol" in the top-right corner if you have not done so already.

You can then log in using the DOXA CLI by running the following command:

In [None]:
!doxa login

Finally, you can submit your results to DOXA by running the following command:

In [None]:
!doxa upload submission

Wooo! 🥳 You have (probably) just uploaded your predictions to DOXA &ndash; well done! Take a moment to see how you have done on the [scoreboard](https://doxaai.com/competition/uclais-2023-2/scoreboard).