### Required dependencies
You'll need recent versions of Jupyter (but if you're reading this, you are probably OK), scikit-learn, numpy, pandas and matplotlib and/or seaborn. The most recent versions should be fine. You are free to use any other package under the sun, but I suspect you will be at least needing the above.

I advise you to use a form of virtual environments to manage your python projects (e.g. pipenv, venv, conda etc.).

To get free GPU time, you can try Google Colab. It is a tool for running notebooks like this on the fly, and provides you with a VM and a GPU for free. Almost all packages for machine learning are automatically installed, and I suspect you could the entire project on Colab if you wanted to. Still, it is useful to learn how to set up your environment on your own pc as well, and Colab is a bit more complicated when you have to import your datasets (best to import them from a Google Drive for speed). Colab could become useful if you intend to try the deep learning approaches with TensorFlow and PyTorch, and you don't have a GPU yourself.

In [1]:
# numerical library:
import numpy as np

# data manipulation library:
import pandas as pd

# standard packages used to handle files:
import sys
import os 
import glob
import time

# scikit-learn machine learning library:
import sklearn

# plotting:
import matplotlib.pyplot as plt

# tell matplotlib that we plot in a notebook:
%matplotlib notebook

Define your folder structure with your data:

In [2]:
data_folder = "./"

In [3]:
train_data = pd.read_csv(data_folder + "train.csv")
test_data = pd.read_csv(data_folder + "test.csv")

### Data exploration
Let's take a look at our train and test data:

In [4]:
train_data.head()

Unnamed: 0,date,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2,Appliances
0,2016-01-11 17:00:00,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,...,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433,60
1,2016-01-11 17:10:00,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,...,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195,60
2,2016-01-11 17:20:00,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,...,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668,50
3,2016-01-11 17:30:00,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,...,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389,50
4,2016-01-11 17:40:00,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,...,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097,60


Let's take a look at our first 1000 datapoints in the training set:

In [15]:
train_data[0:1000].plot(x="date", y="Appliances",figsize=(10,7))

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='date'>

In [6]:
test_data.head()

Unnamed: 0,date,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-04-24 21:00:00,0,21.926667,35.5,19.29,37.5,22.39,34.0,21.39,32.225714,...,20.2,32.4,4.1,758.0,82.0,3.0,40.0,1.2,10.668196,10.668196
1,2016-04-24 21:10:00,0,21.89,35.4,19.2225,37.425,22.39,34.09,21.35,32.2,...,20.2,32.4,3.95,758.05,82.166667,3.0,40.0,1.1,48.467852,48.467852
2,2016-04-24 21:20:00,0,21.89,35.4,19.2,37.466667,22.39,33.963333,21.29,32.277143,...,20.2,32.29,3.8,758.1,82.333333,3.0,40.0,1.0,36.388536,36.388536
3,2016-04-24 21:30:00,0,21.89,35.4,19.1,37.59,22.39,33.9,21.29,32.334,...,20.175,32.29,3.65,758.15,82.5,3.0,40.0,0.9,17.198176,17.198176
4,2016-04-24 21:40:00,0,21.89,35.4,19.1,37.59,22.39,33.966667,21.29,32.29,...,20.166667,32.563333,3.5,758.2,82.666667,3.0,40.0,0.8,7.200588,7.200588


### Building a first submission

For a first submission, let's just take the average consumption for the appliances of the training set, and use this value for all test samples:

In [7]:
average_consumption = train_data["Appliances"].mean()
print(average_consumption)

98.75133333333333


Let's put this in a numpy array with length of our test dataset. Normally, 'predictions' will be the output of your model here, instead of just creating this guess:

In [8]:
predictions = np.full(test_data.shape[0], average_consumption)
len(predictions)

4735

Create a unique filename based on timestamp:

In [9]:
def generate_unique_filename(basename, file_ext):
    """Adds a timestamp to filenames for easier tracking of submissions, models, etc."""
    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
    return basename + '_' + timestamp + '.' + file_ext

Let's create our pandas dataframe and write it to csv. You can submit this file to Kaggle.

In [10]:
submission = pd.DataFrame(data=predictions, columns=["Appliances"])
submission.index.name = "Id"
submission.head()

Unnamed: 0_level_0,Appliances
Id,Unnamed: 1_level_1
0,98.751333
1,98.751333
2,98.751333
3,98.751333
4,98.751333


In [11]:
submission.to_csv(generate_unique_filename("average_submission", "csv"))