In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load data

In [None]:
data_path = 'data/'

In [None]:
data = pd.read_csv(f'{data_path}train_phase_1.csv')
data.date = pd.to_datetime(data.date, format='%Y-%m-%d %H:%M:%S')

test = pd.read_csv(f'{data_path}test_phase_1.csv')
test.date = pd.to_datetime(test.date, format='%Y-%m-%d %H:%M:%S')

In [None]:
data.head()

In [None]:
data.shape, test.shape

# Checks

First, we check that this data has the exact types we are waiting for.

In [None]:
assert data.dtypes.equals(pd.Series({
    'date': 'datetime64[ns]', 
    'wp1': 'float64', 
    'u': 'float64', 
    'v': 'float64', 
    'ws': 'float64', 
    'wd': 'float64',
}))

assert test.dtypes.equals(pd.Series({
    'date': 'datetime64[ns]', 
    'u': 'float64', 
    'v': 'float64', 
    'ws': 'float64', 
    'wd': 'float64',
}))

Then, we check that we have no NA values in the dataframes.

In [None]:
assert not data.isnull().any(axis=None) and not test.isnull().any(axis=None)

Nice, our data is exactly the types we want, and no NA are present, we can pass on to an exploratory data analysis (EDA).

# EDA and first models

Let's start by checking what we want to predict : the power measurement wp1 of the first farm.

## Wp1

In [None]:
plt.plot(data.head(300).date, data.head(300).wp1)

In [None]:
data.wp1.hist()

In [None]:
data.wp1.min(), data.wp1.max()

## Windspeed

The Critical parameter in predicting the wind power, obviously seems to be the wind speed. Let us observe this parameter.

In [None]:
plt.plot(data.head(300).date, data.head(300).ws)

Let's see if there is any correlation between power and speed by taking a look at one of them, function of the other.

In [None]:
plt.scatter(data.head(300).ws, data.head(300).wp1)
plt.xlabel('windspeed')
plt.ylabel('windpower')


There clearly seems to be a correlation between the two ! When windspeed rises, the wind power rises, on average, even if the relation between the two is not linear. 

We can confirm this by calculating the correlation coeficient betwen the two. Actually we can directly calculate all of the correlation coefficient between all variables in the dataset in one line of code. Let us do so.

In [None]:
data.corr()

The correlation coefficient between windpseed and wind powe is 0.7 : this is very high indeed, our firt conclusion were true. Let us recall Pearson's definition of correlation, which is the one we used here.

For a sample, it is defined by : 
    

$$
\frac{\sum \limits _{i=1} ^{n} (x_{i} - \bar x) (y_{i} - \bar y)}{\sqrt{\sum \limits _{i=1} ^{n}(x_{i} - \bar x)^{2}}\sqrt{\sum \limits _{i=1} ^{n}(y_{i} - \bar y)^{2}}} 
$$

What is important to recall is that it is comprised in the range $[-1, 1]$ and : 
    - it is equal to 1 if the two variables are exactly the same
    - -1 if the two varables are the exact opposite
    - when it is equal to 0, the two variables have nothing in common : they are independent one from the other, for example this could be the value of the bitcoin and the average windspeed in south korea, we know these two have nothing in common.
    - when it is > 0, the two variables are positively correlated, this means that on average, when one goes up, the other goes up too.
    - when it is < 0, the two variables are negatively correlated, this means that on average, when one goes up, the other goes down.

Based on this first EDA, a very simple model we can try to predict our sample is to try a linear model : 

$$ wp_{1} = \alpha . ws + \beta$$

In order to do so, we import some libraries that will be useful.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

We now cut our dataframe in two : one dataframe will be used for training, and the other one will be used to estimate what is the value of this first model we have made. 
For this, why do we not directly use the test set ? The reason is that for the test set, we do not know the exact value of the power measurement.

In [None]:
train, val = train_test_split(data, test_size=0.2, shuffle=True)
print(train.shape, val.shape)

In [None]:
X_train = train['ws'].values.reshape((-1, 1))
y_train = train['wp1'].values

X_val = val['ws'].values.reshape((-1, 1))
y_val = val['wp1'].values

lm = LinearRegression()
lm.fit(X_train, y_train)
mean_absolute_error(lm.predict(X_val), y_val)

Nice ! We have our first model and it gives an error of 0.13 !..

Now wait, what is the value of that first model ? How can we know if 0.13 is actually a good error ? Well for this, a very neat way to be able to know if our model is worth anything is to compare it to a naive model. A naive model can be for example to predict everytime the same value, whatever the conditions. One of these naive model we have at hand would be to predict the mean value of the wind power in the train set. Let's see what would this model give. 

In [None]:
from sklearn.dummy import DummyRegressor

In [None]:
dm = DummyRegressor(strategy='mean')
dm.fit(X_train, y_train)
print(mean_absolute_error(dm.predict(X_val), y_val))

dm = DummyRegressor(strategy='median')
dm.fit(X_train, y_train)
print(mean_absolute_error(dm.predict(X_val), y_val))

Yes ! Good news, our model did really learn something good ! We are a lot better than the 'mean' or 'median' model, around 30% better, based on this metric.

## wind direction, u & v

Now let's take a look at direction.

In [None]:
plt.plot(data.head(300).date, data.head(300).wd)

Interesting to see that, as expected, it is comprised between 0 and 360. And when it crosses 360 it goes to 0, as expected for a direction. But this makes it a highly discontinuous function in 360. How should we treat this in the models ? That's a question for you to answer..

One more thing about direction, the wind vector can be either expressed as : 
    - windspeed, and wind direction, 
    - or u and v components. 
These two representations are interchangeable (for math guys, there is a bijection between these two representations)
And in our case, the convention for the wind direction used in our data is wind vector azimuth. For more information on these matters, please check the following website which explains these representations : 
    http://tornado.sfsu.edu/geosciences/classes/m430/Wind/WindDirection.html


In [None]:
plt.plot(data.head(300).date, data.head(300).u)

In [None]:
plt.plot(data.head(300).date, data.head(300).v)

# Next steps Modeling

We have already seen first models above : the linear model with one variable (windspeed), and two naive models (median, and mean). It will be your job from now on to determine the best model, but let's already take a look at one classic model that data scientists usually try on for nearly any subject : Random Forest. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
X_train = train[['ws', 'wd', 'u', 'v']].values
y_train = train['wp1'].values

X_val = val[['ws', 'wd', 'u', 'v']].values
y_val = val['wp1'].values

rf = RandomForestRegressor(n_jobs=-1)
rf.fit(X_train, y_train)
print(mean_absolute_error(rf.predict(X_val), y_val))

Random Forest does only very slightly better than the linear model.

# Predictions on test set

Now our model is fit, we can pass on to the predictions.

_Note: be careful when generating your submission file. Indeed, it needs to be a csv file with ";" as separator._

In [None]:
X_test = test[['ws', 'wd', 'u', 'v']].values

df_predictions = pd.DataFrame({
    'date': test['date'],
    'wp1': rf.predict(X_test),
})

df_predictions.to_csv('predictions.csv', index=False, sep=';')
df_predictions.head()

Now it is your turn, what better model can you think of ?