# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

Python libraries in the below cell are imported to provide strong functionality and flexibility in performing tasks on the datasets. Pandas and numpy libraries are respectively imported to load and manupulate variables stored in the datasets in some stages of data science life cycle. Once data is loaded and manipulated; visuals can be made to check the distribution of data, outliers, and correlation to gather insights about the features and check relationships between them. Other libraries imported in the cell below enables the preparation of data for model building. These libraries makes the scaling of models easy.

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from statsmodels.graphics.correlation import plot_corr

# Libraries for data preparation and model building
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
import math
from statsmodels.graphics.correlation import plot_corr
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
from scipy.stats import pearsonr
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

Loading the spain weather train and test datasets csv files into DataFrames and assign variable name to them, df_1 and df_2 respectively.

In [None]:
df_1 = pd.read_csv('df_train.csv') # load the train data
df_2 = pd.read_csv('df_test.csv') # load the test data

#### Creating a copy of the data to avoid making permanent changes to the data

Copies of the train and test datasets are created in the cell below to avoid messing up the datsets during the cleaning operations and also to make sure that the original datasets are safe incase of a technical glitch that maybe results in the loss of data. 

In [None]:
df_train = df_1.copy() # Copy of the train data
df_test = df_2.copy() # Copy of the test data

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


EDA is a process and it can take quite some time depending on size and nature of data ones is dealing with. Operations in this process includes investigating the data to discover its patterns, spot anomalies, test some hypotheses, and check assumptions. The above mentioned are made possible with the use of summary statistics and graphical representations. This step is crucial and its about making sense of the data before getting it dirty.

In [None]:
# look at data statistics

In [None]:
#checking the shape of the data
df_train.shape

In [None]:
df_test.shape

The shape of the train data set indcates 8763 rows and 49 columns.  
The shape of the test data set indcates 2920 rows and 48 columns.

**Note:** We will add *T* to (transpose)tranform our data by changing the diagonals of the columns and index for a better view of the columns.

In [None]:
df_train.head().T

In [None]:
df_test.head().T

Observations:

1. There is an unnamed column that has the same value with the index value as seen above, this column is insignificant to our use case
2. Column names have a mix of upper and lower case letters
3. All wind_deg and pressure columns are supposed to be expressed as a category according to the (features) data description on Kaggle
4. The weather id columns are insignificant to our use case

In [None]:
#checking the data type of each column in the data
print(df_train.info())

#checking if there are missing values in any column
print(df_train.isnull().sum())

In [None]:
#checking the data type of each column in the data
print(df_test.info())

#checking if there are missing values in any column
print(df_test.isnull().sum());

Observations:

1. Valencia_pressure has 2068 null values in the train dataset and 454 null values in the test dataset.
2. The datatype 'object' indicates that the columns time, Valencia wind deg, and Seville pressure are not numerical.

In [None]:
# A look at the train data descriptive statistics
df_train.describe().T

In [None]:
# A look at the test data descriptive statistics
df_test.describe().T

Observations:

1. Assessment indicates the presence of an anomaly with the rainfall levels of barcelona in the train dataset. (high rainfall level and there wasn't any rain that day as determined from online records) 
2. The windspeed for valencia.(max speed 52 is too high, other cities had a maximum wind speed of 12, normal windspeed levels fall between the ranges of 10 to 19 on average)
3. There are no percentiles for rain_1h and snow_1h
4. There are no percentiles for rain_3h and snow_3h

In [None]:
df_train.kurtosis().T

Observations:
The following columns have high kurtosis;

>Bilbao_rain_1h       =     32.904656  
>Valencia_wind_speed     =  35.645426  
>Bilbao_wind_speed       =   3.631565  
>Seville_rain_1h          = 93.840746  
>Bilbao_snow_3h         =  806.128471  
>Barcelona_pressure   =   3687.564230  
>Seville_rain_3h       =   413.136592  
>Madrid_rain_1h        =    76.584491  
>Barcelona_rain_3h     =   187.800460  
>Valencia_snow_3h       = 4089.323165  
>Madrid_weather_id      =    9.259047  
>Barcelona_weather_id    =   5.701882  
>Seville_weather_id       = 10.710308

In [None]:
df_test.kurtosis()

Observations:
The following columns have high kurtosis;

>Bilbao_rain_1h         =   16.905396  
>Barcelona_rain_1h       =  52.069367  
>Seville_rain_1h       =    48.243445  
>Seville_rain_3h       =  2920.000000  
>Madrid_rain_1h       =     41.250278  
>Barcelona_rain_3h     =  1642.238858  
>Valencia_pressure     =     4.966557  
>Madrid_pressure       =    14.027856  

In [None]:
df_train.skew().T

***Observations:***  
**The following columns have High Positve Skew;**  

Madrid_wind_speed        1.441144  
< Bilbao_rain_1h           5.222802 >  
< Valencia_wind_speed      3.499637 >  
< Bilbao_wind_speed        1.716914 >  
Seville_clouds_all       1.814452  
Madrid_clouds_all        1.246745  
Seville_wind_speed       1.151006  
Barcelona_rain_1h        8.726988  
< Seville_rain_1h          8.067341 >  
Bilbao_snow_3h          26.177568  
Barcelona_pressure      57.979664  
Seville_rain_3h         19.342574  
< Madrid_rain_1h           7.074308 >  
< Barcelona_rain_3h       12.696605 >  
< Valencia_snow_3h        63.298084 >  

**The following columns have High Negative Skew;**  

Madrid_weather_id       -3.107722  
Barcelona_weather_id    -2.584011  
Seville_weather_id      -3.275574  
Valencia_pressure       -1.705162  
Madrid_pressure         -1.850768  
Bilbao_weather_id       -1.234844

In [None]:
df_test.skew()

***Observations:***  
**The following columns have High Positve Skew;**  

Madrid_wind_speed        1.441144  
< Bilbao_rain_1h           5.222802 >  
< Valencia_wind_speed      3.499637 >  
< Bilbao_wind_speed        1.716914 >  
Seville_clouds_all       1.814452  
Madrid_clouds_all        1.246745  
Seville_wind_speed       1.151006  
Barcelona_rain_1h        8.726988  
< Seville_rain_1h          8.067341 >  
Bilbao_snow_3h          26.177568  
Barcelona_pressure      57.979664  
Seville_rain_3h         19.342574  
< Madrid_rain_1h           7.074308 >  
< Barcelona_rain_3h       12.696605 >  
< Valencia_snow_3h        63.298084 >  

**The following columns have High Negative Skew;**  

Madrid_weather_id       -3.107722  
Barcelona_weather_id    -2.584011  
Seville_weather_id      -3.275574  
Valencia_pressure       -1.705162  
Madrid_pressure         -1.850768  
Bilbao_weather_id       -1.234844

def outliers(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    factor = iqr*1.5
    low_b = q1-factor
    upper_b = q3+factor
    wl = df[column]>low_b
    wh = df[column]<upper_b
    not_outliers = df[~((df<(wl)) | (df>(wh)))]
    return not_outliers.describe().T

In [None]:
# plot relevant feature interactions

In [None]:
plt.figure(figsize = [10,5])
df_train.skew(axis=0, skipna=True).plot();

In [None]:
plt.figure(figsize = [10,5])
df_train.kurtosis(axis=0, skipna=True).plot();

In [None]:
sns.set(rc={'figure.figsize':(10,10)})

#define plotting region (2 rows, 2 columns)
fig, axes = plt.subplots(5, 2)

#create boxplot in each subplot
sns.scatterplot(data=df_train, y='Bilbao_rain_1h', x='load_shortfall_3h', ax=axes[0,0])
sns.scatterplot(data=df_train, y='Valencia_wind_speed', x='load_shortfall_3h', ax=axes[0,1])
sns.scatterplot(data=df_train, y='Bilbao_wind_speed', x='load_shortfall_3h', ax=axes[1,0])
sns.scatterplot(data=df_train, y='Seville_rain_1h', x='load_shortfall_3h', ax=axes[1,1])
sns.scatterplot(data=df_train, y='Bilbao_snow_3h', x='load_shortfall_3h', ax=axes[2,0])
sns.scatterplot(data=df_train, y='Barcelona_pressure', x='load_shortfall_3h', ax=axes[2,1])
sns.scatterplot(data=df_train, y='Seville_rain_3h', x='load_shortfall_3h', ax=axes[3,0])
sns.scatterplot(data=df_train, y='Madrid_rain_1h', x='load_shortfall_3h', ax=axes[3,1])
sns.scatterplot(data=df_train, y='Barcelona_rain_3h', x='load_shortfall_3h', ax=axes[4,0])
sns.scatterplot(data=df_train, y='Valencia_snow_3h', x='load_shortfall_3h', ax=axes[4,1]);

In [None]:
# Visualizing the correlation
fig = plt.figure(figsize=(10,8));
ax = fig.add_subplot(111);
plot_corr(df_train.corr(), xnames = df_train.corr().columns, ax = ax, );

In [None]:
plt.figure(figsize=[35,15])
sns.heatmap(df_train.corr(), vmin=-1, vmax=1 ,annot=True);

In [None]:
# evaluate correlation

In [None]:
# have a look at feature distributions

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
df_train = df_train.reindex(sorted(df_train.columns), axis=1)

In [None]:
df_train.columns = df_train.columns.str.lower()

In [None]:
df_train.columns

In [None]:
df_train.drop(['unnamed: 0', 'barcelona_weather_id', 'bilbao_weather_id', 'madrid_weather_id', 'seville_weather_id'], axis = 1, inplace = True)

In [None]:
df_train.head()

In [None]:
def sub_outliers_pressure(df, col):
    upper_limit = df[col].mean() + 3*df[col].std()
    lower_limit = df[col].mean() - 3*df[col].std()
    df[col] = np.where(df[col] > 1052,
        upper_limit,
        np.where(
            df[col] < 950,
            950,
            df[col]
        )
    )
    return df.describe()

In [None]:
def sub_outliers_wind(df, col):
    upper_limit = df[col].mean() + 3*df[col].std()
    lower_limit = df[col].mean() - 3*df[col].std()
    df[col] = np.where(df[col] > 19,
        upper_limit,
            df[col]
    )
    
    return df.describe()

In [None]:
def sub_outliers_rain(df, col):
    upper_limit = df[col].mean() + 3*df[col].std()
    lower_limit = df[col].mean() - 3*df[col].std()
    df[col] = np.where(df[col] > 1052,
        upper_limit,
        
    )
    return df.describe()

In [None]:
#sub_outliers(df_train, 'bilbao_rain_1h')
sub_outliers_wind(df_train, 'valencia_wind_speed')
sub_outliers_wind(df_train, 'bilbao_wind_speed')
#sub_outliers(df_train, 'seville_rain_1h')
#sub_outliers(df_train, 'bilbao_snow_3h')
sub_outliers_pressure(df_train, 'barcelona_pressure')
#sub_outliers(df_train, 'seville_rain_3h')
#sub_outliers(df_train, 'madrid_rain_1h')
#sub_outliers(df_train, 'barcelona_rain_3h')
#sub_outliers(df_train, 'valencia_snow_3h')

In [None]:
df_train['barcelona_pressure'].value_counts()

In [None]:
df_train.kurtosis().T

In [None]:
df_train['valencia_pressure'].fillna(df_train['valencia_pressure'].mean(), inplace = True)

In [None]:
df_train.skew()

In [None]:
plt.figure(figsize = [10,5])
df_train.skew(axis=0, skipna=True).plot(rotation=90);

In [None]:
# create new features

In [None]:
#creating a level category for the wind degree columns
classes = {
    "level_1" : [0,36],
    "level_2" : [36,72],
    "level_3" : [72,108],
    "level_4" : [108,144],
    "level_5" : [144,180],
    "level_6" : [180,216],
    "level_7" : [216,252],
    "level_8" : [252,288],
    "level_9" : [288,324],
    "level_10" : [324,360]
}

In [None]:
#A function that changes the wind degree columns to categorical
def change_level(df, column_name, dictionary):
    row = 0
    for x in df[column_name]:
        for key, value in dictionary.items():
            if x == 0.0:
                df.at[row, column_name] = key
            if x > value[0] and x <= value[1]:
                df.at[row, column_name] = key
        row += 1
    
    return

In [None]:
change_level(df_train, "barcelona_wind_deg", classes)

In [None]:
change_level(df_train, "bilbao_wind_deg", classes)

In [None]:
df_train['bilbao_wind_deg'].value_counts()

In [None]:
#creating the level boundaries for the pressure columns
levels = np.arange(950, 1050 +1, 4)
levels

In [None]:
#create an empty dictionary for the pressure levels
pressure_dict = {}

In [None]:
#creating and adding the pressure levels to the dictionary
for i in range(len(levels) - 1):
    pressure_dict['sp' + str(i+1)] = [levels[i], levels[i+1]]
pressure_dict

In [None]:
change the values for all the pressure columns in the train dataset to their respective categories
for column in df_train[[x for x in df_train.columns if 'pressure' in x and 'seville' not in x]].columns:
    change_level(df_train, column, pressure_dict)

In [None]:
change_level(df_train, 'barcelona_pressure', pressure_dict)
df_train['barcelona_pressure'].value_counts()

In [None]:
# Visualizing the correlation
fig = plt.figure(figsize=(10,10));
ax = fig.add_subplot(111);
plot_corr(df_train.corr(), xnames = df_train.corr().columns, ax = ax, );

In [None]:
plt.figure(figsize=[35,25])
sns.set(font_scale=1.4)
train_heatmap = sns.heatmap(df_train.corr(), vmin=-1, vmax=1 ,annot=True)
#train_heatmap = sns.diverging_palette(h_neg=210, h_pos=350, s=90, l=30, as_cmap=True)
train_heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':20}, pad=12);

In [None]:
#creating dummies for the pressure columns
seville_pressure_dummy = pd.get_dummies(df_train['seville_pressure'])
barcelona_pressure_dummy = pd.get_dummies(df_train['barcelona_pressure'])
bilbao_pressure_dummy = pd.get_dummies(df_train['bilbao_pressure'])
madrid_pressure_dummy = pd.get_dummies(df_train['madrid_pressure'])
valencia_pressure_dummy = pd.get_dummies(df_train['valencia_pressure'])
barcelona_pressure_dummy

In [None]:
#checking dummy df for duplicates
seville_pressure_dummy['check'] = seville_pressure_dummy.sum(axis=1)
barcelona_pressure_dummy['check'] = barcelona_pressure_dummy.sum(axis=1)
bilbao_pressure_dummy['check'] = bilbao_pressure_dummy.sum(axis=1)
madrid_pressure_dummy['check'] = madrid_pressure_dummy.sum(axis=1)
valencia_pressure_dummy['check'] = valencia_pressure_dummy.sum(axis=1)

In [None]:
seville_pressure_dummy['check'].unique()

In [None]:
seville_pressure_dummy = seville_pressure_dummy.drop(['check'],axis=1)
barcelona_pressure_dummy = barcelona_pressure_dummy.drop(['check'],axis=1)
bilbao_pressure_dummy = bilbao_pressure_dummy.drop(['check'],axis=1)
madrid_pressure_dummy = madrid_pressure_dummy.drop(['check'],axis=1)
valencia_pressure_dummy = valencia_pressure_dummy.drop(['check'],axis=1)
barcelona_pressure_dummy

In [None]:
df_train.drop(['seville_pressure'], axis = 1)
df_train.drop(['barcelona_pressure'], axis = 1)
df_train.drop(['bilbao_pressure'], axis = 1)
df_train.drop(['madrid_pressure'], axis = 1)
df_train.drop(['valencia_pressure'], axis = 1)

In [None]:
svp_lvl1 = seville_pressure_dummy.loc[:, 'sp1': 'sp17'].max(axis=1)
svp_lvl2 = seville_pressure_dummy.loc[:, 'sp18': 'sp24'].max(axis=1)
svp_lvl3 = seville_pressure_dummy.loc[:, 'sp25':].max(axis=1)

#bcp_lvl1 = barcelona_pressure_dummy.loc[:, '950.0': 'sp16'].max(axis=1)
#bcp_lvl2 = barcelona_pressure_dummy.loc[:, 'sp17': 'sp23'].max(axis=1)
#bcp_lvl3 = barcelona_pressure_dummy.loc[:, 'sp7':].max(axis=1)

#bbp_lvl1 = bilbao_pressure_dummy.loc[:, 'sp1': 'sp17'].max(axis=1)
#bbp_lvl2 = bilbao_pressure_dummy.loc[:, 'sp18': 'sp24'].max(axis=1)
#bbp_lvl3 = bilbao_pressure_dummy.loc[:, 'sp25':].max(axis=1)

#mdp_lvl1 = madrid_pressure_dummy.loc[:, 'sp1': 'sp17'].max(axis=1)
#mdp_lvl2 = madrid_pressure_dummy.loc[:, 'sp18': 'sp24'].max(axis=1)
#mdp_lvl3 = madrid_pressure_dummy.loc[:, 'sp25':].max(axis=1)

#vlp_lvl1 = valencia_pressure.loc[:, 'sp1': 'sp17'].max(axis=1)
#vlp_lvl2 = valencia_pressure.loc[:, 'sp18': 'sp24'].max(axis=1)
#vlp_lvl3 = valencia_pressure.loc[:, 'sp25':].max(axis=1)

In [None]:
df_train = pd.concat([df_train, svp_lvl1, svp_lvl2, svp_lvl3], axis=1)
df_train = pd.concat([df_train, bcp_lvl1, bcp_lvl2, bcp_lvl3], axis=1)
df_train = pd.concat([df_train, bbp_lvl1, bbp_lvl2, bbp_lvl3], axis=1)
df_train = pd.concat([df_train, mdp_lvl1, mdp_lvl2, mdp_lvl3], axis=1)
df_train = pd.concat([df_train, vlp_lvl1, vlp_lvl2, vlp_lvl3], axis=1)

In [None]:
df_train.columns

In [None]:
column_names = [  'barcelona_pressure',    'barcelona_rain_1h',    'barcelona_rain_3h',
             'barcelona_temp',   'barcelona_temp_max',   'barcelona_temp_min',
         'barcelona_wind_deg', 'barcelona_wind_speed',    'bilbao_clouds_all',
            'bilbao_pressure',       'bilbao_rain_1h',       'bilbao_snow_3h',
                'bilbao_temp',      'bilbao_temp_max',      'bilbao_temp_min',
            'bilbao_wind_deg',    'bilbao_wind_speed',    'madrid_clouds_all',
            'madrid_humidity',      'madrid_pressure',       'madrid_rain_1h',
                'madrid_temp',      'madrid_temp_max',      'madrid_temp_min',
          'madrid_wind_speed',   'seville_clouds_all',     'seville_humidity',
           'seville_pressure',      'seville_rain_1h',      'seville_rain_3h',
               'seville_temp',     'seville_temp_max',     'seville_temp_min',
         'seville_wind_speed',    'valencia_humidity',    'valencia_pressure',
           'valencia_snow_3h',        'valencia_temp',    'valencia_temp_max',
          'valencia_temp_min',    'valencia_wind_deg',  'valencia_wind_speed',
          'load_shortfall_3h',                 'time',               'sp_lvl1',
                            'sp_lvl2',                      'sp_lvl3', 'bcp_lvl1',
                'bcp_lvl2', 'bcp_lvl3', 'bbp_lvl1', 'bbp_lvl2', 'bbp_lvl3', 'mdp_lvl1', 'mdp_lvl2', 'mdp_lvl3',
               'vlp_lvl1', 'vlp_lvl2', 'vlp_lvl3']

In [None]:
df_train.columns = column_names

In [None]:
df_train.columns

In [None]:
# engineer existing features

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split data

In [None]:
# create targets and features dataset
X = df_train.drop(['time',
 'valencia_wind_deg',
 'bilbao_wind_deg',
 'barcelona_wind_deg',
 'seville_pressure',
 'barcelona_pressure',
 'bilbao_pressure',
 'valencia_pressure',
 'madrid_pressure', 'load_shortfall_3h'], axis=1)

In [None]:
y = df_train['load_shortfall_3h'].astype('int')

In [None]:
# create one or more ML models
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
lg = linear_model.LogisticRegression(solver='lbfgs', max_iter=100)

In [None]:
lg.fit(X, y)

In [None]:
lg.coef_

In [None]:
lg.intercept_

In [None]:
X_test = df_test.drop(['time',
 'valencia_wind_deg',
 'bilbao_wind_deg',
 'barcelona_wind_deg',
 'seville_pressure',
 'barcelona_pressure',
 'bilbao_pressure',
 'valencia_pressure',
 'madrid_pressure'], axis=1)

In [None]:
# evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic