# Understanding Supervised Regression Models

Use supervised regression techniques to **predict daily bicycle rentals for a bike-sharing service**. **Using historical data, we train and evaluate multiple regression models to forecast rental counts based on features such as weather, season, and weekday**. 

## Regression

_Supervised_ machine learning techniques involve training a model to operate on a set of _features_ and predict a _label_ using a dataset that includes some already-known label values. The training process _fits_ the features to the known labels to define a general function that can be applied to new features for which the labels are unknown, and predict them. You can think of this function like this, in which **_y_** represents the label we want to predict and **_x_** represents the features the model uses to predict it.

$$y = f(x)$$

In most cases, _x_ is actually a _vector_ that consists of multiple feature values, so to be a little more precise, the function could be expressed like this:

$$y = f([x_1, x_2, x_3, ...])$$

The goal of training the model is to find a function that performs some kind of calculation to the _x_ values that produces the result _y_. We do this by applying a machine learning _algorithm_ that tries to fit the _x_ values to a calculation that produces _y_ reasonably accurately for all of the cases in the training dataset.

There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:

**_Regression_ algorithms**: Algorithms that predict a _y_ value that is a numeric value, such as the price of a house or the number of sales transactions.

**_Classification_ algorithms**: Algorithms that predict to which category, or _class_, an observation belongs. The _y_ value in a classification model is a vector of probability values between 0 and 1, one for each class, indicating the probability of the observation belonging to each class.

In this notebook, we'll focus on _regression_, using an example based on a real study in which data for a bicycle sharing scheme was collected and used to predict the number of rentals based on seasonality and weather conditions. We'll use a simplified version of the dataset from that study.

**Citation**: The data used in this exercise is derived from [Capital Bikeshare](https://www.capitalbikeshare.com/system-data) and is used in accordance with the published [license agreement](https://www.capitalbikeshare.com/data-license-agreement).


# Explore the Data

Goal: try to understand relationships between its attributes

Apparent relationship between _features_ and the _label_ (target) your model try to predict

- Detect and fix issues in the data (missing values, errors or outlier values)
- Derive new features columns by transforming or combining existing features (feature engineering)
- Normalizing numeric features
- Encoding categorical features

In [5]:
# Load the data
import pandas as pd

# load the training dataset
bike_data = pd.read_csv('./../../data/daily-bike-share.csv')

# Display the first few rows of the data
bike_data.head()


Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,rentals
0,1,1/1/2011,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331
1,2,1/2/2011,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131
2,3,1/3/2011,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120
3,4,1/4/2011,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108
4,5,1/5/2011,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82
