## **PREDICTING BIKE RENTALS**

In this project, we'll evaluate different machine learning algorithms in terms of their abilities to predict the number of bike rentals at a given facility in Washington, D.C.  Each of these models will be trained using historical data from the facility as provided by the University of California Irvine Machine Learning Group at [http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)  

The models that we'll be exploring are Linear Regression, Decision Tree, and Random Forest.  Relevant data inputs that will be used to predict the number of rentals include, but are not limited to, the following: (1) the season during which the rental occurred; (2) the time of the day of the rental; (3) whether the day was a holiday; (4) the temperature on the day of the rental, etc. 

We'll begin by reading in the data into a dataframe and exploring the various columns.




In [1]:
import pandas as pd
import matplotlib.pyplot as plt

bike_rentals = pd.read_csv("bike_rental_hour.csv")
bike_rentals.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Here are the descriptions for the relevant columns:

**instant** - A unique sequential ID number for each row

**dteday** - The date of the rentals

**season** - The season in which the rentals occurred

**yr** - The year the rentals occurred

**mnth** - The month the rentals occurred

**hr** - The hour the rentals occurred

**holiday** - Whether or not the day was a holiday

**weekday** - The day of the week (as a number, 0 to 7)

**workingday** - Whether or not the day was a working day

**weathersit** - The weather (as a categorical variable)

**temp** - The temperature, on a 0-1 scale

**atemp** - The adjusted temperature

**hum** - The humidity, on a 0-1 scale

**windspeed** - The wind speed, on a 0-1 scale

**casual** - The number of casual riders (people who hadn't previously signed up with the bike sharing program)

**registered** - The number of registered riders (people who had already signed up)

**cnt** - The total number of bike rentals (casual + registered)

We'll try to predict the total number of bikes people rented in a given hour. Essentially, we'll be predicting the "cnt" column using all of the other columns, except for "casual" and "registered".

In [5]:
# making a histogram of the 'cnt' column to analyze the distribution of the rentals
plt.hist(bike_rentals["cnt"])
%matplotlib inline