# Data modelling on Bike Sharing data in London

Data aquired from: https://www.kaggle.com/edenau/london-bike-sharing-system-data  
Data project initiated: 25/01/2019  
Author: Sedar Olmez

Data modelling:  
    GDS1: Data Gathering, Preparation and Exploration.  
    GDS2: Data Representation and Transformation.  
    GDS3: Computing with Data.   
    GDS4: Data Visualisation and Presentation.   
    GDS5: Data Modelling.   
    GDS6: Science about Data Science. 

Assessment:
![Assessment](assessment.png)

In [47]:
# Libraries
from __future__ import print_function
import matplotlib.pyplot as plt
import seaborn as sea
import pandas as pd
import numpy as np
from datetime import date
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor as DTR

## Old dataset

In [48]:
%%time
# Load up the journeys.csv dataset into dataframes
dataframe_journeys = pd.read_csv('data/Walking-Cycling.csv')
dataframe_journeys.columns = ['LA code', 'Local Authority', 'Year', 'Frequency', 'Walking_%', 'Cycling_%']
dataframe_journeys.dropna()

CPU times: user 7.02 ms, sys: 2.5 ms, total: 9.52 ms
Wall time: 7.79 ms


In [49]:
dataframe_journeys.head()

Unnamed: 0,LA code,Local Authority,Year,Frequency,Walking_%,Cycling_%
0,E09000001,City of London,2010/11,1x per month,78,30
1,E09000002,Barking and Dagenham,2010/11,1x per month,60,8
2,E09000003,Barnet,2010/11,1x per month,65,10
3,E09000004,Bexley,2010/11,1x per month,65,11
4,E09000005,Brent,2010/11,1x per month,62,14


In [50]:
list(dataframe_journeys)

['LA code', 'Local Authority', 'Year', 'Frequency', 'Walking_%', 'Cycling_%']

In [51]:
# We will now convert the Local Authority column to string
dataframe_journeys['Local Authority'] = dataframe_journeys['Local Authority'].astype('|S')

In [52]:
dataframe_journeys.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1204 entries, 0 to 1203
Data columns (total 6 columns):
LA code            1204 non-null object
Local Authority    1204 non-null object
Year               1204 non-null object
Frequency          1204 non-null object
Walking_%          1204 non-null int64
Cycling_%          1204 non-null object
dtypes: int64(1), object(5)
memory usage: 56.5+ KB


#### In Pandas, dtype(obj) == python.dtype(str) therefore we set the string to the maximum bytes of the longest string stored by using

```.astype(|S)```

In [56]:
# We must now convert both the Walking_% and Cycling_% to float as this would make regression more accurate.
# However we must first convert Cycling_% from String to Int then Float.
dataframe_journeys['Cycling_%'].astype(int)

ValueError: invalid literal for long() with base 10: '-'

In [57]:
# An invalit literal for long() error is thrown, this means there are values in the column which cannot be converted to int
# Let us identify these columns
dataframe_journeys['Cycling_%'] = pd.to_numeric(dataframe_journeys['Cycling_%'], errors='coerce')

In [59]:
# We found the row with the problem, for row 326 the column Cycling_% had - so we replaced it with NaN.
print (dataframe_journeys[ pd.to_numeric(dataframe_journeys['Cycling_%'], errors='coerce').isnull()])

       LA code Local Authority     Year    Frequency  Walking_%  Cycling_%
326  E09000026       Redbridge  2011/12  5x per week         17        NaN


In [60]:
# The 'Cycling_%' column was converted to int, now we can focus on changing both columns to floats.
dataframe_journeys.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1204 entries, 0 to 1203
Data columns (total 6 columns):
LA code            1204 non-null object
Local Authority    1204 non-null object
Year               1204 non-null object
Frequency          1204 non-null object
Walking_%          1204 non-null int64
Cycling_%          1203 non-null float64
dtypes: float64(1), int64(1), object(4)
memory usage: 56.5+ KB


In [61]:
# Here we convert Walking_% to float from int
dataframe_journeys['Walking_%'] = dataframe_journeys['Walking_%'].astype(float)

In [62]:
# As can be seen, we converted the columns to floats.
dataframe_journeys.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1204 entries, 0 to 1203
Data columns (total 6 columns):
LA code            1204 non-null object
Local Authority    1204 non-null object
Year               1204 non-null object
Frequency          1204 non-null object
Walking_%          1204 non-null float64
Cycling_%          1203 non-null float64
dtypes: float64(2), object(4)
memory usage: 56.5+ KB


In [63]:
dataframe_journeys.head()

Unnamed: 0,LA code,Local Authority,Year,Frequency,Walking_%,Cycling_%
0,E09000001,City of London,2010/11,1x per month,78.0,30.0
1,E09000002,Barking and Dagenham,2010/11,1x per month,60.0,8.0
2,E09000003,Barnet,2010/11,1x per month,65.0,10.0
3,E09000004,Bexley,2010/11,1x per month,65.0,11.0
4,E09000005,Brent,2010/11,1x per month,62.0,14.0


## Dataset cleaned and optimised