## 1. Age

According to Santander Cycles Survey of Q2 2017/2018 their main target group depicts of people between the **age 16 and 54**.<br>
They group their customers in the following age groups:

![](img/target.png)

As this plot shows they have more male users than female users, why we give maless a higher weight than females.<br>
Moreover we should weight the several age groups differently. 

Population data of 2015-2019 was found at the [London Datastore](https://data.london.gov.uk/dataset/projections/).

After preparing the data in a first way we have the age groups ordered by London districts:


In [1]:
import pandas as pd
import numpy as np

In [2]:
age = pd.read_csv("age.csv")
age.head(100)

Unnamed: 0,District,Ages sum,16-24M,16-24F,25-34M,25-34F,35-44M,35-44F,45-54M,45-54F,55-64M,55-64F,>65M,>65F,Year,Sum Male
0,Camden,199263,15875,16521,26876,25634,19397,18384,14351,13960,10219,9862,12465,15719,2015,0
1,Camden,203057,16277,15983,27264,25554,20025,18558,15062,14466,10602,10203,12900,16163,2016,0
2,Camden,206309,16426,16255,27676,25880,20385,18586,15292,14733,10852,10484,13246,16494,2017,0
3,Camden,209229,16693,16564,28119,25886,20434,18603,15522,15033,11122,10786,13635,16832,2018,0
4,Camden,211902,16799,16948,28477,25915,20688,18411,15501,15356,11539,11096,13962,17210,2019,81465
5,City of London,8146,287,339,950,553,454,374,714,462,2405,382,618,608,2015,0
6,City of London,8427,334,384,967,576,481,346,724,465,2506,373,641,630,2016,0
7,City of London,8514,332,359,974,592,495,359,746,448,2547,376,635,651,2017,0
8,City of London,8870,363,352,1021,655,534,400,744,434,2662,389,650,666,2018,0
9,City of London,9177,375,367,1073,702,567,424,732,434,2747,413,665,678,2019,2747


In [26]:
# Calculating percentage population augmentation from 2015 to 2019

dico = {}
for index, row in age.iterrows():
    array = dico.get(row["District"],[])
    array.append(row["Ages sum"])
    dico[row["District"]] = array

for district,array in dico.items():
    print(district, "\t\t", int(100*(array[-1] - array[1])/array[-1]))

Camden 		 4
City of London 		 8
Hackney 		 4
Hammersmith and Fulham 		 3
Haringey 		 2
Islington 		 3
Kensington and Chelsea 		 2
Lambeth 		 3
Lewisham 		 3
Newham 		 5
Southwark 		 4
Tower Hamlets 		 5
Wandsworth 		 3
Westminster 		 4


In [32]:
# Get coordinate of each district:

from NominatimLibrary import Locator
locator = Locator()

district_coords = {}

def locate(locator, district):
    elm = district_coords.get(district)
    if elm == None:
        try:
            district_coords[district] = locator.get_coordinates("" + district + ", London, UK")
            return district_coords[district]
        except Exception as e:
            print("Could not locate: ", e)
    else:
        return elm

age["District"].map(lambda x: locate(locator, x))
    
for i,j in district_coords.items():
    print(i,j)

Camden (51.5423045, -0.1395604)
City of London (51.5156177, -0.0919983)
Hackney (51.5432402, -0.0493621)
Hammersmith and Fulham (51.4920377, -0.2236401)
Haringey (51.58792985, -0.10541010599099)
Islington (51.5384287, -0.0999051)
Kensington and Chelsea (51.4989948, -0.1991229)
Lambeth (51.5013012, -0.117287)
Lewisham (51.4624325, -0.0101331)
Newham (51.52999955, 0.0293179602938221)
Southwark (51.5029222, -0.103458)
Tower Hamlets (51.49595675, -0.011744492532098)
Wandsworth (51.4570271, -0.1932607)
Westminster (51.4973206, -0.137149)


In [10]:
stations = pd.read_csv("../raw/rental_stations_saved.csv")
stations.head()

Unnamed: 0,name,id,lat,lon,capacity
0,"River Street , Clerkenwell",1,51.529163,-0.109971,19
1,"Phillimore Gardens, Kensington",2,51.499607,-0.197574,37
2,"Christopher Street, Liverpool Street",3,51.521284,-0.084606,32
3,"St. Chad's Street, King's Cross",4,51.530059,-0.120974,23
4,"Sedding Street, Sloane Square",5,51.49313,-0.156876,27


In [40]:
# Get closest district for each station:

closest_col = []
for index, row in stations.iterrows():
    (lat,lon) = (row.lat,row.lon)
    closest_district = None
    closest_dist = 9999999
    for district,coords in district_coords.items():
        dist = locator.distance_crow_coords(coords, (lat,lon))
        if closest_district == None or dist < closest_dist:
            closest_dist = dist
            closest_district = district
    closest_col.append(closest_district)

stations["closest_district"] = closest_col
stations.to_csv("../raw/rental_stations_district.csv")
stations.head()

Unnamed: 0,name,id,lat,lon,capacity,closest_district
0,"River Street , Clerkenwell",1,51.529163,-0.109971,19,Islington
1,"Phillimore Gardens, Kensington",2,51.499607,-0.197574,37,Kensington and Chelsea
2,"Christopher Street, Liverpool Street",3,51.521284,-0.084606,32,City of London
3,"St. Chad's Street, King's Cross",4,51.530059,-0.120974,23,Islington
4,"Sedding Street, Sloane Square",5,51.49313,-0.156876,27,Westminster


The most used stations according to age data could be in: <br>
* Newham
* Southwark
* Tower Hamlets
* Wandsworth

### To Do:
* add datetime (maybe take every day of the year?)
* assign age data to stations

## 3. Earnings
According to Santander Cycles Survey of Q2 2017/2018 their main target group come up with an anual income between **20 k and more than 70 k**.<br>
On the one hand they have **casual users** with an anual income between **20-40 k**<br>
On the other hand they have **members** which have an average anual income between **40 - 75+ k**

![](img/income.png)

<br><br>
Extracted earnings from 2015-2018 from [Office for National Statistics](https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2016provisionalresults).</br>


In [50]:
earnings = pd.read_csv('earnings.csv')
earnings

Unnamed: 0,District,2015,2016,2017,2018
0,Camden,42161.6,43165.2,42406.0,45084.0
1,City of London,60351.2,62696.4,63642.8,68151.2
2,Hackney,36550.8,38854.4,39249.6,39124.8
3,Hammersmith and Fulham,36831.6,38818.0,42692.0,43097.6
4,Haringey,34008.0,32656.0,34470.8,35058.4
5,Islington,45099.6,47762.0,48984.0,48656.4
6,Kensington and Chelsea,35952.8,34039.2,38303.2,39993.2
7,Lambeth,41574.0,39301.6,40211.6,41646.8
8,Lewisham,33649.2,34403.2,34637.2,33919.6
9,Newham,32541.6,34377.2,36878.4,36296.0


The most used stations according to **income** data could be in: 

* City of London (Members)
* Tower Hamlets (Members)
* Westminster (Members)
* Haringey (Casual Users)
* Lewisham (Casual Users)
* Newham (Casual Users) 


### To Do:
* add datetime to data(maybe date of each day for one year?)
* assign earnings data to stations

## 4. Political data
Extracted election results from 2016 from [London Datastore](https://data.london.gov.uk/elections/).</br>


**Conservatives**: 
* low-emission buses
* car and van to be zero-emission by 2050
* plant a million trees in towns and cities to improve air
* 25-year environment plan<br>
**Labour**: <br>
* Clean Air Act to deal with illegal air quality
* safeguard habitats and species in the blue belts of seas and oceans
* ban on fracking
* plant a million trees
* ensure that 60% of the UK’s energy comes from zero-carbon or renewable sources by 2030<br>
**Liberal Democrats**<br>
* charge on disposable coffee cups to reduce waste
* diesel scrappage scheme, and a ban on the sale of diesel cars and small vans in the UK by 2025
* extend ultra-low emission zones to 10 more towns and cities
* Zero Carbon Britain Act to set new targets to reduce net greenhouse gas emissions by 80% by 2040 and to zero by 2050<br>
**Green Party**<br>
* new Environmental Protection Act and a new environmental regulator and court
* end plastic waste by introducing a bottle deposit scheme
* a Clean Air Acta
* End the reliance on fossil fuels with a ban on fracking and pledge to bring forward the coal phase out by two years to 2023
* Scrap plans for all new nuclear power stations<br>
**UK Independence Party**
* Cancellation of any state financing of climate protection
***

-> Labour & Green Party have probably voters with a higher focus on environmental protection<br>
-> Liberal Democrats & Conservatives have probably voters who are amongst other things interested in environmental protection<br>
-> UK Independence Party have probably voters who do not care on environmental protection
<br><br>
-> Districts with **lots of** votes for **Green Party** & **Labour Party** and **less** votes for **UK Independence Party** probably have a higher use of rental bikes. 
<br><br>Reference: [The Guardian](https://www.theguardian.com/environment/2017/may/21/how-do-the-four-main-parties-compare-on-the-environment)

In [28]:
political = pd.read_csv('political.csv')
political

Unnamed: 0,District,% Con,% Lab,% Lib Dem,% Green,% UKIP
0,City of London,40.415105,37.647737,6.630153,7.004901,2.075526
1,Camden,27.28413,51.053337,5.12985,8.448074,2.03398
2,Hackney,12.093692,66.721942,2.896786,10.193162,1.382321
3,Hammersmith and Fulham,41.089339,39.768094,4.316411,5.916356,2.282932
4,Haringey,17.677652,60.070296,5.243635,8.783056,1.295669
5,Islington,16.990974,60.377152,4.623599,8.741961,2.371882
6,Kensington and Chelsea,55.599084,28.015772,4.075299,4.533198,1.836683
7,Lambeth,21.426381,56.114814,5.388028,9.172221,1.449483
8,Lewisham,19.395787,56.829223,4.828802,8.96861,2.947496
9,Newham,17.423339,65.307076,2.291541,4.47361,2.558457


The most used stations according to political data could be in: <br>
* Hackney
* Haringey
* Lambeth 
* Lewisham 
* Islington


### To Do:
* add datetime to data(maybe date of each day for the year?)
* assign political interests data to stations

# Adding extra features to base features dataset

In [None]:
BASE_FEATURES = "./all_stations_events.csv"
STATIONS_WITH_DISTRICT = "../raw/rental_stations_district.csv"
AGE_DATA = "./age.csv"

stations = pd.read_csv(STATIONS_WITH_DISTRICT)
features = pd.read_csv(BASE_FEATURES)
age = pd.read_csv("./age.csv")

features = features.merge(stations[["id","closest_district"]], how="left", left_on="Station ID", right_on="id")
features.head()

In [None]:
features = features.merge(age, how="left", left_on=["closest_district","Year"], right_on=["District","Year"])
features.head()