## 1. Age

According to Santander Cycles Survey of Q2 2017/2018 their main target group depicts of people between the **age 16 and 54**.<br>
They group their customers in the following age groups:

![](img/target.png)

As this plot shows they have more male users than female users, why we give maless a higher weight than females.<br>
Moreover we should weight the several age groups differently. 

Population data of 2015-2019 was found at the [London Datastore](https://data.london.gov.uk/dataset/projections/).

After preparing the data in a first way we have the age groups ordered by London districts:


In [3]:
import pandas as pd
import numpy as np

In [3]:
age = pd.read_csv("age.csv")
age.head(10)

Unnamed: 0,District,Ages sum,16-24M,16-24F,25-34M,25-34F,35-44M,35-44F,45-54M,45-54F,55-64M,55-64F,>65M,>65F,Year,Sum Male
0,Camden,199263,15875,16521,26876,25634,19397,18384,14351,13960,10219,9862,12465,15719,2015,0
1,Camden,203057,16277,15983,27264,25554,20025,18558,15062,14466,10602,10203,12900,16163,2016,0
2,Camden,206309,16426,16255,27676,25880,20385,18586,15292,14733,10852,10484,13246,16494,2017,0
3,Camden,209229,16693,16564,28119,25886,20434,18603,15522,15033,11122,10786,13635,16832,2018,0
4,Camden,211902,16799,16948,28477,25915,20688,18411,15501,15356,11539,11096,13962,17210,2019,81465
5,City of London,8146,287,339,950,553,454,374,714,462,2405,382,618,608,2015,0
6,City of London,8427,334,384,967,576,481,346,724,465,2506,373,641,630,2016,0
7,City of London,8514,332,359,974,592,495,359,746,448,2547,376,635,651,2017,0
8,City of London,8870,363,352,1021,655,534,400,744,434,2662,389,650,666,2018,0
9,City of London,9177,375,367,1073,702,567,424,732,434,2747,413,665,678,2019,2747


The most used stations according to age data could be in: <br>
* Newham
* Southwark
* Tower Hamlets
* Wandsworth

### To Do:
* add datetime (maybe take every day of the year?)
* assign age data to stations

## 2. Events
Extracted events from 2015-2019 from [Lodon Events](http://www.londontown.com/London/London-Events-2019).</br>


**Assumption**: The district with the highest event density has the highest frequency.


In [1]:
import json
import pandas as pd
with open('eventsdata/events-2015.json', 'r') as f:
    data15 = json.load(f)
dfRaw15= pd.DataFrame(data15)
df15=dfRaw15.assign(Year='2015')
df1 =df15.drop(['locality','url'],axis=1)
# Data from 2016
with open('eventsdata/events-2016.json', 'r') as f:
    data16 = json.load(f)
dfRaw16= pd.DataFrame(data16)
df16=dfRaw16.assign(Year='2016')
df2 = df16.drop(['locality','url'],axis=1)
# Data from 2017
with open('eventsdata/events-2017.json', 'r') as f:
    data17 = json.load(f)
dfRaw17= pd.DataFrame(data17)
df17=dfRaw17.assign(Year='2017')
df3 = df17.drop(['locality','url'],axis=1)
# Data from 2018
with open('eventsdata/events-2018.json', 'r') as f:
    data18 = json.load(f)
dfRaw18= pd.DataFrame(data18)
df18=dfRaw18.assign(Year='2018')
df4 = df18.drop(['locality','url'],axis=1)
# Data from 2019
with open('eventsdata/events-2019.json', 'r') as f:
    data19 = json.load(f)
dfRaw19= pd.DataFrame(data19)
df19=dfRaw19.assign(Year='2019')
df5 = df19.drop(['locality','url'],axis=1)

In [2]:
from NominatimLibrary import Locator
locator = Locator()

In [3]:
# Merge data
frames = [df1, df2, df3,df4,df5]
final = pd.concat(frames)
f = final.assign(Month='0')
f.head()

Unnamed: 0,dates,postcode,street,title,Year,Month
0,05th - 06th June 2018,KT8 9AU,"East Molesey, Surrey",\nLionel Richie,2015,0
1,08th - 09th March 2019,SE1 8XX,Belvedere Road,\nWomen of the World Festival,2015,0
2,02nd - 14th March 2015,SW1E 5JA,12 Palace Street,\nRuby Wax: Sane New World,2015,0
3,03rd - 07th March 2015,TW9 1QJ,"The Green, Surrey",\nHarvey,2015,0
4,03rd - 14th March 2015,NW1 3BF,"15-16 Triton Street, Regents Place",\nCitizen Puppet,2015,0


In [11]:
geonames = pd.read_csv("geoname_postcodes_world.txt", sep="\t", names=["country","postcode","city","","","","","","","lat","lon","accuracy"])

dico = {}
def locate(dictionnary, locator, x):
    if dico.get(x) == None:
        try:
            dico[x] = locator.get_coordinates("" + x + ", England, GB")
        except Exception as e:
            # print(e)
            # If it was not found, we use only the three first letters:
            dpt = x[:3]
            geo_coords = geonames[(geonames["country"] == "GB") & (geonames["postcode"] == dpt)][["lat","lon"]]
            # Try in geonames :
            if len(geo_coords) > 0:
                dico[x] = (geo_coords.iloc[0].lat, geo_coords.iloc[0].lon)
            else:
                # Finally, try with Nominatim with the three first characters:
                try:
                    dico[x] = locator.get_coordinates("" + dpt + ", England, GB")
                except Exception as e:
                    print("Could not find :", e)
                    if x[:3] == "XX1":
                        print("Note: it is normal the postcode doesn't exist, it is 'secret cinema' at an unknown location.")
                    # Finally if nothing is found, we take a point very far away:
                    dico[x] = (0, 0)
    return dico[x]

final["coords"] = final["postcode"].map(lambda x: locate(dico, locator, x))
final.head()

XX1, England, GB


Unnamed: 0,dates,postcode,street,title,Year,coords
0,05th - 06th June 2018,KT8 9AU,"East Molesey, Surrey",\nLionel Richie,2015,"(51.3143224, -0.231344251983845)"
1,08th - 09th March 2019,SE1 8XX,Belvedere Road,\nWomen of the World Festival,2015,"(51.4976029, -0.0812871)"
2,02nd - 14th March 2015,SW1E 5JA,12 Palace Street,\nRuby Wax: Sane New World,2015,"(51.51301285, -0.0434090325310393)"
3,03rd - 07th March 2015,TW9 1QJ,"The Green, Surrey",\nHarvey,2015,"(51.577252, -0.3714965)"
4,03rd - 14th March 2015,NW1 3BF,"15-16 Triton Street, Regents Place",\nCitizen Puppet,2015,"(51.51301285, -0.0434090325310393)"


Fucking dataset, until here I only applied my algo on a subset, but now that I run it on the whole set, I see that there is so many things:
- "day month year"
- "dow day month year"
- "day - day month year"
- "dow day - dow day month year"
- "day month - day month year"
- "day month to day month year"
- "Various Venues"
- "day and day month year"
- "day, day and day month year"
- "day, day, day and day month year"
- **Actually any combinations... !**

- Funny one : `15th May - 05th March 2016`

In [95]:
import re
from datetime import date, timedelta

SINGLE_DATE = "^[A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
TWO_DATES = "^[A-Za-z ]*([0-9]{1,2})[^ ]* and [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
THREE_DATES = "^[A-Za-z ]*([0-9]{1,2})[^ ]*, [A-Za-z ]*([0-9]{1,2})[^ ]* and [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
FOUR_DATES = "^[A-Za-z ]*([0-9]{1,2})[^ ]*, [A-Za-z ]*([0-9]{1,2})[^ ]*, [A-Za-z ]*([0-9]{1,2})[^ ]* and [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"

TWO_MONTHS = "^[A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) and [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
THREE_MONTHS = "^[A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+), [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) and [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
FOUR_MONTHS = "^[A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+), [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+), [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) and [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"

SINGLE_DATE = "^[A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
DOUBLE_DAYS = "^([0-9]{1,2})[^ ]* (-|to) [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
DOUBLE_MONTHS = "^([0-9]{1,2})[^ ]* ([A-Za-z]+) (-|to) [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"
DOUBLE_YEARS = "^([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}) (-|to) [A-Za-z ]*([0-9]{1,2})[^ ]* ([A-Za-z]+) ([0-9]{4}).*"

MONTHS = [
    "January",
    "February",
    "March",
    "April",
    "May",
    "June",
    "July",
    "August",
    "September",
    "October",
    "November",
    "December",
]

def get_month_id(month):
    month_id = -1
    for (i,m) in enumerate(MONTHS):
        if m[:3] == month[:3]:
            month_id = i + 1
    return month_id

def make_date(match, year_id, month_id, day_id):
    # print(int(match.group(year_id)),MONTHS.index(match.group(month_id)) + 1,int(match.group(day_id)))
    return date(int(match.group(year_id)),get_month_id(match.group(month_id)),int(match.group(day_id)))

def format_date(date_range):
    res = []
    start = None
    end = None

    if date_range == "Various Venues":
        return []
    
    match = re.match(DOUBLE_YEARS, date_range)
    if match != None:
        start = make_date(match,3,2,1)
        end = make_date(match,7,6,5)
    
    match = re.match(DOUBLE_MONTHS, date_range)
    if match != None:
        start = make_date(match,6,2,1)
        end = make_date(match,6,5,4)
    
    if start == None:   
        match = re.match(DOUBLE_DAYS, date_range)
        if match != None:
            start = make_date(match,5,4,1)
            end = make_date(match,5,4,3)
    
    if start != None:
        if start > end: # If only "day0 month0 - day1 month1 year" but month0 > month1, then 0 is a year before.
            start = start.replace(year=start.year - 1)
        while start <= end:
            res.append(start)
            start = start + timedelta(1)

    # If it is not a range:

    if len(res) == 0:
        match = re.match(SINGLE_DATE, date_range)
        if match != None:
            res.append(make_date(match,3,2,1))

    if len(res) == 0:
        match = re.match(TWO_DATES, date_range)
        if match != None:
            res.append(make_date(match,4,3,1))
            res.append(make_date(match,4,3,2))

    if len(res) == 0:
        match = re.match(THREE_DATES, date_range)
        if match != None:
            res.append(make_date(match,5,4,1))
            res.append(make_date(match,5,4,2))
            res.append(make_date(match,5,4,3))

    if len(res) == 0:
        match = re.match(FOUR_DATES, date_range)
        if match != None:
            res.append(make_date(match,6,5,1))
            res.append(make_date(match,6,5,2))
            res.append(make_date(match,6,5,3))
            res.append(make_date(match,6,5,4))
    
    if len(res) == 0:
        match = re.match(TWO_MONTHS, date_range)
        if match != None:
            res.append(make_date(match,5,2,1))
            res.append(make_date(match,5,4,3))

    if len(res) == 0:
        match = re.match(THREE_MONTHS, date_range)
        if match != None:
            res.append(make_date(match,7,2,1))
            res.append(make_date(match,7,4,3))
            res.append(make_date(match,7,6,5))

    if len(res) == 0:
        match = re.match(FOUR_MONTHS, date_range)
        if match != None:
            res.append(make_date(match,9,2,1))
            res.append(make_date(match,9,4,3))
            res.append(make_date(match,9,6,5))
            res.append(make_date(match,9,8,7))

    if len(res) == 0:
        print("Couldn't match: " + date_range)
    
    return res

final["dates"].map(format_date)
#format_date("15th May - 05th March 2016")

Couldn't match: Saturday 31st August and Sunday 1st September 2019
Couldn't match: 31st May, 1st June and 2nd June 2019
Couldn't match: Saturday 31st August and Sunday 1st September 2019


0                               [2018-06-05, 2018-06-06]
1                               [2019-03-08, 2019-03-09]
2      [2015-03-02, 2015-03-03, 2015-03-04, 2015-03-0...
3      [2015-03-03, 2015-03-04, 2015-03-05, 2015-03-0...
4      [2015-03-03, 2015-03-04, 2015-03-05, 2015-03-0...
5      [2015-03-04, 2015-03-05, 2015-03-06, 2015-03-0...
6      [2018-09-06, 2018-09-07, 2018-09-08, 2018-09-0...
7      [2015-03-04, 2015-03-05, 2015-03-06, 2015-03-0...
8      [2015-03-04, 2015-03-05, 2015-03-06, 2015-03-0...
9      [2015-03-05, 2015-03-06, 2015-03-07, 2015-03-0...
10     [2019-03-06, 2019-03-07, 2019-03-08, 2019-03-0...
11     [2015-03-05, 2015-03-06, 2015-03-07, 2015-03-0...
12                              [2015-03-05, 2015-03-06]
13     [2015-03-05, 2015-03-06, 2015-03-07, 2015-03-0...
14                              [2016-03-24, 2016-03-25]
15     [2015-03-06, 2015-03-07, 2015-03-08, 2015-03-0...
16                                          [2016-03-06]
17                             

In [71]:
# Assign Month
f.loc[f['dates'].str.contains('January'), 'Month'] = 1
f.loc[f['dates'].str.contains('February'), 'Month'] = 2
f.loc[f['dates'].str.contains('March'), 'Month'] = 3
f.loc[f['dates'].str.contains('April'), 'Month'] = 4
f.loc[f['dates'].str.contains('May'), 'Month'] = 5
f.loc[f['dates'].str.contains('June'), 'Month'] = 6
f.loc[f['dates'].str.contains('July'), 'Month'] = 7
f.loc[f['dates'].str.contains('August'), 'Month'] = 8
f.loc[f['dates'].str.contains('September'), 'Month'] = 9
f.loc[f['dates'].str.contains('October'), 'Month'] = 10
f.loc[f['dates'].str.contains('November'), 'Month'] = 11
f.loc[f['dates'].str.contains('December'), 'Month'] = 12
# Assign Year
f.loc[f['dates'].str.contains('2015'), 'Year'] = 2015
f.loc[f['dates'].str.contains('2016'), 'Year'] = 2016
f.loc[f['dates'].str.contains('2017'), 'Year'] = 2017
f.loc[f['dates'].str.contains('2018'), 'Year'] = 2018
f.loc[f['dates'].str.contains('2019'), 'Year'] = 2019

In [7]:
# Cut days
date = f['dates']
str(date)
days= []
dash = '-'
d = '0'

for index, row in date.iteritems():
    #if np.where(date.applymap(lambda x: x == '-')):
    #if date.apply(lambda x: any(pd.Series(x).str.contains('-'))):
    # if date.str.contains('-'):
  
    if dash in row and len(row)<28:
        b= row[7:9]
        a = row[0:2]
        c=a+b
        days.append(c)
        
    elif dash in row and len(row)>28:
        x= row[:16]
        days.append(x)
    else:
        b = row[:2]
        days.append(b)
        
        #if _type in result:
                    #values[_type].append(result[_type])
f=f.assign(Days = days)
#print(days)
#df = pd.DataFrame({'Days': days})
len(days)
#df.to_csv('days.csv')

1817

In [21]:
events = pd.read_csv('eventsPrepared.csv')
events.head(75)

Unnamed: 0,dates,postcode,street,title,Year,Month,Days,Duration
0,05th - 06th June 2018,KT8 9AU,"East Molesey, Surrey",\nLionel Richie,2018,6,5,2.0
1,08th - 09th March 2019,SE1 8XX,Belvedere Road,\nWomen of the World Festival,2019,3,8,2.0
2,03rd - 07th March 2015,TW9 1QJ,"The Green, Surrey",\nHarvey,2015,3,3,1.0
3,03rd - 14th March 2015,NW1 3BF,"15-16 Triton Street, Regents Place",\nCitizen Puppet,2015,3,14,12.0
4,04th - 28th March 2015,EC2Y 8DS,Silk Street,\nAntigone,2015,3,28,26.0
5,06th - 15th September 2018,SW11 5TN,Lavender Hill,\nMissing,2018,9,15,10.0
6,04th March - 31st May 2015,WC2N 5DN,Trafalgar Square,\nInventing Impressionism,2015,3,4,89.0
7,04th March - 06th September 2015,SW3 4SQ,"Duke of York's Square, King's Road",\nPangaea II: New Art from Africa and Latin Am...,2015,3,4,188.0
8,05th March - 31st August 2015,SE1 6HZ,"Lambeth Road, Elephant and Castle",\nFashion on the Ration,2015,8,5,180.0
9,06th - 21st March 2019,EC2M 4WY,9 Devonshire Square,\nHouse of Holi,2019,3,6,6.0




### To Do:
* extract days of datetime
* calculate duration?
* find coordinates to postalcodes
* assign events to stations

## 3. Earnings
According to Santander Cycles Survey of Q2 2017/2018 their main target group come up with an anual income between **20 k and more than 70 k**.<br>
On the one hand they have **casual users** with an anual income between **20-40 k**<br>
On the other hand they have **members** which have an average anual income between **40 - 75+ k**

![](img/income.png)

<br><br>
Extracted earnings from 2015-2018 from [Office for National Statistics](https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2016provisionalresults).</br>


In [50]:
earnings = pd.read_csv('earnings.csv')
earnings

Unnamed: 0,District,2015,2016,2017,2018
0,Camden,42161.6,43165.2,42406.0,45084.0
1,City of London,60351.2,62696.4,63642.8,68151.2
2,Hackney,36550.8,38854.4,39249.6,39124.8
3,Hammersmith and Fulham,36831.6,38818.0,42692.0,43097.6
4,Haringey,34008.0,32656.0,34470.8,35058.4
5,Islington,45099.6,47762.0,48984.0,48656.4
6,Kensington and Chelsea,35952.8,34039.2,38303.2,39993.2
7,Lambeth,41574.0,39301.6,40211.6,41646.8
8,Lewisham,33649.2,34403.2,34637.2,33919.6
9,Newham,32541.6,34377.2,36878.4,36296.0


The most used stations according to **income** data could be in: 

* City of London (Members)
* Tower Hamlets (Members)
* Westminster (Members)
* Haringey (Casual Users)
* Lewisham (Casual Users)
* Newham (Casual Users) 


### To Do:
* add datetime to data(maybe date of each day for one year?)
* assign earnings data to stations

## 4. Political data
Extracted election results from 2016 from [London Datastore](https://data.london.gov.uk/elections/).</br>


**Conservatives**: 
* low-emission buses
* car and van to be zero-emission by 2050
* plant a million trees in towns and cities to improve air
* 25-year environment plan<br>
**Labour**: <br>
* Clean Air Act to deal with illegal air quality
* safeguard habitats and species in the blue belts of seas and oceans
* ban on fracking
* plant a million trees
* ensure that 60% of the UK’s energy comes from zero-carbon or renewable sources by 2030<br>
**Liberal Democrats**<br>
* charge on disposable coffee cups to reduce waste
* diesel scrappage scheme, and a ban on the sale of diesel cars and small vans in the UK by 2025
* extend ultra-low emission zones to 10 more towns and cities
* Zero Carbon Britain Act to set new targets to reduce net greenhouse gas emissions by 80% by 2040 and to zero by 2050<br>
**Green Party**<br>
* new Environmental Protection Act and a new environmental regulator and court
* end plastic waste by introducing a bottle deposit scheme
* a Clean Air Acta
* End the reliance on fossil fuels with a ban on fracking and pledge to bring forward the coal phase out by two years to 2023
* Scrap plans for all new nuclear power stations<br>
**UK Independence Party**
* Cancellation of any state financing of climate protection
***

-> Labour & Green Party have probably voters with a higher focus on environmental protection<br>
-> Liberal Democrats & Conservatives have probably voters who are amongst other things interested in environmental protection<br>
-> UK Independence Party have probably voters who do not care on environmental protection
<br><br>
-> Districts with **lots of** votes for **Green Party** & **Labour Party** and **less** votes for **UK Independence Party** probably have a higher use of rental bikes. 
<br><br>Reference: [The Guardian](https://www.theguardian.com/environment/2017/may/21/how-do-the-four-main-parties-compare-on-the-environment)

In [28]:
political = pd.read_csv('political.csv')
political

Unnamed: 0,District,% Con,% Lab,% Lib Dem,% Green,% UKIP
0,City of London,40.415105,37.647737,6.630153,7.004901,2.075526
1,Camden,27.28413,51.053337,5.12985,8.448074,2.03398
2,Hackney,12.093692,66.721942,2.896786,10.193162,1.382321
3,Hammersmith and Fulham,41.089339,39.768094,4.316411,5.916356,2.282932
4,Haringey,17.677652,60.070296,5.243635,8.783056,1.295669
5,Islington,16.990974,60.377152,4.623599,8.741961,2.371882
6,Kensington and Chelsea,55.599084,28.015772,4.075299,4.533198,1.836683
7,Lambeth,21.426381,56.114814,5.388028,9.172221,1.449483
8,Lewisham,19.395787,56.829223,4.828802,8.96861,2.947496
9,Newham,17.423339,65.307076,2.291541,4.47361,2.558457


The most used stations according to political data could be in: <br>
* Hackney
* Haringey
* Lambeth 
* Lewisham 
* Islington


### To Do:
* add datetime to data(maybe date of each day for the year?)
* assign political interests data to stations