<img src='../images/dsl-logo.png' width="40%" align="left" />
<img src='../images/hs-aalen-logo.png' width="40%" align="right" />

# Capital Bikeshare: Anlayse und Prognose der Ausleihvorgänge

### Ziel

Es sollen die Ausleihvoränge aus den Jahren 2015-2017 analysiert werden.

1. Auffälligkeiten: Explorative Datenanalyse und Visualisierung
2. Prognose der Ausleihvorgänge nach Wahl (eigenem Ermessen) insgesamt oder pro Station und pro Tag oder pro Stunde

## Herunterladen und Parsen der Orignaldaten (raw)

**Hinweis:** Die Notebooks sind so aufgebaut, dass sie zu einer Verarbeitungs-Pipeline gehören und in der Reihenfolge der Nummern (Prefixe) ausgeführt sollten, da spätere Notebooks (die mit einer größeren Anfangsnummer) Daten aus den vorherigen Notebooks verwenden. Nur Notebooks mit ganzen *10*er-Nummern gehören zur eigentlichen Verarbeitungs-Pipeline.

In [1]:
# Die Trip-Daten liegen in einem s3-Bucket - Verwendung einer speziellen Bibliothek (boto3)
# Sowie eine Bibliotheken (wget und requests) für das Herunteladen und Speichern von Dateien per www.
# Installieren falls nicht verfügbar!
!pip install boto3
!pip install wget
!pip install requests

Collecting boto3
  Downloading boto3-1.14.50-py2.py3-none-any.whl (129 kB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.4.0,>=0.3.0
  Downloading s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
Collecting botocore<1.18.0,>=1.17.50
  Downloading botocore-1.17.50-py2.py3-none-any.whl (6.6 MB)
Collecting docutils<0.16,>=0.10
  Downloading docutils-0.15.2-py3-none-any.whl (547 kB)
Installing collected packages: jmespath, docutils, botocore, s3transfer, boto3
  Attempting uninstall: docutils
    Found existing installation: docutils 0.16
    Uninstalling docutils-0.16:
      Successfully uninstalled docutils-0.16
Successfully installed boto3-1.14.50 botocore-1.17.50 docutils-0.15.2 jmespath-0.10.0 s3transfer-0.3.3
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  C

In [2]:
import os
import re
import glob
import wget
import json
import requests
import datetime
import pandas as pd
import boto3
from botocore import UNSIGNED
from botocore.config import Config
from zipfile import ZipFile

In [3]:
pd.__version__

'1.0.5'

### Herunterladen der Daten

Es wurden manuell Trip-Daten aus den Jahren 2015, 2016 und 2017 heruntergeladen!

In [4]:
RAW_DATA_PATH = '../data/raw/'
TRIP_DATA_PATH = RAW_DATA_PATH + 'tripdata/'
TRIP_ZIP_FILE_SUFFIX = 'tripdata.zip'
TRIP_DATA_CSV_FILE_PATTERN = '^[0-9]{4}.*tripdata\.csv$'
PARSED_DATA_PATH = '../data/'
RAW_TRIPS_FILE = 'trips_raw.pkl'
RAW_WEATHER_FILE = 'weather_raw.pkl'
ALT_WEATHER_FILE = 'weather_alt.pkl'
WEATHER_DATA_PATH = RAW_DATA_PATH + 'weather/'

In [5]:
URL_TEMPLATE_WEATHER_DATA = 'https://api.meteostat.net/v1/history/hourly?station=72405&start={}&end={}&time_zone=Europe/Berlin&time_format=Y-m-d%20H:i&key=xPVZEykm'

URL_ALT_WEATHER_DATA = 'https://open.meteostat.net/hourly/72405.csv.gz'
COLS_ALT_WEATHER_DATA = ['date', 'hour', 'temperature', 'dewpoint', 
                     'precipitation', 'precipitation_3', 'precipitation_6',
                     'snowdepth', 'windspeed', 'peakgust', 'winddirection', 'humidity', 'pressure', 'condition']

WEATHER_COLS_TO_DROP = ['precipitation_3', 'precipitation_6', 'snowdepth', 'peakgust', 'condition']


In [6]:
# Mapping auf praktische und einheitliche Namen der Merkmale
# Leerzeichen in Namen vermeiden, einheitliche Kleinschreibung wird hier verwendet

TRIP_COLS_NAME_MAP = {
        'Duration': 'duration',
        'Start date': 'start_ts',
        'End date': 'end_ts',
        'Start station number': 'start_station_id',
        'Start station': 'start_station_name',
        'End station number': 'end_station_id',
        'End station': 'end_station_name',
        'Bike number': 'bike_number',
        'Member': 'member'
    }


In [7]:
TRIP_DATA_PARSE_DATES = [1,2]

In [8]:
BUCKET_NAME = 'capitalbikeshare-data'

In [9]:
# Setze auf None, um alle verfügbaren Dateien herunterzuladen
TRIP_FILES_TO_LOAD = {
    '2015-capitalbikeshare-tripdata.zip',
    '2016-capitalbikeshare-tripdata.zip',
    '2017-capitalbikeshare-tripdata.zip'
}

In [10]:
# Lade alle (noch nicht geladenen) Trip-Data-Dateien aus dem s2 Bucket (BUCKET_NAME siehe oben)
def load_trip_data(delta_only=True, target_path=TRIP_DATA_PATH, trip_files_to_load=TRIP_FILES_TO_LOAD):
    # create target path if it does not exist
    if not os.path.exists(target_path):
        print('Creating dir', target_path, '...')
        os.makedirs(target_path)
    
    # init s3 bucket access
    s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
    bucket = s3.Bucket(BUCKET_NAME)
    

    # load files
    ignored = 0
    skipped = 0
    downloaded = 0    
    for obj in bucket.objects.all():
        source_file = obj.key
        target_file = target_path+source_file
        
        if (source_file.endswith(TRIP_ZIP_FILE_SUFFIX)):
            
            # process file only if no file_list specified or source file in file list
            if (trip_files_to_load is None or source_file in trip_files_to_load):            
                if (os.path.exists(target_file)):
                    #print('Skipping existing file', source_file, '...')
                    skipped += 1
                else:
                    print('Downloading', source_file, '...')
                    bucket.download_file(source_file, target_file)
                    downloaded += 1                    
                
            else: 
                ignored += 1
                
    print('tripdata-files ignored:', ignored)            
    print('tripdata-files skipped:', skipped)            
    print('tripdata-files downloaded:', downloaded)

In [15]:
# Lade die in den Trip-Data-Files enthaltenen csv-Dateien (teilweise mehrere) in eine Liste aus DataFrames
# konkateniere alle DataFrames zu einem und verwendete einheitliche Spaltennamen ohne Leerhzeichen
def read_raw_trip_data(target_path=TRIP_DATA_PATH, trip_files_to_load=TRIP_FILES_TO_LOAD, check_content=True):
    load_trip_data(target_path=target_path, trip_files_to_load=trip_files_to_load)
    df_list = []
    schema_index = 0
    # walk through all zip files
    for file in sorted(glob.glob(target_path+'*'+TRIP_ZIP_FILE_SUFFIX)):
        print('Unzipping', file)
        # a zip file may contain multiple csv files
        with ZipFile(file) as zfile:
            for name in zfile.namelist():
                if re.match(TRIP_DATA_CSV_FILE_PATTERN, name):
                    print('Reading file', name , '...')
                    
                    df_trips = pd.read_csv(zfile.open(name), parse_dates=TRIP_DATA_PARSE_DATES)
                    
                    df_trips.rename(columns=TRIP_COLS_NAME_MAP, inplace=True)
                                        
                    if check_content:
                        check_trip_data(df_trips)    
                                                
                    df_list.append(df_trips)
                                      

                else:
                    print('Skipping file', name, '!')
    # return list of trip-DataFrames (raw format)
    return df_list

In [16]:
def check_trip_data(df):
    print('trips:', df.shape[0], 
          '\tmin start date:', df['start_ts'].min(), 
          '\tmax start date: ', df['start_ts'].max())

In [17]:
def get_trip_data(target_path=TRIP_DATA_PATH, trip_files_to_load=TRIP_FILES_TO_LOAD):
    df_list = read_raw_trip_data(target_path=target_path, trip_files_to_load=trip_files_to_load)
    print('Concatenating', len(df_list), 'dataframes ...')
    # concat all individual dataframes            
    df_trips = pd.concat(df_list)
    print('Total trips combined:', df_trips.shape[0])
    print('Done.')
    return df_trips

In [None]:
df_trips_raw = get_trip_data()

Creating dir ../data/raw/tripdata/ ...
Downloading 2015-capitalbikeshare-tripdata.zip ...
Downloading 2016-capitalbikeshare-tripdata.zip ...
Downloading 2017-capitalbikeshare-tripdata.zip ...
tripdata-files ignored: 36
tripdata-files skipped: 0
tripdata-files downloaded: 3
Unzipping ../data/raw/tripdata\2015-capitalbikeshare-tripdata.zip
Reading file 2015Q1-capitalbikeshare-tripdata.csv ...
trips: 423719 	min start date: 2015-01-01 00:02:44 	max start date:  2015-03-31 23:59:52
Reading file 2015Q2-capitalbikeshare-tripdata.csv ...
trips: 999818 	min start date: 2015-04-01 00:02:23 	max start date:  2015-06-30 23:58:37
Reading file 2015Q3-capitalbikeshare-tripdata.csv ...
trips: 1056366 	min start date: 2015-07-01 00:00:25 	max start date:  2015-09-30 23:57:53
Reading file 2015Q4-capitalbikeshare-tripdata.csv ...
trips: 706003 	min start date: 2015-10-01 00:01:30 	max start date:  2015-12-31 23:57:57
Unzipping ../data/raw/tripdata\2016-capitalbikeshare-tripdata.zip
Reading file 2016Q1-c

In [19]:
df_trips_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10277677 entries, 0 to 815263
Data columns (total 9 columns):
 #   Column              Dtype         
---  ------              -----         
 0   duration            int64         
 1   start_ts            datetime64[ns]
 2   end_ts              datetime64[ns]
 3   start_station_id    int64         
 4   start_station_name  object        
 5   end_station_id      int64         
 6   end_station_name    object        
 7   bike_number         object        
 8   Member type         object        
dtypes: datetime64[ns](2), int64(3), object(4)
memory usage: 784.1+ MB


In [20]:
df_trips_raw.to_pickle(PARSED_DATA_PATH+RAW_TRIPS_FILE)

In [21]:
def load_weather_data_year(year, replace=False, target_path=WEATHER_DATA_PATH, url_template=URL_TEMPLATE_WEATHER_DATA):

    # create target path if it does not exist
    if not os.path.exists(target_path):
        print('Creating dir', target_path, '...')
        os.makedirs(target_path)

    target_file = target_path + 'weather_' + str(year) + '.json'
    
    if (replace & os.path.exists(target_file)):
        os.remove(target_file)
        
    if (not os.path.exists(target_file)):
        url = url_template.format(str(year)+'-01-01', str(year)+'-12-31')  
        print('Downloading',  url, '...')
        weather_json = requests.get(url).json()
        with open(target_path+'weather_'+str(year)+'.json', 'w') as outfile:
            json.dump(weather_json, outfile)
    else:
        print('File ', os.path.basename(target_file), 'already downloaded!')


    

In [22]:
def load_weather_data(year_start, year_end, replace=False, target_path=WEATHER_DATA_PATH, url_template=URL_TEMPLATE_WEATHER_DATA):
    for year in range(year_start, year_end+1):
        load_weather_data_year(year, replace=replace, target_path=target_path, url_template=url_template)


In [23]:
def parse_weather_data(file):
    data = json.load(open(file))
    df = pd.DataFrame(data['data'])
    df.drop(WEATHER_COLS_TO_DROP+['time'], axis=1, inplace=True)
    df.rename(columns={'time_local': 'time_ts'}, inplace=True)
    return df
    

In [24]:
def get_weather_data_for_trips(df_trips):
    
    target_path = WEATHER_DATA_PATH
    
    load_weather_data(
        year_start=df_trips['start_ts'].min().year, 
        year_end=df_trips['start_ts'].max().year,
        target_path=target_path)
    
    df_list = []
    # walk through all zip files
    for file in sorted(glob.glob(target_path+'weather*.json')):
        df_list.append(parse_weather_data(file))
        print('parsed', file, 'with', df_list[-1].shape[0], 'rows')
        
        
    df = pd.concat(df_list)    
    
    df['time_ts'] = pd.to_datetime(df['time_ts'])
    
    return df
    
    
    


In [25]:
df_weather = get_weather_data_for_trips(df_trips_raw)

Creating dir ../data/raw/weather/ ...
Downloading https://api.meteostat.net/v1/history/hourly?station=72405&start=2015-01-01&end=2015-12-31&time_zone=Europe/Berlin&time_format=Y-m-d%20H:i&key=xPVZEykm ...


ProxyError: HTTPSConnectionPool(host='api.meteostat.net', port=443): Max retries exceeded with url: /v1/history/hourly?station=72405&start=2015-01-01&end=2015-12-31&time_zone=Europe/Berlin&time_format=Y-m-d%20H:i&key=xPVZEykm (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 Proxy Authentication Required')))

In [22]:
df_weather.head()

Unnamed: 0,time_ts,temperature,dewpoint,humidity,precipitation,windspeed,winddirection,pressure
0,2015-01-01 00:00:00,2.2,-14.0,29.0,0.0,5.4,220.0,1026.7
1,2015-01-01 01:00:00,1.1,-12.3,36.0,,7.6,210.0,1026.5
2,2015-01-01 02:00:00,1.1,-11.0,40.0,0.0,5.4,230.0,1026.3
3,2015-01-01 03:00:00,0.6,-11.8,39.0,0.0,5.4,250.0,1025.6
4,2015-01-01 04:00:00,0.6,-11.2,41.0,0.0,9.4,170.0,1025.1


In [23]:
df_weather.to_pickle(PARSED_DATA_PATH+RAW_WEATHER_FILE)

In [24]:
df_weather.head()

Unnamed: 0,time_ts,temperature,dewpoint,humidity,precipitation,windspeed,winddirection,pressure
0,2015-01-01 00:00:00,2.2,-14.0,29.0,0.0,5.4,220.0,1026.7
1,2015-01-01 01:00:00,1.1,-12.3,36.0,,7.6,210.0,1026.5
2,2015-01-01 02:00:00,1.1,-11.0,40.0,0.0,5.4,230.0,1026.3
3,2015-01-01 03:00:00,0.6,-11.8,39.0,0.0,5.4,250.0,1025.6
4,2015-01-01 04:00:00,0.6,-11.2,41.0,0.0,9.4,170.0,1025.1


In [28]:
def get_alt_weather_data_for_trips(df_trips):
    df = pd.read_csv(URL_ALT_WEATHER_DATA, names=COLS_ALT_WEATHER_DATA, parse_dates=[0])
    df = df[(df.date >= df_trips['start_ts'].min()) & (df.date <= df_trips['start_ts'].max())]
    df.drop(WEATHER_COLS_TO_DROP, axis=1, inplace=True)
    df['hour'] = df['hour'].str[0:2].astype('int')
    return df.reset_index(drop=True)
    

In [29]:
df_weather_alt = get_alt_weather_data_for_trips(df_trips_raw)

URLError: <urlopen error Tunnel connection failed: 407 Proxy Authentication Required>

In [27]:
df_weather_alt.to_pickle(PARSED_DATA_PATH+ALT_WEATHER_FILE)

In [28]:
df_weather_alt.head()

Unnamed: 0,date,hour,temperature,dewpoint,precipitation,windspeed,winddirection,humidity,pressure
0,2015-01-02,0,5.0,-6.6,0.0,16.6,200.0,43.0,1018.7
1,2015-01-02,1,4.4,-6.8,0.0,16.6,200.0,44.0,1018.4
2,2015-01-02,2,3.3,-6.1,0.0,14.8,200.0,50.0,1018.5
3,2015-01-02,3,4.4,-5.7,0.0,14.8,210.0,48.0,1018.9
4,2015-01-02,4,4.4,-5.7,0.0,9.4,220.0,48.0,1019.0
