# Semester Project - Nextbike
## Task 1 - Exploration and Description

In [1]:
from nextbike.preprocessing import Preprocessing 
from nextbike.io import input

import numpy as np

In [2]:
df = input.__read_file()
df.head()

Unnamed: 0.1,Unnamed: 0,p_spot,p_place_type,datetime,b_number,trip,p_uid,p_bikes,p_lat,b_bike_type,p_name,p_number,p_lng,p_bike
0,0,True,0,2019-01-20 00:00:00,52073,first,7314560,1,51.506613,4,FH-Dortmund Sonnenstraße,7374,7.455587,False
1,1,True,0,2019-01-20 23:59:00,52073,last,7314560,1,51.506613,4,FH-Dortmund Sonnenstraße,7374,7.455587,False
2,2,True,0,2019-01-20 00:00:00,52331,first,113573,2,51.49269,4,Universität/S-Bahnhof,7362,7.417633,False
3,3,True,0,2019-01-20 23:59:00,52331,last,113573,4,51.49269,4,Universität/S-Bahnhof,7362,7.417633,False
4,4,True,0,2019-01-20 00:00:00,31346,first,113543,3,51.523351,4,Brackel Kirche,7337,7.546867,False


### a) The data set shows columns with prefixes p and b. What do you think do they represent? Also try to find good assumptions for the meanings of the columns.

The prefix "p" stands for the <i> positon </i> and the prefix "b" describes the features for the used <i> bike</i> . 

###### Meanings of the columns

| Column      | Description          |
|-------------|----------------------|
|<i> p_spot </i>      |True, if it is an official station                   |
|<i>p_place_type </i>|                      |
|<i>datetime </i></i>    |Datetime of the start or end of a trip |
|<i>b_number </i>    |Bike ID                   |
|<i>trip   </i>      |Values = ["first, last, start, end] <br> defines if a trip starts or ends|
|<i>p_uid </i>       |ID of the bike station / position                      |
|<i>p_bikes </i>     |Number of available bikes at the postion                      |
|<i>p_lat   </i>     |Latitude coordinate of the position                      |
|<i>b_bike_type</i>  |Type of the used bike                      |
|<i>p_name  </i>     |Street or station name of the current position                      |
|<i>p_number  </i>   |ID of the postion / bike station                      |
|<i>p_lng </i>       |Longitude coordinate of the position                      |
|<i>p_bike   </i>    |                      |




### b) The trip column in your data set shows different values. Explain why there are not only two. Are examples with certain values for trip more informative for the analysis of mobility patterns than others?


#### Analyse the trip column

In [3]:
df["trip"].unique()

array(['first', 'last', 'start', 'end'], dtype=object)

There are four different values in the trip column [first, last, start, end]. 
At least two values are required to define whether the dataset belongs to the starting point or the end of the trip. This means that <b> one trip is represented in two successively rows </b> in the dataframe. One of the rows contains the values at the startinging point (i.e. datetime, start position) and the other row contains the values at the ending point of the trip. 

Let's have a deeper look in the dataframe and the trip column.

In [4]:
# there are much more datasets which have the values "start" and "end" in the trip column
df["trip"].value_counts()

start    249536
end      242878
last      88710
first     88528
Name: trip, dtype: int64

In [5]:
df[(df["trip"] == "first") | (df["trip"] =="last")].head(50)

Unnamed: 0.1,Unnamed: 0,p_spot,p_place_type,datetime,b_number,trip,p_uid,p_bikes,p_lat,b_bike_type,p_name,p_number,p_lng,p_bike
0,0,True,0,2019-01-20 00:00:00,52073,first,7314560,1,51.506613,4,FH-Dortmund Sonnenstraße,7374,7.455587,False
1,1,True,0,2019-01-20 23:59:00,52073,last,7314560,1,51.506613,4,FH-Dortmund Sonnenstraße,7374,7.455587,False
2,2,True,0,2019-01-20 00:00:00,52331,first,113573,2,51.49269,4,Universität/S-Bahnhof,7362,7.417633,False
3,3,True,0,2019-01-20 23:59:00,52331,last,113573,4,51.49269,4,Universität/S-Bahnhof,7362,7.417633,False
4,4,True,0,2019-01-20 00:00:00,31346,first,113543,3,51.523351,4,Brackel Kirche,7337,7.546867,False
5,5,True,0,2019-01-20 23:59:00,31346,last,113543,4,51.523351,4,Brackel Kirche,7337,7.546867,False
6,6,True,0,2019-01-20 00:00:00,50641,first,113561,5,51.506312,4,Hainallee / Südbad,7351,7.470531,False
9,9,True,0,2019-01-20 23:59:00,50641,last,6260019,1,51.493966,4,TU Dortmund Emil-Figge-Straße 50,7367,7.418008,False
10,10,True,0,2019-01-20 00:00:00,53801,first,50383,3,51.513777,14,Westentor,7319,7.455849,False
11,11,True,0,2019-01-20 03:39:00,53801,last,50383,3,51.513777,14,Westentor,7319,7.455849,False


In this filtered dataframe above it gets clear that the examples with the values **first** and **last** in the trip column don't make much sense. Most of the trips in this dataframe have an unlikely long trip duration. The start time of a trip is almost always at 0 AM and the end time of a trip is at 23:59 PM. 
Furthermore the start and the end positions of one trip are the same. 

It could be measurement errors or other data recording errors. <br> 
These datasets can be disregarded for the next steps, because they aren't suitable for further analysis, especially for the preditction of trip durations. 

### c) Based on the given data, create a new DataFrame that stores (at least) the following trip information (“trip format”):
- Bike Number
- Start Time (Either as appropriate data type or as several columns from “Start Month” down to “Start Minute”)
- Weekend (binary)
- Start Position (Either as appropriate data type or as two columns for Longitude and Latitude),
- Duration
- End Time 
- End Position 

In [6]:
df = Preprocessing.get_trip_data()
df.head(5)

Unnamed: 0,Unnamed: 0_x,datetime_start,b_number,latitude_start,p_name_start,longitude_start,Unnamed: 0_y,datetime_end,latitude_end,p_name_end,longitude_end,trip_duration,coordinates_start,coordinates_end,distance,weekday,weekend,day,month,hour
0,7,2019-01-20 16:22:00,50641,51.506312,Hainallee / Südbad,7.470531,8,2019-01-20 17:00:00,51.493966,TU Dortmund Emil-Figge-Straße 50,7.418008,38,"(51.506311756219, 7.470531463623098)","(51.493965925874996, 7.4180084466934)",3.89729,6,1,20,1,16
1,13,2019-01-20 02:31:00,50425,51.517155,Hauptbahnhof/Bahnhofsvorplatz,7.459931,14,2019-01-20 02:43:00,51.513069,Unionstr.,7.448886,12,"(51.517155427985, 7.459931373596199)","(51.513069322724, 7.448886036872902)",0.891383,6,1,20,1,2
2,19,2019-01-20 11:32:00,53006,51.509557,Ritterhausstr.,7.446949,20,2019-01-20 13:33:00,51.517155,Hauptbahnhof/Bahnhofsvorplatz,7.459931,121,"(51.509557115819995, 7.4469494819641)","(51.517155427985, 7.459931373596199)",1.235649,6,1,20,1,11
3,21,2019-01-20 14:38:00,53006,51.517155,Hauptbahnhof/Bahnhofsvorplatz,7.459931,22,2019-01-20 14:53:00,51.500725,Polizeipräsidium,7.459819,15,"(51.517155427985, 7.459931373596199)","(51.500725323279, 7.4598187208176)",1.827997,6,1,20,1,14
4,23,2019-01-20 17:02:00,53006,51.500725,Polizeipräsidium,7.459819,24,2019-01-20 17:16:00,51.514029,Schwanenwall,7.47257,14,"(51.500725323279, 7.4598187208176)","(51.514028646499995, 7.472570284530001)",1.724677,6,1,20,1,17


#### Adding weather features

The following steps add three weather features to the final trip DataFrame. The ressource for the weather data is "Deutscher Wetterdienst". [Here](https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/), you can download the hourly weather data for several cities in Germany. 

The reason why we take the weather data for Waltrop-City is because there is no official weather station directly in Dortmund. There is no data for Dortmund accessable. Waltrop is the closest city to Dortmund, where weather data can be accessed.



In [7]:
df = Preprocessing.get_trip_data(withWeather=True)
df.head(10)

Unnamed: 0,Unnamed: 0_x,datetime_start,b_number,latitude_start,p_name_start,longitude_start,Unnamed: 0_y,datetime_end,latitude_end,p_name_end,...,coordinates_end,distance,weekday,weekend,day,month,hour,temperature °C,precipitation in mm,precipitation
0,7,2019-01-20 16:22:00,50641,51.506312,Hainallee / Südbad,7.470531,8,2019-01-20 17:00:00,51.493966,TU Dortmund Emil-Figge-Straße 50,...,"(51.493965925874996, 7.4180084466934)",3.89729,6,1,20,1,16,0.5,0.0,0
1,77,2019-01-20 16:42:00,53940,51.507457,Möllerbrücke,7.451364,78,2019-01-20 16:44:00,51.507457,Möllerbrücke,...,"(51.507457007245996, 7.4513643980026)",0.0,6,1,20,1,16,0.5,0.0,0
2,85,2019-01-20 16:53:00,50061,51.503293,Vinckeplatz,7.455822,86,2019-01-20 17:13:00,51.519332,Cinestar,...,"(51.519331863840996, 7.460124492645298)",1.809251,6,1,20,1,16,0.5,0.0,0
3,91,2019-01-20 16:35:00,51138,51.499039,Steigenberger Hotel / Berswordtstr.,7.451472,92,2019-01-20 16:37:00,51.499039,Steigenberger Hotel / Berswordtstr.,...,"(51.49903890740301, 7.4514716863632)",0.0,6,1,20,1,16,0.5,0.0,0
4,279,2019-01-20 16:43:00,53120,51.507457,Möllerbrücke,7.451364,280,2019-01-20 17:02:00,51.512836,Am Kaiserbrunnen,...,"(51.512835629153, 7.4822580814362)",2.226714,6,1,20,1,16,0.5,0.0,0
5,363,2019-01-20 16:24:00,53096,51.500675,Kuithanstr.,7.440834,364,2019-01-20 16:40:00,51.510311,Hiltropwall,...,"(51.5103107749, 7.46223704786)",1.832379,6,1,20,1,16,0.5,0.0,0
6,403,2019-01-20 16:05:00,52040,51.502318,Kreuzstraße,7.450029,404,2019-01-20 17:51:00,51.513069,Unionstr.,...,"(51.513069322724, 7.448886036872902)",1.198778,6,1,20,1,16,0.5,0.0,0
7,475,2019-01-20 16:05:00,51426,51.502318,Kreuzstraße,7.450029,476,2019-01-20 17:40:00,51.502318,Kreuzstraße,...,"(51.5023181776, 7.4500286579132)",0.0,6,1,20,1,16,0.5,0.0,0
8,13,2019-01-20 02:31:00,50425,51.517155,Hauptbahnhof/Bahnhofsvorplatz,7.459931,14,2019-01-20 02:43:00,51.513069,Unionstr.,...,"(51.513069322724, 7.448886036872902)",0.891383,6,1,20,1,2,-5.8,0.0,0
9,319,2019-01-20 02:00:00,53171,51.500675,Kuithanstr.,7.440834,320,2019-01-20 07:15:00,51.500675,Kuithanstr.,...,"(51.500675232617, 7.4408340454102015)",0.0,6,1,20,1,2,-5.8,0.0,0


In [8]:
Preprocessing.get_write_trip_data(df)

Transformed trip data for Dortmund successfully saved in a csv file!


### d) Calculate the aggregate statistics (i.e., mean and standard deviation) for the trip duration per month, per day of week, and per hour of day. Are there visible differences between weekdays and weekends?

(The differences between weekdays and weekends will be shown in Task 2 by visualizing the data)

#### Calculating aggregate statistic per month, per day of week and per hour of day

##### Statistic per month

In [9]:
# in this array "July" is missing 
month_by_name = np.array(["January", "February", "March", "April", "May", "June", "August", "September", "October", "November", "December"])

# Means per month
df.groupby(['month']).mean()[["trip_duration"]].set_index(keys=month_by_name)

Unnamed: 0,trip_duration
January,65.135989
February,35.401047
March,41.186327
April,44.510447
May,52.121733
June,42.560769
August,34.758468
September,20.710711
October,18.898973
November,25.229027


In [10]:
# Means per month
# distinguish between weekend and workday
df.groupby(['weekend', 'month']).mean()[["trip_duration"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_duration
weekend,month,Unnamed: 2_level_1
0,1,61.92291
0,2,30.756774
0,3,35.526872
0,4,42.583561
0,5,50.622423
0,6,41.404373
0,8,34.727259
0,9,22.114532
0,10,20.034604
0,11,28.723922


In [11]:
# Standard deviation per month
df.groupby(['month']).std()[["trip_duration"]].set_index(keys=month_by_name)

Unnamed: 0,trip_duration
January,131.724449
February,94.22861
March,98.509464
April,99.339453
May,110.840998
June,96.293009
August,79.853703
September,53.338906
October,48.72748
November,66.714969


In [12]:
# Standard deviation per month
# distinguish between weekend and workday
df.groupby(['weekend','month']).std()[["trip_duration"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_duration
weekend,month,Unnamed: 2_level_1
0,1,124.840611
0,2,83.261928
0,3,86.525411
0,4,94.782378
0,5,106.765659
0,6,90.561902
0,8,79.176275
0,9,56.939429
0,10,50.598669
0,11,73.673695


##### Statistics per day of week

In [13]:
# Means 
weekday_by_name= np.array(["Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"])
df.groupby(['weekday']).mean()[["trip_duration"]].set_index(weekday_by_name)

Unnamed: 0,trip_duration
Monday,30.053161
Tuesday,36.014902
Wednesday,33.014809
Thursday,31.970065
Friday,33.90298
Saturday,37.342071
Sunday,32.080302


In [14]:
# Standard deviation 
df[["weekday", "trip_duration"]].groupby("weekday").std().set_index(weekday_by_name)

Unnamed: 0,trip_duration
Monday,73.128124
Tuesday,86.036889
Wednesday,81.896887
Thursday,80.431986
Friday,80.763024
Saturday,94.889149
Sunday,90.43124


In [15]:
# Means per hour
df.groupby(['hour']).mean()[["trip_duration"]]

Unnamed: 0_level_0,trip_duration
hour,Unnamed: 1_level_1
0,58.854846
1,55.360782
2,33.281008
3,23.742657
4,54.533423
5,65.690177
6,58.079338
7,57.050767
8,45.014704
9,42.288665


In [16]:
# Means per hour 
# distinguish between weekend and workday
df.groupby(['weekend','hour']).mean()[["trip_duration"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_duration
weekend,hour,Unnamed: 2_level_1
0,0,54.622517
0,1,59.419125
0,2,21.60593
0,3,20.695751
0,4,61.61264
0,5,73.246589
0,6,56.794803
0,7,57.252132
0,8,45.370666
0,9,39.186665


In [17]:
# Standard deviation per hour
df[["hour", "trip_duration"]].groupby("hour").std()

Unnamed: 0_level_0,trip_duration
hour,Unnamed: 1_level_1
0,166.089569
1,145.765061
2,122.470896
3,90.472204
4,148.854879
5,135.961196
6,125.960485
7,118.082078
8,97.729403
9,90.016248


In [18]:
# Standard deviation per hour
# distinguish between weekend and workday
df.groupby(['weekend','month']).std()[["trip_duration"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_duration
weekend,month,Unnamed: 2_level_1
0,1,124.840611
0,2,83.261928
0,3,86.525411
0,4,94.782378
0,5,106.765659
0,6,90.561902
0,8,79.176275
0,9,56.939429
0,10,50.598669
0,11,73.673695
