# Semester Project - Nextbike
## Task 1 - Exploration and Description

In [None]:
!pip3 install -e ..

In [None]:
from nextbike.preprocessing import Preprocessing 
from nextbike.io import input, output

import numpy as np

In [None]:
df = input.read_file()
df.head()

### a) The data set shows columns with prefixes p and b. What do you think do they represent? Also try to find good assumptions for the meanings of the columns.

The prefix "p" stands for the <i> positon </i> and the prefix "b" describes the features for the used <i> bike</i> . 

###### Meanings of the columns

| Column      | Description          |
|-------------|----------------------|
|<i> p_spot </i>      |True, if it is an official station                   |
|<i>p_place_type </i>|                      |
|<i>datetime </i></i>    |Datetime of the start or end of a trip |
|<i>b_number </i>    |Bike ID                   |
|<i>trip   </i>      |Values = ["first, last, start, end] <br> defines if a trip starts or ends|
|<i>p_uid </i>       |ID of the bike station / position                      |
|<i>p_bikes </i>     |Number of available bikes at the postion                      |
|<i>p_lat   </i>     |Latitude coordinate of the position                      |
|<i>b_bike_type</i>  |Type of the used bike                      |
|<i>p_name  </i>     |Street or station name of the current position                      |
|<i>p_number  </i>   |ID of the postion / bike station                      |
|<i>p_lng </i>       |Longitude coordinate of the position                      |
|<i>p_bike   </i>    |                      |




### b) The trip column in your data set shows different values. Explain why there are not only two. Are examples with certain values for trip more informative for the analysis of mobility patterns than others?


#### Analyse the trip column

In [None]:
df["trip"].unique()

There are four different values in the trip column [first, last, start, end]. 
At least two values are required to define whether the dataset belongs to the starting point or the end of the trip. This means that <b> one trip is represented in two successively rows </b> in the dataframe. One of the rows contains the values at the startinging point (i.e. datetime, start position) and the other row contains the values at the ending point of the trip. 

Let's have a deeper look in the dataframe and the trip column.

In [None]:
# there are much more datasets which have the values "start" and "end" in the trip column
df["trip"].value_counts()

In [None]:
df[(df["trip"] == "first") | (df["trip"] =="last")].head(50)

In this filtered dataframe above it gets clear that the examples with the values **first** and **last** in the trip column don't make much sense. Most of the trips in this dataframe have an unlikely long trip duration. The start time of a trip is almost always at 0 AM and the end time of a trip is at 23:59 PM. 
Furthermore the start and the end positions of one trip are the same. 

It could be measurement errors or other data recording errors. <br> 
These datasets can be disregarded for the next steps, because they aren't suitable for further analysis, especially for the preditction of trip durations. 

### c) Based on the given data, create a new DataFrame that stores (at least) the following trip information (“trip format”):
- Bike Number
- Start Time (Either as appropriate data type or as several columns from “Start Month” down to “Start Minute”)
- Weekend (binary)
- Start Position (Either as appropriate data type or as two columns for Longitude and Latitude),
- Duration
- End Time 
- End Position 

In [None]:
df = Preprocessing.get_trip_data()
df.head(5)

#### Adding weather features

The following steps add three weather features to the final trip DataFrame. The ressource for the weather data is "Deutscher Wetterdienst". [Here](https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/), you can download the hourly weather data for several cities in Germany. 

The reason why we take the weather data for Waltrop-City is because there is no official weather station directly in Dortmund. There is no data for Dortmund accessable. Waltrop is the closest city to Dortmund, where weather data can be accessed.



In [None]:
df = Preprocessing.get_trip_data(with_weather=True)
df.head(10)

In [None]:
output.write_trip_data(df)

### d) Calculate the aggregate statistics (i.e., mean and standard deviation) for the trip duration per month, per day of week, and per hour of day. Are there visible differences between weekdays and weekends?

(The differences between weekdays and weekends will be shown in Task 2 by visualizing the data)

#### Calculating aggregate statistic per month, per day of week and per hour of day

##### Statistic per month

In [None]:
# in this array "July" is missing 
month_by_name = np.array(["January", "February", "March", "April", "May", "June", "August", "September", "October", "November", "December"])

# Means per month
df.groupby(['month']).mean()[["trip_duration"]].set_index(keys=month_by_name)

In [None]:
# Means per month
# distinguish between weekend and workday
df.groupby(['weekend', 'month']).mean()[["trip_duration"]]

In [None]:
# Standard deviation per month
df.groupby(['month']).std()[["trip_duration"]].set_index(keys=month_by_name)

In [None]:
# Standard deviation per month
# distinguish between weekend and workday
df.groupby(['weekend','month']).std()[["trip_duration"]]

##### Statistics per day of week

In [None]:
# Means 
weekday_by_name= np.array(["Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"])
df.groupby(['weekday']).mean()[["trip_duration"]].set_index(weekday_by_name)

In [None]:
# Standard deviation 
df[["weekday", "trip_duration"]].groupby("weekday").std().set_index(weekday_by_name)

In [None]:
# Means per hour
df.groupby(['hour']).mean()[["trip_duration"]]

In [None]:
# Means per hour 
# distinguish between weekend and workday
df.groupby(['weekend','hour']).mean()[["trip_duration"]]

In [None]:
# Standard deviation per hour
df[["hour", "trip_duration"]].groupby("hour").std()

In [None]:
# Standard deviation per hour
# distinguish between weekend and workday
df.groupby(['weekend','month']).std()[["trip_duration"]]