# Semester Project - Nextbike
## Task 1 - Exploration and Description

In [None]:
# import relevant libaries for data exploration 
from vincenty import vincenty
import numpy as np 
import pandas as pd 
import datetime
from datetime import timedelta
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

In [None]:
# reading the csv
df = pd.read_csv("../../data/internal/dortmund.csv", index_col=0)
df.head(5)

### a) The data set shows columns with prefixes p and b. What do you think do they represent? Also try to find good assumptions for the meanings of the columns.

The prefix "p" stands for the <i> positon </i> and the prefix "b" describes the features for the used <i> bike</i> . 

###### Meanings of the columns

| Column      | Description          |
|-------------|----------------------|
|<i> p_spot </i>      |True, if it is an official station                   |
|<i>p_place_type </i>|                      |
|<i>datetime </i></i>    |Datetime of the start or end of a trip |
|<i>b_number </i>    |Bike ID                   |
|<i>trip   </i>      |Values = ["first, last, start, end] <br> defines if a trip starts or ends|
|<i>p_uid </i>       |ID of the bike station / position                      |
|<i>p_bikes </i>     |Number of available bikes at the postion                      |
|<i>p_lat   </i>     |Latitude coordinate of the position                      |
|<i>b_bike_type</i>  |Type of the used bike                      |
|<i>p_name  </i>     |Street or station name of the current position                      |
|<i>p_number  </i>   |ID of the postion / bike station                      |
|<i>p_lng </i>       |Longitude coordinate of the position                      |
|<i>p_bike   </i>    |                      |



### b) The trip column in your data set shows different values. Explain why there are not only two. Are examples with certain values for trip more informative for the analysis of mobility patterns than others?


#### Analyse the trip column

In [None]:
df["trip"].unique()

There are four different values in the trip column [first, last, start, end]. 
At least two values are required to define whether the dataset belongs to the starting point or the end of the trip. This means that <b> one trip is represented in two successively rows </b> in the dataframe. One of the rows contains the values at the startinging point (i.e. datetime, start position) and the other row contains the values at the ending point of the trip. 

Let's have a deeper look in the dataframe and the trip column.

In [None]:
# there are much more datasets which have the values "start" and "end" in the trip column
df["trip"].value_counts()

In [None]:
df[(df["trip"] == "first") | (df["trip"] =="last")].head(50)

In this filtered dataframe above it gets clear that the examples with the values **first** and **last** in the trip column don't make much sense. Most of the trips in this dataframe have an unlikely long trip duration. The start time of a trip is almost always at 0 AM and the end time of a trip is at 23:59 PM. 
Furthermore the start and the end positions of one trip are the same. 

It could be measurement errors or other data recording errors. <br> 
These datasets can be disregarded for the next steps, because they aren't suitable for further analysis, especially for the preditction of trip durations. 

### c) Based on the given data, create a new DataFrame that stores (at least) the following trip information (“trip format”):
- Bike Number
- Start Time (Either as appropriate data type or as several columns from “Start Month” down to “Start Minute”)
- Weekend (binary)
- Start Position (Either as appropriate data type or as two columns for Longitude and Latitude),
- Duration
- End Time 
- End Position 

In [None]:
df =df[((df["trip"] == "start") | (df["trip"]=="end"))]

In [None]:
# there are more "start" than "end" datasets 
df["trip"].value_counts()

In [None]:
# check, if the next dataset belongs to the current dataset 
# this means that they build a pair for one trip
# if they have the same trip type, we want to delete them 
deletionFilter = df["trip"] != df["trip"].shift(-1)
deletionFilter.value_counts()

There are 6659 datasets which have the same trip type as the previous dataset. That's exactly the difference between the number of datasets for trip type "start" and "end". (see above)

In [None]:
# apply the filter 
df = df[deletionFilter]
df.groupby("trip").count() # the number of datasets for each trip type is equal now 

In [None]:
# focus on datasets whith values "start" and "end" in the trip column
# store the starting and ending events of a trip in two different dataframes
df_start = df[(df["trip"] == "start")] 
df_end = df[(df["trip"] == "end")] 

In [None]:
df_start.reset_index(inplace=True)
df_end.reset_index(inplace=True)

In [None]:
# rename the column names to distinguish the columns after a merge of the dateframes
df_start.rename(columns={"index":"index_start","datetime":"datetime_start", "p_lat":"latitude_start","p_lng":"longitude_start","p_name":"p_name_start","b_number":"b_number_start","p_number":"p_number_start"},inplace=True)
df_end.rename(columns={"index":"index_end","datetime":"datetime_end", "p_lat":"latitude_end","p_lng":"longitude_end","p_name":"p_name_end","b_number":"b_number_end","p_number":"p_number_end"},inplace=True)

In [None]:
# drop the columns, which aren't necessary for the final dataframe
df_start.drop(['p_spot', 'p_place_type',  'trip',
       'p_uid', 'p_bikes', 'b_bike_type',
       'p_bike'],inplace=True,axis=1)

df_end.drop(['p_spot', 'p_place_type', 'trip',
       'p_uid', 'p_bikes', 'b_bike_type',
       'p_bike'],inplace=True,axis=1)

In [None]:
# modify the index_end to merge the dataframes by index_start and index_end
df_end["index_end"] = df_end["index_end"]-1

In [None]:
# merge the two sepearte dataframes to the final dataframe 
# the final dataframe consists of datasets which describe a trip with features for the start and the end of a trip
df_final = pd.merge(df_start,df_end,left_on="index_start", right_on="index_end")

In [None]:
# check if there is a trip with different bike numbers at the start and the end of the trip 
#- if so this wouldn't make sense 
df_final[df_final["b_number_start"] != df_final["b_number_start"]]

In [None]:
# check if the start time is later than the end time
# if so this wouldn't make sense 
df_final[df_final["datetime_start"] > df_final["datetime_end"]]

In [None]:
# after merging we get as many datasets as the number of datasets for each trip type
# a trip with its start and end features is represented in one row now
df_final.info()

In [None]:
# p_number != 0 --> just focus on the trips from and to an official bike station 
df_final = df_final[(df_final["p_number_start"] != 0) & (df_final["p_number_end"] != 0)]

In [None]:
# drop the redundant columns
df_final.drop(["index_start","index_end","b_number_end","p_number_start","p_number_end"],inplace=True,axis=1)


In [None]:
df_final.rename(columns={"b_number_start":"b_number"},inplace=True)

In [None]:
df_final.columns

In [None]:
df_final.columns

In [None]:
df_final.info()

In [None]:
# check for missing values 
df_final.isna().any(axis=0)

In [None]:
# converting objects to datetimes
df_final["datetime_start"] = pd.to_datetime(df_final["datetime_start"])
df_final["datetime_end"] = pd.to_datetime(df_final["datetime_end"])

# adding the trip duration with the difference of start and end time
df_final["trip_duration"] = df_final["datetime_end"] -df_final["datetime_start"]

#converting timedelta to numeric and format in minutes 
df_final["trip_duration"] = pd.to_numeric(df_final["trip_duration"] / 60000000000)

df_final["coordinates_start"] = list(zip(df_final["latitude_start"],df_final["longitude_start"]))
df_final["coordinates_end"] = list(zip(df_final["latitude_end"],df_final["longitude_end"]))

# adding the distance between start and end position
df_final["distance"] = df_final.apply(
    lambda x: vincenty([x["latitude_start"], x["longitude_start"]],
                       [x["latitude_end"], x["longitude_end"]],),axis=1)

# adding another distances
df_final["distanceToUniversity"] = df_final.apply(lambda x: vincenty([x["latitude_start"], x["longitude_start"]],
                       [51.4928736,7.415647],),axis=1)
df_final["distanceToCentralStation"] = df_final.apply(lambda x: vincenty([x["latitude_start"], x["longitude_start"]],
                       [51.5175, 7.458889],),axis=1)

## adding the weekday of the start time of a trip; stored in integers (0: monday, 6:sunday)
df_final['weekday'] = df_final['datetime_start'].dt.dayofweek

In [None]:
# function which returns True for saturday and sunday; otherwise it returns False
def isWeekend(index_of_day): 
    if index_of_day > 4: 
        return 1
    else: 
        return 0

# adding new boolean column "weekend"    
df_final["weekend"] = df_final["weekday"].apply(lambda x: isWeekend(x))

In [None]:
# transform column "datatime_start" into several columns 
df_final["day"] = df_final["datetime_start"].apply(lambda x: x.day)
df_final["month"] = df_final["datetime_start"].apply(lambda x: x.month)
df_final["hour"] = df_final["datetime_start"].apply(lambda x: x.hour)
df_final["minute"] = df_final["datetime_start"].apply(lambda x : x.minute)
df_final["day_of_year"] = df_final["datetime_start"].apply(lambda x: x.timetuple().tm_yday)

In [None]:
def __get_tripLabel(row):
    if ((row['towardsUniversity'] == 1) & (row['awayFromUniversity'] == 0)):
        return 'towardsUniversity'
    if ((row['towardsUniversity'] == 0) & (row['awayFromUniversity'] == 1)):
        return 'awayFromUniveristy'
    if ((row['towardsUniversity'] == 1) & (row['awayFromUniversity'] == 1)):
        return 'towardsUniversity'
    if ((row['towardsUniversity'] == 0) & (row['awayFromUniversity'] == 0)):
        return 'noUniversityRide'

    warnings.warn("Warning...........Message")
    return None

# add the attribute whether a trip was done towards/away from university (for prediction in task 3b)
# array with university stations
university_stations = ["TU Dortmund Seminarraumgebäude 1", "TU Dortmund Hörsaalgebäude 2", "Universität/S-Bahnhof",
                        "TU Dortmund Emil-Figge-Straße 50", "FH-Dortmund Emil-Figge-Straße 42"]

df_final['towardsUniversity'] = df_final['p_name_end'].apply(lambda x: 1 if x in university_stations else 0)
df_final['awayFromUniversity'] = df_final['p_name_start'].apply(lambda x: 1 if x in university_stations else 0)

df_final['tripLabel'] = df_final.apply(lambda row: __get_tripLabel(row), axis=1)

#### Adding weather features

The following steps add three weather features to the final trip DataFrame. The ressource for the weather data is "Deutscher Wetterdienst". [Here](https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/), you can download the hourly weather data for several cities in Germany. 

The reason why we take the weather data for Waltrop-City is because there is no official weather station directly in Dortmund. There is no data for Dortmund accessable. Waltrop is the closest city to Dortmund, where weather data can be accessed.


In [None]:
# temperature for each hour in 2019 
temp = pd.read_csv("../../data/external/WaltropTemp.txt", sep = ";")
temp.rename(columns = {"TT_TU":"temperature °C", "MESS_DATUM":"datetime"}, inplace=True)
temp.drop(labels=["STATIONS_ID", "QN_9", "eor","RF_TU"], axis=1, inplace=True)
temp = temp[(temp["datetime"] >= 2019010100) & (temp["datetime"] <= 2019123123)]
temp.reset_index(drop=True, inplace=True)
temp

In [None]:
# two features (precipitation in mm & precipitaion y/n) for each hour in 2019 
precipitation = pd.read_csv("../../data/external/WaltropPrecipitation.txt", sep = ";")
precipitation.rename(columns = {"  R1":"precipitation in mm", "MESS_DATUM":"datetime", "RS_IND":"precipitation"}, inplace=True)
precipitation = precipitation[(precipitation["datetime"] >= 2019010100) & (precipitation["datetime"] <= 2019123123)]
precipitation.drop(labels=["STATIONS_ID", "QN_8", "eor","WRTR"], axis=1, inplace=True)
precipitation.reset_index(drop=True, inplace=True)
precipitation

In [None]:
# merge DataFrames for temperature and precipitaion to one DataFrame 
weather = pd.merge(temp,precipitation, on="datetime")
weather

In [None]:
def formatDatetimeForMerging(x):
    # return as integer for merging 
    return int(x[:13].replace('-','').replace(' ',''))

df_final["datetime_start_for_merge_with_weather"] = df_final["datetime_start"].apply(lambda x: formatDatetimeForMerging(str(x)))

# merge with weather data 
df_final = pd.merge(df_final, weather, left_on="datetime_start_for_merge_with_weather", right_on="datetime")

# drop redundant columns 
df_final.drop(labels=["datetime", "datetime_start_for_merge_with_weather"], axis=1,inplace=True)

In [None]:
df_final.to_csv('../../data/processed/dortmund_trips.csv')
df_final

### d) Calculate the aggregate statistics (i.e., mean and standard deviation) for the trip duration per month, per day of week, and per hour of day. Are there visible differences between weekdays and weekends?

(The differences between weekdays and weekends will be shown in Task 2 by visualizing the data)

#### Calculating aggregate statistic per month, per day of week and per hour of day

##### Statistic per month

In [None]:
# in this array "July" is missing 
month_by_name = np.array(["January", "February", "March", "April", "May", "June", "August", "September", "October", "November", "December"])

# Means per month
df_final.groupby(['month']).mean()[["trip_duration"]].set_index(keys=month_by_name)

In [None]:
# Means per month
# distinguish between weekend and workday
df_final.groupby(['weekend', 'month']).mean()[["trip_duration"]]

There is no data for july.

In [None]:
# Standard deviation per month
df_final.groupby(['month']).std()[["trip_duration"]].set_index(keys=month_by_name)

In [None]:
# Standard deviation per month
# distinguish between weekend and workday
df_final.groupby(['weekend','month']).std()[["trip_duration"]]

##### Statistics per day of week

In [None]:
# Means 
weekday_by_name= np.array(["Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"])
df_final.groupby(['weekday']).mean()[["trip_duration"]].set_index(weekday_by_name)

In [None]:
# Standard deviation 
df_final[["weekday", "trip_duration"]].groupby("weekday").std().set_index(weekday_by_name)

##### Statistics per hour of day

In [None]:
# Means per hour
df_final.groupby(['hour']).mean()[["trip_duration"]]

In [None]:
# Means per hour 
# distinguish between weekend and workday
df_final.groupby(['weekend','hour']).mean()[["trip_duration"]]

In [None]:
# Standard deviation per hour
df_final[["hour", "trip_duration"]].groupby("hour").std()

In [None]:
# Standard deviation per hour
# distinguish between weekend and workday
df_final.groupby(['weekend','month']).std()[["trip_duration"]]