# NYC Cabs

New York. The city that never sleeps. New York is one of the world's busiest cities. Many tourists, business man, and new yorkers use cabs as central means of transportation. Although the yellow cabs are a central building block of the cities image and self-representation, many (digital) competitors entered the competive arena. Frequently, competitors such as UBER do not only offer a cheaper ride, but also offer a new mobility experience using apps and other digital companion services. 

Thus, the different NYC cab companies have teamed up and decided to build a new digital app, with which potential customers can order a cab and also get a prediction for the ride's price to be paid. 

Now, it is up to you to build such a prediction system for the NYC cab companies. In this initial part, you will start on the data engineering part and some exploratory data analysis. In a second part, after module 3, you will engage in the prediction part. For this assignment, we will use original historic data. Due to the size of the data set we will work with 0.2% of January's 2016 data as well as weather and holiday data for that time.

# 1) Read in data

Read in the three datasets with pandas.

In [None]:
import pandas as pd
weather=pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/weather_assignment.csv")
holidays=pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/usHolidays.csv")
rides=pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/rides_jan2016_assignment.csv")

# 2) Prepare weather data

### 2.1) Inspect data

Inspect the weather date using the .head() and .info() methods. 

In [None]:
weather.____

In [None]:
weather.____

What can we see this that?

1. Have a look at the different date formats ( .info() ). Most variables are measured as integer and floats (Can you remember the difference?). This is good - they reflect numeric measurements.
2. The features "conds" (condition) and "vis" (visibility) have no nice names - we need to fix that
3. The features "conds" and "date" are objects, that is, they are recognized as strings (text). This is ok for "conds". But we need to transform "date" from an object to a "datetime" - we need to tell pyhton that this is a measurement of time 
4. Have a look at the "date" feature: Weather is measured at hourly rate.

###  2.2) Fix date object

Pandas provides a very easy approach to transforming dates into "datetime" format - the correct format for date and tiems. We apply the .to_datetime() function to our variable. Remember that this is a function and not a method. Methods are directly called on a dataframe (e.g., weather.head() ). Functions have to be applied to a dataframe from pandas (e.g., pd.to_datetime() )

In [None]:
weather["date"] = pd.to_datetime(weather["date"])

In [None]:
weather.info()

In [None]:
weather.head()

We new "date" feature contains a date-part and time-part. We are only interested in the date-part and create a new feature "DATE" using the .dt.date method

In [None]:
weather["DATE"]=weather["date"].dt.date
weather.head()

### 2.3) Please rename the features "vis" and "conds" into "visibility" and "conditions"

In [None]:
weather = weather.rename(____={____:____})
weather.____

In [None]:
weather = weather.rename(____={____:____})
weather.____

# 3) Prepare rides data

### 3.1) Investigate the rides dataframe 

In [None]:
____

In [None]:
____

### 3.2) Fix the date object (again)

Have a look at the feature "pickup_datetime" - yet another format for time. Let's transform that to pandas' standard datetime with pd.to_datetime()

In [None]:
rides[____] = pd.____(____[____])

Extract the date part only

In [None]:
rides["pickup_datetime"] = rides["pickup_datetime"].____.____
rides["pickup_datetime"]

# 4) Prepare the holidays dataframe

### 4.1) Inspect the holidays dataframe

In [None]:
____

In [None]:
____

### 4.2) Remove the feature "Index". We don't need that. Remove that feature from the data frame

In [None]:
del(holidays[____])

### 4.3) And again, a different format for dates. Fix it!

In [None]:
holidays["Date"]=pd.to_datetime(____)
holidays.head()

In [None]:
The command may generate a warning, because pandas fails to parse the format automatically. You can ignore it (not generally, but in this case)

# 5) Merge rides and holidays

Let's merge together the three data frames on the date variable. However, watchout, dates are special.... The standard merge command leads to an error. We have to invest a bit into additional data preparation instead ...

In [None]:
pd.merge(rides, 
         holidays, 
         how="inner", 
         left_on="pickup_datetime", 
         right_on="Date")

If we want to merge on date features. We have to work with index values, that is, the row names (0 to 4 on the left side in the following table) 

In [None]:
rides.head()

We have to rename the index / row names. We can do that with the .set_index() method of pandas. In the following example, we rename the index from a number, e.g., 0 to 4, to the pickup_datetime.

In [None]:
rides.set_index(rides["pickup_datetime"], inplace=True)
rides.head()

Lets repeat that for the holidays dataframe.

In [None]:
holidays.____(holidays["Date"], inplace=____)
holidays.____

Now, we can use the merge command. But,we merge on the renamed index values... Also, we need a left join because we want to add the holidays to the single rides. That is, we want to know whether a given ride was undertaken on holiday or on a regular day.

In [None]:
rides_merged = pd.merge(rides, 
                         holidays, 
                         how="left", 
                         left_index=True, 
                         right_index=True)

Let's check weather merge was successful

In [None]:
rides_merged.____

In the dataframe "rides_merged", we can now delete the two date features "pickup_datetime" and "date". The data is redundant and the have the date information already in the newly created index. 

### 5.1) Remove old date variables "pickup_datetime" and "Date". They are not needed anymore. The new index resembles the date information.

In [None]:
____(rides_merged["pickup_datetime"])
____

The feature "Holiday" has a lot of missing values. Let's have a look at the 10 first instances

### 5.2) Impute missing values in the "Holiday" variable

In [None]:
rides_merged["Holiday"].head(10)

Let's fill the missing values with the expression "Regular Day"

In [None]:
rides_merged["Holiday"]=rides_merged["Holiday"].fillna(value=____)
rides_merged["Holiday"].head(10)

# 6) Merge rides_merged and weather data 

### 6.1) Aggregate weather data from hourly to daily level

Let's reinspect the weather data frame

In [None]:
weather.____

Ok. Weather data is measured on a hourly level. In order to facilate things, we aggregate the weather data on the daily level, that is, the take the average for each calender day. 

In [None]:
weather_daily = weather.groupby("DATE").____
weather_daily.head()

Now, we jave daily averages. Note that the features "date" and "conditions" have been removed (because the mean cannot be applied to these variables). Also, the index values have already been set to the "DATE" values that we have been using for grouping! 

### 6.2) Let's merge the aggregated dataframe "weather_daily" to "rides_merged". Please use the correct merge (inner, left, etc.) and explore whether the merge was successful.

In [None]:
rides_merged = pd.merge(____,
                        ____,
                        how=____,
                        left_index = True,
                        right_index = True)
rides_merged.info()

### Woohoo! We are done. Everything is merged together!

# 7) Exploratory data analysis

### 7.1) Get all all taxi rides with a price of more than 100 USD!

In [None]:
rides_merged.___[rides_merged["total_amount"]>___,]

### 7.2) Get all instances with an average speed of smaller than 2 miles per hour and snow fall bigger than 0 mm!

In [None]:
rides_merged.loc[(___) & (___),]

### 7.3) Get the total trip_destance of the different vendors (use "VendorID")

In [None]:
rides_merged.groupby(____)[____].____

### 7.4) Check the correlation between "trip_duration" and "snow". Draw a scatterplot and a check the correlation.

In [None]:
rides_merged.plot(____, _____, ____)