# Data Transforms - Rolling up NYC taxi data
 
Let's work with some raw data from the NYC yellow cabs.  The goal is to engineer a dataset that allows one to see how various usage statistics changed during Covid.  We'll focus on just May 2019 and May 2020 for the comparison. 
 
You'll see that there are various issues we need to deal with before we can aggregate. 
 
- NaN values
- Fixing improper datatypes 
- Filtering data
- remove columns we don't actually need
- Create new metrics
 
After we deal with those issues we can roll up the data into daily bins and compare trends

**Note:** What we're learning here is the foundation of data transforms.  If this was part of an ETL you would write up the cleaning and transformations into several functions that then are executed after extracting the data but before loading elsewhere. You would do this on a schedule to where say each month when new taxi data is uploaded you would go and run this and then update the aggregated data. 

**Note 2:** This is a pretty lengthy process.  There are lots of steps and times where we think we're done but then have to go back.  For example, we will think we did all our filtering, but then after making a feature realize we needed to do more filtering or filter that new column.  We do something to minimize the number of operations, but sometimes you catch something much later and just deal with it then.  If you were going to turn this into a function you would optimize the sequence of operations a bit more, but I didn't fully do that here so you can see what I mean.  


## Bringing in our data

In [None]:
# Bring in our libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Here I'm setting how many decimals are displayed for our floats.
# If you don't do this pandas will display like 8 decimals, which makes it significantly harder to make sense of when looking at it.
pd.set_option('float_format', '{:.2f}'.format)

In [None]:
# We'll pull the May 2019 data direct from the website...  This will take a second as it's on the larger side (700mb).
# The perk to using colab is you're downloading from Amazon to Google's storage, both of which are probably faster than your local connection!
# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

rides_05_2019 = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-05.parquet")


In [None]:
# And pull the May 2020
rides_05_2020 = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-05.parquet")


In [None]:
# Let's first look and make sure it pulled in the right number of columns. 
# Go and print the shape of both data frames in one cell
# Question 1: What is the shape of rides_05_2020?


In [None]:
# So that looks good in that there are 18 in both. 
# But ooph, over 7.5 million rides in May 2019 and not even 350,000 in May 2020
 
# Let's look at the column names to make sure they didn't change 
# We also need to make sure the same as we're going to merge these two dataframes
print(rides_05_2019.columns)
...

In [None]:
# It's a bit awkward to look at both long lists, so why don't we just compare them?
# This is useful as sometimes companies will change data specifications mid year or year-to-year. 
rides_05_2019.columns == rides_05_2020.columns

In [None]:
# OK, so still not seeing any issues.
# Let's check datatypes of each and see if they differ.
# Print out the datatypes of both datasets

print(rides_05_2020.dtypes)
#Question 2: What is the data type of trip_distance?

In [None]:
# So there are some differences... let's again make it more clear with a comparison.
# Fill in below for the comparison 
... == ...




We can see that the datatypes don't match up in four columns.  And it looks like it's because in 2019 those columns were imported as integers, as you would expect for something like an ID, but in 2020 they were imported as floats. 

Let's go and explore VendorID in 2020.  We'll look at three different ways to dig into a column.

- `describe()` to check summary stats and see if anything is off
- `unique()` to see how many unique values are present
- `isna().sum()` to count up if and how many NaN values there are

In [None]:
# Using describe
rides_05_2020['VendorID'].describe()

We can see how many NaNs are there in each column:

In [None]:
# We can use isna().sum() to find the number of NaNs if one column
# How many NaNs in the passenger_count of the riders_05_2019?


In [None]:
#Now look at the describe, what is the number of rows that has been used to calcuate the statistics? 
rides_05_2020.describe()


In [None]:
#How many not NaN rows we have for passenger count? and compare it non NaN rows for other columns such as VendorID
rides_05_2020.count()

OK, so it's clear we have missing data.  The NaN values in the column caused pandas to import the data as a float vs. an integer.  Let's just go do the same for another two columns and see what they look like. 

In [None]:
# Another way to get a sense of what is happening here is to subset the data to show all of the lines where passenger_count is NaN to see if we can detect a pattern. 
# This isn't an alternative to the previous steps, but rather additional information gathering to understand what the issue is.
rides_05_2020[rides_05_2020['passenger_count'].isna()]

What did this tell us?

- We have the same number of NaN values in the columns that imported as different datatypes across the years. This suggests some sort of data entry issues/errors in those rows and that the NA/NaNs are not randomly scattered across different rows. So, for some reason 50k+ rows were entered with some missing data. 

- Our `fare_amount` column, and seemingly all of the columns except for 'passenger_count', 'RatecodeID', 'congestion_surcharge', and 'airport_fee' have no missing data. Those entry issues were likely isolated to just those columns.  You'd want to do more checking (e.g. sytematically looking for NAs in each column, making sure all of the NAs are on the same rows, etc.), but we're not going to do that here.

- We have values that are extreme/not based in reality in our fare_amount column (e.g. -240.00), which is an issue. And we're going to want to do a `describe` on the whole dataset. But let's deal with our NaN values first.



## Dropping NaNs, bringing dataframes together

We know we need to deal with extreme values, but such extreme values might also be present in 2020.  So best to wait until we bring the data together to filter those off rather than do them on both datasets.  *But* we will want to convert our improper 2019 datatypes first and drop NA values before joining.

### Checking and Dropping NaNs

In [None]:
# Now drop all the NA values. 
# Easy just by adding .dropna() function at the end of the dataset you want to drop from

rides_05_2020_dropped = rides_05_2020.dropna()
print(rides_05_2020_dropped.shape)



It looks like we lost all of the rows! Why? 
Let's look at the NaNs in the original dataframe again

In [None]:
rides_05_2020.isna().sum()

Do you see the problem? The column airport_fee does not have value for any row which causes dropna to drop all rows. To avoid that we have to first drop that column!

In [None]:
rides_05_2020 = rides_05_2020.drop(columns=['airport_fee'])
rides_05_2020.isna().sum()

In [None]:
# Now we can drop the NaNs
# Let's store the existing number of rows in the 2020 data before we drop
rows_2020_predrop = rides_05_2020.shape[0]

rides_05_2020 = rides_05_2020.dropna()
# Do an internal check by getting the number of rides now and calculating the difference
rows_2020_postdrop = rides_05_2020.shape[0]
rows_2020_predrop - rows_2020_postdrop

Great!  So dropping those rows dropped the number of NaN values we were seeing in single columns.  
 
If we were seeing a number greater than that it would suggest that there were NA's in other columns but in different positions. Luckily we don't have to worry about that.  


In [None]:
#Lets see if have any NaN remained
rides_05_2020.isna().sum()


In [None]:
#Do the same for rides_05_2019 but you have to check and make sure the data have the problems
....



### Concatenating our data frames

Yay!  OK, let's bring our data together finally

The `concat()` function in pandas allows you to join one dataframe to the bottom of another. Note that column number and names must match (which we know they do because we checked earlier). And same with datatypes.  

**Def - concatenate:** To join or link together

In [None]:
# Call our full set of data 'rides'
rides = pd.concat([rides_05_2019, rides_05_2020])

In [None]:
# Now check the shape and make sure it makes sense.  Are the number of rows equal to the sum of rows in the individual data frames?
#Question 4: What is the shape of rides (after dropping NaNs and concatenating rides_05_2019 and rides_05_2020)?

## Making useful

We did the above in a really manual way.  That's fine for a first pass, but ideally what you would do is build in checks that took in the data, looked for NA's, and if there were them dropped and converted datatypes.  That's a bit much at this point, but I just want to highlight that the goal is to automate this process so it's done on a regular basis without intervention.  By the end of the class we'll be doing a good bit of that!

For now, we have our dataset of May rides in 2019 and 2020.  Let's now make it useful. To do so we want to do the following:

- Drop columns we don't need
- Filter out data that don't make sense
- Convert columns to datetimes
- Get metrics that are of interest.  Specifically, let's get calculate the average speed, average passengers, and average tip so that we can compare these between years.  
- To get the above we'll need to aggregate our data, but before that we'll need to create new columns.  

### Dropping columns
Let's select only the columns we need.

To do this you can just put a list of column names inside of `rides[]` and then assign back to your dataframe.  

You can also drop columns individually with `df.drop(columns = ['col_to_drop'])`

In [None]:
rides = rides[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'fare_amount', 'tip_amount']]

Let's take another look at our simplied data and see if we can detect other problems

In [None]:
rides.describe()

## Filtering out bad data



- Our min passenger count is zero and our min trip distance is also zero.  We should select only rows where there is at least one passenger. We'll also want to deal with the 0s in distance, but setting an appropriate minimum cutoff is harder than with setting a min of 1 passenger. 
  
  \* In reality, you'd want to check whether rides without passengers actually occur in some cases and think about whether these cases are relevant to the analyses the end user would want to be doing

- Our min fare and tip are also negative.  Let's set min fare to be at least $2.50 (the initial charge for NYC cabs https://www1.nyc.gov/site/tlc/passengers/taxi-fare.page).

- Our max trip distance is 10000+ miles!  And the max fare amount is over $400,000.  Obviously wrong, so we want to filter those out as well. However, there is the same issue here as with setting an objective min distance. 

  A good way of dealing with these is first handling the obvious cases with clear cutoffs (i.e. passengers > 0 and min fare > 2.50) and then checking to see if the other problem cases persist. Sometimes a lot of problematic data are part of the same row, so eliminating some rows based on very reasonable assumptions avoid having to make potenitally questionable assumptions about other data.

In [None]:
# Filtering out bad values is pretty easy.  
# A basic version is like this:
rides = rides[rides['trip_distance'] >= 0.25]

# But you can string them together and knock out everything in one go 
rides = rides[(rides['passenger_count'] >= 1) &
              (rides['fare_amount'] >= 2.5)]
rides.describe()

In [None]:
# Unfortunately, that didn't solve all of our issues. We still have trip distances equal to 0, and unreasonably high distances and fare amounts. 
# Next, we can at least eliminate distances of 0
rides = rides[rides['trip_distance'] > 0]
rides.describe()



In [None]:
# At least we have dealt with the $430k fare, but we still have a fare > 10k miles
# So, let's look at that datapoint and see whether it seems reasonable or not
rides[rides['trip_distance'] > 10000]

In [None]:
# Looking at it, the fare amount is $15.99, which seems inexpensive for a 10k mile ride! In all likelihood, there was a data entry error in the distance, so we can remove that data point
rides = rides[rides['trip_distance'] < 10000]
rides.describe()

In [None]:
# Now things are starting to look more realistic. There's a 400 mile ride, but also a $4000 fare (which are likely the same ride) and that *could* happen
# Let's look at those data points:
rides[rides['trip_distance'] > 400]


The fare amount does not sense, so we have to remove it!

In [None]:
rides = rides[rides['trip_distance'] < 400]

In [None]:
#How about the fare_amount larger than 4000?
#If the trip_distance does not make sense for the fare_amount, we should remove it.
#Question 5: What is the trip_distance of the ride with the fare_amount larger than 4000$?


In [None]:
#Remove all the rows with far_amount larger than 4000
...



Let's take another look at our data

In [None]:
rides.describe()

In [None]:
# Now the max distance and fares are ~300 and 800, so let's check again to see if that's the same entry
rides[rides['trip_distance'] > 300]

In [None]:
# Here it does seem to be a legitimate entry, so we'll leave it. 
# As a sort of 'gut check' to make sure we're not missing any issues with distances below 300 that *are* problematic, we can do this:
rides[rides['trip_distance'] > 100]

A quick scan of the data shows high fares associated with long distances, which suggest the data are *probably* ok. 

In reality, and when building a real pipeline to deal extracting and cleaning these more automatically, we'd probably use an algorithm comparing distance and fare for each entry (using the offical rates as an estimated min) as a way of removing data entry errors of this sort. But going through these manually as a first step allows you to understand the data and the types of errors that might occur and ultimately will help you figure out the best way of doing this automatically. 

Generally, as a DE you're trying to balance 1) having accurate data with 2) removing too much data, 3) how much time you spend cleaning these up, and 4) what the end user needs. In all cases, the output produced will be a compromise along all of these axes.

### Convert to datetime and make triptime column

Now that we have only realistic values we can start making our columns


In [None]:
# Lets Check the datatype of tpep_pickup_datetime and tpep_dropoff_datetime to see if they  datetimes 
# Check your datatypes 
rides.dtypes

It looks like we are OK with datatypes

One issue that will frequently arise are dates that are wrong.  These data should just be in May of 2019 and 2020.  But I bet if we ask what the min pickup date is we'll get something lower.  And I'm sure the max is higher.  So we need to filter out those values! 

**Note:** We're going to want to filter out two months.  This means we want to filter dates that are greater than May 1st, 2019 **and** less than May 31st, 2019 **OR** greater than May 1st, 2020 **and** less than May 31st, 2020.  The or operator is simply a pipe `|`. 

In [None]:
# Check
print(rides['tpep_pickup_datetime'].min())
print(rides['tpep_pickup_datetime'].max())

In [None]:
# Clearly out of range. So let's filter again
rides = rides[(rides['tpep_pickup_datetime'] >= '2019-05-01 00:00:00') & 
               (rides['tpep_pickup_datetime'] <= '2019-05-31 23:59:59') | # note me using '|' for OR.  
               (rides['tpep_pickup_datetime'] >= '2020-05-01 00:00:00') & 
               (rides['tpep_pickup_datetime'] <= '2020-05-31 23:59:59')]
print(rides['tpep_pickup_datetime'].min())
print(rides['tpep_pickup_datetime'].max())

In [None]:
# Do the same for the 'tpep_dropoff_datetime'
print(rides['tpep_dropoff_datetime'].min())
...

#Question 6: What is the maximum dropoff time after cleaning?

Datetime conversion can get a bit tricky.  Datetimes get stored in lots of different ways and the methods for each way can differ a bit.  [The pandas documentation is a good guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html).

Here we're going to subtract the dropoff time from the pickup time into a new column called `trip_time`.

That operation will give you a datatype called `timedelta64[ns]` meaning it's a datatype that's specific to time differences and that it's calculating in nanoseconds (ns).  So we also need to convert that to seconds using our trusty `astype()`.  Inside the `astype()` you'll tell it what you wanted to convert it to.  In this case seconds, the datatype of which is `timedelta64[s]`. 

In [None]:
# First make a new column called trip_time.  That should be the dropoff minus the pickup time

rides['trip_time'] = ... - ...

# Check the new column.  Note the dtype
...

In [None]:
# Now let's convert it to seconds and also divide by 60 to make it into minutes.
rides['trip_time'] = (rides['trip_time'].astype('timedelta64[s]'))/60
rides.describe() # Check it

We clearly have to do one last filter to fix the bad trip times. Normally we'd go through the same steps as above of checking the data bit by bit, but to save time here we'll assume that a realistic trip is at least two minutes and less than say 120 minutes (this will likely remove more data than necessary, but that's ok for this example)

In [None]:
# It never ends!
# Do one a filter where you filter your dataframe down such that it...
# Only has trip times greater than 2 and less than / equal to 120

... = ...[(...['...'] >= ...) & (...['...'] <= ...)]
# Verify 
rides.describe()

#Question 7: What is average (mean) trip_time after filtering?

### Making other metrics

Let's now make our metrics.  The exact metrics depend on the use case of the dataset.  Here we want to see how rider properties change, so we'll look at things like overall ridership, usage, as well as how they behave via tipping. We'll make the following:

- A speed column.  We'll divide distance by time.  Will speed increase during Covid?  
- A tip per time column.  We'll see if people tip more during Covid.
- We'll also make a binary column if they tipped that we'll count up later to figure out the percentage of people tipping. 

Note, it's important to think about the order in which you're doing these tasks.  For example, if we didn't filter out those negative or zero distance trips, it would cause problems making these columns 

In [None]:
# Speed column first.  
# This is relatively simple. 
# The only tricky thing is that our time is in minutes but we'll want speed in miles per hour

rides['trip_speed'] = rides['trip_distance']/(rides['trip_time']/60)
rides['trip_speed']

Now let's make a feature of how many dollars are tipped per mile traveled.  Are people being more or less generous with tipping? 

In [None]:
# Make a column 'tip_per_distance' where you divide the tip amount by trip distance
rides['...'] = ... / ...


OK, now let's look at some summary stats again!

In [None]:
#Question 8: What is the maximum tip per distance?
# use describe

Clearly we need to filter some more. Here again, we're going to make some very rough assumptions that normally we'd need to dig into the data more instead.

However, for this exercise we'll assume that trip speed should be less than 75 mph and tip per distance less than $10

In [None]:
# Filter data frame so that trip_speed is less than 75 and tip per distance less than 10
rides = rides[(rides['trip_speed'] <= 75) & 
              (rides['tip_per_distance'] <= 10)]
rides.describe()

And finally let's just make a binary of if they tipped or not. 

Remember `np.where()` is nice for these ifelse type of statements.  The first argument is the condition where you're comparing something.  The second for what to do if that condition was met.  The last is if that condition isn't met.  

In [None]:
# If tip_amount is greater than zero then 1 (they tipped).
# if it's not, then 0 (they didn't tip).
rides['tipped'] = np.where(rides['tip_amount'] > 0, 1, 0)

## Aggregating
 
Wow, so a lot of wrangling data to get it into shape.  Again, our goal would be to make this be able to work with a dashboard or build a report to answer business questions.  As a DE your job would be to build the pipeline that does all these transforms and then upload it somewhere.  
 
So now lets go and aggregate our data into daily averages as that's likely what they'd be interested in.  

There are two main parts to aggregating used here.

- The `groupby()`: This is saying what column you want to colapse down into groups and calculate summary stats for.  In this case I'm calling the pickup time, then using `.dt` which allows us to access  properties of the datetime column. That's followed up by `.date` which means we want to group by individual dates.  You could do other levels if you wanted such as second, hour, or even quarter. 

- The `.agg()`: This is saying how you want to aggregate within the groups (in this case individual dates).  Inside `.agg()` you provide a dictionary where the key is the column in the original data you want to work with and then the value is a list of math you want to do on that column.  So if you had just `{'passenger_count': ['mean']}` you'd be asking for it to calculate the mean passenger count for each distinct group (in this case days). 

This is a semi-complex groupby given we want a lot of things, but I think you can handle it!  I encourage you to go and play with it a bit to see what different levels you can aggregate by and what types of functions you can apply. 


In [None]:
rides_daily = rides.groupby(rides['tpep_pickup_datetime'].dt.date).agg({'passenger_count': ['mean'], 
                                                                        'trip_distance': ['mean'], 
                                                                        'trip_speed': ['mean'],
                                                                        'tip_amount': ['mean'],
                                                                        'tip_per_distance': ['mean'],
                                                                        'fare_amount':['count'],
                                                                        'tipped': ['sum']})
rides_daily # look at it!

Cool!  So we have 62 rows as we'd expect for two different May months.
 
**Note** this groupby took the pickup time and turned it into our index!
 
The only issue is now we have an annoying 'multi index' for the column names.  Essentially to access one we'd have to do something like `rides_daily['passenger_count']['mean']`.  That's annoying, so let's rename quick.

Renaming columns can be done if you call `.columns` on the left of the `=`.  What this does is then access the column names of the dataframe and replaces them with the list of new names that are right of the `=`. 

In [None]:
rides_daily.columns = ['mean_pass', 'mean_dist', 'mean_speed', 'mean_tip', 'mean_tip_dist', 'total_rides', 'number_tipped']
rides_daily

Now that we have our grouped data we want to make one last feature of the percent of riders that tipped.  Given we counted up a binary column in our `.agg`, we can just divide the `number_tipped` column by the `total_rides` column and multiply by 100 to get the units into percentages. 

In [None]:
# Make percent column.
rides_daily['percent_tipped'] = rides_daily['number_tipped']/rides_daily['total_rides']*100
rides_daily

In [None]:
#Question 9: What is the percent_tipped for 2020-05-30?


## Graphing and making some inference
 
Phew!  So, lots of data processing to get to this point.  Let's make a visualization like someone would who was running operations.  For example, how does speed change over day-to-day for both months. 
 
since we'll want to graph both lines on one plot over the month, we'll need to first  make a new column for day of the week, and then split our data into 2019 and 2020

OK, let's make our day column.  Remember that when we did our aggregation our time value became the index.  So we need to call the index with `rides_daily.index` inside of `to_datetime()`.  As it's a datetime we can just ask it for the actual day using `.day`

In [None]:
# Make day column.
# This is saying 'pandas, convert our index to a datetime and extract the day.
rides_daily['day'] = pd.to_datetime(rides_daily.index).day
rides_daily

Now we'll slice into years using our index

This is pretty straightforward in that we're going to use `.loc[]` to slice by the index like we did last week. So the format is:
`new_df = df.loc[start_date : end_date]`

The only tricky part is that you can't just tell it a time inside the slice.  You need to wrap `pd.to_datetime().date()` around the start and end dates so python knows they're datetimes!

In [None]:
# OK let's do it!
rides_daily_2019 = rides_daily.loc[pd.to_datetime('2019-05-01').date():pd.to_datetime('2019-05-31').date()]
rides_daily_2020 = rides_daily.loc[pd.to_datetime('2020-05-01').date():pd.to_datetime('2020-05-31').date()]

# now check
rides_daily_2020.head()

OK, let's first see how trip distance changed in response to Covid

In [None]:
plt.plot(rides_daily_2019['day'], rides_daily_2019['mean_dist'], label = 2019)
plt.plot(rides_daily_2020['day'], rides_daily_2020['mean_dist'], label = 2020)
plt.xlabel('date of month')
plt.ylabel('average distance of trips')
plt.legend()
plt.show()

Interestingly, it seems like people were making shorter trips during covid. We can imagine several plausible reasons for this, such as people simply moving around the city less and staying within their neighborhoods more, or simply not being comfortable in cab for longer periods of time to limit their risk of infection.

You can see there are a lot of spikes that don't align, but that's just because the weekends fall on different dates. 

How does speed change? 



In [None]:
# Make a plot looking at average speed of trips over the dates of both years. 


Wow, a lot!  

Let's see how number of passengers differs

In [None]:
# Make a plot looking at average number of passengers differs
#Question 10: The daily average number of passenger in 2019 is ALWAYS higher than the same day in 2020?

Clearly lots more solo rides!

Are people more or less generous to taxi drivers in a pandemic? Let's look at the percentage of tips

In [None]:
# Make a plot looking at tip percentage over the month with each year as it's own line. 
plt.plot(rides_daily_2019['day'], rides_daily_2019['percent_tipped'], label = 2019)
plt.plot(rides_daily_2020['day'], rides_daily_2020['percent_tipped'], label = 2020)
plt.xlabel('date of month')
plt.ylabel('% tipped')
plt.legend()
plt.show()

# Conclusion

This is a pretty big lesson that covers a whole bunch of operations that are used for data cleaning.  You might not need to do all this, or you might need to do more.  As you can see, your first pass through, transforming your data has a lot of back-and-forth as you identify issues. Once you figured them out it would be much easier to streamline and turn into a function (see below).  

Next week we'll get into working with JSON data and strings.    

## Converting to functions
Here I just quicky went and converted the script above into several functions.  I then call in a new dataset and apply them.  

You should **definitely** play around with them and try different filtering options to see how it affects the output. You should see some significant changes in what the data say depending on the choices you make. 

**This is an essential point! Choosing the correct input data is the difference between having a reliable output or not**

In [None]:
def drop_na_and_cols(df):
  df = df[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'fare_amount', 'tip_amount']].dropna()
  return(df)

In [None]:
def filter_to_real(df):
  df = df[(df['passenger_count'] >= 1) &
              (df['trip_distance'] >= 0.25) &
              (df['trip_distance'] <= 15) &
              (df['fare_amount'] >= 3) &
              (df['fare_amount'] <= 100) &
              (df['tip_amount'] >= 0) &
              (df['tip_amount'] <= 20) ]
  return(df)

In [None]:
# Note I added start_date and end_date arguments here so you can specify the date range of the data
def make_features(df, start_date, end_date):
  df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'] )
  df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
  df = df[(df['tpep_pickup_datetime'] >= start_date) & (df['tpep_pickup_datetime'] <= end_date)]
  df['trip_time'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
  df['trip_time'] = (df['trip_time'].astype('timedelta64[s]'))/60
  df = df[(df['trip_time'] >= 2) & (df['trip_time'] <= 120)]
  df['trip_speed'] = df['trip_distance']/(df['trip_time']/60)
  df['tip_per_distance'] = df['tip_amount'] / df['trip_distance']
  df = df[(df['trip_speed'] <= 75) & (df['tip_per_distance'] <= 10) ]
  df['tipped'] = np.where(df['tip_amount'] > 0, 1, 0)
  return(df)


In [None]:
def agg_and_rename(df):
  df_daily = df.groupby(df['tpep_pickup_datetime'].dt.date).agg({'passenger_count': ['mean'], 
                                                                        'trip_distance': ['mean'], 
                                                                        'trip_speed': ['mean'],
                                                                        'tip_amount': ['mean'],
                                                                        'tip_per_distance': ['mean'],
                                                                        'fare_amount':['count'],
                                                                        'tipped': ['sum']})
  df_daily.columns = ['mean_pass', 'mean_dist', 'mean_speed', 'mean_tip', 'mean_tip_dist', 'total_rides', 'number_tipped']
  df_daily['percent_tipped'] = df_daily['number_tipped']/df_daily['total_rides']*100
  return(df_daily)


In [None]:
rides_03_2020 = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-03.parquet")
n_ride = drop_na_and_cols(rides_03_2020)
n_ride = filter_to_real(n_ride)
n_ride = make_features(n_ride, start_date= '2020-03-01 00:00:00', end_date= '2020-03-31 23:59:59')
n_ride = agg_and_rename(n_ride)
n_ride