In [2]:
import pandas as pd

## How to Clean Data
A fondamental step that is necessary to have more detailed and precise analysis is to clean up the initial given data. This means that in the millions of entries in the data we found a lot of weird values and we decided to cut them out from our analysis.

These are all the entries that have been deleted:
* Trips with a distance less or equal to zero. 
* The initial time of a trip must be strictly lower than the dropoff time.
* Total amount for a ride has to be greater ot at least equal to 3.3 dollars. This value is obtained from the (legend file)[http://www.nyc.gov/html/tlc/downloads/pdf/taxi_information.pdf] for taxis in New York, where is sated that there is an initial charge of 2.50 dollars plus 50 cent of State surcharge and 30 cent of improvement surcharge.
* There coulb be also errors with the dates. For this reason we heve controlled for each entry if the year is 2018 and that the month corresponds with the one that is being analyzed.
* We found some trips that has a long duration (i.e. more than 5 hours). We retained that a taxi trips can not be too long and for this reason we dropped all the trips longer than 3 hours.
* There were also very long rides, with houndres of miles as distance. We dropped all trips longer than 200 miles.
* The last thing that we dropped was the high rates for price over miles. We cut all the rates greater than 17.5 dollars per miles.

The initial data had size $53925735$ and the analyzed has size $52239189$.

The dropped lines are $1686546$ in total.

In [3]:
# retieving data to be cleaned
raw_data  = pd.read_csv("yellow_tripdata_2018-01.csv", nrows = 10000)

# cleaning data
raw_data['delta'] = (pd.to_datetime(raw_data['tpep_dropoff_datetime']) -  pd.to_datetime(raw_data['tpep_pickup_datetime'])).astype('timedelta64[m]')
data = raw_data.drop(raw_data[(raw_data.trip_distance >= 200) | (raw_data.delta > 180) |(raw_data.trip_distance <= 0) | (raw_data.tpep_dropoff_datetime <= raw_data.tpep_pickup_datetime) | (pd.DatetimeIndex(raw_data['tpep_pickup_datetime']).month != 1) | (pd.DatetimeIndex(raw_data['tpep_pickup_datetime']).year != 2018) | (raw_data.total_amount < 3.3) | ((raw_data.total_amount / raw_data.trip_distance) > 17.5)].index) 

# free memory deleting the raw_data
del raw_data