# Part 1:  Data preprocessing


This project deals with a lot of data that can be too much for a personal computer to handle effectively. For instance, each month of yellow taxi data is about 2GB in size. Therefore, it’s imperative that you remove any unnecessary and invalid data, use appropriate data types when possible, and take samples.

For this part, you will need to use the requests, BeautifulSoup, and pandas packages to help you programmatically download and clean every Yellow Taxi CSV file needed. Cleaning the data includes: removing unnecessary columns and invalid data points, normalizing column names, and removing trips that start and/or end outside of the following latitude/longitude coordinate box: (40.560445, -74.242330) and (40.908524, -73.717047). 

While you will not need to programmatically download the Uber data, you will need to load it in from your computer, and clean the dataset as you did with the Yellow Taxi datasets.

Each month of Yellow Taxi data contains millions of trips. However, the provided Uber dataset is only a sampling of all data. Therefore, you will need to generate a sampling of Yellow Taxi data that’s roughly equal to the sample size of the Uber dataset.

Also within this part, define a function that calculates the distance between two coordinates in kilometers that only uses the `math` module from the standard library. Write at least one unit test that tests this calculation function. 

Using that function that calculates the distance in kilometers between two coordinates, add a column to each dataset that contains the distance between the pickup and dropoff location.

Finally, load in the weather datasets from your computer, and clean each dataset, including only the dates & columns needed to answer the questions in the other parts of the project.

Tips:
Downloading Yellow Taxi data can take a while per file since each file is so large. Consider saving the sample data for each month to your computer in case you need to step away, and load it back in when you return.
Relatedly, make use of your .gitignore file to avoid committing the Yellow Taxi sample dataset CSV files to your repo.
Read ahead to figure out which columns are absolutely necessary for each dataset.
Be mindful of the data types for each column, which will make it easier for yourself when storing and filtering data later on.
Use the re module to help pull out the desired links for Yellow Taxi CSV files.


In [1]:
import requests
import csv
import numpy as np
import pandas as pd
import urllib.request
import datetime 
import io

import bs4

import math

import geopy.distance
from pathlib import Path
import matplotlib.pyplot as plt


# Defining function to calculate the distance

In [2]:


def calculate_distance(from_coord, to_coord):
    def deg2rad(deg):
        return deg * (np.pi/180)
    def hav(theta):
        return np.sin(theta/2)**2

    from_coord = list(from_coord)
    to_coord = list(to_coord)
    d = []
    
    for i in range(len(from_coord)):
        x1 = from_coord[i][0]
        y1 = from_coord[i][1]
        x2 = to_coord[i][0]
        y2 = to_coord[i][1]
        R = 6371.009 # Radius of the earth in km
        dLat = deg2rad(x2-x1)
        dLon = deg2rad(y2-y1) 
        c = np.arcsin(np.sqrt(hav(dLat) + (1 - hav(deg2rad(x1 - x2)) - hav(deg2rad(x1 + x2)))*hav(dLon)))
        d.append(2 * R * c) #Distance in km
    return d

def test_calculate_distance():
    coords_1 = (52.2296756, 21.0122287)
    coords_2 = (52.406374, 16.9251681)

    print (geopy.distance.geodesic(coords_1, coords_2).km)

test_calculate_distance()
from_coord = [[52.2296756, 21.0122287]]
to_coord = [[52.406374, 16.9251681]]
calculate_distance(from_coord, to_coord)

279.35290160430094


[278.45856843965987]

# Getting the number of samples of each year and month of the uber rides.

In [3]:

#Getting the number of samples of each year and month of the uber rides.
file = "uber_rides_sample.csv"
df_uber = pd.read_csv(file)


initial_year = 2009
ending_year = 2015

initial_month = 1
ending_month = 12

counter_sample = {}

df_uber['Year'] = df_uber['key'].str.slice(0, 4)
df_uber['Month'] = df_uber['key'].str.slice(5, 7)

for year in range(initial_year,ending_year + 1):
    for month in range(initial_month, ending_month + 1):
        if month < 10:
            aux = df_uber[df_uber["Year"] == str(year)]
            aux = aux[aux["Month"] == "0"+str(month)]
            counter_sample[str(year)+"-0"+str(month)] = len(aux)
        else: 
            aux = df_uber[df_uber["Year"] == str(year)]
            aux = aux[aux["Month"] == str(month)]
            counter_sample[str(year)+"-"+str(month)] = len(aux)
            
counter_sample  
values = sum(counter_sample.values())
print("The total number of samples is: ",values)
#print(counter_sample)
df_uber

The total number of samples is:  200000


Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,Year,Month
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1,2015,05
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.994710,40.750325,1,2009,07
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.740770,-73.962565,40.772647,1,2009,08
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3,2009,06
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5,2014,08
...,...,...,...,...,...,...,...,...,...,...,...
199995,42598914,2012-10-28 10:49:00.00000053,3.0,2012-10-28 10:49:00 UTC,-73.987042,40.739367,-73.986525,40.740297,1,2012,10
199996,16382965,2014-03-14 01:09:00.0000008,7.5,2014-03-14 01:09:00 UTC,-73.984722,40.736837,-74.006672,40.739620,1,2014,03
199997,27804658,2009-06-29 00:42:00.00000078,30.9,2009-06-29 00:42:00 UTC,-73.986017,40.756487,-73.858957,40.692588,2,2009,06
199998,20259894,2015-05-20 14:56:25.0000004,14.5,2015-05-20 14:56:25 UTC,-73.997124,40.725452,-73.983215,40.695415,1,2015,05


# Function to find the urls

In [4]:


#Function to find the urls
def find_taxi_csv_urls():
    years = [2009,2010,2011,2012,2013,2014,2015]
    TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
    content = requests.get(TAXI_URL)
    soup = bs4.BeautifulSoup(content.text, 'html.parser')
    divs = soup.find_all("div")
    new_divs1 = []
    new_divs2 = []
    
    ancors = soup.find_all("a")
    ancors_yellow = []
    
    for i in ancors:
        if 'title' in i.attrs.keys() and i["title"] == "Yellow Taxi Trip Records" and int(i["href"][-11:-7]) in years:
            ancors_yellow.append(i["href"])
    
    return ancors_yellow



urls = find_taxi_csv_urls()
urls


['https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-02.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-03.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-04.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-05.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-06.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-07.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-08.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-09.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-10.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.csv',
 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-01.csv',
 'https://s3.amazonaws.co

#  Sampling the yellow taxi rides.


In [5]:
#Sampling the yellow taxi rides.


def sampling(urls):
    for url in urls:
        year = url[-11:-7]
        month = url[-6:-4]
        print(url)
        df = pd.read_csv(url, on_bad_lines='skip')
        print("The length of the original file is: ", len(df))

        df_taxi = df.sample(n = counter_sample[url[-11:-4]], random_state = 1)
        print("The length of the sample (year, month) = ({},{}) is {}".format(year,month,len(df_taxi)))   

        df_taxi.to_csv("taxi_rides_sample"+ year +"-"+month +".csv")
        print("ok")

        
urls = find_taxi_csv_urls()#[-23:-15]
print(urls)
#df = pd.read_csv(urls)
#Falta el 37. Falta 2010        
#sampling(urls)




['https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-02.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-03.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-04.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-05.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-06.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-07.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-08.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-09.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-10.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.csv', 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-01.csv', 'https://s3.amazonaws.com/nyc-tlc/tri

# Read csv

In [6]:
##Read csv

def read_csv_files(st):
    
    initial_year = "2009"
    initial_month = "01"
    
    ending_year = "2015"
    ending_month = "12"
    
    for year in range(int(initial_year), int(ending_year) + 1):
        for month in range(int(initial_month), int(ending_month) + 1):
            
            if int(month) < 10:
                month_s = "0" + str(month)
            else:
                month_s = str(month)
            year_s = str(year)
            
            if year == int(initial_year) and month == int(initial_month):
                file = "taxi_rides_sample"+ year_s +"-"+ month_s +".csv"
                path = Path(file)
                if path.is_file():
                    df = pd.read_csv(file)
                    if st:
                        print("(year, month) = ({}, {})".format(year_s,month_s))
            else:
                file = "taxi_rides_sample"+ year_s +"-" + month_s +".csv"
                path = Path(file)
                if path.is_file():
                    aux = pd.read_csv(file)
                    df = pd.concat([df, aux])
                    if st:
                        print("(year, month) = ({}, {})".format(year_s,month_s))
            
    return df

df = read_csv_files(True)
df

(year, month) = (2009, 01)
(year, month) = (2009, 02)
(year, month) = (2009, 03)
(year, month) = (2009, 04)
(year, month) = (2009, 05)
(year, month) = (2009, 06)
(year, month) = (2009, 07)
(year, month) = (2009, 08)
(year, month) = (2009, 09)
(year, month) = (2009, 10)
(year, month) = (2009, 11)
(year, month) = (2009, 12)
(year, month) = (2010, 01)
(year, month) = (2010, 02)
(year, month) = (2010, 03)
(year, month) = (2010, 04)
(year, month) = (2010, 05)
(year, month) = (2010, 06)
(year, month) = (2010, 07)
(year, month) = (2010, 08)
(year, month) = (2010, 09)
(year, month) = (2010, 10)
(year, month) = (2010, 11)
(year, month) = (2010, 12)
(year, month) = (2011, 01)
(year, month) = (2011, 02)
(year, month) = (2011, 03)
(year, month) = (2011, 04)
(year, month) = (2011, 05)
(year, month) = (2011, 06)
(year, month) = (2011, 07)
(year, month) = (2011, 08)
(year, month) = (2011, 09)
(year, month) = (2011, 10)
(year, month) = (2011, 11)
(year, month) = (2011, 12)
(year, month) = (2012, 01)
(

Unnamed: 0.1,Unnamed: 0,vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,Start_Lon,Start_Lat,Rate_Code,store_and_forward,...,tip_amount,tolls_amount,total_amount,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,RateCodeID,extra,improvement_surcharge,RatecodeID
0,7561035,VTS,2009-01-18 03:05:00,2009-01-18 03:05:00,1.0,1.14,-73.988533,40.737097,,,...,,,,,,,,,,
1,12786419,CMT,2009-01-11 16:16:20,2009-01-11 16:24:44,1.0,1.90,-74.007916,40.725825,,,...,,,,,,,,,,
2,1633484,CMT,2009-01-27 22:31:35,2009-01-27 22:32:49,1.0,0.50,-73.957135,40.770662,,,...,,,,,,,,,,
3,10767030,CMT,2009-01-10 14:20:57,2009-01-10 14:31:58,2.0,2.20,-73.981873,40.748760,,,...,,,,,,,,,,
4,11727611,VTS,2009-01-18 00:48:00,2009-01-18 00:56:00,2.0,1.48,-73.993790,40.741527,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2156,587414,,,,,,,,,,...,,,,2.0,2015-06-01 12:19:07,2015-06-01 13:07:06,1.0,0.0,0.3,
2157,4963741,,,,,,,,,,...,,,,2.0,2015-06-17 08:56:36,2015-06-17 09:01:17,1.0,0.0,0.3,
2158,45385,,,,,,,,,,...,,,,1.0,2015-06-04 15:52:47,2015-06-04 15:58:48,1.0,1.0,0.3,
2159,5879609,,,,,,,,,,...,,,,2.0,2015-06-19 10:04:12,2015-06-19 10:22:39,1.0,0.0,0.3,


# Renaming the columns

In [None]:
###Renaming the columns

def df_rename(option, ref, df):
    for i in range(len(option)):
        df[ref[i]] = df[ref[i]].fillna(df[option[i]])
    #df = df[ref]
    return df


df = read_csv_files(False)
option = ["tolls_amount", "total_amount", "tpep_pickup_datetime", "tpep_dropoff_datetime", "tip_amount",  "passenger_count", "pickup_latitude", "pickup_longitude", "dropoff_longitude", "dropoff_latitude"]
ref = ["Tolls_Amt", "Total_Amt", "Trip_Pickup_DateTime", "Trip_Dropoff_DateTime", "Tip_Amt", "Passenger_Count", "Start_Lat", "Start_Lon", "End_Lon","End_Lat"]
df = df_rename(option, ref, df)

option = ["tolls_amount", "total_amount", "pickup_datetime", "dropoff_datetime", "tip_amount",  "passenger_count", "pickup_latitude", "pickup_longitude", "dropoff_longitude", "dropoff_latitude"]
ref = ["Tolls_Amt", "Total_Amt", "Trip_Pickup_DateTime", "Trip_Dropoff_DateTime", "Tip_Amt", "Passenger_Count", "Start_Lat", "Start_Lon", "End_Lon","End_Lat"]
df = df_rename(option, ref, df)
df = df[ref]



# Calculating distance and adding it as a column

In [None]:
#Calculating distance and adding it as a column
def limits_coord(df):
    lower_limit_x = 40.560445
    upper_limit_x = 40.908524
    lower_limit_y = -74.242330
    upper_limit_y = -73.717047
    df = df[df["Start_Lat"] > lower_limit_x]
    df = df[df["Start_Lat"] < upper_limit_x]
    df = df[df["End_Lat"] > lower_limit_x]
    df = df[df["End_Lat"] < upper_limit_x]    
    
    df = df[df["Start_Lon"] > lower_limit_y]
    df = df[df["Start_Lon"] < upper_limit_y]
    df = df[df["End_Lon"] > lower_limit_y]
    df = df[df["End_Lon"] < upper_limit_y]      
    
    return df

def dist(df):
    from_coord, to_coord = df[["Start_Lat","Start_Lon"]].values, df[["End_Lat","End_Lon"]].values
    distance = calculate_distance(from_coord, to_coord)
    df["Distance"] = distance
    return df

df = limits_coord(df)
df = dist(df)

df



# Renaming Uber data

In [None]:
#Renaming Uber data

def df_rename_uber(option, ref, df):
    for i in range(len(option)):
        print("(option, ref): ({},{})".format(option[i],ref[i]))
        df = df.rename(columns = {option[i]: ref[i]})
    return df

file = "uber_rides_sample.csv"
df_uber = pd.read_csv(file)

option = ["fare_amount", "pickup_datetime", "passenger_count", "pickup_latitude", "pickup_longitude", "dropoff_longitude", "dropoff_latitude"]
ref = ["Total_Amt", "Trip_Pickup_DateTime", "Passenger_Count", "Start_Lat", "Start_Lon", "End_Lon","End_Lat"]

df_uber = df_rename_uber(option, ref, df_uber)

df_uber

# Setting limits to the coordinates of Uber data

In [None]:
df_uber = limits_coord(df_uber)
df_uber = dist(df_uber)
df_uber

# Processing Weather Data



In [None]:
def clean_month_weather_data_hourly(csv_file):
    df = pd.read_csv(csv_file)
    lower_limit_x = 40.560445
    upper_limit_x = 40.908524
    lower_limit_y = -74.242330
    upper_limit_y = -73.717047
    df = df[df["LATITUDE"] > lower_limit_x]
    df = df[df["LATITUDE"] < upper_limit_x]
    df = df[df["LONGITUDE"] > lower_limit_y]
    df = df[df["LONGITUDE"] < upper_limit_y]    
    df = df[["DATE", "HourlyPrecipitation", "HourlyWindSpeed"]]
    df = df.fillna(0)
    return df

    
    
def clean_month_weather_data_daily(csv_file):
    df = pd.read_csv(csv_file)
    lower_limit_x = 40.560445
    upper_limit_x = 40.908524
    lower_limit_y = -74.242330
    upper_limit_y = -73.717047
    df = df[df["LATITUDE"] > lower_limit_x]
    df = df[df["LATITUDE"] < upper_limit_x]
    df = df[df["LONGITUDE"] > lower_limit_y]
    df = df[df["LONGITUDE"] < upper_limit_y]    
    df = df[["DATE", "DailyAverageWindSpeed", "DailyPrecipitation"]]
    df = df.fillna(0)
    return df
    

def load_and_clean_weather_data():
    hourly_dataframes = []
    daily_dataframes = []
    
    # add some way to find all weather CSV files
    # or just add the name/paths manually
    weather_csv_files = []
    for i in range(2009,2016):
        weather_csv_files.append(str(i)+"_weather.csv")
        
    
    for csv_file in weather_csv_files:
        hourly_dataframe = clean_month_weather_data_hourly(csv_file)
        daily_dataframe = clean_month_weather_data_daily(csv_file)
        hourly_dataframes.append(hourly_dataframe)
        daily_dataframes.append(daily_dataframe)
        
    # create two dataframes with hourly & daily data from every month
    hourly_data = pd.concat(hourly_dataframes)
    daily_data = pd.concat(daily_dataframes)
    return hourly_data, daily_data

#clean_month_weather_data_daily("2009_weather.csv")
hourly_data, daily_data = load_and_clean_weather_data()

# Part 2: Storing Data
Using SQLAlchemy, create a SQLite database with which you’ll load in your preprocessed datasets.

Create and populate four tables: one for your sampled datasets of Yellow Taxi trips, one for Uber trips, one for hourly weather information, and one for daily weather information. Use appropriate data types for each column. 

Create a schema.sql file that defines each table’s schema. You can use SQLAlchemy within the notebook to help generate this file, (added 2022-04-21) or another programmatic approach, or create this schema file by hand.

Tips (added 2022-04-21):
The first 48 lines of this gist is a good example of what makes up a schema file.
I should be able to run this schema file to create the tables in a database via the SQLite CLI tool. That is, I should be able to run the following command in a Jupyter notebook cell to create a database with the four required tables (it is not expected that you do this yourself for the project, but this is a good sanity check for it to succeed without error):

	!sqlite3 project.db < schema.sql


# Part 3: Understanding Data
For this part, define a SQL query for each of the following questions - one query per question. Save each query as a .sql file, naming it something illustrative of what the query is for, e.g. top_10_hottest_days.sql.

For 01-2009 through 06-2015, what hour of the day was the most popular to take a Yellow Taxi? The result should have 24 bins.
For the same time frame, what day of the week was the most popular to take an Uber? The result should have 7 bins.
What is the 95% percentile of distance traveled for all hired trips during July 2013?
What were the top 10 days with the highest number of hired rides for 2009, and what was the average distance for each day?
Which 10 days in 2014 were the windiest on average, and how many hired trips were made on those days?
During Hurricane Sandy in NYC (Oct 29-30, 2012), plus the week leading up and the week after, how many trips were taken each hour, and for each hour, how much precipitation did NYC receive and what was the sustained wind speed? There should be an entry for every single hour, even if no rides were taken, no precipitation was measured, or there was no wind.

For each query, be sure to execute it in the notebook so we can see your answers to the question.

Tips:
You may wish to use SQLAlchemy within the notebook to help craft these queries and query files. You can also use pandas to help check the validity of your queries.
You may want to familiarize yourself with COALESCE, WITH, and WITH RECURSIVE expressions for help in answering some of the questions.
See appendix of lecture notes from module #10 for more tips/hints


# Part 4: Visualizing Data
For this final part, you will be creating a bunch of visualizations embedded in your notebook using matplotlib and/or other visualization libraries of your choice. Be sure to define a function for each visualization, then call these functions in separate cells to render each visualization.

This is where you can get creative with the look and feel of each visual. All that is required is that each visualization is immediately understandable without necessarily needing to read its associated function (i.e. labeled axes, titles, appropriate plot/graph/visual type, etc). You’re welcome to use Markdown cells to introduce and/or explain each visualization. 

You can use pandas to help parse data before generating a visualization, but you must read the data from your SQLite database (so, do not read directly from CSV files). CLARIFICATION (2022-04-26): you are able to use pandas dataframes to help with your visualization. But you should be creating those dataframes from querying the SQL tables you need (not by reading from a CSV file).

Create an appropriate visualization for the first query/question in part 3.
Create a visualization that shows the average distance traveled per month (regardless of year - so group by each month) for both taxis and Ubers combined. Include the 90% confidence interval around the mean in the visualization.
Define three lat/long coordinate boxes around the three major New York airports: LGA, JFK, and EWR (you can use bboxfinder to help). Create a visualization that compares what day of the week was most popular for drop offs for each airport.
Create a heatmap of all hired trips over a map of the area. Consider using KeplerGL or another library that helps generate geospatial visualizations.
Create a scatter plot that compares tip amount versus distance for Yellow Taxi rides. You may remove any outliers how you see fit.
Create another scatter plot that compares tip amount versus precipitation amount for Yellow Taxi rides. You may remove any outliers how you see fit.
Come up with 3 questions on your own that can be answered based on the data in the 4 tables. Create at least one visualization to answer each question. At least one visualization should require data from at least 3 tables.


# We get the df from SQL

# For 01-2009 through 06-2015, what hour of the day was the most popular to take a Yellow Taxi?

In [None]:
dates = df["Trip_Pickup_DateTime"].values
samples = []

for i in range(len(dates)):
    #print(dates[i])
    if type(dates[i]) == str:
        aux = int(dates[i][-8:-6])
        samples.append(aux)
samples

plt.hist(samples, density = False, bins = 24)
plt.title("Histogram Taxi")
plt.xlabel("Hour")
plt.ylabel("Frequency")
plt.show()

#  Create a visualization that shows the average distance traveled per month (regardless of year - so group by each month) for both taxis and Ubers combined. Include the 90% confidence interval around the mean in the visualization

In [None]:
def month_distance(df):
    time = df["Trip_Pickup_DateTime"].values
    distance = df["Distance"].values
    
    
    


In [None]:
df_uber