# 1. Data Collection
- In this section, we are going to extract relevant dataset from 4 different data sources:
  - 1-1. Big Query Google Cloud Platofrm (GCP): Chicago Taxi Trips
  - 1-2. Wikipedia - Chicago Community Areas (Available at: https://en.wikipedia.org/wiki/Community_areas_in_Chicago)
  - 1-3. Nominatim API - OpenStreetMap Data (Avaiable at: https://nominatim.org/)
  - 1-4. Flatfiles

## 1-1. Extracting Public Dataset from Google Cloud Platform
- Due to the large size of the dataset on GCP and the slow loading speed on our current local machine, this project will focus on trips that took place in 2015. It will specifically analyze the top 5 highest demand taxi companies: 'Yellow Cab', 'American United', 'Checker Taxi', 'Blue Diamond', and '5 Star Taxi'.

In [None]:
from google.cloud import bigquery

client = bigquery.Client()


QUERY = """
SELECT * 
FROM `chicago_taxi.chicago_taxi_main`
WHERE trip_year = 2015
AND company IN ('Yello Cab', 'American United', 'Checker Taxi', 'Blue Diamond', '5 Star Taxi')
"""

query_job = client.query(QUERY)

In [None]:
chicago_taxi = query_job.to_dataframe()

In [None]:
chicago_taxi.to_csv("chicago_taxi.csv", index=False)

In [1]:
import pandas as pd
chicago_taxi = pd.read_csv("chicago_taxi.csv")

In [2]:
chicago_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1778219 entries, 0 to 1778218
Data columns (total 24 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   unique_key              object 
 1   taxi_id                 object 
 2   trip_year               int64  
 3   trip_start_timestamp    object 
 4   trip_start_date         object 
 5   trip_start_time         object 
 6   trip_end_timestamp      object 
 7   trip_end_date           object 
 8   trip_end_time           object 
 9   trip_seconds            float64
 10  trip_miles              float64
 11  pickup_community_area   float64
 12  dropoff_community_area  float64
 13  fare                    float64
 14  tips                    float64
 15  tolls                   float64
 16  extras                  float64
 17  trip_total              float64
 18  payment_type            object 
 19  company                 object 
 20  pickup_latitude         float64
 21  pickup_longitude        float64

## 1-2. Webscraping Community Area Information

- In the Chicago Taxi Trip dataset from GCP, the community area is represented by numbers, suggesting that there could be additional valuable information to explore. This information might be particularly useful for implementing business strategies, such as when a taxi company targets specific areas and demographic groups. For future use, we will store this information in our database.


In [3]:
import requests
import re
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

tbody = soup.find('tbody')
if tbody:
    rows = tbody.find_all('tr')

    community_numbers = []
    community_names = []
    populations = []

    for row in rows[1:]:  
        cols = row.find_all('td')
        if len(cols) > 2:
            community_numbers.append(cols[0].text.strip())
            community_names.append(row.find('th').text.strip()) 
            populations.append(cols[1].text.strip()) 

    community = pd.DataFrame({
        'community_number': community_numbers,
        'community_name': community_names,
        'population': populations
    })
    community=community[0:77]
    print(community)
else:
    print("Table body not found on the page")


   community_number      community_name population
0                01         Rogers Park     55,628
1                02          West Ridge     77,122
2                03              Uptown     57,182
3                04      Lincoln Square     40,494
4                05        North Center     35,114
..              ...                 ...        ...
72               73  Washington Heights     25,065
73               74     Mount Greenwood     18,628
74               75         Morgan Park     21,186
75               76              O'Hare     13,418
76               77           Edgewater     56,296

[77 rows x 3 columns]


## 1-3. Nominatim API - OpenStreetMap

- We are going to use the Nominatim API, which provides detailed geographic information, including specific addresses and types of locations based on specific geographic coordinates (i.e., latitude, longitude). As we have geo-coordinates provided in the Chicago taxi trips dataset, we are able to identify the pick-up and drop-off information in more detail.

In [10]:
def get_location_details(lat, lon):
    url = f"https://nominatim.openstreetmap.org/reverse?lat={lat}&lon={lon}&format=json"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        location_name = data.get('display_name')
        location_type = data.get('type')
        return pd.Series([location_name, location_type])
    else:
        return pd.Series(["Error", "Error"])


- To intergrade the API data, we first need to manipulate the chicago taxi trip dataset by concatenate the latitude and longitude to get the unique single location. Thus, we are going to create addition columns called 'pickup_location' and 'dropoff_location'.

In [5]:
chicago_taxi['pickup_location'] = chicago_taxi.apply(lambda row: f"{row['pickup_latitude']}, {row['pickup_longitude']}", axis=1)
chicago_taxi['dropoff_location'] = chicago_taxi.apply(lambda row: f"{row['dropoff_latitude']}, {row['dropoff_longitude']}", axis=1)

- We are going to create two tables, named 'pickup_location_info' and 'dropoff_location_info', to integrate information from the Nominatim API for the extraction of additional geographic information, respectively. Each table will contain the following information: pickup_location, latitude, longitude, address, and location_type.

In [14]:

# creating pikup location table intergrating API
pickup_location = chicago_taxi['pickup_location'].unique()
pickup_location_info = pd.DataFrame(pickup_location)
pickup_location_info.rename(columns={0: 'pickup_location'}, inplace=True)
pickup_location_info[['latitude', 'longitude']] = pickup_location_info['pickup_location'].str.split(',', expand=True)
pickup_location_info[['address','type']] = pickup_location_info.apply(lambda row: get_location_details(row['latitude'], row['longitude']), axis=1)
pickup_location_info['latitude'] = pickup_location_info['latitude'].astype(float)
pickup_location_info['longitude'] = pickup_location_info['longitude'].astype(float)

# creating dropoff location table intergrating API
dropoff_location = chicago_taxi['dropoff_location'].unique()
dropoff_location_info = pd.DataFrame(dropoff_location)
dropoff_location_info.rename(columns={0: 'dropoff_location'}, inplace=True)
dropoff_location_info[['latitude', 'longitude']] = dropoff_location_info['dropoff_location'].str.split(',', expand=True)
dropoff_location_info[['address','type']] = dropoff_location_info.apply(lambda row: get_location_details(row['latitude'], row['longitude']), axis=1)
dropoff_location_info['latitude'] = dropoff_location_info['latitude'].astype(float)
dropoff_location_info['longitude'] = dropoff_location_info['longitude'].astype(float)

pickup_location_info, dropoff_location

(                 pickup_location   latitude  longitude  \
 0     41.94258518, -87.656644092  41.942585 -87.656644   
 1       41.88528132, -87.6572332  41.885281 -87.657233   
 2      41.89321636, -87.63784421  41.893216 -87.637844   
 3    41.936237179, -87.656411531  41.936237 -87.656412   
 4     41.89503345, -87.619710672  41.895033 -87.619711   
 ..                           ...        ...        ...   
 421  41.928464984, -87.695086675  41.928465 -87.695087   
 422  41.798041716, -87.594196627  41.798042 -87.594197   
 423  41.985916382, -87.768970241  41.985916 -87.768970   
 424  41.776163693, -87.579948248  41.776164 -87.579948   
 425  42.005559764, -87.901885838  42.005560 -87.901886   
 
                                                address         type  
 0    3331, North Seminary Avenue, Wrigleyville, Lak...          yes  
 1    Twelve01West, 1201, West Lake Street, Fulton M...   apartments  
 2    350-372, West Ontario Street, Near North Side,...   commercial  
 3    

# 2. Data Cleaning

In [15]:
chicago_taxi = chicago_taxi.copy()

In [17]:
chicago_taxi.head()

Unnamed: 0,unique_key,taxi_id,trip_year,trip_start_timestamp,trip_start_date,trip_start_time,trip_end_timestamp,trip_end_date,trip_end_time,trip_seconds,...,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,pickup_location,dropoff_location
0,3b838244692ab501427001b44af93f77c86204d4,6e40306a3a76d2e41f2530cc314ecfbd2520aae13202d7...,2015,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,2015-04-19 00:15:00+00:00,2015-04-19,00:15:00,147.0,...,1.0,7.05,Credit Card,Blue Diamond,41.942585,-87.656644,41.942692,-87.651771,"41.94258518, -87.656644092","41.942691844, -87.651770507"
1,261244b20b7cc4bb971b9d21bc9f756bd8ee07d9,6aeb4a88ff55ac575e3ef10ef32622967534bd48fb9ba6...,2015,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,552.0,...,1.5,10.95,Credit Card,American United,41.885281,-87.657233,41.912432,-87.670189,"41.88528132, -87.6572332","41.912431869, -87.670189148"
2,eac1e0162828f4ddd7ae137adb168c0831a3bf72,641c9356c873f4b5fb13d4b2f70d8b4d4b7b2c98057272...,2015,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,2015-04-19 00:15:00+00:00,2015-04-19,00:15:00,629.0,...,1.5,11.15,Credit Card,American United,41.893216,-87.637844,41.892042,-87.631864,"41.89321636, -87.63784421","41.892042136, -87.63186395"
3,5f54f3b346e021ac2549229752cb6f59f6065653,8307cf9433f0293eee99c6944aeab484521d9cd9b1fce5...,2015,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,182.0,...,1.0,5.65,Cash,American United,41.936237,-87.656412,41.941556,-87.666289,"41.936237179, -87.656411531","41.941555829, -87.666288887"
4,3fa6dc0f02a81ee04349f398fdbc0f583c1669ff,687e3ef9daf087b79188bf0fea27f22cd5786b0cda0c80...,2015,2015-04-19 00:00:00+00:00,2015-04-19,00:00:00,2015-04-19 00:15:00+00:00,2015-04-19,00:15:00,708.0,...,1.5,10.95,Cash,American United,41.895033,-87.619711,41.892493,-87.664746,"41.89503345, -87.619710672","41.892493167, -87.664745836"


In [16]:
chicago_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1778219 entries, 0 to 1778218
Data columns (total 26 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   unique_key              object 
 1   taxi_id                 object 
 2   trip_year               int64  
 3   trip_start_timestamp    object 
 4   trip_start_date         object 
 5   trip_start_time         object 
 6   trip_end_timestamp      object 
 7   trip_end_date           object 
 8   trip_end_time           object 
 9   trip_seconds            float64
 10  trip_miles              float64
 11  pickup_community_area   float64
 12  dropoff_community_area  float64
 13  fare                    float64
 14  tips                    float64
 15  tolls                   float64
 16  extras                  float64
 17  trip_total              float64
 18  payment_type            object 
 19  company                 object 
 20  pickup_latitude         float64
 21  pickup_longitude        float64

In [18]:
chicago_taxi.isnull().sum()

unique_key                      0
taxi_id                         0
trip_year                       0
trip_start_timestamp            0
trip_start_date                 0
trip_start_time                 0
trip_end_timestamp              0
trip_end_date                   0
trip_end_time                   0
trip_seconds                   85
trip_miles                      0
pickup_community_area       46967
dropoff_community_area      81578
fare                            0
tips                            0
tolls                     1778219
extras                          0
trip_total                      0
payment_type                    0
company                         0
pickup_latitude             46885
pickup_longitude            46885
dropoff_latitude            78108
dropoff_longitude           78108
pickup_location                 0
dropoff_location                0
dtype: int64

In [19]:

trip_seconds_null = chicago_taxi[chicago_taxi['trip_seconds'].isnull()]
trip_seconds_null

Unnamed: 0,unique_key,taxi_id,trip_year,trip_start_timestamp,trip_start_date,trip_start_time,trip_end_timestamp,trip_end_date,trip_end_time,trip_seconds,...,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,pickup_location,dropoff_location
20567,1ede641ffd6bbd7f0c8d53e39c550135866d347b,824726452b94cfd411500410b26d1d60e5e07ada21556d...,2015,2015-11-01 02:00:00+00:00,2015-11-01,02:00:00,2015-11-01 01:15:00+00:00,2015-11-01,01:15:00,,...,1.5,13.38,Credit Card,American United,41.900266,-87.632109,41.921778,-87.651062,"41.900265687, -87.63210922","41.921778188, -87.651061884"
20568,758f0bd0f3f37653a7457e0b04a512ef510ce523,b5e563475714c9be944c844b7bfa126481e9f9222ff26d...,2015,2015-11-01 02:00:00+00:00,2015-11-01,02:00:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,1.0,5.65,Cash,American United,41.893216,-87.637844,41.892042,-87.631864,"41.89321636, -87.63784421","41.892042136, -87.63186395"
20569,e3ba16789cbc48d77ef5b419a9998c0c90540972,50142176d90fb9c95a38ae58b5ecd2aa8f7c49b54ff90b...,2015,2015-11-01 02:00:00+00:00,2015-11-01,02:00:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,0.0,11.65,Credit Card,American United,41.902788,-87.626146,41.928967,-87.656157,"41.902788048, -87.62614559","41.928967266, -87.656156831"
20570,f2a338367669b35f24160563df182bdf509139f7,1344830e881c7eba092d0c19fc0d33fb004fce69324141...,2015,2015-11-01 02:00:00+00:00,2015-11-01,02:00:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,1.0,12.71,Credit Card,Checker Taxi,41.880994,-87.632746,41.871351,-87.688675,"41.880994471, -87.632746489","41.8713514, -87.6886749"
20571,86c41279bab5eaf5fe9a395422a1deee04f7b41f,d13c5aaa066f94b4927779ed24cd313b0c686f03407095...,2015,2015-11-01 02:00:00+00:00,2015-11-01,02:00:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,1.0,9.85,Cash,Blue Diamond,41.921778,-87.651062,41.899507,-87.679600,"41.921778188, -87.651061884","41.899506548, -87.679600287"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1621203,56fd1399598eea415cce4d4c705c6c492c124074,22c14626f6bfb534f0181e4584ad8bd04d7f988f254d9f...,2015,2015-11-01 01:45:00+00:00,2015-11-01,01:45:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,0.0,11.85,Credit Card,American United,41.885300,-87.642808,41.934659,-87.646730,"41.885300022, -87.642808466","41.934659157, -87.646729729"
1621204,38686acfd995543dbb35f1085e3009bcd030fbfa,60e9b32a85d0045d670d329891f51b9796543659769e9a...,2015,2015-11-01 01:45:00+00:00,2015-11-01,01:45:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,1.0,22.83,Credit Card,Checker Taxi,41.946295,-87.654298,41.885300,-87.642808,"41.946294536, -87.654298084","41.885300022, -87.642808466"
1621205,aa07ebd014898d47721c222f932d4fcc25ca7fd4,de42f8191f3b8079c680589098c5f3f6e2d202e98f8a81...,2015,2015-11-01 01:45:00+00:00,2015-11-01,01:45:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,1.0,19.38,Credit Card,Blue Diamond,41.922686,-87.649489,41.874005,-87.663518,"41.922686284, -87.649488729","41.874005383, -87.66351755"
1621206,f798670f2a0ee858eea6ea64657dce1899d3463e,8abc7af60f163d05f5202eacbab892f7b2c5d2d4b3ebba...,2015,2015-11-01 01:45:00+00:00,2015-11-01,01:45:00,2015-11-01 01:00:00+00:00,2015-11-01,01:00:00,,...,0.0,13.85,Cash,Blue Diamond,41.885300,-87.642808,41.926811,-87.642605,"41.885300022, -87.642808466","41.926811182, -87.642605247"
