# 1. Data Collection
- In this section, we are going to extract relevant dataset from 4 different data sources:
  - 1-1. Big Query Google Cloud Platofrm (GCP): Chicago Taxi Trips
  - 1-2. Wikipedia - Chicago Community Areas (Available at: https://en.wikipedia.org/wiki/Community_areas_in_Chicago)
  - 1-3. Nominatim API - OpenStreetMap Data (Avaiable at: https://nominatim.org/)
  - 1-4. Flatfile: taxi_vehicle.csv file sourced from Chicago Data Portable (Available at: https://data.cityofchicago.org/Community-Economic-Development/Active-Taxis-Make-Model-Chart/6cak-z3a4). 

## 1-1. Extracting Public Dataset from Google Cloud Platform
- Due to the large size of the dataset on GCP and the slow loading speed on our current local machine, this project will focus on trips that took place in 2015. It will specifically analyze the top 5 highest demand taxi companies: 'Yellow Cab', 'American United', 'Checker Taxi', 'Blue Diamond', and '5 Star Taxi'.

In [1]:
from google.cloud import bigquery

client = bigquery.Client()


QUERY = """
SELECT * 
FROM `chicago_taxi.chicago_taxi_main`
WHERE trip_year = 2015
AND company IN ('Yellow Cab', 'American United', 'Checker Taxi', 'Blue Diamond', '5 Star Taxi')
"""

query_job = client.query(QUERY)


In [None]:
chicago_taxi = query_job.to_dataframe()

In [None]:
# chicago_taxi.to_csv("chicago_taxi.csv", index=False)

In [None]:
import pandas as pd
# chicago_taxi = pd.read_csv("chicago_taxi.csv")

In [6]:
chicago_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4972163 entries, 0 to 4972162
Data columns (total 24 columns):
 #   Column                  Dtype              
---  ------                  -----              
 0   unique_key              object             
 1   taxi_id                 object             
 2   trip_year               Int64              
 3   trip_start_timestamp    datetime64[us, UTC]
 4   trip_start_date         dbdate             
 5   trip_start_time         object             
 6   trip_end_timestamp      datetime64[us, UTC]
 7   trip_end_date           dbdate             
 8   trip_end_time           object             
 9   trip_seconds            Int64              
 10  trip_miles              float64            
 11  pickup_community_area   Int64              
 12  dropoff_community_area  Int64              
 13  fare                    float64            
 14  tips                    float64            
 15  tolls                   float64            
 16  

## 1-2. Webscraping Community Area Information

- In the Chicago Taxi Trip dataset from GCP, the community area is represented by numbers, suggesting that there could be additional valuable information to explore. This information might be particularly useful for implementing business strategies, such as when a taxi company targets specific areas and demographic groups. For future use, we will store this information in our database.


In [7]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

tbody = soup.find('tbody')
if tbody:
    rows = tbody.find_all('tr')

    community_numbers = []
    community_names = []
    populations = []

    for row in rows[1:]:  
        cols = row.find_all('td')
        if len(cols) > 2:
            community_numbers.append(cols[0].text.strip())
            community_names.append(row.find('th').text.strip()) 
            populations.append(cols[1].text.strip()) 

    community = pd.DataFrame({
        'community_number': community_numbers,
        'community_name': community_names,
        'population': populations
    })
    community=community[0:77]
    print(community)
else:
    print("Table body not found on the page")


   community_number      community_name population
0                01         Rogers Park     55,628
1                02          West Ridge     77,122
2                03              Uptown     57,182
3                04      Lincoln Square     40,494
4                05        North Center     35,114
..              ...                 ...        ...
72               73  Washington Heights     25,065
73               74     Mount Greenwood     18,628
74               75         Morgan Park     21,186
75               76              O'Hare     13,418
76               77           Edgewater     56,296

[77 rows x 3 columns]


## 1-3. Nominatim API - OpenStreetMap

- We are going to use the Nominatim API, which provides detailed geographic information, including specific addresses and types of locations based on specific geographic coordinates (i.e., latitude, longitude). As we have geo-coordinates provided in the Chicago taxi trips dataset, we are able to identify the pick-up and drop-off information in more detail.

In [8]:
def get_location_details(lat, lon):
    url = f"https://nominatim.openstreetmap.org/reverse?lat={lat}&lon={lon}&format=json"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        location_name = data.get('display_name')
        location_type = data.get('type')
        return pd.Series([location_name, location_type])
    else:
        return pd.Series(["Error", "Error"])


- To intergrade the API data, we first need to manipulate the chicago taxi trip dataset by concatenate the latitude and longitude to get the unique single location. Thus, we are going to create addition columns called 'pickup_location' and 'dropoff_location'.

In [9]:
chicago_taxi['pickup_location'] = chicago_taxi.apply(lambda row: f"{row['pickup_latitude']}, {row['pickup_longitude']}", axis=1)
chicago_taxi['dropoff_location'] = chicago_taxi.apply(lambda row: f"{row['dropoff_latitude']}, {row['dropoff_longitude']}", axis=1)

- We are going to create two tables, named 'pickup_location_info' and 'dropoff_location_info', to integrate information from the Nominatim API for the extraction of additional geographic information, respectively. Each table will contain the following information: pickup_location, latitude, longitude, address, and location_type.

In [10]:

# creating pikup location table intergrating API
pickup_location = chicago_taxi['pickup_location'].unique()
pickup_location_info = pd.DataFrame(pickup_location)
pickup_location_info.rename(columns={0: 'pickup_location'}, inplace=True)
pickup_location_info[['latitude', 'longitude']] = pickup_location_info['pickup_location'].str.split(',', expand=True)
pickup_location_info[['address','type']] = pickup_location_info.apply(lambda row: get_location_details(row['latitude'], row['longitude']), axis=1)
pickup_location_info['latitude'] = pickup_location_info['latitude'].astype(float)
pickup_location_info['longitude'] = pickup_location_info['longitude'].astype(float)

# creating dropoff location table intergrating API
dropoff_location = chicago_taxi['dropoff_location'].unique()
dropoff_location_info = pd.DataFrame(dropoff_location)
dropoff_location_info.rename(columns={0: 'dropoff_location'}, inplace=True)
dropoff_location_info[['latitude', 'longitude']] = dropoff_location_info['dropoff_location'].str.split(',', expand=True)
dropoff_location_info[['address','type']] = dropoff_location_info.apply(lambda row: get_location_details(row['latitude'], row['longitude']), axis=1)
dropoff_location_info['latitude'] = dropoff_location_info['latitude'].astype(float)
dropoff_location_info['longitude'] = dropoff_location_info['longitude'].astype(float)



In [None]:
# display(pickup_location_info.head())
# display(dropoff_location_info.head())

In [11]:
# merge dropoff and pickup location tables
dropoff_location_info.rename(columns={'dropoff_location': 'location_coordinates', 'type': 'dropoff_type'}, inplace=True)
pickup_location_info.rename(columns={'pickup_location': 'location_coordinates', 'type': 'pickup_type'}, inplace=True)

location_info = pd.merge(dropoff_location_info, pickup_location_info, on='location_coordinates', suffixes=('_dropoff', '_pickup'))

location_info = location_info[['location_coordinates', 'address_dropoff', 'dropoff_type']]
location_info.rename(columns={'dropoff_type': 'type'}, inplace=True)
location_info.rename(columns={'address_dropoff': 'address'}, inplace=True)


In [2]:
import pandas as pd
location_info = pd.read_csv("location_info.csv")

In [5]:
location_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421 entries, 0 to 420
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   location_coordinates  421 non-null    object
 1   address               421 non-null    object
 2   type                  421 non-null    object
dtypes: object(3)
memory usage: 10.0+ KB


# 1-4. Flatfile: Taxi Vehicle type/make, Taxi Comapny Info
- **Taxi Vehicle type/make**: This contains the information of type and make of taxis running in Chicago. The presented dataset isn't exactaly same as from original source. I've modified to make it relevent to the dataset from other existing data sources to meet the RNCP criteria (i.e. randomly assign taxi_id)
- **Taxi company**: the data sourced as flatfiles from chicago city data portal - only 2020 data available. comapny_id is assgined to be stored as primary key in the database

In [74]:
taxi = pd.read_csv('taxi_vehicle.csv')
taxi.head()

Unnamed: 0,taxi_id,Public Vehicle Number,Vehicle Make,Vehicle Model Year,Vehicle Color,Vehicle Fuel Source
0,6e40306a3a76d2e41f2530cc314ecfbd2520aae13202d7...,1350,FORD,2014.0,BLUE,Hybrid
1,6aeb4a88ff55ac575e3ef10ef32622967534bd48fb9ba6...,4063,NISSAN,2011.0,WHITE,Hybrid
2,641c9356c873f4b5fb13d4b2f70d8b4d4b7b2c98057272...,5448,CHRYSLER,2013.0,YELLOW,Flex Fuel
3,8307cf9433f0293eee99c6944aeab484521d9cd9b1fce5...,266,TOYOTA,2013.0,GREEN,Hybrid
4,687e3ef9daf087b79188bf0fea27f22cd5786b0cda0c80...,5644,TOYOTA,2012.0,WHITE,Hybrid


In [13]:
taxi_comapny_id = pd.read_csv("taxi_company1.csv")
taxi_comapny_info = pd.read_csv("taxi_company2.csv")



In [14]:
company = pd.merge(taxi_comapny_info, taxi_comapny_id, on='company', how='left')
company=company[['company_id','company', 'taxi_exterior_color', 'business_phone', 'dispatch_phone',
       'address', 'city_state', 'zip', 'email']]

In [15]:
company.head()

Unnamed: 0,company_id,company,taxi_exterior_color,business_phone,dispatch_phone,address,city_state,zip,email
0,22,5 Star Taxi,White,773-561-4444,773-561-4444,9696 W. FOSTER AVE,"CHICAGO, IL",60656,info@flash.com
1,16,24 Seven Taxi,Blue,773-878-8294,773-944-0350,5606 N. WESTERN AV,"CHICAGO, IL",60659,chicago247taxi@gmail.com
2,4,American United,"White, Stars, Stripes",773-327-6161,773-248-7600,"3800 N MILWAUKEE AVE, SUITE A","CHICAGO, IL",60641,
3,12,Blue Diamond,"Cream, Blue",312-881-3188,312-226-8880,"3800 N MILWAUKEE AVE, SUITE A","CHICAGO, IL",60641,
4,21,Blue Ribbon Taxi Association Inc.,"White, Blue, Stripes",773-279-4100,773-878-5400,4020 W. GLENLAKE AVE,"CHICAGO, IL",60646,info@blueribbontaxi.com


# 2. Data Cleaning

- We are going to conduct simple data cleaning for storing into database. We are going to convert the datatypes correctly and remove the columns that might be considered redundant. Further, we will manipulate the tables according to optimization to database (e.g. primary and foreign keys)

In [16]:
chicago_taxi.head()

Unnamed: 0,unique_key,taxi_id,trip_year,trip_start_timestamp,trip_start_date,trip_start_time,trip_end_timestamp,trip_end_date,trip_end_time,trip_seconds,...,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,pickup_location,dropoff_location
0,7bc6538af09173ecde839aaf31974d0a26dd9c13,f54b11bc86d1ab32945f60547e15cbdc14a3b017b2a0ee...,2015,2015-04-02 21:45:00+00:00,2015-04-02,21:45:00,2015-04-02 21:45:00+00:00,2015-04-02,21:45:00,302,...,1.0,10.25,Credit Card,Yellow Cab,41.878667,-87.671654,41.879255,-87.642649,"41.87866742, -87.671653621","41.879255084, -87.642648998"
1,0737d0a781fad8ffef310dc3b54b5351b4c65e47,2eda36427e0a5394e90d77488294cd75e2fd87f04acb02...,2015,2015-02-08 01:45:00+00:00,2015-02-08,01:45:00,2015-02-08 01:45:00+00:00,2015-02-08,01:45:00,166,...,0.0,6.65,Credit Card,Checker Taxi,41.926811,-87.642605,41.929047,-87.651311,"41.926811182, -87.642605247","41.929046937, -87.651310877"
2,1c3ec7ab3525f30f7924ea89dcc8f3e0ccfa1e6d,fdcc770e27c1fb9af9154cbab27fa4a1f830d1a3d6a839...,2015,2015-02-08 18:00:00+00:00,2015-02-08,18:00:00,2015-02-08 18:15:00+00:00,2015-02-08,18:15:00,878,...,0.0,9.65,Cash,Blue Diamond,41.938666,-87.711211,41.953582,-87.723452,"41.938666196, -87.711210593","41.953582125, -87.72345239"
3,efba433903024e63cf72fe1474c4d256db5b90b1,887b2728097b2d9f149774b8cc04fe9d80b9221506ef8c...,2015,2015-02-08 01:15:00+00:00,2015-02-08,01:15:00,2015-02-08 01:15:00+00:00,2015-02-08,01:15:00,397,...,2.0,9.25,Credit Card,American United,41.929047,-87.651311,41.943237,-87.643471,"41.929046937, -87.651310877","41.943237122, -87.643470956"
4,f89cdf530e724bcb3fef69ac40742fddc7515bdf,516e828a4ee5e460c2aa30c739641b46593e686d6bd271...,2015,2015-02-08 00:30:00+00:00,2015-02-08,00:30:00,2015-02-08 00:45:00+00:00,2015-02-08,00:45:00,1081,...,0.0,12.05,Cash,Yellow Cab,41.963185,-87.683855,41.921855,-87.646211,"41.963184966, -87.683854556","41.921854911, -87.646210977"


In [17]:
chicago_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4972163 entries, 0 to 4972162
Data columns (total 26 columns):
 #   Column                  Dtype              
---  ------                  -----              
 0   unique_key              object             
 1   taxi_id                 object             
 2   trip_year               Int64              
 3   trip_start_timestamp    datetime64[us, UTC]
 4   trip_start_date         dbdate             
 5   trip_start_time         object             
 6   trip_end_timestamp      datetime64[us, UTC]
 7   trip_end_date           dbdate             
 8   trip_end_time           object             
 9   trip_seconds            Int64              
 10  trip_miles              float64            
 11  pickup_community_area   Int64              
 12  dropoff_community_area  Int64              
 13  fare                    float64            
 14  tips                    float64            
 15  tolls                   float64            
 16  

In [18]:
chicago_taxi.isnull().sum()

unique_key                      0
taxi_id                         0
trip_year                       0
trip_start_timestamp            0
trip_start_date                 0
trip_start_time                 0
trip_end_timestamp              0
trip_end_date                   0
trip_end_time                   0
trip_seconds                  254
trip_miles                      0
pickup_community_area       92590
dropoff_community_area     192761
fare                            0
tips                            0
tolls                     4972163
extras                          0
trip_total                      0
payment_type                    0
company                         0
pickup_latitude             92389
pickup_longitude            92389
dropoff_latitude           182885
dropoff_longitude          182885
pickup_location                 0
dropoff_location                0
dtype: int64

In [19]:
#merge with taxi_comany_id
chicago_taxi = pd.merge(chicago_taxi, taxi_comapny_id, on='company', how='left')

# drop unessary columns
chicago_taxi = chicago_taxi.drop(['trip_year','trip_start_date','trip_start_time','trip_end_date','trip_end_time', 'tolls', 'company'], axis=1)

#drop null values
chicago_taxi = chicago_taxi.dropna()

# convert datatype for MySQL format
chicago_taxi['trip_start_timestamp'] = pd.to_datetime(chicago_taxi['trip_start_timestamp']).dt.strftime('%Y-%m-%d %H:%M:%S')
chicago_taxi['trip_end_timestamp'] = pd.to_datetime(chicago_taxi['trip_end_timestamp']).dt.strftime('%Y-%m-%d %H:%M:%S')

chicago_taxi['pickup_community_area'] = chicago_taxi['pickup_community_area'].astype('Int64')
chicago_taxi['dropoff_community_area'] = chicago_taxi['dropoff_community_area'].astype('Int64')



In [20]:
trips=chicago_taxi.copy()

In [21]:
trips.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4768966 entries, 0 to 4972162
Data columns (total 20 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   unique_key              object 
 1   taxi_id                 object 
 2   trip_start_timestamp    object 
 3   trip_end_timestamp      object 
 4   trip_seconds            Int64  
 5   trip_miles              float64
 6   pickup_community_area   Int64  
 7   dropoff_community_area  Int64  
 8   fare                    float64
 9   tips                    float64
 10  extras                  float64
 11  trip_total              float64
 12  payment_type            object 
 13  pickup_latitude         float64
 14  pickup_longitude        float64
 15  dropoff_latitude        float64
 16  dropoff_longitude       float64
 17  pickup_location         object 
 18  dropoff_location        object 
 19  company_id              int64  
dtypes: Int64(3), float64(9), int64(1), object(7)
memory usage: 777.7+ MB


In [22]:
community.head()

Unnamed: 0,community_number,community_name,population
0,1,Rogers Park,55628
1,2,West Ridge,77122
2,3,Uptown,57182
3,4,Lincoln Square,40494
4,5,North Center,35114


In [23]:
community['community_number'] = community['community_number'].astype('Int64')
community['population'] = community['population'].str.replace(',','').astype('Int64')

In [24]:
community.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   community_number  77 non-null     Int64 
 1   community_name    77 non-null     object
 2   population        77 non-null     Int64 
dtypes: Int64(2), object(1)
memory usage: 2.1+ KB


In [25]:
community.isnull().sum()

community_number    0
community_name      0
population          0
dtype: int64

In [26]:
location_info.head()

Unnamed: 0,location_coordinates,address,type
0,"41.879255084, -87.642648998","Taco Lulú, 601, West Adams Street, West Loop G...",restaurant
1,"41.929046937, -87.651310877","863, West Wrightwood Avenue, Lincoln Park, Chi...",yes
2,"41.953582125, -87.72345239","3831, West Irving Park Road, Irving Park, Chic...",house
3,"41.943237122, -87.643470956","537-545, West Roscoe Street, Northalsted, Lake...",yes
4,"41.921854911, -87.646210977","658, West Webster Avenue, Mid-North District, ...",yes


In [27]:
location_info.isnull().sum()

location_coordinates    0
address                 0
type                    0
dtype: int64

'type' of address appear to be not informative as most of values are yes - which does not mean anythin. So we drop this column

In [28]:
location_info['type'].value_counts()

type
yes                 298
apartments           32
house                25
school               16
residential          15
parking              13
university            6
secondary             5
bridge                5
pitch                 5
industrial            5
motorway              4
golf_course           3
bus_stop              3
playground            3
nature_reserve        3
hospital              2
aerodrome             2
church                2
garden                2
terrace               2
restaurant            2
social_facility       2
detached              2
brewery               2
office                2
tertiary              1
dog_park              1
path                  1
religious             1
post_depot            1
police                1
museum                1
place_of_worship      1
surveillance          1
service               1
convenience           1
bar                   1
college               1
theatre               1
zoo                   1
Error      

In [29]:
location_info = location_info.drop(columns=['type'], axis=1)

In [30]:
location = location_info.copy()

In [31]:
company.head()

Unnamed: 0,company_id,company,taxi_exterior_color,business_phone,dispatch_phone,address,city_state,zip,email
0,22,5 Star Taxi,White,773-561-4444,773-561-4444,9696 W. FOSTER AVE,"CHICAGO, IL",60656,info@flash.com
1,16,24 Seven Taxi,Blue,773-878-8294,773-944-0350,5606 N. WESTERN AV,"CHICAGO, IL",60659,chicago247taxi@gmail.com
2,4,American United,"White, Stars, Stripes",773-327-6161,773-248-7600,"3800 N MILWAUKEE AVE, SUITE A","CHICAGO, IL",60641,
3,12,Blue Diamond,"Cream, Blue",312-881-3188,312-226-8880,"3800 N MILWAUKEE AVE, SUITE A","CHICAGO, IL",60641,
4,21,Blue Ribbon Taxi Association Inc.,"White, Blue, Stripes",773-279-4100,773-878-5400,4020 W. GLENLAKE AVE,"CHICAGO, IL",60646,info@blueribbontaxi.com


In [32]:
company.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   company_id           19 non-null     int64 
 1   company              19 non-null     object
 2   taxi_exterior_color  19 non-null     object
 3   business_phone       19 non-null     object
 4   dispatch_phone       19 non-null     object
 5   address              19 non-null     object
 6   city_state           19 non-null     object
 7   zip                  19 non-null     int64 
 8   email                13 non-null     object
dtypes: int64(2), object(7)
memory usage: 1.5+ KB


In [75]:
taxi.head()

Unnamed: 0,taxi_id,Public Vehicle Number,Vehicle Make,Vehicle Model Year,Vehicle Color,Vehicle Fuel Source
0,6e40306a3a76d2e41f2530cc314ecfbd2520aae13202d7...,1350,FORD,2014.0,BLUE,Hybrid
1,6aeb4a88ff55ac575e3ef10ef32622967534bd48fb9ba6...,4063,NISSAN,2011.0,WHITE,Hybrid
2,641c9356c873f4b5fb13d4b2f70d8b4d4b7b2c98057272...,5448,CHRYSLER,2013.0,YELLOW,Flex Fuel
3,8307cf9433f0293eee99c6944aeab484521d9cd9b1fce5...,266,TOYOTA,2013.0,GREEN,Hybrid
4,687e3ef9daf087b79188bf0fea27f22cd5786b0cda0c80...,5644,TOYOTA,2012.0,WHITE,Hybrid


In [76]:
taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386 entries, 0 to 385
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   taxi_id                386 non-null    object 
 1   Public Vehicle Number  386 non-null    int64  
 2   Vehicle Make           386 non-null    object 
 3   Vehicle Model Year     386 non-null    float64
 4   Vehicle Color          386 non-null    object 
 5   Vehicle Fuel Source    386 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 18.2+ KB


In [77]:
taxi.columns = taxi.columns.str.lower()
taxi.columns = taxi.columns.str.replace(' ', '_')

In [78]:
taxi['vehicle_model_year'] = taxi['vehicle_model_year'].astype('Int64')

In [37]:
# trips_taxi_grouped = trips.groupby(['taxi_id', 'company_id']).size().reset_index(name='counts')

In [38]:
# taxi = pd.merge(taxi, trips_taxi_grouped, on='taxi_id', how='left')

In [39]:
taxi=taxi.dropna()

In [40]:
# taxi['company_id'] = taxi['company_id'].astype('Int64')
# taxi = taxi.drop('counts', axis=1)

# 3. Storing into Database

In [41]:
display(trips.head())
display(taxi.head())
display(location.head())
display(community.head())
display(company.head())

Unnamed: 0,unique_key,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,extras,trip_total,payment_type,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,pickup_location,dropoff_location,company_id
0,7bc6538af09173ecde839aaf31974d0a26dd9c13,f54b11bc86d1ab32945f60547e15cbdc14a3b017b2a0ee...,2015-04-02 21:45:00,2015-04-02 21:45:00,302,1.4,28,28,6.25,3.0,1.0,10.25,Credit Card,41.878667,-87.671654,41.879255,-87.642649,"41.87866742, -87.671653621","41.879255084, -87.642648998",2
1,0737d0a781fad8ffef310dc3b54b5351b4c65e47,2eda36427e0a5394e90d77488294cd75e2fd87f04acb02...,2015-02-08 01:45:00,2015-02-08 01:45:00,166,0.7,7,7,4.65,2.0,0.0,6.65,Credit Card,41.926811,-87.642605,41.929047,-87.651311,"41.926811182, -87.642605247","41.929046937, -87.651310877",9
2,1c3ec7ab3525f30f7924ea89dcc8f3e0ccfa1e6d,fdcc770e27c1fb9af9154cbab27fa4a1f830d1a3d6a839...,2015-02-08 18:00:00,2015-02-08 18:15:00,878,2.3,21,16,9.65,0.0,0.0,9.65,Cash,41.938666,-87.711211,41.953582,-87.723452,"41.938666196, -87.711210593","41.953582125, -87.72345239",12
3,efba433903024e63cf72fe1474c4d256db5b90b1,887b2728097b2d9f149774b8cc04fe9d80b9221506ef8c...,2015-02-08 01:15:00,2015-02-08 01:15:00,397,1.3,7,6,6.25,1.0,2.0,9.25,Credit Card,41.929047,-87.651311,41.943237,-87.643471,"41.929046937, -87.651310877","41.943237122, -87.643470956",4
4,f89cdf530e724bcb3fef69ac40742fddc7515bdf,516e828a4ee5e460c2aa30c739641b46593e686d6bd271...,2015-02-08 00:30:00,2015-02-08 00:45:00,1081,3.7,4,7,12.05,0.0,0.0,12.05,Cash,41.963185,-87.683855,41.921855,-87.646211,"41.963184966, -87.683854556","41.921854911, -87.646210977",2


Unnamed: 0,taxi_id,public_vehicle_number,vehicle_make,vehicle_model_year,vehicle_color,vehicle_fuel_source,company_id
0,6e40306a3a76d2e41f2530cc314ecfbd2520aae13202d7...,1350,FORD,2014,BLUE,Hybrid,12
1,6aeb4a88ff55ac575e3ef10ef32622967534bd48fb9ba6...,4063,NISSAN,2011,WHITE,Hybrid,4
2,641c9356c873f4b5fb13d4b2f70d8b4d4b7b2c98057272...,5448,CHRYSLER,2013,YELLOW,Flex Fuel,2
3,641c9356c873f4b5fb13d4b2f70d8b4d4b7b2c98057272...,5448,CHRYSLER,2013,YELLOW,Flex Fuel,4
4,8307cf9433f0293eee99c6944aeab484521d9cd9b1fce5...,266,TOYOTA,2013,GREEN,Hybrid,4


Unnamed: 0,location_coordinates,address
0,"41.879255084, -87.642648998","Taco Lulú, 601, West Adams Street, West Loop G..."
1,"41.929046937, -87.651310877","863, West Wrightwood Avenue, Lincoln Park, Chi..."
2,"41.953582125, -87.72345239","3831, West Irving Park Road, Irving Park, Chic..."
3,"41.943237122, -87.643470956","537-545, West Roscoe Street, Northalsted, Lake..."
4,"41.921854911, -87.646210977","658, West Webster Avenue, Mid-North District, ..."


Unnamed: 0,community_number,community_name,population
0,1,Rogers Park,55628
1,2,West Ridge,77122
2,3,Uptown,57182
3,4,Lincoln Square,40494
4,5,North Center,35114


Unnamed: 0,company_id,company,taxi_exterior_color,business_phone,dispatch_phone,address,city_state,zip,email
0,22,5 Star Taxi,White,773-561-4444,773-561-4444,9696 W. FOSTER AVE,"CHICAGO, IL",60656,info@flash.com
1,16,24 Seven Taxi,Blue,773-878-8294,773-944-0350,5606 N. WESTERN AV,"CHICAGO, IL",60659,chicago247taxi@gmail.com
2,4,American United,"White, Stars, Stripes",773-327-6161,773-248-7600,"3800 N MILWAUKEE AVE, SUITE A","CHICAGO, IL",60641,
3,12,Blue Diamond,"Cream, Blue",312-881-3188,312-226-8880,"3800 N MILWAUKEE AVE, SUITE A","CHICAGO, IL",60641,
4,21,Blue Ribbon Taxi Association Inc.,"White, Blue, Stripes",773-279-4100,773-878-5400,4020 W. GLENLAKE AVE,"CHICAGO, IL",60646,info@blueribbontaxi.com


In [42]:
import mysql.connector
from mysql.connector import Error
import pandas as pd


hostname = '127.0.0.1'
port = 3306
dbname = 'chicago_taxi'
username = 'root'
password = 'password'

connection = None


In [80]:



import mysql.connector
from mysql.connector import Error
import pandas as pd


hostname = '127.0.0.1'
port = 3306
dbname = 'chicago_taxi'
username = 'root'
password = 'password'

connection = None

try:

    connection = mysql.connector.connect(host=hostname, port=port, database=dbname, user=username, password=password)

    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()


       
        # Insert data into tables
        def insert_data(table_name, dataframe, insert_query):
            for i, row in dataframe.iterrows():
                cursor.execute(insert_query, tuple(row))
            connection.commit()
            print(f"Data inserted successfully into {table_name}")

       # taxi
        taxi_insert_query = """
            INSERT INTO taxi (taxi_id, public_vehicle_number, vehicle_make, vehicle_model_year, vehicle_color, vehicle_fuel_source)
            VALUES (%s, %s, %s, %s, %s, %s)
        """
        insert_data('taxi', taxi, taxi_insert_query)

        # community
        community_insert_query = """
            INSERT INTO community (community_number, community_name, population)
            VALUES (%s, %s, %s)
        """
        insert_data('community', community, community_insert_query)

        # company
        company_insert_query = """
            INSERT INTO company (company_id, company, taxi_exterior_color, business_phone, dispatch_phone, address, city_state, zip, email)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
        """
        insert_data('company', company, company_insert_query)

        #location
        location_insert_query = """
            INSERT INTO location (location_coordinates, address)
            VALUES (%s, %s)
        """
        insert_data('location', location, location_insert_query)

        # trips
        trips_insert_query = """
            INSERT INTO trips (unique_key, taxi_id, trip_start_timestamp, trip_end_timestamp, trip_seconds, trip_miles, pickup_community_area, dropoff_community_area, fare, tips, extras, trip_total, payment_type, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, pickup_location, dropoff_location, company_id)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        """
        insert_data('trips', trips, trips_insert_query)
        

except Error as e:
    print("Error while connecting to MySQL", e)
finally:
    # Close the connection if it was established
    if connection and connection.is_connected():
        cursor.close()
        connection.close()
        print("MySQL connection is closed")


Connected to MySQL Server version  8.0.33
Data inserted successfully into taxi
Data inserted successfully into community
Data inserted successfully into company
Data inserted successfully into location
Data inserted successfully into trips
MySQL connection is closed
