## 1 Project Overview

As an analyst for Zuber, a new ride-sharing company launching in Chicago, my primary objective is to identify patterns and insights within the available data to enhance the company's understanding of passenger preferences and the influence of external factors on ride frequency. By leveraging data from competitors and analyzing various metrics, I aim to provide actionable recommendations to optimize Zuber's services.

## 2 Initialization

### 2.1 Add imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 2.2 Parsing Data

In [2]:
URL='https://practicum-content.s3.us-west-1.amazonaws.com/data-analyst-eng/moved_chicago_weather_2017.html'
req = requests.get(URL)
soup = BeautifulSoup(req.text, 'lxml')

Fetch the HTML content

In [3]:
table = soup.find('table', attrs={"id": "weather_records"})
heading_table = [th.text.strip() for th in table.find_all('th')]

Find the table and Extract the table headers

In [4]:
content = []
for row in table.find_all('tr'): # type: ignore
    if not row.find_all('th'):
        content.append([td.text.strip() for td in row.find_all('td')])

weather_records = pd.DataFrame(content, columns=heading_table) # type: ignore
weather_records


Unnamed: 0,Date and time,Temperature,Description
0,2017-11-01 00:00:00,276.150,broken clouds
1,2017-11-01 01:00:00,275.700,scattered clouds
2,2017-11-01 02:00:00,275.610,overcast clouds
3,2017-11-01 03:00:00,275.350,broken clouds
4,2017-11-01 04:00:00,275.240,broken clouds
...,...,...,...
692,2017-11-29 20:00:00,281.340,few clouds
693,2017-11-29 21:00:00,281.690,sky is clear
694,2017-11-29 22:00:00,281.070,few clouds
695,2017-11-29 23:00:00,280.060,sky is clear


Extract the table rows and Create the DataFrame

### 2.3 Set up CSV DataFrames

In [5]:
local = {
    'cabs': './dataset/moved_project_sql_result_01.csv', 
    'neighborhoods': './dataset/moved_project_sql_result_04.csv',
    'trips': './dataset/moved_project_sql_result_07.csv'
}
server_path = {
    'cabs': '/dataset/moved_project_sql_result_01.csv',
    'neighborhoods': '/dataset/moved_project_sql_result_04.csv',
    'trips': '/dataset/moved_project_sql_result_07.csv'
}
online = {}

set the paths for the data frames

In [6]:
def load_data(set):
    try:
        df = pd.read_csv(local[set])
    except FileNotFoundError:
        try:
            df = pd.read_csv(server_path[set])
        except FileNotFoundError:
            df = pd.read_csv(online[set])
    return df

set up the function for the correct path. 

In [7]:
cabs = load_data('cabs')
neighborhoods = load_data('neighborhoods')
trips = load_data('trips')

loaded the data frames

## 3 Preparing the Data

### 3.1 Inspect `weather_records`

In [8]:
weather_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 697 entries, 0 to 696
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Date and time  697 non-null    object
 1   Temperature    697 non-null    object
 2   Description    697 non-null    object
dtypes: object(3)
memory usage: 16.5+ KB


Weather and Temperature need to be corrected.

In [9]:
weather_records['Date and time'] = pd.to_datetime(weather_records['Date and time'], format="%Y-%m-%d %H:%M:%S")
weather_records['Temperature'] = pd.to_numeric(weather_records['Temperature'])
weather_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 697 entries, 0 to 696
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date and time  697 non-null    datetime64[ns]
 1   Temperature    697 non-null    float64       
 2   Description    697 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 16.5+ KB


They have both been fixed 

In [10]:
weather_records_missing = weather_records.isna().sum()
weather_records_dupl = weather_records.duplicated().sum()
print(f'Number of missing values:\n{weather_records_missing}\n\nNumber of duplicated rows:\n{weather_records_dupl}')

Number of missing values:
Date and time    0
Temperature      0
Description      0
dtype: int64

Number of duplicated rows:
0


No issues

In [11]:
# TODO: might keep this in the future
#weather_records['Date and time'] = weather_records['Date and time'].dt.date

Minor improvement 

### 3.2 Inspect `cabs`

In [12]:
cabs

Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299
...,...,...
59,4053 - 40193 Adwar H. Nikola,7
60,2733 - 74600 Benny Jona,7
61,5874 - 73628 Sergey Cab Corp.,5
62,"2241 - 44667 - Felman Corp, Manuel Alonso",3


Objects and integers 

In [13]:
cabs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


everything is correct

In [14]:
cabs_missing = cabs.isna().sum()
cabs_dupl = cabs.duplicated().sum()
print(f'Number of missing values:\n{cabs_missing}\n\nNumber of duplicated rows:\n{cabs_dupl}')

Number of missing values:
company_name    0
trips_amount    0
dtype: int64

Number of duplicated rows:
0


everything seems fine

### 3.3 Inspect `neighborhoods`

In [15]:
neighborhoods

Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727.466667
1,River North,9523.666667
2,Streeterville,6664.666667
3,West Loop,5163.666667
4,O'Hare,2546.900000
...,...,...
89,Mount Greenwood,3.137931
90,Hegewisch,3.117647
91,Burnside,2.333333
92,East Side,1.961538


should be objects and floats

In [16]:
neighborhoods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB


everything is correct

In [17]:
neighborhoods_missing = neighborhoods.isna().sum()
neighborhoods_dupl = neighborhoods.duplicated().sum()
print(f'Number of missing values:\n{neighborhoods_missing}\n\nNumber of duplicated rows:\n{neighborhoods_dupl}')

Number of missing values:
dropoff_location_name    0
average_trips            0
dtype: int64

Number of duplicated rows:
0


everything is correct

### 3.4 Inspect 

In [18]:
trips

Unnamed: 0,start_ts,weather_conditions,duration_seconds
0,2017-11-25 16:00:00,Good,2410.0
1,2017-11-25 14:00:00,Good,1920.0
2,2017-11-25 12:00:00,Good,1543.0
3,2017-11-04 10:00:00,Good,2512.0
4,2017-11-11 07:00:00,Good,1440.0
...,...,...,...
1063,2017-11-25 11:00:00,Good,0.0
1064,2017-11-11 10:00:00,Good,1318.0
1065,2017-11-11 13:00:00,Good,2100.0
1066,2017-11-11 08:00:00,Good,1380.0


dates, objects, and floats

In [19]:
trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.2+ KB


need to fix trips

In [20]:
trips['start_ts'] = pd.to_datetime(trips['start_ts'], format="%Y-%m-%d %H:%M:%S")
trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   start_ts            1068 non-null   datetime64[ns]
 1   weather_conditions  1068 non-null   object        
 2   duration_seconds    1068 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 25.2+ KB


fixed

In [21]:
trip_missing = trips.isna().sum()
trip_dupl = trips.duplicated().sum()
print(f'Number of missing values:\n{trip_missing}\n\nNumber of duplicated rows:\n{trip_dupl}')

Number of missing values:
start_ts              0
weather_conditions    0
duration_seconds      0
dtype: int64

Number of duplicated rows:
197


everything is fine

In [28]:
trips_dupl_names = trips[trips.duplicated(keep=False)].sort_values(by='start_ts')
trips_dupl_names

Unnamed: 0,start_ts,weather_conditions,duration_seconds
541,2017-11-04 05:00:00,Good,1200.0
462,2017-11-04 05:00:00,Good,1200.0
682,2017-11-04 06:00:00,Good,1267.0
681,2017-11-04 06:00:00,Good,1267.0
363,2017-11-04 07:00:00,Good,1440.0
...,...,...,...
831,2017-11-25 11:00:00,Good,1680.0
1058,2017-11-25 12:00:00,Good,1440.0
255,2017-11-25 12:00:00,Good,1380.0
53,2017-11-25 12:00:00,Good,1380.0


not problematic because it  only counts by the hour and it is very reasonable that to think that a third of the rides might have people coming and going to some of the same areas.

## 4.1 Data Analysis

### 4.1 Task 1