----------
# Cab Consulting Questions
_______

## Author Information
Jessen Hobson, Ph.D. </br>
Associate Professor of Accountancy </br> 
R.C. Evans Data Analytics Fellow </br>
University of Illinois at Urbana-Champaign


### Adapted By
Jake Krupa, Ph.D, CPA </br>
Assitant Professor of Accounting </br> 
AB Freeman School of Business </br>
Tulane University

## Learning Objectives
* Retrieve data via API
* Store and access data via Pandas and CSV
* Clean data in Pandas
* Descriptive analytics: 
    * grouping data
    * visualization via box plots, scatter plots, histograms, and heat maps
* Predictive analytics using regression
* Story telling from a business analytics point of view


## Case Background
Yellow Cab Chicago has hired your small accounting and consulting firm to help them shave costs and increase revenues. Like many traditional cab companies, Yellow Cab is feeling significant pressure from ride-hailing companies Yellow Cab Chicago is a hypothetical cab sharing company that wants to reduce costs and increase revenues. Like many traditional cab companies, Yellow Cab is feeling significant pressure from ride-hailing companies like Lyft and Uber. Your assignment is to use an authentic, existing data set to better understand the situation Yellow Cab is facing and to investigate potential revenue opportunities.
 
-------

## Load Relevant Packages and Dependencies

-------

Set up the environment, load packages and dependencies.

In [40]:
# database tool
import pandas as pd 
import numpy as np 
# for use with API to get data
# if you get an error that sodapy is not installed, 
# use this code to install in Jupyter: !pip install sodapy
#from sodapy import Socrata  this wasn't working
# You can also download the data from Canvas

# Also, import the linear regression library
import statsmodels.formula.api as smf

# to show graphs inline
%matplotlib inline 

# disable an unneeded warning
pd.options.mode.chained_assignment = None  # default='warn'

## Load Data and Save Data
There are two methods to obtain data for this case. Both methods use data from data.cityofchicago.org.

------

### Retrieve Data, Method 1
Load through data.cityofchicago.org using their API. An application token is given to you. E.g., https://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

Adapt this one line of code below to download 2,000,000 rows of data from 2015:
`results = client.get("r2u4-wwk3", limit=200)`

In [None]:
# App token - something needed for multiple requests from an API
MyAppToken="J1pdc11pinLHhRMnrhgItLRqO"
client = Socrata("data.cityofchicago.org/",
                  MyAppToken,
                  username="jkrupa@tulane.edu",
                  password="accn7290-3290SP2021")

# returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
# Getting a subset of 2,000,000
# r2u4-wwk3 is dataset for 2020
# 9arg-bn2i is dataset for 2015
# wrvz-psew is for all
# results = client.get("r2u4-wwk3", limit=200)
# ADAPT THIS LINE OF CODE 
results = client.get("9arg-bn2i", limit=2000000)


# Convert to pandas DataFrame
df = pd.DataFrame.from_records(results)
df.shape

Save data as a CSV file so that you can save and return to your work without having to use an application token in the API.

In [4]:
# Export file for later use
file = 'cab_data2015.csv' 
df.to_csv(file, index = False)

In [25]:
# Load data for use
df2015=pd.read_csv("cab_data2015.csv")

## 0. Look at the Data
--------

Question: take a look at 10 random rows using `.sample()` of the DataFrame to get a sense of what you have downloaded.

In [26]:
df2015.sample(10)

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,fare,tips,extras,trip_total,...,dropoff_census_tract,pickup_community_area,dropoff_community_area,pickup_centroid_latitude,pickup_centroid_longitude,pickup_centroid_location,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_centroid_location,tolls
781113,93e74fd6e45891db0fdf946ce349703d01c903dd,365689b9f3107b807470fe16b781f7ae5e76e057e00a19...,2015-03-10T20:45:00.000,2015-03-10T21:00:00.000,480.0,0.0,6.65,1.0,0.0,7.65,...,17031080000.0,32.0,8.0,41.880994,-87.632746,"{'type': 'Point', 'coordinates': [-87.63274648...",41.892042,-87.631864,"{'type': 'Point', 'coordinates': [-87.63186394...",0.0
660261,3fb28e1a6aa53eaa0d754e8ae26ae2c1ec7c1a3c,c7a2b9554ad4efdac7ae4fc8ad99678ffefa21bccf6d23...,2015-03-11T08:00:00.000,2015-03-11T08:15:00.000,480.0,1.7,7.25,0.0,0.0,7.25,...,17031840000.0,8.0,32.0,41.895033,-87.619711,"{'type': 'Point', 'coordinates': [-87.61971067...",41.880994,-87.632746,"{'type': 'Point', 'coordinates': [-87.63274648...",0.0
880801,fc96e3ea20fe7af2e1975e5d55941e59c301eb56,a9bc5e55dd09e8687769a853fe4e67226f1e16b649e7bc...,2015-04-17T19:00:00.000,2015-04-17T19:00:00.000,240.0,0.7,5.05,2.0,0.0,7.05,...,17031840000.0,32.0,28.0,41.880994,-87.632746,"{'type': 'Point', 'coordinates': [-87.63274648...",41.867902,-87.642959,"{'type': 'Point', 'coordinates': [-87.64295866...",0.0
2930,9778151bcf26fdefa5c4692532be5bb9bfb96355,3b6f6ca381f164253171136e5ab4d752da5ec9223ea20f...,2015-10-23T18:45:00.000,2015-10-23T19:15:00.000,1820.0,3.4,15.25,0.0,1.5,16.75,...,,,,,,,,,,
1882538,499a2c2e48a65442c0c49a0ad6e2caf824560b4f,2bda8e2c6eee0728581a0614b850f52cf79004426ff1df...,2015-08-14T18:00:00.000,2015-08-14T18:00:00.000,0.0,0.0,44.25,8.85,0.0,53.1,...,,,,,,,,,,0.0
1612737,7658ba1cd60f5dcbce7b6ade2a4784e075a80b53,1d6d1609c02f50bdd45ffcbe08fb36ce7908b2ff08b421...,2015-07-16T11:30:00.000,2015-07-16T12:00:00.000,1320.0,4.3,13.85,0.0,1.0,14.85,...,,20.0,6.0,41.924347,-87.73474,"{'type': 'Point', 'coordinates': [-87.73473975...",41.944227,-87.655998,"{'type': 'Point', 'coordinates': [-87.65599818...",0.0
1094946,769ec256afa9aff9a66b0ba325c7b801282deaa7,ea7bd9f0e56d3a905d808f728cd0630e1c11d4f85a3d42...,2015-05-19T08:15:00.000,2015-05-19T08:15:00.000,600.0,3.4,9.85,0.0,0.0,9.85,...,,24.0,32.0,41.901207,-87.676356,"{'type': 'Point', 'coordinates': [-87.67635598...",41.878866,-87.625192,"{'type': 'Point', 'coordinates': [-87.62519214...",0.0
784252,69f72aeccd3cc7521ed4c9b94eeb73573d8f92b1,dc5cee229b76c5da48c87dd388fa845941c54103cd1a63...,2015-03-11T00:30:00.000,2015-03-11T00:45:00.000,660.0,0.3,13.05,3.9,0.0,16.95,...,17031070000.0,32.0,7.0,41.880994,-87.632746,"{'type': 'Point', 'coordinates': [-87.63274648...",41.929078,-87.646293,"{'type': 'Point', 'coordinates': [-87.64629347...",0.0
899077,a690e4dc1bde44a6dc0f7cd01ff0b6794f662464,96a3f43a4eb7400d581a682773569e457cdcba002adf40...,2015-04-18T01:45:00.000,2015-04-18T01:45:00.000,120.0,0.5,4.05,2.0,1.5,7.55,...,17031080000.0,8.0,8.0,41.899156,-87.626211,"{'type': 'Point', 'coordinates': [-87.62621053...",41.900266,-87.632109,"{'type': 'Point', 'coordinates': [-87.63210921...",0.0
26280,74b2e9b100a4e965b1630c6655aea5845538155c,66e0beb5cdba833d46e74cf7d92603483d9d7237c3427d...,2015-01-24T00:15:00.000,2015-01-24T00:30:00.000,420.0,3.7,10.05,2.0,2.0,14.05,...,,,,,,,,,,0.0


Question: Use the `.info()` function in Pandas to view data types of each column.

In [27]:
df2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   trip_id                     object 
 1   taxi_id                     object 
 2   trip_start_timestamp        object 
 3   trip_end_timestamp          object 
 4   trip_seconds                float64
 5   trip_miles                  float64
 6   fare                        float64
 7   tips                        float64
 8   extras                      float64
 9   trip_total                  float64
 10  payment_type                object 
 11  company                     object 
 12  pickup_census_tract         float64
 13  dropoff_census_tract        float64
 14  pickup_community_area       float64
 15  dropoff_community_area      float64
 16  pickup_centroid_latitude    float64
 17  pickup_centroid_longitude   float64
 18  pickup_centroid_location    object 
 19  dropoff_centroid_lati

## 1. Clean Data
--------


### 1.1 Eliminate bad data

Question:

#### 1.1.1 Delete trips with any of the following trip characteristics:
- zero or missing fare 
- zero or missing trip_total
- *missing* tips

In [28]:
condition1 = (df2015['fare'].isnull()) | (df2015['fare']==0)
df2015 = df2015[~condition1]

df2015.shape

(1997093, 23)

In [29]:
condition2 = (df2015['trip_total'].isnull()) | (df2015['trip_total']==0)
df2015 = df2015[~condition2]

df2015.shape

(1997093, 23)

In [30]:
condition3 = (df2015['tips'].isnull())
df2015 = df2015[~condition3]

df2015.shape

(1997093, 23)

Question:

#### 1.1.2 Delete trips with any of the following characteristics: 
- NaN (missing) mileage
- Zero or NaN (missing) seconds.

In [31]:
condition4 = (df2015['trip_miles'].isnull())
df2015 = df2015[~condition4]

df2015.shape

(1997089, 23)

In [32]:
condition5 = (df2015['trip_seconds'].isnull()) | (df2015['trip_seconds']==0)
df2015 = df2015[~condition5]

df2015.shape

(1740739, 23)

# Question:

#### 1.1.2 Delete trips with no mileage (equal to zero) but starting and stopping in a different community, since these are likely mistakes.
#### 1.1.3 Delete trips with no seconds (equal to zero) but starting and stopping in a different community, since these are likely mistakes.

In [33]:
condition6 = (df2015['trip_miles']==0) & (df2015["pickup_centroid_location"] != df2015["dropoff_centroid_location"])
df2015 = df2015[~condition6]

In [34]:
df2015.shape

(1389100, 23)

In [35]:
condition7 = (df2015['trip_seconds']==0) & (df2015["pickup_centroid_location"] != df2015["dropoff_centroid_location"])
df2015 = df2015[~condition7]
df2015.shape

(1389100, 23)

#### 1.1.4 Delete trips with missing pick up latitude or longitude.

In [36]:
condition8 = (df2015['pickup_centroid_latitude'].isnull()) | (df2015['pickup_centroid_latitude'].isnull())
df2015 = df2015[~condition8]
df2015.shape

(1226479, 23)

### 1.2 Fix datetime

Question:

#### 1.2.1 Create a new column that converts `trip_start_timestamp` and `trip_end_timestamp` to a datetime object

In [37]:
df2015['startDateTime'] = pd.to_datetime(df2015['trip_start_timestamp'])
df2015['endDateTime'] = pd.to_datetime(df2015['trip_end_timestamp'])

Questions;

#### 1.2.2 Create four new columns - for the month, day, hour, and day of the week, respectively (use the start of the trip, since that is something the taxi driver can control)

In [38]:
df2015['startYear'] = pd.DatetimeIndex(df2015['startDateTime']).year
df2015['startMonth'] = pd.DatetimeIndex(df2015['startDateTime']).month
df2015['startDay'] = pd.DatetimeIndex(df2015['startDateTime']).day
df2015['startHoue'] = pd.DatetimeIndex(df2015['startDateTime']).hour

condition9 = df2015['startMonth'] < 7
df2015 = df2015[condition9]
df2015.shape

(922773, 29)

### 1.3 Deal with outliers

### Using reasonable judgement to identify outliers

Most very high cost trips are probably errors or uninteresting since they have took very few seconds and went very few miles. For example, it does not seem plausible that a $8,000 cab fare took went about 0 miles and took about 1 second. 

Let's correct this by eliminating rows with unusually high dollars per second (i.e., trip with unreasonably low seconds but high fare). Create the appropriate fare per seconds ratio and eliminate those with a ratio greater than $0.80 per second.

In [39]:
condition10 = (df2015['fare']/df2015['trip_seconds'] > .8)
df2015 = df2015[~condition10]

df2015.shape

(922733, 29)

## 2. Predictive Analytics - Regression
----------

### 2.1 Two Basic Regressions

Question: Run two regression analyses below. The first regression predicts `fare` (the dependent variable, Y) using `trip_seconds`, `trip_miles`, `hour`, `month`, and `dayofweek`.  </p>
Run the regression discussed above using `fare` as the dependent variable and display the results.

In [44]:
df2015['dayofweek'] = pd.DatetimeIndex(df2015['startDateTime']).dayofweek
results_fare = smf.ols('fare ~ trip_seconds + trip_miles + startHoue + startMonth + dayofweek', data=df2015).fit()
results_fare.summary()

0,1,2,3
Dep. Variable:,fare,R-squared:,0.75
Model:,OLS,Adj. R-squared:,0.75
Method:,Least Squares,F-statistic:,554700.0
Date:,"Tue, 05 Apr 2022",Prob (F-statistic):,0.0
Time:,00:55:34,Log-Likelihood:,-2819000.0
No. Observations:,922733,AIC:,5638000.0
Df Residuals:,922727,BIC:,5638000.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.2460,0.020,214.685,0.000,4.207,4.285
trip_seconds,0.0087,8.06e-06,1076.286,0.000,0.009,0.009
trip_miles,0.5532,0.001,552.683,0.000,0.551,0.555
startHoue,-0.0450,0.001,-57.058,0.000,-0.047,-0.043
startMonth,0.0117,0.003,3.669,0.000,0.005,0.018
dayofweek,-0.1109,0.003,-41.265,0.000,-0.116,-0.106

0,1,2,3
Omnibus:,2461992.415,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,601633796172.169
Skew:,-31.248,Prob(JB):,0.0
Kurtosis:,3958.301,Cond. No.,4230.0


Question: Run the regression discussed above using `tips` as the dependent variable and display the results.

In [45]:
results_tips = smf.ols('tips ~ trip_seconds + trip_miles + startHoue + startMonth + dayofweek', data=df2015).fit()
results_tips.summary()

0,1,2,3
Dep. Variable:,tips,R-squared:,0.184
Model:,OLS,Adj. R-squared:,0.184
Method:,Least Squares,F-statistic:,41620.0
Date:,"Tue, 05 Apr 2022",Prob (F-statistic):,0.0
Time:,00:56:06,Log-Likelihood:,-2056000.0
No. Observations:,922733,AIC:,4112000.0
Df Residuals:,922727,BIC:,4112000.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.3295,0.009,38.086,0.000,0.313,0.346
trip_seconds,0.0009,3.53e-06,259.508,0.000,0.001,0.001
trip_miles,0.0817,0.000,186.638,0.000,0.081,0.083
startHoue,0.0006,0.000,1.694,0.090,-9.17e-05,0.001
startMonth,0.0346,0.001,24.710,0.000,0.032,0.037
dayofweek,-0.0506,0.001,-43.055,0.000,-0.053,-0.048

0,1,2,3
Omnibus:,2275043.881,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1306382680811.042
Skew:,24.9,Prob(JB):,0.0
Kurtosis:,5831.905,Cond. No.,4230.0


### 2.2 Tell a story

Question: Finally, choose two coefficients from those modeled above and make recommendations to your supervisor for the Yellow Cab Company client about how drivers can maximize their expected trip_total and/or tips.

(In other words, choose two coefficients and interpret what they mean in the language of business. Use that to recommend a cab driver to take an action for example "Prioritizing driving on Wednesdays" or "Start your trips at 10am")

In [None]:
#To Maximize tips and fares you should try to increase your trip miles and also work later in the year as trip month 
#is postively correlated with both tips and fates. For tips specifically you should start your trips later as they are slightly
#positvely correlated with tips
