<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/330px-Uber_logo_2018.svg.png'/>
<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Lyft_logo.svg/330px-Lyft_logo.svg.png' />

## Introduction
<b>Uber</b> is an American multinational ridesharing company offering services that include peer-to-peer ridesharing, ride service hailing, food delivery, and a micromobility system with electric bikes and scooters. The company is based in San Francisco and has operations in over 785 metropolitan areas worldwide. Its platforms can be accessed via its websites and mobile apps.<br><br>
<b>Lyft</b> is a ridesharing company based in San Francisco, California and operating in 640 cities in the United States and 9 cities in Canada. It develops, markets, and operates the Lyft mobile app, offering car rides, scooters, a bicycle-sharing system, and a food delivery services. Lyft is the second-largest ridesharing company in the United States with a 28% market share after Uber, according to Second Measure.<br><br>
<div style="text-align: right"> Source: <b>Wikipedia</b> </div>

---

The following notebook is an attempt at understanding the pricing model of Uber and Lyft cab rides. For this, we will be using a dataset which records bookings of Uber and Lyft cab rides in the city of Boston during November and December 2018. The dataset is publicly available on <a href='https://www.kaggle.com/brllrb/uber-and-lyft-dataset-boston-ma'>Kaggle</a>.<br>

We wish to derive insights about how factors such as distance, cab type, time of day and weather conditions affect prices of cab rides and compare the differences between the two companies. We have also implemented a regression model to predict prices and a classification model to classify based on the company. These models also give us an insight into the feature importances and coefficients which give us additional information about how individual features affect the pricing of a cab ride.<br>

Exploartory Data Analysis has been carried out using Plotly visualization library which produces interactive visualizations. You can hover over the plots for more information and also control the plot using the legend in the top right corner.  

**<i>Note: Plotly version 3.10.0 and cufflinks version 0.13.0 are required to run these interactive visualizations without any errors. If you do not have the required libraries, please install them and restart this notebook. Or, use the following link to view this notebook with interactive visualizations rendered - </i>**
https://nbviewer.jupyter.org/github/ap1495/EAS-503---Python-Data-Analysis-Project/blob/master/UberLyft.ipynb

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

from datetime import datetime
import sqlite3
import plotly.offline as py
import cufflinks as cf
from plotly.offline import download_plotlyjs, iplot, plot, init_notebook_mode
from plotly import tools
import plotly.graph_objs as go
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')


plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis



plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis




## Data Import and Exploration

In [2]:
try:
    conn=sqlite3.connect("CabData.db")
    print("Connection successful")
    
except Error as e:
    print(e)
    
#Loading two datasets into database.
df_sql = pd.read_csv('F:/DS/Datasets/rideshare_kaggle.csv')
geo_sql = pd.read_csv('F:/DS/Datasets/geolocations.csv')
df_sql.to_sql('Cabdetails',conn, index=False,if_exists='replace')
geo_sql.to_sql('geoLocations',conn, index=False,if_exists='replace')

#Working with selected features which are important to this project 
#and retrieving only those records which has target(price) variable as not null.
query = query = '''SELECT distance, cab_type, timestamp, source, destination, price, surge_multiplier, product_id,
         name, cloudCover, temperature, windSpeed, uvIndex, precipIntensityMax, precipIntensity,
         precipProbability, hour, day, loc.latitude as source_latitude,
         loc.longitude as source_longitude,loc1.latitude as destination_latitude, loc1.longitude as destination_longitude
         from  Cabdetails cab,geoLocations loc, geoLocations loc1 where loc.Place = cab.source 
         and loc1.Place = cab.destination and cab.price is not null'''
df = pd.read_sql_query(query, conn)

conn.close()

Connection successful


In [3]:
df.head()

Unnamed: 0,distance,cab_type,timestamp,source,destination,price,surge_multiplier,product_id,name,cloudCover,...,uvIndex,precipIntensityMax,precipIntensity,precipProbability,hour,day,source_latitude,source_longitude,destination_latitude,destination_longitude
0,0.44,Lyft,1544953000.0,Haymarket Square,North Station,5.0,1.0,lyft_line,Shared,0.72,...,0,0.1276,0.0,0.0,9,16,42.3638,-71.0585,42.3657,-71.061
1,0.44,Lyft,1543284000.0,Haymarket Square,North Station,11.0,1.0,lyft_premier,Lux,1.0,...,0,0.13,0.1299,1.0,2,27,42.3638,-71.0585,42.3657,-71.061
2,0.44,Lyft,1543367000.0,Haymarket Square,North Station,7.0,1.0,lyft,Lyft,0.03,...,0,0.1064,0.0,0.0,1,28,42.3638,-71.0585,42.3657,-71.061
3,0.44,Lyft,1543554000.0,Haymarket Square,North Station,26.0,1.0,lyft_luxsuv,Lux Black XL,0.0,...,0,0.0,0.0,0.0,4,30,42.3638,-71.0585,42.3657,-71.061
4,0.44,Lyft,1543463000.0,Haymarket Square,North Station,9.0,1.0,lyft_plus,Lyft XL,0.44,...,0,0.0001,0.0,0.0,3,29,42.3638,-71.0585,42.3657,-71.061


In [4]:
#Rows, columns
df.shape

(637976, 22)

In [5]:
#Columns with number of records and their data types.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637976 entries, 0 to 637975
Data columns (total 22 columns):
distance                 637976 non-null float64
cab_type                 637976 non-null object
timestamp                637976 non-null float64
source                   637976 non-null object
destination              637976 non-null object
price                    637976 non-null float64
surge_multiplier         637976 non-null float64
product_id               637976 non-null object
name                     637976 non-null object
cloudCover               637976 non-null float64
temperature              637976 non-null float64
windSpeed                637976 non-null float64
uvIndex                  637976 non-null int64
precipIntensityMax       637976 non-null float64
precipIntensity          637976 non-null float64
precipProbability        637976 non-null float64
hour                     637976 non-null int64
day                      637976 non-null int64
source_latitude   

In [6]:
#Descriptive Stastics of numerical features
df.describe()

Unnamed: 0,distance,timestamp,price,surge_multiplier,cloudCover,temperature,windSpeed,uvIndex,precipIntensityMax,precipIntensity,precipProbability,hour,day,source_latitude,source_longitude,destination_latitude,destination_longitude
count,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0,637976.0
mean,2.189261,1544046000.0,16.545125,1.015068,0.686291,39.582406,6.186795,0.249031,0.037369,0.008909,0.145941,11.618528,17.797674,42.355036,-59.219575,42.355037,-59.222248
std,1.135413,689202.8,9.324359,0.095422,0.358599,6.7255,3.147856,0.474306,0.055216,0.02688,0.328776,6.948776,9.982083,0.008302,39.296742,0.008302,39.292713
min,0.02,1543204000.0,2.5,1.0,0.0,18.91,0.45,0.0,0.0,0.0,0.0,0.0,1.0,42.3398,-71.1054,42.3398,-71.1054
25%,1.27,1543444000.0,9.0,1.0,0.37,36.45,3.41,0.0,0.0,0.0,0.0,6.0,13.0,42.3503,-71.0892,42.3503,-71.0892
50%,2.16,1543737000.0,13.5,1.0,0.82,40.49,5.91,0.0,0.0004,0.0,0.0,12.0,17.0,42.3519,-71.061,42.3519,-71.061
75%,2.93,1544828000.0,22.5,1.0,1.0,43.58,8.41,0.0,0.0916,0.0,0.0,18.0,28.0,42.3638,-71.055,42.3638,-71.055
max,7.86,1545161000.0,97.5,3.0,1.0,57.22,15.0,2.0,0.1459,0.1447,1.0,23.0,30.0,42.3657,71.0643,42.3657,71.0643


In [7]:
#Overview of categorical features.
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
for col in cat_cols:
    print('Column Name: ', col, '\n')
    print('Number of unique values: ', df[col].nunique(), '\n')
    print('Unique values: ', df[col].unique())
    print('\n----------------------------------------\n')

Column Name:  cab_type 

Number of unique values:  2 

Unique values:  ['Lyft' 'Uber']

----------------------------------------

Column Name:  source 

Number of unique values:  12 

Unique values:  ['Haymarket Square' 'Back Bay' 'North End' 'North Station' 'Beacon Hill'
 'Boston University' 'Fenway' 'South Station' 'Theatre District'
 'West End' 'Financial District' 'Northeastern University']

----------------------------------------

Column Name:  destination 

Number of unique values:  12 

Unique values:  ['North Station' 'Northeastern University' 'West End' 'Haymarket Square'
 'South Station' 'Fenway' 'Theatre District' 'Beacon Hill' 'Back Bay'
 'North End' 'Financial District' 'Boston University']

----------------------------------------

Column Name:  product_id 

Number of unique values:  12 

Unique values:  ['lyft_line' 'lyft_premier' 'lyft' 'lyft_luxsuv' 'lyft_plus' 'lyft_lux'
 '6f72dfc5-27f1-42e8-84db-ccc7a75f6969'
 '6c84fd89-3f11-4782-9b50-97c468b19529'
 '55c66225-fbe7-4

In [8]:
#Columns with their missing values.
df.isnull().sum()

distance                 0
cab_type                 0
timestamp                0
source                   0
destination              0
price                    0
surge_multiplier         0
product_id               0
name                     0
cloudCover               0
temperature              0
windSpeed                0
uvIndex                  0
precipIntensityMax       0
precipIntensity          0
precipProbability        0
hour                     0
day                      0
source_latitude          0
source_longitude         0
destination_latitude     0
destination_longitude    0
dtype: int64

## Visualizations

In [9]:
def layout_details(title, xaxis_title=None, yaxis_title=None):
    '''Add layout details such as Title, X-axis title and Y-axis title.'''
    layout = go.Layout(title=title,
                       xaxis=dict(title=xaxis_title),
                       yaxis=dict(title=yaxis_title)
                      )
    
    return layout

def display_trace(data, layout):
    '''Display visualization with chosen data and layout.'''
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig)
    
def plot_bar_chart(x, y, colors=None, name=None, opacity=None):
    '''Plot bar chart.'''
    bar_trace = go.Bar(x=x,
                       y=y,
                       marker_color=colors,
                       opacity=opacity,
                       name=name
                      )
    return bar_trace

def plot_pie_chart(values, labels, domain, textinfo, hole, name=None):
    '''Plot pie/donut chart.'''
    pie_trace = go.Pie(values=values,
                       labels=labels,
                       domain=domain,
                       textinfo=textinfo,
                       hole=hole,
                       name=name
                      )
    
    return pie_trace

def plot_line_chart(x, y, colors=None, name=None, opacity=None):
    '''Plot line chart.'''
    line_trace = go.Scatter(x=x,
                            y=y,
                            marker_color=colors,
                            name=name,
                            opacity=opacity
                           )
    
    return line_trace

def plot_scatter_chart(x, y, colors=None, name=None, opacity=None):
    '''Plot scatter chart.'''
    scatter_trace = go.Scatter(x=x,
                            y=y,
                            marker_color=colors,
                            name=name,
                            opacity=opacity,
                            mode='markers',
                           )
    
    return scatter_trace

def plot_scatter_line_chart(x, y, colors=None, name=None, opacity=None):
    '''Plot scatter chart.'''
    scatter_line_trace = go.Scatter(x=x,
                            y=y,
                            marker_color=colors,
                            name=name,
                            opacity=opacity,
                            mode='lines',
                           )
    
    return scatter_line_trace

In [10]:
token = 'pk.eyJ1IjoiYXAxNDk1IiwiYSI6ImNrM2dwa3lzcjA0Y3AzaWx1aGxjb3VwNTIifQ.qnk9bZu8o0RWFRy-qSsEgA'
hs_count = len(df[df['source'] == 'Haymarket Square'])
bb_count = len(df[df['source'] == 'Back Bay'])
ne_count = len(df[df['source'] == 'North End'])
ns_count = len(df[df['source'] == 'North Station'])
bh_count = len(df[df['source'] == 'Beacon Hill'])
bu_count = len(df[df['source'] == 'Boston University'])
f_count = len(df[df['source'] == 'Fenway'])
ss_count = len(df[df['source'] == 'South Station'])
td_count = len(df[df['source'] == 'Theatre District'])
we_count = len(df[df['source'] == 'West End'])
fd_count = len(df[df['source'] == 'Financial District'])
nu_count = len(df[df['source'] == 'Northeastern University'])

trace = go.Scattermapbox(
        lat=df['source_latitude'].unique().tolist(),
        lon=df['source_longitude'].unique().tolist(),
        mode='markers',
        marker=go.scattermapbox.Marker(size=11,
                                       color='orange',
                                       opacity=0.8
                                      ),
        text=['Haymarket Square<br>'+str(hs_count), 'Back Bay<br>'+str(bb_count), 
              'North End<br>'+str(ne_count), 'North Station<br>'+str(ns_count), 
              'Beacon Hill<br>'+str(bh_count),'Boston University<br>'+str(bu_count), 
              'Fenway<br>'+str(f_count), 'South Station<br>'+str(ss_count), 
              'Theatre District<br>'+str(td_count), 'West End<br>'+str(we_count),
              'Financial District<br>'+str(fd_count), 'Northeastern University<br>'+str(nu_count)
             ]
    )

layout = go.Layout(autosize=True,
                   title='<b>Uber & Lyft Booking Locations in Boston</b>',
                   hovermode='closest',
                   mapbox=go.layout.Mapbox(accesstoken=token,
                                           bearing=0,
                                           center=go.layout.mapbox.Center(lat=42.3469,
                                                                          lon=-71.0661
                                                                         ),
                                            pitch=0,
                                            zoom=12
                                          ),
                )

fig = go.Figure(data=[trace], layout=layout)
py.iplot(fig)

In [11]:
colors = ['black', 'deeppink']
x = df['cab_type'].value_counts().index.values
y = df['cab_type'].value_counts().values
opacity=0.8
trace1 = plot_bar_chart(x, y, colors, opacity=opacity)
layout = layout_details('<b>Count of Rides by Company</b>', '<b><i>Company</b></i>', '<b><i>Number of Rides</b></i>')
display_trace([trace1], layout)

The dataset contains **330,568** records for **Uber** bookings while **307,408** records for **Lyft** bookings. Note that Uber has **23,160** bookings more than Lyft.
<br><br>
*Hover over the visalization to get exact values.*<br><br>

In [12]:
uber_df = df[df['cab_type'] == 'Uber']
lyft_df = df[df['cab_type'] == 'Lyft']

textinfo='value+percent'
hole=0.4

values_uber=uber_df.groupby(by='name')['name'].count().values
labels_uber=uber_df.groupby(by='name')['name'].count().index.values
domain_uber=dict(x=[0, 0.5], y=[1, 0.3])
name='Uber'
trace2a = plot_pie_chart(values_uber, labels_uber, domain_uber, textinfo, hole, name)

values_lyft=lyft_df.groupby(by='name')['name'].count().values
labels_lyft=lyft_df.groupby(by='name')['name'].count().index.values
domain_lyft=dict(x=[0.55, 1], y=[1, 0.3])
name='Lyft'
trace2b = plot_pie_chart(values_lyft, labels_lyft, domain_lyft, textinfo, hole, name)

layout = layout_details('<b>Count of Rides by Cab Classes by Company<br>Uber | Lyft</b>')
display_trace([trace2a, trace2b], layout)

The above visualization shows the split of bookings based on cab classes. Since the bookings we made using a controlled environment, all the classes have more or less the same number of bookings. **~55,000** bookings in each cab class for Uber while **~51,000** bookings in each cab class for Lyft.
<br><br>
*Hover over the visualization for more information and control the visualization using the legend in the right corner.*
<br><br>

In [13]:
fig = tools.make_subplots(rows=1, cols=2)

x=uber_df.groupby(by='name')['price'].sum().index.values
y=uber_df.groupby(by='name')['price'].sum().values
name='Uber'
color='black'
opacity=0.8
trace3a = plot_bar_chart(x, y, color, name, opacity)

x=lyft_df.groupby(by='name')['price'].sum().index.values
y=lyft_df.groupby(by='name')['price'].sum().values
name='Lyft'
color='deeppink'
opacity=0.8
trace3b = plot_bar_chart(x, y, color, name)

fig.append_trace(trace3a, 1,1)
fig.append_trace(trace3b, 1,2)
fig['layout'].update(title='<b>Total Sales by Company based on Cab Classes</b>',
                     xaxis=dict(automargin=True),
                     yaxis=dict(title='<b><i>Sales in USD</b></i>')
                    )
py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



Uber tends to make more money on their economy rides compared to Lyft. Lyft tends to make more money than Uber on their premium rides. Although, number of Uber bookings are more than Lyft, the total sales of Lyft's premium rides are greater than Uber's premium rides.
<br>
- Black is to Lux Black
- Black SUV is to Lux Black XL
- UberPool is to Shared
- UberX is to Lyft
- UberXL is to Lyft XL
- WAV is to Lux
<br>

*Hover over the bars to get exact sales values.*

### Distance vs Price

In [14]:
x=uber_df.groupby(by='distance')['price'].mean().index.values
y=uber_df.groupby(by='distance')['price'].mean().values
name='Uber'
color='black'
opacity=0.8
trace4a = plot_line_chart(x, y, color, name, opacity)

x=lyft_df.groupby(by='distance')['price'].mean().index.values
y=lyft_df.groupby(by='distance')['price'].mean().values
name='Lyft'
color='deeppink'
opacity=0.8
trace4b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b>Average Cost of Ride by Distance</b>', '<b><i>Distance in miles</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace4a, trace4b], layout)

For smaller distances, there is no significant difference in pricing between the two companies. However, as the distance increases, the difference in pricing between the two companies also increase. Overall, Lyft tends to charge more than Uber.
<br><br>
*Hover over the visualization to compare the prices between Uber and Lyft for a particular distance.*<br><br>

In [15]:
x=uber_df[uber_df['name'] == 'UberX'].groupby(by='distance')['price'].mean().index.values
y=uber_df[uber_df['name'] == 'UberX'].groupby(by='distance')['price'].mean().values
name='UberX'
color='black'
opacity=0.8
trace5a = plot_line_chart(x, y, color, name, opacity)

x=lyft_df[lyft_df['name'] == 'Lyft'].groupby(by='distance')['price'].mean().index.values
y=lyft_df[lyft_df['name'] == 'Lyft'].groupby(by='distance')['price'].mean().values
name='Lyft(Economy)'
color='deeppink'
opacity=0.8
trace5b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b>Average Cost of Ride by Distance for Economy Rides(4 seater) </b>', 
                        '<b><i>Distance in miles</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace5a, trace5b], layout)

For economy rides, Uber tends to charge more than Lyft irrespective of the booking distance.
<br><br>
*Hover over the visualization to compare the prices between Uber and Lyft for a particular distance.*<br><br>

In [16]:
x=uber_df[uber_df['name'] == 'UberXL'].groupby(by='distance')['price'].mean().index.values
y=uber_df[uber_df['name'] == 'UberXL'].groupby(by='distance')['price'].mean().values
name='Uber XL'
color='black'
opacity=0.8
trace7a = plot_line_chart(x, y, color, name, opacity)

x=lyft_df[lyft_df['name'] == 'Lyft XL'].groupby(by='distance')['price'].mean().index.values
y=lyft_df[lyft_df['name'] == 'Lyft XL'].groupby(by='distance')['price'].mean().values
name='Lyft XL'
color='deeppink'
opacity=0.8
trace7b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b>Average Cost of Ride by Distance for Economy XL Rides(6 seater)</b>', 
                        '<b><i>Distance in miles</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace7a, trace7b], layout)

For economy XL rides, there is no significant difference in charges between the two companies for most distances.
<br><br>
*Hover over the visualization to compare the prices between Uber and Lyft for a particular distance.*<br><br>

In [17]:
x=uber_df[uber_df['name'] == 'UberPool'].groupby(by='distance')['price'].mean().index.values
y=uber_df[uber_df['name'] == 'UberPool'].groupby(by='distance')['price'].mean().values
name='Uber Pool'
color='black'
opacity=0.8
trace6a = plot_line_chart(x, y, color, name, opacity)

x=lyft_df[lyft_df['name'] == 'Shared'].groupby(by='distance')['price'].mean().index.values
y=lyft_df[lyft_df['name'] == 'Shared'].groupby(by='distance')['price'].mean().values
name='Lyft Shared'
color='deeppink'
opacity=0.8
trace6b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b>Average Cost of Ride by Distance for Shared Rides</b>', 
                        '<b><i>Distance in miles</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace6a, trace6b], layout)

Lyft charges significantly lower than Uber when it comes to shared rides irrespective of the distance.
<br><br>
*Hover over the visualization to compare the prices between Uber and Lyft for a particular distance.*<br><br>

In [18]:
x=uber_df[uber_df['name'] == 'Black'].groupby(by='distance')['price'].mean().index.values
y=uber_df[uber_df['name'] == 'Black'].groupby(by='distance')['price'].mean().values
name='Uber Black'
color='black'
opacity=0.8
trace8a = plot_line_chart(x, y, color, name, opacity)

x=lyft_df[lyft_df['name'] == 'Lux Black'].groupby(by='distance')['price'].mean().index.values
y=lyft_df[lyft_df['name'] == 'Lux Black'].groupby(by='distance')['price'].mean().values
name='Lyft Black'
color='deeppink'
opacity=0.8
trace8b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b>Average Cost of Ride by Distance for Premium Rides(4 seater)</b>', 
                        '<b><i>Distance in miles</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace8a, trace8b], layout)

Lyft's premium 4 seater ride charges are significantly higher than Uber's premium 4 seater rides across all distances.
<br><br>
*Hover over the visualization to compare the prices between Uber and Lyft for a particular distance.*<br><br>

In [19]:
x=uber_df[uber_df['name'] == 'Black SUV'].groupby(by='distance')['price'].mean().index.values
y=uber_df[uber_df['name'] == 'Black SUV'].groupby(by='distance')['price'].mean().values
name='Uber Black XL'
color='black'
opacity=0.8
trace9a = plot_line_chart(x, y, color, name, opacity)

x=lyft_df[lyft_df['name'] == 'Lux Black XL'].groupby(by='distance')['price'].mean().index.values
y=lyft_df[lyft_df['name'] == 'Lux Black XL'].groupby(by='distance')['price'].mean().values
name='Lyft Black XL'
color='deeppink'
opacity=0.8
trace9b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b>Average Cost of Ride by Distance for Premium Rides(6 seater)</b>', 
                        '<b><i>Distance in miles</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace9a, trace9b], layout)

The same can be said for Lyft's premium 6 seater rides. They charge slightly higher than Uber's 6 seater rides across all distances.
<br><br>
*Hover over the visualization to compare the prices between Uber and Lyft for a particular distance.*

**Overall Lyft appears to cost more than Uber for similar distances. But digging down further, we found that Uber is costlier than Lyft with economy rides while Lyft charges more than Uber for similar distances on its premium rides.**

### Time vs Price

In [20]:
x=df.groupby(by='day')['price'].mean().index.values
y=df.groupby(by='day')['price'].mean().values
trace10 = plot_line_chart(x, y)

layout = layout_details('<b>Average Cost of Ride by Day</b>', 
                        '<b><i>Day</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace10], layout)

Although the graph shows major ups and downs, there is no drastic increase/decrease in average cost of a ride based on a particular day. The highest cost is <b>\$16.79</b> while the lowest is <b>\$16.41</b>. <br><br>

In [21]:
x=df.groupby(by='hour')['price'].mean().index.values
y=df.groupby(by='hour')['price'].mean().values
trace11 = plot_line_chart(x, y)

layout = layout_details('<b>Average Cost of Ride by Hour</b>', 
                        '<b><i>Hour</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace11], layout)

The same can be said about this graph as well. There is no drastic increase/decrease in average cost of a ride based on a particular hour in a day. The highest cost is <b>\$16.60</b> while the lowest is <b>\$16.48</b>. <br><br>

In [22]:
df['distance'].describe()

count    637976.000000
mean          2.189261
std           1.135413
min           0.020000
25%           1.270000
50%           2.160000
75%           2.930000
max           7.860000
Name: distance, dtype: float64

In [23]:
less3_df = df[df['distance'] <= 3]
less6_df = df[(df['distance'] <= 6) & (df['distance'] > 3)]
less15_df = df[df['distance'] > 6]

x=less3_df.groupby(by='hour')['price'].mean().index.values
y=less3_df.groupby(by='hour')['price'].mean().values
name='Less than 3 miles'
trace12a = plot_line_chart(x, y, name=name)

x=less6_df.groupby(by='hour')['price'].mean().index.values
y=less6_df.groupby(by='hour')['price'].mean().values
name='Between 3 & 6 miles'
trace12b = plot_line_chart(x, y, name=name)

x=less15_df.groupby(by='hour')['price'].mean().index.values
y=less15_df.groupby(by='hour')['price'].mean().values
name='Greater than 6 miles'
trace12c = plot_line_chart(x, y, name=name)

layout = layout_details('<b>Average Cost of Ride by Hour on varying Distances</b>', 
                        '<b><i>Hour</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace12a, trace12b, trace12c], layout)

Investigating further into how time of day affects prices, it can be noted from the above visualization that for distances greater than 6 miles, the average cost of a ride varies significantly throughout the day.<br>
**An interesting observation from the above visualization is that the highest increase in average price for distances greater than 6 miles occurs during the time period of 7am to 8am which is often considered as the peak hour where employees are commuting to their offices.**<br><br>
*Hover over the visualization to compare the cost across the different distances and times of day*. <br><br>

In [24]:
x=less15_df[less15_df['cab_type'] == 'Uber'].groupby(by='hour')['price'].mean().index.values
y=less15_df[less15_df['cab_type'] == 'Uber'].groupby(by='hour')['price'].mean().values
name='Uber'
color='black'
opacity=0.8
trace13a = plot_line_chart(x, y, color, name, opacity)

x=less15_df[less15_df['cab_type'] == 'Lyft'].groupby(by='hour')['price'].mean().index.values
y=less15_df[less15_df['cab_type'] == 'Lyft'].groupby(by='hour')['price'].mean().values
name='Lyft'
color='deeppink'
opacity=0.8
trace13b = plot_line_chart(x, y, color, name, opacity)

layout = layout_details('<b><b>Average Cost of Ride by Hour for Distances > 6 miles based on Company</b></b>', 
                        '<b><i>Hour</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace13a, trace13b], layout)

From the above visualization, it is evident that Lyft tends to modify the price of a cab ride based on time of day more than Uber.

*Hover over the visualization to compare the cost between Uber and Lyft.* 
<br><br>

## External Factors vs Price
- Temperature
- Wind Speed
- Precipitation

In [25]:
x=less3_df[less3_df['distance'] == 2.32].groupby(by='temperature')['price'].mean().index.values
y=less3_df[less3_df['distance'] == 2.32].groupby(by='temperature')['price'].mean().values
name='2.32 miles'
trace14a = plot_scatter_chart(x, y, name=name)

x=less6_df[less6_df['distance'] == 3.05].groupby(by='temperature')['price'].mean().index.values
y=less6_df[less6_df['distance'] == 3.05].groupby(by='temperature')['price'].mean().values
name='3.05 miles'
trace14b = plot_scatter_chart(x, y, name=name)

x=less15_df[less15_df['distance'] == 7.46].groupby(by='temperature')['price'].mean().index.values
y=less15_df[less15_df['distance'] == 7.46].groupby(by='temperature')['price'].mean().values
name='7.46 miles'
trace14c = plot_scatter_chart(x, y, name=name)

layout = layout_details('<b>Average Cost of Ride based on Temperature</b>', 
                        '<b><i>Temperature in Fahrenheit</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace14a, trace14b, trace14c], layout)

The above visualization suggests that the cost of a ride in either Uber or Lyft depends on the temperature at the time of booking for distances greater than 6 miles. While the data points for lower distances are clustered close together, the data points for a higher distance appears to be more dispersed indicating that temperature could play a hand in determining the cost of cab ride.

*Hover over the visualization to compare costs and control the visualization using the legend in the right corner.*
<br><br>

In [26]:
x=less3_df[less3_df['distance'] == 2.32].groupby(by='windSpeed')['price'].mean().index.values
y=less3_df[less3_df['distance'] == 2.32].groupby(by='windSpeed')['price'].mean().values
name='2.32 miles'
trace15a = plot_scatter_chart(x, y, name=name)

x=less6_df[less6_df['distance'] == 3.05].groupby(by='windSpeed')['price'].mean().index.values
y=less6_df[less6_df['distance'] == 3.05].groupby(by='windSpeed')['price'].mean().values
name='3.05 miles'
trace15b = plot_scatter_chart(x, y, name=name)

x=less15_df[less15_df['distance'] == 7.46].groupby(by='windSpeed')['price'].mean().index.values
y=less15_df[less15_df['distance'] == 7.46].groupby(by='windSpeed')['price'].mean().values
name='7.46 miles'
trace15c = plot_scatter_chart(x, y, name=name)

layout = layout_details('<b>Average Cost of Ride based on Wind Speed</b>', 
                        '<b><i>Wind Speed in MPH</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace15a, trace15b, trace15c], layout)

The previous inference can be made for the above visualization as well where wind speed tends to affect costs of longer cab rides while costs of shorter rides are closely clustered together.<br><br>
*Hover over the visualization to compare costs and control the visualization using the legend in the right corner.*
<br><br>

In [27]:
x=less3_df[less3_df['distance'] == 2.32].groupby(by='precipIntensity')['price'].mean().index.values
y=less3_df[less3_df['distance'] == 2.32].groupby(by='precipIntensity')['price'].mean().values
name='2.32 miles'
trace16a = plot_scatter_chart(x, y, name=name)

x=less6_df[less6_df['distance'] == 3.05].groupby(by='precipIntensity')['price'].mean().index.values
y=less6_df[less6_df['distance'] == 3.05].groupby(by='precipIntensity')['price'].mean().values
name='3.05 miles'
trace16b = plot_scatter_chart(x, y, name=name)

x=less15_df[less15_df['distance'] == 7.46].groupby(by='precipIntensity')['price'].mean().index.values
y=less15_df[less15_df['distance'] == 7.46].groupby(by='precipIntensity')['price'].mean().values
name='7.46 miles'
trace16c = plot_scatter_chart(x, y, name=name)


layout = layout_details('<b>Average Cost of Ride based on Precipitation Intensity</b>', 
                        '<b><i>Intensity</i></b>', 
                        '<b><i>Cost in USD</i></b>')
display_trace([trace16a, trace16b, trace16c], layout)

Precipitation intensity too seems to affect costs for longer cab rides. However, the number of data points to back this conclusion is very low.<br><br>

*Hover over the visualization to compare costs and control the visualization using the legend in the right corner.*
<br><br>

## Machine Learning

In [28]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, confusion_matrix, classification_report

In [29]:
def regression_stats(y_test, pred):
    '''Print regression metrics.'''
    print('r^2: ', round(r2_score(y_test, pred), 2))
    print('MAE: ', round(mean_absolute_error(y_test, pred), 2))
    print('RMSE: ', round(np.sqrt(mean_squared_error(y_test, pred)), 2), '\n')
    
def classification_stats(y_test, pred):
    '''Print classification metrics.'''
    print('Accuracy: ', round(accuracy_score(y_test, pred), 2), '\n')    
    print('Classification Report: \n', classification_report(y_test, pred), '\n')
    print('Confusion Matrix: \n', confusion_matrix(y_test, pred), '\n')
    
def fit_predict(model, X_train, y_train, X_test):
    '''Trains model and returns predictions on test set.'''
    print(model.fit(X_train, y_train), '\n')
    pred = model.predict(X_test)
    return pred

def create_pred_dataframe(y_test, pred):
    '''Returns dataframe with Actual value and Predicted value.'''
    return pd.DataFrame({'Actual': y_test, 'Predicted': pred})

def create_coeff_dataframe(model, X_train):
    '''Returns dataframe with coeffcients of Linear Regression model.'''
    return pd.DataFrame(model.coef_, X_train.columns, columns=['Coefficients'])

def create_feat_importance_dataframe(model, X_train):
    'Returns dataframe with feature importances.'
    return pd.DataFrame(model.feature_importances_, X_train.columns, columns=['Importance'])

In [30]:
df.columns

Index(['distance', 'cab_type', 'timestamp', 'source', 'destination', 'price',
       'surge_multiplier', 'product_id', 'name', 'cloudCover', 'temperature',
       'windSpeed', 'uvIndex', 'precipIntensityMax', 'precipIntensity',
       'precipProbability', 'hour', 'day', 'source_latitude',
       'source_longitude', 'destination_latitude', 'destination_longitude'],
      dtype='object')

In [31]:
#Separate dataframe for regression task.
reg_df = pd.get_dummies(df)
X = reg_df.drop(columns=['price']) 
y = reg_df['price'] #Target variable which is the cost of a ride.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) #Train test split: 67:33

## Linear Regression

In [32]:
lin_reg = LinearRegression()
pred = fit_predict(lin_reg, X_train, y_train, X_test)
regression_stats(y_test, pred)
pred_df = create_pred_dataframe(y_test, pred)
coeff_df = create_coeff_dataframe(lin_reg, X_train)
print('Sample of Actual vs Predicted Dataframe:\n\n', pred_df.sample(10), '\n')
print('Top 10 Linear Regression Coefficients:\n\n',
      coeff_df.sort_values(by='Coefficients', ascending=False).head(10), '\n')

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False) 

r^2:  0.93
MAE:  1.75
RMSE:  2.49 

Sample of Actual vs Predicted Dataframe:

         Actual  Predicted
293868    11.5  11.582145
67166     30.5  30.704637
87585     38.5  33.773319
595436     7.0   7.797991
280733    15.5  17.268259
116397    22.5  20.165853
346159     5.0   2.792145
299147    16.5  16.270170
586913     5.0   1.393460
590166     9.0  12.179234 

Top 10 Linear Regression Coefficients:

                                                  Coefficients
surge_multiplier                                    18.448034
name_Lux Black XL                                    7.490721
product_id_lyft_luxsuv                               7.490721
product_id_6d318bcc-22a3-4af6-bddd-b409bfce1546      7.182275
name_Black SUV                                       7.182275
distance                                             2.883235
name_Lux Black                                       2.865610
produ

## Random Forest Regressor

In [33]:
rfr = RandomForestRegressor()
pred = fit_predict(rfr, X_train, y_train, X_test)
regression_stats(y_test, pred)
pred_df = create_pred_dataframe(y_test, pred)
feat_imp_df = create_feat_importance_dataframe(rfr, X_train)
print('Sample of Actual vs Predicted Dataframe:\n\n', pred_df.sample(10), '\n')
print('Top 10 Important Features:\n\n',
      feat_imp_df.sort_values(by='Importance', ascending=False).head(10), '\n')

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False) 

r^2:  0.96
MAE:  1.16
RMSE:  1.84 

Sample of Actual vs Predicted Dataframe:

         Actual  Predicted
152451    30.0     31.800
388002    14.0     15.050
259257    10.0     11.850
576961     7.5      7.400
553344    16.5     17.100
17988     16.5     18.000
186758    12.0     11.425
320296    15.0     13.600
198055     9.0      6.800
468530    31.5     31.150 

Top 10 Important Features:

                                                  Importance
distance                                           0.152638
name_Black SUV                                     0.150830
name_Lux Black XL                   

In [34]:
#Separate dataframe for classification task.
X = df.drop(columns=['cab_type', 'product_id', 'name', 'timestamp'])
y = df['cab_type'] #Target variable which is company: Uber or Lyft
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) #Train test split: 67:33

## Logistic Regression

In [35]:
log_reg = LogisticRegression()
pred = fit_predict(log_reg, X_train, y_train, X_test)
classification_stats(y_test, pred)
pred_df = create_pred_dataframe(y_test, pred)
print('Sample of Actual vs Predicted Dataframe:\n\n', pred_df.sample(10), '\n')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False) 

Accuracy:  0.55 

Classification Report: 
               precision    recall  f1-score   support

        Lyft       0.60      0.19      0.29    101543
        Uber       0.54      0.88      0.67    108990

   micro avg       0.55      0.55      0.55    210533
   macro avg       0.57      0.54      0.48    210533
weighted avg       0.57      0.55      0.49    210533
 

Confusion Matrix: 
 [[19186 82357]
 [12575 96415]] 

Sample of Actual vs Predicted Dataframe:

        Actual Predicted
105905   Lyft      Lyft
179606   Lyft      Lyft
460956   Lyft      Uber
367472   Lyft      Uber
27183    Uber      Lyft
61582    Uber      Uber
52223    Lyft      Lyft
415244   Lyft      Uber
336652   Lyft      Uber
516467   Uber      Uber 



## Random Forest Classifier

In [36]:
rfc = RandomForestClassifier()
pred = fit_predict(rfc, X_train, y_train, X_test)
classification_stats(y_test, pred)
pred_df = create_pred_dataframe(y_test, pred)
feat_imp_df = create_feat_importance_dataframe(rfc, X_train)
print('Sample of Actual vs Predicted Dataframe:\n\n', pred_df.sample(10), '\n')
print('Top 10 Important Features:\n\n',
      feat_imp_df.sort_values(by='Importance', ascending=False).head(10), '\n')

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False) 

Accuracy:  0.95 

Classification Report: 
               precision    recall  f1-score   support

        Lyft       0.94      0.95      0.95    101543
        Uber       0.96      0.94      0.95    108990

   micro avg       0.95      0.95      0.95    210533
   macro avg       0.95      0.95      0.95    210533
weighted avg       0.95      0.95      0.95    210533
 

Confusion Matrix: 
 [[ 96806   4737]
 [  6527 102463]] 

Sample of Actual vs Predicted Dataframe:

        Actual Predicted
599960   Uber      Uber
313961   Uber      Uber
607142   Uber      Uber
295354

## Conclusion

- Total Revenue: 
    - Uber: \$5.2 million
    - Lyft: \$5.3 million
<br><br>
- Price based on distance:
    - Overall, Uber is more affordable than Lyft.
    - Lyft is more affordable than Uber for economy rides.
    - Uber is more affordable than Lyft for luxury rides.
<br><br>
- Price based on time of day:
    - Cab rides for distances greater than 6 miles tend to vary throughout the day with the highest increase in price during rush hour 7am to 8am.
    - Lyft appears to modify their prices more than Uber based on time of day.
<br><br>    
- Price based on temperature, windspeed and precipitation:
    - Prices tend to vary more for distances with longer rides based on above mentioned external factors while prices for shorter rides are not affected by weather conditions.
