<a href="https://www.kaggle.com/code/pritam1202/taxi-revenue-analysis-and-predictions?scriptVersionId=189008251" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Optimizing Revenue and Driver Retention in the Taxi Industry: A Data-Driven Analysis of Payment Methods, Fare Pricing, Revenue Forecasting, and Predictive Modeling

## Introduction  
In the highly competitive rental taxi sector, optimizing revenue and ensuring driver satisfaction are crucial for sustained growth and long-term success. With the increasing reliance on data-driven decision-making, understanding the factors influencing fare pricing and revenue generation has become essential. This research aims to uncover actionable insights by analyzing the impact of different payment methods on fare amounts, performing a time series analysis to forecast future revenue, and developing predictive models to estimate fare prices and driver tips.

## Objective   
By leveraging data-driven insights and predictive analytics, this research seeks to provide a comprehensive understanding of revenue dynamics in the taxi industry. The findings will guide strategic decisions to maximize profits, improve driver satisfaction, and ensure a competitive edge in the rapidly evolving market.
- **Payment Methods and Fare Pricing:** Analyze if different payment methods (e.g., cash, credit/debit card) affect fare amounts. This includes conducting a A/B testing, using python hypothesis testing and descriptive statistics to extract useful information to generate more cash for drivers. In particular, to validate any observed differences in fare pricing of card versus cash payment.
  
- **Revenue Forecasting:** Use time series analysis to project future revenue trends.

- **Predictive Modeling:** Develop models to predict fare prices and driver tips, aiding in profit estimation and driver retention. Additionally, analyze how payment types correspond to fare estimates for tailoring offers accordingly. Also, suggest tip amount to customers by predicting tip amount based on trip distance and duration.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
import warnings

In [None]:
warnings.filterwarnings('ignore')

### Loading dataset

In [None]:
data = pd.read_csv('/kaggle/input/newyork-yellow-taxi-trip-data-2020-2019/yellow_tripdata_2020-01.csv')
data.head()

### Column Descriptions

| Column                  | Description                                                                                   | Values                                                                                                 |
|-------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| **VendorID**            | Identifier for the TPEP provider supplying the record.                                        | 1 = Creative Mobile Technologies, LLC<br>2 = VeriFone Inc.                                             |
| **tpep_pickup_datetime**| The date and time when the meter was activated.                                               | -                                                                                                      |
| **tpep_dropoff_datetime**| The date and time when the meter was turned off.                                              | -                                                                                                      |
| **Passenger_count**     | The number of passengers in the vehicle, as entered by the driver.                            | -                                                                                                      |
| **Trip_distance**       | The distance of the trip in miles, as recorded by the taximeter.                              | -                                                                                                      |
| **PULocationID**        | TLC Taxi Zone where the meter was engaged.                                                    | -                                                                                                      |
| **DOLocationID**        | TLC Taxi Zone where the meter was disengaged.                                                 | -                                                                                                      |
| **RateCodeID**          | The applicable rate code at the end of the trip.                                              | 1 = Standard rate<br>2 = JFK<br>3 = Newark<br>4 = Nassau or Westchester<br>5 = Negotiated fare<br>6 = Group ride |
| **Store_and_fwd_flag**  | Indicates if the trip record was stored in the vehicle's memory before transmission to the vendor due to lack of server connection. | Y = Store and forward trip<br>N = Not a store and forward trip                                         |
| **Payment_type**        | How the passenger paid for the trip, represented by a numeric code.                           | 1 = Credit card<br>2 = Cash<br>3 = No charge<br>4 = Dispute<br>5 = Unknown<br>6 = Voided trip          |
| **Fare_amount**         | The fare as calculated by the meter based on time and distance.                               | -                                                                                                      |
| **Extra**               | Additional charges, currently including only the $0.50 and $1 rush hour and overnight charges.| -                                                                                                      |
| **MTA_tax**             | A $0.50 tax automatically added based on the metered rate.                                    | -                                                                                                      |
| **Improvement_surcharge**| A $0.30 surcharge added at the start of the trip, implemented since 2015.                     | -                                                                                                      |
| **Tip_amount**          | Credit card tip amounts. (Note: Cash tips are not recorded here.)                             | -                                                                                                      |
| **Tolls_amount**        | Total tolls paid during the trip.                                                             | -                                                                                                      |
| **Total_amount**        | The total charge to passengers, excluding cash tips.                                          | -                                                                                                      |


# Exploratory Data Analysis

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data['tpep_pickup_datetime'] = pd.to_datetime(data['tpep_pickup_datetime'])
data['tpep_dropoff_datetime'] = pd.to_datetime(data['tpep_dropoff_datetime'])

In [None]:
data['trip_timings'] = data['tpep_dropoff_datetime'] - data['tpep_pickup_datetime']
data['trip_duration'] = data['trip_timings'].dt.total_seconds()/60

In [None]:
data.columns

## **Payment Methods and Fare Pricing:**
- Since our objective revolves around payment type, fare amount and any other factor influencing the fare amount ,we are filtering only the required columns for statistical analysis. 

In [None]:
df = data[['passenger_count', 'trip_distance','payment_type','fare_amount','trip_duration']]
df.head()

In [None]:
df.shape

In [None]:
#checking for missing values
df.isnull().sum()

In [None]:
print('Missing data %:')
round(df.isnull().sum()*100/df.shape[0],2)

#### Since only 1% data is missing we will drop the null values

In [None]:
df.dropna(inplace=True)
df.shape

In [None]:
df.info()

In [None]:
df['passenger_count'] = df['passenger_count'].astype('int64')
df['payment_type'] = df['payment_type'].astype('int64')

In [None]:
#checking for duplicate values
df[df.duplicated()==True]

In [None]:
# dropping duplicate values as they won't contribute in analysis
df.drop_duplicates(inplace=True)
df.shape

In [None]:
# passenger count distribution
df['passenger_count'].value_counts()

In [None]:
# dropping zero passenger count as no passenger means no one paid the fare amount
df = df[(df['passenger_count']!=0) & (df['passenger_count']<7)]
df['passenger_count'].value_counts()

In [None]:
# payment type distribution
df['payment_type'].value_counts()

#### As our analysis centers around card and cash payment we are keeping only card and cash payment denoted by 1 and 2 respectively 

In [None]:
df = df[df['payment_type']<3]
df['payment_type'].value_counts()

In [None]:
# Final remaining data
df.shape

In [None]:
# Changing the encoded values of payment type to actual labels
df['payment_type'].replace([1,2],['Card','Cash'],inplace=True)

In [None]:
# Descriptive statistics
df.describe()

#### Trip distance, fare amount, trip duration cannot be negative. Need to remove the wrong values. Also there are lot of outliers present in these values as there is a huge gap between 75 percentile and max value. 

In [None]:
# removing negative values
df = df[df['trip_distance']>0]
df = df[df['fare_amount']>0]
df = df[df['trip_duration']>0]
df.shape

In [None]:
# checking outliers
plt.figure(figsize = (8,12))
plt.subplot(3,1,1)
sns.boxplot(df['trip_distance'])
plt.subplot(3,1,2)
sns.boxplot(df['fare_amount'])
plt.subplot(3,1,3)
sns.boxplot(df['trip_duration'])
plt.show()

In [None]:
# removing outliers using IQR method
def outlier_r(df,col):
    q1 = df[col].quantile(0.25)
    q2 = df[col].quantile(0.75)
    IQR = q2-q1
    lower_bound = q1-1.5*IQR
    upper_bound = q2+1.5*IQR
    df = df[(df[col]>=lower_bound) & (df[col]<=upper_bound)]
    return df

for col in ['trip_distance','fare_amount','trip_duration']:
    df = outlier_r(df,col)
    
df.shape

In [None]:
df.describe()

#### The objective is to explore relationship between payment type and the concerning trip distance and fare amount.  
#### Are there variations in the payment type concerning different fare amounts or trip distance?  

In [None]:
plt.title('Payment type distribution')
plt.pie(df['payment_type'].value_counts(normalize=True),labels=df['payment_type'].value_counts(normalize=True).index,
       autopct='%1.1f%%')
plt.show()

#### Card payment is more, clocking at 67.7%

In [None]:
plt.figure(figsize=(12,7))
plt.subplot(1,2,1)
plt.title('Distribution of trip distance')
sns.histplot(df[df['payment_type']=='Card']['trip_distance'],bins = 15,kde=True,label='Card')
sns.histplot(df[df['payment_type']=='Cash']['trip_distance'],bins = 15,kde=True,label='Cash')
plt.legend()
plt.subplot(1,2,2)
plt.title('Distribution of fare amount')
sns.histplot(df[df['payment_type']=='Card']['fare_amount'],bins = 15,kde=True,label='Card')
sns.histplot(df[df['payment_type']=='Cash']['fare_amount'],bins = 15,kde=True,label='Cash')
plt.legend()
plt.show()

In [None]:
df.groupby(['payment_type']).agg({'trip_distance':['mean','std'],'fare_amount':['mean','std']})

#### The mean for card is more than cash in both trip distance and fare amount. This indicates that passengers are more inclined to pay with card.  
#### On any trip distance card is the preffered mode, however when the fare amount is on lower scale there is less difference between the preferences.  
#### But as distance and fare amount increase, the preference to pay with card increases.

#### Subsequently, the objective now is to analysis the payment types in relation to the passenger count.  
#### Our objective is to check whether there are any changes in payment preference related to the number of passengers travelling in the cab.

In [None]:
# checking the payment type distribution in percentage based on passenger count
payment=df.groupby(['payment_type','passenger_count'])[['passenger_count']].count()
payment.rename(columns = {'passenger_count':'count'},inplace = True)
payment.reset_index(inplace=True)
payment['percentage'] = payment['count']*100/payment['count'].sum()
payment

In [None]:
# rearranging the data for horizontal stacked bar chart
payment_dist = pd.DataFrame(columns=['payment_type',1,2,3,4,5,6])
payment_dist['payment_type']=['Card','Cash']
payment_dist.iloc[0,1:]=payment.iloc[:6,3]
payment_dist.iloc[1,1:]=payment.iloc[6:,3]
payment_dist

In [None]:
# plotting the horizontal bar chart
fig,ax = plt.subplots(figsize=(15,7))
payment_dist.plot(x='payment_type',kind='barh',stacked=True,ax=ax)

for i in ax.patches:
    w = i.get_width()
    h = i.get_height()
    x,y = i.get_xy()
    ax.text(x+w/2,y+h/2,'{:.0f}%'.format(w),horizontalalignment='center',verticalalignment='center')

## **Hypothesis testing**
- First we will check whether the distribution of fare amount adheres to the normal ditribution. Even though the histogram above depicts otherwise, we will use Quantile-Quantile (QQ) plot for further confirmation.

In [None]:
import statsmodels.api as sm

In [None]:
sm.qqplot(df['fare_amount'], line = '45')
plt.show()

- The data plots do not align (not even closely) to the diagonal line, suggesting it doesn't follow a normal distribution.  
- So instead of z-test we will use a t-test for our hypothesis testing. Moreover the population variance is also not known suggesting that a t-test will be a better estimate as it can accomodate the uncertainities that come with estimating population parameters from sample data.  
  
#### *Null hypothesis*: There is no difference in average fare between customers who use credit card and customers who use cash.  
#### *Alternate hypothesis*: There is a difference in average fare between customers who use credit and customers who use cash.

In [None]:
# sample 1
card_sample = df[df['payment_type']=="Card"]['fare_amount']
# sample 2
cash_sample = df[df['payment_type']=="Cash"]['fare_amount']

Performing an F-test to check the relation between variances of the sample

In [None]:
# F-test
def F_test(data1,data2, alpha = 0.05):
    var1 = np.var(data1,ddof=1)
    var2 = np.var(data2,ddof=1)
    len1 = len(data1)-1
    len2 = len(data2)-1
    if var1 > var2:
        F = var1/var2
    else:
        F = var2/var1
    p = 1-st.f.cdf(F,len1,len2)
    if p < alpha:
        print('Variances are not same')
    else: 
        print('Variances are same')
        
        
F_test(card_sample,cash_sample)

Performing a two sample t-test to check our hypothesis

In [None]:
# two sample t-test
t_stats, p_value = st.ttest_ind(a=card_sample,b=cash_sample,equal_var=False) # variances are not equal
if p_value < 0.05:
    print(f'p_value = {p_value}, Reject null hypothesis')
else:
    print(f'p_value = {p_value},Accept null hypothesis')

#### Since the p-value is less than the significance level of 5%, the null hypothesis is rejected.  
#### There is a statistically significant difference in average fare amount between passengers who pay with cash and passengers who use card.  
### *The bussiness can generate more revenue by encouraging customers to pay with card and should market offers or advertisements accordingly*

## **Revenue Forecasting**  
- We will use time series to detect the trends in overall fare amount.
- Additionally, the trends in different payment types (cash and card) will also be analysed.**

# **Predictive Modeling**  
- The objective here will be to use machine learning models, specially regression and classification, to predict the fare amount and payment type based on trip distance and trip duration. This will further help the buiseness in planing and allocating resources accordingly.  
- Additionally the models will also be used to predict tip amounts based on trip distance and trip duration. This will help the business to retrain as well as increase their driver base by suggesting tip amounts to passengers.  
  
*Regression* - Linear Regression  
*Classification* - Random Forest or Support Vector Classifier(SVC)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as le
from sklearn import metrics
from sklearn.model_selection import cross_validate

In [None]:
ML_data = data[['trip_distance','trip_duration','fare_amount']]
ML_data.head()

In [None]:
ML_data.shape

In [None]:
ML_data.info()

In [None]:
ML_data.isnull().sum()

In [None]:
ML_data.describe()

In [None]:
ML_data = ML_data[ML_data['trip_distance']>0]
ML_data = ML_data[ML_data['trip_duration']>0]
ML_data = ML_data[ML_data['fare_amount']>0]

In [None]:
plt.subplot(3,1,1)
sns.boxplot(ML_data['trip_distance'])
plt.subplot(3,1,2)
sns.boxplot(ML_data['trip_duration'])
plt.subplot(3,1,3)
sns.boxplot(ML_data['fare_amount'])
plt.show()

In [None]:
for col in ['trip_distance','fare_amount','trip_duration']:
    ML_data = outlier_r(ML_data,col)
    
ML_data.shape

In [None]:
ML_data[ML_data.duplicated()]

In [None]:
ML_data.drop_duplicates(keep='first',inplace = True)
ML_data.shape

In [None]:
x = ML_data[['trip_distance','trip_duration']]
y = ML_data['fare_amount']

In [None]:
train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.2)

In [None]:
train_x.shape

In [None]:
train_y.shape

In [None]:
test_x.shape

In [None]:
test_y.shape

In [None]:
linear_model = le()
cross_validate(linear_model,train_x,train_y,cv=5,scoring=['neg_root_mean_squared_error','r2'])

In [None]:
linear_model.fit(train_x,train_y)
pred_y = linear_model.predict(test_x)
metrics.mean_squared_error(test_y,pred_y)

In [None]:
print(linear_model.coef_)
print(linear_model.intercept_)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
params = {
    'n_estimators' : [300, 600, 1000],
    }
gradboost_model = GradientBoostingRegressor(init = linear_model, random_state = 42, learning_rate = 0.01)
grid = GridSearchCV(estimator = gradboost_model, param_grid = params, scoring = 'neg_mean_squared_error', cv = 4,verbose = 3, refit = True)
grid.fit(train_x,train_y)

In [None]:
grid.best_params_

In [None]:
pred_y_boosted = grid.predict(test_x)
metrics.mean_squared_error(test_y,pred_y_boosted)

make classifier based on the payment type and fare  
fare to tip ratio  
auto suggest tip