<div>
<h1>Maximizing Revenue for taxi Cab Drivers through Payment Type Analysis</h1>

<h2>Problem statement</h2>
in the fast-paced taxi booking sector, making the most revenue is essential for long-term success and driver happiness. Our goal is to use data-driven insights to maximize revenue streams for taxi drivers in order to meet this need. Our research aims to determine whether payment methods have an impact on fare pricing by focusing on the relationship between payment type and fare amount

<h2>Objective </h2>

<span style="font-weight: bold; color: yellow;">This project main goal is to run A/B test to examine the relationship between the total fare and the method of payment.</span> we use python hypothesis testing and descriptive statistics to extract useful information that can help taxi divers generate more cash. in particular, we want to find out if there is a big difference in the fares for those who pay with credit card vs those who pays with cash.

<h2>Research Question<h2>

is there a relationship between total fare amount and payment type and can we nudge customers toward payment methods that generate higher revenue for drivers, without negatively impacting customer experience?
</div>

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
%matplotlib inline
sns.set_style(style="darkgrid")
warnings.filterwarnings(action="ignore")

## Loading Dataset

In [None]:
from pandas import DataFrame


df: DataFrame = pd.read_csv(filepath_or_buffer=r"D:\Python\data analysis projects\maximum revenue for the drivers\yellow_tripdata_2020-01.csv", low_memory=False)
df

## Exploratory Data Analysis

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# convert strings to date time
df["tpep_pickup_datetime"] = pd.to_datetime(arg=df["tpep_pickup_datetime"])
df["tpep_dropoff_datetime"] = pd.to_datetime(arg=df["tpep_dropoff_datetime"])

In [None]:
missing_data = df.isna().sum()
missing_data

In [None]:
(65441 / 6405008) * 100

## Drop the missing data
**Since the missing data constitutes only 1% of the dataset, removing these values ensures simplicity and minimal impact on the overall analysis**

In [None]:
df_cleaned = df.dropna(axis=0)


In [None]:
df_cleaned

In [None]:
duration = df_cleaned["tpep_dropoff_datetime"] - df_cleaned["tpep_pickup_datetime"]
cleaned_df = pd.concat(objs=[df_cleaned.iloc[:,:3], duration, df_cleaned.iloc[:,3:]], axis=1)
cleaned_df.rename(columns={0:"duration"}, inplace=True)
cleaned_df["duration"] = cleaned_df["duration"].dt.total_seconds() / 60
cleaned_df

In [None]:
cleaned_df.columns

In [None]:

df: DataFrame = cleaned_df[["passenger_count", "payment_type", "fare_amount", "trip_distance", "duration"]]
for col in df.columns:
    if col == "duration":
        continue
    else:
        df[col].astype(dtype=np.int64)

In [None]:
df. info()

In [None]:
np.round(a=df.describe(), decimals=3)

In [None]:
import plotly.express as px
import plotly.graph_objects as go
# Create the box plot
fig = px.box(data_frame=cleaned_df, x="passenger_count", orientation="h")

# Update the color of the box plot
fig.update_traces(marker_color="forestgreen")
fig.add_annotation()
fig.show()

In [None]:
df.drop_duplicates(inplace=True, keep="first")
df = df[df["payment_type"] < 3]
df = df[(df["passenger_count"] > 0) & (df["passenger_count"] < 6)]
df

In [None]:
df["payment_type"].replace(to_replace=[1,2], value=["card", "cash"], inplace=True)

In [None]:
df.isna().sum()

In [None]:
df = df[df["trip_distance"] > 0]
df = df[df["fare_amount"] > 0]
df = df[df["duration"] > 0]
df

In [None]:
df.hist(bins=100, figsize=(20,10))

In [None]:
df["fare_amount"].value_counts().sort_index(ascending=False)

In [None]:
df[df["fare_amount"] > 4000]

## Dealing with outliers
**Using the IQR to remove the outlier by upper and lower bound**

In [None]:
for col in ["fare_amount", "trip_distance", "duration"]:
    Q1 = np.percentile(df[col], 25)
    Q3 = np.percentile(df[col], 75)

    IQR = Q3 - Q1
    
    upper_bound = Q3 + (1.5 * IQR)
    lower_bound = Q1 - (1.5 * IQR)

    # Applying filter to remove the outliers from the data
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

In [None]:
plt.figure(figsize=(20, 5))

plt.subplot(1,2,1)

# mean for payment type
mean_card = df[df["payment_type"] == "card"]["fare_amount"].mean()
mean_cash = df[df["payment_type"] == "cash"]["fare_amount"].mean()

plt.hist(df[df["payment_type"] == "card"]["fare_amount"], bins=30, edgecolor="k", color="forestgreen",histtype="barstacked")
plt.hist(df[df["payment_type"] == "cash"] ["fare_amount"], bins=30, histtype="barstacked", color="blue")

# add mean line for payment type
plt.axvline(mean_card, color="red", linestyle="dashed", linewidth=2, label=f"card mean: {mean_card:.2f}")
plt.axvline(mean_cash, color="#E85C0D", linestyle="dashed", linewidth=2, label=f"cash mean: {mean_cash:.2f}")

plt.xlabel("Fare Amount")
plt.ylabel("Frequency")
plt.title("Histogram of Fare Amounts by Payment Type")
plt.tight_layout()
plt.xlim(0,)
plt.legend()

plt.subplot(1,2,2)

mean_card = df[df["payment_type"] == "card"]["trip_distance"].mean()
mean_cash = df[df["payment_type"] == "cash"]["trip_distance"].mean()

plt.hist(df[df["payment_type"] == "card"]["trip_distance"], bins=30, edgecolor="k", color="navy", histtype="barstacked")
plt.hist(df[df["payment_type"] == "cash"]["trip_distance"], bins=30, color="crimson", histtype="barstacked")

plt.axvline(mean_card, color="white", linestyle="dashed", linewidth=2, label=f"card mean: {mean_card:.2f}")
plt.axvline(mean_cash, color="#E85C0D", linestyle="dashed", linewidth=2, label=f"cash mean: {mean_cash:.2f}")

plt.xlabel("Trip Distance")
plt.ylabel("Frequency")
plt.title("Histogram of trip distance by Payment Type".title())
plt.tight_layout()
plt.xlim(0,)
plt.legend()

plt.show()

# descriptive statistics for the taxi driver payment methods [card, cash]
np.round(df.groupby("payment_type").agg({"fare_amount": ["mean", "std"], "trip_distance": ["mean", "std"]}), 2)

# Executive Summary

## Key Insights:
- **Card payments** are associated with higher average fare amounts compared to cash payments.
- Trips paid for by card tend to be slightly longer on average.

## Implications for Business Strategy:
- Consider promoting card payments, especially for longer trips, as they tend to yield higher fare amounts.
- Tailor marketing campaigns to encourage the use of cards for premium or long-distance services.

# Visual & Data-Driven Insights

## Fare Amounts:
- **Graphical Representation:** Display the histogram side by side with annotations pointing to the mean values for card and cash payments. Highlight the difference in mean fare amounts between the two payment types.
- **Textual Insight:** “As observed, trips paid for by card have a higher average fare amount (mean: $13.11) compared to cash payments (mean: $11.76). This suggests a tendency for customers to use cards for more expensive trips.”

## Trip Distances:
- **Graphical Representation:** Similar histogram for trip distances with vertical lines indicating the mean trip distances for card and cash payments.
- **Textual Insight:** “The average trip distance is longer for card payments (mean: 2.99 miles) versus cash (mean: 2.60 miles). This could imply that customers prefer using cards for longer trips, possibly due to convenience or security concerns.”

# Actionable Recommendations

## Promotional Campaigns:
- **Discounts on Card Payments:** Offer discounts or incentives for card payments on trips longer than a certain distance, leveraging the trend that card payments are already preferred for such trips.
- **Partnerships with Card Companies:** Partner with credit card companies to offer exclusive rewards or cashback for using cards on longer or premium services.

## Customer Segmentation:
- **Target High-Value Customers:** Identify frequent users who prefer card payments and offer them loyalty programs or premium services to increase engagement.
- **Encourage Card Usage:** Implement marketing strategies aimed at customers who predominantly use cash, encouraging them to switch to card payments for a more seamless experience.

# Operational Adjustments

- **Adjust Pricing Models:** Given the higher fare amounts associated with card payments, consider introducing dynamic pricing models that offer slight discounts for card payments on longer trips to increase usage.
- **Payment Method Optimization:** Ensure that the infrastructure supports a seamless card payment process, especially for long-distance trips, to cater to the trend observed.

# Further Analysis

- **Explore Outliers:** Investigate the high-end tails in both fare amount and trip distance histograms to understand if they represent outliers or a segment of high-value customers.
- **Correlation Analysis:** Conduct a deeper statistical analysis to explore correlations between payment type, trip distance, and fare amount to uncover more nuanced insights.

# Presentation Approach

## Slide Deck:
- Start with a summary slide of the main findings.
- Include detailed slides with histograms, annotated means, and supporting textual insights.
- End with recommendations and a Q&A section for stakeholders.

## Report:
- Write a structured report with sections for introduction, methodology, results (with visuals), and conclusions. The actionable recommendations could be highlighted in a separate section or as an appendix.

# Business Context Application

- **Ride-Sharing Services:** Emphasize how these insights can be used to tailor services, such as premium ride options that cater to customers who prefer card payments.
- **Urban Transportation:** Discuss how urban transportation planners might leverage this data to encourage cashless payments, improving operational efficiency.

By organizing the conclusions in this way, you can make your findings more accessible and actionable, ensuring that stakeholders understand the practical implications of your data analysis.


In [None]:
np.round(df.describe(), 3)

In [None]:
from copy import deepcopy

correlation_columns = ["passenger_count", "fare_amount", "trip_distance", "duration"]
payment_dummies = pd.get_dummies(df["payment_type"]).astype(np.int32)
correlation_df = deepcopy(df[correlation_columns])

correlation_df = pd.concat(objs=[payment_dummies, correlation_df], axis=1)

plt.figure(figsize=(9, 5))
sns.heatmap(data=correlation_df[["card", "cash","fare_amount", "trip_distance"]].corr(), annot=True, cmap="Greens")
plt.show()

In [None]:
plt.title("Preference of payment type".title())
plt.pie(df["payment_type"].value_counts(normalize=True), labels=df["payment_type"].value_counts().index, autopct='%1.1f%%', startangle=90)

## Passenger Count Analysis

In [None]:
passenger_count = df.groupby(["payment_type", "passenger_count"])[["passenger_count"]].count()
passenger_count.rename(columns={"passenger_count" : "count"}, inplace=True)
passenger_count.reset_index(inplace=True)

In [None]:
passenger_count.rename(columns={"passenger_count" : "passengers"}, inplace=True)
passenger_count["percentage"] = passenger_count["count"] / passenger_count["count"].sum() * 100

In [None]:
passenger_count

In [None]:
passenger_df = pd.DataFrame(columns=["payment_type", 1, 2, 3, 4, 5])
passenger_df["payment_type"] = ["Card", "Cash"]
passenger_df.iloc[0, 1:]  = passenger_count.iloc[0:5, -1]
passenger_df.iloc[1, 1:]  = passenger_count.iloc[5:, -1]
passenger_df.plot(x="payment_type", kind="barh", stacked=True, color=["#2E8B57", "#3CB371", "#228B22", "#006400", "#556B2F"])

passenger_df

**NULL HYPOTHESIS:** is there is no difference between average fare between  customer who use credit card and customer who use cash

**Alternative HYPOTHESIS:** there is a difference between average fare between  customer who use credit card and customer who use cash

In [None]:
import statsmodels.api as sm

sm.qqplot(df["fare_amount"], line= "45")
plt.show()
df.shape

In [None]:
card_sample = df[df["payment_type"] == "card"]["fare_amount"]
cash_sample = df[df["payment_type"] == "cash"]["fare_amount"]

In [None]:
t_stats, p_value = stats.ttest_ind(a=card_sample, b=cash_sample, equal_var=False)
print(f"t-test: {t_stats}\np-value: {p_value}")

