### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [10]:
# Load the dataset
data = pd.read_csv('data/coupons.csv')

In [11]:
# Glance the data
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [12]:
# Understand the dataset

# To clearly understand all the columns available in the dataset
# and what are the unique values in each column

for col in data.columns:
    print(f"{col}: ", data[col].unique())

# Apart from understanding the data, the data types and how much of 
# the data is filled in is also important to know

data.info()


destination:  ['No Urgent Place' 'Home' 'Work']
passanger:  ['Alone' 'Friend(s)' 'Kid(s)' 'Partner']
weather:  ['Sunny' 'Rainy' 'Snowy']
temperature:  [55 80 30]
time:  ['2PM' '10AM' '6PM' '7AM' '10PM']
coupon:  ['Restaurant(<20)' 'Coffee House' 'Carry out & Take away' 'Bar'
 'Restaurant(20-50)']
expiration:  ['1d' '2h']
gender:  ['Female' 'Male']
age:  ['21' '46' '26' '31' '41' '50plus' '36' 'below21']
maritalStatus:  ['Unmarried partner' 'Single' 'Married partner' 'Divorced' 'Widowed']
has_children:  [1 0]
education:  ['Some college - no degree' 'Bachelors degree' 'Associates degree'
 'High School Graduate' 'Graduate degree (Masters or Doctorate)'
 'Some High School']
occupation:  ['Unemployed' 'Architecture & Engineering' 'Student'
 'Education&Training&Library' 'Healthcare Support'
 'Healthcare Practitioners & Technical' 'Sales & Related' 'Management'
 'Arts Design Entertainment Sports & Media' 'Computer & Mathematical'
 'Life Physical Social Science' 'Personal Care & Service'
 'Com

In [13]:
# Further looking at the data distribution gives a little more 
# understanding of the data

# data.describe(include='all')
data.describe()

Unnamed: 0,temperature,has_children,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
count,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0
mean,63.301798,0.414144,1.0,0.561495,0.119126,0.214759,0.785241,0.568433
std,19.154486,0.492593,0.0,0.496224,0.32395,0.410671,0.410671,0.495314
min,30.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,55.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
50%,80.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
75%,80.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0
max,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [14]:
# Finally check how many null values exist in each column

data.isnull().sum()

destination                 0
passanger                   0
weather                     0
temperature                 0
time                        0
coupon                      0
expiration                  0
gender                      0
age                         0
maritalStatus               0
has_children                0
education                   0
occupation                  0
income                      0
car                     12576
Bar                       107
CoffeeHouse               217
CarryAway                 151
RestaurantLessThan20      130
Restaurant20To50          189
toCoupon_GEQ5min            0
toCoupon_GEQ15min           0
toCoupon_GEQ25min           0
direction_same              0
direction_opp               0
Y                           0
dtype: int64

3. Decide what to do about your missing data -- drop, replace, other...

In [15]:
# It is observed that more than 50% of the rows do not have any values 
# in the column 'car', hence it can be dropped

df = data.drop(columns='car')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  Bar                   12577 non-null  object
 15  CoffeeHouse           12467 non-null

4. What proportion of the total observations chose to accept the coupon?



In [16]:
# Calculate the Proportion
total_observations = len(df)
accepted_count = df['Y'].sum()
proportion_accepted = accepted_count / total_observations

# Prepare Data for Plotting
# Map the integer values to descriptive labels for the plot
plot_data = df['Y'].value_counts().reset_index()
plot_data.columns = ['Acceptance', 'Count']
plot_data['Acceptance'] = plot_data['Acceptance'].map({1: 'Accepted (Y=1)', 0: 'Rejected (Y=0)'})

fig = px.pie(
    plot_data,
    names='Acceptance',
    values='Count',
    title=f"Proportion of Coupon Acceptance (Total Observations: {total_observations})",
    color='Acceptance',
    color_discrete_map={'Accepted (Y=1)':'green', 'Rejected (Y=0)':'red'}
)

# Customize hover text to show both count and percentage
fig.update_traces(textinfo='percent+label')
fig.update_layout(autosize=True)
fig.show()
fig.write_image("./images/observations_coupon_accepted.png")


5. Use a bar plot to visualize the `coupon` column.

In [17]:
# Calculate the counts of each unique value in the 'coupon' column
coupon_counts = df['coupon'].value_counts().reset_index()
coupon_counts.columns = ['Coupon_Type', 'Count']

# print the uniques values and their count
print (df['coupon'].value_counts().sort_values(ascending=True))

# Sort the counts in descending order (done automatically by plotly with categoryorder)
coupon_counts = coupon_counts.sort_values(by='Count')

# Create a bar plot using Plotly Express
fig = px.bar(
    coupon_counts,
    x='Coupon_Type',
    y='Count',
    title='Distribution of Coupon Types',
    labels={'Coupon_Type': 'Type of Coupon', 'Count': 'Number of Observations'},
    color='Coupon_Type'
)

# Customize layout for better readability and ensure sorted order
fig.update_layout(xaxis={'categoryorder':'total ascending'})
fig.update_layout(autosize=True)
fig.show()
fig.write_image("./images/coupon_bar.png")

coupon
Restaurant(20-50)        1492
Bar                      2017
Carry out & Take away    2393
Restaurant(<20)          2786
Coffee House             3996
Name: count, dtype: int64


6. Use a histogram to visualize the temperature column.

In [18]:
# Unique Temperature Values: [30, 55, 80]
temp_values = df['temperature'].unique()

# print the unique temperatue values
print(df['temperature'].value_counts().sort_values(ascending=True))

# Create a histogram using Plotly Express
fig = px.histogram(
    df,
    x='temperature',
    nbins=len(temp_values), # Use a bin for each unique temperature value
    title='Distribution of Temperature Observations',
    labels={'temperature': 'Temperature (Fahrenheit)', 'count': 'Frequency (Number of Observations)'}
)

# Customize layout for better appearance
fig.update_layout(bargap=0.05)
fig.update_xaxes(dtick=10) # Set tick marks every 10 degrees

# autosize the plot layout
fig.update_layout(autosize=True)
fig.show()
fig.write_image("./images/temperature_histogram.png")

temperature
30    2316
55    3840
80    6528
Name: count, dtype: int64


**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [19]:
# Create a boolean mask to select rows where 'coupon' is 'Bar'
bar_coupon_mask = df['coupon'] == 'Bar'

# Filter the original DataFrame to create the new DataFrame
bar_coupons_df = df[bar_coupon_mask].copy()

print(bar_coupons_df.shape)
bar_coupons_df.head()


(2017, 25)


Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
9,No Urgent Place,Kid(s),Sunny,80,10AM,Bar,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
13,Home,Alone,Sunny,55,6PM,Bar,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,1,0,1
17,Work,Alone,Sunny,55,7AM,Bar,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,1,0,1,0
24,No Urgent Place,Friend(s),Sunny,80,10AM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,0,1,1
35,Home,Alone,Sunny,55,6PM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,1,0,1


2. What proportion of bar coupons were accepted?


In [20]:
# Get the bar count 
total_barcount = len(bar_coupons_df)
accepted_barcount = bar_coupons_df['Y'].sum()
proportion_baraccepted = accepted_barcount / total_barcount

# Prepare Data for Plotting
# Map the integer values to descriptive labels for the plot
plot_data = bar_coupons_df['Y'].value_counts().reset_index()
plot_data.columns = ['Acceptance', 'Count']
plot_data['Acceptance'] = plot_data['Acceptance'].map({1: 'Accepted (Y=1)', 0: 'Rejected (Y=0)'})

fig = px.pie(
    plot_data,
    names='Acceptance',
    values='Count',
    title=f"Proportion of Bar Coupon Acceptance (Total Bar Coupons: {total_barcount})",
    color='Acceptance',
    color_discrete_map={'Accepted (Y=1)':'green', 'Rejected (Y=0)':'red'}
)

# Customize hover text to show both count and percentage
fig.update_traces(textinfo='percent+label')
fig.update_layout(autosize=True)
fig.show()
fig.write_image("./images/bar_coupon_accepted.png")

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [21]:
# Check the unique values in Bar that determine the low frequency and high frequency visitors
print(bar_coupons_df['Bar'].value_counts())
# The values are - 'never', 'less1', '1~3, '4~8', 'gt8'

# Define the low and high frequency categories based on the standard coupon dataset values
low_frequency = ['never', 'less1', '1~3']  # 3 or fewer times a month
high_frequency = ['4~8', 'gt8']        # More than 3 times a month

# Create a new categorical column for grouping
bar_coupons_df['Bar_Frequency_Group'] = bar_coupons_df['Bar'].apply(
    lambda x: 'Low Frequency (<=3/mo)' if x in low_frequency else 'High Frequency (>3/mo)'
)

# Calculate acceptance rate (mean of 'Y') for each group
acceptance_comparison = bar_coupons_df.groupby('Bar_Frequency_Group')['Y'].mean().reset_index()
acceptance_comparison['Acceptance_Rate'] = (acceptance_comparison['Y'] * 100).round(2).astype(str) + '%'
acceptance_comparison = acceptance_comparison.rename(columns={'Y': 'Proportion_Accepted'})

# Also calculate the full data for the pie charts (counts)
pie_data = bar_coupons_df.groupby(['Bar_Frequency_Group', 'Y']).size().reset_index(name='Count')

print("\n--- Acceptance Rate Comparison ---")
print(acceptance_comparison)
print("\n--- Data for Pie Charts ---")
print(pie_data)

Bar
never    830
less1    570
1~3      397
4~8      150
gt8       49
Name: count, dtype: int64

--- Acceptance Rate Comparison ---
      Bar_Frequency_Group  Proportion_Accepted Acceptance_Rate
0  High Frequency (>3/mo)             0.731818          73.18%
1  Low Frequency (<=3/mo)             0.370618          37.06%

--- Data for Pie Charts ---
      Bar_Frequency_Group  Y  Count
0  High Frequency (>3/mo)  0     59
1  High Frequency (>3/mo)  1    161
2  Low Frequency (<=3/mo)  0   1131
3  Low Frequency (<=3/mo)  1    666


In [22]:

# Plot 1: Low Frequency Visitors (Index 2 and 3)
low_freq_data = pie_data[pie_data['Bar_Frequency_Group'] == 'Low Frequency (<=3/mo)'].copy()
low_freq_data['Acceptance'] = low_freq_data['Y'].map({1: 'Accepted (Y=1)', 0: 'Rejected (Y=0)'})

fig1 = px.pie(
    low_freq_data,
    names='Acceptance',
    values='Count',
    title='Acceptance Rate for Low Frequency Bar Visitors (<=3/mo)',
    color='Acceptance',
    color_discrete_map={'Accepted (Y=1)':'green', 'Rejected (Y=0)':'red'}
)
fig1.update_traces(textinfo='percent+label')
fig1.update_layout(autosize=True)

# Plot 2: High Frequency Visitors (Index 0 and 1) ---
high_freq_data = pie_data[pie_data['Bar_Frequency_Group'] == 'High Frequency (>3/mo)'].copy()
high_freq_data['Acceptance'] = high_freq_data['Y'].map({1: 'Accepted (Y=1)', 0: 'Rejected (Y=0)'})

fig2 = px.pie(
    high_freq_data,
    names='Acceptance',
    values='Count',
    title='Acceptance Rate for High Frequency Bar Visitors (>3/mo)',
    color='Acceptance',
    color_discrete_map={'Accepted (Y=1)':'green', 'Rejected (Y=0)':'red'}
)
fig2.update_traces(textinfo='percent+label')
fig2.update_layout(autosize=True)

fig1.show()
fig1.write_image("./images/low_frequency_visitors.png")
fig2.show()
fig2.write_image("./images/high_frequency_visitors.png")

4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [24]:
# Define the conditions for the Target Group:
# Condition A: Bar Frequency > 1 time a month ('1~3', '4~8', 'gt8')
bar_freq_high = ['1~3', '4~8', 'gt8']
cond_freq = bar_coupons_df['Bar'].isin(bar_freq_high)

# Condition B: Age > 25 ('26', '31', '36', '41', '46', '50+')
age_over_25 = ['26', '31', '36', '41', '46', '50plus']
cond_age = bar_coupons_df['age'].isin(age_over_25)

# Create the 'Target_Group' column
# Target Group: (cond_freq AND cond_age)
bar_coupons_df['Group'] = 'Others'
bar_coupons_df.loc[cond_freq & cond_age, 'Group'] = 'Target Group (>1/mo & >25yo)'

# Calculate acceptance rate (mean of 'Y') for each group
acceptance_comparison = bar_coupons_df.groupby('Group')['Y'].mean().reset_index()
acceptance_comparison['Acceptance_Rate'] = (acceptance_comparison['Y'] * 100).round(2).astype(str) + '%'
acceptance_comparison = acceptance_comparison.rename(columns={'Y': 'Proportion_Accepted'})

# Plot the comparison (Bar Chart)
fig = px.bar(
    acceptance_comparison,
    x='Group',
    y='Proportion_Accepted',
    title='Bar Coupon Acceptance Rate Comparison',
    labels={'Proportion_Accepted': 'Acceptance Rate', 'Group': 'Driver Group'},
    color='Group',
    color_discrete_map={
        'Target Group (>1/mo & >25yo)': '#2ca02c', # Green for target
        'Others': "#2f0fcf" # Red for others
    },
    text=acceptance_comparison['Acceptance_Rate']
)

fig.update_layout(yaxis_tickformat='.0%', yaxis_range=[0, 1])
fig.update_traces(textposition='outside')
fig.update_xaxes(tickangle=0)

# Save the plot
# fig.write_json('bar_coupon_acceptance_comparison.json')

# Print the comparison
print("--- Bar Coupon Acceptance Rate Comparison ---")
print(acceptance_comparison)

fig.show()
fig.write_image("./images/bar_coupon_acceptance_rate.png")

--- Bar Coupon Acceptance Rate Comparison ---
                          Group  Proportion_Accepted Acceptance_Rate
0                        Others             0.335003           33.5%
1  Target Group (>1/mo & >25yo)             0.695238          69.52%


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [25]:
# Define the criteria for the Target Segment
high_bar_frequency = ['1~3', '4~8', 'gt8']
not_kid_passenger = bar_coupons_df['passanger'] != 'Kid(s)'
not_farm_fish_forest = bar_coupons_df['occupation'] != 'Farming, Fishing, or Forestry'

# Create the combined boolean mask for the Target Segment
target_segment_mask = (bar_coupons_df['Bar'].isin(high_bar_frequency)) & \
                      (not_kid_passenger) & \
                      (not_farm_fish_forest)

# Create a new categorical column for grouping
bar_coupons_df['Segment'] = 'All Others'
bar_coupons_df.loc[target_segment_mask, 'Segment'] = 'Target Segment'

# Calculate acceptance rate (mean of 'Y') and total count for both groups
acceptance_comparison = bar_coupons_df.groupby('Segment')['Y'].agg(['mean', 'count']).reset_index()
acceptance_comparison = acceptance_comparison.rename(columns={'mean': 'Acceptance_Rate', 'count': 'Total_Observations'})
acceptance_comparison['Acceptance_Rate_Percent'] = (acceptance_comparison['Acceptance_Rate'] * 100).round(2)

# Generate Bar Plot
fig = px.bar(
    acceptance_comparison,
    x='Segment',
    y='Acceptance_Rate',
    color='Segment',
    title='Bar Coupon Acceptance Rate Comparison',
    labels={
        'Acceptance_Rate': 'Acceptance Rate (Proportion)',
        'Segment': 'Driver Group'
    },
    text=acceptance_comparison['Acceptance_Rate_Percent'].apply(lambda x: f'{x}%')
)

# Set y-axis range from 0 to 1 for better interpretation of proportion
fig.update_yaxes(range=[0, 1.0], tickformat=".0%")
fig.show()
fig.write_image("./images/bar_coupon_acceptance_rate_compare.png")

# The Target Segment (drivers who go to a bar more than once a month, had passengers that were not a kid, 
# and had occupations other than farming, fishing, or forestry) is more than twice as likely to accept the 
# coupon compared to the 'All Others' group. This segment represents a high-value customer group for bar promotions


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [27]:
# Filter for Bar coupons and drop NaNs in necessary frequency columns
relevant_cols = ['Bar', 'RestaurantLessThan20', 'Y', 'passanger', 'maritalStatus', 'age', 'income', 'coupon']
bar_df = df[df['coupon'] == 'Bar'][relevant_cols].copy()
bar_df.dropna(subset=['Bar', 'RestaurantLessThan20'], inplace=True)

# Define the categorical criteria lists
high_bar_frequency = ['1~3', '4~8', 'gt8']
under_30_age = ['21', '26']
high_cheap_restaurant_frequency = ['4~8', 'gt8']
low_income = ['Less than $12500', '$12500 - $24999', '$25000 - $37499', '$37500 - $49999']

# Define the boolean masks for Group A, B, and C

# Group A: go to bars > 1/mo AND passengers not kid AND not widowed
mask_A = (bar_df['Bar'].isin(high_bar_frequency)) & \
         (bar_df['passanger'] != 'Kid(s)') & \
         (bar_df['maritalStatus'] != 'Widowed')

# Group B: go to bars > 1/mo AND under age 30
mask_B = (bar_df['Bar'].isin(high_bar_frequency)) & \
         (bar_df['age'].isin(under_30_age))

# Group C: go to cheap restaurants > 4/mo AND income < 50K
mask_C = (bar_df['RestaurantLessThan20'].isin(high_cheap_restaurant_frequency)) & \
         (bar_df['income'].isin(low_income))

# Combine the masks using OR logic
combined_target_mask = mask_A | mask_B | mask_C

# Create the Segment column
bar_df['Segment'] = 'All Others'
bar_df.loc[combined_target_mask, 'Segment'] = 'Combined Target Group'

# Calculate acceptance rate (mean of 'Y') and total count for both groups
acceptance_comparison = bar_df.groupby('Segment')['Y'].agg(['mean', 'count']).reset_index()
acceptance_comparison = acceptance_comparison.rename(columns={'mean': 'Acceptance_Rate', 'count': 'Total_Observations'})
acceptance_comparison['Acceptance_Rate_Percent'] = (acceptance_comparison['Acceptance_Rate'] * 100).round(2)

print(" Acceptance Rate Comparison ")
print(acceptance_comparison)

# Generate Bar Plot
fig = px.bar(
    acceptance_comparison,
    x='Segment',
    y='Acceptance_Rate',
    color='Segment',
    title='Bar Coupon Acceptance Rate Comparison for Combined Target Group',
    labels={
        'Acceptance_Rate': 'Acceptance Rate (Proportion)',
        'Segment': 'Driver Group'
    },
    text=acceptance_comparison['Acceptance_Rate_Percent'].apply(lambda x: f'{x}%')
)

# Set y-axis range from 0 to 1 for better interpretation of proportion
fig.update_yaxes(range=[0, 1.0], tickformat=".0%")

fig.show()
fig.write_image("./images/bar_coupon_acceptance_rate_targetgroup.png")

 Acceptance Rate Comparison 
                 Segment  Acceptance_Rate  Total_Observations  \
0             All Others         0.293972                1211   
1  Combined Target Group         0.591440                 771   

   Acceptance_Rate_Percent  
0                    29.40  
1                    59.14  


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [28]:
# Filter for Bar coupons and drop NaNs in the 'Bar' frequency column (essential for this analysis)
bar_df = df[df['coupon'] == 'Bar'].copy()
bar_df.dropna(subset=['Bar'], inplace=True)

# Acceptance Rate vs. Bar Frequency
bar_freq_order = ['never', 'less1', '1~3', '4~8', 'gt8']
bar_acceptance = bar_df.groupby('Bar')['Y'].mean().reset_index()
bar_acceptance['Acceptance_Rate_Percent'] = (bar_acceptance['Y'] * 100).round(2)

# Ensure correct ordering
bar_acceptance['Bar'] = pd.Categorical(bar_acceptance['Bar'], categories=bar_freq_order, ordered=True)
bar_acceptance = bar_acceptance.sort_values('Bar')

fig1 = px.bar(
    bar_acceptance,
    x='Bar',
    y='Y',
    title='Hypothesis Support 1: Acceptance Rate vs. Bar Frequency',
    labels={'Y': 'Acceptance Rate (Proportion)', 'Bar': 'How often driver goes to a Bar'},
    text=bar_acceptance['Acceptance_Rate_Percent'].apply(lambda x: f'{x}%'),
    color_discrete_sequence=['#1f77b4']
)
fig1.update_yaxes(range=[0, 1.0], tickformat=".0%")
fig1.update_xaxes(title_text="Bar Frequency (per month)")
fig1.show()
fig1.write_image("./images/hypothesis_1.png")

# Acceptance Rate vs. Age
age_order = ['21', '26', '30', '36', '41', '46', '50+']
age_acceptance = bar_df.groupby('age')['Y'].mean().reset_index()
age_acceptance['Acceptance_Rate_Percent'] = (age_acceptance['Y'] * 100).round(2)

# Ensure correct ordering
age_acceptance['age'] = pd.Categorical(age_acceptance['age'], categories=age_order, ordered=True)
age_acceptance = age_acceptance.sort_values('age')

fig2 = px.bar(
    age_acceptance,
    x='age',
    y='Y',
    title='Hypothesis Support 2: Acceptance Rate vs. Driver Age',
    labels={'Y': 'Acceptance Rate (Proportion)', 'age': 'Driver Age'},
    text=age_acceptance['Acceptance_Rate_Percent'].apply(lambda x: f'{x}%'),
    color_discrete_sequence=['#ff7f0e']
)
fig2.update_yaxes(range=[0, 1.0], tickformat=".0%")
fig2.show()
fig2.write_image("./images/hypothesis_2.png")

# Acceptance Rate vs. Passenger Type
passenger_acceptance = bar_df.groupby('passanger')['Y'].mean().reset_index()
passenger_acceptance = passenger_acceptance.sort_values('Y', ascending=False)
passenger_acceptance['Acceptance_Rate_Percent'] = (passenger_acceptance['Y'] * 100).round(2)

fig3 = px.bar(
    passenger_acceptance,
    x='passanger',
    y='Y',
    title='Hypothesis Support 3: Acceptance Rate vs. Passenger Type',
    labels={'Y': 'Acceptance Rate (Proportion)', 'passanger': 'Passenger Type'},
    text=passenger_acceptance['Acceptance_Rate_Percent'].apply(lambda x: f'{x}%'),
    color_discrete_sequence=['#2ca02c']
)
fig3.update_yaxes(range=[0, 1.0], tickformat=".0%")
fig3.update_xaxes(title_text="Passenger Type")
fig3.show()
fig3.write_image("./images/hypothesis_3.png")

# Print underlying data for verification
print("--- Bar Frequency Acceptance Rates ---")
print(bar_acceptance[['Bar', 'Acceptance_Rate_Percent']])
print("\n--- Age Acceptance Rates ---")
print(age_acceptance[['age', 'Acceptance_Rate_Percent']])
print("\n--- Passenger Acceptance Rates ---")
print(passenger_acceptance[['passanger', 'Acceptance_Rate_Percent']])

--- Bar Frequency Acceptance Rates ---
     Bar  Acceptance_Rate_Percent
4  never                    18.80
3  less1                    44.39
0    1~3                    64.74
1    4~8                    78.00
2    gt8                    73.47

--- Age Acceptance Rates ---
   age  Acceptance_Rate_Percent
0   21                    50.72
1   26                    48.47
3   36                    29.61
4   41                    43.86
5   46                    34.86
2  NaN                    37.43
6  NaN                    29.54
7  NaN                    41.38

--- Passenger Acceptance Rates ---
   passanger  Acceptance_Rate_Percent
1  Friend(s)                    56.19
0      Alone                    40.86
3    Partner                    38.52
2     Kid(s)                    20.69


### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  