### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [None]:
data = pd.read_csv('data/coupons.csv')

In [None]:
data.head()

2. Investigate the dataset for missing or problematic data.

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.isnull().sum()

3. Decide what to do about your missing data -- drop, replace, other...

In [None]:
data = data.drop(columns=['car'])

4. What proportion of the total observations chose to accept the coupon?



In [None]:
proportion_accepted = (len(data.query('Y == 1'))/len(data))*100
proportion_accepted

5. Use a bar plot to visualize the `coupon` column.

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=data, x='coupon', order=data['coupon'].value_counts().index)
plt.title('Distribution of Coupon Types')
plt.xlabel('Coupon Type')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

plt.savefig('images/coupon_distribution.png')
plt.show()

6. Use a histogram to visualize the temperature column.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='temperature', bins=5, kde=True)
plt.title('Distribution of Temperature')
plt.xlabel('Temperature (Fahrenheit)')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('images/temperature_distribution.png')
plt.show()

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [None]:
barData = data.query('coupon == "Bar"')

2. What proportion of bar coupons were accepted?


In [None]:
proportion_bar_accepted = (len(barData.query('Y == 1')) / len(barData)) * 100
print(f'{proportion_bar_accepted:.2f}% of bar coupons were accepted.')

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [None]:
# Define groups based on bar visitation frequency
barData_less_than_3 = barData[barData['Bar'].isin(['less than 1', '1~3', '0'])]
barData_more_than_3 = barData[barData['Bar'].isin(['4~8', 'greater than 8'])]

# Calculate acceptance rates for each group
acceptance_less_than_3 = (barData_less_than_3['Y'] == 1).mean() * 100
acceptance_more_than_3 = (barData_more_than_3['Y'] == 1).mean() * 100

print(f"Acceptance rate for drivers going to a bar 3 or fewer times a month: {acceptance_less_than_3:.2f}%")
print(f"Acceptance rate for drivers going to a bar more than 3 times a month: {acceptance_more_than_3:.2f}%")

4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [None]:
# Define the conditions for the first group
bar_more_than_once_a_month = barData['Bar'].isin(['1~3', '4~8', 'greater than 8'])
age_over_25 = barData['age'].isin(['26', '31', '36', '41', '46', '50+'])

# Group 1: Drivers who go to a bar more than once a month AND are over 25
group1 = barData[bar_more_than_once_a_month & age_over_25]

# Group 2: All other drivers
group2 = barData[~(bar_more_than_once_a_month & age_over_25)]

# Calculate acceptance rates for each group
acceptance_group1 = (group1['Y'] == 1).mean() * 100
acceptance_group2 = (group2['Y'] == 1).mean() * 100

print(f"Acceptance rate for drivers going to a bar >1/month AND over 25: {acceptance_group1:.2f}%")
print(f"Acceptance rate for all other drivers: {acceptance_group2:.2f}%")

# Check for a difference
if acceptance_group1 > acceptance_group2:
    print("There is a higher acceptance rate for drivers who go to a bar >1/month and are over 25.")
elif acceptance_group1 < acceptance_group2:
    print("There is a lower acceptance rate for drivers who go to a bar >1/month and are over 25.")
else:
    print("There is no significant difference in acceptance rates between the two groups.")

5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [None]:
# Define the conditions for the first group
bar_more_than_once_a_month = barData['Bar'].isin(['1~3', '4~8', 'greater than 8'])
passengers_not_kid = ~barData['passanger'].isin(['Kid(s)'])
occupations_not_farming = ~barData['occupation'].isin(['Farming, fishing, and forestry'])

# Group 1: Drivers who meet all three conditions
group1_new = barData[bar_more_than_once_a_month & passengers_not_kid & occupations_not_farming]

# Group 2: All other drivers
group2_new = barData[~(bar_more_than_once_a_month & passengers_not_kid & occupations_not_farming)]

# Calculate acceptance rates for each group
acceptance_group1_new = (group1_new['Y'] == 1).mean() * 100
acceptance_group2_new = (group2_new['Y'] == 1).mean() * 100

print(f"Acceptance rate for drivers (bar >1/month, no kids, no farming/fishing/forestry): {acceptance_group1_new:.2f}%")
print(f"Acceptance rate for all other drivers: {acceptance_group2_new:.2f}%")

# Check for a difference
if acceptance_group1_new > acceptance_group2_new:
    print("There is a higher acceptance rate for drivers who meet these specific conditions.")
elif acceptance_group1_new < acceptance_group2_new:
    print("There is a lower acceptance rate for drivers who meet these specific conditions.")
else:
    print("There is no significant difference in acceptance rates between the two groups.")

6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [None]:
# Condition 1: go to bars more than once a month, had passengers that were not a kid, and were not widowed
condition1 = (
    barData['Bar'].isin(['1~3', '4~8', 'greater than 8']) &
    ~barData['passanger'].isin(['Kid(s)']) &
    (barData['maritalStatus'] != 'Widowed')
)

# Condition 2: go to bars more than once a month and are under the age of 30
condition2 = (
    barData['Bar'].isin(['1~3', '4~8', 'greater than 8']) &
    barData['age'].isin(['below 21', '21', '26'])
)

# Condition 3: go to cheap restaurants more than 4 times a month and income is less than 50K.
condition3 = (
    barData['RestaurantLessThan20'].isin(['4~8', 'greater than 8']) &
    barData['income'].isin(['Less than $12500', '$12500 - $24999', '$25000 - $37499', '$37500 - $49999'])
)

# Combine conditions using OR to define the target group
target_group = barData[condition1 | condition2 | condition3]

# Calculate acceptance rate for the target group
acceptance_target_group = (target_group['Y'] == 1).mean() * 100

# Calculate acceptance rate for all other drivers (complement of the target group)
other_drivers = barData[~(condition1 | condition2 | condition3)]
acceptance_other_drivers = (other_drivers['Y'] == 1).mean() * 100

print(f"Acceptance rate for drivers meeting complex criteria: {acceptance_target_group:.2f}%")
print(f"Acceptance rate for all other drivers: {acceptance_other_drivers:.2f}%")

# Check for a difference
if acceptance_target_group > acceptance_other_drivers:
    print("There is a higher acceptance rate for drivers meeting the complex criteria.")
elif acceptance_target_group < acceptance_other_drivers:
    print("There is a lower acceptance rate for drivers meeting the complex criteria.")
else:
    print("There is no significant difference in acceptance rates between the two groups.")

7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

**Based on the observations:-**

* regular bar-goers, especially those who visit bars more than once a month, are significantly more likely to accept bar coupons.
* the trend is further amplified when combined with factors like being over the age of 25, having adult passengers, and having occupations outside of farming, fishing, or forestry.
* The conditions also suggests that a combination of frequent bar visits, fewer family responsibilities, and specific income/restaurant habits correlates with higher acceptance.

**Therefore, a hypothesis could be:-**

Drivers who exhibit a lifestyle indicative of **regular social drinking**, characterized by **frequent bar visits**, **older age** (e.g., over 25), and **fewer immediate family obligations** (e.g., no kids as passengers, not widowed), are considerably **more likely to accepting bar coupons**.


**Independent Investigation**

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

8. **Independent Investigation for Coffee House Coupons**

Create a new DataFrame containing only Coffee House Coupon and investigate the data

In [None]:

coffeeHouseData = data.query('coupon == "Coffee House"')

coffeeHouseData.head()
coffeeHouseData.shape
coffeeHouseData.isnull().sum()

9. Compute the acceptance rate for Coffee House Coupons

In [None]:
coffeeHouseTotal = len(coffeeHouseData)
print("Total Coffee House Coupons : ", coffeeHouseTotal)
totalAccepted = len(coffeeHouseData.query('Y == 1'))
print("Total Coffee House Coupons Accepted : ", totalAccepted)

print("Coffee House Coupons Acceptance % :", round((totalAccepted/coffeeHouseTotal)*100,2) ,"%")

overallCouponsAccepted = len(data.query('Y==1'))
print("Overall Coupons Accepted : ", overallCouponsAccepted)
overallCouponsAcceptedProp = round((overallCouponsAccepted/len(data))*100,2)
print("Overall Coupons Acceptance % :", overallCouponsAcceptedProp ,"%")

print("Overall Coffee House Coupons Acceptance % :", round((totalAccepted/overallCouponsAccepted)*100,2) ,"%")



  10. Compute the Coffee House Coupon acceptance rate by age

In [None]:
coffee_acceptance_by_age = coffeeHouseData.groupby('age')['Y'].mean().sort_values(ascending=False)
print("Coffee House Coupon Acceptance Rate by Age:")
print(coffee_acceptance_by_age * 100)

  11. Show the above data in a histogram

In [None]:
plt.figure(figsize=(12, 7))
sns.barplot(x=coffee_acceptance_by_age.index, y=coffee_acceptance_by_age.values * 100, palette='viridis')
plt.title('Coffee House Coupon Acceptance Rate by Age')
plt.xlabel('Age Group')
plt.ylabel('Acceptance Rate (%)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()



12. Coffee House Coupon Acceptance Rate by Weather and show it in a sns graph



In [None]:
coffee_acceptance_by_income = coffeeHouseData.groupby('income')['Y'].mean().sort_values(ascending=False)
print("Coffee House Coupon Acceptance Rate by Income:")
print(coffee_acceptance_by_income * 100)

plt.figure(figsize=(12, 7))
sns.barplot(x=coffee_acceptance_by_income.index, y=coffee_acceptance_by_income.values * 100, palette='viridis')
plt.title('Coffee House Coupon Acceptance Rate by Income')
plt.xlabel('Income Group')
plt.ylabel('Acceptance Rate (%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

13. Compute Coffee House Coupon acceptance by Weather Conditions



In [None]:
coffee_acceptance_by_weather = coffeeHouseData.groupby('weather')['Y'].mean().sort_values(ascending=False)
print("Coffee House Coupon Acceptance Rate by Weather Condition:")
print(coffee_acceptance_by_weather * 100)

plt.figure(figsize=(8, 6))
sns.barplot(x=coffee_acceptance_by_weather.index, y=coffee_acceptance_by_weather.values * 100, hue=coffee_acceptance_by_weather.index, palette='coolwarm', legend=False)
plt.title('Coffee House Coupon Acceptance Rate by Weather')
plt.xlabel('Weather Condition')
plt.ylabel('Acceptance Rate (%)')
plt.tight_layout()
plt.show()

  14. Compute Coffee House Coupon acceptance by  Time of day and plot it

In [None]:
coffee_acceptance_by_time = coffeeHouseData.groupby('time')['Y'].mean().sort_values(ascending=False)
print("Coffee House Coupon Acceptance Rate by Time of Day:")
print(coffee_acceptance_by_time * 100)

plt.figure(figsize=(10, 6))
sns.barplot(x=coffee_acceptance_by_time.index, y=coffee_acceptance_by_time.values * 100, hue=coffee_acceptance_by_time.index, palette='viridis', legend=False)
plt.title('Coffee House Coupon Acceptance Rate by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Acceptance Rate (%)')
plt.tight_layout()
plt.show()

 15.  ** More EDA(Exploratory Data Analysis) and Visualizations: ** for self learning

In [None]:

data.describe()



16. Select only the numberical features:

In [None]:
df_numeric = data.select_dtypes(include=['float64', 'int64'])
df_numeric.describe()

16  Heatmap of Numerical Features

In [None]:


# Heatmap on numerical features
plt.figure(figsize=(5, 4))
sns.heatmap(df_numeric.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()

17. Histograms of Numeric features:

In [None]:
df_numeric.hist(bins=20, figsize=(10, 8))
plt.show()

 18  Create a pairplot for all numerical variables, coloring by the 'Y' ( Coupon Accepted) variable.

In [None]:
sns.pairplot(data, hue='Y', palette='viridis')
plt.suptitle('Pairplot of Numerical Variables')
plt.show()

19. Need to convert critical features to numberic features to arrive at more granular analysis