### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly_express as px

In [2]:
pd.set_option('display.max_columns', None)

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [3]:
data = pd.read_csv('data/coupons.csv')

In [4]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,car,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

One immediate find is that 99.1% of "car" column is `NaN`, so we can safely drop it. Also, there are some entries that don't have data for number of times they go to a bar, coffee house, take-out, and restaraunt (both <$20 and $20-$50). Also, column for passengers is named `passanger`. However, I decided that for compatibility it is better to leave it unchanged.

In [49]:
data["age"].unique()

array(['21', '46', '26', '31', '41', '50plus', '36', 'below21'],
      dtype=object)

In [None]:
px.histogram(data["car"])

In [47]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  Bar                   12684 non-null  object
 15  CoffeeHouse           12684 non-null

In [3]:
data.isna().sum()

destination                 0
passanger                   0
weather                     0
temperature                 0
time                        0
coupon                      0
expiration                  0
gender                      0
age                         0
maritalStatus               0
has_children                0
education                   0
occupation                  0
income                      0
car                     12576
Bar                       107
CoffeeHouse               217
CarryAway                 151
RestaurantLessThan20      130
Restaurant20To50          189
toCoupon_GEQ5min            0
toCoupon_GEQ15min           0
toCoupon_GEQ25min           0
direction_same              0
direction_opp               0
Y                           0
dtype: int64

3. Decide what to do about your missing data -- drop, replace, other...

Since `car` column is 99.1% `NaN` and we aren't given an explanation to it, I think that the best option is to drop it entirely - especially given that its distribution is 21-22 across all 5 non-NaN categories. All other problematic data columns are described as "number of times user [does something]", so I think it is okay to replace all `NaN` values with 0s (and then checking all operations with version of data where all such entries are dropped to make sure it has no impact). As you can see, after doing this, no data is missing

In [5]:
data = data.drop(columns="car")

In [6]:
data = data.fillna("never")

In [34]:
data.isna().sum()

destination             0
passanger               0
weather                 0
temperature             0
time                    0
coupon                  0
expiration              0
gender                  0
age                     0
maritalStatus           0
has_children            0
education               0
occupation              0
income                  0
Bar                     0
CoffeeHouse             0
CarryAway               0
RestaurantLessThan20    0
Restaurant20To50        0
toCoupon_GEQ5min        0
toCoupon_GEQ15min       0
toCoupon_GEQ25min       0
direction_same          0
direction_opp           0
Y                       0
dtype: int64

4. What proportion of the total observations chose to accept the coupon? 



In [75]:
def acceptance(df):
    """Calculates acceptance of coupons over a given dataframe"""
    return df[ df["Y"] == 1 ]["Y"].sum() / len(df["Y"])

In [76]:
acceptance(data)

0.5684326710816777

5. Use a bar plot to visualize the `coupon` column.

In [9]:
coupon_plot_helper = data["coupon"].value_counts()
coupon_plot_helper

Coffee House             3996
Restaurant(<20)          2786
Carry out & Take away    2393
Bar                      2017
Restaurant(20-50)        1492
Name: coupon, dtype: int64

In [15]:
px.bar(data_frame=coupon_plot_helper)

6. Use a histogram to visualize the temperature column.

In [21]:
px.histogram(data["temperature"], nbins=20)

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [43]:
df_bar_coupon = data[ data["coupon"] == "Bar" ]

2. What proportion of bar coupons were accepted?


In [77]:
acceptance(df_bar_coupon)

0.41001487357461575

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [78]:
df_bar3less = df_bar_coupon.query("Bar in ('never', 'less1', '1~3')")
df_bar4more = df_bar_coupon.query("Bar in ('4~8', 'gt8')")
print("Acceptance of rare visiters: " + str(acceptance(df_bar3less)))
print("Acceptance of frequent visiters: " + str(acceptance(df_bar4more)))

Acceptance of rare visiters: 0.37073707370737075
Acceptance of frequent visiters: 0.7688442211055276


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [79]:
df_group = df_bar_coupon.query("Bar not in ('never', 'less1') and age not in ('below21', '21')")
df_antigroup = df_bar_coupon.query("Bar in ('never', 'less1') or age in ('below21', '21')")

print("Acceptance of (>1 a month and age>25): " + str(acceptance(df_group)))
print("Acceptance of everyone else: " + str(acceptance(df_antigroup)))

Acceptance of (>1 a month and age>25): 0.6952380952380952
Acceptance of everyone else: 0.33500313087038197


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


In [80]:
df_group = df_bar_coupon.query("Bar not in ('never', 'less1') and passanger != 'Kid(s)' and occupation != 'Farming Fishing & Forestry'")
df_antigroup = df_bar_coupon.query("Bar in ('never', 'less1') or passanger == 'Kid(s)' or occupation == 'Farming Fishing & Forestry'")

print("Acceptance of those drivers: " + str(acceptance(df_group)))
print("Acceptance of everyone else: " + str(acceptance(df_antigroup)))

Acceptance of those drivers: 0.7132486388384754
Acceptance of everyone else: 0.296043656207367


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 



In [70]:
group1 = df_bar_coupon.query("Bar not in ('never', 'less1') and passanger != 'Kid(s)' and maritalStatus != 'Widowed'")

In [71]:
group2 = df_bar_coupon.query("Bar not in ('never', 'less1') and age in ('below21', '21', '26')")

In [72]:
group3 = df_bar_coupon.query("RestaurantLessThan20 in ('4~8', 'gt8') and income in ('Less than $12500', '$12500 - $24999', '$25000 - $37499', '$37500 - $49999')")

In [81]:
print("Acceptance of drivers who go to bars more than once a month, had passengers that were not a kid, and were not widowed: " + str(acceptance(group1)))
print("Acceptance of drivers who go to bars more than once a month and are under the age of 30: " + str(acceptance(group2)))
print("Acceptance of drivers who go to cheap restaurants more than 4 times a month and income is less than 50K: " + str(acceptance(group2)))

Acceptance of drivers who go to bars more than once a month, had passengers that were not a kid, and were not widowed: 0.7132486388384754
Acceptance of drivers who go to bars more than once a month and are under the age of 30: 0.7217391304347827
Acceptance of drivers who go to cheap restaurants more than 4 times a month and income is less than 50K: 0.7217391304347827


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

They are regulars (go to a bar at least once a month, often more), have some disposable income, and sometimes (but not always) have some adult friends with them.

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

Let's take a look into the takeaway coupons

In [74]:
df_take_coupon = data.query("coupon == 'Carry out & Take away'")

In [82]:
acceptance(df_take_coupon)

0.7354784788967823

In [85]:
# Acceptance over regulars (take away <1, >1, and >3 times a month)
group1 = df_take_coupon.query("CarryAway in ('never', 'less1')")
group2 = df_take_coupon.query("CarryAway not in ('never', 'less1')")
group3 = df_take_coupon.query("CarryAway in ('4~8', 'gt8')")

print(acceptance(group1))
print(acceptance(group2))
print(acceptance(group3))

0.6953316953316954
0.743705941591138
0.749554367201426


In [87]:
acceptance(df_take_coupon.query("CarryAway == 'never'"))

0.7962962962962963

In [91]:
# Short vs long term coupons
group1 = df_take_coupon.query("expiration == '2h'")
group2 = df_take_coupon.query("expiration == '1d'")

print(acceptance(group1))
print(acceptance(group2))

0.663820704375667
0.7815934065934066


In [92]:
# Income
group1 = df_take_coupon.query("income in ('Less than $12500', '$12500 - $24999', '$25000 - $37499', '$37500 - $49999')")
group2 = df_take_coupon.query("income in ('$50000 - $62499', '$62500 - $74999', '$75000 - $87499', '$87500 - $99999', '$100000 or More')")

print(acceptance(group1))
print(acceptance(group2))

0.7482408131352619
0.7208258527827648


In [96]:
# Children?

group1 = df_take_coupon.query("has_children  == 0")
group2 = df_take_coupon.query("has_children == 1")

print(acceptance(group1))
print(acceptance(group2))

0.7322604242867593
0.7397660818713451


In [97]:
# Family?

group1 = df_take_coupon.query("maritalStatus == 'Single'")
group2 = df_take_coupon.query("maritalStatus == 'Married partner'")
group3 = df_take_coupon.query("maritalStatus == 'Unmarried partner'")
group4 = df_take_coupon.query("maritalStatus == 'Divorced'")
group5 = df_take_coupon.query("maritalStatus == 'Widowed'")

print(acceptance(group1))
print(acceptance(group2))
print(acceptance(group3))
print(acceptance(group4))
print(acceptance(group5))

0.7467672413793104
0.7317073170731707
0.7139175257731959
0.7222222222222222
0.8461538461538461


In [103]:
# Age?

group1 = df_take_coupon.query("age in ('below21', '21', '26')")
group2 = df_take_coupon.query("age in ('31', '36')")
group3 = df_take_coupon.query("age in ('41', '46', '50plus')")

print(acceptance(group1))
print(acceptance(group2))
print(acceptance(group3))

0.7355140186915888
0.7216174183514774
0.7485294117647059


Overall, takeout coupons are very popular among the population, holding steadily in 70-75% range for most of my queries. Somewhat notable trends I saw here are (surprising) preference by people who normally don't order takeout (79% acceptance for those who "never" order it), obvious preference towards 1 day coupons (but 2h ones were still quite popular), obvious (+10% compared to other groups) preference by the widowed people, with single being the next accepting group at 74.6%, small preference towards people with sub-50K income (+2.8%) and lack of distinction by presence of kids.

However, as already said, takeout coupons are popular across entire population of this survey, so it is hard to find anything larger than couple percent difference.