### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [199]:
data = pd.read_csv('data/coupons.csv')

In [200]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


In [283]:
data.info()
data['coupon'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   12684 non-null  object
 15  Bar                   12684 non-null

Coffee House             3996
Restaurant(<20)          2786
Carry out & Take away    2393
Bar                      2017
Restaurant(20-50)        1492
Name: coupon, dtype: int64

2. Investigate the dataset for missing or problematic data.

In [202]:
## Check each column for NaN or Null
for i in data.columns:
    print(data[[i]].isna().value_counts())
    print("\n")




destination
False          12684
dtype: int64


passanger
False        12684
dtype: int64


weather
False      12684
dtype: int64


temperature
False          12684
dtype: int64


time 
False    12684
dtype: int64


coupon
False     12684
dtype: int64


expiration
False         12684
dtype: int64


gender
False     12684
dtype: int64


age  
False    12684
dtype: int64


maritalStatus
False            12684
dtype: int64


has_children
False           12684
dtype: int64


education
False        12684
dtype: int64


occupation
False         12684
dtype: int64


income
False     12684
dtype: int64


car  
True     12576
False      108
dtype: int64


Bar  
False    12577
True       107
dtype: int64


CoffeeHouse
False          12467
True             217
dtype: int64


CarryAway
False        12533
True           151
dtype: int64


RestaurantLessThan20
False                   12554
True                      130
dtype: int64


Restaurant20To50
False               12495
True                  1

In [203]:
## Looks like the following have null values. Lets check their values.
# car
# bar
# CoffeeHouse
# CarryAway
# RestaurantLessThan20
# Restaurant20To50
columns = [
    'car',
    'Bar',
    'CoffeeHouse',
    'CarryAway',
    'RestaurantLessThan20',
    'Restaurant20To50'
]

for i in columns:
    print(data[[i]].value_counts())
    print("\n")

car                                     
Mazda5                                      22
Scooter and motorcycle                      22
do not drive                                22
Car that is too old to install Onstar :D    21
crossover                                   21
dtype: int64


Bar  
never    5197
less1    3482
1~3      2473
4~8      1076
gt8       349
dtype: int64


CoffeeHouse
less1          3385
1~3            3225
never          2962
4~8            1784
gt8            1111
dtype: int64


CarryAway
1~3          4672
4~8          4258
less1        1856
gt8          1594
never         153
dtype: int64


RestaurantLessThan20
1~3                     5376
4~8                     3580
less1                   2093
gt8                     1285
never                    220
dtype: int64


Restaurant20To50
less1               6077
1~3                 3290
never               2136
4~8                  728
gt8                  264
dtype: int64




3. Decide what to do about your missing data -- drop, replace, other...

In [204]:
## I think replacing all with "unknown" will allow us to process these values without skewing data.
data = data.fillna(value="unknown")
data

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12679,Home,Partner,Rainy,55,6PM,Carry out & Take away,1d,Male,26,Single,...,never,1~3,4~8,1~3,1,0,0,1,0,1
12680,Work,Alone,Rainy,55,7AM,Carry out & Take away,1d,Male,26,Single,...,never,1~3,4~8,1~3,1,0,0,0,1,1
12681,Work,Alone,Snowy,30,7AM,Coffee House,1d,Male,26,Single,...,never,1~3,4~8,1~3,1,0,0,1,0,0
12682,Work,Alone,Snowy,30,7AM,Bar,1d,Male,26,Single,...,never,1~3,4~8,1~3,1,1,1,0,1,0


4. What proportion of the total observations chose to accept the coupon? 



In [205]:
data[['Y']].value_counts().iloc[0]/(data[['Y']].value_counts().iloc[0] + data[['Y']].value_counts().iloc[1])

## looks like approximately 57% chose to use the coupon.

0.5684326710816777

5. Use a bar plot to visualize the `coupon` column.

In [206]:
import plotly.express as px
fig1 = px.bar(data.sort_values('temperature'), x='coupon', color='time')
fig1.update_traces(dict(marker_line_width=0))
fig1.show()

fig2 = px.bar(data.sort_values('temperature'), x='coupon', color='temperature')
fig2.update_traces(dict(marker_line_width=0))
fig2.show()
## lets check value counts for time 
data[['time']].value_counts()

time
6PM     3230
7AM     3164
10AM    2275
2PM     2009
10PM    2006
dtype: int64

6. Use a histogram to visualize the temperature column.

In [207]:
px.histogram(data, x='temperature', color='coupon')

In [208]:
px.histogram(data, x='temperature', color='time')

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [209]:
selector = (data['coupon'] == 'Bar')
df2 = data[selector]
df2

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
9,No Urgent Place,Kid(s),Sunny,80,10AM,Bar,1d,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,1,0,0,1,0
13,Home,Alone,Sunny,55,6PM,Bar,1d,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,0,0,1,0,1
17,Work,Alone,Sunny,55,7AM,Bar,1d,Female,21,Unmarried partner,...,never,unknown,4~8,1~3,1,1,1,0,1,0
24,No Urgent Place,Friend(s),Sunny,80,10AM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,0,1,1
35,Home,Alone,Sunny,55,6PM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12663,No Urgent Place,Friend(s),Sunny,80,10PM,Bar,1d,Male,26,Single,...,never,1~3,4~8,1~3,1,1,0,0,1,0
12664,No Urgent Place,Friend(s),Sunny,55,10PM,Bar,2h,Male,26,Single,...,never,1~3,4~8,1~3,1,1,0,0,1,0
12667,No Urgent Place,Alone,Rainy,55,10AM,Bar,1d,Male,26,Single,...,never,1~3,4~8,1~3,1,1,0,0,1,0
12670,No Urgent Place,Partner,Rainy,55,6PM,Bar,2h,Male,26,Single,...,never,1~3,4~8,1~3,1,1,0,0,1,0


In [210]:
## Checking to verify
df2[['Bar']].value_counts()

Bar    
never      830
less1      570
1~3        397
4~8        150
gt8         49
unknown     21
dtype: int64

2. What proportion of bar coupons were accepted?


In [211]:
df2[['Y']].mean()
## the mean of a 0 or 1 column is the same as rate of 1s to 0s

# ~41%

Y    0.410015
dtype: float64

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [212]:
acceptance_lessThan3 = df2[(df2['Bar'] == '1~3') | (df2['Bar'] == 'less1')]['Y'].mean()
acceptance_greaterThan4 = df2[(df2['Bar'] == '4~8') | (df2['Bar'] == 'gt8')]['Y'].mean()
print(acceptance_lessThan3)
print(acceptance_greaterThan4)

0.5274043433298863
0.7688442211055276


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [234]:
condition_a = df2[(df2['Bar'] != 'less1') & (df2['age'] != 21) & (df2['age'] != 21)]
compare_a = condition_a['Y'].mean()

condition_b = df2.drop(condition_a.index)

compare_b = condition_b['Y'].mean()
print(compare_a)
print(compare_b)

## there is a difference. All the others have a higher acceptance rate, leading me to believe that those of age 21 also have a higher acceptance rate.

0.39668279198341394
0.443859649122807




5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


In [262]:
condition_bar_more_than_once = (df2['Bar'] != 'less1') & (df2['Bar'] != 'never') & (df2['Bar'] != 'unknown') 
condition_no_kid_passenger = (df2['passanger'] != 'Alone') & (df2['passanger'] != 'Kid(s)')
condition_a = df2[ condition_bar_more_than_once
                  & condition_no_kid_passenger
                  & (df2['occupation'] != 'Farming Fishing & Forestry ')]
compare_a = condition_a['Y'].mean()

condition_b = df2.drop(condition_a.index)


compare_b = condition_b['Y'].mean()
## Verify
print("\n")
print(condition_a[['Bar']].value_counts())
print("\n")
print(condition_a[['passanger']].value_counts())


## compare
print("\n")
print(compare_a)
print(compare_b)



Bar
1~3    133
4~8     42
gt8     20
dtype: int64


passanger
Friend(s)    120
Partner       75
dtype: int64


0.717948717948718
0.3770581778265642


In [None]:
##  drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.
##  were more likely to accept the coupon

6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 



In [282]:
condition_not_widowed = (df2['maritalStatus'] != 'Widowed')
condition_age_under_30 = ((df2['age'] == '21') | (df2['age'] == '26'))
condition_income_less_than_50k = (
    (df2['income'] == "$25000 - $37499")
    | (df2['income'] == "$12500 - $24999")
    | (df2['income'] == "$37500 - $49999"))

condition_go_to_cheap_restaurant_more_than_4 = (
    (df2['RestaurantLessThan20'] != 'less1')
    & (df2['RestaurantLessThan20'] != '1~3')
)
 


condition_a = df2[
    (condition_bar_more_than_once 
    & condition_no_kid_passenger
    & condition_not_widowed) 
]

condition_b = df2[
    (condition_bar_more_than_once
    & condition_age_under_30)
]

condition_c = df2[
    (condition_income_less_than_50k
    & condition_go_to_cheap_restaurant_more_than_4)
]
print(condition_a['Y'].mean())
print(condition_b['Y'].mean())
print(condition_c['Y'].mean())

0.717948717948718
0.7313432835820896
0.4723127035830619


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [None]:
## Drivers who are unmarried, under the age of 30, and/or don't have any kids with them are more likely to accept the bar coupon

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [299]:
selector = (data['coupon'] == 'Coffee House')
df3 = data[selector]
fig1 = px.bar(df3.sort_values('Y'), x='passanger', color='Y')
fig1.update_traces(dict(marker_line_width=0))
fig1.show()

fig2 = px.bar(df3.sort_values('Y'), x='income', color='Y')
fig2.update_traces(dict(marker_line_width=0))
fig2.show()

## Passengers who are alone or are with friends are MUCH more likely to use the Coffee House coupon.

## contrary to assumption, the lower the temperature, the less likely people were willing to go to the coffee house and use the coupon. 
## It seems that most coupon uses are during hotter temperatures. 

In [301]:
fig3 = px.bar(df3.sort_values('Y'), x='occupation', color='Y')
fig3.update_traces(dict(marker_line_width=0))
fig3.show()

## Seems like Unemployed, Students, Sales, and Computer & Mathematical occupations both receive and use the coupons much more often. Why is this the case?

In [372]:
fig3 = px.bar(df3.sort_values('Y')[df3['occupation'] == 'Student'], x='income', color='Y')
fig3.update_traces(dict(marker_line_width=0))
fig3.show()
print(df3.sort_values('Y')[(df3['occupation'] == 'Student') & df3['income'] == '87500--']['Y'])

## Seems like Unemployed, Students, Sales, and Computer & Mathematical occupations both receive and use the coupons much more often. Why is this the case?


Boolean Series key will be reindexed to match DataFrame index.



Series([], Name: Y, dtype: int64)



Boolean Series key will be reindexed to match DataFrame index.



In [358]:

fig4 = px.bar(df3.sort_values('Y')[df3['occupation'] == 'Unemployed'], x='weather', color='Y')
fig4.update_traces(dict(marker_line_width=0))
fig4.show()Un

## See nothing directly obvious with weather. Lets limit our dataset to various occupations.


SyntaxError: invalid syntax (114940317.py, line 3)

In [318]:

fig4 = px.bar(df3.sort_values('Y')[df3['occupation'] == 'Unemployed'], x='weather', color='Y')
fig4.update_traces(dict(marker_line_width=0))
fig4.show()

## See nothing directly obvious with weather. Lets limit our dataset to various occupations.


## Regardless, it seems obvious that there are much more yesses on Sunny days. Which seems opposite of what is expected. These people are likely getting frozen treats from the Coffee House


Boolean Series key will be reindexed to match DataFrame index.



In [365]:
fig4 = px.histogram(df3.sort_values('Y'), barmode='group', x='destination', color='Y', facet_row='time', facet_col='temperature')
fig4.update_traces(dict(marker_line_width=0))
fig4.update_layout(barmode='group')
fig4.show()

## People on the way home dont seem to often use the coupons. 
## People with no urgent place to be early in the morning are much more likely to get coffee and use the coupon. In general people with no place to be are much more likely to use the coupon.
## Also it appears that 

In [333]:
# lets limit to only home destination
df4 = df3[df3['destination'] == 'Home']
fig5 = px.histogram(df4.sort_values('Y'), x='maritalStatus', color='Y')
fig5.update_traces(dict(marker_line_width=0))
fig5.show()

In [346]:
import plotly.graph_objects as go 
fig = go.Figure(data=[
    go.Bar(name='5', x=df3['Y'], y=df3['toCoupon_GEQ5min']),
    go.Bar(name='15', x=df3['Y'], y=df3['toCoupon_GEQ15min']),
    go.Bar(name='25', x=df3['Y'], y=df3['toCoupon_GEQ25min'])
])
fig.update_traces(dict(marker_line_width=0))
fig.show()

## Majority of uses are within 5 minutes of receiving the coupon. 

It appears that when the coupon expires within 2 hours, people are less likely to accept the coupon. However, when the coupon expires within 1d, people are more likely to accept the coupon.

In [359]:


fig5 = px.histogram(data.sort_values('Y'), x='expiration', color='Y')
    
fig5.show()

## Much more often to accept coupon when theres 1 day.
## but those who are headed home are much less likely to accept coupon at all.
#

In [363]:
fig5 = px.histogram(data.sort_values('Y')[data['destination'] != 'No Urgent Place'], x='expiration', color='Y')
    
fig5.show()



Boolean Series key will be reindexed to match DataFrame index.

