### Part 1. Exploratory Data Analysis.

#### 1.1 Load the data.

In [42]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.float_format', '{:.2f}'.format)
sns.set()

In [2]:
#df = pd.read_csv('groupon.txt', sep=None, engine='python')

In [3]:
#df.to_parquet('groupon.parquet')

In [4]:
df = pd.read_parquet('groupon.parquet')

In [43]:
df.head()

Unnamed: 0,refund_bucket,refund_sub_bucket,order_date,transaction_date,week_end_date,dmm_subcat_1,category_1,deal_supply_channel,buyer_name_1,auth_bookings,capture_bookings,refunds,cancel_refunds,refunded_units,auth_refunds,capture_units
0,Other,Other,2016-08-04,2016-08-04,2016-08-07,Inverse Normal,Probability distribution II,Goods Stores,Asher,?,?,91.87,?,3,?,?
1,Returns,Change of mind,2018-08-31,2018-09-21,2018-09-23,Binomial Distribution.,Probability distribution I,Goods,Jesus,?,?,20.98,?,1,?,?
2,Fraud,Fraud,2017-04-19,2017-04-19,2017-04-23,Power series,Calculus II,Goods,Tristan,?,?,?,79.94,?,?,?
3,Two-Hour Refunds,Two-Hour Refunds,2016-02-05,2016-02-05,2016-02-07,Prime Factorization Algorithms,?,Goods,Jeremiah,?,?,49.267469958,?,1,?,?
4,Shortage Cancellations,Vendor Shortage,2018-07-21,2018-08-15,2018-08-19,Transformations,Geometry,Goods,Jacob,?,?,29.97,?,2,?,?


Our dataframe consists of 16 columns and approximately 6.8 millions rows. The dataframe is huge as it occupies almost 1 GB of memory. Python loaded all columns as objects - so we need to transform data into right data type. The original dictionary with data types description is provided below.

|Column_name	                |Type		|Description
| --- | --- | --- |
|refund_bucket                 	|Varchar	|Reason for refunding customer
|refund_sub_bucket             	|Varchar	|Sub reason for refunding customer
|transaction_date              	|Date		|date of refund
|week_end_date                 	|Date		|weekend date of refund
|dmm_subcat                    	|Varchar	|sub category of product
|category                      	|Varchar	|category of product
|deal_supply_channel           	|Varchar	|channel of sale
|buyer_name                    	|Varchar	|name of buyer who sourced the product
|auth_bookings                 	|Float		|bookings authorized on card
|capture_bookings              	|Float		|bookings captured
|refunds                       	|Float		|amount of refund
|cancel_refunds                	|Float		|refunds if the transaction was a cancellation
|refunded_units                	|Integer	|quantity of product for which refunds were issued
|auth_refunds                  	|Integer	|	
|capture_units                 	|Integer	|	

To allow a faster processing and easy data manipulation let's subset our dataframe to separate one year of transactions and refunds based on a transaction date. I will take May 2018 - April 2019 data. 

To do so I need to convert all date's columns to Python date-time format and will use a boolean mask to create a subset.

In [7]:
df['order_date'] = pd.to_datetime(df['order_date'], infer_datetime_format=True)
df['transaction_date'] = pd.to_datetime(df['transaction_date'], infer_datetime_format=True)
df['week_end_date'] = pd.to_datetime(df['week_end_date'], infer_datetime_format=True)

Since I have already converted date's columns to date format, I can set a desired date to filter a dataframe. Then I will assign a mask to dateaframe.

In [8]:
start_date = '2018-04-01'
mask = df['transaction_date'] >= start_date
subset = df.loc[mask]
subset.shape

(2140082, 16)

#### 1.2. Data Cleaning and Transformation

In [9]:
subset = subset.copy()

In [10]:
subset['refunded_units'] = subset['refunded_units'].str.replace('?', '0').astype(int)

Now when we convereted refunded units into integer data type, let's work with auth_refunds, and capture_units in the same way.

In [11]:
subset['auth_refunds'] = subset['auth_refunds'].str.replace('?', '0').astype(int)
subset['capture_units'] = subset['capture_units'].str.replace('?', '0').astype(int)

Now let's convert auth_bookings, capture_bookings, refunds, and cancel_refunds columns into the float data type.

In [12]:
subset['auth_bookings'] = subset['auth_bookings'].str.replace('?', '0').astype(float)
subset['capture_bookings'] = subset['capture_bookings'].str.replace('?', '0').astype(float)
subset['refunds'] = subset['refunds'].str.replace('?', '0').astype(float)
subset['cancel_refunds'] = subset['cancel_refunds'].str.replace('?', '0').astype(float)

I am done with data type transformations. Now let's replace question marks with unknown values in our categorical columns and see what categories and how many of them do I have.

I have 139 unique subcategories in dummy subcategory column.

In [13]:
subset['dmm_subcat_1'] = subset['dmm_subcat_1'].str.replace('?', 'unknown')
print(subset['dmm_subcat_1'].nunique())
print(subset['dmm_subcat_1'].value_counts().head(10))

139
Power series                        81496
Surface of revolution               58082
Bernoulli Distribution              55141
Erlang Distribution.                48652
Maclaurin series                    44475
Cumulative Distribution Function    43233
Polynomial functions                41714
Degenerate Distribution.            40893
Bayesian linear regression          40665
Exponential Distribution.           40265
Name: dmm_subcat_1, dtype: int64


I have only 14 categories compare to 139 subcategories. In reality categories are actual product categories that Groupon sells online but for our project purposes they were replaced with college math classes names.

In [14]:
subset['category_1'] = subset['category_1'].str.replace('?', 'unknown')
print(subset['category_1'].nunique())
print(subset['category_1'].value_counts())

14
Probability distribution I     541125
Probability distribution II    414040
Calculus II                    271681
Geometry                       216795
Graph Theory                   148467
Linear Regression              140379
Algebra                        133526
unknown                        115445
Calculus I                      53262
Decision Tree                   36688
Combinatorics                   33903
Operations Research             28068
Clustering algorithms            5697
Ensemble methods                 1006
Name: category_1, dtype: int64


There are only two unique supply channels: goods or goods stores.

In [15]:
subset['deal_supply_channel'].value_counts()

Goods           1893570
Goods Stores     246512
Name: deal_supply_channel, dtype: int64

Buyer names are not the actual buyers but rather the employees who sell the particular products. There are 273 unique employees in our subset.

In [16]:
subset['buyer_name_1'] = subset['buyer_name_1'].str.replace('?', 'Unknown')
print(subset['buyer_name_1'].nunique())
print(subset['buyer_name_1'].value_counts().head(10))

273
Unknown    461877
Asher      188851
Harmony     44057
Alexis      39726
Tessa       36054
Nathan      34015
Mateo       30248
Cooper      29576
Leah        27759
Juliet      24664
Name: buyer_name_1, dtype: int64


Our refund bucket column consists of seven buckets, which are general reasons for return.

In [17]:
subset['refund_bucket'] = subset['refund_bucket'].str.replace('?', 'Unknown')
subset['refund_bucket'].value_counts()

Unknown                    928342
Returns                    657377
Logistics Cancellations    265471
Two-Hour Refunds           155765
Shortage Cancellations      62777
Other                       40697
Fraud                       29653
Name: refund_bucket, dtype: int64

In addition, I have 18 subbuckets columns that also specify the refund reasons.

In [18]:
subset['refund_sub_bucket'] = subset['refund_sub_bucket'].str.replace('?', 'Unknown')
print(subset['refund_sub_bucket'].value_counts())
print(subset['refund_sub_bucket'].nunique())

Unknown                     928342
Product Quality             197493
Change of mind              172610
Wrong Size                  155774
Two-Hour Refunds            155765
Wrong/Damaged Product       109297
Returned to Sender           87688
Tracking Shows Delivered     63705
Dead Tracking                61661
Purchase Issues              41079
Vendor Shortage              40806
Other                        40697
Fraud                        29653
Other Returns                22203
Groupon Error                12438
Shipping Issues              11338
Other Shortage                8788
Warehouse Shortage             745
Name: refund_sub_bucket, dtype: int64
18


In [19]:
# df['transaction_year'] = pd.DatetimeIndex(df['transaction_date']).year
# df['transaction_month'] = pd.DatetimeIndex(df['transaction_date']).month
# df['transaction_day'] = pd.DatetimeIndex(df['transaction_date']).dayofweek

#### 1.3 Data Analysis and Visualization.

As we can see the company might not be able to decrease refunds in Other, Returns, or Two-Hour Refunds. However, the company can definetly decrease refunds in such categories as Logistic Cancellations, Shortage Cancellations, and Fraud.

In [21]:
for bucket, frame in subset.groupby('refund_bucket'):
    avg = np.round(np.average(frame['refunds']),2)
    print('The average amount of refund in ' + bucket + ' was ' + str(avg))

The average amount of refund in Fraud was 19.14
The average amount of refund in Logistics Cancellations was 42.98
The average amount of refund in Other was 36.17
The average amount of refund in Returns was 66.16
The average amount of refund in Shortage Cancellations was 61.36
The average amount of refund in Two-Hour Refunds was 44.57
The average amount of refund in Unknown was 0.0


Let's see the average amounts of refunds by sub category. The same rule applies to sub category column - there are definetly areas for improvement, especially in such sub buckets as Groupon error, Other Shortage, Shipping Issues, 
Warehouse Shortage, etc. `m

Grouping data by refund bucket gives us a very useful insight about the dataset: all valid transactions with no refunds fall into a category 'Unknown' under a refund bucket column, which makes the 'unknown' refund bucket the biggest refund category in the dataset. This fact makes data analysis harder because I don't actually know what products under each particular refund category were bought the most.

In [44]:
subset.groupby(['refund_bucket','refund_sub_bucket']).agg(
min_refund=pd.NamedAgg(column='refunds', aggfunc=np.min),
min_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.min),
max_refund=pd.NamedAgg(column='refunds', aggfunc=np.max),
max_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.max),
avg_refund=pd.NamedAgg(column='refunds', aggfunc=np.mean),
avg_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.mean))

Unnamed: 0_level_0,Unnamed: 1_level_0,min_refund,min_transaction,max_refund,max_transaction,avg_refund,avg_transaction
refund_bucket,refund_sub_bucket,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fraud,Fraud,0.0,0.0,5699.97,0.0,19.14,0.0
Logistics Cancellations,Dead Tracking,0.0,0.0,16574.35,0.0,49.38,0.0
Logistics Cancellations,Purchase Issues,0.0,0.0,2999.9,0.0,20.69,0.0
Logistics Cancellations,Returned to Sender,0.0,0.0,2449.93,0.0,45.07,0.0
Logistics Cancellations,Shipping Issues,0.0,0.0,3199.99,0.0,46.77,0.0
Logistics Cancellations,Tracking Shows Delivered,-1.63,0.0,9249.0,0.0,47.6,0.0
Other,Other,-637.5,0.0,2899.0,0.0,36.17,0.0
Returns,Change of mind,-4598.0,0.0,10369.37,0.0,76.72,0.0
Returns,Other Returns,0.0,0.0,5576.43,0.0,77.66,0.0
Returns,Product Quality,-10.41,0.0,14169.14,0.0,77.11,0.0


As explained by Groupon team, they do not use refund amount to analyze refunds. The better measure for this is a refund rate - refund amount compared to total transaction amount.

Let's add a refund rate column to our dataset to generate some insights.

In [70]:
by_category = subset.groupby(['category_1']).agg(
min_refund=pd.NamedAgg(column='refunds', aggfunc=np.min),
min_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.min),
max_refund=pd.NamedAgg(column='refunds', aggfunc=np.max),
max_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.max),
avg_refund=pd.NamedAgg(column='refunds', aggfunc=np.mean),
avg_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.mean)).sort_values(by = 'avg_transaction',ascending=False)
by_category['refund_rate'] = (by_category['avg_refund']/by_category['avg_transaction'])*100
by_category

Unnamed: 0_level_0,min_refund,min_transaction,max_refund,max_transaction,avg_refund,avg_transaction,refund_rate
category_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Calculus II,-637.5,0.0,16574.35,1953473.05,74.46,923.89,8.06
Graph Theory,-55.38,0.0,2879.9,195730.62,25.05,759.56,3.3
Probability distribution II,-4598.0,0.0,5371.8,270776.06,34.14,703.41,4.85
Linear Regression,-200.0,0.0,5276.6,165051.78,21.76,557.21,3.91
Clustering algorithms,0.0,0.0,1329.36,21153.0,45.93,457.18,10.05
Geometry,-200.0,0.0,17187.12,162782.0,16.33,451.11,3.62
Algebra,-348.25,0.0,9249.0,100926.43,23.66,439.31,5.38
Probability distribution I,-205.0,0.0,7211.08,170148.28,28.67,424.26,6.76
unknown,-170.0,0.0,9024.5,77712.6,19.44,385.77,5.04
Calculus I,-26.8,0.0,2911.86,78630.9,13.71,313.47,4.37


In [71]:
by_subcategory = subset.groupby(['category_1','dmm_subcat_1']).agg(
min_refund=pd.NamedAgg(column='refunds', aggfunc=np.min),
min_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.min),
max_refund=pd.NamedAgg(column='refunds', aggfunc=np.max),
max_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.max),
avg_refund=pd.NamedAgg(column='refunds', aggfunc=np.mean),
avg_transaction=pd.NamedAgg(column='capture_bookings', aggfunc=np.mean))
by_subcategory['refund_rate'] = (by_subcategory['avg_refund']/by_subcategory['avg_transaction'])*100
by_subcategory.sort_values(by=['avg_transaction','avg_refund','refund_rate'], ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,min_refund,min_transaction,max_refund,max_transaction,avg_refund,avg_transaction,refund_rate
category_1,dmm_subcat_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Calculus II,Maclaurin series,-637.5,0.0,16574.35,1953473.05,204.78,1842.99,11.11
Probability distribution II,Generalized Error Distribution.,-344.5,0.0,2249.97,78423.72,61.83,1831.85,3.38
Geometry,Proofs,0.0,0.0,1129.75,84401.78,45.44,1516.47,3.0
Probability distribution II,Mixture Distribution,-100.0,0.0,5170.82,238032.01,82.88,1465.63,5.65
Graph Theory,Mengers theorem,-8.0,0.0,1806.77,192124.5,36.33,1383.72,2.63
Probability distribution II,Kumaraswamy Distribution,-85.59,0.0,5371.8,270776.06,98.6,1311.57,7.52
Probability distribution II,Lognormal Distribution.,-108.99,0.0,2099.26,175344.76,31.9,1030.32,3.1
Calculus II,Surface of revolution,-42.72,0.0,10499.3,334352.4,62.87,1000.08,6.29
Calculus II,Taylor series,-40.0,0.0,12423.99,381296.96,50.76,967.19,5.25
Graph Theory,Hamiltonian Graphs,-15.0,0.0,2103.2,160479.92,28.4,928.72,3.06


In [68]:
subset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
auth_bookings,2140082.0,575.89,6152.39,0.0,0.0,0.0,0.0,3273685.58
capture_bookings,2140082.0,561.8,3955.84,0.0,0.0,0.0,96.66,1953473.05
refunds,2140082.0,31.65,106.0,-4598.0,0.0,0.0,31.96,17187.12
cancel_refunds,2140082.0,13.8,139.43,-10.6,0.0,0.0,0.0,56214.5
refunded_units,2140082.0,0.86,1.96,0.0,0.0,0.0,1.0,698.0
auth_refunds,2140082.0,20.77,195.46,0.0,0.0,0.0,0.0,33298.0
capture_units,2140082.0,20.37,117.48,0.0,0.0,0.0,3.0,16192.0


In [69]:
subset.pivot_table(values='refunds', index='buyer_name_1', columns='refund_bucket', aggfunc = np.mean)

refund_bucket,Fraud,Logistics Cancellations,Other,Returns,Shortage Cancellations,Two-Hour Refunds,Unknown
buyer_name_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aaron,0.00,30.65,,37.50,58.30,13.82,0.00
Abel,3.69,41.19,12.85,38.88,46.18,10.77,0.00
Abigail,,,,189.99,,,
Adalynn,,16.74,,,,,
Adam,0.00,129.99,,171.11,,79.57,0.00
...,...,...,...,...,...,...,...
William,1.13,25.72,13.49,33.42,29.18,16.25,0.00
Wyatt,0.00,57.98,27.61,71.53,97.98,18.30,0.00
Ximena,34.30,24.95,26.42,30.06,29.56,20.62,0.00
Zoe,1.99,9.61,,21.31,0.16,2.02,0.00


In [27]:
# doesn't work properly - gives nan, null, or infinity instead. I have to figure out why
# subset['refund_rate'] = subset['refunds']/subset['capture_bookings']

From the following plot, we can clearly see that refunds have spikes and valleys. We can observe the most spikes during Thanksgiving and Christmas season when most people shop for gifts.

### Part 2. Machine Learning Modeling.