### Part 1. Exploratory Data Analysis.

#### 1.1 Load the data.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
df = pd.read_csv('groupon.txt', sep=None, engine='python')

In [3]:
df.head(10)

Unnamed: 0,refund_bucket,refund_sub_bucket,order_date,transaction_date,week_end_date,dmm_subcat_1,category_1,deal_supply_channel,buyer_name_1,auth_bookings,capture_bookings,refunds,cancel_refunds,refunded_units,auth_refunds,capture_units
0,Other,Other,8/4/2016,8/4/2016,8/7/2016,Inverse Normal,Probability distribution II,Goods Stores,Asher,?,?,91.87,?,3,?,?
1,Returns,Change of mind,8/31/2018,9/21/2018,9/23/2018,Binomial Distribution.,Probability distribution I,Goods,Jesus,?,?,20.98,?,1,?,?
2,Fraud,Fraud,4/19/2017,4/19/2017,4/23/2017,Power series,Calculus II,Goods,Tristan,?,?,?,79.94,?,?,?
3,Two-Hour Refunds,Two-Hour Refunds,2/5/2016,2/5/2016,2/7/2016,Prime Factorization Algorithms,?,Goods,Jeremiah,?,?,49.267469958,?,1,?,?
4,Shortage Cancellations,Vendor Shortage,7/21/2018,8/15/2018,8/19/2018,Transformations,Geometry,Goods,Jacob,?,?,29.97,?,2,?,?
5,Logistics Cancellations,Dead Tracking,11/17/2016,12/8/2016,12/11/2016,Surface of revolution,Calculus II,Goods,River,?,?,34.99,?,1,?,?
6,?,?,7/17/2018,7/19/2018,7/22/2018,Folded Normal / Half Normal Distribution.,Probability distribution I,Goods,Tessa,?,1597.29449477419,?,?,?,?,95
7,Other,Other,4/10/2017,7/14/2017,7/16/2017,Non Linear programming,Operations Research,Goods,Kaylee,?,?,4.99,?,1,?,?
8,Returns,Change of mind,11/9/2018,12/21/2018,12/23/2018,Power series,Calculus II,Goods,?,?,?,19.99,?,1,?,?
9,Fraud,Fraud,9/14/2018,9/14/2018,9/16/2018,Power series,Calculus II,Goods,?,?,?,?,5082.78,?,?,?


In [5]:
print(df.info())
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6831276 entries, 0 to 6831275
Data columns (total 16 columns):
 #   Column               Dtype 
---  ------               ----- 
 0   refund_bucket        object
 1   refund_sub_bucket    object
 2   order_date           object
 3   transaction_date     object
 4   week_end_date        object
 5   dmm_subcat_1         object
 6   category_1           object
 7   deal_supply_channel  object
 8   buyer_name_1         object
 9   auth_bookings        object
 10  capture_bookings     object
 11  refunds              object
 12  cancel_refunds       object
 13  refunded_units       object
 14  auth_refunds         object
 15  capture_units        object
dtypes: object(16)
memory usage: 833.9+ MB
None
(6831276, 16)


Our dataframe consists of 16 columns and approximately 6.8 millions rows. The dataframe is huge as it occupies almost 1 GB of memory. Python loaded all columns as objects - so we need to transform data into right data type. The original dictionary with data types description is provided below.

|Column_name	                |Type		|Description
| --- | --- | --- |
|refund_bucket                 	|Varchar	|Reason for refunding customer
|refund_sub_bucket             	|Varchar	|Sub reason for refunding customer
|transaction_date              	|Date		|date of refund
|week_end_date                 	|Date		|weekend date of refund
|dmm_subcat                    	|Varchar	|sub category of product
|category                      	|Varchar	|category of product
|deal_supply_channel           	|Varchar	|channel of sale
|buyer_name                    	|Varchar	|name of buyer who sourced the product
|auth_bookings                 	|Float		|bookings authorized on card
|capture_bookings              	|Float		|bookings captured
|refunds                       	|Float		|amount of refund
|cancel_refunds                	|Float		|refunds if the transaction was a cancellation
|refunded_units                	|Integer	|quantity of product for which refunds were issued
|auth_refunds                  	|Integer	|	
|capture_units                 	|Integer	|	

#### 1.2. Data Cleaning and Transformation

In [8]:
df['refunded_units'] = df['refunded_units'].str.replace('?', '0')
df['refunded_units'] = df['refunded_units'].astype(int)

Now when we convereted refunded units into integer data type, let's work with auth_refunds, and capture_units in the same way.

In [9]:
df['auth_refunds'] = df['auth_refunds'].str.replace('?', '0')
df['auth_refunds'] = df['auth_refunds'].astype(int)

In [10]:
df['capture_units'] = df['capture_units'].str.replace('?', '0')
df['capture_units'] = df['capture_units'].astype(int)

Now let's convert auth_bookings, capture_bookings, refunds, and cancel_refunds columns into the float data type.

In [None]:
df['auth_bookings'] = df['auth_bookings'].str.replace('?', '0')
df['auth_bookings'] = df['auth_bookings'].astype(float)

In [None]:
df['capture_bookings'] = df['capture_bookings'].str.replace('?', '0')
df['capture_bookings'] = df['capture_bookings'].astype(float)

In [None]:
df['refunds'] = df['refunds'].str.replace('?', '0')
df['refunds'] = df['refunds'].astype(float)

In [44]:
df['cancel_refunds'] = df['cancel_refunds'].str.replace('?', '0')
df['cancel_refunds'] = df['cancel_refunds'].astype(float)

Now when we converted our refunds columns to float data type, let's convert transaction_date and week_end_date into date/time data type.

In [45]:
df['order_date'] = pd.to_datetime(df['order_date'], infer_datetime_format=True)
df['transaction_date'] = pd.to_datetime(df['transaction_date'], infer_datetime_format=True)
df['week_end_date'] = pd.to_datetime(df['week_end_date'], infer_datetime_format=True)

### DMM_SUBCAT_1 COLUMN

In [46]:
df['dmm_subcat_1'] = df['dmm_subcat_1'].str.replace('?', 'unknown')

In [47]:
print(df['dmm_subcat_1'].nunique())
print(df['dmm_subcat_1'].value_counts().head(10))

140
Power series                      380191
Surface of revolution             202547
Maclaurin series                  180019
Polynomial functions              173272
Bernoulli Distribution            154714
Prime Factorization Algorithms    147979
Erlang Distribution.              145873
Electrical networks               128065
Degenerate Distribution.          124399
Exponential Distribution.         123513
Name: dmm_subcat_1, dtype: int64


### CATEGORY COLUMN

In [48]:
df['category_1'] = df['category_1'].str.replace('?', 'unknown')
print(df['category_1'].nunique())
print(df['category_1'].value_counts())

14
Probability distribution I     1603719
Probability distribution II    1213601
Calculus II                    1070787
Geometry                        626388
Algebra                         567067
Graph Theory                    520646
Linear Regression               378167
unknown                         376896
Calculus I                      171269
Combinatorics                   117097
Decision Tree                    98516
Operations Research              69411
Clustering algorithms            13813
Ensemble methods                  3899
Name: category_1, dtype: int64


### DEAL SUPPLY CHANNEL COLUMN

In [49]:
df['deal_supply_channel'].value_counts()

Goods           6253600
Goods Stores     577676
Name: deal_supply_channel, dtype: int64

### BUYER NAME COLUMN

In [50]:
df['buyer_name_1'] = df['buyer_name_1'].str.replace('?', 'Unknown')

In [51]:
print(df['buyer_name_1'].nunique())
print(df['buyer_name_1'].value_counts().head(10))

353
Unknown      965347
Asher        520015
Luis         100820
Max           95006
Ezra          88199
Sofia         84766
Giovanni      80643
Stella        79479
Camila        76727
Emmett        75189
Name: buyer_name_1, dtype: int64


### REFUND BUCKET COLUMN

In [52]:
df['refund_bucket'] = df['refund_bucket'].str.replace('?', 'Unknown')

In [53]:
df['refund_bucket'].value_counts()

Returns                    2781522
Unknown                    1956060
Logistics Cancellations    1086118
Two-Hour Refunds            503519
Other                       207162
Shortage Cancellations      170881
Fraud                       126014
Name: refund_bucket, dtype: int64

### REFUND SUB BUCKET COLUMN

In [54]:
df['refund_sub_bucket'] = df['refund_sub_bucket'].str.replace('?', 'Unknown')

In [55]:
df['refund_sub_bucket'].value_counts()

Unknown                     1956060
Product Quality              910743
Change of mind               837070
Two-Hour Refunds             503519
Wrong/Damaged Product        494392
Wrong Size                   472068
Returned to Sender           335514
Dead Tracking                291467
Tracking Shows Delivered     276718
Other                        207162
Purchase Issues              129031
Fraud                        126014
Vendor Shortage              107126
Other Returns                 67249
Shipping Issues               53388
Groupon Error                 29543
Other Shortage                26638
Warehouse Shortage             7574
Name: refund_sub_bucket, dtype: int64

In [56]:
df['refund_sub_bucket'].nunique()

18

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6831276 entries, 0 to 6831275
Data columns (total 16 columns):
refund_bucket          object
refund_sub_bucket      object
order_date             datetime64[ns]
transaction_date       datetime64[ns]
week_end_date          datetime64[ns]
dmm_subcat_1           object
category_1             object
deal_supply_channel    object
buyer_name_1           object
auth_bookings          float64
capture_bookings       float64
refunds                float64
cancel_refunds         float64
refunded_units         int32
auth_refunds           int32
capture_units          int32
dtypes: datetime64[ns](3), float64(4), int32(3), object(6)
memory usage: 755.7+ MB


Now when transaction date column is actually DateTime objects let's use .apply() to create 3 new columns called Year, Month, and Day of Week.

In [58]:
df['transaction_year'] = pd.DatetimeIndex(df['transaction_date']).year
df['transaction_month'] = pd.DatetimeIndex(df['transaction_date']).month
df['transaction_day'] = pd.DatetimeIndex(df['transaction_date']).dayofweek

### Part 2. Machine Learning Modeling.