this is a large data set. The decompressed files require about 22GB of space.

This data captures the process of offering incentives (a.k.a. coupons) to a large number of customers and forecasting those who will become loyal to the product. Let's say 100 customers are offered a discount to purchase two bottles of water. Of the 100 customers, 60 choose to redeem the offer. These 60 customers are the focus of this competition. You are asked to predict which of the 60 will return (during or after the promotional period) to purchase the same item again.

To create this prediction, you are given a minimum of a year of shopping history prior to each customer's incentive, as well as the purchase histories of many other shoppers (some of whom will have received the same offer). The transaction history contains all items purchased, not just items related to the offer. Only one offer per customer is included in the data. The training set is comprised of offers issued before 2013-05-01. The test set is offers issued on or after 2013-05-01.

Files
You are provided four relational files:

transactions.csv - contains transaction history for all customers for a period of at least 1 year prior to their offered incentive <br>
trainHistory.csv - contains the incentive offered to each customer and information about the behavioral response to the offer <br>
testHistory.csv - contains the incentive offered to each customer but does not include their response (you are predicting the repeater column for each id in this file) <br>
offers.csv - contains information about the offers <br>

Fields
All of the fields are anonymized and categorized to protect customer and sales information. The specific meanings of the fields will not be provided (so don't bother asking). Part of the challenge of this competition is learning the taxonomy of items in a data-driven way.

history
id - A unique id representing a customer <br>
chain - An integer representing a store chain <br>
offer - An id representing a certain offer <br>
market - An id representing a geographical region <br>
repeattrips - The number of times the customer made a repeat purchase <br>
repeater - A boolean, equal to repeattrips > 0 <br>
offerdate - The date a customer received the offer <br>

transactions
id - see above <br>
chain - see above <br>
dept - An aggregate grouping of the Category (e.g. water) <br>
category - The product category (e.g. sparkling water) <br>
company - An id of the company that sells the item <br>
brand - An id of the brand to which the item belongs <br>
date - The date of purchase <br>
productsize - The amount of the product purchase (e.g. 16 oz of water) <br>
productmeasure - The units of the product purchase (e.g. ounces) <br>
purchasequantity - The number of units purchased <br>
purchaseamount - The dollar amount of the purchase <br>

offers
offer - see above <br>
category - see above <br>
quantity - The number of units one must purchase to get the discount <br>
company - see above <br>
offervalue - The dollar value of the offer <br>
brand - see above <br>

The transactions file can be joined to the history file by (id,chain). The history file can be joined to the offers file by (offer). The transactions file can be joined to the offers file by (category, brand, company). A negative value in productquantity and purchaseamount indicates a return.

## Installing the Lifetime package

In [7]:
pip install lifetimes

Collecting lifetimes
  Downloading Lifetimes-0.11.3-py3-none-any.whl (584 kB)
Collecting autograd>=1.2.0
  Downloading autograd-1.3.tar.gz (38 kB)
Collecting dill>=0.2.6
  Downloading dill-0.3.2.zip (177 kB)
Building wheels for collected packages: autograd, dill
  Building wheel for autograd (setup.py): started
  Building wheel for autograd (setup.py): finished with status 'done'
  Created wheel for autograd: filename=autograd-1.3-py3-none-any.whl size=47994 sha256=b8049f8a9f28fdaa906c28d6d21adda0970f4d32693b26d31b9e71137476c907
  Stored in directory: c:\users\bharg\appdata\local\pip\cache\wheels\ef\32\31\0e87227cd0ca1d99ad51fbe4b54c6fa02afccf7e483d045e04
  Building wheel for dill (setup.py): started
  Building wheel for dill (setup.py): finished with status 'done'
  Created wheel for dill: filename=dill-0.3.2-py3-none-any.whl size=78977 sha256=3fe180f21474711fb77947a2994d4275a64932f94d10ec554a92e394752973f2
  Stored in directory: c:\users\bharg\appdata\local\pip\cache\wheels\72\6b\d5\

## Importing the necessary modules

In [1]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import *
import pyspark.sql.functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import *

#Lifetime package in python is designed to calculate the CLTV 
import lifetimes
import sys

### Creating the Spark Session

In [2]:
spark=SparkSession.builder.appName("Spark Programming").getOrCreate()

In [6]:
sc = SQLContext(spark)

In [7]:
type(spark)
spark

### Loading the transactions data into the Spark Dataframe

In [14]:
#Creating the Schema for the datasets
# I have kept the ID fields as Nullable=False so that there is no null in these columns as by very basic definition they should be not-null and unique
offers_schema =StructType(fields=[StructField('offer',IntegerType(),False),      #offer - An id representing a certain offer
                 StructField('category',IntegerType(),True),     #category - The product category (e.g. sparkling water)
                StructField('quantity',IntegerType(),True),     #quantity - The number of units one must purchase to get the discount
                StructField('company',IntegerType(),False),       #company - An id of the company that sells the item
                StructField('offervalue',DoubleType(),True),   #offervalue - The dollar value of the offer
                StructField('brand',IntegerType(),False)])         #brand - An id of the brand to which the item belongs

transactions_schema = StructType(fields=[StructField('id',StringType(),False),   #id - A unique id representing a customer
                      StructField('chain',IntegerType(),True),  #chain - An integer representing a store chain
                      StructField('dept',IntegerType(),True),    #dept - An aggregate grouping of the Category (e.g. water)
                      StructField('category',IntegerType(),True), #category - The product category (e.g. sparkling water)
                      StructField('company',IntegerType(),False), #company - An id of the company that sells the item
                      StructField('brand',IntegerType(),False),  #brand - An id of the brand to which the item belongs
                      StructField('date',DateType(),True),        #date - The date of purchase
                      StructField('productsize',DoubleType(),True), #productsize - The amount of the product purchase (e.g. 16 oz of water)
                      StructField('productmeasure',StringType(),True), #productmeasure - The units of the product purchase (e.g. ounces)
                      StructField('purchasequantity',IntegerType(),True), #purchasequantity - The number of units purchased
                      StructField('purchaseamount',DoubleType(),True)]) #purchaseamount - The dollar amount of the purchase

trainHistory_schema =StructType(fields=[StructField('id',StringType(),False),     #id - A unique id representing a customer
                      StructField('chain',IntegerType(),True),    #chain - An integer representing a store chain
                      StructField('offer',IntegerType(),False),   #offer - An id representing a certain offer
                      StructField('market',IntegerType(),False),  #market - An id representing a geographical region
                      StructField('repeattrips',IntegerType(),True), #repeattrips - The number of times the customer made a repeat purchase
                      StructField('repeater',StringType(),True),  #repeater - A boolean, equal to repeattrips > 0
                      StructField('offerdate',DateType(),True)])  #offerdate - The date a customer received the offer
 

In [15]:
#Reading the data into spark RDDs
transactions = spark.read.format('csv').\
                               options(header='true').\
                load("Data/X5 Retail Data/acquire-valued-shoppers-challenge/transactions.csv",header=True,schema=transactions_schema)

offers=spark.read.format('csv').\
                               options(header='true').\
                load("Data/X5 Retail Data/acquire-valued-shoppers-challenge/offers.csv",header=True,schema=offers_schema)

trainHistory=spark.read.format('csv').\
                               options(header='true').\
                load("Data/X5 Retail Data/acquire-valued-shoppers-challenge/trainHistory.csv",header=True,schema=trainHistory_schema)


In [None]:
transactions.select(countDistinct('id')).show()
#transactions.show(60)

In [11]:
#Merging the Datasets for the analysis purpose
#Join Transactions to TrainHistory dataset
transaction_history=transactions.join(trainHistory,on=['id','chain'],how='inner')
history_offer=trainHistory.join(offers,on='offer',how='left')

 The transactions file can be joined to the offers file by (category, brand, company).

In [None]:
print( "Transactions Dataset Shape: \n Number of columns ->" , transactions.count() , " and Number of Rows -> ", len(transactions.columns))

In [20]:
#Extracting the month from the purchase_date column
transactions=transactions.withColumn("Transaction_Month", month(transactions['date']))

In [21]:
#Converting the dataframe to a queriable view. This will allow us to use power of SQL to query the dataframes
transactions.createOrReplaceTempView('Transactions') 
offers.createOrReplaceTempView('Offers')
trainHistory.createOrReplaceTempView('History')

In [None]:
result1=spark.sql("Select TRUNC(date, 'month') as PurchaseMonth,COUNT(DISTINCT chain) as Transactions FROM Transactions GROUP BY TRUNC(date, 'month') ORDER BY PurchaseMonth;")
result1.show()


In [None]:
#Let us explore the transactions dataset first
transactions.groupBy(['id','chain','date']).count().show()



#Check number of distinct users
#spark.sql("Select count(distinct(id)) from Transactions").show()
#print("Number of customers: %i" %().collect()))

In [55]:
#Let us combine the three seperate datasets to create one single dataset
merge1=trainHistory.join(offers,on='offer',how='inner')
final_df=transactions.join(merge1,on=['id','chain','company','brand','category'],)
       # transactions.join(trainHistory.join(offers,on='offer',how='inner'))

+-------+--------+-----+------+-----------+--------+----------+--------+--------+---------+----------+------+
|  offer|      id|chain|market|repeattrips|repeater| offerdate|category|quantity|  company|offervalue| brand|
+-------+--------+-----+------+-----------+--------+----------+--------+--------+---------+----------+------+
|1208251|   86246|  205|    34|          5|       t|2013-04-24|    2202|       1|104460040|       2.0|  3718|
|1197502|   86252|  205|    34|         16|       t|2013-03-27|    3203|       1|106414464|      0.75| 13474|
|1197502|12682470|   18|    11|          0|       f|2013-03-28|    3203|       1|106414464|      0.75| 13474|
|1197502|12996040|   15|     9|          0|       f|2013-03-25|    3203|       1|106414464|      0.75| 13474|
|1204821|13089312|   15|     9|          0|       f|2013-04-01|    5619|       1|107717272|       1.5|102504|
|1197502|13179265|   14|     8|          0|       f|2013-03-29|    3203|       1|106414464|      0.75| 13474|
|1200581|1

In [41]:
merge1.select('repeater').distinct().show()

+--------+
|repeater|
+--------+
|    null|
+--------+



In [None]:
#Let us examine the transaction activity
%sql
select date as Purchase_Date,
count(distinct(category))

In [38]:
#Creating a new dataset with 3 columns: Customer Id, Transaction Date and Sales amount
# Sales amount is calculated by multiplying the purchase quantity with the purchase amount
sales_df=df2.withColumn("Sales",round(df2.purchasequantity * df2.purchaseamount,2))


In [39]:
sales_df.printSchema()

root
 |-- id: string (nullable = true)
 |-- chain: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- category: string (nullable = true)
 |-- company: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- date: date (nullable = true)
 |-- productsize: string (nullable = true)
 |-- productmeasure: string (nullable = true)
 |-- purchasequantity: integer (nullable = true)
 |-- purchaseamount: double (nullable = true)
 |-- Sales: double (nullable = true)



In [40]:
# Shape of Sales Dataset
print((sales_df.count(), len(sales_df.columns)))

(349655789, 12)


In [25]:
#How many customers are under our analysis?
sales_df.select('id').distinct().count()

311541

In [28]:
# Getting a summary statistics for newly created sales dataset
sales_df.describe().show()

+-------+--------------------+------------------+
|summary|                  id|             Sales|
+-------+--------------------+------------------+
|  count|           349655789|         349655789|
|   mean|1.8395699348116875E9|56.791282085746246|
| stddev|1.5394515594486134E9|48228.584970713746|
|    min|           100007447|      -6.4676264E7|
|    max|            99999754|    4.8358281744E8|
+-------+--------------------+------------------+



In [29]:
#Finding the starting date (min_date) and the end date(max_date)
min_date, max_date = df2.select(min("date"), max("date")).first()
min_date, max_date

(datetime.date(2012, 3, 2), datetime.date(2013, 7, 28))

In [37]:
# calculate difference in days between 2013-12-31 and the Invoice Date
sales_df=sales_df.withColumn("RecencyDays", expr("datediff('2013-12-31', date)"))
sales_df.show(20)

+-----+----------+-----+-----------+
|   id|      date|Sales|RecencyDays|
+-----+----------+-----+-----------+
|86246|2012-03-02| 7.59|        669|
|86246|2012-03-02| 1.59|        669|
|86246|2012-03-02| 5.99|        669|
|86246|2012-03-02| 1.99|        669|
|86246|2012-03-02|20.76|        669|
|86246|2012-03-02|  7.8|        669|
|86246|2012-03-02| 2.49|        669|
|86246|2012-03-02| 1.39|        669|
|86246|2012-03-02|  3.0|        669|
|86246|2012-03-02| 5.79|        669|
|86246|2012-03-02| 0.59|        669|
|86246|2012-03-02| 3.29|        669|
|86246|2012-03-02| 3.29|        669|
|86246|2012-03-02| 1.99|        669|
|86246|2012-03-02| 0.89|        669|
|86246|2012-03-02| 3.59|        669|
|86246|2012-03-02| 3.99|        669|
|86246|2012-03-02| 8.87|        669|
|86246|2012-03-02| 4.99|        669|
|86246|2012-03-02|  1.0|        669|
+-----+----------+-----+-----------+
only showing top 20 rows

