/ / / / / / / / / / / / / / / / / / / / / / /

CALCULATING LIFETIME VALUE 

/ / / / / / / / / / / / / / / / / / / / / / / 

Now that we have our dataset cleaned and properly explored we want to develop a sense of reviewer "lifetime value" and, more importantly, what factors tend to determine their lifetime value. That way, we can use initial attributes about a reviewer to predict what their long-term lifetime value may be; in a real world business setting, this would be useful to determine how much we want to invest in that customer/consumer.

In our case, we decided to use the total amount a user has spent over their history in this dataset (sum of the 'price' column for all products they've reviewed) as their lifetime value. We want to use attributes about the first review they write to get a sense of whether or not they are likely to have a high/low lifetime value. 

In order to get there, the first thing we'll do is some feature engineering to create the necessary attributes: 

- LTV: Total money each customer has spent on the products they reviewed
- Star rating for customers' first review
- Length of customers' first review
- Price of the first product customer reviewed
- Main category of the first product customer reviewed 

We'll run a regression model and use star rating, length of first review, price of first product reviewed, and main category of the first product they reviewed to try to determine the predictive relationship with total money the customer spent over their lifetime. We'll use total money the customer spent as our dependent variable, and the rest of the attributes as our independent variables. 

From there, hopefully we can use our model to accurately predict the lifetime value of a customer based on their first review (purchase)

In [1]:
import os
os.chdir("c:\\Users\\submi\\Downloads")

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyspark
import altair as alt
from pyspark import sql
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *


In [3]:
# Create Spark session 
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('Milestone_I') \
    .getOrCreate()

sc = spark.sparkContext

In [4]:
df = spark.read.parquet("electronics_cleaned.parquet")
df.show()

+----------+-------+--------------+--------------+--------------------+-------------+--------------------+-----+--------------------+-----+----+
|      asin|overall|unixReviewTime|    reviewerID|          reviewText|        brand|            main_cat|price|               title|month|year|
+----------+-------+--------------+--------------+--------------------+-------------+--------------------+-----+--------------------+-----+----+
|1935009354|    5.0|    2016-09-26|A1I4L5LK9BFWO0|Works really grea...|Mighty Bright|         Amazon Home|11.99|Mighty Bright 426...|    9|2016|
|B006PHFGP6|    5.0|    2016-04-19|A1PLZ5PND29KNJ|Worked just great...|       Visico|      Camera & Photo| 8.95|2 x Visico 110v 2...|    4|2016|
|1935009354|    1.0|    2016-09-15|A3Q9HY0LU7MN6E|Right out the pac...|Mighty Bright|         Amazon Home|11.99|Mighty Bright 426...|    9|2016|
|B006QQKII6|    3.0|    2016-07-01|A2DJOAED6QFO88|OK, feels cheap a...|  Simply type|Health & Personal...| 2.54|Hebrew &amp; Engl.

In [5]:
import pyspark.sql.functions as F

review_data_user_spend = df.groupBy("reviewerID").agg(F.sum("price").alias("Total_Spend"))

review_data_user_spend.show()

+--------------+------------------+
|    reviewerID|       Total_Spend|
+--------------+------------------+
|A3VIGZNL8QGKAH|            121.77|
| AV55A16Y32PZM|            390.85|
|A2YO4SCWAWNYBI| 95.13000000000001|
|A1LYN3ZK230TGE|            1155.6|
|A2HYT1FVBQ3XDL|             55.19|
| AB5G8SRIV97L5|119.78999999999998|
|A1CATUFGPZP98O|            297.25|
|A3Q9AL26B3BZXK|             76.32|
|A2643K385LH6EM|             68.95|
|A1LUO9ZDG69BC4|204.45999999999998|
| AVPC2FJ6FXR5J|176.49999999999997|
|A37OJC78HSOXXN|160.00000000000003|
|A37B5C0DOO3PWR|            296.08|
|A2VTWAS18HZAKU|            756.04|
|A3MQLIDMWKLP6A|122.16999999999999|
|A3UHZUM0N424XM|107.36000000000001|
|A2TNDCBQ6HH9FZ|             75.19|
|A23P8ZK8J6Y5PW|53.510000000000005|
| AREHX0RYDJK96|             23.07|
| A7KT9OJ6ZBRM4|317.84999999999997|
+--------------+------------------+
only showing top 20 rows



In [6]:
# Create a dataframe to capture just the first review for each reviewer as well as additional features 

from pyspark.sql.window import Window

scratch2 = Window.partitionBy("reviewerID").orderBy(F.col("unixReviewTime"))
review_data_first_review = df.withColumn("row",F.row_number().over(scratch2)) \
  .filter(F.col("row") == 1).drop("row")

# Add a feature that shows the length of the text of each review
review_data_first_review = review_data_first_review.withColumn("review_length", F.length("reviewText"))

# Select only the columns that matter to us; Drop rows with any na's in them 
review_data_first_review = review_data_first_review.select("reviewerID", "overall", "review_length", "main_cat", "price")
review_data_first_review = review_data_first_review.na.drop()

In [7]:
# Combine first review dataset with the total spend dataset to develop the final dataframe we will use for our regression models 
review_final_LTV = review_data_first_review.join(review_data_user_spend, ['reviewerID'], how = "left_outer")
review_final_LTV = review_final_LTV.withColumnRenamed("overall", "first_purchase_rating").withColumnRenamed("Price", "first_purchase_price")\
.withColumnRenamed("main_cat", "first_purchase_category")\
.withColumnRenamed("review_length", "first_review_length")
review_final_LTV.show()

+--------------------+---------------------+-------------------+-----------------------+--------------------+------------------+
|          reviewerID|first_purchase_rating|first_review_length|first_purchase_category|first_purchase_price|       Total_Spend|
+--------------------+---------------------+-------------------+-----------------------+--------------------+------------------+
|A0232309XTWAG6FEGW93|                  5.0|                266|         Camera & Photo|               12.01| 65.89999999999999|
|A0243759LWJA50LV06FT|                  5.0|               2828|              Computers|                8.49|            144.73|
| A0263994QBSZJDIHLWE|                  5.0|                 66|   Cell Phones & Acc...|                1.96|             22.79|
|A02836981FYG9912C66F|                  5.0|               2051|              Computers|                9.99|             580.0|
|A02946063P41YZHKAQV6|                  5.0|                 25|              Computers|         

/ / / / / / / / / / / / / / / / / / / / /

Developing our LTV Model 

/ / / / / / / / / / / / / / / / / / / / /

In [8]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols= ['first_purchase_rating','first_review_length','first_purchase_price'],
                                  outputCol= 'features')
test_vec_df = vectorAssembler.transform(review_final_LTV)
test_vec_df = test_vec_df.select(['features', 'Total_Spend'])
test_vec_df.count()


721505

In [9]:
# Split the dataset into a training and test dataset
splits = test_vec_df.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

In [10]:
# Train the model on our training set 

from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = 'features', labelCol='Total_Spend', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [0.9499642372899711,0.03995774898882538,1.0274694155656496]
Intercept: 98.67039618960274


In [11]:
# Let's take a look at how well our model can explain variability in the data 

trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 274.590376
r2: 0.384534


In [12]:
predictions = lr_model.transform(test_df)
predictions = predictions.select("prediction","Total_Spend","features")

In [32]:
predictions_pd = predictions.toPandas()
predictions_pd.sample(15)

Unnamed: 0,prediction,Total_Spend,features
67895,150.278697,64.34,"[5.0, 274.0, 34.95]"
162032,139.221925,40.3,"[5.0, 176.0, 28.0]"
97223,132.683717,187.65,"[4.0, 256.0, 19.45]"
18577,113.472368,18.89,"[5.0, 123.0, 5.0]"
22090,187.223553,140.53,"[5.0, 1069.0, 39.99]"
196822,160.017879,150.43,"[5.0, 131.0, 49.99]"
207462,114.677903,51.79,"[5.0, 75.0, 8.04]"
133169,124.543921,185.27,"[4.0, 330.0, 8.65]"
102371,144.896974,82.96,"[5.0, 164.0, 33.99]"
91592,119.90099,53.58,"[5.0, 270.0, 5.54]"


In [31]:
# Chart showing 

scratch_data = predictions_pd[predictions_pd['Total_Spend'] <= 1500][['prediction', 'Total_Spend']]

prediction_chart = alt.Chart(scratch_data.sample(4999), title = "Predicted LTV vs. Actual LTV").mark_point().encode(
    alt.X('prediction:Q', scale = alt.Scale(domain=[0,1500]), title = 'Predicted LTV'),
    alt.Y('Total_Spend:Q', scale = alt.Scale(domain=[0,1500]), title = 'Actual LTV (Total Spend)')
).properties(width = 500, height = 500)

prediction_chart 