# Customer Purchase Value Prediction

This notebook aims to predict customer purchase values based on their multi-session behavior across digital touchpoints. We'll analyze user interactions including browser types, traffic sources, device details, and geographical indicators to estimate purchase potential and optimize marketing strategies.

## 1. Required Libraries

First, let's import all the necessary libraries for our analysis.

In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Load and Explore Dataset

Let's load the training and testing datasets and examine their basic properties.

In [46]:
# Load the datasets
train_data = pd.read_csv('train_data.csv')
test_data = pd.read_csv('test_data.csv')

# Display basic information about the training data
print('Training Data Shape:', train_data.shape)
# print('\nBasic Information:')
# train_data.info()

# print('\nFirst few rows of the training data:')
# train_data.head()

Training Data Shape: (116023, 52)


In [47]:
X = train_data.drop(columns=['purchaseValue'])
y = train_data['purchaseValue']

In [48]:
feature_names = X.columns.values
label = y.name
print('\nFeature Names:', feature_names)
print('Label Name:', label)


Feature Names: ['trafficSource.isTrueDirect' 'browser' 'device.screenResolution'
 'trafficSource.adContent' 'trafficSource.keyword' 'screenSize'
 'geoCluster' 'trafficSource.adwordsClickInfo.slot'
 'device.mobileDeviceBranding' 'device.mobileInputSelector' 'userId'
 'trafficSource.campaign' 'device.mobileDeviceMarketingName'
 'geoNetwork.networkDomain' 'gclIdPresent' 'device.operatingSystemVersion'
 'sessionNumber' 'device.flashVersion' 'geoNetwork.region' 'trafficSource'
 'totals.visits' 'geoNetwork.networkLocation' 'sessionId' 'os'
 'geoNetwork.subContinent' 'trafficSource.medium'
 'trafficSource.adwordsClickInfo.isVideoAd' 'browserMajor'
 'locationCountry' 'device.browserSize'
 'trafficSource.adwordsClickInfo.adNetworkType' 'socialEngagementType'
 'geoNetwork.city' 'trafficSource.adwordsClickInfo.page'
 'geoNetwork.metro' 'pageViews' 'locationZone' 'device.mobileDeviceModel'
 'trafficSource.referralPath' 'totals.bounces' 'date' 'device.language'
 'deviceType' 'userChannel' 'device.

In [49]:
no_bounce = train_data[train_data['totals.bounces'] != 1]

In [50]:
# Calculate correlation between pageViews and purchaseValue for users who did not bounce
corr = no_bounce['pageViews'].corr(no_bounce['purchaseValue'])
print("Correlation between pageViews and purchaseValue (no bounce):", corr)

Correlation between pageViews and purchaseValue (no bounce): 0.22547226150716898


In [51]:
# from pandas.plotting import scatter_matrix
# attribute_list = ['pageViews', 'totalHits', 'totals.visits', 'purchaseValue']
# scatter_matrix(
#     no_bounce[attribute_list],
#     alpha=0.2,
#     figsize=(10, 10),
#     diagonal='hist',  # Changed from 'kde' to 'hist'
#     color='blue'
# )
# plt.suptitle('Scatter Matrix of Selected Attributes (No Bounce)', fontsize=16, fontweight='bold')
# plt.tight_layout()
# plt.show()  # Added explicit show() call

## 3. Data Preprocessing

In this section, we'll:
1. Handle missing values
2. Encode categorical variables
3. Normalize numerical features

## 4. Feature Engineering

Let's create new features based on the existing data to improve our model's predictive power.

## 5. Exploratory Data Analysis (EDA)

Let's visualize key patterns and relationships in our data.

## 6. Model Training

Let's prepare our data and train a Random Forest model.

## 7. Model Evaluation

Let's evaluate our model's performance using various metrics.

## 8. Predictions and Submission

Let's generate predictions for the test data and create a submission file.