# Customer Purchase Value Prediction

This notebook aims to predict customer purchase values based on their multi-session behavior across digital touchpoints. We'll analyze user interactions including browser types, traffic sources, device details, and geographical indicators to estimate purchase potential and optimize marketing strategies.

## 1. Required Libraries

First, let's import all the necessary libraries for our analysis.

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
# plt.style.use('seaborn')

## 2. Load and Explore Dataset

Let's load the training and testing datasets and examine their basic properties.

In [24]:
# Load the datasets
train_data = pd.read_csv('train_data.csv')
test_data = pd.read_csv('test_data.csv')

# Display basic information about the training data
print('Training Data Shape:', train_data.shape)
print('\nBasic Information:')
train_data.info()

# print('\nFirst few rows of the training data:')
# train_data.head()

Training Data Shape: (116023, 52)

Basic Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116023 entries, 0 to 116022
Data columns (total 52 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   trafficSource.isTrueDirect                    42890 non-null   object 
 1   purchaseValue                                 116023 non-null  float64
 2   browser                                       116023 non-null  object 
 3   device.screenResolution                       116023 non-null  object 
 4   trafficSource.adContent                       2963 non-null    object 
 5   trafficSource.keyword                         44162 non-null   object 
 6   screenSize                                    116023 non-null  object 
 7   geoCluster                                    116023 non-null  object 
 8   trafficSource.adwordsClickInfo.slot           4281 non-null    object 

## 3. Data Preprocessing

In this section, we'll:
1. Handle missing values
2. Encode categorical variables
3. Normalize numerical features

## 4. Feature Engineering

Let's create new features based on the existing data to improve our model's predictive power.

## 5. Exploratory Data Analysis (EDA)

Let's visualize key patterns and relationships in our data.

## 6. Model Training

Let's prepare our data and train a Random Forest model.

## 7. Model Evaluation

Let's evaluate our model's performance using various metrics.

## 8. Predictions and Submission

Let's generate predictions for the test data and create a submission file.