## Section 1: Data Acquisition for Clustering and Propensity Modeling

### Overview
In this section, we start our analysis by fetching the data necessary for clustering and propensity modeling. Utilizing Adobe's Customer Journey Analytics (CJA) API, we aim to pull a dataset within the date range of January 1, 2024, to January 31, 2024. This dataset will comprise crucial metrics such as:
- Orders
- Revenue
- Visits
- Occurrences
- Time Spent

These metrics are key to understanding user behavior on the website and will serve as the foundation for our clustering and propensity modeling.

### Implementation Steps
1. **Configuration and Library Importation**: Import necessary libraries and configure the CJA API.
2. **Defining the Report Request**: Specify the data view ID, dimensions, metrics, and the global filter for the date range.
3. **Data Retrieval and DataFrame Creation**: Execute the report request and load the data into a DataFrame for analysis.
4. **Preliminary Data Review**: Display the first few rows of the DataFrame to ensure data has been successfully retrieved.

In [None]:
# Configurations and Imports
import cjapy
cjapy.importConfigFile("python_config.json")

# Initialize cjapy, data view, and date range
cja = cjapy.CJA()
data_view = "dv_62ba17d5a5d7845496f5fb4d"
dateRange = "2024-01-01T00:00:00.000/2024-01-31T00:00:00.000"

# Define the report request with selected metrics and dimensions
myRequest = cjapy.RequestCreator()
myRequest.setDataViewId(data_view)
myRequest.setDimension("variables/adobe_personid")
myRequest.addMetric("metrics/orders")
myRequest.addMetric("metrics/revenue")
myRequest.addMetric("metrics/visits")
myRequest.addMetric("metrics/occurrences")
myRequest.addMetric("metrics/adobe_timespent")
myRequest.addGlobalFilter(dateRange)

# Execute the report request and load the data into a DataFrame
myReport = cja.getReport(myRequest)
df = myReport.dataframe

# Display the first few rows of the DataFrame to verify successful data retrieval
print(df.head())

## Section 2: Clustering Model Preparation with t-SNE

### Introduction to Clustering Techniques

Exploring the vast landscape of clustering techniques unveils a plethora of methods designed to uncover natural groupings in data. These techniques are invaluable for identifying similarities among data points and segmenting data into meaningful clusters. Among the array of methods, our focus will be on a dimensionality reduction technique named t-SNE (t-Distributed Stochastic Neighbor Embedding), renowned for its efficacy in visualizing high-dimensional data in a more comprehensible lower-dimensional space.

For further reading on clustering techniques, this [Northwestern University resource](https://sites.northwestern.edu/researchcomputing/2022/03/14/online-learning-resources-clustering/) offers a comprehensive overview.

### Why t-SNE?

t-SNE excels in transforming high-dimensional data into a 2D or 3D space, making it an exceptional tool for visualizing complex datasets. By applying t-SNE, we can:
- **Visualize high-dimensional data** in a lower-dimensional space, enhancing interpretability.
- **Identify natural clusters** within the data, based on the similarity of data points.

### Implementing t-SNE for Website Visitor Data

Our analysis applies t-SNE to website visitor data, aiming to visualize user behaviors and interactions with the website in a two-dimensional space. This visualization assists in identifying clusters of similar visitor behaviors, facilitating a deeper understanding of user engagement patterns and interests on the website.

In [None]:
from sklearn.manifold import TSNE
import pandas as pd
import plotly.express as px

# Drop the 'personid' column to create a DataFrame with only numerical columns
df = myReport.dataframe.drop(columns=["itemId"])
X = myReport.dataframe.drop(columns=['variables/adobe_personid', "itemId"])

# Create t-SNE instance
tsne = TSNE(n_components=2)

# Apply t-SNE
X_tsne = tsne.fit_transform(X)

# Convert the t-SNE results to a DataFrame
df_tsne = pd.DataFrame(X_tsne, columns=['tsne_1', 'tsne_2'])

## Section 3: Visualizing t-SNE Results

Having applied t-SNE to our website visitor data, we now move to visualize the transformed data. The t-SNE algorithm has reduced our high-dimensional dataset into a two-dimensional space, aiming to preserve the relative distances and relationships among data points. This visualization enables us to observe the clustering of similar behaviors, offering insights into how users interact with the website.

### Visualization Objectives
Our primary goal with this visualization is to:
- Identify clusters of similar visitor behaviors.
- Understand the distribution and grouping of data points in the 2D space.
- Detect any outliers or unique patterns that emerge from the visualization.

This step is crucial for our exploratory data analysis, providing a foundation for more detailed cluster analysis and interpretation in the following sections.

In [None]:
# Create the scatter plot
fig = px.scatter(df_tsne, x='tsne_1', y='tsne_2', title='t-SNE Visualization')

# Equal aspect ratio ensures that one unit in the x-axis is equal to one unit in the y-axis
fig.update_xaxes(scaleanchor="y", scaleratio=1)
fig.update_yaxes(scaleanchor="x", scaleratio=1)

# Adjust the layout for better readability
fig.update_layout(height=600)

# Display the plot
fig.show()

## Section 4: Enhancing Clustering with DBSCAN

After mapping the website visitors' behaviors into a visual and understandable format using t-SNE, we further refine our analysis by applying DBSCAN (Density-Based Spatial Clustering of Applications with Noise). This clustering technique is particularly effective for identifying high-density clusters and distinguishing outliers, providing a nuanced understanding of visitor groupings.

### Why DBSCAN?

DBSCAN stands apart due to its ability to form clusters based on density, making it adept at handling:
- **Variably shaped clusters**: Unlike k-means, DBSCAN does not assume clusters to be spherical.
- **Noise and outliers**: DBSCAN can identify and separate outliers from core clusters.

By integrating DBSCAN with our t-SNE results, we aim to:
- **Identify core clusters** of similar visitor behaviors.
- **Detect outliers** that do not fit into any primary cluster.
- **Understand the spatial distribution** of clusters and outliers in the 2D t-SNE space.

This step is crucial for our clustering analysis, offering a comprehensive view of visitor behaviors and interactions on the website.

In [None]:
from sklearn.cluster import DBSCAN

# Create DBSCAN instance with customized parameters
dbscan = DBSCAN(eps=4.0, min_samples=20)

# Fit DBSCAN to the t-SNE results
dbscan.fit(df_tsne)

# Add the cluster labels to the original DataFrame
df['cluster'] = dbscan.labels_

# Convert the 'cluster' column to a categorical type for better visualization
df['cluster'] = df['cluster'].astype('category')
df = pd.concat([df, df_tsne], axis=1)

import plotly.express as px

# Create the scatter plot with DBSCAN clusters
fig = px.scatter(df, x='tsne_1', y='tsne_2', color='cluster', title='t-SNE Visualization with DBSCAN Clusters')

# Ensure equal aspect ratio for accurate representation
fig.update_xaxes(scaleanchor="y", scaleratio=1)
fig.update_yaxes(scaleanchor="x", scaleratio=1)

# Adjust the layout for clarity
fig.update_layout(height=600)

# Display the plot
fig.show()

## Section 5: Enhancing Analysis with 3D Visualization

Following our exploration through t-SNE and DBSCAN clustering, we extend our analysis by incorporating a three-dimensional (3D) plot. This approach allows us to integrate an additional metric, such as revenue, to examine how these clusters perform concerning a specific metric of interest.

### Objectives of 3D Visualization
- **Depth of Insight**: By adding a third dimension, we can uncover patterns and distinctions not visible in 2D visualizations.
- **Metric Integration**: Incorporating metrics like revenue enables us to identify high-value clusters or behaviors.
- **Interactive Exploration**: A 3D plot allows users to rotate and zoom, providing a dynamic way to explore the data.

This visualization not only serves to enhance our understanding of the clusters formed but also invites us to investigate how specific metrics vary across different clusters, offering a comprehensive view of user behavior and its impact on revenue generation.

In [None]:
# Create the 3D scatter plot
fig = px.scatter_3d(df, x='tsne_1', y='tsne_2', z='metrics/revenue', color='cluster', title='3D t-SNE Visualization with DBSCAN Clusters')

# Adjust the layout for a better viewing experience
fig.update_layout(height=700)

# Display the plot
fig.show()

## Section 6: Building a Propensity Model

With the clustering analysis providing insights into user behaviors and groupings, our next step is to leverage this understanding to predict future actions. Specifically, we aim to build a propensity model to identify users who, although have not yet made a purchase, show a high likelihood of converting in the future.

### Objectives of the Propensity Model
- **Predictive Analysis**: Utilize historical data to predict the probability of users making a purchase.
- **Targeting Strategy**: Enable targeted marketing strategies by identifying users with high conversion potential.
- **Data Preparation**: Format and prepare our dataset, focusing on metrics that contribute to purchase propensity without directly revealing purchase information.

This model will help us not only in understanding current user behaviors but also in forecasting future actions, allowing for more informed decision-making and strategic planning.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Convert 'metrics/orders' into a binary target variable for whether a purchase was made
df['purchase_made'] = (df['metrics/orders'] > 0).astype(int)

# Prepare the dataset for modeling by dropping columns that might lead to data leakage
X_propensity = df.drop(columns=['metrics/orders', 'metrics/revenue', 'variables/adobe_personid', 'cluster', 'tsne_1', 'tsne_2', 'purchase_made'])
y_propensity = df['purchase_made']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_propensity, y_propensity, test_size=0.2, random_state=42)

# Training the logistic regression model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)

# Display the features and target variable to ensure proper data preparation
print(X_train.head())
print(y_train.head())

## Section 7: Model Evaluation Metrics

After training our logistic regression model to predict the likelihood of a website visitor making a purchase, we evaluate the model's performance using several key metrics. These metrics provide insight into how well our model performs across different aspects such as accuracy, precision, recall, and the ability to distinguish between visitors who will and will not make a purchase.

### Key Evaluation Metrics
- **Accuracy**: Measures the proportion of total predictions that are correct.
- **ROC AUC**: Reflects the model's ability to discriminate between positive and negative classes.
- **Classification Report**: Includes precision, recall, and F1-score for a detailed performance analysis.

Understanding these metrics is crucial for interpreting the model's effectiveness and identifying areas for improvement.

In [None]:
import numpy as np
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, roc_curve, auc, confusion_matrix

# Predict on the test set and calculate evaluation metrics
y_pred = logistic_model.predict(X_test)
y_proba = logistic_model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_proba)
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Calculate ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Calculate and normalize confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat_normalized = conf_mat.astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]

# Visualization setup
fig = make_subplots(rows=1, cols=2, subplot_titles=('ROC Curve', 'Normalized Confusion Matrix'), horizontal_spacing=0.2)

# ROC Curve
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'ROC curve (AUC = {roc_auc:.2f})', line=dict(color='darkorange')), row=1, col=1)
fig.add_shape(type='line', line=dict(dash='dash', color='navy'), x0=0, x1=1, y0=0, y1=1, row=1, col=1)

# Normalized Confusion Matrix
conf_mat_fig = ff.create_annotated_heatmap(z=conf_mat_normalized, x=['Not Purchased', 'Purchased'], y=['Not Purchased', 'Purchased'],
                                           colorscale='Blues', annotation_text=np.round(conf_mat_normalized, 2), showscale=True)
for trace in conf_mat_fig.data:
    fig.add_trace(trace, row=1, col=2)

# Final adjustments and display
fig.update_layout(title_text='Model Evaluation: ROC Curve and Normalized Confusion Matrix', width=1200)
fig.show()

## Section 8: Identifying High Propensity Users

With our propensity model in place, the next step is to apply this model to our dataset to identify users who have not yet made a purchase but show a high likelihood of doing so in the future. This step is crucial for targeting and engagement strategies, allowing us to focus our efforts on users most likely to convert.

### Objectives
- **Application of the Propensity Model**: Use the model to predict purchase probabilities for all users.
- **Filtering High Propensity Users**: Identify users with a high probability of making a purchase, focusing our marketing and engagement strategies on this group.
- **Visualization**: Create a histogram to visualize the distribution of purchase probabilities, helping us understand the overall propensity landscape of our user base.

In [None]:
# Predict the probabilities for the entire dataset
purchase_probabilities = logistic_model.predict_proba(X_propensity)

# Extract the probabilities of making a purchase
probabilities_of_purchase = purchase_probabilities[:, 1]

# Initialize the 'purchase_probability' column with NaNs or zeros
df['purchase_probability'] = np.nan

# Update 'purchase_probability' only for rows where no purchase has been made
df.loc[df['metrics/orders'] == 0, 'purchase_probability'] = probabilities_of_purchase[df['metrics/orders'] == 0]

# Filter out NaN values from 'purchase_probability' to focus on non-purchased users
filtered_probabilities = df['purchase_probability'].dropna()

# Create a histogram/distribution plot with 4 larger buckets between 0 and 100%
fig = px.histogram(filtered_probabilities, range_x=[0, 1], 
                   labels={'value': 'Purchase Probability'},
                   title='Distribution of Purchase Probabilities for Users Who Have Not Yet Purchased')

# Calculate the count of visitors in each bin
bin_counts = pd.cut(filtered_probabilities, bins=np.linspace(0, 1, 5)).value_counts().sort_index()

# Update the number of bins and adjust the bin edges
fig.update_traces(xbins=dict(
        start=0,
        end=1,
        size=0.25
    ))

fig.show()

## Section 9: Exporting Data for Further Analysis and Targeting

The final step in our analysis involves exporting the cluster assignments and propensity scores into a JSON file. This file can then be uploaded to Adobe Experience Platform (AEP) for detailed analysis in Analysis Workspace, or used in Customer Data Platforms (CDP) for targeted audience engagement. Exporting this data allows us to operationalize our findings, applying the insights gained from clustering and propensity modeling to enhance marketing strategies and customer experiences.

### Objectives
- **Data Export**: Compile the cluster assignments and propensity scores into a structured JSON format.
- **Integration with AEP and CDP**: Enable seamless integration of our analytical outcomes with marketing platforms for actionable insights.
- **Operationalization of Insights**: Leverage the exported data for targeted marketing campaigns, personalized customer engagement, and strategic decision-making.

### Preparing for Data Export
Before exporting your data, it's essential to obtain your **tenant ID** from AEP, which serves as a prefix for loading data into the platform. Your tenant ID is readily available within the AEP Schemas UI when creating a new schema. Alternatively, for a detailed guide on locating your tenant ID and understanding the foundational elements of AEP, refer to Adobe's documentation: [Getting Started with Adobe Experience Platform](https://experienceleague.adobe.com/docs/experience-platform/xdm/api/getting-started.html?lang=en).

By completing this step, we bridge the gap between data analysis and practical application, ensuring that the valuable insights derived from our models are readily accessible for strategic initiatives.

In [None]:
import json

# Define the output file path
output_file_path = "cluster_and_propensities.json"

# Prepare data for export
data_to_export = []
for index, row in df.iterrows():
    # Initialize the JSON object structure
    json_object = {
        "_tenantid": {
            "personID": row['variables/adobe_personid'],
            "clusterid": int(row['cluster'])  # Ensure cluster ID is integer
        }
    }
    
    # Add propensity score if available and not NaN
    propensity = row.get('purchase_probability')
    if pd.notnull(propensity):
        # Make sure to replace this with your company's actual tenant id!
        json_object["_tenantid"]["propensity"] = round(propensity, 2)
    
    # Append the JSON object to the list
    data_to_export.append(json_object)

# Write data to a JSON Lines file
with open(output_file_path, 'w') as file:
    for item in data_to_export:
        file.write(json.dumps(item) + '\n')

print(f"Data successfully exported to {output_file_path} - note that users who have previously made a purchase will not have a propensity score in the output.")