# Machine Learning - Reseller Classification

## Project Scope:

The purpose of this project is to distinguish the resellers from the customers who purchased on the US website. Resellers harm the brand reputation, product market price, and inventory management of the company. Therefore, the company wants to identify and block the resellers' shipments.

The project involves a binary classification problem that may require feature engineering if necessary. The data source is a structured dataset in the database, which includes numerical and categorical features.

The training of the model will be carried out using PySpark MLlib and Keras Deep Learning frameworks, followed by an evaluation of various classifiers. This notebook is specifically dedicated to exploring three unique PySpark MLlib models: Logistic Regression, Random Forest, and Gradient Boosted Trees (GBT). The demonstration of Keras Deep Learning models will be covered in subsequent notebooks.

## Dataset column descriptions:

      'sales\_channel\_id': US sales channel id is 1, integer 

      'external\_customer\_id': customer id, integer 

      'email': customer email address, string

      'last\_shipping\_address\_address1': the shipping address 1 used in the last transaction, string

      'last\_shipping\_address\_address2': the shipping address 2 used in the last transaction, string

      'last\_shipping\_address\_city': the shipping city used in the last transaction, string

      'last\_shipping\_address\_zip': the shipping address zip code used in the last transaction, string

      'last\_shipping\_address\_country\_code': the shipping country code used in the last transaction, string

      'total\_orders': the total count of orders purchased by the customer, integer 

      'total\_units': the total count of item units purchased by the customer, integer 

      'total\_gross': the total gross sales spent by the customer, float

      'total\_discounts': the total discounts used by the customer, float

      'total\_returns': the total returns to the customer, float

      'total\_shipping': the total shipping spent by the customer, float

      'total\_taxes': the total taxes purchased by the customer, float

      'r\_score': recency score represents how recently a customer has made a purchase, score 1-5, integer 

      'f\_score': frequency score represents how often a customer makes a purchase, score 1-5, integer 

      'm\_score': monetary value score represents how much money a customer spends on purchases, score 1-5, integer

      'rfm\_score': r\_score + f\_score + m\_score, integer 

      'is\_reseller': 1 (reseller) or 0 (normal customer), this is the target, integer

## Preliminary Analysis:

The main features that are considered for the analysis are:

\- Total orders, units, gross sales, and discounts: These features reflect the reseller behavior of buying large quantities of products during the discount season.

\- Total returns: This feature indicates the reseller tendency of returning unsold products.

\- Total shipping and taxes: These features provide some information about the reseller location.

\- R\_score, F\_score, M\_score, RFM\_score: These features are derived from the recency, frequency, and monetary value of each customer's purchases and may help in training a model. The training process utilizes only R\_score, F\_score, and RFM\_score as the input features. This is based on the rationale that these three features encompass the information of M\_score. Including M\_score as an additional feature would result in a correlation problem.

One challenge in the reseller classification problem is to distinguish between loyal customers and resellers. Loyal customers spend a lot of money in total, but each transaction does not include many units. Resellers purchase multiple units in a single transaction. Therefore, two custom features are created to capture this difference:

\- Average units per order: This feature is obtained by dividing total units by total orders.

\- Average gross sales per order: This feature is obtained by dividing total gross sales by total orders.

The resellers may follow a different pattern in these two custom features compared to the loyal customers.

The preliminary training stage involves training a model with all the numerical features plus the two custom features. 

## Improvements 1:

Based on the feedback from the team who identified the resellers, I have analyzed the following criteria: email address, shipping address, and IP address. I have discovered that some resellers use multiple email accounts and vary their shipping addresses to avoid detection. However, these methods can be exposed by examining the email domain name and the embedding shipping address of the orders. The IP address is not a reliable indicator, as it can be easily changed by using a VPN. Therefore, I propose to create a model that considers both numerical and categorical features (excludes IP address) of the orders, and uses an embedding space to measure the distance between different shipping addresses. This will help us to detect the resellers' intentions more accurately

## Improvements 2:

The observation reveals that resellers exhibit a tendency to procure a substantial number of units for each Stock Keeping Unit (SKU) and strive to acquire a wide variety of SKUs. This is primarily driven by their need for product diversity and the requirement to maintain ample stock levels in their warehouses. Resellers often divide their orders into smaller batches, resulting in a higher total unit count and distinct SKU count compared to regular customers. The latter group typically possesses personal preferences that dictate their selection of SKUs based on their individual style. In contrast, resellers lack such preferences as their objective is to cater to the diverse needs of various customers. Consequently, the implementation of two custom features can aid in identifying resellers. The first custom feature involves calculating the average distinct SKU count per order, which counts the number of unique SKUs across a customer's order history and dividing it by the total number of orders. The second custom feature, units per SKU, is computed by dividing the units per order by the average distinct SKU count per order. By incorporating these two custom features into the model training process, reseller classification can be significantly improved.

## Improvements 3:

The importance of each attribute was checked using the RandomForest model. Attributes with an importance of less than 1% were removed as they provided little information to the model. Removing these attributes reduces noise and helps the model focus on the important ones.

The results indicate that f\_score, units\_per\_sku, units\_per\_order, total\_taxes, and total\_shipping have an importance of less than 1%. Removing them from the dataset and retraining the model may be beneficial.

## Define input and output

In [None]:
model_name = "Reseller Classifier"
input_table_name = "customer"
output_table_name = "ml_resellers"

In [None]:
import os

# Define Azure SQL Database connection
jdbcHostname = os.getenv("SQLDB_HOST")
user = os.getenv("SQLDB_USER")
password = dbutils.secrets.get(scope="azure_key_vault", key='SQLDB-PW') # use Azure Key Vault to save this password. 
jdbcDatabase = os.getenv("SQLDB_DB")
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : user,
"password" : password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

In [None]:
# load table as a Spark DataFrame
customers = (spark.read
  .format("jdbc")
  .option("url", jdbcUrl)
  .option("dbtable", input_table_name)
  .option("user", user)
  .option("password", password)
  .load()
)

# take only US data till yesterday
customers = customers.where('sales_channel_id = 1 AND CAST(last_transaction_date AS DATE) <= "2023-07-24" ')

In [None]:
input_columns = ['sales_channel_id'
      ,'external_customer_id'
      ,'customer_type'
      ,'email'
      ,'first_name'
      ,'last_name'
      ,'last_shipping_address_address1'
      ,'last_shipping_address_address2'
      ,'last_shipping_address_city'
      ,'last_shipping_address_country'
      ,'last_shipping_address_phone'
      ,'last_shipping_address_province'
      ,'last_shipping_address_zip'
      ,'last_shipping_address_country_code'
      ,'last_shipping_address_province_code'
      ,'first_transaction_date'
      ,'first_transaction_id'
      ,'last_transaction_date'
      ,'last_transaction_id'
      ,'last_transaction_ip'
      ,'total_orders'
      ,'total_units'
      ,'total_gross'
      ,'total_discounts'
      ,'total_returns'
      ,'total_shipping'
      ,'total_taxes'
      ,'r_score'
      ,'f_score'
      ,'m_score'
      ,'rfm_score'
      ,'is_reseller'
      ,'avg_sku_count'
      ]

## Create Custom Features

In [None]:
import pyspark.sql.functions as F

# create 4 custom features
df = (
    customers.select(
        [customers[col] for col in input_columns]
        + [
            F.substring_index(
                F.substring_index(F.lower(customers["email"]), "@", -1), ".", 1
            ).alias("email_domain")
        ]
        + [
            F.concat(
                F.coalesce(F.lower(customers["last_shipping_address_address1"]), F.lit("")),
                F.lit(" "),
                F.coalesce(F.lower(customers["last_shipping_address_address2"]), F.lit("")),
                F.lit(" "),
                F.coalesce(F.lower(customers["last_shipping_address_city"]), F.lit("")),
                F.lit(" "),
                F.coalesce(F.lower(customers["last_shipping_address_country_code"]), F.lit("")),
                F.lit(" "),
                F.coalesce(F.lower(customers["last_shipping_address_zip"]), F.lit("")),
            ).alias("address")
        ]
    )
    .withColumn("email_domain_address", F.array("email_domain", "address"))
    .withColumn("units_per_order", F.col("total_units") * 1.0 / F.col("total_orders"))
    .withColumn("gross_per_order", F.col("total_gross") * 1.0 / F.col("total_orders"))
    .withColumn("units_per_sku", F.col('units_per_order') * 1.0 / F.col('avg_sku_count'))
)

In [None]:
IDENTIFIERS = ['external_customer_id', 'email']
CONTINUOUS_COLUMNS = [
  'total_orders',
  'total_units',
  'total_gross',
  'total_discounts',
  'total_returns',
  'total_shipping',
  'total_taxes',
  'r_score',
  'f_score',
  'rfm_score',
  'avg_sku_count',
  'units_per_order', 
  'gross_per_order', 
  'units_per_sku'
]
CATEGORICAL_COLUMN = 'email_domain_address'
TARGET_COLUMN = ['is_reseller']

In [None]:
# Drop nulls
customers = df.dropna(
  how='any',
  subset=[x for x in IDENTIFIERS + CONTINUOUS_COLUMNS + TARGET_COLUMN + [CATEGORICAL_COLUMN]]
)

# Remove duplicates
customers = customers.dropDuplicates(subset=['sales_channel_id', 'external_customer_id']) 

## Train Test Split

In [None]:
# Take all the US resellers
resellers = customers.where('sales_channel_id = 1 and is_reseller = 1 and CAST(last_transaction_date AS DATE) <= "2023-07-24"')
# Take US normal customers in 2023 only
normal_customers = customers.where('sales_channel_id = 1 and is_reseller = 0 and CAST(last_transaction_date AS DATE) between "2023-01-01" and "2023-07-24" ')

# Split the resellers and normal customers, then merge them together, and then suffle the order. 
# This ensures train, test, val dataset have equal portion of resellers and normal customers.
train_resellers, test_resellers, val_resellers = resellers.randomSplit([0.8, 0.1, 0.1], seed=42)
train_normal, test_normal, val_normal = normal_customers.randomSplit([0.8, 0.1, 0.1], seed=42)

train = train_resellers.union(train_normal).orderBy(F.rand(seed=42))
test = test_resellers.union(test_normal).orderBy(F.rand(seed=42))
val = val_resellers.union(val_normal).orderBy(F.rand(seed=42))

train.cache()
test.cache()
val.cache()

DataFrame[sales_channel_id: int, external_customer_id: bigint, customer_type: string, email: string, first_name: string, last_name: string, last_shipping_address_address1: string, last_shipping_address_address2: string, last_shipping_address_city: string, last_shipping_address_country: string, last_shipping_address_phone: string, last_shipping_address_province: string, last_shipping_address_zip: string, last_shipping_address_country_code: string, last_shipping_address_province_code: string, first_transaction_date: timestamp, first_transaction_id: bigint, last_transaction_date: timestamp, last_transaction_id: bigint, last_transaction_ip: string, total_orders: int, total_units: int, total_gross: decimal(19,4), total_discounts: decimal(19,4), total_returns: decimal(19,4), total_shipping: decimal(19,4), total_taxes: decimal(19,4), r_score: int, f_score: int, m_score: int, rfm_score: int, is_reseller: int, avg_sku_count: double, email_domain: string, address: string, email_domain_address: a

## Create a Pipeline

In [None]:
imput_columns = [(x + '_i') for x in CONTINUOUS_COLUMNS if x not in ['units_per_order', 'gross_per_order', 'units_per_sku']]

In [None]:
from pyspark.ml import Pipeline
import pyspark.ml.feature as MF

embedding_size = 100

imputer = MF.Imputer(
  strategy='mean',
  inputCols=[x for x in CONTINUOUS_COLUMNS if x not in ['units_per_order', 'gross_per_order', 'units_per_sku']],
  outputCols=imput_columns
)

embedding = MF.Word2Vec(
    vectorSize=embedding_size,
    inputCol='email_domain_address',
    outputCol='embedded'
)

continuous_assembler = MF.VectorAssembler(
  inputCols=imput_columns + ['units_per_order', 'gross_per_order', 'units_per_sku', 'embedded'],
  outputCol='continuous'
)

continuous_scaler = MF.StandardScaler(
  inputCol='continuous',
  outputCol='features'
)

customers_pipeline = Pipeline(
  stages=[imputer, embedding, continuous_assembler, continuous_scaler]
)

customers_pipeline_model = customers_pipeline.fit(train)
customers_features = customers_pipeline_model.transform(train)

##RandomForest

In [None]:
from pyspark.ml.classification import RandomForestClassifier
import mlflow

clf = RandomForestClassifier(
  featuresCol='features',
  labelCol='is_reseller',
  predictionCol='prediction'
)

customers_pipeline.setStages(
  [
    imputer, 
    embedding,
    continuous_assembler, 
    continuous_scaler,
    clf
  ]
)

# Start an MLflow run
mlflow.start_run()

# Train a model
customers_pipeline_model = customers_pipeline.fit(train)

# Predictions
results = customers_pipeline_model.transform(val)

# Log metrics
model = customers_pipeline_model.stages[-1]
metrics = model.evaluate(results.select('email', 'is_reseller', 'features'))

mlflow.log_metric("val_accuracy", metrics.accuracy)
mlflow.log_metric("val_precision", metrics.precisionByLabel[1])
mlflow.log_metric("val_recall", metrics.recallByLabel[1])

# Log the model
mlflow.spark.log_model(customers_pipeline_model, "RandomForest_model")

# Close the MLflow run
mlflow.end_run()

2023/08/15 21:09:17 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


In [None]:
column_names = (
    [
        x
        for x in CONTINUOUS_COLUMNS
        if x not in ["units_per_order", "gross_per_order", "units_per_sku"]
    ]
    + ["units_per_order", "gross_per_order", "units_per_sku"]
    + [f"embed_{i}" for i in range(100)]
)

In [None]:
import pandas as pd

col_importances = pd.DataFrame(
    model.featureImportances.toArray(), columns=["importance"], index=column_names
)

In [None]:
col_importances.loc[CONTINUOUS_COLUMNS].sort_values(by='importance', ascending=False)
# f_score, units_per_sku, units_per_order, total_taxes, total_shipping are useless. We can consider to remove them from the dataset and then retrain the model.

Unnamed: 0,importance
gross_per_order,0.241971
r_score,0.120089
total_gross,0.058116
avg_sku_count,0.035167
total_discounts,0.034501
total_orders,0.023808
rfm_score,0.017863
total_units,0.014201
total_returns,0.010107
f_score,0.007412
