## **Data Science Modeling**
### **Ecommerce - Modeling a Sequential Recommendation System (SRS) of Products by Markov Chains**

Lee Eik Shern

15-Sept-2023

## **Table of Contents**:

1. Objective 
2. Origin of Markov Chains
    - Modeling a Sequential Recommendation System (SRS) of Products as a DTMC
3. Data Retrieval from Relational Database
4. Data Structuring
    - Data structuring for Unsupervised Learning (ie. KMeans or PCA)
    - Data structuring for Apriori Model application
    - Data structuring for Markov Chain Model application
4. Counting of Product Pairs for Markov Chain Model
5. Transition Matrix (Frequencies & Probabilities)
6. Making Predictions with Markov Chains Transition Matrix 
6. Conclusion
    - Limitations and Recommendations for future research directions
7. Further Discussion 
    - Modeling a Sequential Recommendation System (SRS) of Products with DTMC as Long Term Analysis with Steady State Transition Matrix
8. References

### **Objective:**

Using **`time series Markov chains`** to model and predict system dynamics over time and provide operations decisions/policy for marketing 
With sample of **`~1.2m customer purchase paths`** over a **`2-months period`**, and would like to analyze the data to help inform a new **`cross-sell sales strategy`**. 

Goal is to explore the data and build a **Simple Markov Chains model** to **predict** which product an existing customer is **most likely to purchase next**.
 
1.	Collect sample data containing customer-level **purchase paths**
2.	Calculate a **frequency** and **probability matrix** for popular products
3.	For any **given product**, **predict** the **most likely future purchase**

### **Datasets:**

Initial dataset files with total file size [14.7 GB] from [Kaggle's eCommerce behavior data from multi category store](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store)
:
- `"2019-Oct.csv"` [5.7GB]
- `"2019-Nov.csv"` [9.7GB]

`Property & Description`

- `event_time` = Time when event happened at (in UTC).
- `event_type` = 'purchase' - a user purchased a product (other 'view' & 'cart' are filtered out for this project)
- `product_id` = ID of a product
- `category_id` = Product's category ID
- `category_code` = Product's category taxonomy (code name)
- `brand` = Downcased string of brand name. Can be missed.
- `price` = Float price of a product. Present.
- `user_id` = Permanent user ID.
- `user_session` = Temporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause.



### **Data Normalization:**
1. Perform Data Cleaning, Data Enrichment, and Data Aggregation on an ecommerce large dataset. 
2. Reduce datasets combined file size by utilize Database Model based on Relational Database (in `.sqlite`) by Database Normalization between 3 tables with SQLite and perform queries on the Database with INNER JOIN between tables.

- **`Data Model`** built with **`PRIMARY KEY / FOREIGN KEY (fk)`** relationships​

![](kz_Database_Schema.png)

And each TABLE are linked up with **`'PRIMARY KEY/FOREIGN KEY'`** in Relational Database, created under one final `.sqlite` format. 

Accessible for stakeholders or collaborators with (i.e data engineer/data scientist/business analyst), to make SQL Query to access to the database, to derive useful information with Data Mining, or conduct Advanced Statistical Modeling and Machine Learning.

With **`-73%`** file size reduction from initial [14.7 GB]
- Database file (.sqlite): `db_connection_kz_ecommerce_2019-Oct-Nov.sqlite `- [3.9 GB] 


## **Origin of Markov Chains**

*Markov chains were first introduced in 1906 by Andrey Markov, with the goal of showing that the Law of Large Numbers does not necessarily require the random variables to be independent.*

To see where the Markov model comes from, consider first an i.i.d. sequence of random variables $X_{0}, X_{1}, X_{2}, . . . , X_{n}$, . . . where we think of n as time. Independence is a very strong assumption: it means that the Xj’s provide no information about each other. At the other extreme, allowing general interactions between the $X_{j}
’s$ makes it very difficult to compute even basic things. Markov chains are a happy medium between complete independence and complete dependence.


**Stochastic Processes**

Stochastic process – a collection of random variables to represent the evolution of some system of random values over time. 

Define as $\{ {X_{n}, n \ge 0} \}$ or $\{ X_{0}, X_{1}, X_{2}, ... \}$ which can include a finite or infinite number of observations.

The space on which a Markov process "lives" can be either discrete or continuous, and time can be either discrete or continuous.

**Discrete Time Markov Chains (`DTMCs`)**

DTMC – is a special stochastic process satisfying "Markov/memoryless" (given previous state, the future is independent of the past) & "time-invariant", meaning the behavior (its response to inputs) does not change with time.

**Markov Property:**

Mathematically, for a stochastic process $\{ {X_{n}, n \ge 0} \}$ with state space S for any ${i, j, i_{0}, i_{1}, i_{2}, ··· \in S}$, we have 

$$Pr\{X_{n+1} = j|X_{n} = i, X_{n-1} = i_{0}, X_{n-2} = i_{1}, ···\}$$

1. **Markov property/Memoryless**= 
$$Pr\{X_{n+1} = j|X_{n} = i\}$$

2. **Time invariant**= 
$$Pr\{X_{m+1} = j|X_{m} = i\}$$ 
 
3. **Transition Probability**= 
$$P_{ij}$$


Define $P = [P_{ij}]$ as the transition probability matrix

*This says that given the history $X_{0}, X_{1}, X_{2}...X_{n}$, only the most recent term, $X_{n}$, matters for predicting $X_{n+1}$. If we think of time n as the present, times before n as the past, and times after n as the future, the Markov property says that given the present, the past and future are conditionally independent.*

*The Markov assumption greatly simplifies computations of conditional probability: instead of having to condition on the entire past, we only need to condition on the most recent value.*

reference: [Markov Chains](https://projects.iq.harvard.edu/files/stat110/files/markov_chains_handout.pdf)

### **Modeling a Sequential Recommendation System (SRS) of Products as a `DTMC`**

In an e-commerce product recommendation system, a Markov chain can be used to model the sequence of products purchased by a customer. 

The states in the Markov chain represent the products, and the transition probabilities represent the likelihood of a customer moving from one product to another. 

Hence, the product a user choose to purchase on each 'user_session' can be modeled as a Markov Chains with Transition Matrix

**DTMC as Transient Analysis with Transition Matrix**

Question that could be answered by Markov Chains model with DTMC applied as Transient Analysis with Transition Matrix for all 'product_id' product purchase sequence for all 'user_session' as states transition:

- What are the produdct a user most likely to puchase next after a purchased a current product?

## **Data Retrieval** from **`Relational Database`**

In [6]:
import os 
import sqlite3
from sqlite3 import Error
import pandas as pd
import numpy as np

try:
    os.chdir(r'./Datasets/ecommerce')
except FileNotFoundError:
    print("Working directory is correct")

def create_connection(path): 
    '''
    path: path to connect SQLite database file (.sqlite)
    return: 'db_connection' object created to be used to 
    execute SQL commands/queries on SQLite database
    '''
    db_connection = None
    try:
        db_connection = sqlite3.connect(path)
        print("SQLite DB is connected")
    except Error as err:
        print(f"Error", err, "occurred")

    return db_connection

def execute_read_query(db_connection, query):
    '''
    query: SQL commands/queries in str being pass to 
    cursor.execute(query) to execute on SQLite database  
    object 'db_connection'
    return: result of query from cursor.execute(query).fetchall()
    as list of tuples
    '''
    cursor = db_connection.cursor()
    result = None
    try:
        result = cursor.execute(query).fetchall()
        return result
    except Error as err:
        print(f"Error", err, "occurred")

# Reload saved DB from disk
db_connection = create_connection(path="db_connection_kz_ecommerce_2019-Oct-Nov.sqlite")

# event_type='purchase'
select_kzEVENTS_kzPRODUCTS_kzUSERS ='''
        SELECT 
            event_time, 
            event_type,                                 
            kzPRODUCTS.product_id, 
            category_id, 
            brand, 
            price,
            user_id,
            user_session,
            category,
            product
            
        FROM 
            kzEVENTS
        INNER JOIN 
            kzPRODUCTS,
            kzUSERS

        ON 
            kzEVENTS.prod_id = kzPRODUCTS.prod_id
            AND
            kzEVENTS.event_id = kzUSERS.id

        WHERE 
            event_type='purchase'    
        '''

db_connection_df_oct_nov_purchase = pd.DataFrame(
    execute_read_query(db_connection, select_kzEVENTS_kzPRODUCTS_kzUSERS),
    columns = [
            'event_time',
            'event_type',
            'product_id',
            'category_id',
            'brand',
            'price',
            'user_id',
            'user_session',
            'category',
            'product']
            )

db_connection_df_oct_nov_purchase.head()

# Close DB Connection after retrieved 'df_oct_nov_purchase' datasets
db_connection.close()
del db_connection
print('SQLite DB is closed')

Working directory is correct
SQLite DB is connected


SQLite DB is closed


In [7]:
# Combining brand and product together
# db_connection_df_oct_nov_purchase[['brand','product']]
db_connection_df_oct_nov_purchase['brand.product'] = db_connection_df_oct_nov_purchase['brand'] + '.' + db_connection_df_oct_nov_purchase['product']
db_connection_df_oct_nov_purchase_mba = db_connection_df_oct_nov_purchase[['event_time', 'event_type', 'product_id', 'price',
                                                                           'user_id', 'user_session', 'category', 'brand.product'
                                                                            ]]
# Assigning Data Types
db_connection_df_oct_nov_purchase_mba = db_connection_df_oct_nov_purchase_mba.astype({
                                                                                        'event_time' : 'datetime64', 
                                                                                        'event_type' : 'category', 
                                                                                        'product_id' : 'int64', 
                                                                                        'price' : 'float64',
                                                                                        'user_id' : 'int64',
                                                                                        'user_session' : 'object', 
                                                                                        'category' : 'category',
                                                                                        'brand.product' : 'object'                                                                                                    
                                                                                        })
db_connection_df_oct_nov_purchase_mba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1208697 entries, 0 to 1208696
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   event_time     1208697 non-null  datetime64[ns]
 1   event_type     1208697 non-null  category      
 2   product_id     1208697 non-null  int64         
 3   price          1208697 non-null  float64       
 4   user_id        1208697 non-null  int64         
 5   user_session   1208697 non-null  object        
 6   category       1208697 non-null  category      
 7   brand.product  1208697 non-null  object        
dtypes: category(2), datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 57.6+ MB


In [8]:
# Creating the kzPRODUCTS table, and drop duplicates
kzPRODUCTS_unique = db_connection_df_oct_nov_purchase_mba[['product_id','category','price', 'brand.product']].drop_duplicates()
# Creating the kzUSERS table, and drop duplicates
kzUSERS_unique = db_connection_df_oct_nov_purchase_mba[['user_session','user_id']].drop_duplicates()
# Creating the kzEVENTS table
kzEVENTS_unique = db_connection_df_oct_nov_purchase_mba[['event_time','event_type','product_id','user_session']]
# simplify the revenue calculation, by taking 'average' for each 'product_id' price, with .groupby().mean()
avg_price = kzPRODUCTS_unique.groupby('product_id')['price'].mean().round(2).reset_index()
# Drop the original kzPRODUCTS dataframe with fluctuating 'price' for each 'product_id'
kzPRODUCTS_unique = kzPRODUCTS_unique.drop('price', axis=1).drop_duplicates()
# Merge with original 'kzPRODUCTS_unique', with 'avg_price' to replace the 'price' column
kzPRODUCTS_unique_avg_price = pd.merge(
                                        kzPRODUCTS_unique, 
                                        avg_price, 
                                        on='product_id', 
                                        how='left'
                                        )

# ranked_cnt_user_session = kzUSERS_unique.groupby('user_id')['user_session'].count().sort_values(ascending=False)

In [9]:
# Drop duplicates 'brand.product' with same 'product_id' before merge with 'kzEVENTS_unique'
kzPRODUCTS_unique_avg_price = kzPRODUCTS_unique_avg_price.drop(index=[19702,18309,14193,2844,18850,14086,26157])
# Merge the kzEVENTS table with the kzPRODUCTS table 
kzEVENTS_on_kzPRODUCTS = pd.merge(  
                                        kzEVENTS_unique, 
                                        kzPRODUCTS_unique_avg_price, 
                                        on ='product_id', 
                                        how='left'
                                        )
# Merge the 'kzEVENTS_on_kzPRODUCTS' with the 'kzUSERS_unique' table
kzEVENTS_on_kzPRODUCTS_on_kzUSERS = pd.merge(   
                                        kzEVENTS_on_kzPRODUCTS, 
                                        kzUSERS_unique, 
                                        on ='user_session', 
                                        how='left'
                                        )

print(kzEVENTS_on_kzPRODUCTS_on_kzUSERS.shape)
kzEVENTS_on_kzPRODUCTS_on_kzUSERS

(1208697, 8)


Unnamed: 0,event_time,event_type,product_id,user_session,category,brand.product,price,user_id
0,2019-10-01 00:02:14,purchase,1004856,8187d148-3c41-46d4-b0c0-9c08cd9dc564,electronics,samsung.smartphone,128.36,543272936
1,2019-10-01 00:04:37,purchase,1002532,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,electronics,apple.smartphone,594.62,551377651
2,2019-10-01 00:07:07,purchase,13800054,1dea3ee2-2ded-42e8-8e7a-4e2ad6ae942f,furniture.bathroom,santeri.toilet,55.08,555332717
3,2019-10-01 00:09:26,purchase,4804055,2af9b570-0942-4dcd-8f25-4d84fba82553,electronics.audio,apple.headphone,194.82,524601178
4,2019-10-01 00:09:54,purchase,4804056,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,electronics.audio,apple.headphone,161.93,551377651
...,...,...,...,...,...,...,...,...
1208692,2019-11-30 23:58:08,purchase,1004767,878a1538-ebe3-4d7f-b773-1b057b1971eb,electronics,samsung.smartphone,247.68,574868869
1208693,2019-11-30 23:58:14,purchase,1004874,717566cf-ef93-4078-ba8f-169a3ac9f1a0,electronics,samsung.smartphone,365.35,547804983
1208694,2019-11-30 23:58:22,purchase,1005130,829c20b5-696e-4a8a-8a9f-171014a3ecbe,electronics,apple.smartphone,1553.56,515582054
1208695,2019-11-30 23:58:57,purchase,1004767,ca50e291-43f3-4ca2-9e13-20ee6b8b25f0,electronics,samsung.smartphone,247.68,579876821


### Data structuring for Unsupervised Learning (ie. KMeans or PCA)

In [10]:
# # Save in .csv file after replace price with avg_price
# kzEVENTS_on_kzPRODUCTS_on_kzUSERS.to_csv("kz_ecommerce_2019-Oct-Nov_purchase_avg-price.csv")

Convert dataframe into list-of-lists

In [11]:
# For all 'user_session' count 
df_products_list_of_lists = kzEVENTS_on_kzPRODUCTS_on_kzUSERS[[
                                                                'event_time', 
                                                                'user_session', 
                                                                'brand.product',
                                                                'product_id',
                                                                'price',
                                                                'user_id'
                                                                ]].copy()
df_products_list_of_lists = df_products_list_of_lists.sort_values('event_time', ascending=True)
print(df_products_list_of_lists.shape)
df_products_list_of_lists.head()

(1208697, 6)


Unnamed: 0,event_time,user_session,brand.product,product_id,price,user_id
0,2019-10-01 00:02:14,8187d148-3c41-46d4-b0c0-9c08cd9dc564,samsung.smartphone,1004856,128.36,543272936
1,2019-10-01 00:04:37,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,apple.smartphone,1002532,594.62,551377651
2,2019-10-01 00:07:07,1dea3ee2-2ded-42e8-8e7a-4e2ad6ae942f,santeri.toilet,13800054,55.08,555332717
3,2019-10-01 00:09:26,2af9b570-0942-4dcd-8f25-4d84fba82553,apple.headphone,4804055,194.82,524601178
4,2019-10-01 00:09:54,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,apple.headphone,4804056,161.93,551377651


### Data structuring for Apriori Model application

In [12]:
products_list_of_lists_seq_min_two_unique_non_repeat = df_products_list_of_lists                                \
                                                        .groupby('user_session')                                \
                                                        .filter(lambda x: x['brand.product'].nunique() > 1)     \
                                                        .groupby('user_session')['brand.product']               \
                                                        .apply(lambda x: list(set(x)))                          \
                                                        .tolist()
print(len(products_list_of_lists_seq_min_two_unique_non_repeat))
products_list_of_lists_seq_min_two_unique_non_repeat[0:10]

45627


[['apple.smartphone', 'samsung.smartphone'],
 ['kugoo.skates', 'vitek.kettle'],
 ['apple.smartphone', 'sho-me.videoregister'],
 ['midea.washer', 'lg.tv'],
 ['samsung.smartphone', 'lg.washer'],
 ['samsung.refrigerators', 'samsung.tv', 'samsung.washer'],
 ['apple.smartphone', 'apple.headphone'],
 ['apple.smartphone', 'xiaomi.smartphone'],
 ['lg.acoustic', 'bosch.vacuum', 'redmond.iron'],
 ['samsung.smartphone', 'redmond.coffee_grinder']]

In [13]:
# # Export products list of lists in sequence to .csv file
# products_list_of_lists_seq_min_two_unique_non_repeat_df = pd.DataFrame(products_list_of_lists_seq_min_two_unique_non_repeat)
# products_list_of_lists_seq_min_two_unique_non_repeat_df.to_csv('kz_ecommerce_transactions_products_lists_seq_min_two_unique_non_repeat_df.csv', index=False, header=False)

### Data structuring for Markov Chain Model application

### EDA: Explore on **`longest`** `'user_session'`


To ensure the sequential relationships between products purchased are more representative of the user’s online shopping behavior in a particular point in time, a threshold of minimum of < 2 days is applied.

In [14]:
usr_sess_longest = kzEVENTS_on_kzPRODUCTS_on_kzUSERS.groupby('user_session')['event_time'].agg(['min', 'max'])
usr_sess_longest['time_delta'] = usr_sess_longest['max'] - usr_sess_longest['min']
usr_sess_longest = usr_sess_longest.sort_values('min')
usr_sess_longest

Unnamed: 0_level_0,min,max,time_delta
user_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8187d148-3c41-46d4-b0c0-9c08cd9dc564,2019-10-01 00:02:14,2019-10-01 00:02:14,0 days 00:00:00
3c80f0d6-e9ec-4181-8c5c-837a30be2d68,2019-10-01 00:04:37,2019-10-01 00:09:54,0 days 00:05:17
1dea3ee2-2ded-42e8-8e7a-4e2ad6ae942f,2019-10-01 00:07:07,2019-10-01 00:07:07,0 days 00:00:00
2af9b570-0942-4dcd-8f25-4d84fba82553,2019-10-01 00:09:26,2019-10-01 00:09:26,0 days 00:00:00
0b74a829-f9d7-4654-b5b0-35bc9822c238,2019-10-01 00:10:08,2019-10-01 00:10:08,0 days 00:00:00
...,...,...,...
ca50e291-43f3-4ca2-9e13-20ee6b8b25f0,2019-11-30 23:57:07,2019-11-30 23:58:57,0 days 00:01:50
068b0939-1d19-4289-90d8-bb0ee2a3547a,2019-11-30 23:57:23,2019-11-30 23:57:23,0 days 00:00:00
878a1538-ebe3-4d7f-b773-1b057b1971eb,2019-11-30 23:58:08,2019-11-30 23:58:08,0 days 00:00:00
829c20b5-696e-4a8a-8a9f-171014a3ecbe,2019-11-30 23:58:22,2019-11-30 23:58:22,0 days 00:00:00


In [15]:
usr_sess_longest_ranked = usr_sess_longest.sort_values(by='time_delta', ascending=False).head(50)
usr_sess_longest_ranked

Unnamed: 0_level_0,min,max,time_delta
user_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
b3b586ab-daf4-4f43-83f1-dc2692c0ef99,2019-10-16 07:03:10,2019-11-30 11:10:29,45 days 04:07:19
6c8443fb-a522-477c-9e69-ba5eeb059c6b,2019-10-26 11:12:33,2019-11-30 10:41:40,34 days 23:29:07
9a64d9f6-1410-40a2-87f9-ef7133ee01bc,2019-10-24 06:22:07,2019-11-26 11:18:37,33 days 04:56:30
c4e2255d-ec4e-4be0-86c0-4243aee0d97b,2019-10-06 13:53:47,2019-11-06 18:22:01,31 days 04:28:14
3e0db92b-9f2a-40c7-9b9f-44a08dd2e844,2019-10-21 06:01:53,2019-11-20 06:25:33,30 days 00:23:40
d3f2c130-620e-4ae3-a9d4-456348005fa7,2019-11-01 08:10:14,2019-11-30 05:57:51,28 days 21:47:37
985aa12c-d4ab-4e7d-a343-46b55da15e6d,2019-10-17 05:56:39,2019-11-14 11:00:08,28 days 05:03:29
842b1e3f-5d6f-46ff-96c0-0c8c81efa1a1,2019-10-18 06:47:08,2019-11-12 08:38:43,25 days 01:51:35
18b0cf58-5110-413f-87a0-e2d03b80592c,2019-10-21 09:17:30,2019-11-14 06:46:47,23 days 21:29:17
43145132-a7a4-4e53-878e-e75f503aed6a,2019-10-22 07:25:46,2019-11-14 07:48:50,23 days 00:23:04


In [16]:
usr_sess_longest[['time_delta']].describe(percentiles=[.25, .5, .75, .99995])

Unnamed: 0,time_delta
count,1025278
mean,0 days 00:01:39.873809834
std,0 days 02:38:53.456356199
min,0 days 00:00:00
25%,0 days 00:00:00
50%,0 days 00:00:00
75%,0 days 00:00:00
99.995%,1 days 13:10:11.242250092
max,45 days 04:07:19


In [17]:
usr_id_sess_longest = kzEVENTS_on_kzPRODUCTS_on_kzUSERS[kzEVENTS_on_kzPRODUCTS_on_kzUSERS['user_session'] == usr_sess_longest[['time_delta']].idxmax().item()]
usr_sess_longest_oths = kzEVENTS_on_kzPRODUCTS_on_kzUSERS[(kzEVENTS_on_kzPRODUCTS_on_kzUSERS['user_id'] == usr_id_sess_longest['user_id'][:1].values.item()) & (kzEVENTS_on_kzPRODUCTS_on_kzUSERS['user_session'] != usr_sess_longest[['time_delta']].idxmax().item())]

In [18]:
print(f"The longest single 'user_session' {usr_sess_longest[['time_delta']].idxmax().item()} lasted {usr_sess_longest[['time_delta']].max().item()}")
print(f"with active purchases span from {usr_sess_longest_ranked[:1]['min'].item()} to {usr_sess_longest_ranked[:1]['max'].item()}")
print(f"by 'user_id' {usr_id_sess_longest['user_id'][:1].values.item()}, with total {len(usr_id_sess_longest)} products purchased within the single longest 'user_session'")
print(f"and with total {len(usr_id_sess_longest) + len(usr_sess_longest_oths)} products purchased for all 'user_session', under the same 'user_id")

The longest single 'user_session' b3b586ab-daf4-4f43-83f1-dc2692c0ef99 lasted 45 days 04:07:19
with active purchases span from 2019-10-16 07:03:10 to 2019-11-30 11:10:29
by 'user_id' 513230794, with total 31 products purchased within the single longest 'user_session'
and with total 198 products purchased for all 'user_session', under the same 'user_id


In [19]:
# Defined 'user_session_outlier' as time_delta at >= 2 days to be removed 
user_session_outlier = usr_sess_longest[usr_sess_longest['time_delta'].dt.days > 1]
print(len(user_session_outlier))
user_session_outlier.index

46


Index(['0eedef43-b496-46a8-a027-70bdb0c1ed22',
       '38798848-b4c6-4a9e-9cbe-592e342518e0',
       '3eae5f06-bdd5-4671-ae2d-7cea7212740a',
       'ae547951-cde6-4132-be18-e5ee1e6ff32c',
       'd6fab6ed-6df5-4c8f-a331-8418d2523eb1',
       '7faa58a7-6f4d-4fd1-a45f-aa6fc70dd434',
       'd15e3ae0-f7cb-4728-ac20-f6f1a4cd5511',
       'dee5bc5f-31c4-4fcc-b263-11e61c0c6a0a',
       'c4e2255d-ec4e-4be0-86c0-4243aee0d97b',
       '888458a2-4614-458a-8bb0-fd4de5e14665',
       '18c112c2-5470-4318-965e-050913a2e028',
       '817d438b-82a2-4add-b17c-ef89cb2edf65',
       '7c94ef00-ba1c-44bb-ab49-74164ec0016a',
       'f97bd8e0-eeb9-4fdf-89c4-d89cda5a3afb',
       '6a5b2885-5b4d-4fa6-ad75-47ee2113745c',
       'b3b586ab-daf4-4f43-83f1-dc2692c0ef99',
       'e79bb9e0-83ac-4401-b2b9-f3a47d5635f1',
       '2769b5d8-9f34-4d52-a1c1-afebccf15c91',
       '9a21d989-f561-4aef-b8eb-5107be2e4a16',
       '985aa12c-d4ab-4e7d-a343-46b55da15e6d',
       '842b1e3f-5d6f-46ff-96c0-0c8c81efa1a1',
       '2fd4f

In [20]:
# Get idx of outlier with threshold of 'user_session' time_delta at > 1 day
idx_to_drop = kzEVENTS_on_kzPRODUCTS_on_kzUSERS[kzEVENTS_on_kzPRODUCTS_on_kzUSERS['user_session'].isin(user_session_outlier.index)].index
idx_to_drop

Int64Index([   1994,    2053,    2727,    2768,    2794,    3840,    3902,
               4686,    4781,    4809,
            ...
            1195626, 1198028, 1198068, 1198121, 1198152, 1198716, 1198755,
            1198802, 1198838, 1198874],
           dtype='int64', length=404)

In [21]:
# Remove outlier with threshold of 'user_session' time_delta at > 1 day
df = kzEVENTS_on_kzPRODUCTS_on_kzUSERS.drop(idx_to_drop)
df

Unnamed: 0,event_time,event_type,product_id,user_session,category,brand.product,price,user_id
0,2019-10-01 00:02:14,purchase,1004856,8187d148-3c41-46d4-b0c0-9c08cd9dc564,electronics,samsung.smartphone,128.36,543272936
1,2019-10-01 00:04:37,purchase,1002532,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,electronics,apple.smartphone,594.62,551377651
2,2019-10-01 00:07:07,purchase,13800054,1dea3ee2-2ded-42e8-8e7a-4e2ad6ae942f,furniture.bathroom,santeri.toilet,55.08,555332717
3,2019-10-01 00:09:26,purchase,4804055,2af9b570-0942-4dcd-8f25-4d84fba82553,electronics.audio,apple.headphone,194.82,524601178
4,2019-10-01 00:09:54,purchase,4804056,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,electronics.audio,apple.headphone,161.93,551377651
...,...,...,...,...,...,...,...,...
1208692,2019-11-30 23:58:08,purchase,1004767,878a1538-ebe3-4d7f-b773-1b057b1971eb,electronics,samsung.smartphone,247.68,574868869
1208693,2019-11-30 23:58:14,purchase,1004874,717566cf-ef93-4078-ba8f-169a3ac9f1a0,electronics,samsung.smartphone,365.35,547804983
1208694,2019-11-30 23:58:22,purchase,1005130,829c20b5-696e-4a8a-8a9f-171014a3ecbe,electronics,apple.smartphone,1553.56,515582054
1208695,2019-11-30 23:58:57,purchase,1004767,ca50e291-43f3-4ca2-9e13-20ee6b8b25f0,electronics,samsung.smartphone,247.68,579876821


In [22]:
chk_outlier_aft_rm_long_session_df = df.groupby('user_session')['event_time'].agg(['min', 'max'])
chk_outlier_aft_rm_long_session_df['time_delta'] = chk_outlier_aft_rm_long_session_df['max'] - chk_outlier_aft_rm_long_session_df['min']
chk_outlier_aft_rm_long_session_df = chk_outlier_aft_rm_long_session_df.sort_values('min')
# After removed 'user_session' outlier
chk_outlier_aft_rm_long_session_df['time_delta'].max()

Timedelta('1 days 23:52:36')

### EDA: Explore on single **`'user_id'`** with **`largest multiple`** `'user_session'` count

Ranked `'user_id'` that made the most frequently returned purchase by counting of unique `'user_session'`

In [23]:
usr_id_mult_sess_max = df[['user_id']].value_counts().idxmax()[0]
usr_id_mult_sess_max_df = df[df['user_id']==usr_id_mult_sess_max].sort_values('event_time', ascending=True)
usr_id_mult_sess_max_df

Unnamed: 0,event_time,event_type,product_id,user_session,category,brand.product,price,user_id
450074,2019-10-25 15:17:14,purchase,1004836,e7c7cd8a-dff7-4808-9a5e-000e0ac824bc,electronics,samsung.smartphone,231.12,564068124
452035,2019-10-25 17:41:27,purchase,1004750,11a8d7b5-5139-44d2-a4a8-96cb532d5b96,electronics,samsung.smartphone,196.27,564068124
452051,2019-10-25 17:42:47,purchase,1004750,11a8d7b5-5139-44d2-a4a8-96cb532d5b96,electronics,samsung.smartphone,196.27,564068124
452089,2019-10-25 17:45:32,purchase,1004833,11a8d7b5-5139-44d2-a4a8-96cb532d5b96,electronics,samsung.smartphone,171.76,564068124
452137,2019-10-25 17:49:34,purchase,1004833,11a8d7b5-5139-44d2-a4a8-96cb532d5b96,electronics,samsung.smartphone,171.76,564068124
...,...,...,...,...,...,...,...,...
1192841,2019-11-30 07:12:28,purchase,1004856,e6955252-e324-4839-8f97-61ad64b1c824,electronics,samsung.smartphone,128.36,564068124
1192864,2019-11-30 07:13:33,purchase,1004856,e6955252-e324-4839-8f97-61ad64b1c824,electronics,samsung.smartphone,128.36,564068124
1195338,2019-11-30 08:51:04,purchase,1307310,9a478bd6-8135-426a-a97a-c1d5d06954fd,computers,acer.notebook,289.83,564068124
1197433,2019-11-30 10:11:20,purchase,1004767,7e411779-aabe-4f42-b4c2-24de619642fb,electronics,samsung.smartphone,247.68,564068124


In [24]:
# usr_id_mult_sess_max_df['user_session'].nunique()

In [25]:
usr_id_mult_sess_max_time = usr_id_mult_sess_max_df.groupby('user_session')['event_time'].agg(['min', 'max'])
usr_id_mult_sess_max_time['time_delta'] = usr_id_mult_sess_max_time['max'] - usr_id_mult_sess_max_time['min']
usr_id_mult_sess_max_time = usr_id_mult_sess_max_time.sort_values('min')
usr_id_mult_sess_max_time_ranked = usr_id_mult_sess_max_time.sort_values(by='time_delta', ascending=False)
usr_id_mult_sess_max_time_ranked

Unnamed: 0_level_0,min,max,time_delta
user_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3b00665a-daff-4a2c-bba2-a152cc6e62c9,2019-11-02 15:03:26,2019-11-02 16:32:33,0 days 01:29:07
bb797c4b-d75a-434e-a81d-366a52e55513,2019-11-08 04:01:00,2019-11-08 04:47:08,0 days 00:46:08
f9cd8616-a5ed-484c-aed8-50c092004853,2019-11-02 19:12:52,2019-11-02 19:50:43,0 days 00:37:51
4f4d574b-76b9-4a49-9e0a-200a28662082,2019-11-02 04:10:18,2019-11-02 04:45:31,0 days 00:35:13
34b9eb67-4755-49c3-bd42-5384ac9b6325,2019-10-29 14:11:39,2019-10-29 14:44:23,0 days 00:32:44
...,...,...,...
77e8cbcc-7932-4488-b62a-4c61729c001f,2019-11-07 08:54:40,2019-11-07 08:54:40,0 days 00:00:00
12607fe6-7772-40ea-9bdb-e0343bf60195,2019-11-08 07:24:13,2019-11-08 07:24:13,0 days 00:00:00
71ad1682-4eb1-438d-bb72-34068649c7e6,2019-11-11 13:49:27,2019-11-11 13:49:27,0 days 00:00:00
2d24b95c-b906-48dc-9211-959944d1adab,2019-11-13 02:41:03,2019-11-13 02:41:03,0 days 00:00:00


In [26]:
# usr_id_mult_sess_max_time_ranked[0:1].index.item()

In [27]:
usr_id_mult_sess_max_qty_prod = usr_id_mult_sess_max_df[usr_id_mult_sess_max_df['user_session']==usr_id_mult_sess_max_time_ranked[0:1].index.item()]
print(f"User with 'user_id' {usr_id_mult_sess_max} with largest total {usr_id_mult_sess_max_df['user_session'].nunique()} unique 'user_session'")
print(f"with total {len(usr_id_mult_sess_max_df)} products purchased, between {usr_id_mult_sess_max_df['event_time'].min()} to {usr_id_mult_sess_max_df['event_time'].max()}")
print(f"where the longest 'user_session' only lasted {usr_id_mult_sess_max_time_ranked['time_delta'].max()} with total {len(usr_id_mult_sess_max_qty_prod)} products purchased within that longest 'user_session'")

User with 'user_id' 564068124 with largest total 125 unique 'user_session'
with total 637 products purchased, between 2019-10-25 15:17:14 to 2019-11-30 18:14:58
where the longest 'user_session' only lasted 0 days 01:29:07 with total 58 products purchased within that longest 'user_session'


## **Counting of Product Pairs for Markov Chain Model**

### Define a threshold with minimum `two different products puchased` in every single `'user_session'`

Since many of the `'use_session'` are with single product bought only once without next product purchase in sequence as a product pairs, hence smoothing factor is applied (with added non-zero denominator smoothing factor) to the transition matrix before normalizing the rows of this matrix to create transition probabilities. This is to ensure transition matrix includes all unique products in datasets. 


Use filter method of the groupby object to filter the groups based on a condition. 

To filter the list of lists of data frame df_products_list_of_lists, with columns ['event_time', 'user_session', 'brand.product'] with output as a list of sub-lists where each sub-list has at least two elements for each 'user_session':

In [28]:
session_counts = df['user_session'].value_counts()
session_counts

68b52b9a-97c8-4525-aba2-604ede028da8    85
1d34878d-1a42-401b-90a4-d44e2aa1e127    76
3b00665a-daff-4a2c-bba2-a152cc6e62c9    58
6860e076-5d5f-4a7e-ac61-6d85706a262f    48
145959cc-c588-0f7e-918a-dc14fdfd032c    34
                                        ..
815199a4-a4ec-11b4-3c21-d7a4ac33a504     1
1810d14c-d371-4f73-9508-f45c00f45186     1
07c9536c-73cc-4a9a-9798-a4114c8db5d0     1
2b3f45c3-44e9-47ea-90a6-6df530391e2e     1
a65116f4-ac53-4a41-ad68-6606788e674c     1
Name: user_session, Length: 1025232, dtype: int64

In [29]:
# Filter 'user_session' to minimum 'two different products puchased' in every single session
valid_sessions = session_counts[session_counts >= 2].index

df = df[df['user_session'].isin(valid_sessions)]
df.head()

Unnamed: 0,event_time,event_type,product_id,user_session,category,brand.product,price,user_id
1,2019-10-01 00:04:37,purchase,1002532,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,electronics,apple.smartphone,594.62,551377651
4,2019-10-01 00:09:54,purchase,4804056,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,electronics.audio,apple.headphone,161.93,551377651
17,2019-10-01 02:22:11,purchase,1004750,ce885079-4d92-4fe6-92a3-377c5a2d8291,electronics,samsung.smartphone,196.27,555110488
29,2019-10-01 02:25:04,purchase,1004856,ce885079-4d92-4fe6-92a3-377c5a2d8291,electronics,samsung.smartphone,128.36,555110488
34,2019-10-01 02:26:05,purchase,1004750,ce885079-4d92-4fe6-92a3-377c5a2d8291,electronics,samsung.smartphone,196.27,555110488


In [30]:
df.sample(10)

Unnamed: 0,event_time,event_type,product_id,user_session,category,brand.product,price,user_id
1021951,2019-11-20 15:11:57,purchase,1004870,24375247-29e4-4775-8872-ba56684fd4d0,electronics,samsung.smartphone,283.61,556992835
112278,2019-10-07 16:41:06,purchase,1004856,08931f59-5917-47b2-94cc-fd715df1e977,electronics,samsung.smartphone,128.36,516069379
780951,2019-11-14 15:53:13,purchase,3801134,17b024b3-5e41-4509-91b0-fc116519f89a,appliances,elenberg.iron,15.07,561443414
992440,2019-11-19 05:04:56,purchase,1201231,ff8f9f2b-f610-4c9d-b7b4-7148fb7df6a8,electronics,huawei.tablet,98.89,514474375
539484,2019-10-31 07:58:46,purchase,1500258,c9187b61-f449-43d5-b8b7-f434ccc9a298,computers.peripherals,epson.printer,256.69,541727437
1139062,2019-11-27 13:04:04,purchase,1003312,0d4ed290-a79c-4b28-89ce-21e1726c26e1,electronics,apple.smartphone,710.12,548881415
303890,2019-10-17 18:12:41,purchase,1003306,fa914519-0bce-4152-bb46-50cce90d25b8,electronics,apple.smartphone,585.73,555199107
595600,2019-11-03 18:36:36,purchase,3600025,2f07378c-c793-470d-a0e3-218e56a126a8,appliances.kitchen,indesit.washer,213.87,515067894
636566,2019-11-06 07:19:23,purchase,21400513,0a08ff7b-911b-4c21-a6be-7275f7021c59,electronics,casio.clocks,80.75,524635764
973166,2019-11-18 05:43:03,purchase,1005239,2ea275a0-c648-437f-9c07-fedcdc84a897,electronics,xiaomi.smartphone,265.41,530821378


### Create a dictionary to refer to `'brand.product'` by `'product_id'`

This dictionary {`'product_id'` : `'brand.product'`} will be used in products recommender system with prediction later

In [31]:
product_id_brand = df.set_index('product_id')['brand.product'].to_dict()
print(f"There are total {len(product_id_brand.keys())} of unique 'product_id'")
print(f"There are total {len(set(product_id_brand.values()))} of unique 'brand.product'")

There are total 13206 of unique 'product_id'
There are total 1868 of unique 'brand.product'


In [32]:
# Select only relevant columns 
df = df[['event_time', 'product_id', 'user_session', 'brand.product', 'price', 'user_id']]
df.sort_values(by='event_time')
print(df.nunique(),'\n')
df.info()
df.head()

event_time       288348
product_id        13206
user_session     128747
brand.product      1868
price              8093
user_id           86483
dtype: int64 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 311808 entries, 1 to 1208695
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   event_time     311808 non-null  datetime64[ns]
 1   product_id     311808 non-null  int64         
 2   user_session   311808 non-null  object        
 3   brand.product  311808 non-null  object        
 4   price          311808 non-null  float64       
 5   user_id        311808 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 16.7+ MB


Unnamed: 0,event_time,product_id,user_session,brand.product,price,user_id
1,2019-10-01 00:04:37,1002532,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,apple.smartphone,594.62,551377651
4,2019-10-01 00:09:54,4804056,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,apple.headphone,161.93,551377651
17,2019-10-01 02:22:11,1004750,ce885079-4d92-4fe6-92a3-377c5a2d8291,samsung.smartphone,196.27,555110488
29,2019-10-01 02:25:04,1004856,ce885079-4d92-4fe6-92a3-377c5a2d8291,samsung.smartphone,128.36,555110488
34,2019-10-01 02:26:05,1004750,ce885079-4d92-4fe6-92a3-377c5a2d8291,samsung.smartphone,196.27,555110488


[GroupBy.cumcount(ascending=True)](https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.core.groupby.GroupBy.cumcount.html)

Number each item in each group from 0 to the length of that group - 1.

Essentially this is equivalent to
``` python
self.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
```
**Parameters:**  *`ascending : bool, default True`* - If False, number in reverse, from length of group - 1 to 0.

**Returns:** *`Series`* - Sequence number of each element within each group.

In [33]:
# Create a new column that indicates the cumulative products purchased
# in which products were added under the same 'user_session' (within same 'purchase_seq_cart')
loc = df.columns.get_loc('user_session') + 1
df.insert(loc, 'purchase_seq_cart', df.groupby('user_session').cumcount())
df.head()

Unnamed: 0,event_time,product_id,user_session,purchase_seq_cart,brand.product,price,user_id
1,2019-10-01 00:04:37,1002532,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,0,apple.smartphone,594.62,551377651
4,2019-10-01 00:09:54,4804056,3c80f0d6-e9ec-4181-8c5c-837a30be2d68,1,apple.headphone,161.93,551377651
17,2019-10-01 02:22:11,1004750,ce885079-4d92-4fe6-92a3-377c5a2d8291,0,samsung.smartphone,196.27,555110488
29,2019-10-01 02:25:04,1004856,ce885079-4d92-4fe6-92a3-377c5a2d8291,1,samsung.smartphone,128.36,555110488
34,2019-10-01 02:26:05,1004750,ce885079-4d92-4fe6-92a3-377c5a2d8291,2,samsung.smartphone,196.27,555110488


In [34]:
df_grouped = df.groupby(['user_session', 'purchase_seq_cart'])['product_id'].first().unstack()
print(df_grouped.shape)
df_grouped.head()

(128747, 85)


purchase_seq_cart,0,1,2,3,4,5,6,7,8,9,...,75,76,77,78,79,80,81,82,83,84
user_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00004ada-8f93-49a6-956d-4ed71ae94791,1005031.0,1005031.0,,,,,,,,,...,,,,,,,,,,
00005b76-13ba-4afe-b80d-2f2b337d3e92,1004806.0,1005066.0,,,,,,,,,...,,,,,,,,,,
0000c091-07d6-42b6-a7d8-75732b489429,1005238.0,1005238.0,,,,,,,,,...,,,,,,,,,,
0000de39-dc74-414d-8da6-83ad56135bf5,1004246.0,1002544.0,1004767.0,1004833.0,,,,,,,...,,,,,,,,,,
0000fa47-9577-480a-9fa4-be5c25e8dd59,1004767.0,1004836.0,,,,,,,,,...,,,,,,,,,,


[pandas.DataFrame.iterrows()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html)

Iterate over DataFrame rows as `(index, Series)` pairs.

*`Yields : index label or tuple of label`*

- The index of the row. A tuple for a MultiIndex.

*`data : Series`*

- The data of the row as a Series.

### Create a list of all `'product_id'` pairs that appear in the same `'user_session'` in purchase sequence

In [35]:
%%time

product_pairs_tuples = []

for (_, row) in df_grouped.iterrows():
    row_tuples = list(zip(row, row[1:]))
    for t in row_tuples:  
        if any(np.isnan(x) for x in t):
            break
        product_pairs_tuples.append(t)

print(len(product_pairs_tuples))            
product_pair_counts = pd.Series(product_pairs_tuples).value_counts(sort=False)
product_pair_counts

183061
CPU times: total: 12.8 s
Wall time: 12.8 s


(1005031.0, 1005031.0)    131
(1004806.0, 1005066.0)      1
(1005238.0, 1005238.0)    156
(1004246.0, 1002544.0)     26
(1002544.0, 1004767.0)    215
                         ... 
(1005116.0, 3600165.0)      1
(1004902.0, 1004873.0)      1
(3601448.0, 3600163.0)      1
(1005239.0, 1005234.0)      1
(5700753.0, 5700384.0)      1
Length: 56339, dtype: int64

In [36]:
pd.DataFrame(product_pair_counts, columns=['prod_pairs_cnt']).sort_values(by='prod_pairs_cnt', ascending=False)

Unnamed: 0,prod_pairs_cnt
"(1004856.0, 1004856.0)",4978
"(1004767.0, 1004767.0)",3745
"(4804056.0, 4804056.0)",3193
"(1004833.0, 1004833.0)",2504
"(1005115.0, 1005115.0)",2468
...,...
"(1005142.0, 1004767.0)",1
"(1002524.0, 12600013.0)",1
"(11600158.0, 11600158.0)",1
"(1004836.0, 5100700.0)",1


## **Transition Matrix (Frequencies & Probabilities)**

In [37]:
# Create a square matrix as 'transition_matrix' where each row and column represents a 'product_id'
product_id_unique = df['product_id'].unique()
transition_matrix_freq = pd.DataFrame(0, index=product_id_unique, columns=product_id_unique)
transition_matrix_freq

Unnamed: 0,1002532,4804056,1004750,1004856,1005104,1004833,1004958,1005003,1005109,1004834,...,43300207,5100699,2402889,100005807,12400516,4802669,6200711,21403811,3701249,3701365
1002532,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4804056,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1004750,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1004856,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1005104,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4802669,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6200711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21403811,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3701249,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# Fill in the 'transition_matrix_freq' with the product pair counts
for product_pair, count in product_pair_counts.items():
    # Check to ensure that both products in each pair are present in the 'transition_matrix'
    if product_pair[0] in transition_matrix_freq.columns and product_pair[1] in transition_matrix_freq.index:
        transition_matrix_freq.loc[product_pair] = count
transition_matrix_freq

Unnamed: 0,1002532,4804056,1004750,1004856,1005104,1004833,1004958,1005003,1005109,1004834,...,43300207,5100699,2402889,100005807,12400516,4802669,6200711,21403811,3701249,3701365
1002532,228,6,1,9,2,4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4804056,1,3193,3,36,3,11,2,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1004750,0,7,747,147,0,167,1,1,0,23,...,0,0,0,0,0,0,0,0,0,0
1004856,2,32,63,4978,1,287,1,2,0,25,...,0,0,0,0,0,0,0,0,0,0
1005104,2,7,0,8,242,0,0,0,6,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4802669,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6200711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21403811,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3701249,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [39]:
pd.DataFrame(transition_matrix_freq.sum(axis=1), columns=['sum_rows'])

Unnamed: 0,sum_rows
1002532,462
4804056,4240
1004750,1535
1004856,8457
1005104,535
...,...
4802669,0
6200711,0
21403811,1
3701249,1


In [40]:
# Add smoothing to the 'transition_matrix_freq' so it is near zero and divisible (none-zero) later in calculating probability
smoothing_factor = 0.0001
transition_matrix_freq += smoothing_factor
# To be used as Numerator of 'transition_matrix' probability calculation later
transition_matrix_freq.round(3)

Unnamed: 0,1002532,4804056,1004750,1004856,1005104,1004833,1004958,1005003,1005109,1004834,...,43300207,5100699,2402889,100005807,12400516,4802669,6200711,21403811,3701249,3701365
1002532,228.0,6.0,1.0,9.0,2.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4804056,1.0,3193.0,3.0,36.0,3.0,11.0,2.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1004750,0.0,7.0,747.0,147.0,0.0,167.0,1.0,1.0,0.0,23.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1004856,2.0,32.0,63.0,4978.0,1.0,287.0,1.0,2.0,0.0,25.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005104,2.0,7.0,0.0,8.0,242.0,0.0,0.0,0.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4802669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6200711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21403811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3701249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [41]:
# # Save in .csv file 'transition_matrix_freq' filled with 'product_pair' count 
# transition_matrix_freq.to_csv("kz_ecommerce_2019-Oct-Nov_transition-matrix-freq_limit-usr-sess_smoothed.csv", index=True)

In [42]:
# To be used as Denominator of 'transition_matrix' probability calculation later
pd.DataFrame(transition_matrix_freq.sum(axis=1), columns=['sum_rows'])

Unnamed: 0,sum_rows
1002532,463.3206
4804056,4241.3206
1004750,1536.3206
1004856,8458.3206
1005104,536.3206
...,...
4802669,1.3206
6200711,1.3206
21403811,2.3206
3701249,2.3206


In [43]:
# Example of first row and first column, and divided by sum of first row of 'transition_matrix_freq' table 
# which is the result of transition probability of 'product_id' 1002532 to 1002532
transition_matrix_freq.iloc[0, 0] / transition_matrix_freq.iloc[0].sum()

0.4921000706638126

In [44]:
# Similarly, result of transition probability of 'product_id' 1005104 to 4804056
transition_matrix_freq.iloc[4, 1] / transition_matrix_freq.iloc[4].sum()

0.01305208116190204

In [45]:
# Divide the Numerator with Denominator, as Normalize the rows of the transition matrix to create transition probabilities
transition_matrix = transition_matrix_freq.div(transition_matrix_freq.sum(axis=1), axis=0)
print(transition_matrix.shape)
transition_matrix.round(3)

(13206, 13206)


Unnamed: 0,1002532,4804056,1004750,1004856,1005104,1004833,1004958,1005003,1005109,1004834,...,43300207,5100699,2402889,100005807,12400516,4802669,6200711,21403811,3701249,3701365
1002532,0.492,0.013,0.002,0.019,0.004,0.009,0.000,0.000,0.000,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
4804056,0.000,0.753,0.001,0.008,0.001,0.003,0.000,0.000,0.000,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
1004750,0.000,0.005,0.486,0.096,0.000,0.109,0.001,0.001,0.000,0.015,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
1004856,0.000,0.004,0.007,0.589,0.000,0.034,0.000,0.000,0.000,0.003,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
1005104,0.004,0.013,0.000,0.015,0.451,0.000,0.000,0.000,0.011,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4802669,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
6200711,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
21403811,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000
3701249,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.431


In [46]:
# Sum across first row (axis=1)
transition_matrix.iloc[0, :].sum().round(14)

1.0

In [47]:
# Sum across all rows (axis=1)
pd.DataFrame(transition_matrix.sum(axis = 1), columns=['sum_rows'])

Unnamed: 0,sum_rows
1002532,1.0
4804056,1.0
1004750,1.0
1004856,1.0
1005104,1.0
...,...
4802669,1.0
6200711,1.0
21403811,1.0
3701249,1.0


In [48]:
# Check the sum across all rows (axis=1) are == 1.0
(transition_matrix.sum(axis = 1).round(14) == 1.0).all()  # Close to 1, up to 14 decimals

True

In [49]:
# Sum across first column
transition_matrix.iloc[:, 0].sum()

3.0249652310778474

In [50]:
# Check the sum across all columns (axis=0)
transition_matrix.sum(axis = 0)

1002532      3.024965
4804056     33.217069
1004750      8.490650
1004856     46.709288
1005104      2.837108
              ...    
4802669      0.524231
6200711      0.566117
21403811     0.526497
3701249      0.519214
3701365      0.950137
Length: 13206, dtype: float64

In [51]:
# # Save in .csv file 'transition_matrix' with transition probabilities of 'product_pair' 
# transition_matrix.to_csv("kz_ecommerce_2019-Oct-Nov_transition-matrix_limit-usr-sess_smoothed.csv", index=True)

In [52]:
# Convert the transition matrix to a numpy array
transition_matrix = transition_matrix.to_numpy()
print(transition_matrix.shape)
transition_matrix

(13206, 13206)


array([[4.92100071e-01, 1.29502120e-02, 2.15854853e-03, ...,
        2.15833270e-07, 2.15833270e-07, 2.15833270e-07],
       [2.35799199e-04, 7.52831583e-01, 7.07350442e-04, ...,
        2.35775621e-08, 2.35775621e-08, 2.35775621e-08],
       [6.50905807e-08, 4.55640574e-03, 4.86226703e-01, ...,
        6.50905807e-08, 6.50905807e-08, 6.50905807e-08],
       ...,
       [4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
        4.30923037e-05, 4.30923037e-05, 4.30923037e-05],
       [4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
        4.30923037e-05, 4.30923037e-05, 4.30966129e-01],
       [4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
        4.30923037e-05, 4.30923037e-05, 4.30923037e-05]])

## **Making Predictions with Markov Chains Transition Matrix**

### [numpy.argsort(a, axis=-1, kind=None, order=None)](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)

Returns the indices that would sort an array.

Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.


### [numpy.argmax(a, axis=None, out=None, *, keepdims=<no value>)](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html)

Returns the indices of the maximum values along an axis.



In [53]:
# type(product_id_unique)

In [54]:
product_id_unique = df['product_id'].unique()
product_id_unique

array([ 1002532,  4804056,  1004750, ..., 21403811,  3701249,  3701365],
      dtype=int64)

In [55]:
current_product = 1002546
np.where(product_id_unique == current_product)

(array([1548], dtype=int64),)

In [56]:
np.where(product_id_unique == current_product)[0]

array([1548], dtype=int64)

In [57]:
current_index = np.where(product_id_unique == current_product)[0][0]
current_index

1548

In [58]:
transition_matrix

array([[4.92100071e-01, 1.29502120e-02, 2.15854853e-03, ...,
        2.15833270e-07, 2.15833270e-07, 2.15833270e-07],
       [2.35799199e-04, 7.52831583e-01, 7.07350442e-04, ...,
        2.35775621e-08, 2.35775621e-08, 2.35775621e-08],
       [6.50905807e-08, 4.55640574e-03, 4.86226703e-01, ...,
        6.50905807e-08, 6.50905807e-08, 6.50905807e-08],
       ...,
       [4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
        4.30923037e-05, 4.30923037e-05, 4.30923037e-05],
       [4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
        4.30923037e-05, 4.30923037e-05, 4.30966129e-01],
       [4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
        4.30923037e-05, 4.30923037e-05, 4.30923037e-05]])

In [59]:
transition_matrix[current_index]

array([4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
       4.30923037e-05, 4.30923037e-05, 4.30923037e-05])

In [60]:
transition_probabilities = transition_matrix[current_index]
transition_probabilities

array([4.30923037e-05, 4.30923037e-05, 4.30923037e-05, ...,
       4.30923037e-05, 4.30923037e-05, 4.30923037e-05])

In [61]:
# If use argmax(), it will straight away return the maximum index with the highest probability, without need to sort.
np.argmax(transition_probabilities)

1548

In [62]:
top_index = np.argmax(transition_probabilities)
top_index

1548

In [63]:
print(product_id_unique[top_index])
print(transition_probabilities[top_index])

1002546
0.4309661294492803


In [64]:
# If use argsort(), it will sort the probability in ascending order, and need to reverse it, and get the top 5 ranked indices
np.argsort(transition_probabilities)

array([    0,  8798,  8799, ...,  4398, 13205,  1548], dtype=int64)

In [66]:
n = 5
top_indices = np.argsort(transition_probabilities, kind='quicksort')[::-1][:n]
top_indices

array([ 1548, 13205,  4398,  4408,  4407], dtype=int64)

In [67]:
# Convert 'top_indices' to 'product_id'
print(product_id_unique[top_indices])
print(transition_probabilities[top_indices])

[ 1002546  3701365 28718072 28719206  1801984]
[4.30966129e-01 4.30923037e-05 4.30923037e-05 4.30923037e-05
 4.30923037e-05]


### **Custom Function `recommend_next_product()`**

with **`np.argmax()`**

In [68]:
# argmax() give top ranked probability index
def recommend_next_product(current_product):
    # Get the index of the current product, by 
    current_index = np.where(product_id_unique == current_product)[0][0]
    
    # Get the transition probabilities for the current product
    transition_probabilities = transition_matrix[current_index]
    
    # Get the indices of the top recommended product
    # straight away return the maximum index with the highest probability, without need to sort.
    top_index = np.argmax(transition_probabilities) 

    # Get the top n recommended product and its probability
    top_product = product_id_unique[top_index]
    top_probability = transition_probabilities[top_index]
    
    return (top_product, top_probability)

In [69]:
# argmax() give top ranked probability index
current_product = 1002099  
print(f"Current product: {product_id_brand[current_product]} (Product ID: {current_product}) ")
next_product = recommend_next_product(current_product)
print(f"Recommended next product option : {product_id_brand[next_product[0]]} (Product ID: {next_product[0]}, Probability: {next_product[1]:.4f})")

Current product: samsung.smartphone (Product ID: 1002099) 
Recommended next product option : samsung.smartphone (Product ID: 1002099, Probability: 0.2874)


### **Custom Function `recommend_next_products()`**

with **`np.argsort()`**

In [70]:
# argsort() give top n ranking probabilities indices
def recommend_next_products(current_product, n=5):
    # Get the index of the current product
    current_index = np.where(product_id_unique == current_product)[0][0]
    
    # Get the transition probabilities for the current product
    transition_probabilities = transition_matrix[current_index]
    
    # Get the indices of the top n recommended products
    # reversed it with [::-1] as descending order, and select first n elements in the argsort() list 
    top_indices = np.argsort(transition_probabilities, kind='quicksort')[::-1][:n] 
    
    # Get the top n recommended products and their probabilities
    top_products = product_id_unique[top_indices]
    top_probabilities = transition_probabilities[top_indices]
    
    return list(zip(top_products, top_probabilities))

In [71]:
# argsort() give top n ranking probabilities indices, with randomly choosen current product
current_product = np.random.choice(product_id_unique)  # huggies.diapers (Product ID: 16200282) ; pampers.diapers (Product ID: 16200299) 
print(f"Current product: {product_id_brand[current_product]} (Product ID: {current_product}) ")
next_products = recommend_next_products(current_product)
for i, (product, probability) in enumerate(next_products):
    print(f"Recommended next product option {i+1}: {product_id_brand[product]} (Product ID: {product}, Probability: {probability:.4f})")

Current product: huggies.diapers (Product ID: 16200282) 
Recommended next product option 1: huggies.diapers (Product ID: 16200277, Probability: 0.3759)
Recommended next product option 2: samsung.smartphone (Product ID: 1005100, Probability: 0.1880)
Recommended next product option 3: samsung.smartphone (Product ID: 1005099, Probability: 0.1880)
Recommended next product option 4: snowcap.vacuum (Product ID: 3701365, Probability: 0.0000)
Recommended next product option 5: samsung.refrigerators (Product ID: 2700978, Probability: 0.0000)


### Opportunity for cross-selling strategy to be applied in products offerings

With innovation and marketing operations improvement, given the fact that smartphone are the top selling products for the kz ecommerce business. 

In [72]:
# Create two lists for comparison between 'apple.smartphone' vs 'samsung.smartphone'
keys_apple = sorted([key for key, value in product_id_brand.items() if value == 'apple.smartphone'])
print(keys_apple)
apple_smartphone_list = keys_apple
keys_samsung = sorted([key for key, value in product_id_brand.items() if value == 'samsung.smartphone'])
print(keys_samsung)
samsung_smartphone_list = keys_samsung

[1001618, 1002522, 1002524, 1002525, 1002527, 1002528, 1002531, 1002532, 1002535, 1002536, 1002538, 1002540, 1002542, 1002544, 1002545, 1002546, 1002547, 1002548, 1002549, 1002628, 1002629, 1002633, 1002634, 1002786, 1002796, 1002995, 1003064, 1003141, 1003304, 1003305, 1003306, 1003307, 1003308, 1003309, 1003310, 1003311, 1003312, 1003313, 1003314, 1003315, 1003316, 1003317, 1003318, 1003319, 1003363, 1003800, 1003801, 1003802, 1003803, 1004225, 1004226, 1004227, 1004228, 1004229, 1004230, 1004231, 1004232, 1004233, 1004234, 1004235, 1004236, 1004237, 1004238, 1004239, 1004240, 1004241, 1004242, 1004243, 1004244, 1004245, 1004246, 1004247, 1004248, 1004249, 1004250, 1004251, 1004252, 1004253, 1004254, 1004255, 1004256, 1004258, 1004259, 1004260, 1004356, 1004357, 1004358, 1004359, 1004360, 1004361, 1004362, 1004363, 1004441, 1004916, 1004917, 1004919, 1004920, 1004925, 1004926, 1004929, 1004930, 1005104, 1005105, 1005106, 1005107, 1005108, 1005109, 1005110, 1005111, 1005112, 1005113, 

In [77]:
for i in apple_smartphone_list[0:20]: # Limit to first 20
    current_product = i  
    print(f"\nCurrent product: {product_id_brand[current_product]}  (Product ID: {current_product})\n")
    next_products = recommend_next_products(current_product)
    for i, (product, probability) in enumerate(next_products):
        print(f"Recommended next product option #{i+1}: {product_id_brand[product]} (Product ID: {product}, Probability: {probability:.4f})")


Current product: apple.smartphone  (Product ID: 1001618)

Recommended next product option #1: apple.smartphone (Product ID: 1001618, Probability: 0.3164)
Recommended next product option #2: apple.smartphone (Product ID: 1002796, Probability: 0.1582)
Recommended next product option #3: apple.smartphone (Product ID: 1003800, Probability: 0.1582)
Recommended next product option #4: apple.smartphone (Product ID: 1002544, Probability: 0.1582)
Recommended next product option #5: dyson.vacuum (Product ID: 3701357, Probability: 0.0000)

Current product: apple.smartphone  (Product ID: 1002522)

Recommended next product option #1: snowcap.vacuum (Product ID: 3701365, Probability: 0.0001)
Recommended next product option #2: samsung.refrigerators (Product ID: 2700978, Probability: 0.0001)
Recommended next product option #3: lg.tv (Product ID: 1801984, Probability: 0.0001)
Recommended next product option #4: alphard.subwoofer (Product ID: 5801632, Probability: 0.0001)
Recommended next product opti

In [76]:
for i in samsung_smartphone_list[0:20]: # Limit to first 20
    current_product = i  
    print(f"\nCurrent product: {product_id_brand[current_product]}  (Product ID: {current_product})\n")
    next_products = recommend_next_products(current_product)
    for i, (product, probability) in enumerate(next_products):
        print(f"Recommended next product option #{i+1}: {product_id_brand[product]} (Product ID: {product}, Probability: {probability:.4f})")


Current product: samsung.smartphone  (Product ID: 1000978)

Recommended next product option #1: samsung.smartphone (Product ID: 1000978, Probability: 0.0969)
Recommended next product option #2: samsung.smartphone (Product ID: 1004833, Probability: 0.0969)
Recommended next product option #3: oppo.smartphone (Product ID: 1004838, Probability: 0.0969)
Recommended next product option #4: samsung.smartphone (Product ID: 1003712, Probability: 0.0969)
Recommended next product option #5: huawei.headphone (Product ID: 4804137, Probability: 0.0969)

Current product: samsung.smartphone  (Product ID: 1002042)

Recommended next product option #1: samsung.smartphone (Product ID: 1004858, Probability: 0.4310)
Recommended next product option #2: snowcap.vacuum (Product ID: 3701365, Probability: 0.0000)
Recommended next product option #3: dauscher.dishwasher (Product ID: 4600534, Probability: 0.0000)
Recommended next product option #4: lg.tv (Product ID: 1801984, Probability: 0.0000)
Recommended next 

## **Conclusion:** 

The solution by modeling a Sequential Recommendation System (SRS) of Products with Markov Chain could capture sequential patterns and temporal dependencies (different points in time) of a customer's products purchased in sequence. 

This solution allowed seemingly random products purchased in sequence could potentially be opportunity for cross-selling strategy to be applied in products offerings innovation and marketing operations improvement. 
  
### **Limitations and Recommendations for future research directions:**

1. The Simple Markov Chains model used doesn’t consider user-specific preferences or semantic information (contents-based) about the products purchased. 

2. The transition matrix may not account for changes in customer behavior over time or the introduction of new products, and its only sampled between 2019-Oct ~ Nov, therefore, it’s crucial to regularly update the transition probability matrix with new data to ensure its accuracy and relevance.

3. Since there are many zero probabilities between certain product pairs, it might indicate sparse data, which could make predictions less reliable. In such cases, techniques like adding a small constant to all probabilities (smoothing) or more complex methods like matrix factorization might be used to handle the sparsity more effectively.

4. Include Action Set (A) which is the set of all possible actions that can be performed by the use, that could include "viewing" a product, adding a product to the "cart", or making a "purchase". 

5. Improve performance with performance metrics that can be used to evaluate and compare the performance of models in recommender systems, examples metrics are Precision, Recall, F1-score, MAP (Mean Average Precision), Normalized Discounted Cumulative Gain (NDCG).


## **Futher Discussion:**

### **Modeling a Sequential Recommendation System (SRS) of Products with DTMC as Long Term Analysis with Steady State Transition Matrix**

The steady-state probabilities then represent the long-term likelihood of each product being purchased, hence, the question that could be answered by Markov Chains model with DTMC applied as Long Term Analysis with Steady State Transition Matrix:

- What is the customer lifetime value (CLV) which is total revenue a ecommerce business can reasonably expect from a single customer account throughout the business relationship?

To compute the **`steady-state probabilities`** by finding the **`eigenvector`** of the **transition matrix** that corresponds to an **`eigenvalue of 1`**. This can be useful in `predicting long-term trends` or `behaviors` in the `system`. 

This **`eigenvector`** gives us the **`steady-state probabilities`**

However, **`not all Markov chains have a steady state`**.

For instance, if there are **`transient states`** (states that once left cannot be returned to), or if there are **`periodic states`** (states where the chain can return only **`after a fixed number of steps`**), then a steady state may not exist. In such cases, a steady-state analysis would not be meaningful.

- Theorem 1: A finite-state irreducible aperiodic DTMC has a unique $\pi$ (limiting distribution).
- Theorem 2: If a DTMC is both positive recurrent and aperiodic, then it is ergodic. Ergodicity guarantees the existence of $\pi$.

A Markov chain can be ergodic, if it is possible to go from `any state to every other state` in `finitely many moves`


**What if the chain doesn’t reach a steady-state?**

1. Only regular Markov chains converge over time. 
    - It is considered regular if `some power` of the `transition matrix` has `only positive`, `non-zero, values`
2. And if Markov Chain does not converge, it has a `periodic pattern`.
    - will get the same transition probabilities from time to time.
3. Test if Markov chain will eventually converge.
    - For any *`n x n`* transition matrix, could check if `all powers` of the `transition matrix` have all positive, non-zero, values, up to power: $\max_{power} = (n - 1)^{2} + 1$

reference: 
[Markov Models and Cost Effectiveness Analysis Applications in Medical Research, Springer](https://link.springer.com/chapter/10.1007/978-3-319-43742-2_24)

## **References:**

1. Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. "**Factorizing personalized Markov chains for next-basket recommendation**". In Proceedings of the 19th international conference on World wide web (WWW '10). Association for Computing Machinery, New York, NY, USA, 811–820. https://doi.org/10.1145/1772690.1772773 

2. Y. Yang, H. -J. Jang and B. Kim, "**A Hybrid Recommender System for Sequential Recommendation: Combining Similarity Models With Markov Chains**" in IEEE Access, vol. 8, pp. 190136-190146 (2020), https://doi.org/10.1109/ACCESS.2020.3027380  

3. Lonjarret, C., Auburtin, R., Robardet, C. et al. "**Sequential recommendation with metric models based on frequent sequences**". Data Min Knowl Disc 35, 1087–1133 (2021). https://doi.org/10.1007/s10618-021-00744-w 

4. Chen G, Li Z. "**A New Method Combining Pattern Prediction and Preference Prediction for Next Basket Recommendation**" Entropy (Basel); 23(11):1430. (2021) https://doi.org/10.3390/e23111430 

5. Chen, Xin, Alex Reibman, and Sanjay Arora. "**Sequential Recommendation Model for Next Purchase Prediction**" arXiv preprint arXiv:2207.06225 (2022). https://doi.org/10.48550/arXiv.2207.06225 
