## **Sprint 3: H&M Personalized Fashion Recommendations**

### Preparing For Matrix Factorization


___

Atoosa Rashid

[GitHub](https://github.com/atoosa-r/)

[LinkedIn](https://www.linkedin.com/in/atoosarashid/) 
____

### **Introduction**

In this data analysis, we explore H&M Group datasets, including transactions, customer information, and article details. H&M Group operates globally with 53 online markets and approximately 4850 stores. The objective is to uncover insights for developing effective product recommendations.

In this notebook we will be further processing our data and preparing it for Matrix Factorization. 

###  **Preprocessing**

#### Data Loading
Let's load the data and import the Python packages we will be using. 

In [1]:
#Importing libraries: 

import numpy as np                 
import pandas as pd                  
import time
import re
import string

import os
import time
import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')

from collections import Counter

from sklearn.metrics.pairwise import cosine_similarity

from scipy.spatial.distance import cosine as cosine_distance

[nltk_data] Downloading package punkt to /Users/Atoosa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
#Importing Dataframes:

articles_df=pd.read_csv("/Users/Atoosa/Desktop/data/hm/articles.csv")                #Clothing articles 

transactions_df=pd.read_csv("/Users/Atoosa/Desktop/data/hm/transactions_train.csv")  #Transaction information 

customers_df=pd.read_csv("/Users/Atoosa/Desktop/data/hm/customers.csv")              #Customer information 

**Data Dictionary:**

**articles df:**

Data on the articles of products in the transactions in transactions df by the customers in customer df.

- `article id`:int64, ID for each article. 
- `product_code`: int64, Code representing the product. 
- `prod_name`: object, Name of the product. 
- `product_type_no`: int64, Number representing the product type. 
- `product_type_name` : object, Name of the product type. 
- `product_group_name`: object, Group name of the product. 
- `graphical_appearance_no`: int64, Number representing the graphical appearance. 
- `graphical_appearance_name`: object, Name reopresenting the graphical appearance. 
- `colour_group_code`: int64, Code representing the colour group.
- `colour_group_name`: object, Name of the colour group.
- `perceived_colour_value_id`: int64, ID for perceived color value.
- `perceived_colour_value_name`: object, Name for perceived color value.
- `perceived_colour_master_id`: int64, ID for perceived color master.
- `perceived_colour_master_name`: object, Name for perceived color master.
- `department_no`: int64, Number representing the department.
- `department_name`:object, Name of the department.
- `index_code`: object, Code for the index.
- `index_name`: object, Name for the index
- `index_group_no`: int64, Number representing the index group
- `index_group_name`: object, Name of the index group. 
- `section_no`: int64, Number representing the section. 
- `section_name`: object, Name of the section 
- `garment_group_no`: int64, Number representing the graphical appearance.
- `garment_group_name`:object, Name of the graphical appearance.
- `detail_desc`: object

**customers df** 

Data on the customers involved in making the transactions found in transactions df and articles of products found in articles df.

- `customer_id`: object, individual unique customer id #. This column is also present in transactions df 
- `FN`: float64
- `Active`: float64 
- `club_member_status`: object 
- `fashion_news_frequency`: object 
- `age`: float64
 - `postal_code`: object 

**transactions df**

Data on the transactions being made to purchase the articles of products found in the articles df by the customers found in the customers df. Note: The prices are not real prices, they are altered.


- `t_dat`: object, transaction date
- `customer_id`: object, individual unique customer id #. This column is also present in customers df 
- `article_id`: int64
- `price`: float64


___
### **Cleaning**

In [3]:
# Finding all duplicate rows:

duplicates = transactions_df[transactions_df.duplicated(keep=False)]

# Viewing the duplcated transaction: 

duplicates

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
14,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,501820043,0.016932,2
15,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,501820043,0.016932,2
17,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,671505001,0.033881,2
18,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,671505001,0.033881,2
19,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,631848002,0.033881,2
...,...,...,...,...,...
31788282,2020-09-22,ff6f55a51af284b71dcd264396b299e548f968c1769e71...,919786002,0.042356,2
31788291,2020-09-22,ff94f31e864d9b655643ac4d2adab3611c7241adb5d34c...,901666001,0.084729,2
31788292,2020-09-22,ff94f31e864d9b655643ac4d2adab3611c7241adb5d34c...,901666001,0.084729,2
31788311,2020-09-22,ffd4cf2217de4a0a3f9f610cdec334c803692a18af08ac...,791587021,0.025407,2


The identified duplicate rows in our dataset represent repeated transactions, where customers purchased multiple units of the same item in a single transaction. The majority of these duplicates occur in pairs, further indicating that customers bought two items. Dropping these rows would lead to:

1. Inaccurate Sales Data: Each row reflects actual sales. Removing them would underreport sales figures and revenue.
2. Incomplete Customer Behavior Analysis: Retaining all rows ensures a comprehensive understanding of customer purchasing patterns, which is vital for the analysis.

**Therefore, to maintain data integrity and accuracy, we will not drop these duplicate rows.**

Next, the datasets will be examined for any null values present.

In [4]:
#Checking for null values: 

articles_df.isna().sum()

article_id                        0
product_code                      0
prod_name                         0
product_type_no                   0
product_type_name                 0
product_group_name                0
graphical_appearance_no           0
graphical_appearance_name         0
colour_group_code                 0
colour_group_name                 0
perceived_colour_value_id         0
perceived_colour_value_name       0
perceived_colour_master_id        0
perceived_colour_master_name      0
department_no                     0
department_name                   0
index_code                        0
index_name                        0
index_group_no                    0
index_group_name                  0
section_no                        0
section_name                      0
garment_group_no                  0
garment_group_name                0
detail_desc                     416
dtype: int64

In [5]:
#Creating series to check null values:

nulls = pd.isnull(articles_df["detail_desc"])  
    
#Filtering data for rows with desc nulls:

articles_df[nulls] 

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
1467,351332007,351332,Marshall Lace up Top,252,Sweater,Garment Upper body,1010018,Treatment,7,Grey,...,Jersey Fancy DS,D,Divided,2,Divided,58,Divided Selected,1005,Jersey Fancy,
2644,420049002,420049,OL TAGE PQ,87,Boots,Shoes,1010016,Solid,13,Beige,...,Premium Quality,C,Ladies Accessories,1,Ladieswear,64,Womens Shoes,1020,Shoes,
2645,420049003,420049,OL TAGE PQ,87,Boots,Shoes,1010016,Solid,23,Dark Yellow,...,Premium Quality,C,Ladies Accessories,1,Ladieswear,64,Womens Shoes,1020,Shoes,
2742,426199002,426199,Ellen Shortie Daisy Low 3p,286,Underwear bottom,Underwear,1010016,Solid,9,Black,...,Casual Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear",
2743,426199010,426199,Ellen Shortie Daisy Low 3p,286,Underwear bottom,Underwear,1010017,Stripe,8,Dark Grey,...,Casual Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67838,752458001,752458,Poissy boho dress,265,Dress,Garment Full body,1010007,Embroidery,10,White,...,Dress,A,Ladieswear,1,Ladieswear,6,Womens Casual,1013,Dresses Ladies,
72720,768842001,768842,Andrews set,270,Garment Set,Garment Full body,1010017,Stripe,10,White,...,Baby Boy Woven,G,Baby Sizes 50-98,4,Baby/Children,41,Baby Boy,1006,Woven/Jersey/Knitted mix Baby,
72721,768842004,768842,Andrews set,270,Garment Set,Garment Full body,1010004,Check,73,Dark Blue,...,Baby Boy Woven,G,Baby Sizes 50-98,4,Baby/Children,41,Baby Boy,1006,Woven/Jersey/Knitted mix Baby,
93144,856985001,856985,Pogo rope,67,Belt,Accessories,1010016,Solid,12,Light Beige,...,Belts,C,Ladies Accessories,1,Ladieswear,65,Womens Big accessories,1019,Accessories,


There are 416 missing values in the `detail_desc` column. Given that the dataset contains 105,542 rows, and these missing values constitute less than 0.4% of the total entries, they will be dropped to maintain data quality.


In [6]:
# Removing rows with null values in detail_desc:

articles_df.dropna(subset=['detail_desc'], inplace=True)

#Sanity Check:

nulls = articles_df['detail_desc'].isnull().sum()

print(f"Number of null values in detail_desc after dropping: {nulls}")

Number of null values in detail_desc after dropping: 0


In [7]:
#Sanity Check: 

articles_df.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


For the purposes of our matrix factorization system we will not be utilizing majority of the columns within the `articles_df` we can proceed to dropping the columns which will not be necessary for our following recommendation systems. 

In [8]:
#Checking for null values: 

customers_df.isna().sum()

customer_id                    0
FN                        895050
Active                    907576
club_member_status          6062
fashion_news_frequency     16011
age                        15861
postal_code                    0
dtype: int64

The customers_df contains many null values. However, for the purpose of this analysis, the customer_df columns with null values will not be extensively analyzed. The recommendation systems will only involve the customer_id column, which does not contain any null values.

In [9]:
#Checking for null values: 

transactions_df.isna().sum()

t_dat               0
customer_id         0
article_id          0
price               0
sales_channel_id    0
dtype: int64

Our transactions dataframe is free of any nulls.

For the purposes of this analysis, the transaction date `t_dat` column will be converted to a datetime format.

In [10]:
# Converting t_dat column to datetime:

transactions_df['t_dat'] = pd.to_datetime(transactions_df['t_dat'])

#Sanity Check: 

transactions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       object        
 2   article_id        int64         
 3   price             float64       
 4   sales_channel_id  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 1.2+ GB


For our matrix factorization system, we will not be utilizing the majority of the columns in the `articles_df`. Therefore, we can proceed by dropping the columns that are unnecessary for our recommendation systems and export the csv for later use. 

In [11]:
#List of columns to drop:
columns_to_drop = [
    'product_code',
    'product_type_no',
    'graphical_appearance_no',
    'colour_group_code',
    'department_no',
    'index_code',
    'index_group_no',
    'section_no',
    'garment_group_no',
    'perceived_colour_value_id',
    'perceived_colour_value_name',
    'perceived_colour_master_id',
    'perceived_colour_master_name',
    'index_name',
    'graphical_appearance_name'
]

#Drop the columns from articles_df
articles_df.drop(columns=columns_to_drop, inplace=True)

In [12]:
#Sanity Check:

articles_df.head(3)

Unnamed: 0,article_id,prod_name,product_type_name,product_group_name,colour_group_name,department_name,index_group_name,section_name,garment_group_name,detail_desc
0,108775015,Strap top,Vest top,Garment Upper body,Black,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,Strap top,Vest top,Garment Upper body,White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,Strap top (1),Vest top,Garment Upper body,Off White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.


Let's review the articles_df `detail_desc` column which we will be using for our future NLP models. 

In [13]:
#Initial review of the descriptions 

# Adjusting the display options:

pd.set_option('display.max_colwidth', None)

# Getting the unique descriptions:

unique_descriptions = articles_df['detail_desc'].unique()

# Printing the unique descriptions:

for desc in unique_descriptions:
    
    print(desc)

Jersey top with narrow shoulder straps.
Microfibre T-shirt bra with underwired, moulded, lightly padded cups that shape the bust and provide good support. Narrow adjustable shoulder straps and a narrow hook-and-eye fastening at the back. Without visible seams for greater comfort.
Semi shiny nylon stockings with a wide, reinforced trim at the top. Use with a suspender belt. 20 denier.
Tights with built-in support to lift the bottom. Black in 30 denier and light amber in 15 denier.
Semi shiny tights that shape the tummy, thighs and calves while also encouraging blood circulation in the legs. Elasticated waist.
Opaque matt tights. 200 denier.
Sweatshirt in soft organic cotton with a  press-stud on one shoulder (sizes 12-18 months and 18-24 months without a press-stud). Brushed inside.
Two soft bandeau bras in soft jersey with side support and a silicone trim at the top.
Fitted top in soft stretch jersey with a wide neckline and long sleeves.
Trousers in sweatshirt fabric with an elasticat

**To effectively analyze and work with our article descriptions, we'll start by preprocessing the descriptions. Our preprocessing involves several key steps to prepare the text data for word embedding:**

1. Lowercasing: Converts all characters to lowercase for uniformity and case insensitivity.
2. Removing Punctuation: Eliminates punctuation marks to reduce noise and simplify the text.
3. Tokenizing: Splits the text into individual words, facilitating further analysis.
4. Removing Stopwords: Removes common words that do not add significant meaning, enhancing focus on important words.

These preprocessing steps are crucial as they reduce noise, ensure data uniformity, and enhance the quality of word embeddings, leading to more accurate and efficient text analysis.

We'll start by creating our custom tokenizer that will help with the preprocessing of the descriptions. 

In [14]:
# Initialize the stopwords list:

ENGLISH_STOP_WORDS = set(stopwords.words('english'))

#Creating custom tokenizer function:

def my_custom_tokenizer(sentence):

    if not isinstance(sentence, str):

        return ""
    
    # Lowercasing
    sentence = sentence.lower()

    # Removing punctuation
    sentence = re.sub(f"[{re.escape(string.punctuation)}]", " ", sentence)

    # Splitting into words
    words = sentence.split()

    # Removing stopwords
    words = [word for word in words if word not in ENGLISH_STOP_WORDS and word]

    # Reconstructing the sentence
    cleaned_sentence = ' '.join(words)

    return cleaned_sentence

In [15]:
#Applying our preprocessing to the 'detail_desc' column:

articles_df['preprocessed_detail_desc'] = articles_df['detail_desc'].apply(my_custom_tokenizer)

In [16]:
#Sanity check: 

articles_df.head(3)

Unnamed: 0,article_id,prod_name,product_type_name,product_group_name,colour_group_name,department_name,index_group_name,section_name,garment_group_name,detail_desc,preprocessed_detail_desc
0,108775015,Strap top,Vest top,Garment Upper body,Black,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps
1,108775044,Strap top,Vest top,Garment Upper body,White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps
2,108775051,Strap top (1),Vest top,Garment Upper body,Off White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps


In [17]:
#Exporting our cleaned articles df:

#output_file = 'cleaned_articles_df.csv'

#articles_df.to_csv(output_file, index=False)

___
### **Preparing for Matrix Factorization**

#### **Collaborative Filtering (Item-Based)**


We'll use our existing Collaborative Filtering matrix and refine it further in preparation for Matrix Factorization. 

In [18]:
# Getting the maximum date in the 't_dat' column and calculating the start date as 2 weeks before the end date: 

end_date = transactions_df['t_dat'].max()

start_date = end_date - pd.DateOffset(weeks=2)

# Filtering the dataframe to include only the last 2 weeks of data:

transactions = transactions_df[transactions_df['t_dat'] >= start_date]

transactions.head()

# Printing the shape of the filtered dataframe:

print(transactions.shape)

(531967, 5)


In [19]:
#Checking the number of unique items(articles) present in the transactions dataframe:

transactions['article_id'].nunique()

23083

In [20]:
# Aggregating transactions to get the count of each article purchased by each customer:

R = transactions.groupby(by=['customer_id', 'article_id']).size().reset_index(name='unit_number')

#Sanity Check: 

R.head(3)

Unnamed: 0,customer_id,article_id,unit_number
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,794321007,1
1,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,448509014,1
2,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,719530003,1


In [43]:
#Verifying shape of our new df: 

print(f"There are {R.shape[0]} rows and {R.shape[1]} columns in our R dataframe.")


There are 470191 rows and 3 columns in our R dataframe.


In [21]:
#Sanity check to make sure aggregation includes all transactions for each customer by testing with one customer:

R[R['customer_id']=='ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264']  

Unnamed: 0,customer_id,article_id,unit_number
470187,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264,689365050,1
470188,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264,762846027,1
470189,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264,794819001,1
470190,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264,884081001,1


**Creating User-Item matrix**

In order to implement our recommendations we will need a User-Item matrix. We'll use the pivot method to create the matrix. The `customer_id` is set as the index, `article_id` as the columns, and `unit_number` as the values. This matrix represents the number of units purchased by each customer for each article.

In [22]:
# Creating the user-item matrix:

filled_matrix = R.pivot(index='customer_id', columns='article_id', values='unit_number')

  filled_matrix = R.pivot(index='customer_id', columns='article_id', values='unit_number')


In [23]:
#Sanity check:

filled_matrix.head()

article_id,108775044,111565001,111586001,111593001,111609001,120129001,120129014,123173001,126589007,129085001,...,948152002,949198001,949551001,949551002,949594001,952267001,952938001,953450001,953763001,956217002
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,,,,,,,,,,,...,,,,,,,,,,
0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,,,,,,,,,,,...,,,,,,,,,,
0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37e011580a479e80aa94,,,,,,,,,,,...,,,,,,,,,,
0002cca4cc68601e894ab62839428e5f0696417fe0f9e84551c6827a7629d441,,,,,,,,,,,...,,,,,,,,,,
00039306476aaf41a07fed942884f16b30abfa83a2a8bea972019098d6406793,,,,,,,,,,,...,,,,,,,,,,


#### **Scaling and Creating Scoring**

With our Collaborative Filtering matrix prepared, the next step involves calculating a scoring column to initiate Matrix Factorization.

To create our scoring system, we begin by determining the maximum number of units purchased by each customer, row by row. This value represents the highest number of units a customer has bought in a single transaction. We then scale our data by dividing each purchase amount by this maximum value, normalizing the data. Finally, we multiply the result by 5 to generate a score ranging from 0 to 5.

Using a score within this range is more effective for our model, as it standardizes the input values and ensures consistency in the Matrix Factorization process. This approach helps in better capturing the customer's purchasing behavior, ultimately leading to more accurate recommendations.

In [24]:
#Calculating the maximum value per row and ignoring nans:

filled_matrix_values = filled_matrix.values

row_max = np.nanmax(filled_matrix_values, axis=1)

#Making sure we're not dividing by zero by checking no max value is zero:

row_max[row_max == 0] = np.nan

In [25]:
#Dividing each value in the row by the maximum value of that row and multiply by 5:

scaled_values = (filled_matrix_values.T / row_max).T * 5

#Creating a new DataFrame with the scaled values:

scaled_matrix = pd.DataFrame(scaled_values, index=filled_matrix.index, columns=filled_matrix.columns)

In [26]:
#Sanity check:

scaled_matrix.head()

article_id,108775044,111565001,111586001,111593001,111609001,120129001,120129014,123173001,126589007,129085001,...,948152002,949198001,949551001,949551002,949594001,952267001,952938001,953450001,953763001,956217002
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,,,,,,,,,,,...,,,,,,,,,,
0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,,,,,,,,,,,...,,,,,,,,,,
0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37e011580a479e80aa94,,,,,,,,,,,...,,,,,,,,,,
0002cca4cc68601e894ab62839428e5f0696417fe0f9e84551c6827a7629d441,,,,,,,,,,,...,,,,,,,,,,
00039306476aaf41a07fed942884f16b30abfa83a2a8bea972019098d6406793,,,,,,,,,,,...,,,,,,,,,,


In [27]:
#Sanity check:

scaled_matrix.value_counts(108775044)

108775044
2.5    2
5.0    2
Name: count, dtype: int64

In [28]:
#Sanity check:

scaled_matrix.value_counts(120129001)

120129001
5.0    2
Name: count, dtype: int64

Now that we have our new matrix with scores ranging from 0 to 5, the next step for our Matrix Factorization preparation is to transform this matrix into a dataframe with three columns: `customer_id`, `article_id`, and the `score` we created.

The melt function converts a wide-format dataframe into a long-format dataframe. It will reshape our matrix so that each row represents a single customer's score for a single article, with `customer_id`, `article_id`, and `score` as columns.

Normally, this can be done using the .melt function directly. However, due to computational constraints with large datasets, we will use a Dask dataframe to handle this process efficiently.

In [29]:
#Creating an empty dataframe:

melted_matrix = pd.DataFrame(columns=['customer_id', 'article_id', 'score'])

In [30]:
import dask.dataframe as dd

#Converting the scaled_matrix df to a dask df:

dask_df = dd.from_pandas(scaled_matrix, npartitions=10)

In [31]:
#Melting the dask df to create a table:

melted_matrix = dask_df.reset_index().melt(id_vars='customer_id', var_name='article_id', value_name='score')

There will be many rows with nan values since customers are only purchasing select items. We will remove those from our `melted_matrix`.

In [32]:
#Dropping rows with nan values in the score column:

melted_matrix = melted_matrix.dropna(subset=['score'])

In [33]:
#Sanity Check: 

melted_matrix.head(3)

Unnamed: 0,customer_id,article_id,score
20901,0dbe2f2ceb2e205216589497f46228ab5b6eb8927032f43fba4945d6179b0aef,111565001,5.0
26786,18cbcb477a05ec64929809693fcac49edd404571f7e667f65dfc487bf2d99e0c,111565001,2.5
26928,190ce2b4681d15e1b4e752aaefdf07d73cf3265d33e5de764f313818b17f26d4,111565001,5.0


Our melted matrix is currently a Dask dataframe, which uses lazy evaluation, meaning the operations are delayed until needed. To proceed, we'll convert it to a regular Pandas dataframe in chunks due to computational constraints. With more resources, we could have melted the dataframe directly into a Pandas dataframe from the start, avoiding the need for Dask.

In [34]:
#Specific size of each chunk:

chunk_size = 10000

#Empty list to store chunks:

chunks = []

#Converting our dask df to smaller pandas df chunks:

for partition in melted_matrix.to_delayed():
    chunk = dd.from_delayed(partition).compute()
    chunks.append(chunk)

#Concatenate all the chunks into a single pandas df:

regular_dataframe = pd.concat(chunks, ignore_index=True)

In [35]:
#Sanity Check:

regular_dataframe.head(3)

Unnamed: 0,customer_id,article_id,score
0,0dbe2f2ceb2e205216589497f46228ab5b6eb8927032f43fba4945d6179b0aef,111565001,5.0
1,18cbcb477a05ec64929809693fcac49edd404571f7e667f65dfc487bf2d99e0c,111565001,2.5
2,190ce2b4681d15e1b4e752aaefdf07d73cf3265d33e5de764f313818b17f26d4,111565001,5.0


In [36]:
#Verifying shape of our new df: 

print(f"There are {regular_dataframe.shape[0]} rows and {regular_dataframe.shape[1]} columns in our regular_dataframe dataframe.")


There are 470191 rows and 3 columns in our regular_dataframe dataframe.


The shape of our new dataframe matches the original R dataframe we created before generating the collaborative filtering matrix.

For the next steps, we'll be performing Matrix Factorization. To improve efficiency, we can export our newly created dataframe to a csv file and use that for further processing.

In [37]:
#Code to export file to a csv: 

#output_file = 'melted_dataframe.csv'
#regular_dataframe.to_csv(output_file, index=False)

___

**Rough work**

In [38]:
# Kernal crashed

#melted_matrix = scaled_matrix.reset_index().melt(id_vars='customer_id', var_name='article_id', value_name='score')

In [39]:
#took over 500 minutes to run below:

# Iterating over each row and append the values to the melted_matrix DataFrame
#for customer_id, row in scaled_matrix.iterrows():
#    temp_df = pd.DataFrame({
#        'customer_id': customer_id,
#        'article_id': row.index,
#        'score': row.values
#    })
#    temp_df = temp_df.dropna(subset=['score'])  # Drop rows where 'score' is NaN
#    melted_matrix = pd.concat([melted_matrix, temp_df], ignore_index=True)

In [None]:
# Kernal crashed

#chunks = []
#chunk_size = 1000 

#for start in range(0, scaled_matrix.shape[0], chunk_size):
#    chunk = scaled_matrix.iloc[start:start+chunk_size]
#    melted_chunk = chunk.reset_index().melt(id_vars='customer_id', var_name='article_id', value_name='score')
#    chunks.append(melted_chunk)

#melted_matrix = pd.concat(chunks, ignore_index=True)

In [40]:
#Kernal crashed

#def chunk_generator(df, chunk_size):
#    for start in range(0, df.shape[0], chunk_size):
#        yield df.iloc[start:start + chunk_size]

In [41]:
#Kernal crashed

#chunk_size = 500  
#output_file = 'melted_matrix.csv'

#first_chunk = next(chunk_generator(scaled_matrix, chunk_size))
#melted_chunk = first_chunk.reset_index().melt(id_vars='customer_id', var_name='article_id', value_name='score')
#melted_chunk.to_csv(output_file, index=False)

# Append the rest of the chunks to the CSV file
#for chunk in chunk_generator(scaled_matrix, chunk_size):
#    melted_chunk = chunk.reset_index().melt(id_vars='customer_id', var_name='article_id', value_name='score')
#    melted_chunk.to_csv(output_file, mode='a', header=False, index=False)

#melted_matrix = pd.read_csv(output_file)
#print(melted_matrix.head())