# Aim:
- The aim is to build a machine-based recommendation engine to deliver personalized product recommendations more effectively across its entire customer base.

# Data:
- I contains about 73k logged users interactions on more than 3k public articles shared in the platform.

### This dataset features some distinctive characteristics:

- Item attributes: Articles' original URL, title, and content plain text are available in two languages (English and Portuguese).
- Contextual information: Context of the users visits, like date/time, client (mobile native app / browser) and geolocation.
- Logged users: All users are required to login in the platform, providing a long-term tracking of users preferences (not depending on cookies in devices).
- Rich implicit feedback: Different interaction types were logged, making it possible to infer the user's level of interest in the articles (eg. comments > likes > views).
- Multi-platform: Users interactions were tracked in different platforms (web browsers and mobile native apps)

# Model Type:
-  Collaborative and content based filtering

# Data dictionary:



# Import Libraries

In [1]:
# Data manipulation
import pandas as pd 
import numpy as np

# Data visualization
from matplotlib import pyplot as plt
import seaborn as sns

# Building the recommender system
from tqdm import tqdm

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")



# Load Data

In [2]:
shared_article_dataset = pd.read_csv('/Users/eugene/Personal_Projects/Real_ML_Project/recommender_system/research/shared_articles.csv')
user_interaction_dataset = pd.read_csv('/Users/eugene/Personal_Projects/Real_ML_Project/recommender_system/research/users_interactions.csv')

In [3]:
# explore the dataset
shared_article_dataset.head()

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en


In [4]:
shared_article_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3122 entries, 0 to 3121
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   timestamp        3122 non-null   int64 
 1   eventType        3122 non-null   object
 2   contentId        3122 non-null   int64 
 3   authorPersonId   3122 non-null   int64 
 4   authorSessionId  3122 non-null   int64 
 5   authorUserAgent  680 non-null    object
 6   authorRegion     680 non-null    object
 7   authorCountry    680 non-null    object
 8   contentType      3122 non-null   object
 9   url              3122 non-null   object
 10  title            3122 non-null   object
 11  text             3122 non-null   object
 12  lang             3122 non-null   object
dtypes: int64(4), object(9)
memory usage: 317.2+ KB


In [5]:
user_interaction_dataset.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


In [6]:
user_interaction_dataset['eventType'].unique()

array(['VIEW', 'FOLLOW', 'BOOKMARK', 'LIKE', 'COMMENT CREATED'],
      dtype=object)

In [7]:
user_interaction_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72312 entries, 0 to 72311
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   timestamp    72312 non-null  int64 
 1   eventType    72312 non-null  object
 2   contentId    72312 non-null  int64 
 3   personId     72312 non-null  int64 
 4   sessionId    72312 non-null  int64 
 5   userAgent    56918 non-null  object
 6   userRegion   56907 non-null  object
 7   userCountry  56918 non-null  object
dtypes: int64(4), object(4)
memory usage: 4.4+ MB


# Data Preprocessing

In [8]:
# replace event type in our user interaction dataframe with numerical weights.
# create a map object

event_map = {
    'VIEW':1.0,
    'FOLLOW':2.0,
    'BOOKMARK':3.0,
    'LIKE':4.0,
    'COMMENT CREATED':5.0
}

user_interaction_dataset['eventTypeWeight'] =  user_interaction_dataset['eventType'].map(event_map)
user_interaction_dataset.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry,eventTypeWeight
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,,1.0
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US,1.0
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,,1.0
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,,2.0
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,,1.0


In [24]:
# next step, we select only users who have had reasonable interactions with articles. this is necessary to deal with issue of cold-starting
# 5 interactions per user is okay for a start

user_interaction_dataset_new = pd.DataFrame()


for user in user_interaction_dataset['personId'].unique():
    if len(user_interaction_dataset[user_interaction_dataset['personId'] == user]) > 5:
        user_interaction_dataset_new = pd.concat([user_interaction_dataset_new,
            pd.DataFrame(user_interaction_dataset[user_interaction_dataset['personId'] == user].values)], ignore_index=True)
            


user_interaction_dataset_new.columns =  user_interaction_dataset.columns
user_interaction_dataset_new.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry,eventTypeWeight
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,,1.0
1,1465413046,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5...,SP,BR,1.0
2,1464190235,VIEW,6437568358552101410,-8845298781299428018,-1157447994463607871,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4...,SP,BR,1.0
3,1459429221,VIEW,-4760639635023250284,-8845298781299428018,-5149610736659242149,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4...,SP,BR,1.0
4,1459274156,VIEW,-6142462826726347616,-8845298781299428018,-6283148774987755959,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4...,SP,BR,1.0


In [32]:
print('Users with at least 5 interactions: ' + str(len(user_interaction_dataset_new['personId'].unique())) )
print('All users with or without interactions: ' + str(len(user_interaction_dataset['personId'].unique())) )

Users with at least 5 interactions: 1220
All users with or without interactions: 1895


In [40]:
# to model the user interest on a given article, we aggregate all the interactions the user has performed in an item by a
#  weighted mean of interaction

user_interaction_df =  user_interaction_dataset_new.groupby(['personId', 'contentId'])['eventTypeWeight'].mean().reset_index()
user_interaction_df.head()

Unnamed: 0,personId,contentId,eventTypeWeight
0,-9223121837663643404,-8949113594875411859,1.0
1,-9223121837663643404,-8377626164558006982,1.0
2,-9223121837663643404,-8208801367848627943,1.0
3,-9223121837663643404,-8187220755213888616,1.0
4,-9223121837663643404,-7423191370472335463,1.0


In [39]:
type(a)

pandas.core.frame.DataFrame

In [None]:
# Define our backend storage uri as well as our experiment
tracking_uri = 'sqlite:///mlflow.db'
mf.set_tracking_uri(tracking_uri)
mf.set_experiment(experiment_name='recommender_exp')

In [None]:
# Disable autolog, as it isn't compactible with our current tensorflow version(2.16.1)
mf.autolog(disable=True)

In [None]:
# retrieve the labels(i.e., targets/services) which of course represents the service a user is currently subscribed to
# To achieve this, we can simply use the idxmax() function along the columns. This function returns the index of the maximum value for each row.
raw_target = train_data.iloc[:, 22:].idxmax(axis=1)
raw_target

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# call the label encoder fit_transform method on our raw_targets to obtain a numerical representation for each service name
le.fit(raw_target)

transformed_target = le.transform(raw_target)
# create a new column known as service opted within the dataframe
train_data['service_opted_for'] =  transformed_target

# view the first few rows of the dataset
train_data.head()

In [None]:
transformed_target

In [None]:
keys = le.classes_
values =  le.transform(le.classes_)
result = dict(zip(keys, values))
result

In [None]:
with mf.start_run():    
# let's log our encoder as an artifact in mlflow
    mf.log_param('label encoder params during fit methos', result)

mf.end_run()

In [None]:
# Checking the value count of the products
plt.figure(figsize=(12,8))

# Get the name and the occurences
names = raw_target.value_counts().index
values = raw_target.value_counts().values

# Plot the plot
ax = sns.barplot(x=names, y=values)

# Set the title
ax.set_title("Number Of Services Opted In Millions")

# Set the xticklabels and rotate
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

# Label the bars
for p in ax.patches:
    ax.annotate("{:.1f}".format(p.get_height()), (p.get_x(), p.get_height()), rotation=25)

# Show the plot
plt.show()

## Creating a User-Item interaction matrix( count )

In [None]:
train_data.ncodpers

In [None]:
# Our collaborative recommender system requires a user id, service id, and customer satisfaction rating, count or score. 
# Hence we would have to re-engineer new variable (we will create a customer satisfaction metric(service selection ratio) which can replace rating.)

# Creating a user-item matrix, each entry indicates the number of times service opted by that user
user_item_matrix = pd.crosstab(index=train_data.ncodpers, columns=transformed_target, values=1, aggfunc='sum')

# Filling nan values as 0 as service is not opted
user_item_matrix.fillna(0, inplace=True)

# Print the user-item matrix(Represents Count)
user_item_matrix

In [None]:
# Having calculated the number of times a user has opted for a service. Then for each user we will divide the count of 
# each service with the total number of services the user has opted throughout his/her banking journey.

# Convert the user_item_matrix to array datatype
uim_arr = np.array(user_item_matrix)

# Iterate through each row(user)
for row,item in tqdm(enumerate(uim_arr)):
    # Iterate through each column(item)
    for column,item_value in enumerate(item):
        # Change the count of service opted to ratio
        uim_arr[row, column] = uim_arr[row, column] / sum(item)
        
# Convert the array to dataframe for better view
user_item_ratio_matrix = pd.DataFrame(uim_arr, columns=user_item_matrix.columns, index=user_item_matrix.index)

# Print the user_item_ratio_matrix(Represents the ratio)
user_item_ratio_matrix

In [None]:


# Stack the user_item_ratio_matrix to get all values in single column
user_item_ratio_stacked = user_item_ratio_matrix.stack().to_frame()

# Create column for user id
user_item_ratio_stacked['ncodpers'] = [index[0] for index in user_item_ratio_stacked.index]

# Create column for service_opted
user_item_ratio_stacked['service_opted'] = [index[1] for index in user_item_ratio_stacked.index]

# Reset and drop the index
user_item_ratio_stacked.reset_index(drop=True, inplace=True)

# Print the dataframe
user_item_ratio_stacked

In [None]:
# Formating our final dataset

# Rename the column 0 to service_selection_ratio
user_item_ratio_stacked.rename(columns={0:"service_selection_ratio"}, inplace=True)

# Arange the column systematicaly for better view
user_item_ratio_stacked = user_item_ratio_stacked[['ncodpers','service_opted', 'service_selection_ratio']]

# Drop all the rows with 0 entries as it means the user has never opted for the service
user_item_ratio_stacked.drop(user_item_ratio_stacked[user_item_ratio_stacked['service_selection_ratio']==0].index, inplace=True)

# Reset the index
user_item_ratio_stacked.reset_index(drop=True, inplace=True)

# Display the final dataframe
user_item_ratio_stacked

### The above final dataset suits our use case for application in collaborative filtering technique 

In [None]:
# Encode user_id and item_id
user_encoder = LabelEncoder()
user_item_ratio_stacked['ncodpers'] = user_encoder.fit_transform(user_item_ratio_stacked['ncodpers'])

service_encoder = LabelEncoder()
user_item_ratio_stacked['service_opted'] = service_encoder.fit_transform(user_item_ratio_stacked['service_opted'])

user_item_ratio_stacked.head()

# Building our recommender system

In [None]:
# Creating surprise processable dataset
# Initialize a surprise reader object
reader = Reader(line_format='user item rating', sep=',', rating_scale=(0,1), skip_lines=1)

# Load the data
data = Dataset.load_from_df(user_item_ratio_stacked, reader=reader)


In [None]:
# use gridsearch to find best parametr for our svd model
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score["rmse"])
print(gs.best_score["mae"])
# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

In [None]:
# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator["rmse"]
trainset =  data.build_full_trainset()
algo.fit(trainset)

In [None]:
def get_recommendation(uid, model, service_range):    
    recommendations = [(uid, 
                        sid, 
                        le.inverse_transform([sid])[0], 
                        model.predict(uid, sid).est) for sid in range(service_range)]
    # Convert to pandas dataframe
    recommendations = pd.DataFrame(recommendations, columns=['uid', 'sid', 'service_name', 'pred'])
    # Sort by pred
    recommendations.sort_values("pred", ascending=False, inplace=True)
    # Reset index
    recommendations.reset_index(drop=True, inplace=True)

    print(recommendations.head())
    # Return
    return dict(services = list(recommendations.service_name))

In [None]:
get_recommendation(15890.0, algo, 10)

# The next phase aims to prepare our dataset for a content-based filtering algorithm

In [None]:
# We will start by removing records with no details about the user. This is necessary because for the choice of algorithm we require attributes
# of our users.

# Dropping rows with no useful data
train_data.drop(train_data[train_data['ind_empleado'].isnull()].index, axis=0, inplace=True)

# Dropping rows with no useful data
train_data.drop(train_data[train_data['ind_nomina_ult1'].isnull()].index, axis=0, inplace=True)

# Dropping one-hot encoded columns of services
train_data.drop(columns=train_data.iloc[:1,22:-1].columns, inplace=True)

# Print the dataframe
train_data.head()

In [None]:
# Checking the null value for all columns
(train_data.isnull().sum()/len(train_data))*100

In [None]:
# Filling renta with its mean
train_data['renta'].fillna(train_data['renta'].mean(), inplace=True)

# Filling cod_prov with its mode
train_data['cod_prov'].fillna(train_data['cod_prov'].mode()[0], inplace=True)

# Filling indrel_1mes with its mode
train_data['indrel_1mes'].fillna(train_data['indrel_1mes'].mode()[0], inplace=True)

In [None]:
# Check unique category for all categorical variables
# List of names of columns of type object
obj_cols = train_data.select_dtypes('object')

# Iterate through each column
for col in obj_cols:
    print("*"*5,col,"*"*5)
    # Print its unique value
    print(train_data[col].unique(),"\n\n")

In [None]:
# Correcting the categories of column - indrel_1mes
train_data['indrel_1mes'].replace('1', 1, inplace=True)
train_data['indrel_1mes'].replace('1.0', 1, inplace=True)
train_data['indrel_1mes'].replace('2', 2, inplace=True)
train_data['indrel_1mes'].replace('2.0', 2, inplace=True)
train_data['indrel_1mes'].replace('3', 3, inplace=True)
train_data['indrel_1mes'].replace('3.0', 3, inplace=True)
train_data['indrel_1mes'].replace('4', 4, inplace=True)
train_data['indrel_1mes'].replace('4.0', 4, inplace=True)
train_data['indrel_1mes'].replace('P', 5, inplace=True)
train_data['indrel_1mes'].replace('None',np.nan, inplace=True)

# Print dataframe
train_data.head()

## Encoding categorical variables

In [None]:
# List of columns to encode
cols_to_encode = ['ind_empleado', 'pais_residencia', 'sexo', 'indrel', 'tiprel_1mes', 'indresi', 'indext', 'canal_entrada', 'indfall', 'segmento']

# List of label encoders which will be used for transformations later
label_encoders = []

# Create Label encode these columns iteratively
for col in tqdm(cols_to_encode):
    # Initialize a label encoder object
    lab_enc = LabelEncoder()
    
    # Encode the column and replace it with existing
    train_data[col] = lab_enc.fit_transform(train_data[col])
    
    # Append it in the label_encoders list to use it later
    label_encoders.append(lab_enc)
    
# Print the data
train_data.head()

In [None]:
# Deleting column 'nomprov' as we already have its encoded feature(cod_prov)
train_data.drop(columns=['nomprov'], inplace=True)

# Deleting column tipodom as all values are '1'
train_data.drop(columns=['tipodom'], inplace=True)

# Print the dataframe
train_data.head()

## Choosing recent transaction for each user

In [None]:
# Selecting non-duplicate rows(unique) and saving the latest transaction by giving parameter keep='last'
user_data = train_data[train_data['ncodpers'].duplicated(keep='last')]

# Reset the index
user_data.reset_index(drop=True, inplace=True)

# Print the head
user_data.head()

## TO BE CONTINUED