**INTRODUCTION**

A recommendation system is a type of information filtering system that predicts and suggests items or products that a user is likely to be interested in based on their past behavior, preferences, and interactions with the system.

The main goal of a recommendation system is to provide personalized recommendations to users, in order to enhance their experience, and ultimately increase user engagement and revenue for the business.

There are generally two main types of recommendation systems:

> **Content-based recommendation systems** - These systems recommend items that are similar to the ones a user has already shown interest in. It analyzes the attributes or features of items and match them to user preferences.

> **Collaborative filtering recommendation systems** - These systems recommend items that are popular or preferred by other users with similar preferences to the current user. It analyzes user behavior, such as ratings or purchases, and find other users with similar patterns of behavior.

It typically works by collecting data about users' behavior, preferences, and interactions with the system, such as items they have viewed, purchased, or rated. 

**AIM OF PROJECT**: It is develop a recommendation system that will predict the evaluation of DataLab Store products in order to recommend items most likely to be purchased by consumers.

**PROCEDURE**

> **Data Description:** The dataset had 1975 observations(items) with 338 features(users).

> **Missing Values:** The dataset had missing values which were represented with 0 meaning no rating was done on those items.

The following steps were taken in developing the recommendation system for the project:
> **Step 1 Using Machine Learning Algorithm to fill missing values:** The Iterative imputer was used to fill up items which had no rating this works by estimating the rating of an item by using a regression model.

> **Step 2 Using Singular Value Decomposition:** This model was used to predict the final rating of the various items after training and testing on the dataset which had all values imputed now  


In [2]:
# Importation of libraries
import numpy as np
import pandas as pd
import seaborn
import scipy
from matplotlib import pyplot as plt

**EXPLORATORY DATA ANALYSIS**

In [3]:
# Load the dataset
df = pd.read_csv('/kaggle/input/comptition-datalab/dataset.csv')
#if it is in google colab after uploading the dataset
#df = pd.read_csv("dataset.csv")

In [4]:
# Checking for brief data description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1975 entries, 0 to 1974
Columns: 339 entries, Item to Rachid
dtypes: float64(338), object(1)
memory usage: 5.1+ MB


In [7]:
# Checking for the data type of the features
df.dtypes

Item        object
Lewis      float64
Eugene     float64
Lee        float64
Wiley      float64
            ...   
Isaac      float64
Leroy      float64
Maurice    float64
Rania      float64
Rachid     float64
Length: 339, dtype: object

In [13]:
# Number of missing values
(df == 0).sum()

Item          0
Lewis      1953
Eugene     1954
Lee        1891
Wiley      1964
           ... 
Isaac      1929
Leroy      1901
Maurice    1751
Rania      1733
Rachid     1328
Length: 339, dtype: int64

In [12]:
# Number of items rated by each user
df.shape[0]-(df==0).sum()

Item       1975
Lewis        22
Eugene       21
Lee          84
Wiley        11
           ... 
Isaac        46
Leroy        74
Maurice     224
Rania       242
Rachid      647
Length: 339, dtype: int64

In [14]:
df.describe()

Unnamed: 0,Lewis,Eugene,Lee,Wiley,Samira,Jacques,Karima,George,Hakim,Tom,...,Arthur,Martin,Mohammed,Charles,Yusuf,Isaac,Leroy,Maurice,Rania,Rachid
count,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,...,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0,1975.0
mean,0.044557,0.035443,0.127342,0.021772,0.17038,0.030886,0.131139,0.037468,0.050633,0.445316,...,0.392405,0.026329,0.762278,0.244051,0.132152,0.071899,0.119241,0.402532,0.456203,1.189367
std,0.427096,0.36537,0.663005,0.303695,0.781925,0.349494,0.682698,0.383108,0.459305,1.214588,...,1.123094,0.340325,1.276024,0.880346,0.757271,0.508185,0.626043,1.154698,1.256888,1.765175
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.5,5.0,4.5,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


# **Preprocessing**

In [None]:
from keras.layers import Input, Dense
from keras.models import Model


df2 = df.copy()

# Replace missing values with 0
df2.replace(0, np.nan, inplace=True)

# Split the dataset into input (X) and output (y)
X = df2.iloc[:, 1:].values
y = df2.iloc[:, 0].values



The first procedure taken for this work was to fill missing values using various machine learning techniques namely:

> **Simple Imputer**: This technic works by replacing missing values with a specified statistic, such as the mean, median, or mode in the same column.

> **KNN Imputer**: This technic works by finding the K closest observations to the observation with the missing value(s), and using the values from those K neighbors to impute the missing value.

> **Iterative Imputer**: This technic works by using an iterative algorithm that uses regression models to estimate the missing values.

After exploring all techniques listed, **the Iterative imputer gave a better result** which was eventually adopted for filling the missing values in the first step.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=30, random_state=0)
X = imp.fit_transform(X)

In [None]:
df3 = df2.copy()
# Replace back into a copy of the original dataset
df3.iloc[:, 1:] = X

In [None]:
# Reshape the data from wide to long format using the pivot_table method
df_long = df3.melt(id_vars="Item", var_name="customer_name", value_name="rating")


# Verify the shape of the long format data
print("Number of rows:", df_long.shape[0])
print("Number of columns:", df_long.shape[1])

In [None]:
#Subseting to get the right dataset
df4 = df_long.copy()

#Setting new index
df4['Index'] = range(len(df4))
df4 = df4.set_index('Index')

df4.shape

In [None]:
# Our dataset is ready for use !!!
df4.head()

# **Modeling : Singular Value Decomposition(SVD)**

There are several models and techniques that can be used to predict missing values in a matrix (Matrix Completion using SVD, Matrix Factorization, K-Nearest Neighbors (KNN), Deep Learning Models).
The most accurate, based on our data is SVD

Singular Value Decomposition is a mathematical technique that has many applications in data analysis, such as in principal component analysis, image compression, and text processing.

One of the applications of SVD is in matrix completion, where the goal is to predict the missing values of a partially observed matrix. Given a partially observed matrix A, SVD can be used to predict the missing values by approximating the original matrix with a low-rank matrix.



To implement this, we will use the library 'surprise'

In [None]:
pip install surprise

In [None]:
# Modeling
#SVD APPROACH
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import accuracy

# Load the user-item rating data into a Pandas DataFrame

# Convert the DataFrame into a surprise dataset
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(df4, reader)

# Split the dataset into training and testing sets
train_set, test_set = train_test_split(data, test_size=0.3, random_state=5)

# Train the SVD model
algo = SVD()
algo.fit(train_set)

# Make predictions for the test set
predictions = algo.test(test_set)

# Evaluate the accuracy of the predictions
accuracy.mae(predictions)

# **Predictions to submit for the competition**

In [None]:
# Upload the test data we are supposed to make prediction on
sub_test = pd.read_csv('/kaggle/input/comptition-datalab/test.csv')

sub_test.shape
#All the items are in the test dataset

In [None]:
sub_test.head()

In [None]:
# Adding a column of Ratings we got with our Imputer
# This is not yet the predictions
sub_test['Rating'] = -1.000000
for i in range(len(sub_test)) :
  sub_test['Rating'][i]= (df4['rating'][(df4['Item'] == sub_test['Item'][i])&(df4['customer_name']== sub_test['User'][i])]).values[0]

sub_test.head()

In [None]:
# Transform the prediction dataset into a format we can use for prediction
df_tuples = [tuple(row) for row in sub_test.values]# Make predictions for the test set
# Make predictions for the test set
sub_pred = algo.test(df_tuples)
# Evaluate the accuracy of the predictions
accuracy.mae(sub_pred)

In [None]:
# Description of the predictions
predic = pd.DataFrame(sub_pred)
predic.describe()

In [None]:
# Prepare our submission dataset
exp = predic.copy()
exp['Rating'] = exp['est']
exp.drop(columns = ['uid', 'iid', 'r_ui', 'details', 'est'], inplace = True)
exp.head()

In [None]:
# to export the prediction data and download it
#from google.colab import files
#exp.to_csv('prediction.csv', encoding = 'utf-8-sig') 
#files.download('prediction.csv')

Here we go !!!
Thank you !