<a href="https://colab.research.google.com/github/arsyaamalia/content-based-recommendation-systems/blob/main/Skripsi_Content_Based_Recommendation_System_Eventhings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. INTRODUCTION**

*   ## **What is a Content-Based Recommendation System?**
A content-based recommendation system recommends items to users based on the content or characteristics of the items. This type of recommendation system focuses on understanding the properties of items and learning user preferences from the items they have interacted with in the past.

*   ## **How Does it Work?**
The working principle of a content-based recommendation system can be summarized in a few steps:
1.   Feature Extraction: Extract relevant features from the items. For example, in a movie recommendation system, features could include genre, director, actors, and plot keywords.
2.   User Profile: Create a user profile based on their interactions with items. This profile is essentially a summary of the features of items the user has liked or interacted with in the past.
3.   Recommendation: Calculate the similarity between the user profile and each item's features. Items that are most similar to the user profile are recommended.

---

# **2. EXPLORATORY DATA ANALYSIS (EDA)**

## **Import Python Libraries**

The first step involved in ML using python is understanding and playing around with our data using libraries. Here is the [link](https://docs.google.com/spreadsheets/d/1eilNwyFzBFAzO2Z3vOJoqV7UGN6xBc65aOkLJ5nSdmA/edit?usp=sharing) to the dataset.

Import all libraries which are required for our analysis, such as Data Loading, Statistical analysis, Visualizations, Data Transformations, Merge and Joins, etc.

In [1]:
# Import needed modules
import numpy as np
import pandas as pd
import nltk
import re
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## **Reading Dataset**

The Pandas library offers a wide range of possibilities for loading data into the pandas DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images etc.

Most of the data are available in a tabular format of CSV files. It is trendy and easy to access. Using the read_csv() function, data can be converted to a pandas DataFrame.

I will use a dataset containing event services/vendor information, including names, categories, locations, and descriptions.

In [15]:
# Read data
df = pd.read_csv('/content/Indonesia_Event_Service_Businesses.csv')

### **Analyzing the Data**

Before we make any inferences, we listen to our data by examining all variables in the data.

The main goal of data understanding is to gain general insights about the data, which covers the number of rows and columns, values in the data, datatypes, and Missing values in the dataset.

shape – shape will display the number of observations(rows) and features(columns) in the dataset

There are 3462 observations and 6 variables in our dataset

In [16]:
# printing the first 5 rows of the dataframe
df.head()

Unnamed: 0,no,kategori,subkategori,location/city,nama,deskripsi,address,contact,url,lat,lng
0,1,Media Partner,Karir,-,Glints,Glints adalah platform karier yang menghubungk...,,,,,
1,2,Media Partner,Karir,-,Kata.ai,Kata.ai adalah perusahaan teknologi yang menge...,,,,,
2,3,Media Partner,Karir,-,Talent Alpha,Talent Alpha adalah platform yang membantu per...,,,,,
3,4,Media Partner,Karir,-,Ruangguru,Ruangguru adalah platform pendidikan yang meny...,,,,,
4,5,Media Partner,Karir,-,HarukaEdu,HarukaEdu adalah platform e-learning yang meny...,,,,,


In [17]:
# printing the last 5 rows of the dataframe
df.tail()

Unnamed: 0,no,kategori,subkategori,location/city,nama,deskripsi,address,contact,url,lat,lng
3457,3458,Equipment/Rental,Tent,semarang,Sewa Alat Camping Semarang WB OUTDOOR,,Jl. Taman Suryokusumo IV Pasar PKL Selter 1 Bl...,0877-8910-5550,https://maps.google.com/?cid=13343807366684347718,-6.978405,110.464499
3458,3459,Equipment/Rental,Tent,semarang,Kafe tenda,,"Jl. Pahlawan No.2, Mugassari, Kec. Semarang Se...",,https://maps.google.com/?cid=6253221563540260196,-6.996881,110.419691
3459,3460,Equipment/Rental,Tent,semarang,EnergyAdventure Rental Tenda Alat Outdoor,,"Jl. Brobudur Barat V/25 RT8/13, Kalipancur, Ke...",0819-0265-6457,https://maps.google.com/?cid=12486700668638141432,-6.999503,110.368743
3460,3461,Equipment/Rental,Tent,semarang,Warung tenda muda,,"XFC5+JVC, Jl. Sambiroto VII, Sambiroto, Kec. T...",0858-7620-0035,https://maps.google.com/?cid=10406616885502580166,-7.028391,110.459647
3461,3462,Equipment/Rental,Tent,semarang,"Grosir Tenda lipat Murah ""AMIRA TENT""",,"masjid baiturahim, Tawang Rajekwesi belakang N...",0882-2730-8407,https://maps.google.com/?cid=9508086783357841356,-6.972278,110.390687


**info()** helps to understand the data type and information about data, including the number of records in each column, data having null or not null, Data type, the memory usage of the dataset

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3462 entries, 0 to 3461
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   no             3462 non-null   int64  
 1   kategori       3462 non-null   object 
 2   subkategori    3462 non-null   object 
 3   location/city  3462 non-null   object 
 4   nama           3462 non-null   object 
 5   deskripsi      662 non-null    object 
 6   address        2799 non-null   object 
 7   contact        2549 non-null   object 
 8   url            2799 non-null   object 
 9   lat            2799 non-null   float64
 10  lng            2799 non-null   float64
dtypes: float64(2), int64(1), object(8)
memory usage: 297.6+ KB


### **Check for Duplication**

**nunique()** based on several unique values in each column and the data description, we can identify the continuous and categorical columns in the data. Duplicated data can be handled or removed based on further analysis

In [19]:
df.nunique()

no               3462
kategori            3
subkategori        29
location/city      13
nama             3125
deskripsi         658
address          2469
contact          2191
url              2484
lat              2476
lng              2475
dtype: int64

### **Missing Values Calculation**

**isnull()** is widely been in all pre-processing steps to identify null values in the data

In our example, **df.isnull().sum()** is used to get the number of missing records in each column

In [20]:
df.isnull().sum()

no                  0
kategori            0
subkategori         0
location/city       0
nama                0
deskripsi        2800
address           663
contact           913
url               663
lat               663
lng               663
dtype: int64

## **Data Reduction**

Some columns or variables can be dropped if they do not add value to our analysis.

In our dataset, the column address, contact, url, lat, lng, assuming they don’t have any predictive power to predict the dependent variable.

In [21]:
# Remove address, contact, url, lat, lng columns from df
df = df.drop(['address','contact','url','lat','lng'], axis = 1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3462 entries, 0 to 3461
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   no             3462 non-null   int64 
 1   kategori       3462 non-null   object
 2   subkategori    3462 non-null   object
 3   location/city  3462 non-null   object
 4   nama           3462 non-null   object
 5   deskripsi      662 non-null    object
dtypes: int64(1), object(5)
memory usage: 162.4+ KB


We start our Feature Engineering as we need to add some columns required for analysis.

## **Feature Engineering**

Feature engineering refers to the process of using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model using machine learning or statistical modeling. The main goal of Feature engineering is to create meaningful data from raw data.

We will extract relevant features from the dataset, such as kategori, subkategori, location/city, nama, and deskripsi.

In [22]:
# Selecting the relevant features for recommendation
selected_features = ['kategori','subkategori','location/city','nama','deskripsi']
print(selected_features)

['kategori', 'subkategori', 'location/city', 'nama', 'deskripsi']


# **3. PRE-PROCESSING DATA**

Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.

## **Data Cleaning**

Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data from the datasets, and it also replaces the missing values.

### **Handling Missing Values**

We will replace all the missing values with null string (found 2800 missing values in deskripsi column before)

In [23]:
# Replacing the null valuess with null string
for feature in selected_features:
    df[feature] = df[feature].fillna('')

### **Data Integration**

We will combine 5 features that are valuable to make a machine learning algorithm

In [24]:
# combining all the 5 selected features
combined_features = df['kategori'] + ' ' + df['subkategori'] + ' ' + df['location/city'] + ' ' + df['nama'] + ' ' + df['deskripsi']
combined_features

0       Media Partner Karir - Glints Glints adalah pla...
1       Media Partner Karir - Kata.ai Kata.ai adalah p...
2       Media Partner Karir - Talent Alpha Talent Alph...
3       Media Partner Karir - Ruangguru Ruangguru adal...
4       Media Partner Karir - HarukaEdu HarukaEdu adal...
                              ...                        
3457    Equipment/Rental Tent semarang Sewa Alat Campi...
3458           Equipment/Rental Tent semarang Kafe tenda 
3459    Equipment/Rental Tent semarang EnergyAdventure...
3460    Equipment/Rental Tent semarang Warung tenda muda 
3461    Equipment/Rental Tent semarang Grosir Tenda li...
Length: 3462, dtype: object

In [25]:
# push to df
df = df.assign(combined_features=combined_features)
df.head()

Unnamed: 0,no,kategori,subkategori,location/city,nama,deskripsi,combined_features
0,1,Media Partner,Karir,-,Glints,Glints adalah platform karier yang menghubungk...,Media Partner Karir - Glints Glints adalah pla...
1,2,Media Partner,Karir,-,Kata.ai,Kata.ai adalah perusahaan teknologi yang menge...,Media Partner Karir - Kata.ai Kata.ai adalah p...
2,3,Media Partner,Karir,-,Talent Alpha,Talent Alpha adalah platform yang membantu per...,Media Partner Karir - Talent Alpha Talent Alph...
3,4,Media Partner,Karir,-,Ruangguru,Ruangguru adalah platform pendidikan yang meny...,Media Partner Karir - Ruangguru Ruangguru adal...
4,5,Media Partner,Karir,-,HarukaEdu,HarukaEdu adalah platform e-learning yang meny...,Media Partner Karir - HarukaEdu HarukaEdu adal...


### **Remove Special Characters**

In [27]:
def cleaning(Text):
    Text = re.sub('@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+',' ',Text)
    return Text

df['cleaning'] = df['combined_features'].apply(cleaning)

## **Case Folding**

In this step data was collected are uniform cases or letters contained in each profile. Uniformizing letters was done from capital letters converted to lowercase letters.

In [28]:
df['case_folding'] = df['cleaning'].str.lower()

## **Tokenization**

In this step, the text is split into smaller units. We can use either sentence tokenization or word tokenization based on our problem statement.

In [29]:
def tokenization(text):
    tokens = re.split('W+',text)
    return tokens

df['tokenization']= df['case_folding'].apply(tokenization)

## **Stopword Removal**

Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words carry less or no meaning.

In [30]:
nltk.download('stopwords')
from nltk.corpus import stopwords

list_stopwords = stopwords.words('indonesian')
list_stopwords.extend(['yg', 'dg', 'rt', 'dgn', 'ny', 'd', 'klo',
                       'kalo', 'amp', 'biar', 'bikin', 'bilang',
                       'gak', 'ga', 'krn', 'nya', 'nih', 'sih',
                       'si', 'tau', 'tdk', 'tuh', 'utk', 'ya',
                       'jd', 'jgn', 'sdh', 'aja', 'n', 't',
                       'nyg', 'hehe', 'pen', 'u', 'nan', 'loh', 'rt',
                       '&', 'yah', 'no', 'je', 'om', 'pru', 'sch',
                       'injirrr', 'ah', 'oena', 'bu', 'eh', 'n', 'anjir', 'jd', 'anj'])

list_stopwords = set(list_stopwords)

def stopwords_removal(Text):
  words = Text.split()
  return [word for word in words if word not in list_stopwords]

df['stopword_removal'] = df['case_folding'].apply(stopwords_removal)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## **Stemming**

It is also known as the text standardization step where the words are stemmed or diminished to their root/base form.  For example, words like ‘programmer’, ‘programming, ‘program’ will be stemmed to ‘program’.

In [31]:
!pip install swifter
!pip install Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
import swifter

#buat stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()

#stemmed wrapper
def stemmed_wrapper(term):
  return stemmer.stem(term)

term_dict = {}

for Text in df['stopword_removal']:
  for term in Text:
    if term not in term_dict:
      term_dict[term] = ' '

print(len(term_dict))
print("------------------------")

for term in term_dict:
    term_dict[term] = stemmed_wrapper(term)
    print(term,":" ,term_dict[term])

print(term_dict)
print("------------------------")

#memmulai stemming
def apply_stemmed_term(Text):
  return [term_dict[term] for term in Text]

df['stemming'] = df['stopword_removal'].swifter.apply(apply_stemmed_term)

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16507 sha256=5151037deb53ad11c36b6d466dabdab9cc957eb2bdaecd0fbb73b719602e7c28
  Stored in directory: /root/.cache/pip/wheels/e4/cf/51/0904952972ee2c7aa3709437065278dc534ec1b8d2ad41b443
Successfully built swifter
Installing collected packages: swifter
Successfully installed swifter-1.4.0
Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1
4882

Pandas Apply:   0%|          | 0/3462 [00:00<?, ?it/s]

# **4. BUILDING CONTENT-BASED RECOMMENDATION SYSTEMS (CBRS)**

## **Term Frequency**

The term is frequency measure of a word w in a document (text) d. It is equal to the number of instances of word w in document d divided by the total number of words in document d. Term frequency serves as a metric to determine a word’s occurrence in a document as compared to the total number of words in a document. The denominator is always the same.

### **TF-IDF**

We use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text features (descriptions) into numerical vectors.
TF-IDF gives more weight to terms that are important in a specific document and less weight to common terms.

In [32]:
# converting the text data to feature vectors
vectorizer = TfidfVectorizer()

tfidf_vectors = vectorizer.fit_transform(df['stemming'])

AttributeError: 'list' object has no attribute 'lower'

In [None]:
print(tfidf_vectors)