# **Setup**

### **Import Library**

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to C:\Users\ASUS
[nltk_data]     VivoBook\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\ASUS
[nltk_data]     VivoBook\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### **Read Dataset**


### 1. Job Title and Job Description Dataset
Dataset ini berisi koleksi judul pekerjaan beserta deskripsi lengkapnya yang mencakup tanggung jawab dan kualifikasi.

Kolom dataset:
- `Job Title`: Judul pekerjaan
- `Job Description`: Deskripsi lengkap pekerjaan


### 2. Coursera Courses Dataset 
Dataset ini memuat hasil scrap website coursera

Kolom dataset:
- `Title`: Judul nama kursus.
- `Organization`: Nama universitas atau partner industri yang menawarkan kursus.
- `Metadata`: Tingkat kesulitan kursus, jenis couurse, dan lama kursus
- `Skills`: Kumpulan skill yang diajarkan dalam kursus
- `Ratings`: Rating kursus pada skala 5 poin
- `Review counts` : Jumlah review untuk tiap course

Sumber Dataset:
- [Job Title and Job Description Dataset](https://www.kaggle.com/datasets/kshitizregmi/jobs-and-job-description)
- Coursera Courses Dataset = dari hasil scrap website coursera 

In [2]:
course_df = pd.read_csv('coursera_course_dataset_v2.csv').drop('Unnamed: 0', axis=1)
course_df

Unnamed: 0,Title,Organization,Skills,Ratings,Review counts,Metadata
0,Google Cybersecurity,Google,"Network Security, Python Programming, Linux, ...",4.8,4.8(20K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
1,Google Data Analytics,Google,"Data Analysis, R Programming, SQL, Business C...",4.8,4.8(137K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
2,Google Project Management:,Google,"Project Management, Strategy and Operations, ...",4.8,4.8(100K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
3,IBM Data Science,IBM,"Python Programming, Data Science, Machine Lea...",4.6,4.6(120K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
4,Google Digital Marketing & E-commerce,Google,"Digital Marketing, Marketing, Marketing Manag...",4.8,4.8(23K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
...,...,...,...,...,...,...
618,Google Workspace Security,Google Cloud,"Security Engineering, Security Strategy, Soft...",4.7,4.7(762 reviews),Beginner · Course · 1 - 4 Weeks
619,Cybersecurity for Tech Professionals,Campus BBVA,"Computer Security Incident Management, System...",4.7,4.7(106 reviews),Intermediate · Course · 1 - 3 Months
620,Cybersecurity in the Cloud,University of Minnesota,"Security Engineering, System Security, Networ...",4.3,4.3(127 reviews),Beginner · Specialization · 3 - 6 Months
621,Applied Cryptography,University of Colorado System,"Cryptography, Security Engineering, Algorithm...",4.5,4.5(744 reviews),Intermediate · Specialization · 3 - 6 Months


In [3]:
df_job = pd.read_csv('job_title_des.csv').drop('Unnamed: 0', axis=1)
df_job.head()

Unnamed: 0,Job Title,Job Description
0,Flutter Developer,We are looking for hire experts flutter develo...
1,Django Developer,PYTHON/DJANGO (Developer/Lead) - Job Code(PDJ ...
2,Machine Learning,"Data Scientist (Contractor)\n\nBangalore, IN\n..."
3,iOS Developer,JOB DESCRIPTION:\n\nStrong framework outside o...
4,Full Stack Developer,job responsibility full stack engineer – react...


Untuk membangun sistem rekomendasi, kolom yang akan digunakan adalah `Job Description` dari dataset pekerjaan sebagai representasi teks pekerjaan. Sedangkan untuk dataset kursus, beberapa kolom yang relevan akan digabungkan menjadi satu kolom **`text`** yang mewakili informasi kursus secara menyeluruh, antara lain:

- `Title`
- `Skills`
- `Metadata`

Penggabungan ini bertujuan agar seluruh informasi relevan terwakili dalam satu kolom teks yang akan diproses dan digunakan dalam sistem rekomendasi.

In [4]:
course_df['text'] = course_df['Title'] +  ' ' + course_df['Skills'] + ' ' + course_df['Metadata']
course_df = course_df[['Title', 'text']]
course_df.head()

Unnamed: 0,Title,text
0,Google Cybersecurity,"Google Cybersecurity Network Security, Python..."
1,Google Data Analytics,"Google Data Analytics Data Analysis, R Progra..."
2,Google Project Management:,Google Project Management: Project Management...
3,IBM Data Science,"IBM Data Science Python Programming, Data Sci..."
4,Google Digital Marketing & E-commerce,Google Digital Marketing & E-commerce Digital...


### **Preprocessing data text**

In [5]:
def clean_noise(text):
  text = re.sub(r'<.*?>', ' ', text) # Hapus tag HTML
  text = re.sub(r'https?://\S+|www\.\S+', ' ', text) # Hapus URL
  text = re.sub(r'#\w+', ' ', text) # Hapus hashtag
  text = re.sub(r'[^\w\s]', ' ', text) # Hapus tanda baca dan karakter khusus
  text = re.sub(r'\d+', ' ', text) # Hapus angka
  text = re.sub(r'\s+', ' ', text).strip() # Hapus spasi berlebih
  return text

def remove_stopwords(text):
  stopwords_set = set(stopwords.words('english'))
  words = text.split()
  filtered_words = [word for word in words if word.lower() not in stopwords_set]
  return ' '.join(filtered_words)

def stem_text(text):
  stemmer = PorterStemmer()
  words = text.split()
  stemmed_words = [stemmer.stem(word) for word in words]
  return ' '.join(stemmed_words)

def process(text):
    text = text.lower()
    text = clean_noise(text)
    text = remove_stopwords(text)
    text = stem_text(text)
    return text

In [6]:
course_df['text'] = course_df['text'].apply(process)
course_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  course_df['text'] = course_df['text'].apply(process)


Unnamed: 0,Title,text
0,Google Cybersecurity,googl cybersecur network secur python program ...
1,Google Data Analytics,googl data analyt data analysi r program sql b...
2,Google Project Management:,googl project manag project manag strategi ope...
3,IBM Data Science,ibm data scienc python program data scienc mac...
4,Google Digital Marketing & E-commerce,googl digit market e commerc digit market mark...


In [7]:
df_job['Job Description'] = df_job['Job Description'].apply(process)
df_job.head()

Unnamed: 0,Job Title,Job Description
0,Flutter Developer,look hire expert flutter develop elig post app...
1,Django Developer,python django develop lead job code pdj strong...
2,Machine Learning,data scientist contractor bangalor respons loo...
3,iOS Developer,job descript strong framework outsid io alway ...
4,Full Stack Developer,job respons full stack engin react role make i...


# **Text Vectorization**

Sebelum membuat sistem rekomendasi berbasis teks, langkah krusial pertama yang harus dilakukan adalah mengubah data teks menjadi representasi numerik, yaitu vektor. Representasi ini diperlukan karena model pembelajaran mesin dan sistem rekomendasi tidak dapat memproses teks secara langsung; mereka membutuhkan input dalam bentuk angka yang dapat dihitung secara matematis.

## **BERT**
Pada program sistem rekomendasi ini, vektorisasi dilakukan menggunakan BERT karena metode ini mampu mengubah satu kalimat utuh menjadi vektor yang merepresentasikan makna keseluruhan kalimat. BERT sudah dilatih sebelumnya (pre-trained), sehingga kita cukup menginisialisasi modelnya dan langsung menggunakannya pada data teks untuk menghasilkan embedding yang akurat dan kontekstual.

In [8]:
# Inisialisasi model pretrained BERT
model_bert = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Encode text kursus menjadi vektor
course_vectors = model_bert.encode(course_df['text'], convert_to_tensor=True)

# Menampilkan matriks fitur dalam bentuk array
print("\nMatriks BERT:")
print(course_vectors)




Matriks BERT:
tensor([[-0.2468, -0.2206, -0.3310,  ..., -0.4538,  0.3236, -0.0677],
        [-0.4820, -0.2333, -0.2015,  ..., -0.3273,  0.1234,  0.3178],
        [-0.4355,  0.0794,  0.1399,  ..., -0.4650, -0.1113,  0.3009],
        ...,
        [-0.2864, -0.1840,  0.0262,  ..., -0.1523, -0.4079, -0.1371],
        [-0.5446, -0.0560, -0.3005,  ..., -0.0786, -0.0275,  0.3147],
        [-0.2043, -0.6041, -0.3488,  ...,  0.2797,  0.1543, -0.1042]])


In [9]:
# Encode deskripsi pekerjaan menjadi vektor
job_vectors = model_bert.encode(df_job['Job Description'], convert_to_tensor=True)

# Menampilkan matriks fitur dalam bentuk array
print("\nMatriks BERT:")
print(job_vectors)


Matriks BERT:
tensor([[-0.1231,  0.1611,  0.1569,  ..., -0.1863, -0.0768,  0.6933],
        [-0.5362, -0.2623, -0.3761,  ..., -0.0608,  0.1965,  0.0766],
        [-0.4087,  0.0919, -0.2791,  ..., -0.6304, -0.0743,  0.4398],
        ...,
        [-0.3314,  0.1389,  0.0417,  ..., -0.4337, -0.2035,  0.3169],
        [-0.1817, -0.2760, -0.0478,  ...,  0.0094, -0.3610,  0.1994],
        [-0.2163,  0.2193,  0.3593,  ..., -0.3292, -0.4342,  0.5939]])


### **save vectors to csv file**


In [10]:
course_vectors_df = pd.DataFrame(course_vectors.cpu().numpy())
course_vectors_df.to_csv('course_vectors.csv', index=False)
job_vectors_df = pd.DataFrame(job_vectors.cpu().numpy())
job_vectors_df.to_csv('job_vectors.csv', index=False)

# **Labelling & Similarity Mapping**

Setelah kita mendapatkan representasi vektor untuk setiap teks dari dataset teks kita, kita dapat menghitung similaritas antara setiap teks pekerjaan dan setiap teks kursus. Di sini, kita menggunakan metode *Sentence-BERT (SBERT)* dan *Cosine Similarity* untuk mengukur tingkat kemiripan antar vektor teks tersebut.

## **Sentence-BERT (SBERT)**
Karena:
- SBERT dioptimalkan untuk semantic similarity
- Lebih akurat daripada vanilla BERT + cosine

In [11]:
from sentence_transformers import util

# Compute similarity
similarity_sbert = util.pytorch_cos_sim(job_vectors, course_vectors)

# Matrix hasil similaritas vector
print("Matrix Similarity SBERT:")
print(similarity_sbert)

Matrix Similarity SBERT:
tensor([[0.1768, 0.3124, 0.5114,  ..., 0.1864, 0.3080, 0.2620],
        [0.3478, 0.3509, 0.3602,  ..., 0.1760, 0.2764, 0.4736],
        [0.4934, 0.5812, 0.5790,  ..., 0.3468, 0.5403, 0.3861],
        ...,
        [0.2863, 0.2733, 0.4388,  ..., 0.2926, 0.3677, 0.3015],
        [0.4152, 0.4328, 0.5844,  ..., 0.3814, 0.4315, 0.3034],
        [0.3536, 0.4205, 0.6181,  ..., 0.3733, 0.4177, 0.1747]])


In [12]:
# save similarity matrix to csv
similarity_sbert_df = pd.DataFrame(similarity_sbert.cpu().numpy())
similarity_sbert_df.to_csv('similarity_sbert.csv', index=False)

## **Cosine Similarity**

In [13]:
# Hitung similaritas antara setiap vektor pekerjaan dengan setiap vektor kursus
similarity_cosine = cosine_similarity(job_vectors, course_vectors)

# Matrix hasil similaritas vector
print("Matrix Cosine Similarity:")
print(similarity_cosine)

Matrix Cosine Similarity:
[[0.17683974 0.31238782 0.5113713  ... 0.18636844 0.30800754 0.26201794]
 [0.3478121  0.35092637 0.36021355 ... 0.17597252 0.2763656  0.47361216]
 [0.49335116 0.5812128  0.5789698  ... 0.34677166 0.5403382  0.3861111 ]
 ...
 [0.28625622 0.27333137 0.43875718 ... 0.2925578  0.36766097 0.30151296]
 [0.41521713 0.43281984 0.5843527  ... 0.38144046 0.4314703  0.30340362]
 [0.35359737 0.42054495 0.61811835 ... 0.37334    0.41767067 0.17465535]]


In [16]:
# save similarity matrix to csv
similarity_cosine_df = pd.DataFrame(similarity_cosine)
similarity_cosine_df.to_csv('similarity_cosine.csv', index=False)

# **Recommendation Modelling**

In [17]:
def recommend(similarity,job_title, top_n=5):
    # Cari indeks job
    job_index = df_job[df_job['Job Title'] == job_title].index[0]

    # Ambil similarity untuk job itu (ubah ke 1D array jika perlu)
    job_sim = similarity[job_index]
    
    # tensor ke NumPy array
    if hasattr(job_sim, 'cpu'):
        job_sim = job_sim.cpu().numpy().flatten()
    else:
        job_sim = job_sim.flatten()

    # Batasi top_n sesuai jumlah kursus
    top_n = min(top_n, len(course_df))

    # Ambil indeks top-n similarity tertinggi
    top_course_indices = job_sim.argsort()[::-1][:top_n]

    # Ambil nama kursus
    recommended_courses = course_df.iloc[top_course_indices]['Title'].tolist()

    return recommended_courses


Mari kita uji sistem rekomendasi dengan contoh 5 pekerjaan pertama.

## SBERT

In [18]:
for job_name in df_job['Job Title'].tolist()[:5]:
    recommended_courses = recommend(similarity_sbert,job_name)

    print(f"Recommended courses for {job_name}:")
    for i, course_name in enumerate(recommended_courses, start=1):
        print(f"{i}. {course_name}")
    print()

Recommended courses for Flutter Developer:
1. Gestión de Proyectos de Google
2. Google Project Management:
3. Write Professional Emails in English
4. Human Resource Management: HR for People Managers
5. Google Project Management (PT)

Recommended courses for Django Developer:
1. Introductory C Programming
2. Python for Everybody
3. Meta Back-End Developer
4. Meta Database Engineer
5. SQL: A Practical Introduction for Querying Databases

Recommended courses for Machine Learning:
1. Data Science
2. IBM Data Science
3. Practical Data Science with MATLAB
4. Six Sigma Green Belt
5. Analytics for Decision Making

Recommended courses for iOS Developer:
1. Computer Communications
2. IBM Front-End Developer
3. IBM Back-End Development
4. IBM & Darden Digital Strategy
5. Digital Manufacturing & Design Technology

Recommended courses for Full Stack Developer:
1. Meta React Native
2. Gestión de Proyectos de Google
3. Meta Front-End Developer
4. Human Resource Management: HR for People Managers
5. 

## **cosine similarity**

In [19]:
for job_name in df_job['Job Title'].tolist()[:5]:
    recommended_courses = recommend(similarity_cosine,job_name)

    print(f"Recommended courses for {job_name}:")
    for i, course_name in enumerate(recommended_courses, start=1):
        print(f"{i}. {course_name}")
    print()

Recommended courses for Flutter Developer:
1. Gestión de Proyectos de Google
2. Google Project Management:
3. Write Professional Emails in English
4. Human Resource Management: HR for People Managers
5. Google Project Management (PT)

Recommended courses for Django Developer:
1. Introductory C Programming
2. Python for Everybody
3. Meta Back-End Developer
4. Meta Database Engineer
5. SQL: A Practical Introduction for Querying Databases

Recommended courses for Machine Learning:
1. Data Science
2. IBM Data Science
3. Practical Data Science with MATLAB
4. Six Sigma Green Belt
5. Analytics for Decision Making

Recommended courses for iOS Developer:
1. Computer Communications
2. IBM Front-End Developer
3. IBM Back-End Development
4. IBM & Darden Digital Strategy
5. Digital Manufacturing & Design Technology

Recommended courses for Full Stack Developer:
1. Meta React Native
2. Gestión de Proyectos de Google
3. Meta Front-End Developer
4. Human Resource Management: HR for People Managers
5. 