### **Coursera Dataset**

In [2]:
import pandas as pd

In [3]:
df_c = pd.read_csv('./Dataset/Coursera/Coursera_combined_data.csv')
df_c.head()

Unnamed: 0,Course Name,Provider,Skills Gained,Rating & Reviews,Level & Duration,Course Image,Provider Image,Course Link
0,Generative AI: Prompt Engineering Basics,IBM,"ChatGPT, Generative AI, IBM Cloud",4.8 stars,1 - 4 Weeks,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://www.coursera.org/learn/generative-ai-p...
1,Exam Prep AI-102: Microsoft Azure AI Engineer ...,Whizlabs,"Microsoft Azure, Data Ethics, Natural Language...",3.3 stars,1 - 3 Months,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://www.coursera.org/learn/ai-102-microsof...
2,"Python for Data Science, AI & Development",Python for Data Science,"Jupyter, Automation, Web Scraping",AI & Development Course by IBM,1 - 3 Months,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://www.coursera.org/learn/python-for-appl...
3,IBM AI Engineering,IBM,"PyTorch (Machine Learning Library), Supervised...",4.6 stars,3 - 6 Months,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://www.coursera.org/professional-certific...
4,IBM Generative AI Engineering,IBM,"Generative AI, Data Wrangling, Unit Testing",4.6 stars,3 - 6 Months,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://d3njjcbhbojbot.cloudfront.net/api/util...,https://www.coursera.org/professional-certific...


### Data Understanding

In [4]:
print("Shape: ", df_c.shape, "\n")
duplicated_rows = df_c.duplicated().sum()
print("Duplicated rows: ", duplicated_rows, "\n")
print("Total Number of Missing Value:", "\n", df_c.isna().sum())

Shape:  (800, 8) 

Duplicated rows:  215 

Total Number of Missing Value: 
 Course Name         0
Provider            0
Skills Gained       0
Rating & Reviews    0
Level & Duration    0
Course Image        0
Provider Image      0
Course Link         0
dtype: int64


### Data preprocessing

In [5]:
df_c = df_c.drop_duplicates()
print("After removing duplicates: ", df_c.shape)

After removing duplicates:  (585, 8)


In [6]:
# Check how many rows of column "Rating & Reviews" do not start with a number
non_number_rows = df_c[~df_c['Rating & Reviews'].str.match(r'^\d')].shape[0]
print("Number of rows: ", non_number_rows)

Number of rows:  88


In [7]:
# Extract the numeric part from "Rating & Reviews" column into "Rating Score" column
df_c['Rating Score'] = df_c['Rating & Reviews'].str.extract(r'(\d+\.\d+)').astype(float)
# Replace the missing values in "Rating Score" with the mean of the column
df_c['Rating Score'] = df_c['Rating Score'].fillna(df_c['Rating Score'].mean())
df_c['Rating Score'] = df_c['Rating Score'].round(1)

### Embeddings for column "Skills Gained" and create a new column "Embeddings skills".

In [8]:
# Embedding the Coursera Data columns "Skills" using Sentence Transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute embeddings
skills = df_c['Skills Gained'].tolist()
embeddings2 = model.encode(skills, show_progress_bar=True)

df_c['Embeddings skills'] = list(embeddings2)
print("Embeddings for skills successfully added to the dataframe!")

  from .autonotebook import tqdm as notebook_tqdm





Batches: 100%|██████████| 19/19 [00:01<00:00, 13.36it/s]

Embeddings for skills successfully added to the dataframe!





### Save to the new file name of Completed_Data

In [9]:
df_c.to_pickle("./Dataset/Coursera/Coursera_after_embeddings.pkl")
df_c.to_csv("./Dataset/Coursera/Coursera_Completed_Data.csv", index=False)