<a href="https://colab.research.google.com/github/a22057916w/Analysis-on-Online-Course-Data/blob/main/Coursera_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download Dataset
Download the Coursera dataset from shared google drive by file id.


In [None]:
!gdown 18oGZ87xBCx6YXjNRbytQe-dApQUSy3dR

Downloading...
From: https://drive.google.com/uc?id=18oGZ87xBCx6YXjNRbytQe-dApQUSy3dR
To: /content/CourseraDataset-Clean.csv
  0% 0.00/5.41M [00:00<?, ?B/s]100% 5.41M/5.41M [00:00<00:00, 219MB/s]


## Data Preprocessing
*   Removing duplicate courses (rows) based on "Course Title"
*   Removing duplicate coruses (rows) base on "Corrse Url", keeping English "Coruse Title" only.
*   Combing the keywords and performing one-hot encoding
*   Performing one-hot encoding on "Level"
  * 1->beginner
  * 2->intermediate
  * 3->not specified
  * 4->advanced
*   Performing one-hot encoding on "Schedule"
  * 1->Flexible schedule
  * 2->Hands-on learning



In [None]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

In [None]:
df = pd.read_csv("CourseraDataset-Clean.csv")

df['Keyword'] = pd.factorize(df['Keyword'])[0] + 1 # perform ordinal encoding
df['Keyword'] = df['Keyword'].astype(str)
df['Keyword'] = df.groupby('Course Title')['Keyword'].transform(', '.join) # combine keyword
df = df.drop_duplicates(subset=["Course Title"]) # remove duplicate rows based on "Course Title"

# perform one-hot encoding on "Keyword"
one_hot = df['Keyword'].str.get_dummies(sep=", ")
one_hot_keyword = one_hot[list(one_hot.columns)].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
df['Keyword'] = one_hot_keyword

# perform ordinal encoding on "Level"
# 1->beginner; 2->intermediate; 3->advanced; 4>not specified;
# custom_order = ["Beginner level", "Indermediate level", "Advanced level", "not specified"]
custom_order = ['Beginner level', 'Intermediate level', 'Advanced level', 'Not specified']
df['Level'] = OrdinalEncoder(categories=[custom_order], dtype=str).fit_transform(df[['Level']]) + 1    # Perform ordinal encoding

# perform one-hot encoding on "Schedule"
# 1->Flexible schedule; 2->Hands-on learning
df['Schedule'] = pd.factorize(df['Schedule'])[0] + 1
df['Schedule'] = df['Schedule'].astype(str)


Use the langdetect package to detect the language.

In [None]:
!pip install langdetect

In [None]:
from langdetect import detect

# Function to detect the language of a string
def detect_language(text):
    try:
        language = detect(text)
    except:
        language = "unknown"
    return language

# Group by "Course Url" and filter duplicated courses to keep only English titles
duplicates_grouped = df.groupby('Course Url').filter(lambda x: len(x) > 1)
english_duplicates = duplicates_grouped[duplicates_grouped['Course Title'].apply(lambda x: detect_language(x) == 'en')]

# Drop duplicated rows based on the "Course Url"
cleaned_df = df.drop_duplicates(subset=['Course Url'], keep=False)

# Concatenate the DataFrames
df = pd.concat([cleaned_df, english_duplicates])
df

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df.to_csv("CourseraDataset-Preprocessed.csv", index=False, encoding='utf-8-sig') # UTF-8 with BOM encoded

In [None]:
preprocessed_df = pd.read_csv("CourseraDataset-Preprocessed.csv", engine='python')
preprocessed_df