<a href="https://colab.research.google.com/github/a22057916w/Analysis-on-Online-Course-Data/blob/main/Coursera_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download Dataset
Download the Coursera dataset from shared google drive by file id.


In [None]:
!gdown 18oGZ87xBCx6YXjNRbytQe-dApQUSy3dR

Downloading...
From: https://drive.google.com/uc?id=18oGZ87xBCx6YXjNRbytQe-dApQUSy3dR
To: /content/CourseraDataset-Clean.csv
  0% 0.00/5.41M [00:00<?, ?B/s] 39% 2.10M/5.41M [00:00<00:00, 20.0MB/s]100% 5.41M/5.41M [00:00<00:00, 38.9MB/s]


## Data Preprocessing
*   Removing duplicate courses (rows) based on "Course Title"
*   Removing duplicate coruses (rows) base on "Corrse Url", keeping English "Coruse Title" only.
*   Combing the keywords and performing one-hot encoding
*   Performing one-hot encoding on "Level"
  * 1->beginner
  * 2->intermediate
  * 3->not specified
  * 4->advanced
*   Performing one-hot encoding on "Schedule"
  * 1->Flexible schedule
  * 2->Hands-on learning



In [None]:
import pandas as pd

In [6]:
df = pd.read_csv("CourseraDataset-Clean.csv")

df['Keyword'] = pd.factorize(df['Keyword'])[0] + 1 # perform ordinal encoding
df['Keyword'] = df['Keyword'].astype(str)
df['Keyword'] = df.groupby('Course Title')['Keyword'].transform(', '.join) # combine keyword
df = df.drop_duplicates(subset=["Course Title"]) # remove duplicate rows based on "Course Title"

# perform one-hot encoding on "Keyword"
one_hot = df['Keyword'].str.get_dummies(sep=", ")
one_hot_keyword = one_hot[list(one_hot.columns)].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
df['Keyword'] = one_hot_keyword

# perform one-hot encoding on "Level"
# 1->beginner; 2->intermediate; 3->not specified; 4->advanced
df['Level'] = pd.factorize(df['Level'])[0] + 1
df['Level'] = df['Level'].astype(str)

# perform one-hot encoding on "Schedule"
# 1->Flexible schedule; 2->Hands-on learning
df['Schedule'] = pd.factorize(df['Schedule'])[0] + 1
df['Schedule'] = df['Schedule'].astype(str)


Use the langdetect package to detect the language.

In [7]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=3621f443486ab3b3112e6c9ab6895b734e172dd39aff0b4f9bf12b0aa806ccdd
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [11]:
from langdetect import detect

# Function to detect the language of a string
def detect_language(text):
    try:
        language = detect(text)
    except:
        language = "unknown"
    return language

# Group by "Course Url" and filter duplicated courses to keep only English titles
duplicates_grouped = df.groupby('Course Url').filter(lambda x: len(x) > 1)
english_duplicates = duplicates_grouped[duplicates_grouped['Course Title'].apply(lambda x: detect_language(x) == 'en')]

# Drop duplicated rows based on the "Course Url"
cleaned_df = df.drop_duplicates(subset=['Course Url'], keep=False)

# Concatenate the DataFrames
df = pd.concat([cleaned_df, english_duplicates])
df

Unnamed: 0,Course Title,Rating,Level,Schedule,What you will learn,Skill gain,Modules,Instructor,Offered By,Keyword,Course Url,Duration to complete (Approx.),Number of Review
0,Fashion as Design,4.8,1,1,Not specified,"Art History, Art, History, Creativity","Introduction, Heroes, Silhouettes, Coutures, L...","Anna Burckhardt, Paola Antonelli, Michelle Mil...",The Museum of Modern Art,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/fashion-design,20.0,2813
1,Modern American Poetry,4.4,1,1,Not specified,Not specified,"Orientation, Module 1, Module 2, Module 3, Mod...",Cary Nelson,University of Illinois at Urbana-Champaign,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/modern-american...,34.0,100
2,Pixel Art for Video Games,4.5,1,1,Not specified,Not specified,"Week 1: Introduction to Pixel Art, Week 2: Pix...","Andrew Dennis, Ricardo Guimaraes",Michigan State University,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/pixel-art-video...,9.0,227
3,Distribución digital de la música independiente,0.0,1,1,Not specified,Not specified,"Semana 1, Semana 2, Semana 3, Semana 4",Eduardo de la Vara Brown.,SAE Institute México,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/distribucion-di...,8.0,0
4,The Blues: Understanding and Performing an Ame...,4.8,1,1,Students will be able to describe the blues as...,"Music, Chord, Jazz, Jazz Improvisation","Blues Progressions – Theory and Practice , Blu...",Dariusz Terefenko,University of Rochester,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/the-blues,11.0,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8365,Architecting with Google Kubernetes Engine: Pr...,4.9,2,1,Not specified,Not specified,"Introducción al curso, Control de acceso y seg...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-secur...,14.0,30
8366,Computational Thinking for K-12 Educators: Nes...,0.0,1,1,Not specified,"Education, want, Resource, Causality","Course Orientation, Nested If/Else Part 1, Nes...",Beth Simon,University of California San Diego,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/block-programmi...,11.0,0
8367,Cómo combinar y analizar datos complejos,0.0,3,1,Not specified,Not specified,"Estimación básica, Modelos, Vinculación de reg...","Richard Valliant, Ph.D.","University of Maryland, College Park","0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/data-collection...,9.0,0
8368,Architecting with Google Kubernetes Engine: Wo...,0.0,2,1,Not specified,Not specified,"Introdução ao curso, Operações do Kubernetes, ...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-workl...,19.0,0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df.to_csv("CourseraDataset-Preprocessed.csv", index=False, encoding='utf-8-sig') # UTF-8 with BOM encoded

In [None]:
preprocessed_df = pd.read_csv("CourseraDataset-Preprocessed.csv", engine='python')
preprocessed_df

Unnamed: 0,Course Title,Rating,Level,Schedule,What you will learn,Skill gain,Modules,Instructor,Offered By,Keyword,Course Url,Duration to complete (Approx.),Number of Review
0,Fashion as Design,4.8,1,1,Not specified,"Art History, Art, History, Creativity","Introduction, Heroes, Silhouettes, Coutures, L...","Anna Burckhardt, Paola Antonelli, Michelle Mil...",The Museum of Modern Art,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/fashion-design,20.0,2813.0
1,Modern American Poetry,4.4,1,1,Not specified,Not specified,"Orientation, Module 1, Module 2, Module 3, Mod...",Cary Nelson,University of Illinois at Urbana-Champaign,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/modern-american...,34.0,100.0
2,Pixel Art for Video Games,4.5,1,1,Not specified,Not specified,"Week 1: Introduction to Pixel Art, Week 2: Pix...","Andrew Dennis, Ricardo Guimaraes",Michigan State University,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/pixel-art-video...,9.0,227.0
3,Distribución digital de la música independiente,0.0,1,1,Not specified,Not specified,"Semana 1, Semana 2, Semana 3, Semana 4",Eduardo de la Vara Brown.,SAE Institute México,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/distribucion-di...,8.0,0.0
4,The Blues: Understanding and Performing an Ame...,4.8,1,1,Students will be able to describe the blues as...,"Music, Chord, Jazz, Jazz Improvisation","Blues Progressions – Theory and Practice , Blu...",Dariusz Terefenko,University of Rochester,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/the-blues,11.0,582.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6094,Architecting with Google Kubernetes Engine: Pr...,4.9,2,1,Not specified,Not specified,"Introducción al curso, Control de acceso y seg...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-secur...,14.0,30.0
6095,Computational Thinking for K-12 Educators: Nes...,0.0,1,1,Not specified,"Education, want, Resource, Causality","Course Orientation, Nested If/Else Part 1, Nes...",Beth Simon,University of California San Diego,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/block-programmi...,11.0,0.0
6096,Cómo combinar y analizar datos complejos,0.0,3,1,Not specified,Not specified,"Estimación básica, Modelos, Vinculación de reg...","Richard Valliant, Ph.D.","University of Maryland, College Park","0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/data-collection...,9.0,0.0
6097,Architecting with Google Kubernetes Engine: Wo...,0.0,2,1,Not specified,Not specified,"Introdução ao curso, Operações do Kubernetes, ...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-workl...,19.0,0.0
