<a href="https://colab.research.google.com/github/a22057916w/Analysis-on-Online-Course-Data/blob/main/Coursera_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download Original Dataset
* Download the "CourseraDataset-Clean.csv" from the shared Google Drive to your working directory.

* Alternatively, you can directly download the preprocessed dataset from the cell below.


In [None]:
!gdown 18oGZ87xBCx6YXjNRbytQe-dApQUSy3dR

Downloading...
From: https://drive.google.com/uc?id=18oGZ87xBCx6YXjNRbytQe-dApQUSy3dR
To: /content/CourseraDataset-Clean.csv
  0% 0.00/5.41M [00:00<?, ?B/s] 39% 2.10M/5.41M [00:00<00:00, 19.8MB/s]100% 5.41M/5.41M [00:00<00:00, 40.1MB/s]


## Data Preprocessing
*   Removing duplicate courses (rows) based on "Course Title"
*   Removing duplicate coruses (rows) base on "Corrse Url", keeping English "Coruse Title" only.
*   Combing the keywords and performing one-hot encoding
*   Performing ordinal encoding on "Level"
  * 1->beginner
  * 2->intermediate
  * 3->advanced
  * 4->not specified
*   Performing label encoding on "Schedule"
  * 1->Flexible schedule
  * 2->Hands-on learning
* [Optional] Drop rows where both "Rating" and "Number of Review" are 0




Use the `langdetect` package to detect the language.

In [None]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.0/981.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=ee290947917c3ee6d08ac9f83f083936fbeece60a267a272856a3c71f533d45a
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [None]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from langdetect import detect

In [None]:
df = pd.read_csv("CourseraDataset-Clean.csv")

# perform ordinal encoding for combining
df['Keyword'] = pd.factorize(df['Keyword'])[0] + 1
df['Keyword'] = df['Keyword'].astype(str)

# Removing duplicate courses based on "Course Title"
df['Keyword'] = df.groupby('Course Title')['Keyword'].transform(', '.join)  # combine keyword by "Course Title"
df = df.drop_duplicates(subset=["Course Title"])  # remove duplicate rows based on "Course Title"


# Removing duplicate coruses base on "Corrse Url",
# keeping English "Coruse Title" only.
df["Keyword"] = df.groupby('Course Url')['Keyword'].transform(', '.join)  # combine keyword by "Course Url"
duplicates_url = df.groupby('Course Url').filter(lambda x: len(x) > 1)
df_english_titles = duplicates_url[duplicates_url['Course Title'].apply(lambda x: detect(x) == 'en')]
df = df.drop_duplicates(subset=['Course Url'], keep=False)  # remove duplicate rows based on "Course Title"
df = pd.concat([df, df_english_titles]) # Concatenate rows with English "Course Title"
df["Course Url"].astype(str)


# perform one-hot encoding on "Keyword"
one_hot = df['Keyword'].str.get_dummies(sep=", ")
one_hot_keyword = one_hot[list(one_hot.columns)].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
df['Keyword'] = one_hot_keyword

# perform ordinal encoding on "Level"
# 1->beginner; 2->intermediate; 3->advanced; 4>not specified;
# custom_order = ["Beginner level", "Indermediate level", "Advanced level", "not specified"]
custom_order = ['Beginner level', 'Intermediate level', 'Advanced level', 'Not specified']
df['Level'] = OrdinalEncoder(categories=[custom_order], dtype=int).fit_transform(df[['Level']])  + 1   # Perform ordinal encoding
df['Level'] = df['Level'].astype(str)

# perform label encoding on "Schedule"
# 1->Flexible schedule; 2->Hands-on learning
df['Schedule'] = pd.factorize(df['Schedule'])[0] + 1
df['Schedule'] = df['Schedule'].astype(str)

df

Unnamed: 0,Course Title,Rating,Level,Schedule,What you will learn,Skill gain,Modules,Instructor,Offered By,Keyword,Course Url,Duration to complete (Approx.),Number of Review
0,Fashion as Design,4.8,1,1,Not specified,"Art History, Art, History, Creativity","Introduction, Heroes, Silhouettes, Coutures, L...","Anna Burckhardt, Paola Antonelli, Michelle Mil...",The Museum of Modern Art,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/fashion-design,20.0,2813
1,Modern American Poetry,4.4,1,1,Not specified,Not specified,"Orientation, Module 1, Module 2, Module 3, Mod...",Cary Nelson,University of Illinois at Urbana-Champaign,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/modern-american...,34.0,100
2,Pixel Art for Video Games,4.5,1,1,Not specified,Not specified,"Week 1: Introduction to Pixel Art, Week 2: Pix...","Andrew Dennis, Ricardo Guimaraes",Michigan State University,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/pixel-art-video...,9.0,227
3,Distribución digital de la música independiente,0.0,1,1,Not specified,Not specified,"Semana 1, Semana 2, Semana 3, Semana 4",Eduardo de la Vara Brown.,SAE Institute México,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/distribucion-di...,8.0,0
4,The Blues: Understanding and Performing an Ame...,4.8,1,1,Students will be able to describe the blues as...,"Music, Chord, Jazz, Jazz Improvisation","Blues Progressions – Theory and Practice , Blu...",Dariusz Terefenko,University of Rochester,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/the-blues,11.0,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8368,Architecting with Google Kubernetes Engine: Wo...,0.0,2,1,Not specified,Not specified,"Introdução ao curso, Operações do Kubernetes, ...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-workl...,19.0,0
8369,Visualizing static networks with R,0.0,2,2,Learn to preprocess raw data to create nodes a...,"Network Analysis, igraph, R Programming, Graph...",Learn step-by-step,You (Lilian) Cheng,Coursera Project Network,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/projects/visualizing-...,2.0,0
2812,Renewable Energy Specialization,4.8,1,1,Understand and evaluate the operations and per...,"Wind Energy, Sustainability, Renewable Energy,...","Renewable Energy Technology Fundamentals, Rene...","Paul Komor, Stephen R. Lawrence",University of Colorado Boulder,"0, 0, 0, 0, 1, 1, 0, 1, 0, 1",https://www.coursera.org/specializations/renew...,40.0,472
3034,Software Engineering Specialization,4.5,2,1,The principal tasks of software project manage...,"Software Testing, Project Management, Software...",Software Engineering: Modeling Software System...,Kenneth W T Leung,The Hong Kong University of Science and Techno...,"0, 0, 0, 1, 1, 0, 0, 0, 0, 0",https://www.coursera.org/specializations/softw...,80.0,124


**[Optional] Drop rows where both "Rating" and "Number of Review" are 0** <br>
Set `exec = True` to drop rows.

In [None]:
exec = False
if(exec):
  df = df[(df['Rating'] != 0) & (df['Number of Review'] != 0)]

In [None]:
df.to_csv("CourseraDataset-Preprocessed.csv", index=False, encoding='utf-8-sig', lineterminator='\r\n') # UTF-8 with BOM encoded

In [None]:
preprocessed_df = pd.read_csv("CourseraDataset-Preprocessed.csv", engine='python')
preprocessed_df

Unnamed: 0,Course Title,Rating,Level,Schedule,What you will learn,Skill gain,Modules,Instructor,Offered By,Keyword,Course Url,Duration to complete (Approx.),Number of Review
0,Fashion as Design,4.8,1,1,Not specified,"Art History, Art, History, Creativity","Introduction, Heroes, Silhouettes, Coutures, L...","Anna Burckhardt, Paola Antonelli, Michelle Mil...",The Museum of Modern Art,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/fashion-design,20.0,2813
1,Modern American Poetry,4.4,1,1,Not specified,Not specified,"Orientation, Module 1, Module 2, Module 3, Mod...",Cary Nelson,University of Illinois at Urbana-Champaign,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/modern-american...,34.0,100
2,Pixel Art for Video Games,4.5,1,1,Not specified,Not specified,"Week 1: Introduction to Pixel Art, Week 2: Pix...","Andrew Dennis, Ricardo Guimaraes",Michigan State University,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/pixel-art-video...,9.0,227
3,Distribución digital de la música independiente,0.0,1,1,Not specified,Not specified,"Semana 1, Semana 2, Semana 3, Semana 4",Eduardo de la Vara Brown.,SAE Institute México,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/distribucion-di...,8.0,0
4,The Blues: Understanding and Performing an Ame...,4.8,1,1,Students will be able to describe the blues as...,"Music, Chord, Jazz, Jazz Improvisation","Blues Progressions – Theory and Practice , Blu...",Dariusz Terefenko,University of Rochester,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/the-blues,11.0,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6072,Architecting with Google Kubernetes Engine: Wo...,0.0,2,1,Not specified,Not specified,"Introdução ao curso, Operações do Kubernetes, ...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-workl...,19.0,0
6073,Visualizing static networks with R,0.0,2,2,Learn to preprocess raw data to create nodes a...,"Network Analysis, igraph, R Programming, Graph...",Learn step-by-step,You (Lilian) Cheng,Coursera Project Network,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/projects/visualizing-...,2.0,0
6074,Renewable Energy Specialization,4.8,1,1,Understand and evaluate the operations and per...,"Wind Energy, Sustainability, Renewable Energy,...","Renewable Energy Technology Fundamentals, Rene...","Paul Komor, Stephen R. Lawrence",University of Colorado Boulder,"0, 0, 0, 0, 1, 1, 0, 1, 0, 1",https://www.coursera.org/specializations/renew...,40.0,472
6075,Software Engineering Specialization,4.5,2,1,The principal tasks of software project manage...,"Software Testing, Project Management, Software...",Software Engineering: Modeling Software System...,Kenneth W T Leung,The Hong Kong University of Science and Techno...,"0, 0, 0, 1, 1, 0, 0, 0, 0, 0",https://www.coursera.org/specializations/softw...,80.0,124


## Download Preprocessed Dataset
Directly download the preprocessed dataset from the google drive.

In [None]:
!gdown 1LQcrSyOa3y07UIX49_bUifccSls20hhY

Downloading...
From: https://drive.google.com/uc?id=1LQcrSyOa3y07UIX49_bUifccSls20hhY
To: /content/CourseraDataset-Preprocessed.csv
100% 3.72M/3.72M [00:00<00:00, 29.4MB/s]


In [None]:
import pandas as pd

df = pd.read_csv("CourseraDataset-Preprocessed.csv")
df

Unnamed: 0,Course Title,Rating,Level,Schedule,What you will learn,Skill gain,Modules,Instructor,Offered By,Keyword,Course Url,Duration to complete (Approx.),Number of Review
0,Fashion as Design,4.8,1,1,Not specified,"Art History, Art, History, Creativity","Introduction, Heroes, Silhouettes, Coutures, L...","Anna Burckhardt, Paola Antonelli, Michelle Mil...",The Museum of Modern Art,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/fashion-design,20.0,2813
1,Modern American Poetry,4.4,1,1,Not specified,Not specified,"Orientation, Module 1, Module 2, Module 3, Mod...",Cary Nelson,University of Illinois at Urbana-Champaign,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/modern-american...,34.0,100
2,Pixel Art for Video Games,4.5,1,1,Not specified,Not specified,"Week 1: Introduction to Pixel Art, Week 2: Pix...","Andrew Dennis, Ricardo Guimaraes",Michigan State University,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/pixel-art-video...,9.0,227
3,Distribución digital de la música independiente,0.0,1,1,Not specified,Not specified,"Semana 1, Semana 2, Semana 3, Semana 4",Eduardo de la Vara Brown.,SAE Institute México,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/distribucion-di...,8.0,0
4,The Blues: Understanding and Performing an Ame...,4.8,1,1,Students will be able to describe the blues as...,"Music, Chord, Jazz, Jazz Improvisation","Blues Progressions – Theory and Practice , Blu...",Dariusz Terefenko,University of Rochester,"1, 0, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/the-blues,11.0,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6072,Architecting with Google Kubernetes Engine: Wo...,0.0,2,1,Not specified,Not specified,"Introdução ao curso, Operações do Kubernetes, ...",Google Cloud Training,Google Cloud,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/learn/deploying-workl...,19.0,0
6073,Visualizing static networks with R,0.0,2,2,Learn to preprocess raw data to create nodes a...,"Network Analysis, igraph, R Programming, Graph...",Learn step-by-step,You (Lilian) Cheng,Coursera Project Network,"0, 1, 0, 0, 0, 0, 0, 0, 0, 0",https://www.coursera.org/projects/visualizing-...,2.0,0
6074,Renewable Energy Specialization,4.8,1,1,Understand and evaluate the operations and per...,"Wind Energy, Sustainability, Renewable Energy,...","Renewable Energy Technology Fundamentals, Rene...","Paul Komor, Stephen R. Lawrence",University of Colorado Boulder,"0, 0, 0, 0, 1, 1, 0, 1, 0, 1",https://www.coursera.org/specializations/renew...,40.0,472
6075,Software Engineering Specialization,4.5,2,1,The principal tasks of software project manage...,"Software Testing, Project Management, Software...",Software Engineering: Modeling Software System...,Kenneth W T Leung,The Hong Kong University of Science and Techno...,"0, 0, 0, 1, 1, 0, 0, 0, 0, 0",https://www.coursera.org/specializations/softw...,80.0,124
