# Exploratory Data Analysis of Coursera Course Dataset with Python

##### **_1._** Download the Coursera Course Dataset from Kaggle  

Dataset is available at: https://www.kaggle.com/datasets/siddharthm1698/coursera-course-dataset/data

##### **_2._** Initialisation

Import dependancies and read .csv file.

In [2]:
import pandas as pd
import plotly.express as px

#For plotly to work in Jupyter Notebook
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

In [3]:
coursera_df = pd.read_csv("coursera_data.csv")
coursera_df.head()

Unnamed: 0.1,Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,134,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k
1,743,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k
2,874,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k
3,413,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k
4,635,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k


As per Dataset description, column "Unnamed: 0" contains Dataset identifiers which will not have any significance in this analysis. Thus, remove column "Unnamed: 0".

In [4]:
del coursera_df["Unnamed: 0"]
coursera_df.head()

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k
1,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k
2,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k
3,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k
4,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k


##### **_3._** Data Cleaning

**3.1. Handle Missing Data**

In [5]:
coursera_df.isnull().any()

course_title                False
course_organization         False
course_Certificate_type     False
course_rating               False
course_difficulty           False
course_students_enrolled    False
dtype: bool

There are no missing values to be handled.

**3.2. Handle Duplicate Data**

In [6]:
len(coursera_df["course_title"].unique()) == len(coursera_df["course_title"])

False

In [7]:
coursera_df[coursera_df["course_title"].duplicated()]

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
225,Developing Your Musicianship,Berklee College of Music,SPECIALIZATION,4.8,Beginner,54k
564,Machine Learning,Stanford University,COURSE,4.9,Mixed,3.2m
583,Marketing Digital,Universidad Austral,SPECIALIZATION,4.7,Beginner,39k


Identified 3 duplicate courses.  
Find out more about these courses to understand how to handle these duplicates.

In [8]:
coursera_df.query("course_title == 'Developing Your Musicianship' \
                  or course_title == 'Machine Learning' \
                  or course_title == 'Marketing Digital' ")

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
224,Developing Your Musicianship,Berklee College of Music,COURSE,4.8,Mixed,41k
225,Developing Your Musicianship,Berklee College of Music,SPECIALIZATION,4.8,Beginner,54k
563,Machine Learning,University of Washington,SPECIALIZATION,4.6,Intermediate,290k
564,Machine Learning,Stanford University,COURSE,4.9,Mixed,3.2m
582,Marketing Digital,Universidade de São Paulo,COURSE,4.8,Beginner,81k
583,Marketing Digital,Universidad Austral,SPECIALIZATION,4.7,Beginner,39k


Identified duplicate courses are provided either by different organizations, have different certificate types or course difficulty.  
Duplicate courses were not removed nor treated otherwise as they provide significant insights.

**3.3. Treating the Outliers**

Check descriptive statistics first.

In [9]:
coursera_df.describe()

Unnamed: 0,course_rating
count,891.0
mean,4.677329
std,0.162225
min,3.3
25%,4.6
50%,4.7
75%,4.8
max,5.0


Descriptive statistics did not return expected result - based on Dataset description, column "course_students_enrolled" should be included in the descriptive statistics.

In [10]:
coursera_df.dtypes

course_title                 object
course_organization          object
course_Certificate_type      object
course_rating               float64
course_difficulty            object
course_students_enrolled     object
dtype: object

As suspected, column "course_students_enrolled" has a data type 'object' rather than _int_ or _float64_.  
Create a new column with data type converted to float64 for column "course_students_enrolled" to be analysed appropriately.

In [11]:
#Define a function that converts data type to float64
def convert_enrollment(students_enrolled):
    students_enrolled = str(students_enrolled).strip().lower()
    if 'k' in students_enrolled:
        return float(students_enrolled.replace("k","")) * 1000
    elif 'm' in students_enrolled:
        return float(students_enrolled.replace("m","")) * 1000000
    else:
        return float(students_enrolled)
    
#Create a new column "course_students_enrolled_float"
coursera_df["course_students_enrolled_float"] = coursera_df["course_students_enrolled"].apply(convert_enrollment)

Check descriptive statistics again to see whether function "convert_enrollment" was successful.

In [12]:
coursera_df.describe()

Unnamed: 0,course_rating,course_students_enrolled_float
count,891.0,891.0
mean,4.677329,90552.08
std,0.162225,181936.5
min,3.3,1500.0
25%,4.6,17500.0
50%,4.7,42000.0
75%,4.8,99500.0
max,5.0,3200000.0


**Observation:**  
A huge difference identified between 3rd quartile and a maximum value in column "course_students_enrolled_float". Difference indicates potential outliers of the feature.

In [13]:
#Plot box and whisker plot to confirm outliers
fig_1 = px.box(coursera_df, x="course_students_enrolled_float",
             title = "Boxplot of Course Students Enrolled",
             labels={"course_students_enrolled_float" : "Students Enrolled"},
             template="plotly_white")


fig_1.update_layout(font=dict(size=14), title=dict(x=0.5))
fig_1.show()

Mid and extreme outliers identified.  

Outliers must be retained in the context of current analysis and for results to be interpreted correctly.  
*For instance, preprocessing (i. e. eliminating or adjusting) outlier values would not accurately tell which course was the most popular.*

**3.4. Data Categorization and Categorical Encoding**

Categorize number of students enrolled to courses, course rating and encode course difficulty by transformation.

In [14]:
#Define bins for student enrollment
students_enrolled_bins = [0, 1000, 10000, 100000, coursera_df["course_students_enrolled_float"].max()]
labels = ["Low Enrollment", "Medium Enrollment", "High Enrollment", "Very High Enrollment"]

#Create new column for enrollment category
coursera_df["Enrollment_category"] = pd.cut(coursera_df["course_students_enrolled_float"],
                                            bins=students_enrolled_bins,
                                            labels=labels,
                                            right=False)

#Define bins for course rating
rating_bins = [0, 3, 3.5, 4, 4.5, coursera_df["course_rating"].max()]
rating_labels = ["Very Low", "Low", "Medium", "Good", "Excellent"]

#Create new column for rating category
coursera_df["Rating_category"] = pd.cut(coursera_df["course_rating"],
                                            bins=rating_bins,
                                            labels=rating_labels,
                                            right=True)

#Define dictionary for encoding
difficulty_dict = {"Beginner":1, "Mixed":2, "Intermediate":3, "Advanced":4}

#Create new column for encoded course rating
coursera_df["course_difficulty_numeric"] = coursera_df["course_difficulty"].map(difficulty_dict)

coursera_df.head()

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled,course_students_enrolled_float,Enrollment_category,Rating_category,course_difficulty_numeric
0,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k,5300.0,Medium Enrollment,Excellent,1
1,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k,17000.0,High Enrollment,Excellent,3
2,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k,130000.0,Very High Enrollment,Good,2
3,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k,91000.0,High Enrollment,Excellent,2
4,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k,320000.0,Very High Enrollment,Excellent,2


##### **_4._** Exploratory Data Analysis

> 4.1. Initial observations of the Dataset:

In [15]:
rows, columns = coursera_df.shape
print(f"After data pre-processing, current dataset has {rows} observations and {columns} features.")

After data pre-processing, current dataset has 891 observations and 10 features.


> 4.2. Which features are categorical and which are numerical?

In [16]:
coursera_df.dtypes

course_title                        object
course_organization                 object
course_Certificate_type             object
course_rating                      float64
course_difficulty                   object
course_students_enrolled            object
course_students_enrolled_float     float64
Enrollment_category               category
Rating_category                   category
course_difficulty_numeric            int64
dtype: object

**_Note_**: As described thorughout the data cleaning process, "course_students_enrolled" in fact is a numerical feature rather than categorical.

> 4.3. Which courses are the most popular (by student enrollment)?

In [17]:
#Create a new dataframe which is sorted by the number of students enrolled
courses_by_students = coursera_df.sort_values(by="course_students_enrolled_float", ascending=False)

#Define most popular courses by student enrollment
top_10_courses_by_students = courses_by_students.head(10)

#Calculate mean of all students enrolled to a course
mean_students_enrolled = courses_by_students["course_students_enrolled_float"].mean()

#Plot a bar chart displaying top 10 courses by student enrollment
fig_2 = px.bar(top_10_courses_by_students, x="course_title", y="course_students_enrolled_float",
             title="Top 10 Courses by Number of Students Enrolled",
             labels={"course_students_enrolled_float":"Students Enrolled",\
                      "course_title" : "Course Title"},
             text="course_students_enrolled_float",
             template="plotly_white",
             hover_name="course_title")

#Put total students enrolled to a course above the bar
fig_2.update_traces(texttemplate="%{text:.2s}",
                  textposition="outside",
                  marker_color="rgb(136,204,238)")

fig_2.add_shape(type="line",
              x0=-0.5,
              y0=mean_students_enrolled,
              x1=9.5,
              y1=mean_students_enrolled,
              line=dict(color="purple", dash="dash"), name="Mean")

fig_2.add_annotation(x=9.2,
                   y=mean_students_enrolled,
                   text=f"Mean: {mean_students_enrolled:,.0f}",
                   showarrow=True,
                   yshift=2,
                   arrowhead=1,
                   arrowcolor="purple",
                   font=dict(color="purple"))

fig_2.update_layout(font=dict(size=14),
                  title=dict(x=0.4),
                  width=800,
                  height=700,
                  yaxis=dict(showgrid=False))

fig_2.show()

> 4.4. Which courses are the most popular (by course rating)?

In [18]:
#Create a new dataframe which is sorted by the course rating
courses_by_rating = coursera_df.sort_values(by="course_rating", ascending=False)

#Define most popular courses by rating
top_10_courses_by_rating = courses_by_rating.head(10)

#Calculate mean course rating of all courses
mean_rating = courses_by_rating["course_rating"].mean()

#Plot a bar chart displaying top 10 courses by student enrollment
fig_3 = px.bar(top_10_courses_by_rating, 
               x="course_title",
               y="course_rating",
               title="Top 10 Courses by Course Rating",
               labels={"course_rating":"Course Rating", "course_title" : "Course Title"},
               text="course_rating",
               template="plotly_white",
               hover_name="course_title")

fig_3.update_traces(texttemplate="%{text:.2s}",
                    textposition="outside",
                    marker_color="rgb(136,204,238)")

fig_3.add_shape(type="line",
              x0=-0.5,
              y0=mean_rating,
              x1=9.5,
              y1=mean_rating,
              line=dict(color="purple", dash="dash"),
              name="Mean")

fig_3.add_annotation(x=8.5,
                   y=4.5,
                   text=f"Mean: {mean_rating:.2f}",
                   showarrow=False,
                   font=dict(color="purple"))

fig_3.update_layout(font=dict(size=14),
                  title=dict(x=0.4),
                  width=800,
                  height=800,
                  yaxis=dict(showgrid=False))

fig_3.show()

> 4.5. How does student enrollment compare to top courses by enrollment and top courses by rating?

In [19]:

top_10_courses_by_students_temporary = top_10_courses_by_students\
    .rename(columns={"course_title":"Top Courses by Enrollment",\
                      "course_students_enrolled":"Students Enrolled"})
top_10_courses_by_rating_temporary = top_10_courses_by_rating\
    .rename(columns={"course_title":"Top Courses by Rating",\
                      "course_students_enrolled":"Students Enrolled"})

popularity_enrollment_vs_rating = pd.concat([top_10_courses_by_students_temporary[["Top Courses by Enrollment", "Students Enrolled"]]\
                                             .reset_index(drop=True),
                                            top_10_courses_by_rating_temporary[["Top Courses by Rating", "Students Enrolled"]]\
                                                .reset_index(drop=True)], axis=1)
popularity_enrollment_vs_rating

Unnamed: 0,Top Courses by Enrollment,Students Enrolled,Top Courses by Rating,Students Enrolled.1
0,Machine Learning,3.2m,El Abogado del Futuro: Legaltech y la Transfor...,1.5k
1,The Science of Well-Being,2.5m,Infectious Disease Modelling,1.6k
2,Python for Everybody,1.5m,Understanding the Brain: The Neurobiology of E...,130k
3,Programming for Everybody (Getting Started wit...,1.3m,Understanding Einstein: The Special Theory of ...,89k
4,Data Science,830k,Основы разработки на C++: белый пояс,41k
5,Career Success,790k,Excel Skills for Business: Intermediate I,76k
6,English for Career Development,760k,Excel Skills for Business: Essentials,200k
7,Successful Negotiation: Essential Strategies a...,750k,Excel Skills for Business,240k
8,Data Science: Foundations using R,740k,Everyday Parenting: The ABCs of Child Rearing,86k
9,Deep Learning,690k,Getting Started with SAS Programming,22k


Number of students enrolled to most popular courses by rating is a lot lower.  
**Further evaluation focuses on top courses by student enrollment.**

> 4.6. What are the ratings and difficulty levels of courses with highest enrollment?

In [20]:
discrete_colors = {"Mixed":"rgb(136,204,238)",
                   "Beginner":"rgb(169,169,169)",
                   "Intermediate":"rgb(144,103,167)",
                   "Advanced":"rgb(0,0,139)"}

#Plot a scatter plot displaying ratings and difficulty levels of courses with highest enrollment
fig_4=px.scatter(top_10_courses_by_students,
             x="course_title",
             y="course_rating",
             color="course_difficulty", 
             size="course_difficulty_numeric",
             title="Rating and Difficulty of Top 10 Courses by Enrollment",
             labels={"course_rating":"Course Rating", "course_title" : "Course Title",\
                      "course_difficulty":"Course Difficulty"},
             text="course_rating",
             template="plotly_white",
             color_discrete_map=discrete_colors)

fig_4.update_layout(font=dict(size=14),
                  title=dict(x=0.5),
                  width=800,
                  height=700)

fig_4.show()


> 4.7. Which courses are the least popular (by student enrollment)?

In [21]:
#Define least popular courses by student enrollment
top_10_worst_courses_by_students = courses_by_students.tail(10)\
    .sort_values(by="course_students_enrolled_float", ascending=True)

#Plot a bar chart displaying least popular courses by student enrollment
fig_5 = px.bar(top_10_worst_courses_by_students,
               x="course_title", y="course_students_enrolled_float",
               title="Top 10 Least Popular Courses by Number of Students Enrolled",
               labels={"course_students_enrolled_float":"Students Enrolled",\
                        "course_title" : "Course Title"},
               text="course_students_enrolled_float",
               template="plotly_white",
               hover_name="course_title")

fig_5.update_traces(texttemplate='%{text:.2s}',
                    textposition='outside',
                    marker_color="rgb(255,127,127)")

fig_5.update_layout(font=dict(size=14),
                  width=950,
                  height=800,
                  yaxis=dict(showgrid=False))

fig_5.show()

> 4.8. Which courses are the least popular (by course rating)?

In [22]:
#Define least popular courses by rating
top_10_worst_courses_by_rating = courses_by_rating.tail(10)\
    .sort_values(by="course_rating", ascending=True)

#Plot a bar chart displaying top 10 least popular courses by rating
fig_6 = px.bar(top_10_worst_courses_by_rating,
               x="course_title", y="course_rating",
               title="Top 10 Least Popular Courses by Course Rating",
               labels={"course_rating":"Course Rating",\
                        "course_title" : "Course Title"},
               text="course_rating",
               template="plotly_white",
               hover_name="course_title")

fig_6.update_traces(texttemplate="%{text:.2s}",
                    textposition="outside",
                    marker_color="rgb(255,127,127)")

fig_6.update_layout(font=dict(size=14),
                  title=dict(x=0.4),
                  width=950,
                  height=800,
                  yaxis=dict(showgrid=False))

fig_6.show()

**Insight:**  
Overall, all Coursera courses are rated fairly good as there is not a single course rated below 3.3.

> 4.9. How does student enrollment compare to least popular courses by enrollment and least popular courses by rating?

In [23]:
top_10_worst_courses_by_students_temporary = top_10_worst_courses_by_students\
    .rename(columns={"course_title":"Least Popular Courses by Enrollment",\
                      "course_students_enrolled":"Students Enrolled"})
top_10_worst_courses_by_rating_temporary = top_10_worst_courses_by_rating\
    .rename(columns={"course_title":"Least Popular Courses by Rating",\
                      "course_students_enrolled":"Students Enrolled"})

unpopularity_enrollment_vs_rating = pd.concat([top_10_worst_courses_by_students_temporary[["Least Popular Courses by Enrollment", "Students Enrolled"]]\
                                               .reset_index(drop=True),
                                            top_10_worst_courses_by_rating_temporary[["Least Popular Courses by Rating", "Students Enrolled"]]\
                                                .reset_index(drop=True)], axis=1)
unpopularity_enrollment_vs_rating

Unnamed: 0,Least Popular Courses by Enrollment,Students Enrolled,Least Popular Courses by Rating,Students Enrolled.1
0,El Abogado del Futuro: Legaltech y la Transfor...,1.5k,How To Create a Website in a Weekend! (Project...,140k
1,Blockchain Revolution in Financial Services,1.6k,Machine Learning and Reinforcement Learning in...,29k
2,Infectious Disease Modelling,1.6k,iOS App Development with Swift,76k
3,Healthcare Law,1.7k,Machine Learning for Trading,15k
4,The Pronunciation of American English,1.7k,"Introduction to Trading, Machine Learning & GCP",13k
5,"Identifying, Monitoring, and Analyzing Risk an...",1.7k,Mathematics for Machine Learning: PCA,33k
6,Esports,1.8k,How to Start Your Own Business,34k
7,Blended Language Learning: Design and Practice...,1.9k,Cybersecurity and Its Ten Domains,140k
8,Implementing RPA with Cognitive Automation and...,2.2k,Optical Engineering,6.2k
9,International Security Management,2.2k,Hardware Description Languages for FPGA Design,7.4k


> 4.10. How does rating affect enrollment? Is there a correlation?

In [24]:
#Plot a scatter plot displaying course rating vs students enrolled
fig_7 = px.scatter(coursera_df,
                   x="course_students_enrolled_float",
                   y="course_rating",
                   size="course_students_enrolled_float",
                   color="course_rating",
                   hover_name="course_title",
                   title="Course Rating vs. Number of Students Enrolled",
                   labels={"course_students_enrolled_float": "Students Enrolled", "course_rating": "Rating"},
                   template="plotly_white")

fig_7.update_layout(font=dict(size=14),
                  title=dict(x=0.5))

fig_7.show()


**Insight:**  
A clear trend whether course rating affects course enrollment was not observed, however, courses with higher rating tend to have more students enrolled.

> 4.11. Which organizations offer the most popular courses(by student enrollment)?  
How many courses these organizations provide in total?  
How the numbers compare to mean and median?

In [33]:
#Define organizations which provide most popular courses based on student enrollment
top_organizations = top_10_courses_by_students["course_organization"].unique()

#Count how many courses each TOP organization provide
top_organizations_course_counts = coursera_df[coursera_df["course_organization"].isin(top_organizations)]["course_organization"].value_counts()

#Count how many courses each organization provide
organizations_course_counts = coursera_df["course_organization"].value_counts()

#Calculate mean of courses provided all organizations each
mean_courses_by_organization = organizations_course_counts.mean()

#Calculate median of courses provided by all organizations each
median_courses_by_organization = organizations_course_counts.median()

#Plot a bar chart displaying number of courses provided by the most popular organizations
fig_8 = px.bar(top_organizations_course_counts,
               x="count",
               title="Courses Provided by Top Organizations by Enrollment",
               labels={"course_organization":"Organization", "count" : "Number of Courses"},
               text="count",
               template="plotly_white")

fig_8.update_traces(marker_color="rgb(136,204,238)")

#Add mean
fig_8.add_shape(type="line",
                x0=mean_courses_by_organization,
                y0=-0.5,
                x1=mean_courses_by_organization,
                y1=6.7,
                line=dict(color="purple", dash="dash"), name="Mean")

#Add median
fig_8.add_shape(type="line", x0=median_courses_by_organization, y0=-0.5,
              x1=median_courses_by_organization,
              y1=6.7,
              line=dict(color="blue", dash="dash"), name="Median")

#Annotation for the mean
fig_8.add_annotation(y=3,
                     x=mean_courses_by_organization,
                     text=f"Mean: {mean_courses_by_organization:,.0f}",
                     showarrow=True,
                     arrowhead=1,
                     arrowcolor="purple",
                     xshift=-2,
                     yshift=-3,
                     font=dict(color="purple"))

#Annotation for the median
fig_8.add_annotation(y=5,
                     x=median_courses_by_organization,
                     text=f"Median: {median_courses_by_organization:,.0f}",
                     showarrow=True,
                     arrowhead=1,
                     arrowcolor="blue",
                     xshift=-2,
                     yshift=-3,
                     font=dict(color="blue"))

fig_8.update_layout(font=dict(size=14),
                  title=dict(x=0.4),
                  xaxis=dict(showgrid=False))

fig_8.show()

**Insight:**  
Organizations which have most students enrolled provide a lot more courses compared to the mean and median.

> 4.12. Which certification types are most common?

In [28]:
#Count courses by each type of certification
n_certifications = coursera_df.groupby(["course_Certificate_type"]).size().reset_index(name="n")
n_certifications = n_certifications.rename(columns={"course_Certificate_type":"Certificate Type",\
                                                    "n":"Number of Courses"}\
                                                        ).sort_values(by="Number of Courses",
                                                                      ascending=False)
n_certifications["Number of Courses, %"] = round(((n_certifications["Number of Courses"] / n_certifications["Number of Courses"].sum()) * 100), 2)
n_certifications

Unnamed: 0,Certificate Type,Number of Courses,"Number of Courses, %"
0,COURSE,582,65.32
2,SPECIALIZATION,297,33.33
1,PROFESSIONAL CERTIFICATE,12,1.35


> 4.13. Which organizations provide professional certificates?

In [71]:
professional_certifications = coursera_df.query("course_Certificate_type == 'PROFESSIONAL CERTIFICATE' "\
                                                ).groupby(["course_organization"]\
                                                          ).size().reset_index(name="n")
professional_certifications = professional_certifications.rename(columns={"course_organization":"Organization",\
                                                                           "n":"Number of Certifications"}\
                                                                            ).sort_values(by="Number of Certifications",\
                                                                                          ascending=False)
professional_certifications

Unnamed: 0,Organization,Number of Certifications
4,IBM,3
3,Google Cloud,2
2,Google,2
5,SAS,2
0,Arizona State University,1
1,Crece con Google,1
6,"University of California, Irvine",1


> 4.14. How certitifation types compare across top organizations by student enrollment?

In [72]:
#Count certification types provided by top organizations by student enrollment
n_certifications_by_top_orgs = coursera_df[coursera_df["course_organization"]\
                                           .isin(top_organizations)]\
                                            .groupby(["course_organization", "course_Certificate_type"])\
                                                .size()\
                                                    .reset_index(name="n")

discrete_colors_certification = {"COURSE":"rgb(136,204,238)",
                   "SPECIALIZATION":"rgb(169,169,169)",
                   "ROFESSIONAL CERTIFICATE":"rgb(144,103,167)"}

#Plot a bar chart displaying number of certificate types provided by top organizations by student enrollment
fig_9=px.bar(n_certifications_by_top_orgs,
             x="course_organization",
             y="n",
             color="course_Certificate_type", 
             barmode="group",
             title="Certificate Types by Top Organizations",
             labels={"n":"Number of Courses", "course_organization" : "Organization", "course_Certificate_type":"Certificate Type"},
             text="n",
             template="plotly_white",
             color_discrete_map=discrete_colors_certification)

fig_9.update_traces(textposition='outside')

fig_9.update_layout(font=dict(size=14),
                  title=dict(x=0.4),
                  width=800,
                  height=600,
                  yaxis=dict(showgrid=False))

fig_9.show()

> 4.15. Which organizations offer the highest-rated courses?

In [77]:

#Calculate mean rating for organizations by mean rating
mean_rating_by_organization = coursera_df.groupby("course_organization")["course_rating"].mean().reset_index(name="average_rating")

orgs_w_highest_rating_courses = mean_rating_by_organization.query("4.5 < average_rating <= 5")
orgs_w_highest_rating_courses = orgs_w_highest_rating_courses.rename(columns={"course_organization":"Top Organizations by Rating",\
                                               "average_rating":"Mean Rating"})
orgs_w_highest_rating_courses


Unnamed: 0,Top Organizations by Rating,Mean Rating
0,(ISC)²,4.733333
1,Amazon Web Services,4.550000
3,American Museum of Natural History,4.750000
4,Arizona State University,4.771429
5,Atlassian,4.750000
...,...,...
149,Yonsei University,4.750000
150,deeplearning.ai,4.743750
151,École Polytechnique,4.800000
152,École Polytechnique Fédérale de Lausanne,4.725000


> 4.16. How does average rating compare between top organizations by student enrollment and top organizations by rating?

In [79]:
#Define top organizations by mean rating
top_orgs_by_rating = mean_rating_by_organization.sort_values(by="average_rating", ascending=False).head(7)

#Count how many courses each TOP organization provide
top_organizations_by_rating_course_counts = coursera_df[coursera_df["course_organization"].isin(top_orgs_by_rating["course_organization"])] \
    .groupby("course_organization")["course_title"].count().reset_index(name="course_count")

top_orgs_by_rating = pd.merge(top_organizations_by_rating_course_counts, top_orgs_by_rating, on="course_organization")
top_orgs_by_rating = top_orgs_by_rating.rename(columns={"course_organization":"Top Organizations by Mean Rating",\
                                               "average_rating":"Mean Rating", "course_count":"Number of Courses"})



top_orgs_by_rating[["Top Organizations by Mean Rating", "Mean Rating", "Number of Courses"]]

Unnamed: 0,Top Organizations by Mean Rating,Mean Rating,Number of Courses
0,Crece con Google,4.9,1
1,Google - Spectrum Sharing,4.9,1
2,Hebrew University of Jerusalem,4.9,1
3,London Business School,4.9,1
4,"Nanyang Technological University, Singapore",4.9,1
5,ScrumTrek,4.9,1
6,Universidade Estadual de Campinas,4.9,1


In [81]:
# Calculate mean rating for top organizations by student enrollment
avg_rating_by_top_orgs = coursera_df[coursera_df["course_organization"].isin(top_organizations)] \
    .groupby("course_organization")["course_rating"] \
    .mean() \
    .reset_index(name="average_rating")

avg_rating_by_top_orgs = avg_rating_by_top_orgs.sort_values(by="average_rating",\
                                                             ascending=False)

# Create the bar chart displaying mean rating by top organizations by student enrollment
fig_10 = px.bar(avg_rating_by_top_orgs,
               x="course_organization",
               y="average_rating",
               title="Average Rating by Top Organizations by Student Enrollment",
               labels={"average_rating": "Average Rating",\
                        "course_organization": "Organization"},
               template="plotly_white",
               text="average_rating",
               hover_name="course_organization")

fig_10.update_traces(texttemplate="%{text:.2f}",
                     textposition="outside",
                     marker_color="rgb(136,204,238)")

fig_10.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     width=800,
                     height=550,
                     yaxis=dict(showgrid=False,\
                               range=[4, 5],\
                                dtick=0.1,\
                                    showticklabels=False,\
                                        title=None))

fig_10.show()


> 4.17. Do higher difficulty courses have fewer enrollments compared to easier courses?

In [83]:
#Count how many students are enrolled in each course difficulty
enrollment_by_difficulty = coursera_df.groupby("course_difficulty")["course_students_enrolled_float"].sum()\
    .reset_index(name="enrollment")
enrollment_by_difficulty = enrollment_by_difficulty.sort_values(by="enrollment",\
                                                                ascending=False)

#Plot a bar chart displaying how many students are enrolled to each course difficulty level
fig_11 = px.bar(enrollment_by_difficulty, x="course_difficulty",
               y="enrollment",
               title="Student Enrollment by Course Difficulty",
               labels={"course_difficulty":"Course Difficulty",\
                        "enrollment" : "Number of Students Enrolled"},
               text="enrollment",
               template="plotly_white",
               hover_name="course_difficulty")

fig_11.update_traces(texttemplate="%{text:.2s}",
                    textposition="outside",
                    marker_color="rgb(136,204,238)")

fig_11.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     width=800,
                     height=550,
                     yaxis=dict(showgrid=False))

fig_11.show()



> 4.18. How do student enrollment vary across different certification types?

In [86]:
#Count how many students are enrolled in each course difficulty
enrollment_by_difficulty = coursera_df.groupby("course_Certificate_type")["course_students_enrolled_float"].sum().reset_index(name="enrollment")
enrollment_by_difficulty = enrollment_by_difficulty.sort_values(by="enrollment", ascending=False)

#Plot a bar chart displaying number of students enrolled to each course certificate type
fig_12 = px.bar(enrollment_by_difficulty,
                x="course_Certificate_type",
                y="enrollment",
                title="Student Enrollment by Course Certificate Type",
                labels={"course_Certificate_type":"Course Certificate",\
                      "enrollment" : "Number of Students Enrolled"},
                text="enrollment",
                template="plotly_white",
                hover_name="course_Certificate_type")

fig_12.update_traces(texttemplate="%{text:.2s}",
                    textposition="outside",
                    marker_color="rgb(136,204,238)")

fig_12.update_layout(width=800,
                    height=550,
                    yaxis=dict(showgrid=False))

fig_12.show()


> 4.19. Which certification type tends to be associated with higher-rated courses?

In [87]:
#Calculate mean rating by certificate type
mean_rating_by_certification = coursera_df.groupby("course_Certificate_type")["course_rating"].mean()\
  .reset_index(name="average_rating")

#Plot a bar chart displaying mean rating by course certificate type
fig_13 = px.bar(mean_rating_by_certification,
                x="course_Certificate_type",
                y="average_rating",
                title="Mean Rating by Course Certificate Type",
                labels={"course_Certificate_type":"Course Certificate",\
                         "average_rating" : "Mean Rating"},
                text="average_rating",
                template="plotly_white",
                hover_name="course_Certificate_type")

fig_13.update_traces(texttemplate="%{text:.2s}",
                    textposition="outside",
                    marker_color="rgb(136,204,238)")

fig_13.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     width=800,
                     height=550,
                     yaxis=dict(showgrid=False))

fig_13.show()

> 4.20. Is there a correlation between the difficulty level of a course and its rating?

In [88]:
#Create a correlation matrix
corr_matrix = coursera_df[["course_rating", "course_difficulty_numeric"]].corr()

#Plot correlation matrix
fig_14 = px.imshow(corr_matrix,
                   color_continuous_scale="Blues",
                   labels={"color":"Correlation"},
                   x=["Course Difficulty", "Course Rating"],
                   y=["Course Difficulty", "Course Rating"],
                   title="Correlation Matrix: Course Difficulty and Rating",
                   zmin=-1,
                   zmax=1)

fig_14.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     width=600,
                     height=500)

fig_14.show()

> 4.21. How does average rating compare to course difficulty?

In [89]:
#Create a new dataframe that is grouped by course difficulty and mean course rating
mean_rating_by_difficulty = coursera_df.groupby("course_difficulty")["course_rating"].mean()\
    .reset_index(name="average rating")

#Plot a bar chart displaying mean course rating by course difficulty
fig_15=px.bar(mean_rating_by_difficulty,
             x="course_difficulty",
             y="average rating",
             title="Mean Rating by Course Difficulty",
             labels={"course_rating":"Course Rating",\
                      "course_difficulty":"Course Difficulty"},
             text="average rating",
             template="plotly_white",
             color_discrete_map=discrete_colors,
             hover_name="course_difficulty")

fig_15.update_traces(texttemplate="%{text:.2s}", textposition="outside",
                  marker_color="rgb(136,204,238)")

fig_15.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     width=800,
                     height=700)

fig_15.show()

> 4.22. Are there any organizations that specialize in a particular difficulty level of courses?

In [90]:
#Group by organization and difficulty level and count number of courses 
difficulty_level_counts = coursera_df.groupby(["course_organization", "course_difficulty"]).size()\
    .reset_index(name="count")

#Pivot data to clearly see number of courses by their difficulty in each organization
difficulty_pivot = difficulty_level_counts.pivot(index="course_organization",
                                                 columns="course_difficulty",
                                                 values="count").fillna(0)

#Add percentage of each difficulty level
difficulty_pivot["Total Courses"] = difficulty_pivot.sum(axis=1)
for difficulty in coursera_df["course_difficulty"].unique():
    difficulty_pivot[difficulty + ", %"] = (difficulty_pivot[difficulty]/difficulty_pivot["Total Courses"]*100).round(2)

discrete_colors_difficullty_percent = {"Mixed, %":"rgb(136,204,238)",
                   "Beginner, %":"rgb(169,169,169)",
                   "Intermediate, %":"rgb(144,103,167)",
                   "Advanced, %":"rgb(0,0,139)"}

#Show distrubtion
fig_16 = px.box(difficulty_pivot[["Beginner, %", "Intermediate, %", "Mixed, %", "Advanced, %"]],
             title = "Distribution of Course Difficulty by Organization",
             labels={"value":"Course Difficulty %", "course_organization" : "Organization", "course_difficulty":"Course Difficulty"})

fig_16.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     plot_bgcolor="rgb(242, 242, 242)")
fig_16.show()

#Plot a bar chart displaying % certificate types by organization
fig_17 = px.bar(difficulty_pivot[["Beginner, %",
                                  "Intermediate, %",
                                  "Mixed, %",
                                  "Advanced, %"]].head(10),
                barmode="group",
                color_discrete_map=discrete_colors_difficullty_percent,
                title="Certificate Types by Top Organizations",
                labels={"value":"Course Difficulty %",
                        "course_organization" : "Organization",
                        "course_difficulty":"Course Difficulty"},
                template="plotly_white")

fig_17.update_layout(font=dict(size=14),
                     title=dict(x=0.45),
                     width=800,
                     height=550)

fig_17.show()


##### **_5._** Key Takeaways:  
- "Machine Learning" course had the highest number of students enrolled (3.2M) whilst "El Abogado del Futuro: Legaltech y la Transformacion Digital del Derecho" had the least number of students enrolled (1.5k). Number of students enrolled to both courses are way above and below the mean (90,522), respectively.
- 6 out of 10 top rated courses have very low student enrollment below the mean.  
- Most coursera courses are rated fairly good with mean being 4.68. The lowest observed rating was 3.3.   
- Top 10 courses by student enrollment have excellent ratings, only those whose difficulty level is _beginner_ have slightly lower ratings, particularly "Career Success".   
- There was no clear trend observed how course rating affects course enrollment. However, a tendency of courses with higher rating having more students enrolled was seen.  
- Majority of courses provide "Course Certificate" (65.32%) whilst "Specialization Certificate" and "Professional Certificate" are provided only by 33.33% and 1.35% courses, respectively.  
- Whilst mean and median of courses provided by each organization is 6 and 3, respectively, number of courses provided by top organizations by student enrollment ranges between 16 and 59 courses. Namely, "University of Pennsylvania provides 59 courses.
- "University of California, Irvine" is the only organization in top organizations by student enrollment that provides more courses with specialization certificates than course certificate and 1 professional certificate. In comparison, other top organizations provide courses with course certificates mostly.  
- No meaningful relationship between course difficulty and course rating was observed.  
- Although most top organizations specialise into beginner and intermediate difficulty courses, "Autodesk" is the only one that provides 50% advanced and 50% intermediate courses, and "Arizona State University" it the only one that provides beginner, intermediate and mixed courses.

##### **_6._** Actionable insights:  
- Reviewing and updating course content periodically if neecessary would help to maintain top performing courses as well as improve worst performing courses. Updates could include provision of personalised learning paths, recognition awards and enhancement of course accessability.
- Organizations which seek to receive higher student enrollment should consider providing more courses as well as wider variability of course difficulty and certificate type.

##### **_7._** Further improvement:  
- Categorize course titles using frequency analysis with *collections* and *langdetect* libraries or machine learning.  

**If more data were to be available, success of the courses could be further investigated by looking at the following:**  
- Course length
- Course lifespan
- Course content
- Feedback provided from students.  
- Course value - whether course is paid or free
