# Udemy Courses Dataset

Udemy is one of the most popular e-learning platforms in the world. As mentioned on their website, the platform has over 75 000 instructors, <br> 150 000 courses, 250 million enrollments and 33 million minutes worth of content.

The Udemy Dataset has information about the courses avaliable on Udemy from the years 2011-2017. <br>
This Dataset is available on Kaggle website for free. (https://www.kaggle.com/andrewmvd/udemy-courses)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


data = pd.read_csv('datasets/udemy_courses.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB


In [2]:
data.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


In [255]:
data.level.value_counts()

All Levels            1929
Beginner Level        1270
Intermediate Level     421
Expert Level            58
Name: level, dtype: int64

## Data cleaning

In [3]:
data.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

In [4]:
data.published_timestamp = pd.to_datetime(data.published_timestamp).dt.date.astype('datetime64[ns]')
data.published_timestamp

0      2017-01-18
1      2017-03-09
2      2016-12-19
3      2017-05-30
4      2016-12-13
          ...    
3673   2016-06-14
3674   2017-03-10
3675   2015-12-30
3676   2016-08-11
3677   2014-09-28
Name: published_timestamp, Length: 3678, dtype: datetime64[ns]

# Exploratory Data Analysis

## Subjects EDA

### 1. What is the distribution of subjects?

In [7]:
data.subject.unique()

array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

In [236]:
subject = data.subject.value_counts().to_frame('count').reset_index().rename(columns={'index': 'name'}).sort_values(by='count')
subject

Unnamed: 0,name,count
3,Graphic Design,603
2,Musical Instruments,680
1,Business Finance,1195
0,Web Development,1200


In [242]:
fig = px.bar(subject,
            x='name',
            y='count',
            color='name',
            color_discrete_sequence=px.colors.sequential.Blues_r,
            opacity=0.8,
            title='Value Count of Subject Types'
            )
fig.update_traces(marker_line_color='black',
                  marker_line_width=1)                  
fig.update_layout(showlegend=False, width=650)
fig.show()

In [245]:
fig = px.pie(subject,
             values='count',
             names='name',
             hole=0.5,
             color_discrete_sequence=px.colors.sequential.Blues_r,
             title='Subject Types [%]')
fig.update_layout(width=600)
fig.show()

### 2. What is the distribution of subjects per year?

In [192]:
y = data.published_timestamp.dt.year
m = data.published_timestamp.dt.month

subject_year = data[['published_timestamp']].sort_values(by=['published_timestamp'])
subject_year['count_m'] = subject_year.groupby([y, m])['published_timestamp'].transform('size')
subject_year.count_m = subject_year.count_m.cumsum()
subject_year['type'] = 'total'

for val in data.subject.unique():
    temp = data[data.subject == val]
    val_df = temp[['published_timestamp']].sort_values(by=['published_timestamp'])
    val_df['count_m'] = val_df.groupby([y, m])['published_timestamp'].transform('size')
    val_df.count_m = val_df.count_m.cumsum()
    val_df['type'] = val
    subject_year = subject_year.append(val_df, ignore_index=True)
    
subject_year 

Unnamed: 0,published_timestamp,count_m,type
0,2011-07-09,1,total
1,2011-09-09,2,total
2,2011-11-19,4,total
3,2011-11-29,6,total
4,2011-12-20,7,total
...,...,...,...
7351,2017-06-29,37406,Web Development
7352,2017-06-30,37447,Web Development
7353,2017-06-30,37488,Web Development
7354,2017-07-03,37490,Web Development


In [225]:
fig = px.line(subject_year[subject_year.type == 'total'],
              x='published_timestamp',
              y='count_m',
              title='All Subjects Distribution')
fig.update_layout(width=600)
fig.show()

In [227]:
fig = px.line(subject_year[subject_year.type != 'total'],
              x='published_timestamp',
              y='count_m',
              color='type',
              title='Subject Types Distribution')
fig.update_layout(width=700)
fig.show()

### 3. How many people purchase a particular subject?

In [248]:
subject_subscribers =  data.groupby('subject')['num_subscribers'].sum().to_frame('count').reset_index().rename(columns={'subject': 'name'}).sort_values(by='count')
subject_subscribers

Unnamed: 0,name,count
2,Musical Instruments,846689
1,Graphic Design,1063148
0,Business Finance,1868711
3,Web Development,7980572


In [252]:
fig = px.bar(subject_subscribers,
            x='name',
            y='count',
            color='name',
            color_discrete_sequence=px.colors.sequential.Blues_r,
            opacity=0.8,
            title='Subscriber Count vs Subject Type'
            )
fig.update_traces(marker_line_color='black',
                  marker_line_width=1)                  
fig.update_layout(showlegend=False, width=650)
fig.show()

In [253]:
fig = px.pie(subject_subscribers,
             values='count',
             names='name',
             hole=0.5,
             color_discrete_sequence=px.colors.sequential.Blues_r,
             title='Subscriber Count vs Subject Type [%]')
fig.update_layout(width=600)
fig.show()

### 4. What is a mean content duration for each subject? (paid or not)

In [267]:
subject_duration = data.groupby(['subject', 'is_paid'])['content_duration'].mean().to_frame('mean_duration').reset_index().rename(columns={'subject': 'subject_name'})
subject_duration

Unnamed: 0,subject_name,is_paid,mean_duration
0,Business Finance,False,2.148611
1,Business Finance,True,3.675675
2,Graphic Design,False,1.917619
3,Graphic Design,True,3.683011
4,Musical Instruments,False,1.547101
5,Musical Instruments,True,2.949238
6,Web Development,False,2.562281
7,Web Development,True,5.97279


In [272]:
fig = px.bar(subject_duration,
            x='subject_name',
            y='mean_duration',
            color='is_paid',
            barmode='group',
            color_discrete_sequence=px.colors.sequential.Blues_r,
            opacity=0.8,
            title='Mean Duration vs Subject Type'
            )
fig.update_traces(marker_line_color='black',
                  marker_line_width=1)                  
fig.update_layout(width=650)
fig.show()

## Levels EDA

# TODO:
- Levels EDA
- Price EDA
- Titles EDA
- Correlation (?)