## -------------------------------- *Udemy_Course (Online Course Business)* --------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------

# (1-2) Definition, Problems, and Goals

------------------------------------------------------------------------------------------------------------------------------------------------------

### Definition

Machine Learning ini merupakan salah satu model untuk menentukan sukses tidak nya suatu course yang akan dibuka oleh suatu Lembaga Pendidikan atau dalam problem ini ialah MOOC, yang dapat digunakan oleh perusahaan atau lembaga tersebut sebagai bahan pertimbangan dalam me-*release* course baru.

### Problems & Goals

2.1 Problems
- Kualitas Pendidikan Menurun saat diterapkannya Sistem Pendidikan Jarak Jauh berbasis Teknologi (mediaindonesia.com)
- Penurunan partisipan pada Lembaga-lembaga Bimbingan Belajar akibat Covid-19 (sonora.id)
- Kesulitan dalam mencari pekerjaan akibat pandemi covid (kompasiana.com)
- Kesulitan untuk mengetahui course apa yang paling dibutuhkan mereka-mereka yang sedang ingin improving skill (Briyando, Boby.2020)

2.2 Goals
- Membuat model yang dapat memprediksi/mendeteksi sukses tidaknya suatu course yang akan dibuka oleh suatu MOOC dalam mendapatkan subscribers atau partisipan
- Mengetahui variabel apa saja yang dapat mempengaruhi sukses/tidaknya suatu course dalam mendapatkan subscribers atau partisipan

2.3 Limitasi
- Model dapat digunakan oleh seluruh perusahaan berbasis Platform Media Pembelajaran Online / MOOC / Bimbel dan perusahaan sejenis lainnya.
- Model hanya dapat memprediksi persentase sukses atau tidaknya course yang akan dibuka.


# Import Package

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import joblib
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns=999
pd.set_option('display.max_colwidth', -1)

%matplotlib inline

------------------------------------------------------------------------------------------------------------------------------------------------------

# (3) Import Data

------------------------------------------------------------------------------------------------------------------------------------------------------

In [2]:
df = pd.read_csv('udemy.csv', parse_dates=['published_timestamp'])
df1 = df.copy()
df1['content_duration'] = round(df1['content_duration'],2)

In [3]:
df1.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-banking-course/,True,200,2147,23,51,All Levels,1.5,2017-01-18 20:58:58+00:00,Business Finance
1,1113822,Complete GST Course & Certification - Grow Your CA Practice,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09 16:34:20+00:00,Business Finance
2,1006314,Financial Modeling for Business Analysts and Consultants,https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19 19:26:30+00:00,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel 2017,https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/,True,95,2451,11,36,All Levels,3.0,2017-05-30 20:07:24+00:00,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-profits-trading-options/,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13 14:57:18+00:00,Business Finance


------------------------------------------------------------------------------------------------------------------------------------------------------

# (7) Feature Engineering & Feature Selection

In [4]:
publish_dt = df1['published_timestamp'].dt

In [5]:
df1['year_p'] = publish_dt.year
df1['month_p'] = publish_dt.month
df1['date_p'] = publish_dt.day

In [6]:
df1.drop(columns=['published_timestamp'], inplace=True)

In [7]:
def success(x):
    if x <= 1500:
        return 0
    elif x > 1500:
        return 1

In [8]:
df_b = df1[(df1.content_duration>0) & (df1.num_lectures>0)]
df_b.drop([1473,1100,2561,787,894,788], axis=0, inplace=True)
df_b.set_index('course_id',inplace=True)
df_b.drop(columns=['url', 'course_title'], inplace=True)
df_b['is_success'] = df_b['num_subscribers'].apply(success)
df_b.head()

Unnamed: 0_level_0,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,subject,year_p,month_p,date_p,is_success
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1070968,True,200,2147,23,51,All Levels,1.5,Business Finance,2017,1,18,1
1113822,True,75,2792,923,274,All Levels,39.0,Business Finance,2017,3,9,1
1006314,True,45,2174,74,51,Intermediate Level,2.5,Business Finance,2016,12,19,1
1210588,True,95,2451,11,36,All Levels,3.0,Business Finance,2017,5,30,1
1011058,True,200,1276,45,26,Intermediate Level,2.0,Business Finance,2016,12,13,0


In [9]:
dfix = df1[(df1.content_duration>0) & (df1.num_lectures>0)]
dfix.drop([1473,1100,2561,787,894,788], axis=0, inplace=True)
dfix.set_index('course_id',inplace=True)
dfix.drop(columns=['url', 'course_title','num_lectures','content_duration','month_p','date_p'], inplace=True)
dfix['is_success'] = dfix['num_subscribers'].apply(success)
dfix.tail()

Unnamed: 0_level_0,is_paid,price,num_subscribers,num_reviews,level,subject,year_p,is_success
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
775618,True,100,1040,14,All Levels,Web Development,2016,0
1088178,True,25,306,3,Beginner Level,Web Development,2017,0
635248,True,40,513,169,All Levels,Web Development,2015,0
905096,True,50,300,31,All Levels,Web Development,2016,0
297602,True,45,901,36,Beginner Level,Web Development,2014,0


In [10]:
df_b['level_enc'] = df_b.level.map({
    "All Levels":0,
    "Beginner Level":1,
    "Intermediate Level":2,
    "Expert Level":3,
})
df_b = pd.get_dummies(df_b, columns=['is_paid','subject'], prefix_sep='_')
df_b.head()

Unnamed: 0_level_0,price,num_subscribers,num_reviews,num_lectures,level,content_duration,year_p,month_p,date_p,is_success,level_enc,is_paid_False,is_paid_True,subject_Business Finance,subject_Graphic Design,subject_Musical Instruments,subject_Web Development
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1070968,200,2147,23,51,All Levels,1.5,2017,1,18,1,0,0,1,1,0,0,0
1113822,75,2792,923,274,All Levels,39.0,2017,3,9,1,0,0,1,1,0,0,0
1006314,45,2174,74,51,Intermediate Level,2.5,2016,12,19,1,2,0,1,1,0,0,0
1210588,95,2451,11,36,All Levels,3.0,2017,5,30,1,0,0,1,1,0,0,0
1011058,200,1276,45,26,Intermediate Level,2.0,2016,12,13,0,2,0,1,1,0,0,0


In [11]:
dfix['level_enc'] = dfix.level.map({
    "All Levels":0,
    "Beginner Level":1,
    "Intermediate Level":2,
    "Expert Level":3,
})
dfix = pd.get_dummies(dfix, columns=['is_paid','subject'], prefix_sep='_')
dfix.head()

Unnamed: 0_level_0,price,num_subscribers,num_reviews,level,year_p,is_success,level_enc,is_paid_False,is_paid_True,subject_Business Finance,subject_Graphic Design,subject_Musical Instruments,subject_Web Development
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1070968,200,2147,23,All Levels,2017,1,0,0,1,1,0,0,0
1113822,75,2792,923,All Levels,2017,1,0,0,1,1,0,0,0
1006314,45,2174,74,Intermediate Level,2016,1,2,0,1,1,0,0,0
1210588,95,2451,11,All Levels,2017,1,0,0,1,1,0,0,0
1011058,200,1276,45,Intermediate Level,2016,0,2,0,1,1,0,0,0


In [12]:
df_b.drop(columns=['level','num_subscribers','num_reviews'], inplace=True)
df_b.rename(columns={
    "subject_Business Finance":"business_subject",
    "subject_Graphic Design":"graphic_subject",
    "subject_Musical Instruments":"music_subject",
    "subject_Web Development":"webdev_subject"
}, inplace=True)
df_b.head()

Unnamed: 0_level_0,price,num_lectures,content_duration,year_p,month_p,date_p,is_success,level_enc,is_paid_False,is_paid_True,business_subject,graphic_subject,music_subject,webdev_subject
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1070968,200,51,1.5,2017,1,18,1,0,0,1,1,0,0,0
1113822,75,274,39.0,2017,3,9,1,0,0,1,1,0,0,0
1006314,45,51,2.5,2016,12,19,1,2,0,1,1,0,0,0
1210588,95,36,3.0,2017,5,30,1,0,0,1,1,0,0,0
1011058,200,26,2.0,2016,12,13,0,2,0,1,1,0,0,0


In [13]:
dfix.drop(columns=['level','price','num_subscribers','num_reviews'], inplace=True)
dfix.rename(columns={
    "subject_Business Finance":"business_subject",
    "subject_Graphic Design":"graphic_subject",
    "subject_Musical Instruments":"music_subject",
    "subject_Web Development":"webdev_subject"
}, inplace=True)
dfix.head()

Unnamed: 0_level_0,year_p,is_success,level_enc,is_paid_False,is_paid_True,business_subject,graphic_subject,music_subject,webdev_subject
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1070968,2017,1,0,0,1,1,0,0,0
1113822,2017,1,0,0,1,1,0,0,0
1006314,2016,1,2,0,1,1,0,0,0
1210588,2017,1,0,0,1,1,0,0,0
1011058,2016,0,2,0,1,1,0,0,0


In [14]:
df_b.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3671 entries, 1070968 to 297602
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             3671 non-null   int64  
 1   num_lectures      3671 non-null   int64  
 2   content_duration  3671 non-null   float64
 3   year_p            3671 non-null   int64  
 4   month_p           3671 non-null   int64  
 5   date_p            3671 non-null   int64  
 6   is_success        3671 non-null   int64  
 7   level_enc         3671 non-null   int64  
 8   is_paid_False     3671 non-null   uint8  
 9   is_paid_True      3671 non-null   uint8  
 10  business_subject  3671 non-null   uint8  
 11  graphic_subject   3671 non-null   uint8  
 12  music_subject     3671 non-null   uint8  
 13  webdev_subject    3671 non-null   uint8  
dtypes: float64(1), int64(7), uint8(6)
memory usage: 279.6 KB


In [15]:
dfix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3671 entries, 1070968 to 297602
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year_p            3671 non-null   int64
 1   is_success        3671 non-null   int64
 2   level_enc         3671 non-null   int64
 3   is_paid_False     3671 non-null   uint8
 4   is_paid_True      3671 non-null   uint8
 5   business_subject  3671 non-null   uint8
 6   graphic_subject   3671 non-null   uint8
 7   music_subject     3671 non-null   uint8
 8   webdev_subject    3671 non-null   uint8
dtypes: int64(3), uint8(6)
memory usage: 136.2 KB


# (8) Exporting Clean File

In [16]:
# dfix.to_csv('dfix.csv')
# df_b.to_csv('df_b.csv')

# (9) Handling Imbalance Data

In [17]:
dfix.is_success.value_counts(normalize=True)

0    0.628167
1    0.371833
Name: is_success, dtype: float64

- Data sudah balance jadi tidak perlu ada handling imbalance