# COGS 108 - Data Checkpoint

# Names

- Luming Jin
- Xiaoke Li
- Haowei Li
- Yunfei Yang
- Tengyue Wang

<a id='research_question'></a>
# Research Question

**How well can we predict student satisfaction with an online course using key metrics that represent course design, such as course difficulty, course price, and course duration etc?**

# Dataset(s)


- Dataset Name: Udemy Courses - Top 5000 Course 2022
- Link to the dataset: https://www.kaggle.com/datasets/mahmoudahmed6/udemy-top-5k-course-2022
- Number of observations: 5027

dataset describtion: 
The dataset we used for this project is from Kaggle. This dataset titied "Udemy Courses - Top 5000 Course 2022", with data scraped on 9/11/2022, provides a comprehensive look at the top 5000 courses on Udemy. The data updates every 3 to 6 months.

The dataset includes the following columns:

- course_name: This is the name of the course.
- instructor: This denotes the instructor of the course.
- course_url: This is the URL of the course.
- course_image: This is the image of the course.
- course_description: This is the course subtitle and contains information about the course content.
- reviews_avg: This is the average review score of the course.
- reviews_count: This represents the number of reviews for each course.
- course_duration: This is the duration of the course in hours.
- lectures_count: This is the number of lectures in each course.
- level: This is the course level on Udemy.
- price_after_discount: This is the course price in Egyptian pounds (EGP) after discount.
- main_price: This is the original course price.
- course_flag: This is the course flag like (best seller, hot, new, etc.)
- students_count: This is the number of students in each course.

The dataset provides a rich source of information for examining trends and patterns in online learning especially in the field of development. By using this data, we aim to gain insights into various aspects; for example, the popularity of courses, the relationship between price and popularity, the impact of course duration on student numbers. These insights could be valuable for educators, online learning platforms, helping them make informed decisions when taking online courses.

# Setup

In [2]:
%matplotlib inline
import os
import pickle
import warnings
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import date, datetime, timedelta
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
warnings.filterwarnings('ignore')
plt.style.use("fivethirtyeight")

In [3]:
online_course=pd.read_csv("data.csv")

In [4]:
online_course.head()

Unnamed: 0,course_name,instructor,course url,course image,course description,reviews_avg,reviews_count,course_duration,lectures_count,level,price_after_discount,main_price,course_flag,students_count,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,2022 Complete Python Bootcamp From Zero to Her...,Jose Portilla,https://www.udemy.com/course/complete-python-b...,https://img-b.udemycdn.com/course/240x135/5678...,Learn Python like a Professional Start from t...,Rating: 4.6 out of 5,440383 reviews,22 total hours,155 lectures,All Levels,Current price: E£319.99,"Original price: E£1,399.99",,"1,629,692 students",,,,
1,The Web Developer Bootcamp 2022,Colt Steele,https://www.udemy.com/course/the-web-developer...,https://img-b.udemycdn.com/course/240x135/6252...,COMPLETELY REDONE - The only course you need t...,Rating: 4.7 out of 5,248508 reviews,64 total hours,615 lectures,All Levels,Current price: E£269.99,"Original price: E£1,399.99",,"830,559 students",,,,
2,The Complete 2022 Web Development Bootcamp,Dr. Angela Yu,https://www.udemy.com/course/the-complete-web-...,https://img-b.udemycdn.com/course/240x135/1565...,Become a Full-Stack Web Developer with just ON...,Rating: 4.7 out of 5,234837 reviews,65.5 total hours,490 lectures,All Levels,Current price: E£349.99,"Original price: E£1,699.99",Bestseller,"794,897 students",,,,
3,Angular - The Complete Guide (2023 Edition),Maximilian Schwarzmüller,https://www.udemy.com/course/the-complete-guid...,https://img-b.udemycdn.com/course/240x135/7561...,"Master Angular 14 (formerly ""Angular 2"") and b...",Rating: 4.6 out of 5,174576 reviews,34.5 total hours,472 lectures,All Levels,Current price: E£319.99,"Original price: E£1,599.99",Bestseller,"634,196 students",,,,
4,Java Programming Masterclass covering Java 11 ...,"Tim Buchalka, Tim Buchalka's Learn Programming...",https://www.udemy.com/course/java-the-complete...,https://img-b.udemycdn.com/course/240x135/5336...,Learn Java In This Course And Become a Compute...,Rating: 4.5 out of 5,171838 reviews,80.5 total hours,401 lectures,All Levels,Current price: E£349.99,Original price: E£849.99,Bestseller,"727,934 students",,,,


# Data Cleaning

1. Get rid of columns with all NAN Values 

2. Drop un-neccesary columns: course image link, course_flag 

3. Drop rows with nan for main_price or price_after_discount

4. We want to convert strings to float/int for the following columns to prepare for our further analysis: 
   price_after_discount, main_price,students_count, review_avg, reviews_count, course_duration,lectures_count.

5. Add one more colums for price percentage 

**1.Get rid of columns with all NAN Values**

We notice there are four columns having the values all equal to Nah and the name of columns are not specified. We think they do not contain useful information that can be used for further interpretations and analysis. Therefore, we drop these columns from the original dataset.


In [5]:
online_course = online_course.drop(['Unnamed: 14'], axis=1)
online_course = online_course.drop(['Unnamed: 15'], axis=1)
online_course = online_course.drop(['Unnamed: 16'], axis=1)
online_course = online_course.drop(['Unnamed: 17'], axis=1)

**2.Drop un-neccesary columns: course image link, course_flag**

We find there is some unnecessary information including the course’s image and course’s flag. Using the drop method to remove these two columns. 

In [6]:

online_course.drop(['course image', 'course_flag'], axis=1, inplace=True)

**3. Drop rows with nan for main_price or price_after_discount**

We find that there are some nan values in main_price and price_after_discount. Comparing to the total number of entires, the numbers of nan values for each columns are not signifcant as we can see below. We assume that this is probably cause by loss of information during data collections. Therefore, we drop thoese rows with nan values to avoid inaccurate analysis.

In [7]:
online_course['main_price'].isna().sum()

228

In [8]:
online_course['price_after_discount'].isna().sum()

10

In [9]:

online_course=online_course[online_course['main_price'].isna()==False]
online_course=online_course[online_course['price_after_discount'].isna()==False]

In [10]:
online_course['price_after_discount'].value_counts()

Current price: E£269.99    2141
Current price: E£229.99    1284
Current price: E£199.99     838
Current price: E£319.99     491
Current price: E£349.99      42
40 lectures                   1
All Levels                    1
3449 reviews                  1
Name: price_after_discount, dtype: int64

Since All Levels, 3449 reviews and 40 lectures are not valid entires for price_after_discount and there are only  3 entires like this in total, we suspect that these were probably caused by mistakes for for data entry. Therefore we will drop these three rows

In [11]:
online_course=online_course[online_course['price_after_discount']!='All Levels']
online_course=online_course[online_course['price_after_discount']!='3449 reviews']
online_course=online_course[online_course['price_after_discount']!='40 lectures']

After checking main_price with the code below, we can see that column for main_price doesn't have invalid entries

In [12]:
online_course['main_price'].value_counts()

Original price: E£229.99      838
Original price: E£1,299.99    532
Original price: E£719.99      506
Original price: E£1,199.99    451
Original price: E£1,399.99    394
Original price: E£479.99      270
Original price: E£319.99      198
Original price: E£849.99      172
Original price: E£269.99      142
Original price: E£419.99      123
Original price: E£1,599.99    115
Original price: E£679.99       92
Original price: E£779.99       84
Original price: E£349.99       84
Original price: E£529.99       83
Original price: E£749.99       79
Original price: E£649.99       69
Original price: E£599.99       66
Original price: E£799.99       66
Original price: E£619.99       63
Original price: E£819.99       61
Original price: E£629.99       49
Original price: E£579.99       45
Original price: E£729.99       43
Original price: E£449.99       42
Original price: E£999.99       41
Original price: E£1,699.99     31
Original price: E£519.99       31
Original price: E£549.99       26
Name: main_pri

**4.We want to convert strings to float/int for the following columns to prepare for our further analysis:  price_after_discount, main_price,students_count, review_avg, reviews_count, course_duration,lectures_count.**

We need to get all the numeric values from strings to allow further analysis. 

Convert entries in price_after_discount to float.

All the strings are dropped, and the data is replaced by numbers and stored in the original columns. 

In [13]:
online_course['price_after_discount']=online_course['price_after_discount'].str.replace('Current price: E£', '').astype(float)

Convert entries in main_price to float

In [14]:

online_course['main_price'] = online_course['main_price'].str.replace('Original price: E£', '')  # remove prefix
online_course['main_price'] = online_course['main_price'].str.replace(',', '')  # remove thousands separator
online_course['main_price'] = online_course['main_price'].astype(float)

Convert entries in lectures_count to int

In [15]:
online_course['lectures_count'] = online_course['lectures_count'].str.replace('lectures', '').astype(float)

Convert entries in course_duration to float/int

In [16]:
#seperate the number and units of time
online_course['course_duration_unit']=online_course['course_duration'].str.split('total').apply(lambda x: x[1])

In [17]:
#get the value for course_duration
online_course['course_duration']=online_course['course_duration'].str.split('total').apply(lambda x: x[0])

Some of the units are mins. We want to make sure all values are measured in hours.

In [18]:
mask = online_course['course_duration_unit'] == 'mins'
online_course.loc[mask, 'course_duration'] /= 60
# we will then drop the column for course_duration_unit
online_course.drop(['course_duration_unit'], axis=1, inplace=True)

Convert entries in reviews_count and students_count to float/int

In [19]:
online_course['reviews_count'] = online_course['reviews_count'].str.replace('reviews', '').astype(float)

y = online_course['students_count'].str.replace(',', '')
online_course['students_count'] = y.str.replace('students', '').astype(float)

Convert entries in reviews_avg to float/int

To get the average review number, we first convert the data entry in “review_avg” to string and then replace the parts in strings with empty values. In this way, we extracted the number of average reviews for courses. Then, we converted the value to float to ensure the data can be used in the following analyses. 

Similar to price_after_discount, some of the entires in reviews_avg have different format than the rest of the data and don't provide useful information for our analysis. Therefore we dropped these rows.

In [20]:
online_course=online_course[online_course['reviews_avg']!='399.99"']
online_course=online_course[online_course['reviews_avg']!='https://img-b.udemycdn.com/course/240x135/368679_cd44_3.jpg']
online_course=online_course[online_course['reviews_avg']!='Beginner guide to Git, Github and Github Action. Learn to use git commands and create Github actions for DevOps CI CD']
x = online_course['reviews_avg'].str.replace('out of 5', '')
online_course['reviews_avg'] = x.str.replace('Rating: ', '').astype(float)

**5.We added one column that shows the discout each course offered**

In [21]:
online_course['price_discount']=(online_course['main_price']-\
                                 online_course['price_after_discount'])/online_course['main_price']

Below is the cleaned data

In [22]:
online_course.head()

Unnamed: 0,course_name,instructor,course url,course description,reviews_avg,reviews_count,course_duration,lectures_count,level,price_after_discount,main_price,students_count,price_discount
0,2022 Complete Python Bootcamp From Zero to Her...,Jose Portilla,https://www.udemy.com/course/complete-python-b...,Learn Python like a Professional Start from t...,4.6,440383.0,22.0,155.0,All Levels,319.99,1399.99,1629692.0,0.771434
1,The Web Developer Bootcamp 2022,Colt Steele,https://www.udemy.com/course/the-web-developer...,COMPLETELY REDONE - The only course you need t...,4.7,248508.0,64.0,615.0,All Levels,269.99,1399.99,830559.0,0.807149
2,The Complete 2022 Web Development Bootcamp,Dr. Angela Yu,https://www.udemy.com/course/the-complete-web-...,Become a Full-Stack Web Developer with just ON...,4.7,234837.0,65.5,490.0,All Levels,349.99,1699.99,794897.0,0.794122
3,Angular - The Complete Guide (2023 Edition),Maximilian Schwarzmüller,https://www.udemy.com/course/the-complete-guid...,"Master Angular 14 (formerly ""Angular 2"") and b...",4.6,174576.0,34.5,472.0,All Levels,319.99,1599.99,634196.0,0.800005
4,Java Programming Masterclass covering Java 11 ...,"Tim Buchalka, Tim Buchalka's Learn Programming...",https://www.udemy.com/course/java-the-complete...,Learn Java In This Course And Become a Compute...,4.5,171838.0,80.5,401.0,All Levels,349.99,849.99,727934.0,0.588242
