 # Introduction

#### Udemy Courses dataset:

* Udemy is an online learning platform that offers courses on a wide range of topics, from programming and data analysis to business and personal development. The platform has gained immense popularity in recent years, with millions of learners around the world enrolling in courses to acquire new skills and knowledge. The Udemy Courses dataset, available on Kaggle, provides a comprehensive collection of data on over 13,000 courses offered on the platform, including information on course topics, instructors, pricing, enrollment, and reviews.

* This dataset presents an excellent opportunity for data analysts to explore the trends and patterns in online education and gain insights into the preferences and behaviors of learners on the Udemy platform. By analyzing the Udemy Courses dataset, analysts can uncover the most popular course topics, identify the most successful instructors, analyze pricing strategies, and gain insights into customer satisfaction and sentiment. The insights gained from this dataset can be used to inform business decisions, guide marketing strategies, and improve the overall user experience on the Udemy platform.

# Content
1. [Load the data and get some basics information](#1)
1. [EDA](#2)
  * [Distribution of variables](#3)
  * [identifying outliers](#4)
  * [Investigating any patterns or relationships between variables](#5)

2. [Populer Course Topics](#6)
3. [Price Analysis](#7)
4. [Instructor Analysis](#8)
5. [Time Series Analysis](#9)

7. [Summary and Conclusion](#10)

<a id = '1'></a>
## Load the data and get some basics information

In [None]:
pip install opendatasets --upgrade --quiet

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns


from collections import Counter 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
data = pd.read_csv('/kaggle/input/udemy-courses/Course_info.csv')

In [None]:
data.head()

In [None]:
data.info()

dtypes: 
  * bool(1), 
  * float64(8), 
  * object(11)


  As we can see the published_time's dtype is object which has to be datetime. 

In [None]:
data.drop(["instructor_url","course_url"], axis = 1,inplace = True)

In [None]:
data.describe()

### Handle missing data

In [None]:
data.isnull().sum()

In [None]:
data[data.instructor_name.isnull()]

I will fill the nan values with 'Unknown'

In [None]:
data.instructor_name.fillna('Unknown', inplace = True)

### Change the dtype of date

In [None]:
data.published_time = pd.to_datetime(data.published_time)

In [None]:
data.last_update_date = pd.to_datetime(data.last_update_date)

<a id = '2'></a>

## EDA
* Distribution of variables
* identifying outliers
* investigating any patterns or relationships between variables

<a id = '3'></a>

#### Distribution of variables

In [None]:
variables = list(data.select_dtypes(include=['float64', "datetime64[ns, UTC]","datetime64[ns]"]).columns)

In [None]:
variables.pop(0)

In [None]:
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(20, 20))
axs = axs.flatten()

for i, col in enumerate(variables):
    sns.kdeplot(data[col], ax=axs[i])
    axs[i].set_title(col)

plt.tight_layout()
plt.show()


In [None]:
data.info()

In [None]:
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(2,1,figsize = (10, 10))


N, bins, patches = ax1.hist(data['content_length_min'], rwidth=0.85, bins=45,
         color='black')
ax1.grid(axis='y', color ='Grey',
        linestyle ='-.', linewidth = 0.1, which='both',
        alpha = 0.6)
ax1.margins(0.01)
patches[0].set_facecolor('#A435EF')
plt.sca(ax1)
plt.yscale('log')
plt.ylabel("Count")


ax2.hist(data[data['content_length_min']<1000]['content_length_min'], rwidth=0.85, bins=100,
         color='#A435EF')
ax2.grid(axis='y', color ='Grey',
        linestyle ='-.', linewidth = 0.1,
        alpha = 0.6)
ax2.margins(0.01)
plt.sca(ax2)
plt.xlabel("Content length (min)", labelpad=10)
plt.ylabel("Count")
plt.xticks(range(0,1001,100))
plt.subplots_adjust(hspace=0.1)
plt.suptitle('Distribution of course content length:',fontsize= 16)
plt.show()

<a id = '4'></a>

#### Identifying outliers

In [None]:
float_columns = list(data.select_dtypes(include=['float64']).columns)

In [None]:
float_columns.pop(0)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, _)) = plt.subplots(nrows=4, ncols=2, figsize=(12, 6))

axs = [ax1, ax2, ax3, ax4, ax5, ax6, ax7]

for i, col in enumerate(float_columns):
    axs[i].boxplot(data[col])
    axs[i].set_title(col)

_.remove() # remove the last axis

plt.show()


In [None]:
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(12, 6))
axs = axs.flatten()

for i, col in enumerate(float_columns):
    sns.kdeplot(data[col], ax=axs[i])
    axs[i].set_title(col)

axs[-1].remove() # remove the last axis

plt.tight_layout()
plt.show()


In [None]:
data.info()

<a id = '5'></a>

### Investigating any patterns or relationships between variables

In [None]:
sns.heatmap(data.corr(), annot=True)
sns.diverging_palette(145, 300, s=60, as_cmap=True)
plt.title('Correlation Heatmap')
plt.show()


* Is there a relationship between price and rating(avg_rating) and enrollment(num_subscribers)? (Answer is in price analysis section)



<a id = '6'></a>

## Popular Course Topics

In [None]:
data.info()

In [None]:
topic = data.topic.value_counts()
topic_values = topic.values
topic_names = topic.index

In [None]:
python_rate =  ((data.topic[data.topic == "Python"].count())/data.topic.value_counts().values.sum())
others = 1 - python_rate

In [None]:
python_rate*=100
others*=100

In [None]:
data.topic.value_counts().values.sum()

In [None]:
fig, ((ax1,ax2)) = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6))
sns.barplot(x = topic_values[:10], y = topic_names[:10], color="#A435EF", ax = ax1)
ax1.set_title("Top 10 Topic Counts")
ax1.set_xlabel("Count")
ax1.set_ylabel("Topic Name")

sns.scatterplot(x=[1,2,3], y=[4,5,6], color = "White",ax = ax2)
ax2.text(-1.7,1.26,"Most Popular",fontsize=15)
ax2.text(-1.6,1.13,"TOPIC",color = "#A435EF", fontsize=15)
ax2.text(-1.6,1.07,"is")
ax2.text(-1.5,0.94,"Python", color = "#A435EF", fontsize=15)


ax2.text(-1.7,-0.9,"Least Popular",fontsize=15)
ax2.text(-1.6,-1.02,"TOPIC",color = "#A435EF", fontsize=15)
ax2.text(-1.6,-1.09,"is")
ax2.text(-1.5,-1.2,"Security Communication", color = "#A435EF", fontsize=15)

labels = ['Python', 'Others']
sizes = [python_rate,others]
colors = ['black', '#A435EF']
# plot pie chart
ax2.pie(sizes, labels=labels, shadow = True, colors=colors, autopct='%1.1f%%', startangle=90)

# Hide x and y axes
ax2.set_xticks([])
ax2.set_yticks([])


plt.gca().xaxis.set_visible(False)
plt.gca().yaxis.set_visible(False)
plt.axis('off')

plt.show()

<a id = '7'></a>

## Price Analysis

In [None]:
num_subscribers = data.num_subscribers
price = data.price
rating = data.avg_rating
plt.figure(figsize = (15,10))
sns.scatterplot(x=rating,y=num_subscribers, size = price, alpha = 0.5,sizes=(20, 1100), color = "#A435EF")
plt.show()

When looking at the graph, it may seem complex. However, there is a clear relationship between course price, number of subscribers, and average rating.

We can see that courses with a small bubble size are less expensive. Moreover, courses with lower prices have higher average ratings and more subscribers according to the distribution.

In [None]:
data.language[data.price == data.price.max()].value_counts()

data.price.max() = 999.99

I was surprised when I first saw this price, but I had forgotten that the currency varies by country in the dataset. It seems natural that Turkey has almost all the most expensive trainings because the dollar exchange rate is hovering around 19TL these days. 

In [None]:
data[["language","price"]].groupby(["language","price"]).sum().sort_values(by = "price", ascending = False).iloc[:9,:]

<a id = '8'></a>

## Instructor Analysis

In [None]:
instructors = data.instructor_name.value_counts()[:10]
instructor_name = instructors.index
instructor_count = instructors.values

plt.figure(figsize = (12,6))
sns.barplot(x =instructor_count ,y = instructor_name, orient = 'h', color = "#A435EF")
plt.title("Most popular instructors by the amount of courses they publish")
plt.show()

In [None]:
instructors = data.instructor_name[data.avg_rating >=4].value_counts()[:10]
instructor_name = instructors.index
instructor_count = instructors.values
plt.figure(figsize = (12,6))
sns.barplot(x = instructor_count, y=instructor_name, orient = 'h', color = "#A435EF")

plt.title("Most popular instructor by avg rating")
plt.show()

In [None]:
instructors = data.instructor_name[data.num_subscribers >=data.num_subscribers.mean()].value_counts()[:10]
instructor_name = instructors.index
instructor_count = instructors.values
plt.figure(figsize = (12,6))
sns.barplot(x = instructor_count, y=instructor_name, orient = 'h', color = "#A435EF")

plt.title("Most popular instructor by subscribers")
plt.show()

By looking at the proportions in these three graphs, we can say that the three different instructors that have made it into all three lists are the most successful instructors. 
* Laurence Svekis
* Bluelime Learning Solutions
* Infinite Skills

<a id = '9'></a>

## Time Series Analysis

In [None]:
all = data.published_time.dt.year.value_counts()
year = all.index
count = all.values

plt.plot(year,count)

In recent years, the number of available courses has seen a significant increase, driven by technological advancements and the widespread adoption of online education. However, the onset of the pandemic in 2020 accelerated this trend even further. With more time available to dedicate to content creation, experts created a significant number of new courses between 2020 and 2021. As remote learning continues to grow in popularity, it is likely that this upward trend will continue into the foreseeable future.

<a id = '10'></a>

## Summary and Conclusion 

Insights: 

1. The prices of the courses are largely between 0-200.

2. The most popular course is python, the least popular course is Security communication.

3. There is a clear relationship between course price, number of subscribers, and average rating. Courses with a small bubble size are less expensive, and courses with lower prices have higher average ratings and more subscribers according to the distribution.

4. The maximum price in the dataset is 999.99, but this is likely due to currency differences as Turkey has the most expensive courses due to a higher exchange rate.

5. The most successful instructors in the dataset, with courses appearing in all three lists (price, popularity, and rating), are Laurence Svekis, Bluelime Learning Solutions, and Infinite Skills.

6. The number of available courses has significantly increased in recent years due to technological advancements and the widespread adoption of online education. The pandemic in 2020 accelerated this trend further, with more time available for content creation. This trend is likely to continue as remote learning becomes more popular.