
Coursera

The goal of this notebook is to scrape available courses info from [Coursera](https://www.coursera.org/courses). In my use case, I want to extract available skills that can be obtained after taking each course. However, more information actually can be extracted from Coursera website. The idea and code of this notebook are based on the [Siddharth1698](https://github.com/Siddharth1698/Coursera-Course-Dataset)'s  Github and modified accordingly.<br />
If you have looked at my other [WebScraping notebook](https://www.kaggle.com/code/azraimohamad/jobstreet-job-scraping), here we are **unlucky** because Coursera does not provide any usable API hence, we need to work a bit with our *Python* skill to get our job done.

## Importing libraries

In [1]:
#Author:Azrai#
######################
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Define scraping funtion

In [2]:
def auto_Scrapper_Class(html_tag,course_case,tag_class, div_class=None):
    """
    The function auto_Scrapper_Class is used to get three parameters that is the tag,what to scrap and get the content scrapped and class it belongs. 
    """
    for i in range(1,50): # adjust as needed, according to current coursera website, there are 83 pages for all courses
        url = "https://www.coursera.org/courses?page=" +str(i)
        
        #Use below url to gain more customization on the result with different query
        #url = "https://www.coursera.org/search?query=data%20science&page=" +str(i)
        
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        if div_class:
            elements = soup.find_all('div',  class_ = div_class)

            if (len(elements)) != 12:  
                for j in range(0,12):    # There are 12 courses per page
                    course_case.append(None)
                continue
            for name in elements:
                x = name.get_text()
                if x:
                    course_case.append(x)
                else:
                    course_case.append(None)

        else:
            element = soup.find_all(html_tag,  class_ = tag_class)
            if (len(element)) != 12:
                for j in range(0,12):
                    course_case.append(None)
                continue

            for name in element:
                x = name.get_text()
                if x:
                    course_case.append(x)
                else:
                    course_case.append("")


In [3]:
course_title = []
course_organization = []
course_Certificate_type = []
course_rating = []
course_difficulty = []
course_review_counts = []
course_skills = []

In [4]:
#scrap the course title
auto_Scrapper_Class('h3',course_title, tag_class='cds-119 cds-CommonCard-title css-e7lgfl cds-121')

In [5]:
#scrap the other information as per coursera's website html
auto_Scrapper_Class('p',course_organization,'cds-119 cds-ProductCard-partnerNames css-dmxkm1 cds-121')
#auto_Scrapper_Class('div',course_Certificate_type,'_jen3vs _1d8rgfy3')
auto_Scrapper_Class('p',course_rating,'cds-119 css-11uuo4b cds-121')
auto_Scrapper_Class('p',course_difficulty,'cds-119 cds-Typography-base css-dmxkm1 cds-121', 'cds-CommonCard-metadata')
auto_Scrapper_Class('p',course_review_counts,'cds-119 cds-Typography-base css-dmxkm1 cds-121', 'product-reviews css-pn23ng')
auto_Scrapper_Class('p',course_skills,'cds-119 cds-Typography-base css-dmxkm1 cds-121', 'cds-CommonCard-bodyContent' )

## Clean the scraped data

In [6]:
data = {
    'Title': course_title,
    'Organization': course_organization,
    'Skills': course_skills,
    'Ratings':course_rating,
    'Review counts':course_review_counts,
    'Metadata': course_difficulty    
}
df = pd.DataFrame(data)
df['Skills'] = df['Skills'].str.replace("Skills you'll gain:", '', regex=False)
df

Unnamed: 0,Title,Organization,Skills,Ratings,Review counts,Metadata
0,Google Cybersecurity,Google,"Network Security, Python Programming, Linux, ...",4.8,4.8(21K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
1,Google Data Analytics,Google,"Data Analysis, R Programming, SQL, Business C...",4.8,4.8(138K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
2,Google Project Management:,Google,"Project Management, Strategy and Operations, ...",4.8,4.8(101K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
3,IBM Data Science,IBM,"Python Programming, Data Science, Machine Lea...",4.6,4.6(121K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
4,Google Digital Marketing & E-commerce,Google,"Digital Marketing, Marketing, Marketing Manag...",4.8,4.8(23K reviews),Beginner · Professional Certificate · 3 - 6 Mo...
...,...,...,...,...,...,...
583,Cryptography I,Stanford University,"Algorithms, Cryptography, Mathematics, Securi...",4.8,4.8(4.2K reviews),Mixed · Course · 1 - 3 Months
584,Natural Language Processing in TensorFlow,DeepLearning.AI,"Machine Learning, Natural Language Processing...",4.6,4.6(6.4K reviews),Intermediate · Course · 1 - 4 Weeks
585,AI Applications in Marketing and Finance,University of Pennsylvania,"Business Analysis, Machine Learning, Customer...",4.7,4.7(169 reviews),Mixed · Course · 1 - 4 Weeks
586,Artificial Intelligence: an Overview,Politecnico di Milano,"Machine Learning, Leadership and Management, ...",4.5,4.5(207 reviews),Beginner · Specialization · 3 - 6 Months


In [7]:
#df.to_csv("coursera_course_dataset.csv")

## Extras: 
#### In my case, I only want the name of skills that can be obtained from courses in Coursera

In [8]:
skills_column = df['Skills']

skills_column = [str(skill) if skill is not None else '' for skill in skills_column]

# Concatenate all skills into a single string
all_skills_text = ', '.join(skills_column)

# Split the string into a list of skills
all_skills_list = [skill.strip() for skill in all_skills_text.split(',')]

# Get unique skills
distinct_skills = list(set(all_skills_list))

distinct_skills_df = pd.DataFrame({'Distinct Skills': distinct_skills,'source':"coursera"})
len(distinct_skills)

311

In [9]:
distinct_skills_df

Unnamed: 0,Distinct Skills,source
0,,coursera
1,Statistical Programming,coursera
2,Software Visualization,coursera
3,General Accounting,coursera
4,Supply Chain Systems,coursera
...,...,...
306,Django (Web Framework),coursera
307,Computational Thinking,coursera
308,Business Development,coursera
309,Business Research,coursera


In [10]:
#distinct_skills_df.to_csv("skills_coursera.csv")

### Limitation
Unfortunately due to Coursera's website design, we can only get a max of 84 pages(around 1000 courses) with our code. However, there are actually more than 13k courses in Coursera.

Feel free to improve the code <br />
Thank you!