# Getting Coursera free courses data

This notebook is not part of the workshop excercises, however I wanted to share it for new pythonistas who want to know how I got the data for `Notebook 4`.

## Installing dependencies

First, you will want to install BeautifulSoup and pandas. Depending on your installation, do:

```bash
# PIP
> pip install pandas
> pip install bs4

# conda
> conda actitvate spacy_test
> conda install pandas
> conda install bs4
```


## Get data from freeCodeCamp.org

We will be using a very cool blogpost from freeCodeCamp.org titled `I uncovered 1,600 Coursera courses that are still completely free` and written by Dhawal Shah. The post is freely available here: [https://www.freecodecamp.org/news/coursera-free-online-courses-6d84cdb30da/](https://www.freecodecamp.org/news/coursera-free-online-courses-6d84cdb30da/)

In [134]:
import regex as re # standard Python package (no need to install)
import requests # standard Python package (no need to install)

from bs4 import BeautifulSoup
import pandas as pd

Now we need to put the URL in a string. You can actually do this for whatever website you want.

In [135]:
URL = "https://www.freecodecamp.org/news/coursera-free-online-courses-6d84cdb30da/"

We will then write the response text to a text file with a `.html` extension. We will save it under the `data/raw/` directory for archival purposes (in case, for instance, that the post gets taken down).

In [136]:
response = requests.get(URL) # Fetch html

out_dir = '../data/raw/'
file_name = 'freecodecamp.org_news_coursera-free-online-courses.html'

with open(out_dir + file_name, 'w') as whtml:
    whtml.write(response.text)

## Process the html content with BeautifulSoup

[BeautifulSoup](https://pypi.org/project/beautifulsoup4/) is a popular package to process html and xml data. We will loop through the `response` until we find the `<div class="post-content">` tag, which contains the information we're interested in. You can take a look at the contents yourself by opening the raw file in your text processor.

**Our objective is the following:** create a Pandas dataframe that contains a unique `course_id` for each course, the `name` of the course, its `category` (Computer Science, Business, Humanities, etc), and the `institution` that teaches it. We will then write it to a `.csv` file under the `data/proessed/` directory that will later allow us to easily retreive it and work on it during the workshop.

In [137]:
soup = BeautifulSoup(response.text, 'html.parser') # Parse html

# Now we'll search for the tag that interests us. There is only one with the attributes "class="post-content"
post_content = soup.find(attrs={"class": "post-content"}) 

Now we will start to loop over the parsed html. There are a few important elements in the html that we need to take into account. Web scraping and parsing is still more of an art than a science, and I got this information by inspecting the html file.

**Important elements:**

- The `<h2>` tags indicate the `category` of each course. If we inspect the `<h2>` elements, we will notice that they contain some extra information indicating how many courses are listed under each category, for instance `Computer Science (97)`. We will want to strip the final five characters of the string: `\s\(\d\d\)`
- The `<ul>` tags follow each `<h2>` category and they contain course elements tagged as `<li>`
- The `<li>` tags contain `<a>` tags that have both text and hrefs for each follow each course. We will take the text as the `name` of the course and each href as its `URL`.
- The `<em>` tags are also nested under each list `<li>` element. These `<em>` tags contain name of each teaching `institution` and there is only one per course.

Because of the recursive nature of the html data, we will be traversing it with nested loops and we will use a list of lists to keep track of it all.

In [138]:
course_id = 0  # Unique course id numbers. The number will increase as we find <li> tags.
h2_seen = 0  # Will help avoid classifying <li> elements without having seen a category first

pattern = r' \([0-9]+\)'  # Strip ending from `CATEGORY (\d\d)` strings
category = ''  # Keep track of each seen category (h2.text)

data = []  # List of lists to create a pandas dataframe

Now, we will loop over the contents of `post_content`.

In [139]:
for i in post_content.contents:
    if i.name=='h2':
        h2_seen = 1
        category = re.sub(pattern, '', i.string)  # Strip the ending `\s\(\d\d\)`
    else:
        if h2_seen == 1:  # This is not strictly necessary in our data, but it's good practice
            if i.name =='ul':
                for ul in i.contents:  # Loop through the <ul> tags
                    if ul.name=='li':  # Find list elements
                        list_element = ul
                        try:  # Make sure they are not empty
                            list_element.a.text
                        except:
                            next  # If empty list elements exist, skip
                        else:  # Do stuff with our found <li> elements
                            course_name = list_element.a.text
                            course_URL = list_element.a['href']
                            course_instutution = list_element.em.text
                            course_id += 1
                            # Store the above variables for later
                            data.append([course_id, course_name, course_URL, course_instutution])

## Create and query our Pandas dataframe

In [140]:
df = pd.DataFrame(data, columns=['id', 'name', 'url', 'institution'])

Now let's take a closer look at our dataframe. We will repeat these steps during the workshop (when we load this data), so feel free to skip to the end.

In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1632 entries, 0 to 1631
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           1632 non-null   int64 
 1   name         1632 non-null   object
 2   url          1632 non-null   object
 3   institution  1632 non-null   object
dtypes: int64(1), object(3)
memory usage: 51.1+ KB


Let's now look at the top three items in our dataframe

In [142]:
df.head(3)

Unnamed: 0,id,name,url,institution
0,1,Machine Learning,https://www.classcentral.com/course/machine-le...,Stanford University
1,2,"Information Systems Auditing, Controls and Ass...",https://www.classcentral.com/course/informatio...,The Hong Kong University of Science and Techno...
2,3,"Algorithms, Part I",https://www.classcentral.com/course/algs4partI...,Princeton University


Let's look at the bottom three entries

In [143]:
df.tail(3)

Unnamed: 0,id,name,url,institution
1629,1630,Ensino Híbrido: Personalização e Tecnologia na...,https://www.classcentral.com/course/ensino-hib...,Fundação Lemann
1630,1631,Aprenda a ensinar programação com o Programaê!,https://www.classcentral.com/course/programae-...,Fundação Lemann
1631,1632,Preparing for and Passing Technical Certificat...,https://www.classcentral.com/course/preparing-...,ROI Training


Great. Now we can see that there are more than the advertised 1,600 courses. There are, in fact 1632 courses! Let's see how many institutions there are in our data.

In [144]:
df['institution'].nunique()

177

Now let's take a look at the sorted list of institutions:

In [145]:
sorted(df['institution'].unique())

['(ISC)²',
 'Amazon Web Services',
 'American Museum of Natural History',
 'Arizona State University',
 'Atlassian',
 'Berklee College of Music',
 'Brightline Initiative',
 'California Institute of Technology',
 'California Institute of the Arts',
 'Carnegie Mellon University',
 'Case Western Reserve University',
 'Checkpoint',
 'Columbia University',
 'Commonwealth Education Trust',
 'Copenhagen Business School',
 'Curtis Institute of Music',
 'Duke University',
 'E-Learning Development Fund',
 'EIT Digital',
 'EMLYON Business School',
 'ESADE Business and Law School',
 'ESCP Europe',
 'ESSEC Business School',
 'Eindhoven University of Technology',
 'Emory University',
 'Erasmus University Rotterdam',
 'Exploratorium',
 'Fudan University',
 'Fundação Instituto de Administração',
 'Fundação Lemann',
 'George Washington University',
 'Georgia Institute of Technology',
 'Ghent University',
 'GitLab',
 'Goldman Sachs',
 'Goldsmiths, University of London',
 'Google Cloud',
 'Google Daydrea

How many and which courses from our list are taught by the University of Arizona?

In [146]:
uaz_courses = df['institution'] == 'University of Arizona'

print(f"Number of courses taught by the University of Arizona: {df[uaz_courses]['institution'].count()}\n")

uaz_names = list(df[uaz_courses]['name'])
uaz_urls = list(df[uaz_courses]['url'])
ensure_institution = list(df[uaz_courses]['institution'])

for i, j in enumerate(uaz_names):
    print(f"Course name: {uaz_names[i]}\nCourse url: {uaz_urls[i]}\nInstitution: {ensure_institution[i]}\n")

Number of courses taught by the University of Arizona: 5

Course name: Astronomy: Exploring Time and Space
Course url: https://www.classcentral.com/course/astro-3007
Institution: University of Arizona

Course name: Biosphere 2 Science for the Future of Our Planet
Course url: https://www.classcentral.com/course/biosphere-science-future-10470
Institution: University of Arizona

Course name: Astrobiology: Exploring Other Worlds
Course url: https://www.classcentral.com/course/astrobiology-exploring-other-worlds-13556
Institution: University of Arizona

Course name: Roman Art and Archaeology
Course url: https://www.classcentral.com/course/roman-art-archaeology-5796
Institution: University of Arizona

Course name: Introduction to the Orbital Perspective
Course url: https://www.classcentral.com/course/orbitalperspective-8291
Institution: University of Arizona



Let's see how many empty values our dataframe has:

In [147]:
print(f"Number of empty ID fields: {df['id'].isnull().sum()}\n\
Number of empty name fields: {df['name'].isnull().sum()}\n\
Number of empty url fields: {df['url'].isnull().sum()}\n\
Number of empty institution fields: {df['institution'].isnull().sum()}")

Number of empty ID fields: 0
Number of empty name fields: 0
Number of empty url fields: 0
Number of empty institution fields: 0


## Save our data as csv

Let's finish this notebook by saving `csv_data` to our `data/processed/` directory so that we can retreive it during the workshop. No need to worry about fields with commas since they are quoted by default. 

In [148]:
out_dir = '../data/processed/'
file_name = 'freecodecamp.org_news_coursera-free-online-courses.csv'

csv_data = df.to_csv(out_dir + file_name, index=False)
# df_4.to_csv('../data/Penguins/my_penguins.csv')