<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cleaning-Quiz:-Udacity's-Course-Catalog" data-toc-modified-id="Cleaning-Quiz:-Udacity's-Course-Catalog-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cleaning Quiz: Udacity's Course Catalog</a></span></li><li><span><a href="#Step-1:-Get-text-from-Udacity's-course-catalog-web-page" data-toc-modified-id="Step-1:-Get-text-from-Udacity's-course-catalog-web-page-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Step 1: Get text from Udacity's course catalog web page</a></span></li><li><span><a href="#Step-2:-Use-BeautifulSoup-to-remove-HTML-tags" data-toc-modified-id="Step-2:-Use-BeautifulSoup-to-remove-HTML-tags-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Step 2: Use BeautifulSoup to remove HTML tags</a></span></li><li><span><a href="#Step-3:-Find-all-course-summaries" data-toc-modified-id="Step-3:-Find-all-course-summaries-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Step 3: Find all course summaries</a></span></li><li><span><a href="#Step-4:-Inspect-the-first-summary-to-find-selectors-for-the-course-name-and-school" data-toc-modified-id="Step-4:-Inspect-the-first-summary-to-find-selectors-for-the-course-name-and-school-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Step 4: Inspect the first summary to find selectors for the course name and school</a></span></li><li><span><a href="#Step-5:-Collect-names-and-schools-of-ALL-course-listings" data-toc-modified-id="Step-5:-Collect-names-and-schools-of-ALL-course-listings-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Step 5: Collect names and schools of ALL course listings</a></span></li></ul></div>

# Cleaning Quiz: Udacity's Course Catalog
It's your turn! Udacity's [course catalog page](https://www.udacity.com/courses/all) has changed since the last video was filmed. One notable change is the introduction of  _schools_.

In this activity, you're going to perform similar actions with BeautifulSoup to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

**Note: All solution notebooks can be found by clicking on the Jupyter icon on the top left of this workspace.**

# Step 1: Get text from Udacity's course catalog web page
You can use the `requests` library to do this.

Outputting all the javascript, CSS, and text may overload the space available to load this notebook, so we omit a print statement here.

In [4]:
# import statements
import requests
from bs4 import BeautifulSoup

In [5]:
# fetch web page
r = requests.get("https://www.udacity.com/courses/all")

# Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.

Again, printing this entire result may overload the space available to load this notebook, so we omit a print statement here.

In [6]:
soup = BeautifulSoup(r.text, "lxml")

# Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Just like in the video, you can right click on the item, and click "Inspect" to view its html on a web page.

In [7]:
# Find all course summaries
summaries = soup.find_all("div", {"class":"catalog-component__card"})
print('Number of Courses:', len(summaries))

Number of Courses: 247


In [8]:
summaries[1]

<div class="catalog-component__card"><a aria-label="Data Analyst" class="card__top" href="/course/data-analyst-nanodegree--nd002"><div class="card__image-container"><div class="card__image-wrapper"><div class="card__image-overlay" data-catalogtype="nanodegree"></div><div class="card__image" style="background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEVMaXFNx9g6AAAAAXRSTlMAQObYZgAAAApJREFUeNpjYAAAAAIAAeUn3vwAAAAASUVORK5CYII)"></div></div></div><div class="card__title-container"><h3 class="card__title__school greyed">School of Data Science</h3><h2 class="card__title__nd-name">Data Analyst</h2></div><div class="card__text-content"><section><h4 class="text-content__text greyed">Skills Covered</h4><p class="text-content__text">Data Wrangling, Matplotlib, Bootstrapping, Pandas &amp; NumPy, Statistics</p></section><section><h4 class="text-content__text greyed">In Collaboration With</h4><p class="text-content__text">Kaggle</p></section></div></a><div 

# Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [9]:
# print the first summary in summaries
print(summaries[0].prettify())

<div class="catalog-component__card">
 <span class="catalog-card-tag--mobile">
  New Program!
 </span>
 <a aria-label="Data Engineer" class="card__top" href="/course/data-engineer-nanodegree--nd027">
  <div class="card__image-container">
   <div class="card__image-wrapper">
    <div class="card__image-overlay" data-catalogtype="nanodegree">
    </div>
    <div class="card__image" style="background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEVMaXFNx9g6AAAAAXRSTlMAQObYZgAAAApJREFUeNpjYAAAAAIAAeUn3vwAAAAASUVORK5CYII)">
    </div>
   </div>
  </div>
  <div class="card__title-container">
   <span class="catalog-card-tag--desktop">
    New
   </span>
   <h3 class="card__title__school greyed">
    School of Data Science
   </h3>
   <h2 class="card__title__nd-name">
    Data Engineer
   </h2>
  </div>
  <div class="card__text-content">
   <section>
    <h4 class="text-content__text greyed">
     Skills Covered
    </h4>
    <p class="text-content__text

Look for selectors that contain the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html), as you saw in the last video.

In [10]:
# Extract course title
summaries[0].select_one("h2").get_text().strip()

'Data Engineer'

In [11]:
# Extract school
summaries[0].select_one("h3").get_text().strip()

'School of Data Science'

# Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [12]:
courses = []
for summary in summaries:
    # append name and school of each summary to courses list
    title = summary.select_one("h2").get_text().strip()
    school = summary.select_one("h3").get_text().strip()
    courses.append([title,school]) # (title,school)
courses

[['Data Engineer', 'School of Data Science'],
 ['Data Analyst', 'School of Data Science'],
 ['Introduction to Programming', 'School of Programming & Development'],
 ['Deep Learning', 'School of Artificial Intelligence'],
 ['Full Stack Web Developer', 'School of Programming & Development'],
 ['UX Designer', 'School of Business'],
 ['Data Scientist', 'School of Data Science'],
 ['Business Analytics', 'School of Business'],
 ['Self Driving Car Engineer', 'School of Autonomous Systems'],
 ['Programming for Data Science with Python', 'School of Data Science'],
 ['Machine Learning Engineer', 'School of Artificial Intelligence'],
 ['C++', 'School of Autonomous Systems'],
 ['Digital Marketing', 'School of Business'],
 ['SQL', 'School of Data Science'],
 ['AI Programming with Python', 'School of Artificial Intelligence'],
 ['Front End Web Developer', 'School of Programming & Development'],
 ['AI Product Manager', 'School of Artificial Intelligence'],
 ['Cloud DevOps Engineer', 'School of Cloud 

In [13]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

247 course summaries found. Sample:


[['Data Engineer', 'School of Data Science'],
 ['Data Analyst', 'School of Data Science'],
 ['Introduction to Programming', 'School of Programming & Development'],
 ['Deep Learning', 'School of Artificial Intelligence'],
 ['Full Stack Web Developer', 'School of Programming & Development'],
 ['UX Designer', 'School of Business'],
 ['Data Scientist', 'School of Data Science'],
 ['Business Analytics', 'School of Business'],
 ['Self Driving Car Engineer', 'School of Autonomous Systems'],
 ['Programming for Data Science with Python', 'School of Data Science'],
 ['Machine Learning Engineer', 'School of Artificial Intelligence'],
 ['C++', 'School of Autonomous Systems'],
 ['Digital Marketing', 'School of Business'],
 ['SQL', 'School of Data Science'],
 ['AI Programming with Python', 'School of Artificial Intelligence'],
 ['Front End Web Developer', 'School of Programming & Development'],
 ['AI Product Manager', 'School of Artificial Intelligence'],
 ['Cloud DevOps Engineer', 'School of Cloud 

In [14]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

247 course summaries found. Sample:


[['Data Engineer', 'School of Data Science'],
 ['Data Analyst', 'School of Data Science'],
 ['Introduction to Programming', 'School of Programming & Development'],
 ['Deep Learning', 'School of Artificial Intelligence'],
 ['Full Stack Web Developer', 'School of Programming & Development'],
 ['UX Designer', 'School of Business'],
 ['Data Scientist', 'School of Data Science'],
 ['Business Analytics', 'School of Business'],
 ['Self Driving Car Engineer', 'School of Autonomous Systems'],
 ['Programming for Data Science with Python', 'School of Data Science'],
 ['Machine Learning Engineer', 'School of Artificial Intelligence'],
 ['C++', 'School of Autonomous Systems'],
 ['Digital Marketing', 'School of Business'],
 ['SQL', 'School of Data Science'],
 ['AI Programming with Python', 'School of Artificial Intelligence'],
 ['Front End Web Developer', 'School of Programming & Development'],
 ['AI Product Manager', 'School of Artificial Intelligence'],
 ['Cloud DevOps Engineer', 'School of Cloud 