# Cleaning Udacity Catalog

## Step 1: Getting text from Udacity's course catalog page

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
# fetch the web page
r = requests.get('https://in.udacity.com/courses/all')

## Step 2: Use BeautifulSoup to get the HTML tags

In [4]:
soup = BeautifulSoup(r.text, "lxml")

## Step 3: Find all Course summaries

In [5]:
summaries = soup.find_all("div", {"class": "course-summary-card"})
print("Number of courses: {}".format(len(summaries)))

Number of courses: 226


## Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [7]:
print(summaries[0].prettify())

<div _ngcontent-sc268="" class="course-summary-card row row-gap-medium catalog-card nanodegree-card ng-star-inserted">
 <ir-catalog-card _ngcontent-sc268="" _nghost-sc271="">
  <div _ngcontent-sc271="" class="card-wrapper is-collapsed">
   <div _ngcontent-sc271="" class="card__inner card mb-0">
    <div _ngcontent-sc271="" class="card__inner--upper">
     <div _ngcontent-sc271="" class="image_wrapper hidden-md-down">
      <a _ngcontent-sc271="" href="/course/data-engineer-nanodegree--nd027">
       <!-- -->
       <div _ngcontent-sc271="" class="image-container ng-star-inserted" style="background-image:url(https://d20vrrgs8k4bvw.cloudfront.net/images/degrees/nd027/nd-card.jpg);">
        <div _ngcontent-sc271="" class="image-overlay">
        </div>
       </div>
      </a>
      <!-- -->
     </div>
     <div _ngcontent-sc271="" class="card-content">
      <!-- -->
      <!-- -->
      <div _ngcontent-sc271="" class="category-wrapper">
       <span _ngcontent-sc271="" class="mobile-i

Look for selectors contain the the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html), as you saw in the last video.

In [8]:
# title
summaries[0].select_one("h3").get_text().strip()

'Data Engineer'

In [10]:
# school
summaries[0].select_one("h4").get_text().strip()

'School of Data Science'

## Step 5: Collect names and schools of ALL course listings

In [19]:
courses = []
for summary in summaries:
    title = summary.select_one("h3").get_text().strip()
    school = summary.select_one("h4").get_text().strip()
    courses.append((title, school))
    
# printing the courses on udacity
courses

[('Data Engineer', 'School of Data Science'),
 ('VR Foundations', 'School of Programming'),
 ('VR Mobile 360', 'School of Programming'),
 ('VR High-Immersion', 'School of Programming'),
 ('Google Analytics', 'School Of Business'),
 ('Artificial Intelligence for Trading', 'School of AI'),
 ('Python Foundation', 'School of Programming'),
 ('Data Analyst', 'School of Data Science'),
 ('Machine Learning Foundation', 'School of AI'),
 ('AI Programming with Python', 'School of AI'),
 ('Become an Android Developer', 'School of Programming'),
 ('Become a Professional Full Stack Developer', 'School of Programming'),
 ('Become a Data Scientist', 'School of Data Science'),
 ('Android Basics by Google', 'School of Programming'),
 ('Artificial Intelligence', 'School of AI'),
 ('IoT Software Foundation', 'School of Programming'),
 ('Become a Front End Developer', 'School of Programming'),
 ('Learn to Code', 'School of Programming'),
 ('Digital Marketing', 'School Of Business'),
 ('Robotics Software 

In [20]:
# list only School of Data Science courses
for course in courses:
    if course[1] == 'School of Data Science':
        print(course[0])

Data Engineer
Data Analyst
Become a Data Scientist
Intro to Data Analysis
Intro to Data Science
Intro to Inferential Statistics
Data Analysis and Visualization


In [21]:
# list only School of AI courses
for course in courses:
    if course[1] == "School of AI":
        print(course[0])

Artificial Intelligence for Trading
Machine Learning Foundation
AI Programming with Python
Artificial Intelligence
Computer Vision
Natural Language Processing
Machine Learning Engineer
Deep Reinforcement Learning
Deep Learning
Deep Learning
Intro to Machine Learning
Intro to Artificial Intelligence
Intro to Descriptive Statistics
Introduction to Computer Vision
Machine Learning
Intro to Deep Learning with PyTorch
Intro to TensorFlow for Deep Learning
Machine Learning for Trading
Data Analysis with R
Reinforcement Learning
Intro to Hadoop and MapReduce
Artificial Intelligence for Robotics
Linear Algebra Refresher Course
A/B Testing
Data Visualization and D3.js
Artificial Intelligence
Machine Learning: Unsupervised Learning
Model Building and Validation
Big Data Analytics in Healthcare
Knowledge-Based AI: Cognitive Systems
Real-Time Analytics with Apache Storm
Eigenvectors and Eigenvalues
