## How NLP Pipelines Work
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

- **Text Processing**: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
- **Feature Extraction**: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
- **Modeling**: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.

## Stage 1: Text Processing
The first chunk of this lesson will explore the steps involved in text processing, the first stage of the NLP pipeline.

## Why Do We Need to Process Text?
- Extracting plain text: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
- Reducing complexity: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

You'll prepare text data from different sources with the following text processing steps:

1. Cleaning to remove irrelevant items, such as HTML tags
2. Normalizing by converting to all lowercase and removing punctuation
3. Splitting text into words or tokens
4. Removing words that are too common, also known as stop words
5. Identifying different parts of speech and named entities
6. Converting words into their dictionary forms, using stemming and lemmatization

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

## Cleaning
Let's walk through an example of cleaning text data from a popular source - the web. You'll be introduced to helpful tools in working with this data, including the requests library, regular expressions, and Beautiful Soup.

## Documentation for Python Libraries:
- [Requests](https://docs.python.org/3/library/urllib.request.html?highlight=request)
- [Regular Expressions](https://docs.python.org/3/library/re.html)
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Cleaning Quiz: Udacity's Course Catalog
It's your turn! Udacity's [course catalog page](https://www.udacity.com/courses/all) has changed since the last video was filmed. One notable change is the introduction of  _schools_.

In this activity, you're going to perform similar actions with BeautifulSoup to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

**Note: All solution notebooks can be found by clicking on the Jupyter icon on the top left of this workspace.**

### Step 1: Get text from Udacity's course catalog web page
You can use the `requests` library to do this.

Outputting all the javascript, CSS, and text may overload the space available to load this notebook, so we omit a print statement here.

In [9]:
# import statements
import requests
from bs4 import BeautifulSoup

In [11]:
# fetch web page
r = requests.get('https://www.udacity.com/courses/all')

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.

Again, outputting all the results may overload the space available to load this notebook, so we omit a print statement here.

In [14]:
soup = BeautifulSoup(r.text,'lxml')

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Just ike in the video, you can right click on the item, and click "Inspect" to view its html on a web page.

In [16]:
# Find all course summaries
summaries = soup.find_all('div',class_ = 'card-content')
print('Number of Courses:', len(summaries))

Number of Courses: 236


### Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [17]:
# print the first summary in summaries
print(summaries[0].prettify())

<div _ngcontent-sc216="" class="card-content">
 <!-- -->
 <span _ngcontent-sc216="" class="tag tag--new card ng-star-inserted">
  New
 </span>
 <!-- -->
 <div _ngcontent-sc216="" class="category-wrapper">
  <span _ngcontent-sc216="" class="mobile-icon">
  </span>
  <!-- -->
  <h4 _ngcontent-sc216="" class="category ng-star-inserted">
   School of Business
  </h4>
 </div>
 <h3 _ngcontent-sc216="" class="card-heading">
  <a _ngcontent-sc216="" class="capitalize" href="/course/ai-for-business-leaders--nd054">
   AI for Business Leaders
  </a>
 </h3>
 <div _ngcontent-sc216="" class="right-sub">
  <!-- -->
  <div _ngcontent-sc216="" class="skills ng-star-inserted">
   <h4 _ngcontent-sc216="">
    Skills Covered
   </h4>
   <span _ngcontent-sc216="" class="truncate-content">
    <!-- -->
    <span _ngcontent-sc216="" class="ng-star-inserted">
     Artificial Intelligence,
    </span>
    <span _ngcontent-sc216="" class="ng-star-inserted">
     Machine Learning,
    </span>
    <span _ngconte

Look for selectors that contain the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html), as you saw in the last video.

In [18]:
# Extract course title
summaries[0].select_one('h3 a').get_text().strip()


'AI for Business Leaders'

In [19]:
# Extract school
summaries[0].select('h4')[0].get_text().strip()


'School of Business'

### Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [20]:
courses = []
for summary in summaries:
    title = summary.select_one('h3 a').get_text().strip()
    school = summary.select('h4')[0].get_text().strip()
    
    # append name and school of each summary to courses list
    courses.append((title,school))

In [22]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:40]

236 course summaries found. Sample:


[('AI for Business Leaders', 'School of Business'),
 ('Intro to Machine Learning with TensorFlow',
  'School of Artificial Intelligence'),
 ('UX Designer', 'School of Business'),
 ('Data Streaming', 'School of Data Science'),
 ('Front End Web Developer', 'School of Programming'),
 ('Full Stack Web Developer', 'School of Programming'),
 ('Java Developer', 'School of Programming'),
 ('AI Product Manager', 'School of Artificial Intelligence'),
 ('Sensor Fusion Engineer', 'School of Autonomous Systems'),
 ('Data Visualization', 'School of Data Science'),
 ('Cloud Developer', 'School of Cloud Computing'),
 ('Cloud DevOps Engineer', 'School of Cloud Computing'),
 ('Intro to Machine Learning with PyTorch',
  'School of Artificial Intelligence'),
 ('C++', 'School of Autonomous Systems'),
 ('Data Structures and Algorithms', 'School of Programming'),
 ('Programming for Data Science with R', 'School of Data Science'),
 ('Data Engineer', 'School of Data Science'),
 ('Marketing Analytics', 'School 