# Cleaning

主要會用到以下三個 library
- [Requests](https://2.python-requests.org//en/master/user/quickstart/#make-a-request)
- [Regular Expressions](https://docs.python.org/3/library/re.html)
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

這部分使用獲取 [Udacity's course catalog page](https://www.udacity.com/courses/all) 的課程名稱與所屬學院為例。

### Step 1: 獲取網頁的文字

In [1]:
# import statements
import requests
from bs4 import BeautifulSoup

In [2]:
# fetch web page
r = requests.get('https://www.udacity.com/courses/all')

In [3]:
# display text from web page
print(r.text)

<!DOCTYPE html><html><head>
  <meta charset="utf-8">
  <script type="text/javascript" class="ng-star-inserted">window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,n,e){r(e.stack)}),s.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(s,function(t,n){return t}).join(", ")))},{}],2:[function(t,n,e){function r(t,n,e,r,s){try{p?p-=1:o(s||new UncaughtException(t,n,e

### Step 2: 使用 BeautifulSoup 移除 HTML tags

Parser 使用 `"lxml"`，使用`get_text()`獲取文字。

可以使用以下兩行程式碼移除 JavaScript 和 CSS
```python
for script in soup(["script", "style"]):
    script.decompose()
```
請參考 [here](https://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript).

In [4]:
soup = BeautifulSoup(r.text, 'lxml')
for script in soup(["script", "style"]):
    script.decompose()
print(soup.get_text())




Udacity































ProgramsBack to Menu Programming and Development Back to MenuNanodegree ProgramsJava DeveloperCloud DeveloperCloud DevOps EngineerC++Data Structures and AlgorithmsData EngineerIntroduction to ProgrammingFront End Web DeveloperiOS DeveloperFull Stack Web DeveloperReactBlockchain DeveloperAndroid DeveloperAndroid Basics Artificial Intelligence Back to MenuNanodegree ProgramsAI Product ManagerIntroduction to Machine LearningData Structures and AlgorithmsMachine Learning EngineerAI Programming with PythonDeep LearningArtificial Intelligence for TradingComputer VisionNatural Language ProcessingDeep Reinforcement LearningArtificial Intelligence Cloud Computing Back to MenuNanodegree ProgramsCloud DeveloperCloud DevOps Engineer Data Science Back to MenuNanodegree ProgramsData VisualizationData Structures and AlgorithmsProgramming for Data ScienceData EngineerMarketing AnalyticsData AnalystPredictive Analytics for BusinessData ScientistBusiness Analyti

### Step 3: 找到所有課程總結

使用`find_all` method 依照 tag type 和 class name 來選擇。

In [5]:
summaries = soup.find_all('div', {'class': 'course-summary-card'})
print('Number of Courses:', len(summaries))

Number of Courses: 236


### Step 4: 先從第一個課程總結找到名稱與學院

In [6]:
# print the first summary in summaries
print(summaries[0].prettify())

<div _ngcontent-sc189="" class="course-summary-card row row-gap-medium catalog-card nanodegree-card ng-star-inserted">
 <ir-catalog-card _ngcontent-sc189="" _nghost-sc192="">
  <div _ngcontent-sc192="" class="card-wrapper is-collapsed">
   <div _ngcontent-sc192="" class="card__inner card mb-0">
    <div _ngcontent-sc192="" class="card__inner--upper">
     <div _ngcontent-sc192="" class="image_wrapper hidden-md-down">
      <a _ngcontent-sc192="" href="/course/java-developer-nanodegree--nd035">
       <!-- -->
       <div _ngcontent-sc192="" class="image-container ng-star-inserted" style="background-image:url(https://d20vrrgs8k4bvw.cloudfront.net/images/degrees/nd035/nd-card.png);">
        <div _ngcontent-sc192="" class="image-overlay">
        </div>
       </div>
      </a>
      <!-- -->
     </div>
     <div _ngcontent-sc192="" class="card-content">
      <!-- -->
      <span _ngcontent-sc192="" class="tag tag--new card ng-star-inserted">
       New
      </span>
      <!-- -->
   

In [7]:
# Extract course title
summary = summaries[0]
summary.select_one('h3').get_text().strip()

'Java Developer'

In [8]:
# Extract school
summary.select_one('h4').get_text().strip()

'School of Programming'

### Step 5: 蒐集所有課程名稱跟學院名稱

以上所有步驟稱為 **scraping**

In [9]:
courses = []
for summary in summaries:
    # append name and school of each summary to courses list
    school = summary.select_one('h4').get_text().strip()
    name = summary.select_one('h3').get_text().strip()
    courses.append((school, name))

In [10]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:5]

236 course summaries found. Sample:


[('School of Programming', 'Java Developer'),
 ('School of Artificial Intelligence', 'AI Product Manager'),
 ('School of Autonomous Systems', 'Sensor Fusion Engineer'),
 ('School of Data Science', 'Data Visualization'),
 ('School of Cloud Computing', 'Cloud Developer')]

# Normalization

最常見的就是轉換大小寫與移除標點符號，使用以下的小短文作為例子

In [11]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


### Case Normalization

In [12]:
# Convert to lowercase
text = text.lower()
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


### Punctuation Removal

更多 regular expression (regex) 請參考  [here](https://docs.python.org/3/howto/regex.html)

In [13]:
import re
# Remove punctuation characters
text = re.sub(r'[^a-zA-Z0-9_]', ' ', text)
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


> 程式中的 Regular Expression 的作用是把不是 a~z, A~Z, 0~9 的字都換成 " "，換成 space 可以確保單字不會黏再一起，因為有些文章標點符號前後是沒有空白的。

# Tokenization
這部分主要會利用 [nltk.tokenize package](http://www.nltk.org/api/nltk.tokenize.html) 來操作，安裝請參考 [here](https://www.nltk.org/data.html)。

In [14]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ChihYing\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
# import statements
from nltk.tokenize import word_tokenize, sent_tokenize

In [16]:
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


### Word Tokenization

In [17]:
# Split text into words using NLTK
word = word_tokenize(text)
print(word)

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


### Sentence Tokenization

In [18]:
## Split text into sentences
sentence = sent_tokenize(text)
print(sentence)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


# Stop Words
stop word 指的就是對語意沒有影響的字，如 the, are...，移除後可以減少文章處理的複雜度

In [19]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ChihYing\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
# import statements
from nltk.corpus import stopwords

In [21]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)
# Normalize text
text = text.lower()
text = re.sub(r'[^0-9a-zA-Z_]', ' ',text)
# Tokenize text
words = word_tokenize(text)
print()
print(words)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [22]:
# Remove stop words
words = [w for w in words if w not in stopwords.words('english')]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


其中 stop word 定義如下

In [23]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '