# Data Science Recap

Example Coursera DataScience course: labs/DP0701EN/Webscraping postal codes of Canada-Part 1 2 and 3.ipynb

  * https://towardsdatascience.com/introduction-to-web-scraping-with-beautifulsoup-e87a06c2b857
  * https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319

  * https://www.dataquest.io/blog/web-scraping-beautifulsoup/
  * https://www.datacamp.com/community/tutorials/web-scraping-python-nlp
  * https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
  * https://www.datacamp.com/community/tutorials/web-scraping-using-python


## Introduction to Web Scraping with BeautifulSoup
*How to use web scraping to get information from a Wikipedia page*

https://towardsdatascience.com/introduction-to-web-scraping-with-beautifulsoup-e87a06c2b857

  * Request
  * **Beautiful Soup**
  * Scrapy
  * Selenium

### Install BeautifulSoup4 Python package

In [1]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


### Inspect website using developer tools F12


<img src = "wiki devtools.png" width = 400 align = 'left'> <img src = "wiki devtools toclevel.png" width = 400 align = 'right'>


<hr>


### Import Library's and Parse HTML

In [103]:
# importing libraries
from bs4 import BeautifulSoup
import urllib.request
import re

In [104]:
#define url to scrape
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

#connect to website
try:
    page = urllib.request.urlopen(url)
    print("Connection to ", url, "succesfull")
except:
    print("An error occured.")

Connection to  https://en.wikipedia.org/wiki/Artificial_intelligence succesfull


In [105]:
# pase page object to BeautifulSoup

soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Artificial intelligence - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Artificial_intelligence","wgTitle":"Artificial intelligence","wgCurRevisionId":916327886,"wgRevisionId":916018922,"wgArticleId":1164,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia articles needing page number citations from December 2016","CS1 Chinese-language sources (zh)","Wikipedia articles needing page number citations from February 2011","Webarchive template wayback links","CS1 errors: missing periodical","Wikipedia pending changes protected pages","Articles with short description","All articles with unsourced statements","Articles with unsourced statements from June 2019","Articles with unsource

In [107]:
# find specific elements tosection- Finds all li items so toc level 1 and sublevels
# https://regexr.com/ or https://regex101.com/r/cO8lqs/1

regex = re.compile('^tocsection-')
#regex = r"toclevel-1"
content_lis = soup.find_all('li', attrs={'class': regex})

'''
[<li class="toclevel-1 tocsection-1"><a href="#History"><span class="tocnumber">1</span> <span class="toctext">History</span></a></li>, <li class="toclevel-1 tocsection-2"><a href="#Definitions"><span class="tocnumber">2</span> <span class="toctext">Definitions</span></a></li>, <li class="toclevel-1 tocsection-3"><a href="#Basics"><span class="tocnumber">3</span> <span class="toctext">Basics</span></a></li>, <li class="toclevel-1 tocsection-4"><a href="#Problems"><span class="tocnumber">4</span> <span class="toctext">Problems</span></a>
<ul>
<li class="toclevel-2 tocsection-5"><a href="#Reasoning,_problem_solving"><span class="tocnumber">4.1</span> <span class="toctext">Reasoning, problem solving</span></a></li>
<li class="toclevel-2 tocsection-6"><a href="#Knowledge_representation"><span class="tocnumber">4.2</span> <span class="toctext">Knowledge representation</span></a></li>
<li class="toclevel-2 tocsection-7"><a href="#Planning"><span class="tocnumber">4.3</span> <span class="toctext">Planning</span></a></li>
'''

'''
# find specific elements tocnumber Finds all tocnumbers so toc level 1 and sublevels numbers only - no text content
regex = re.compile('toclevel[-1]')
content_lis = soup.find_all('li', attrs={'class': regex})
'''

'''
[<span class="tocnumber">1</span>, <span class="tocnumber">2</span>, <span class="tocnumber">3</span>, <span class="tocnumber">4</span>, <span class="tocnumber">4.1</span>, <span class="tocnumber">4.2</span>, <span class="tocnumber">4.3</span>, <span class="tocnumber">4.4</span>, <span class="tocnumber">4.5</span>, <span class="tocnumber">4.6</span>, <span class="tocnumber">4.7</span>, <span class="tocnumber">4.8</span>,
'''

print(content_lis)

[<li class="toclevel-1 tocsection-1"><a href="#History"><span class="tocnumber">1</span> <span class="toctext">History</span></a></li>, <li class="toclevel-1 tocsection-2"><a href="#Definitions"><span class="tocnumber">2</span> <span class="toctext">Definitions</span></a></li>, <li class="toclevel-1 tocsection-3"><a href="#Basics"><span class="tocnumber">3</span> <span class="toctext">Basics</span></a></li>, <li class="toclevel-1 tocsection-4"><a href="#Problems"><span class="tocnumber">4</span> <span class="toctext">Problems</span></a>
<ul>
<li class="toclevel-2 tocsection-5"><a href="#Reasoning,_problem_solving"><span class="tocnumber">4.1</span> <span class="toctext">Reasoning, problem solving</span></a></li>
<li class="toclevel-2 tocsection-6"><a href="#Knowledge_representation"><span class="tocnumber">4.2</span> <span class="toctext">Knowledge representation</span></a></li>
<li class="toclevel-2 tocsection-7"><a href="#Planning"><span class="tocnumber">4.3</span> <span class="toct

In [108]:
# get the raw text

content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(content)

['1 History', '2 Definitions', '3 Basics', '4 Problems', '4.1 Reasoning, problem solving', '4.2 Knowledge representation', '4.3 Planning', '4.4 Learning', '4.5 Natural language processing', '4.6 Perception', '4.7 Motion and manipulation', '4.8 Social intelligence', '4.9 General intelligence', '5 Approaches', '5.1 Cybernetics and brain simulation', '5.2 Symbolic', '5.2.1 Cognitive simulation', '5.2.2 Logic-based', '5.2.3 Anti-logic or scruffy', '5.2.4 Knowledge-based', '5.3 Sub-symbolic', '5.3.1 Embodied intelligence', '5.3.2 Computational intelligence and soft computing', '5.4 Statistical learning', '5.5 Integrating the approaches', '6 Tools', '6.1 Search and optimization', '6.2 Logic', '6.3 Probabilistic methods for uncertain reasoning', '6.4 Classifiers and statistical learning methods', '6.5 Artificial neural networks', '6.5.1 Deep feedforward neural networks', '6.5.2 Deep recurrent neural networks', '6.6 Evaluating progress', '7 Applications', '7.1 Healthcare', '7.2 Automotive', '7

### Saving data


In [109]:
with open('content.txt', 'w') as f:
    for i in content:
        f.write(i+"\n")