# Scraping the web for text data

---

This file introduces some basic tools for collecting text data from the web. 

## Loading the necessary libraries



---


We will need `requests` (to establish a connection with the website) and `BeautifulSoup` to clean up and extract the text.

In [None]:
from bs4 import BeautifulSoup as bs
import requests

## Establishing a connection to the website



---


First, we save the URL oof the website. 

Next, we link to the website using `requests.get()`.

Finally, we can access the raw text of the website using the `.content` method. 

In [None]:
url = "https://en.wikipedia.org/wiki/Epistemology"

link = requests.get(url)

html = link.content #method applied to the linkeddata to pull up the text

print(html) #clean up the html from the result

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Epistemology - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"84593ce3-bb6b-4082-9611-99e3780bb38f","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Epistemology","wgTitle":"Epistemology","wgCurRevisionId":1122139928,"wgRevisionId":1122139928,"wgArticleId":9247,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Scots-language text","Articles containing French-language text","Articles containing Portuguese-language text","Articles containing 

## Applying `BeautifulSoup` to clean up the data



---



We create a "soup" object by parsing the raw web text (which contains HTML). 

The parsed structure can be printed using `.prettify`

In [None]:
soup = bs(html,'html.parser')

print(soup.prettify()[0:200])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Epistemology - Wikipedia
  </title>
  <script>
   document.documentElement.className="clie


In [None]:
soup=bs(html, "html.parser")
print(soup.prettify()[0:2000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Epistemology - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"84593ce3-bb6b-4082-9611-99e3780bb38f","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Epistemology","wgTitle":"Epistemology","wgCurRevisionId":1122139928,"wgRevisionId":1122139928,"wgArticleId":9247,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Scots-language text","Articles containing French-language text","Articles containing Portuguese-language text","Articles 

If we want to see only the non-HTML material, we can use `.get_text()`

In [None]:
text = soup.get_text() #the output is still ugly and custom lib can be used to clean up further

In [None]:
print(text[:200])





Epistemology - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy


## Scraping using customized tools


---

For some websites, tools have already been developed to assist you in cleanly and efficiently scraping text. 

For example, the `wikipedia` library allows for easy scraping of any Wikipedia article. (There are better wiki tools, but this one is simple to use!).

In [None]:
!pip3 install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=b90e94e1a72a2caa320695f4b644fc735f60662af27d2968508a714a845624c1
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [None]:
import wikipedia

If you know the name of a wikipedia article, you can enter it directly using the `.page` function.

To extract the text from any page you access, use `.content`.

In [None]:
wiki = wikipedia.page("Epistemology")
text = wiki.content

In [None]:
print(text[:200])

Epistemology ( (listen); from Ancient Greek  ἐπιστήμη (epistḗmē) 'knowledge', and  -logy), or the theory of knowledge, is the branch of philosophy concerned with knowledge. Epistemology is considered 


You can also search for possible articles. The result is a list of wikipedia article titles that you can then use with `.page` to retrieve the content. 

In [None]:
homer_search = wikipedia.search("Homer")

print(homer_search)

['Homer', 'Homer Simpson', 'Home run', 'Homer Hickam', 'Winslow Homer', 'Homer, Alaska', 'Odyssey', 'Homer (unit)', 'Home', 'Mark Homer']


In [None]:
lets_search=wikipedia.search("Mahesh Babu")
print(lets_search)

['Mahesh Babu', 'Mahesh Babu filmography', 'Ramesh Babu', 'G. Mahesh Babu Entertainment', 'Sudheer Babu', 'Sarkaru Vaari Paata', 'Maharshi (2019 film)', 'Sarileru Neekevvaru', 'List of awards and nominations received by Mahesh Babu', 'Srimanthudu']


We're interested in the best Homer, so we go with the second element of the list. 

In [None]:
page_title = homer_search[1]

homer_page = wikipedia.page(page_title).content

print(homer_page[:200])

Homer Jay Simpson is a fictional character and one of the two main protagonists, alongside his son Bart, in the American animated sitcom The Simpsons. He is voiced by Dan Castellaneta and first appear


In [None]:
hero_title=lets_search[1]
extract_page=wikipedia.page(hero_title).content # why does not this work with the title as input
print(extract_page[:200])

Mahesh Babu is an Indian actor, producer, narrator known for his work in Telugu cinema. He first appeared in the 1979 film Needa when he was four years old. He continued to perform as a child actor in


## More practice with other kinds of websites

---



In [None]:
url = "https://www.gutenberg.org/cache/epub/51355/pg51355.txt"
url_html = "https://www.gutenberg.org/files/51355/51355-h/51355-h.htm"

# Compare raw-text and HTML pages for success of BeautifulSoup parsing
page = requests.get(url)
page_html = requests.get(url_html)

homer_html = page_html.content
homer_plain = page.content

homer_html_soup = bs(homer_html, 'html.parser')

homer_html_text = homer_html_soup.get_text()




In [None]:
print(homer_html_text[:1000])







The Project Gutenberg eBook of The Iliads of Homer, by Homer



body { margin-left: 20%;
       margin-right: 20%;
       text-align: justify }

h1, h2, h3, h4, h5 {text-align: center; font-style: normal; font-weight:
normal; line-height: 1.5; margin-top: .5em; margin-bottom: .5em;}

h1 {font-size: 300%;
    margin-top: 0.6em;
    margin-bottom: 0.6em;
    letter-spacing: 0.12em;
    word-spacing: 0.2em;
    text-indent: 0em;}
h2 {font-size: 150%; margin-top: 2em; margin-bottom: 1em;}
h3 {font-size: 150%; margin-top: 2em;}
h4 {font-size: 120%;}
h5 {font-size: 110%;}

hr {width: 80%; margin-top: 2em; margin-bottom: 2em;}

div.chapter {page-break-before: always; margin-top: 4em;}

p {text-indent: 1em;
   margin-top: 0.25em;
   margin-bottom: 0.25em; }

.p2 {margin-top: 2em;}

p.poem {text-indent: 0%;
        margin-left: 10%;
        font-size: 90%;
        margin-top: 1em;
        margin-bottom: 1em; }

p.letter {text-indent: 0%;
          margi

In [None]:
print(homer_plain[:1000])

b'\xef\xbb\xbfThe Project Gutenberg eBook of The Iliads of Homer, by Homer\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.\r\n\r\nTitle: The Iliads of Homer\r\n\r\nAuthor: Homer\r\n\r\nTranslator: George Chapman\r\n\r\nRelease Date: March 4, 2016 [eBook #51355]\r\n[Most recently updated: March 8, 2021]\r\n\r\nLanguage: English\r\n\r\n\r\nProduced by: Phil Schempf\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE ILIADS OF HOMER ***\r\n\r\ncover\r\n\r\n\r\n\r\n\r\nThe Iliads of Homer\r\nby Homer\r\n\r\nTranslated from the Greek\r\nby George Chapman\r\n\r\nLondon: Published

In [None]:
homer_plain_clean = bs(homer_plain, "html.parser").get_text()
print(homer_plain_clean[:1000])

The Project Gutenberg eBook of The Iliads of Homer, by Homer

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Iliads of Homer

Author: Homer

Translator: George Chapman

Release Date: March 4, 2016 [eBook #51355]
[Most recently updated: March 8, 2021]

Language: English


Produced by: Phil Schempf

*** START OF THE PROJECT GUTENBERG EBOOK THE ILIADS OF HOMER ***

cover




The Iliads of Homer
by Homer

Translated from the Greek
by George Chapman

London: Published by
Simpkin, Marshall, Hamilton, Kent & Co. Ltd.

New York: published by
Charles Scr