<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawling with Beautifulsoup4 and  Wikipedia Python APIs to create a document collection

<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2023/2024   |

### Installing additional packages

In [1]:
%pip install wikipedia-api
%pip install beautifulsoup4



### Importing some basic required packages

In [2]:
import gzip
import string
import numpy as np
import requests
import regex as re

### Crawling content with Beautifulsoup4
#### Select a webpage, download its content, parse the HTML to extract the text

In [26]:
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/wiki/New_York_City')

# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

# Pull text from all instances of <p> tag within BodyText div
all_p_items = soup.find(class_='mw-body').find_all('p')
print(len(all_p_items))
print(all_p_items[0])
print(all_p_items[0].get_text())
print('    ----    ')
print(all_p_items[1])
print(all_p_items[1].get_text())

161
<p class="mw-empty-elt">
</p>


    ----    
<p><b>New York</b>, often called <b>New York City</b><sup class="reference" id="cite_ref-12"><a href="#cite_note-12">[b]</a></sup> or simply <b>NYC</b>, is the <a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">most populous city</a> in the <a href="/wiki/United_States" title="United States">United States</a>. With a population of 8,804,190 distributed over 300.46 square miles (778.2 km<sup>2</sup>) in 2020, the city is the <a href="/wiki/List_of_United_States_cities_by_population_density" title="List of United States cities by population density">most densely populated</a> major city in the United States. NYC is more than twice as populous as <a href="/wiki/Los_Angeles" title="Los Angeles">Los Angeles</a>, the nation's second-most populous city. New York City is at the southern tip of <a href="/wiki/New_York_(state)" title="New York (state)">New York State</a> and is situated on

In [25]:
punct_regex = re.compile('[{}]'.format(re.escape(string.punctuation))) # Regex matching any punctuation
space_regex = re.compile(' +') # Regex matching whitespace

text = punct_regex.sub(' ', soup.find(class_='mw-body').get_text())
text = space_regex.sub(' ', text).lower()  # convert to lowercase
lines = [
    line.strip()
    for line in text.split("\n")
    if line.strip() != "" # Skip empty lines
]
# Store lines
print(len(lines))
print(lines[0])
print(lines[1])
print(lines[1290])

2940
toggle the table of contents
new york city
['new york often called new york city b or simply nyc is the most populous city in the united states with a population of 8 804 190 distributed over 300 46 square miles 778 2\xa0km2 in 2020 the city is the most densely populated major city in the united states nyc is more than twice as populous as los angeles the nation s second most populous city new york city is at the southern tip of new york state and is situated on one of the world s largest natural harbors the city comprises five boroughs each of which is coextensive with a respective county the five boroughs which were created in 1898 when local governments were consolidated into a single municipality are brooklyn kings county queens queens county manhattan new york county the bronx bronx county and staten island richmond county 11 new york city is a global city and a cultural financial high tech 12 entertainment and media center with a significant influence on commerce health care

In [None]:
######
# TODO: Open the wikipedia page for New York, select a sentence, can you find at which line it appears?
######







In [86]:
######
# TODO: Complete the code,
#   a) split a line in single words, compute word frequency
#   b) compute word frequency of all words across all lines
#
# Try out: https://docs.python.org/3/library/collections.html#collections.Counter
#
######

print(len(lines[1290].split(' ')))
words = set( w for w in lines[1290].split(' '))
print(len(words))
print(words)


25
19
{'culture', 'general', 'on', '2013', 'the', '23', 'original', 'consulate', 'iceland', 'february', 'archived', 'new', 'retrieved', '5', 'july', 'york', 'from', 'of', '2023'}


#### Accessing Links in the page

In [None]:
all_a_items = soup.find(class_='mw-body').find_all('a')
print(len(all_a_items))
for a in all_a_items:
  href = a.get('href')
  if href is not None and href.startswith('/wiki/') :
    print(href)

In [None]:
######
# TODO: Create a dictionary of /wiki/ links, and count how many times they appear in the page, which are the top-5 most frequent links?
######





In [None]:
######
# TODO: Pick the most frequent /wiki/ link from the above dictionary,
# download its page content and extract all links,
# do you find links in common ?
######




### Extract content from Wikipedia with the Wikipedia APIs

In [29]:
import wikipediaapi
## EDIT Down There: put your name and email for the Wikipedia logs
wapi_text = wikipediaapi.Wikipedia('MyProjectName (name@studenti.univr.it)',
                                   'en',
                                   extract_format=wikipediaapi.ExtractFormat.WIKI)

In [30]:
page_py = wapi_text.page('New York City')
print("Page - Exists: {}".format( page_py.exists()))
print(len(page_py.summary))
print(len(page_py.text))
print(len(page_py.langlinks))
print(len(page_py.links))

Page - Exists: True
5150
92129
247
2648


In [33]:
print(page_py.summary[:140])
print("   ---   ")
print(page_py.text[-140:])
print("   ---   ")
print(sorted(page_py.langlinks.keys()))
print("   ---   ")
page_py_it = page_py.langlinks['it']
print(page_py_it.summary[:140])

New York, often called New York City or simply NYC, is the most populous city in the United States. With a population of 8,804,190 distribut
   ---   
 145,000 NYC photographs at the Museum of the City of New York
"The New New York Skyline (interactive)". National Geographic. November 2015.
   ---   
['af', 'als', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'awa', 'ay', 'az', 'azb', 'ba', 'ban', 'bar', 'bat-smg', 'bcl', 'be', 'be-x-old', 'bg', 'bh', 'bi', 'bjn', 'bm', 'bn', 'bo', 'br', 'bs', 'bug', 'bxr', 'ca', 'cbk-zam', 'cdo', 'ce', 'ceb', 'ch', 'ckb', 'co', 'crh', 'cs', 'cu', 'cv', 'cy', 'da', 'dag', 'de', 'diq', 'dsb', 'dty', 'ee', 'el', 'eml', 'eo', 'es', 'et', 'eu', 'ext', 'fa', 'fi', 'fiu-vro', 'fj', 'fo', 'fr', 'frp', 'frr', 'fur', 'fy', 'ga', 'gag', 'gan', 'gcr', 'gd', 'gl', 'glk', 'gn', 'got', 'gu', 'gv', 'ha', 'hak', 'he', 'hi', 'hif', 'hr', 'hsb', 'ht', 'hu', 'hy', 'hyw', 'ia', 'id', 'ie', 'ig', 'ik', 'ilo', 'inh', 'io', 'is', 'it', 'ja', 'jam', 'jbo', 'jv', 

In [42]:
links = page_py.links
for title in sorted(links.keys()):
    if len(title) > 4 : # filter on title length to reduce output
      continue
    print("{}".format(title))

2000
AOL
CNBC
CNN
GDP
Gang
Jazz
Jews
Kyiv
Lima
Logo
Lyon
MSN
NOAA
NPR
NYC
Oslo
PBS
Port
Rome
Sic
WNET
WNYC


In [35]:
test_pages = ['Addis Ababa',  'Tom Sawyer', 'Johannes Gutenberg']

In [87]:
from urllib.parse import quote

page_queue = [ wapi_text.page(tp) for  tp in test_pages ]
page_stored = {}
page_visited = set()
max_iterations = 50

while len(page_queue) > 0 and max_iterations > 0:
  _page = page_queue.pop()


    page_stored[_page.fullurl] = _page.summary

    page_visited.add(_page.fullurl)

    print(max_iterations, _page.title, _page.fullurl)
    max_iterations = max_iterations - 1

    for next_page in _page.links.values():
      try:
        if len(next_page.title) < 6 and len(next_page.title) > 13:
          continue # skip this page

        if ':' in next_page.title :
          continue

        if _page.fullurl in page_visited:
          continue # skip this page

        # otherwise
        page_queue.append(next_page)
      except:
        print("Error retrieving", _page.title)


print(len(page_stored))
print(page_stored.keys())

50 Johannes Gutenberg https://en.wikipedia.org/wiki/Johannes_Gutenberg
49 Tom Sawyer https://en.wikipedia.org/wiki/Tom_Sawyer
48 Addis Ababa https://en.wikipedia.org/wiki/Addis_Ababa
3
dict_keys(['https://en.wikipedia.org/wiki/Johannes_Gutenberg', 'https://en.wikipedia.org/wiki/Tom_Sawyer', 'https://en.wikipedia.org/wiki/Addis_Ababa'])


In [83]:
page_stored['https://en.wikipedia.org/wiki/Victoria_Falls_Airport']

'Victoria Falls Airport (IATA: VFA, ICAO: FVFA) is an international airport serving the Victoria Falls tourism industry, and is 18 kilometres (11 mi) south of the town of Victoria Falls, Zimbabwe.'

#### The following declaration extract unparsed HTML instead of already parsed text

In [28]:
wapi_html = wikipediaapi.Wikipedia('MyProjectName (name@studenti.univr.it)',
                              'en',
                              extract_format=wikipediaapi.ExtractFormat.HTML)
page_py = wapi_text.page('New York City')
print("Page - Exists: {}".format( page_py.exists()))
print(len(page_py.summary))


NameError: ignored