<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture01-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawling with Beautifulsoup4 and  Wikipedia Python APIs to create a document collection

<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2023/2024   |

### Installing additional packages

In [12]:
%pip install wikipedia-api
%pip install beautifulsoup4
%pip install nltk



### Importing some basic required packages

In [13]:
import gzip
import string
import numpy as np
import requests
import regex as re

### Crawling content with Beautifulsoup4
#### Select a webpage, download its content, parse the HTML to extract the text

In [14]:
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/wiki/New_York_City')

# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

# Pull text from all instances of <p> tag within BodyText div
all_p_items = soup.find(class_='mw-body').find_all('p')
print(len(all_p_items))
print(all_p_items[0])
print(all_p_items[0].get_text())
print('    ----    ')
print(all_p_items[1])
print(all_p_items[1].get_text())

161
<p class="mw-empty-elt">
</p>


    ----    
<p><b>New York</b>, often called <b>New York City</b><sup class="reference" id="cite_ref-12"><a href="#cite_note-12">[b]</a></sup> or simply <b>NYC</b>, is the <a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">most populous city</a> in the <a href="/wiki/United_States" title="United States">United States</a>. With a population of 8,804,190 distributed over 300.46 square miles (778.2 km<sup>2</sup>) in 2020, the city is the <a href="/wiki/List_of_United_States_cities_by_population_density" title="List of United States cities by population density">most densely populated</a> major city in the United States. NYC is more than twice as populous as <a href="/wiki/Los_Angeles" title="Los Angeles">Los Angeles</a>, the nation's second-most populous city. New York City is at the southern tip of <a href="/wiki/New_York_(state)" title="New York (state)">New York State</a> and is situated on

In [15]:
punct_regex = re.compile('[{}]'.format(re.escape(string.punctuation))) # Regex matching any punctuation
space_regex = re.compile(' +') # Regex matching whitespace

text = punct_regex.sub(' ', soup.find(class_='mw-body').get_text())
text = space_regex.sub(' ', text).lower()  # convert to lowercase
lines = [
    line.strip()
    for line in text.split("\n")
    if line.strip() != "" # Skip empty lines
]
# Store lines
print(len(lines))
print(lines[0])
print(lines[1])
print(lines[1290])

2940
toggle the table of contents
new york city
consulate general of iceland new york culture consulate general of iceland new york archived from the original on february 5 2013 retrieved july 23 2023


In [23]:
######
# TODO: Open the wikipedia page for New York, select a sentence, can you find at which line it appears?
######

f = 'substantially by human intervention'

for pos, line in enumerate(lines):
  if f in line:
    print(pos, line)


106 the city s land has been altered substantially by human intervention with considerable land reclamation along the waterfronts since dutch colonial times reclamation is most prominent in lower manhattan with developments such as battery park city in the 1970s and 1980s 140 some of the natural relief in topography has been evened out especially in manhattan 141


In [28]:
######
# TODO: Complete the code,
#   a) split a line in single words, compute word frequency
#   b) compute word frequency of all words across all lines
#
# Try out: https://docs.python.org/3/library/collections.html#collections.Counter
#
######

from collections import Counter

print(len(lines[1290].split(' ')))
words = set( w for w in lines[1290].split(' '))
print(len(words))
print(words)

word_count = Counter(lines[1290].split(' '))
print(word_count)

#word_count.most_common(2)

word_count = Counter()
for line in lines:
  word_count.update(line.split(' '))

word_count.most_common(10)


25
19
{'consulate', 'culture', 'of', 'york', 'february', 'archived', 'new', 'retrieved', '5', '2013', 'on', 'general', 'iceland', 'original', '2023', 'july', 'from', '23', 'the'}
Counter({'consulate': 2, 'general': 2, 'of': 2, 'iceland': 2, 'new': 2, 'york': 2, 'culture': 1, 'archived': 1, 'from': 1, 'the': 1, 'original': 1, 'on': 1, 'february': 1, '5': 1, '2013': 1, 'retrieved': 1, 'july': 1, '23': 1, '2023': 1})


[('the', 2045),
 ('new', 1028),
 ('of', 977),
 ('york', 895),
 ('in', 800),
 ('and', 741),
 ('city', 695),
 ('retrieved', 457),
 ('s', 366),
 ('to', 347)]

#### Accessing Links in the page

In [None]:
all_a_items = soup.find(class_='mw-body').find_all('a')
print(len(all_a_items))
for a in all_a_items:
  href = a.get('href')
  if href is not None and href.startswith('/wiki/') :
    print(href)

In [None]:
######
# TODO: Create a dictionary of /wiki/ links, and count how many times they appear in the page, which are the top-5 most frequent links?
######





In [None]:
######
# TODO: Pick the most frequent /wiki/ link from the above dictionary,
# download its page content and extract all links,
# do you find links in common ?
######




### Extract content from Wikipedia with the Wikipedia APIs

In [None]:
import wikipediaapi
## EDIT Down There: put your name and email for the Wikipedia logs
wapi_text = wikipediaapi.Wikipedia('MyProjectName (name@studenti.univr.it)',
                                   'en',
                                   extract_format=wikipediaapi.ExtractFormat.WIKI)

In [None]:
page_py = wapi_text.page('New York City')
print("Page - Exists: {}".format( page_py.exists()))
print(len(page_py.summary))
print(len(page_py.text))
print(len(page_py.langlinks))
print(len(page_py.links))

In [None]:
print(page_py.summary[:140])
print("   ---   ")
print(page_py.text[-140:])
print("   ---   ")
print(sorted(page_py.langlinks.keys()))
print("   ---   ")
page_py_it = page_py.langlinks['it']
print(page_py_it.summary[:140])

In [None]:
links = page_py.links
for title in sorted(links.keys()):
    if len(title) > 4 : # filter on title length to reduce output
      continue
    print("{}".format(title))

In [None]:
test_pages = ['Addis Ababa',  'Tom Sawyer', 'Johannes Gutenberg']

In [None]:
from urllib.parse import quote

page_queue = [ wapi_text.page(tp) for  tp in test_pages ]
page_stored = {}
page_visited = set()
max_iterations = 50

while len(page_queue) > 0 and max_iterations > 0:
  _page = page_queue.pop()

  page_stored[_page.fullurl] = _page.summary

  page_visited.add(_page.fullurl)

  print(max_iterations, _page.title, _page.fullurl)
  max_iterations = max_iterations - 1

  for next_page in _page.links.values():
    try:
      if len(next_page.title) < 6 and len(next_page.title) > 13:
        continue # skip this page

      if ':' in next_page.title :
        continue

      if next_page.fullurl in page_visited:
        continue # skip this page

      # otherwise
      page_queue.append(next_page)
    except:
      print("Error retrieving", next_page.title)


print(len(page_stored))
print(page_stored.keys())

In [None]:
page_stored['https://en.wikipedia.org/wiki/Victoria_Falls_Airport']

In [None]:
######
# TODO: Create the bag of words for all page summaries, remember to transform the text in lowercase and remove punctuation
######

#### The following declaration extract unparsed HTML instead of already parsed text

In [None]:
wapi_html = wikipediaapi.Wikipedia('MyProjectName (name@studenti.univr.it)',
                              'en',
                              extract_format=wikipediaapi.ExtractFormat.HTML)
page_py = wapi_text.page('New York City')
print("Page - Exists: {}".format( page_py.exists()))
print(len(page_py.summary))


### Stemming nad lemmatization

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

In [None]:
# Initialize Python porter stemmer
ps = PorterStemmer()
sn = SnowballStemmer("english")

example_sentence = "Programming is an art and a job. Python programmers often tend to like programming in python because it's like english. This is a better language than many ohters an incredibly useful property that makes things easier. We call people who program in python pythonistas."

# Remove punctuation
example_sentence_no_punct = example_sentence.lower().translate(str.maketrans("", "", string.punctuation))

# Create tokens
word_tokens = word_tokenize(example_sentence_no_punct)

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in word_tokens:
    print ("{0:20}{1:20}{2:20}".format(word, ps.stem(word), sn.stem(word)))


In [None]:

# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

# wn.VERB
# wn.ADV
# wn.NOUN

# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in word_tokens:
   print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos=wordnet.ADJ))) # <- lemmatize as if they are all adjectives


In [None]:
######
# TODO: Text stemming and lemmatization with a wikipedia page summary
######

