# HTML Site Parsing (using BeautifulSoup module) with UMedia
- This following tutorial will guide users through Hyper Text Mark-Up Language (HTML) site parsing using the BeautifulSoup Python module with BTAA's UMedia portal site.
- In addition how to install the BeautifulSoup module, scan and list web pages, return titles, descriptions, and dates, then writing these to CSV format.

## 1. Install Module BeautifulSoup and Import Neccessary Modules

- **Use this section if you don't have the BeautifulSoup Module installed**

In [19]:
# Use one of the following:

# pip install BeautifulSoup

# or  

# conda install BeautifulSoup

Collecting BeautifulSoup
  Using cached BeautifulSoup-3.2.2.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[7 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "<string>", line 2, in <module>
  [31m   [0m   File "<pip-setuptools-caller>", line 34, in <module>
  [31m   [0m   File "/private/var/folders/vd/nbg2c3_x2vdfj2wx6t08tlwc0000gn/T/pip-install-jtyu7isb/beautifulsoup_cb018f388a1942f5b8585825668c096f/setup.py", line 3
  [31m   [0m     "You're trying to run a very old release of Beautiful Soup under Python 3. This will not work."<>"Please use Beautiful Soup 4, available through the pip package 'beautifulsoup4'."
  [31m   [0m                                                                                                    ^
  [31m 

- **List of required modules needed to parse (scrape) websites, populate JSON list, return require fields and print results to a CSV**

In [7]:
import collections # This module provides alternatives to built-in data types that offer additional functionality such as named tuples, ordered dictionaries, and dequeues.
import requests # This module allows you to send HTTP/1.1 requests using Python. It allows you to add content like headers, form data, multipart files.
from urllib import request # This module provides a way to open URLs in Python. 
import csv # This module provides classes for reading and writing CSV files
import json # This module allows you to encode and decode JSON data into Python objects.
import os # This module provides a way of using operating system dependent functionality. It provides a way of interacting with the file system, such as creating, moving, and deleting files and directories.
from bs4 import BeautifulSoup as bs # This module provides a way to parse HTML and XML documents. It provides a way to navigate, search, and modify the parse tree. 

## 2. Set directory to drive and name for CSV

- **Example parameters for setting the folder path and the name of CSV**

In [2]:
HTML_path = r'/Users/Thenewsguy/Documents/GitHub/harvesting-guide/docs/1-Tutorials/T-03_parsing-html-beautiful-soup' # point to the folder path
csv_name = "HTMLCSV" # name for the csv to be created

print("drive files")

drive files


## 3. Return Title, Descriptions and Dates 

- **Each cell shows the Title, HTML breakdown and Dates of the Umedia website https://umedia.lib.umn.edu/ for better understanding of the functionality of the BeautifulSoup module and web scrapping**

In [6]:
# Title Section
url = "https://umedia.lib.umn.edu/"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title')

print(title) # Prints the tag
print(title.string) # Prints the tag string content

<title>UMedia</title>
UMedia


In [7]:
# HTML Breakdown
url="https://umedia.lib.umn.edu/"

# Make a GET request to fetch the raw HTML content

html_content = requests.get(url).text


# Parse the html content

soup = BeautifulSoup(html_content, "lxml")

print(soup.prettify()) # Print the parsed data of html

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   UMedia
  </title>
  <meta content="authenticity_token" name="csrf-param"/>
  <meta content="qSwwoKx2Tl9w13cFZQWYYeASNxnNOx9hhiDbcUzGfdJHHgn5UG4zNc2DyIkUZr3Dpo0VRwlYy9yLMVnEB8unpQ==" name="csrf-token"/>
  <meta name="csp-nonce"/>
  <link data-turbolinks-track="reload" href="/assets/application-34a020d3ce0cc8e2448ab330e9fb076350a54f043e099b4874c32968e9a028e9.css" media="all" rel="stylesheet"/>
  <script data-turbolinks-track="reload" src="/assets/application-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.js">
  </script>
  <script src="/packs/js/application-66740e8fbe3d6198f60f.js">
  </script>
  <script src="/packs/js/thumbnailer-895235c547f3aa7c1823.js">
  </script>
  <script src="/packs/js/details-toggle-582a99a7dc7aa4ff769a.js">
  </script>
  <script src="/packs/js/google_events_tracking-2c296b143c1c7ebd734b.js">
  </script>
  <!--script src="https://cdnapisec.kaltura.com/p/1369852/sp/136985200/embedIframeJs/uico

In [8]:
# Date Section
>>> s = '''<time class="jlist_date_image" datetime="2015-04-02 14:30:12">Idag <span class="list_date">14:30</span></time>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('time'):
        if i.has_attr('datetime'):
            print(i['datetime'])

2015-04-02 14:30:12


## 4. Print Results to CSV and Inspect

- Print JSON results to CSV to be viewed

In [9]:
with open('/Users/Thenewsguy/Documents/GitHub/harvesting-guide/docs/1-Tutorials/T-03_parsing-html-beautiful-soup/HTMLCSV/JSONresults.csv', 'r') as f:
  data = list(csv.reader(f))

counter = collections.defaultdict(int) # Checks occurances/repeated values
for row in data:
    counter[row[0]] += 1 # For each row in the original CSV file, checks 1 time in entire list row is written if sucessful.

writer = csv.writer(open('/Users/Thenewsguy/Documents/GitHub/harvesting-guide/docs/1-Tutorials/T-03_parsing-html-beautiful-soup/HTMLCSV/DCAT01b-18163_20230302.json', 'w'))
for row in data:
    if counter[row[0]] >= 4: # For each row in the original CSV file, checks 4 times in entire list row is written if sucessful.
        writer.writerow(row)