# HTML Site Parsing (using BeautifulSoup module) with UMedia
- This following tutorial will guide users through Hyper Text Mark-Up Language (HTML) site parsing using the BeautifulSoup Python module with BTAA's UMedia portal site.
- In addition how to install the BeautifulSoup module, scan and list web pages, return titles, descriptions, and dates, then writing these to CSV format.

## 1. Install Module BeautifulSoup and Import Neccessary Modules

- **Use this section if you don't have the BeautifulSoup Module installed**

In [4]:
# Use one of the following:

# pip install BeautifulSoup

# or  

# conda install BeautifulSoup

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - beautifulsoup

Current channels:

  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://conda.anaconda.org/conda/osx-64
  - https://conda.anaconda.org/conda/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.



Note: you may need to restart the kernel to use

- **List of required modules needed to parse (scrape) websites, populate JSON list, return require fields and print results to a CSV**

In [10]:
import collections
import requests
import urllib.request 
from urllib import request
import csv
import json
import os
from bs4 import BeautifulSoup as bs

## 2. Set directory to drive and name for CSV

- **Example parameters for setting the folder path and the name of CSV**

In [7]:
HTML_path = r'/Users/Thenewsguy/Documents/GitHub/harvesting-guide/docs/1-Tutorials/HTMLCSV/jsons' # point to the folder path
csv_name = "HTMLCSV" # name for the csv to be created

print("drive files")

drive files


## 3. Opening JSON(s) and Populate them to empty list

- **Creating an empty list that takes JSONs from open webportals that have been scraped and populates them to said list**

In [8]:
# Create an empty list to store the JSON data
JSONMetadata = []

# Open the JSON file
with open('dateAdded_201712.json') as f:
    # Load the JSON data into a Python dictionary
    data_dict = json.load(f)
    
    # Iterate over the dictionary items and append them to the list
    for item in data_dict.items():
        data_list.append(item)

# Print the populated list

print(JSONMetadata)

FileNotFoundError: [Errno 2] No such file or directory: 'dateAdded_201712.json'

## 4. Return Title, Descriptions and Dates 

- **Each cell shows the Title, HTML breakdown and Dates of the Umedia website https://umedia.lib.umn.edu/ for better understanding of the functionality of the BeautifulSoup module and web scrapping**

In [11]:
url = "https://umedia.lib.umn.edu/"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title')

print(title) # Prints the tag
print(title.string) # Prints the tag string content

<title>UMedia</title>
UMedia


In [12]:
url="https://umedia.lib.umn.edu/"

# Make a GET request to fetch the raw HTML content

html_content = requests.get(url).text


# Parse the html content

soup = BeautifulSoup(html_content, "lxml")

print(soup.prettify()) # Print the parsed data of html

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   UMedia
  </title>
  <meta content="authenticity_token" name="csrf-param"/>
  <meta content="2m+6Ut7bqnAJ6NqDN/OCF4tCJdfKAd02k4sKgvXuhqTGyeXdd/+otuhmRo0Vw0WwoBr3xYLwBX3/58MBDdhEpg==" name="csrf-token"/>
  <meta name="csp-nonce"/>
  <link data-turbolinks-track="reload" href="/assets/application-fdd2b06ca1616f0045241f4a13c3cba966fcfeb9182621ff356cc1edcd8b4810.css" media="all" rel="stylesheet"/>
  <script data-turbolinks-track="reload" src="/assets/application-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.js">
  </script>
  <script src="/packs/js/application-a167eb968db527fdd849.js">
  </script>
  <script src="/packs/js/thumbnailer-895235c547f3aa7c1823.js">
  </script>
  <script src="/packs/js/details-toggle-582a99a7dc7aa4ff769a.js">
  </script>
  <script src="/packs/js/google_events_tracking-2c296b143c1c7ebd734b.js">
  </script>
  <script src="https://cdnapisec.kaltura.com/p/1369852/sp/136985200/embedIframeJs/uiconf_

In [13]:
>>> s = '''<time class="jlist_date_image" datetime="2015-04-02 14:30:12">Idag <span class="list_date">14:30</span></time>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('time'):
        if i.has_attr('datetime'):
            print(i['datetime'])

2015-04-02 14:30:12


## 5. Print Results to CSV and Inspect

- Print JSON results to CSV to be viewed

In [None]:
with open('thefile.csv', 'rb') as f:
  data = list(csv.reader(f))

counter = collections.defaultdict(int)
for row in data:
    counter[row[0]] += 1

writer = csv.writer(open("/Users/Thenewsguy/Documents/GitHub/harvesting-guide/docs/1-Tutorials/HTMLCSV/jsons", 'w'))
for row in data:
    if counter[row[0]] >= 4:
        writer.writerow(row)