Skip to content

ccarvel/kinds.or.kr_bs4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

kinds.or.kr_bs4

notes on scraping dynamic news db

For dynamic sites that use a lot of busy javascript libraries and frameworks, scraping is sometimes easier if you download the url as an html file in a local directory. Assuming you have a directory full of html files that won't scrape by connecting to the online page in a within a subdirectory of your ~/Downloads directory (ie /Users/username/Downloads/html-to-scrape), you can use Beautiful Soup 4 to extract the content to something more structured in a CSV--

So, for example-- Save the difficult pages manually or come up with an automated way to save these hard to scrape pages in bulk. For the following page, I just loaded it in Chrome and saved as a webpage: https://www.kinds.or.kr/v2/news/newsDetailView.do?newsId=01101001.19960316000000306 I put it in a subdirectory of /Users/username/Downloads named 'html-to-scrape'

When you've reviewed the files and the html tags you want to organize into a CSV, (for this particular page) you can get the content of tags related to this particular page, but I think this sets up a method for adding additional tags as needed.

snip: filename: kinds.or.kr_bs4.py

import os
import csv
from bs4 import BeautifulSoup

##### SET THIS PATH TO YOUR PARTICULAR DIRECTORY OF SAVED HTML FILES
html_dir_path = '/Users/username/Downloads/html-to-scrape'

# SET THE NAME OF THE OUTPUT CSV FILE
csv_file_name = 'output.csv'

##### ADD HEADER INFO HERE BASED ON TAGS YOU WANT TO EXTRACT
#### CREATE LISTS TO HOLD THE EXTRACTED TITLES, BODIES, AND FILENAMES
titles = []
dates = []
bodies = []
filenames = []

# LOOP THROUGH EACH FILE IN THE DIRECTORY
for filename in os.listdir(html_dir_path):
    # CHECK IF THE FILE IS AN HTML FILE
    if filename.endswith('.html'):
        # OPEN THE HTML FILE WITH UTF-8 ENCODING
        with open(os.path.join(html_dir_path, filename), 'r', encoding='utf-8') as file:
            # PARSE THE HTML USING BEAUTIFUL SOUP 4
            soup = BeautifulSoup(file.read(), 'html.parser')
            ##### BELOW YOU WILL ADD OR REMOVE TAGS YOU NEED FROM YOUR DIFFICULT HTML
            # FIND THE TITLE TAG AND EXTRACT ITS TEXT
            title = soup.title.text if soup.title else ''
            ##### AND ADD WHAT GETS FOUND TO THE LIST OF HEADERS ABOVE
            # ADD THE TITLE TO THE LIST OF TITLES
            titles.append(title)
            # FIND THE DATE TAG AND EXTRACT TEXT
            date = soup.find('li', {'class': 'info_date'}).text.strip() if soup.find('li', {'class': 'info_date'}) else ''
            # ADD THE TEXT FOUND IN THE DATE-RELATED TAG TO THE DATES LIST
            dates.append(date)
            # FIND THE DIV TAG WITH CLASS 'DOC_DESC' AND ID 'CONTENT', AND EXTRACT ITS TEXT
            body text = soup.find('div', {'class': 'doc_desc', 'id': 'content'}).text.strip() if soup.find('div', {'class': 'doc_desc', 'id': 'content'}) else ''
            # ADD THE BODY TO THE LIST OF BODIES
            bodies.append(body)
            # ADD THE FILENAME TO THE LIST OF FILENAMES
            filenames.append(filename)
            
# WRITE THE EXTRACTED TITLES, BODIES, AND FILENAMES TO A CSV FILE WITH UTF-8 ENCODING
with open(csv_file_name, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['file', 'title', 'date', 'body'])
    for filename, title, date, body in zip(filenames, dates, titles, bodies):
        writer.writerow([filename, date, title, body])

If you collect all the difficult URLs in a text file (urls.txt) with one url per line, Beautiful Soup 4 can download these first, and then scrape them.

snip: filename: auto-add-urls-then-scrape.py

import os
import csv
import requests
from bs4 import BeautifulSoup

# Set the path of the directory to save the downloaded HTML files
html_dir_path = '/Users/username/Downloads/html-to-scrape/'

# Set the name of the output CSV file
csv_file_name = 'output.csv'

# Create a list to hold the extracted titles, bodies, and filenames
titles = []
dates = []
bodies = []
filenames = []

# Read the list of URLs from the urls.txt file
with open('urls.txt', 'r') as f:
    urls = [url.strip() for url in f.readlines()]


# LOOP THROUGH EACH FILE IN THE DIRECTORY
for filename in os.listdir(html_dir_path):
    # CHECK IF THE FILE IS AN HTML FILE
    if filename.endswith('.html'):
        # OPEN THE HTML FILE WITH UTF-8 ENCODING
        with open(os.path.join(html_dir_path, filename), 'r', encoding='utf-8') as file:
            # PARSE THE HTML USING BEAUTIFUL SOUP 4
            soup = BeautifulSoup(file.read(), 'html.parser')
            ##### BELOW YOU WILL ADD OR REMOVE TAGS YOU NEED FROM YOUR DIFFICULT HTML
            # FIND THE TITLE TAG AND EXTRACT ITS TEXT
            title = soup.title.text if soup.title else ''
            ##### AND ADD WHAT GETS FOUND TO THE LIST OF HEADERS ABOVE
            # ADD THE TITLE TO THE LIST OF TITLES
            titles.append(title)
            # FIND THE DATE TAG AND EXTRACT TEXT
            date = soup.find('li', {'class': 'info_date'}).text.strip() if soup.find('li', {'class': 'info_date'}) else ''
            # ADD THE TEXT FOUND IN THE DATE-RELATED TAG TO THE DATES LIST
            dates.append(date)
            # FIND THE DIV TAG WITH CLASS 'DOC_DESC' AND ID 'CONTENT', AND EXTRACT ITS TEXT
            body text = soup.find('div', {'class': 'doc_desc', 'id': 'content'}).text.strip() if soup.find('div', {'class': 'doc_desc', 'id': 'content'}) else ''
            # ADD THE BODY TO THE LIST OF BODIES
            bodies.append(body)
            # ADD THE FILENAME TO THE LIST OF FILENAMES
            filenames.append(filename)
            
# WRITE THE EXTRACTED TITLES, BODIES, AND FILENAMES TO A CSV FILE WITH UTF-8 ENCODING
with open(csv_file_name, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['file', 'title', 'body'])
    for filename, title, date, body in zip(filenames, dates, titles, bodies):
        writer.writerow([filename, date, title, body])
       

About

notes on scraping dynamic news db

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published