### PDF Scraper
By Parker Whitehead<br/><br/>

This jupyter notebook is a repurposing of the 'NEW HPC' notebook that has been commonly used to scrape websites for ConfliBERT. Although it is currently set up for a very niche and specific usecase (mining the online library of pdfs from https://www.corteidh.or.cr/), it can be modified to meet any pdf scraping need.<br/><br/>

The notebook is currently broken into **4 main parts.** The **first** is setting your universal parameters. This is the same as usual, except there are a few new ones for pdf specific mining. The **second** part is a page-based scrape for article links given a specific endpoint. In this instance, it is collecting links to all pdf descriptions under the category of 'human rights,' or 'derechos humanos.' On a typical website, this would be sufficient to find the links to PDF downloads. <br/><br/>


**However**, this website uses a PDF viewer and requires a login. As such, I am using a workaround where I collect the links of pdf descriptions, which contain the title of the pdf as well as the ID of the pdf file. After doing some digging in the javascript code, I found out that the pdf viewer was hitting an endpoint on the server that contained the raw PDF file. Conveniently, it was not protected, meaning that you didn't have to log in or meet any pay wall that you would if you tried to access the pdf normally thorugh the viewer. The endpoint for the pdfs was indexable given that you had the pdf ID.<br/><br/>

This is the purpose behind the **third part,** which scrapes the pdf descriptions not for links to the pdf or for raw text, but for the pdf title and ID. There is also an **intermediate step** where a library is used to ensure that the title is in spanish, as there are a fair number of english articles. The **fourth** is the actual PDF scraping.<br/><br/>

Every website is going to be different, so it is almost certian that you are going to have to modify this script in some way. However, it is generally as simple as finding a way to hit a websites endpoint that contain the raw PDFs. The best way that I've found to customize these notebooks is by creating new extract functions, like extract_text() and extract_text_from_df(), which are parallelized with the parallelize_dataframe function. Reading through this scraper should hopefully provide context as to how they work and how you can create your own.

In [None]:
# Uncomment these if running in colab
!pip install wget
!pip install newspaper3k
!pip install xmltodict
!pip install pandarallel
!pip install datefinder
!pip install pydrive
!pip install selenium
!pip install PyPDF2
!pip install langdetect
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip3 install pycryptodome

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import sys
import wget
import newspaper
from newspaper import Article
import json
import pandas as pd
pd.set_option("max_colwidth", 600)
import ast
from bs4 import BeautifulSoup
import re
import requests
import time
import numpy as np
import zipfile
import os
import html
import re
import itertools
import lxml
import xmltodict
import collections
from urllib import request
from collections import OrderedDict
from urllib.error import HTTPError, URLError
from urllib.parse import urlparse
import sys
import ast
import time
from pandarallel import pandarallel
import requests
import datefinder
from datetime import datetime
from tqdm import tqdm
import psutil
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from multiprocessing import Pool
import random
from urllib.parse import quote
import pytz
from PyPDF2 import PdfReader
import io
import langdetect




In [None]:
t0 = time.time()

#PLEASE EDIT THIS CELL!

links_file_path = ""
checkpoints_file_path = ""
csv_file_path = ""

#Universal parameters. Always needs to be changed
news_outlet = ""
country = ""
max_click_SHOW_MORE = 500
host = ""
date_format = "%Y-%m-%d"
relevant_path = f'^{host}.*$'
# Allow http and https links
if re.match("^https?.*", host):
  relevant_path = f'^https?:\/\/{host[8:]}.*$'
else:
  relevant_path = f'^https?:\/\/{host}.*$'
urls = ['/']
urls = [host + url for url in urls]
csv_name = country+'_'+news_outlet+'.csv'
types = ['page', 'click_more','scroll_down']
type_of_page = types[0]
extract_text_sleep = random.randint(2,4)
extract_urls_sleep = random.randint(2,4)

# If page type is click_more or scroll_down i.e. can get more article by clicking a button with text 'click more' or simply scrolling down
# Be sure to use xpath functions to describe the exact button you want to press.
# Sometimes, 'previous' and 'next' for click more can have the same class.
# You may need to do something like this:
# "//a[contains(@class,\"BlogList-pagination-link\") and .//span[text()=\"Más antiguos\"]]"

xpath_for_link = ""
xpath_for_click_more = ""





# If page type is page
target_tag = ''
target_tag_class = ''

page_identifier = ''

if type_of_page == 'page':
    urls = [url + page_identifier for url in urls]
pages_each_category = [1] #The total number of pages for each section. So if policiaca is 1..92, write 92, not 93.
if pages_each_category:
    total_pages = pages_each_category
else:
    total_pages = 500
#-----------------------------------------------------------------------------------------------------------------------------------------

#Get amount of pages per category not having text code
#-----------------------------------------------------------------------------------------------------------------------------------------
retrieve_amount_of_pages_no_text = False
#-----------------------------------------------------------------------------------------------------------------------------------------


#Get amount of pages per category with text code
#-----------------------------------------------------------------------------------------------------------------------------------------
retrieve_amount_of_pages_w_text = False
xpath_for_amount_of_pages = ""
index_for_amount_of_pages_location = 0
#-----------------------------------------------------------------------------------------------------------------------------------------

# in case multi tags
target_tags = []
target_tag_classes = []

# New, pdf relevant parameters:
# Used in this specific instance to index raw pdfs once their id was obtained.
# Might not be applicable for other usecases.
pdf_url_endpoint = ''
relevant_pdf_path = f'^{pdf_url_endpoint}.*$'
# Allow http and https links
if re.match("^https?.*", pdf_url_endpoint):
  relevant_pdf_path = f'^https?:\/\/{pdf_url_endpoint[8:]}.*$'
else:
  relevant_pdf_path = f'^https?:\/\/{pdf_url_endpoint}.*$'

intermediate_site = False # IMPORTANT, CHANGE

title_tag = ''
title_class = ''

# Used if there is an intermediate site:

pdf_id_tag = ''
pdf_id_class = ''

In [None]:
def parallelize_dataframe(df, func, n_cores):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

In [None]:
# Functions for automatically identifying the number of pages

def check_it_has_links(url):
  options = FirefoxOptions()
  options.add_argument("--headless")
  driver = webdriver.Firefox(options=options)
  driver.get(url)

  has_links= True
  try:
    element = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.XPATH, xpath_for_link))
    )
  except TimeoutException as e:
    print('Could not find pages count')
    has_links = False
  driver.quit()
  return has_links

def retrieve_pages_per_category_no_text(url):
  pages = 1
  pages_defined = False

  while check_it_has_links(url + f'{pages}'):
    pages = pages * 10

  add_pages = pages // 2
  if pages != 1:
    pages = pages // 2
    while not pages_defined:
      add_pages = add_pages // 2
      if check_it_has_links(url + f'{pages}'):
        pages = add_pages + pages
      else:
        pages = pages - add_pages
      if add_pages == 1:
        pages_defined = True
      print(pages)

  return pages

def retrieve_pages_per_category_w_text(url):
  options = FirefoxOptions()
  options.add_argument("--headless")
  driver = webdriver.Firefox(options=options)
  driver.get(url)
  try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath_for_amount_of_pages))
    )
  except TimeoutException as e:
    print('Could not find pages count')
    return ''
  pages = driver.find_element(By.XPATH, xpath_for_amount_of_pages).text
  if not pages:
    print('No text found')
  driver.quit()
  return pages

In [None]:
if retrieve_amount_of_pages_no_text:
  total_pages = []
  for url in urls:
    amount_of_pages = retrieve_pages_per_category_no_text(url)
    total_pages.append(amount_of_pages)
  print(total_pages)

if retrieve_amount_of_pages_w_text:
  total_pages = []
  for url in urls:
    text_with_amount_of_pages = retrieve_pages_per_category_w_text(url)

    amount_of_pages =[]

    for char in text_with_amount_of_pages[index_for_amount_of_pages_location:]:
      if not char.isnumeric():
        if char != ',' and char != '.':
          break
      else:
        amount_of_pages.append(char)
    amount_of_pages = ''.join(amount_of_pages)
    print(amount_of_pages + ' ' + url)
    if amount_of_pages:
      total_pages.append(int(amount_of_pages))
    else:
      total_pages.append(1)
  print(total_pages)

In [None]:

output_path = 'links_'+csv_name

def retrieve_links_from_list(url,scroll_down=False):
  options = FirefoxOptions()
  options.add_argument("--headless")
  driver = webdriver.Firefox(options=options)
  driver.get(url)

  count = 0

  all_links = []

  news_urls = set()

  prev_length = 0
  while count <= max_click_SHOW_MORE:
      if not scroll_down:
          try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, xpath_for_click_more))
            )
          except TimeoutException as e:
            print('Could not find click more button')
            break


      try:
          all_links = driver.find_elements(By.XPATH, xpath_for_link)
          for link in all_links:
            if isinstance(link.get_attribute('href'), str):
              if re.match(relevant_pdf_path, link.get_attribute('href')):
                news_urls.add(link.get_attribute('href'))


          if not scroll_down:
            click_more = driver.find_element(By.XPATH, xpath_for_click_more)
          curr_length = len(all_links)
          print(f'{curr_length} links in current page')
          print(f'{prev_length} links in previous page')
          if count > 0:
            if scroll_down and (curr_length == prev_length or (not scroll_down and not click_more)):
              break

          current_links = [l.get_attribute("href") for l in all_links[prev_length:]]
          df_link = pd.DataFrame(current_links, columns = ['link'])
          df_link.to_csv(output_path, mode='a', header=not os.path.exists(output_path))

          prev_length = curr_length

          if not scroll_down:

            try:
                click_more.click();
            except Exception as e:
                driver.execute_script("arguments[0].click();", click_more) #If click does not work because of overlapping elements, this executes

            print(f"Button clicked {count} times", )
          else:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            print(f"Scrolled down {count+1} times")

          time.sleep(3)

          count += 1



      except TimeoutException:
          break
      except NoSuchElementException:
          break

  time.sleep(2)

  all_links = driver.find_elements(By.XPATH, xpath_for_link)

  for link in all_links:
        if isinstance(link.get_attribute('href'), str):
          if re.match(relevant_pdf_path, link.get_attribute('href')):
            news_urls.add(link.get_attribute('href'))

  driver.quit()
  print('*' * 20)
  return list(news_urls)


def extract_urls(url):
    rows = set()
    soup = getSoup(url)
    if soup:
      all_elems = None
      if target_tag_class != '':
        all_elems = soup.find_all(target_tag, {'class':target_tag_class})
      else:
        all_elems = soup.find_all(target_tag)

      for d in all_elems:
        all_links = d.find_all('a', href=True)
        # If there are no links, try to grab href from found element
        if not all_links:
          try:
            # MADE CHANGE FOR SPECIFIC WEBSITE
            if not re.match('^http.*', d['href']):
              # d['href'] = host + d['href']
              d['href'] = "https://" + d['href'][2:]
            rows.add(d['href'])
          except:
            continue

        for l in all_links:
            if not re.match('^http.*', l['href']):
              # l['href'] = host + l['href']
              l['href'] = pdf_url_endpoint + l['href']
            if re.match(relevant_pdf_path, l['href']):
              rows.add(l['href'])


    if rows:
      return list(rows)
    else:
      return None

def extract_urls_from_df(df):
    links = df['page'].map(lambda x: extract_urls(x))
    df['link'] = links
    return df

In [None]:
pages = []

def get_zipped_urls(url, total_pages):
    page_numbers = [str(i) for i in range(1,total_pages)]
    url_multiple = [url] * total_pages
    return [''.join(x) for x in zip(url_multiple,page_numbers)]

if type_of_page == 'page':
    for ind, url in enumerate(urls):
        if isinstance(total_pages, int):
            pages += get_zipped_urls(url, total_pages)
        else:
            pages_for_category = total_pages[ind] + 1
            pages += get_zipped_urls(url, pages_for_category)

elif type_of_page == 'click_more' or type_of_page == 'scroll_down':
    for url in urls:
        pages += retrieve_links_from_list(url, True if type_of_page == 'scroll_down' else False)


df = pd.DataFrame(pages, columns = ['page'])
df

In [None]:
def getSoup(url):
    """
    Return a soup object of the URL
    """
    try:
        req = request.Request(url, headers={'User-Agent' : "Chrome"})
        con = request.urlopen(req)
        time.sleep(extract_urls_sleep)
        html = con.read()

    except HTTPError as e:
        print(e)
        return None

    except URLError as e:
        print('The server could not be found')
        return None

    except Exception as e:
      print(e)
      return None


    soup = BeautifulSoup(html, 'html.parser')
    return soup

In [None]:
# Contains new extracting methods

def extract_text(url):
    text, title, date = None, None, None
    try:
        result = [country, news_outlet] + [''] * 3
        article = Article(url, keep_article_html=False)
        article.download()
        time.sleep(extract_text_sleep)
        article.parse()
        text = article.text

        if text:
            text_copy = text
            title = article.title
            date = article.publish_date
            if date:
                date = date.strftime(date_format)
            else:
                matches = datefinder.find_dates(text_copy)
                most_recent_datetime = sorted(matches)[-1]
                date = most_recent_datetime.strftime(date_format)
    finally:
      if title:
        result[2] = title
      if date:
        result[3] = date
      if text:
        result[4] = text
      return result


def extract_text_from_df(df):
    content = df['link'].map(lambda x: extract_text(x))
    df['content'] = content
    return df


def extract_id_and_title(url):
    title_text, id_tag = None, None
    try:
      time.sleep(random.randint(4,6))
      current_soup = getSoup(url)
      # CHANGE FOR INTERMEDIATE WEBSITE
      try:
        title_text = current_soup.find(title_tag, class_=title_class).get_text()
      except:
        title_text = url[url.rindex('/')+1:]
      id_tags = current_soup.find_all(pdf_id_tag, class_=pdf_id_class)
      id_tag = None
      for possible_id_tag in id_tags:
        if re.match(relevant_pdf_path, possible_id_tag.get('href')):
          id_tag = possible_id_tag.get('href')
        else:
          id_tag = pdf_url_endpoint + possible_id_tag.get('href')
      if not id_tag:
        title_text = url[url.rindex('/')+1:]
        id_tag = url
      # id_tag = id_tag[id_tag.rindex('/')+1:]
    except Exception as e:
        print(f'Error in extract_id_and_title: {e}')

    finally:
        results = (title_text, id_tag)
        return results

def extract_id_and_title_from_df(df):
    try:
        content = df['link'].map(lambda x: extract_id_and_title(x))
        if content.tolist(): # Detects if weird race condition didn't happen
          df[['title','pdf_id']] = pd.DataFrame(content.tolist(),index=content.index)
    except Exception as e:
        print(f'Error in extract_id_and_title_from_df: {e}')
    return df

def extract_pdf_text(x, url):
    text = None
    try:
      url = url + str(int(x)) + '.pdf'
    except:
      url = x
    try:
        texts = []
        response = requests.get(url)

        with io.BytesIO(response.content) as open_pdf_file:
            reader = PdfReader(open_pdf_file)
            if reader.is_encrypted:
              reader.decrypt('')
            for pdfPage in reader.pages:
                texts.append(pdfPage.extract_text())

    except Exception as e:
        print(f'Error in extract_pdf_text: {e}')

    if len(texts) == 0:
        texts.append('')

    return texts


def extract_pdf_text_from_df(df):
    if intermediate_site:
      try:
          content = df['pdf_id'].map(lambda x: extract_pdf_text(x, pdf_url_endpoint))
          df['text'] = content
      except Exception as e:
          print(f'Error in extract_pdf_text_from_df: {e}')
    else:
      try:
          content = df['link'].map(lambda x: extract_pdf_text(x, pdf_url_endpoint))
          df['text'] = content
      except Exception as e:
          print(f'Error in extract_pdf_text_from_df: {e}')
    return df

In [None]:
df

In [None]:
def parallel_work(df, method_to_run, target, source):
  global workers
  while workers>0:
    try:
      df[target] = df[source].parallel_apply(method_to_run)
    except Exception as e:
      raise e
    break

  if workers == 0:
    print('Error during parallel operation. Could not extract text')

  return df

def get_parallel_operation_results(divided_dfs, method_to_run, target, source):
  res = []
  for df in divided_dfs:
    try:
      temp_df = parallel_work(df, method_to_run, target, source)
      if temp_df[target].isnull().all():
        print('Could not retrieve any URLS')
        print('Something is wrong with the target_tag and target_tag_class variables. Please modify')
        return []
      res.append(temp_df)
    except Exception:
      continue
  df_result = pd.concat(res)
  return df_result

def partition_df(df):
  global articles_per_parallel_operation
  divided_dfs = []
  start = 0
  while start < len(df):
    divided_dfs.append(df[start:start+articles_per_parallel_operation])
    start += articles_per_parallel_operation
  return divided_dfs

In [None]:
if type_of_page == 'page':
    start = 0
    limit = 40
    total_time_start = time.time()
    results = []

    while start < len(df):
        start_time = time.time()
        results.append(parallelize_dataframe(df[start:start+limit], extract_urls_from_df, 24))
        end_time = time.time()
        print(f'Batch of data of row range {start}-{start+limit} complete in {round(end_time-start_time, 2)} seconds')
        print(f'{round(min((((start+limit) / len(df)) * 100), 100), 2)}% complete')
        start+=limit

    df = pd.concat(results)
    total_time_end = time.time()
    print(f'total time taken: {round(total_time_end - total_time_start,2)} second')

else:
    df = df.rename(columns={"page": "link"})


In [None]:
df

In [None]:
df.to_csv(links_file_path+'links_'+csv_name)

In [None]:
print(len(df))
df = df[df['link'].notna()]
df = df[(df['link'].str.len() != 0)]
print(len(df))

In [None]:
if type_of_page == 'page':
    df = df.explode('link')
df

In [None]:
print(len(df))
df = df.drop_duplicates(subset='link')
print(len(df))

In [None]:
#Create a links checkpoint csv
df.to_csv(links_file_path+'links_'+csv_name)
#df.to_csv('/xdisk/josorio1/salsarra/links/'+'links_'+csv_name, encoding = 'utf-8-sig')

In [None]:
if 'df' not in locals() or df is None or 'link' not in df:
    if(os.path.exists(links_file_path+'links_'+csv_name)):
        df = pd.read_csv(links_file_path+'links_'+csv_name, index_col=0)

In [None]:
df

In [None]:
# SKIP IF THE PDFS DON'T REQUIRE AN ENDPOINT OR AN INTERMEDIATE
output_path= 'unprocessed_' + csv_name
start = 0
limit = 40

res = []
while start < len(df):
    start_time = time.time()
    demo_df = df[start:start+limit].copy()
    test_df = parallelize_dataframe(demo_df, extract_id_and_title_from_df, 94)
    res.append(test_df)
    test_df.to_csv(output_path, mode='a', header=not os.path.exists(output_path))

    test_df['id_found'] = test_df['pdf_id'].map(lambda x: True if x != None else False)
    no_of_id_retrieved = test_df.id_found.sum()
    print(f'{no_of_id_retrieved} / {len(test_df)} ids retrieved')

    end_time = time.time()
    print(f'Batch of data of row range {start}-{start+limit} complete in {round(end_time-start_time, 2)} seconds')
    print(f'{round(min((((start+limit) / len(df)) * 100), 100), 2)}% complete')
    start+=limit

df = pd.concat(res)

In [None]:
# SKIP IF THE PDFS DON'T REQUIRE AN ENDPOINT OR AN INTERMEDIATE

# Data cleaning

print(df.shape)
df = df[df['id_found'] == True]
print(df.shape)

df = df[['page','link','title','pdf_id','id_found']]
print(df.shape)

In [None]:
# Create a secondary checkpoint for ids and titles
df.to_csv(checkpoints_file_path+'checkpoint_'+csv_name)

In [None]:
# Run to load from csv checkpoint
df = pd.read_csv(checkpoints_file_path+'checkpoint_'+csv_name)
df

In [None]:
# Run parallel text scraping from pdf
output_path= 'unprocessed_' + csv_name
start = 0
limit = 40

res = []
while start < len(df):
    start_time = time.time()
    demo_df = df[start:start+limit].copy()
    test_df = parallelize_dataframe(demo_df, extract_pdf_text_from_df, 24)
    res.append(test_df)
    test_df.to_csv(output_path, mode='a', header=not os.path.exists(output_path))

    test_df['text_found'] = test_df['text'].map(lambda x: True if len(x) > 1 else False)
    no_of_id_retrieved = test_df.text_found.sum()
    print(f'{no_of_id_retrieved} / {len(test_df)} text retrieved')

    end_time = time.time()
    print(f'Batch of data of row range {start}-{start+limit} complete in {round(end_time-start_time, 2)} seconds')
    print(f'{round(min((((start+limit) / len(df)) * 100), 100), 2)}% complete')
    start+=limit

df = pd.concat(res)

In [None]:
# df[['country', 'news_outlet', 'title', 'date', 'text']] = pd.DataFrame(df.content.tolist(), index= df.index)
df


In [None]:
print(len(df))
df = df[df['text_found'] == True]
print(len(df))
df

In [None]:
if 'page' in df:
  df = df.drop(['page'], axis=1)
if 'content' in df:
  df = df.drop(['content'], axis=1)
if 'Unnamed: 0' in df:
    df = df.drop(['Unnamed: 0'], axis=1)
if 'Unnamed: 0.1' in df:
    df = df.drop(['Unnamed: 0.1'], axis=1)
if 'text_found' in df:
    df = df.drop(['text_found'], axis=1)
if 'id_found' in df:
    df = df.drop(['id_found'], axis=1)
if 'pdf_id' in df:
    df = df.drop(['pdf_id'], axis=1)

In [None]:
df['date'] = ''
df['news_outlet'] = news_outlet
df['country'] = country
df['text_set'] = 'TRUE'

In [None]:
correct_column_order = ['link','text_set','country','news_outlet','date','text']

In [None]:
df = df[correct_column_order]
df

In [None]:
seconds = time.time() - t0
duration = time.strftime("(%H:%M:%S)",time.gmtime(seconds))

df.to_csv(csv_file_path+duration+'-'+csv_name, encoding = 'utf-8-sig')

print('Time to complete:',duration)

#tz_DFW = pytz.timezone('US/Central')
#current_time = datetime.now(tz_DFW)
#time = current_time.strftime("(%H:%M:%S)")



#df.to_csv(time+csv_name)

In [None]:
if(os.path.exists(output_path)):
    os.remove(output_path)
    print(('unprocessed file deleted'))