# Gather Missing PDFs

While Zotero does a good job at saving PDF files, a lot of them were still missing in the file attachment column. Before conducting the data analysis we thus made sure to collect as many missing PDFs as we could. After running the script we were able to gather **around 90%** of all missing PDFs. 

In [None]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from doi2pdf import doi2pdf

1. Import the dataframe

In [None]:
df = pd.read_csv('.../OID_library.csv')

df_nopdf = df[df['File Attachments'].isna()]

print(f'Out of the {len(df_nopdf)} articles or entries that do not have a pdf attached '
      f'{len(df_nopdf[df_nopdf.DOI.isna()])} do not have a DOI, while {len(df_nopdf[df_nopdf.Url.isna()])} '
      f'do not have a Url associated with them.')
print()
print('In total, theoretically, we can download ',len(df_nopdf[~df_nopdf.DOI.isna()]['DOI']),
      ' through the DOI and ', len(list(df_nopdf[~df_nopdf.Url.isna()]['Url'])), ' with the URL')

df_nopdf[['Title','Author','Url','DOI']].head()

2. Gain PDFs with request and bs4

In [None]:
def get_pdf_from_url(url, title):
    try:
        response = requests.get(url, timeout=15)  # Set a timeout of 15 seconds
        response.raise_for_status()  # Raise an error for HTTP request issues

        soup = BeautifulSoup(response.text, "html.parser")

        for link in soup.select("a[href$='.pdf']"):
            filename = f'{title}.pdf'
            path = '.../pdf_gathering/additional_pdfs/'
            output = path + filename
            
            pdf_url = urljoin(url, link['href'])
            pdf_response = requests.get(pdf_url, timeout=15)  # Timeout for the PDF request
            pdf_response.raise_for_status()
            
            with open(output, 'wb') as f:
                f.write(pdf_response.content)

    except requests.exceptions.Timeout:
        print(f"Timeout reached for URL: {url}. Moving to the next URL.")
    except requests.exceptions.RequestException as e:
        print(f"Error processing URL: {url} - {e}")
    except Exception as e:
        print(f"Unexpected error for {url}: {e}")

In [None]:
i=0 # I simply use this to count the number ot processed pdfs
urls=list(df_nopdf[~df_nopdf.Url.isna()]['Url'])
url_titles=list(df_nopdf[~df_nopdf.Url.isna()]['Title'])

for url, title in zip(urls, url_titles):
    i+=1
    try:
        get_pdf_from_url(url,title)
        print(i,'Successfully downloaded', title)
    except Exception as e:
        print('Error downloading',title, e)

3. Gain PDFs with doi2pdf (If a DOI is a dupliate of an already downloaded paper it will be retained only once - the title is the same)

In [None]:
dois=list(df_nopdf[~df_nopdf.DOI.isna()]['DOI'])
titles=list(df_nopdf[~df_nopdf.DOI.isna()]['Title'])

for doi, title in zip(dois, titles):
    try:
        doi2pdf(doi, output=f'.../pdf_gathering/additional_pdfs/{title}.pdf')
    except Exception as e:
        print(title, e)