Below is the code necessary to run the webscrappers for the appropriate body. Remember: 

(1). Do not run the code all at once. The GA and SC webscrappers should be run separately. 
(2). Sometimes the UN website is simply not cooperative. If it fails, try at a different time. 

In [None]:
# IMPORTANT: You must separately download pdfminer. 
# If you do not have pdfminer, uncomment and run this code. Elsewise, ignore. 

!pip install pdfminer

Collecting pdfminer
[?25l  Downloading https://files.pythonhosted.org/packages/71/a3/155c5cde5f9c0b1069043b2946a93f54a41fd72cc19c6c100f6f2f5bdc15/pdfminer-20191125.tar.gz (4.2MB)
[K     |████████████████████████████████| 4.2MB 5.4MB/s 
[?25hCollecting pycryptodome
[?25l  Downloading https://files.pythonhosted.org/packages/2b/6f/7e38d7c97fbbc3987539c804282c33f56b6b07381bf2390deead696440c5/pycryptodome-3.9.9-cp36-cp36m-manylinux1_x86_64.whl (13.7MB)
[K     |████████████████████████████████| 13.7MB 330kB/s 
[?25hBuilding wheels for collected packages: pdfminer
  Building wheel for pdfminer (setup.py) ... [?25l[?25hdone
  Created wheel for pdfminer: filename=pdfminer-20191125-cp36-none-any.whl size=6140066 sha256=eba5e4ecf326ce49ef289022369ee09545d16451641aa0364b4437bd6c37a37a
  Stored in directory: /root/.cache/pip/wheels/e1/00/af/720a55d74ba3615bb4709a3ded6dd71dc5370a586a0ff6f326
Successfully built pdfminer
Installing collected packages: pycryptodome, pdfminer
Successfully instal

In [None]:
# # IMPORTANT: You must separately download fuzzywuzzy. 
# If you do not have fuzzywuzzy, run this code. Elsewise, ignore. 
# Fuzzywuzzy works best with python-levenshtein. 

!pip install fuzzywuzzy

!pip install python-levenshtein




In [None]:
# Necessary packages to run the program

import requests
import math
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
import pdfminer

from io import BytesIO, StringIO
import requests
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from fuzzywuzzy import fuzz

In [None]:
#The functions necessary for the webscrapper and fuzzywuzzy count generator. 

def pdfurl_to_text(url):
    """Converts a URL containing a PDF file to raw text"""
    output_string = StringIO()
    parser = PDFParser(BytesIO(requests.get(url).content))
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
    return output_string.getvalue().replace('\n', ' ')

def search_text(text, query, strictness):
    """Searches a text for a certain query using a cuttoff levenshtein distance
    defined as strictness"""
    count = 0
    for sentence in text.split('.'):
        f = fuzz.token_set_ratio(query,sentence.strip())
        if f >= strictness:
            count+=1
    return count

def search_pdf(url, queries, strictness):
    """Bundles URL to PDF to text conversion
    with a list of queries to search for"""
    text = pdfurl_to_text(url)
    return [search_text(text, query, strictness) for query in queries]

**Below is the Script for General Assembly Meeting Documents. Includes webscrapper and fuzzywuzzy results.**

In [None]:
#Webscrapper and Fuzzywuzzy count generator. 
# DO NOT RUN THIS IF YOU ONLY WANT SC DATA
# Note: Unlike SC, GA code runs at once. This is because some of the programming is dependent on each other, and the GA files have to be pulled directly from a search option rather than 
# central page

meeting_pages = []
for y in range(1994,2020):
    for x in range(0,5):
        url = "https://digitallibrary.un.org/search?ln=en&rg=200&jrec={}&fct__2=General+Assembly&fct__2=General+Assembly&cc=Meeting+Records&fct__1=Meeting+Records&fct__1=Meeting+Records&fct__3={}".format(str(1 + (x*200)),str(y))
        meeting_pages.append(requests.get(url))
       
record_nums = []
for i,meet in enumerate(meeting_pages):
    soup = BeautifulSoup(meet.text)
    for link in soup.find_all('a', href=True):
        if link['href'][:8] == "/record/":
            url = link['href'].split("/")[2].split("?")[0]
            record_nums.append(url)
           
urls = []
for record_num in record_nums:
    try:
        soup = BeautifulSoup(requests.get("https://digitallibrary.un.org/record/{}?ln=en".format(record_num)).text, 'html.parser')
        for x in soup.find_all(attrs={"class": "metadata-row"}):
            z = [y.contents[0] for y in x.find_all('span')]
            if len(z) > 1:
                if z[0] == 'Symbol':
                    ga_id = z[1]
                elif z[0] == 'Action note':
                    ga_date = z[1]
        url = [x['content'] for x in soup.find_all(attrs={'name':'citation_pdf_url'}) if 'EN' in x['content']][0]
        urls.append(url)
    except Exception as e:
        print(e)
       
f = open("urls.txt", "a")
for url in urls:
    f.write("{}\n".format(url))
f.close()

urls = []
f = open("urls.txt", "r")
for line in f.readlines():
    urls.append(line.strip())
f.close()

queries = [x.strip() for x in """May I take it that the assembly wishes to take note of those items that remain open for consideration

The agenda was adopted

Amend the agenda to

I have been authorized to make the following statement on behalf of the assembly

Refer the matter

Report of the Committee

Point of Order

Right of reply

To adjourn the debate

May I take it that it is the wish of the General Assembly to conclude its consideration

To suspend the meeting

To adjourn the meeting

To introduce a draft amendment to the draft resolution

The Assembly will now take a decision on draft resolution

Objection to consideration of the question

To withdraw

Reconsideration of the

Appoint a Committee

A recorded vote has been requested

A paragraph-by-paragraph vote on the draft resolution

The draft resolution was adopted

peacekeeping

peacemaking

sanctions

The draft resolution has not been adopted

""".split("\n") if x != '']

results = []
for url in urls:
    try:
        results.append(search_pdf(url, queries, 95))
    except Exception as e:
        print(e)

d = {k:[] for k in queries}
d["meeting"] = []
for i,result in enumerate(results):
    d["meeting"].append(urls[i])
    for j,count in enumerate(result):
        d[queries[j]].append(count)
pd.DataFrame.from_dict(d).to_csv("GA_Queries.csv")


**Below is the Script for Security Council Meeting Documents. Includes webscrapper and fuzzywuzzy results.**

In [None]:
#Webscrapper

import requests
from bs4 import BeautifulSoup

d = {}

links = []
for date in range(1994, 2020):
    soup = BeautifulSoup(requests.get("https://www.un.org/depts/dhl/resguide/scact{}_table_en.htm".format(date)).text)
    for row in soup.find('table').find_all('tr'):
        cols = row.find_all('td')
        if len(cols) > 0:
          s = "pdf?symbol=en%2F{}".format(cols[0].find("a").text).replace("/", "%2F")
          d[s] = date
          links.append(s)
links = list(set(links))

queries = [x.strip() for x in """May I take it that the assembly wishes to take note of those items that remain open for consideration

The agenda was adopted

appoint a committee

I have been authorized to make the following statement on behalf of the committee 

May I take it that the draft report as corrected is adopted by the council

point of order

amendment

A paragraph-by-paragraph vote on the draft resolution

refer the matter

postpone discussion to

postpone discussion indefinitely 

The meeting was suspended

the meeting is adjourned

To withdraw the draft resolution

In conformity with the usual practice, I propose, with consent of the council, to invite those representatives to participate in discussion

It is my understanding that the security council is ready to vote on the draft resolution before it

peacekeeping

peacemaking

sanctions

The draft resolution has been adopted

The draft resolution has not been adopted

""".split("\n") if x != '']

q = {}
b = {"meetings":[], "dates":[]}
for i,x in enumerate(queries):
  b[x] = []
  q[i] = x
results_sc = []
for url in links:
    try:
      text = pdfurl_to_text(url)
      for i, x in enumerate([search_text(text, query, 95) for query in queries]):
        b[q[i]].append(x)
      b["The draft resolution has not been adopted"][-1] = search_text(text, "The draft resolution has not been adopted", 100) 
      b["meetings"].append(url)
      b["dates"].append(d[url])
    except Exception as e:
        print(url)
        print(e)

for k,v in b.items():
    print("{} : {}".format(k,len(v)))

pd.DataFrame.from_dict(b).to_csv("sc_meeting_dates.csv")

No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object! - Is this really a PDF?
No /Root object!

In [None]:
pd.DataFrame.from_dict(b).to_csv("sc_meeting_dates.csv")

NameError: ignored