# Python Web Crawler for Country Inference (COP2019)

## 1. Introduction

### 1.1. The Context

Each row in the COP2019_participation.csv contains:

* the organization name
* the kind of organization (e.g. IGO, NGO etc)
* the total number of participants at COP2019 (for that specific organization)
* the total number of male participants at COP2019 (for that specific organization)
* the total number of female participants at COP2019 (for that specific organization)

There are several kinds of organizations (found in the column `entity_type`):
* Parties 
* Observer States
* United Nations Secretariat units and bodies
* Specialized agencies and related organizations
* Intergovernmental organizations
* Non-governmental organizations


Right now, we are mostly interested in Intergovernmental organizations (**IGOs**) and Non-governmental organizations (**NGOs**).

For each NGO or IGO, we want to scrape the web for the top 10 results that pop up after a Google search of the name of the organization and store the raw text from these websites in a dataframe.


### 1.2. Main Approach to the Problem(so far)

We have decided that we are going to perform an **automated Google search** for each organization that we have in our dataset.
The steps to follow (for each organization) are:

0. creating an empty python dictionary with the keys as the countries and the values all initially set to 0 
1. querying each organization name on Google
2. extracting the links that pop up on the first page (excluding certain links, see below)
3. for each link, extracting the HTML code and stripping it of tags ---> raw text

## 2. Solving the Problem

### 2.1. Preamble

We will be using the following python modules:
* `requests`, `requests_html`, `urllib`, `urllib.request` for establishing web access and accessing web content
* `BeautifulSoup` for retrieving and modifying the text from the extracted links
* `pandas` and `numpy` for working with dataframes

### 2.2. Querying Google and extracting the first 10 links that pop up

Note: When querying something on Google, the number of results on the first page is usually 10. __The first page of results usually contains the most relevant results.__

In [10]:
# importing required modules
import requests
import urllib
import pandas as pd
import numpy as np
from requests_html import HTML
from requests_html import HTMLSession


def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)
        

        
def scrape_google(query):
    """
    This function returns all the links that pop up on the first page of Google results for a given query.
    
    Args:
        query (string): the query to search
    
    Returns:
        links (list): list of links (string) that were extracted from Google
    """
    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.com/search?gl=US&q="+ query)
    links = list(response.html.absolute_links)
    google_domains = ('https://www.google.', 
                      'https://google.', 
                      'https://webcache.googleusercontent.', 
                      'http://webcache.googleusercontent.', 
                      'https://policies.google.',
                      'https://support.google.',
                      'https://maps.google.',
                      'https://translate.google.')

    for url in links[:]:
        if url.startswith(google_domains):
            links.remove(url)

    return links[0:10]

In [13]:
# TEST:
scrape_google("United Nations")

['https://careers.un.org/',
 'https://www.un.org/en/about-us',
 'https://www.facebook.com/unitednations/',
 'https://unfoundation.org/',
 'https://news.un.org/en/story/2021/07/1096052',
 'https://www.unep.org/',
 'https://news.un.org/en/story/2021/07/1096022',
 'https://en.wikipedia.org/wiki/United_Nations#Structure',
 'http://www.un.org/',
 'https://news.un.org/en/story/2021/07/1096012']

### 2.3. Getting the texts from the URLs

In [38]:
# importing the relevant modules
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup


def get_text_from_URL(url):
    """
    Get the raw text from a given url, stripped of all script and style elements.

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        text (string): raw text extracted from the webpage passed as argument.
    """
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = ' '.join(chunk for chunk in chunks if chunk)
    return text


In [39]:
# TEST:
url = "https://wiser.directory/organization/academy-for-mountain-environics-ame/"
print(repr(get_text_from_URL(url)))

"Academy for Mountain Environics AME - Wiser.Directory Issue Areas Democracy, Nonviolence and Peace Ecological Integrity Social and Economic Justice Add Select Page Academy for Mountain Environics AME Visit WebsiteSearch on Google Location: India Issue Areas: Biodiversity Conservation, Conservation Area Creation, Evolutionary Ecology, Practical Conservation, Sustainable Agriculture, Water Supply and Conservation, Women's Economic Development The Academy for Mountain Environics, the research arm of the BCIL, was founded over a decade ago with the basic objective of guiding people on simple, user-friendly systems for harnessing energy and water. The focus is on maintaining the tenuous harmony between the 5 ja’s that bind the planet — jan (people), jal (water) jameen (land), jungle (forests) and janwar (fauna). With projects in districts of Uttaranchal and Madhya Pradesh, the Academy consists of professionals who have come together from diverse fields. Some are geologists and geophysicist

### 2.3. Reading in the COP2019 participation statistics

In [16]:
COP2019 = pd.read_csv("COP2019_participation.csv", index_col=False)

In [17]:
# inspect the first 5 elements
COP2019.head()

Unnamed: 0,name,entity_type,TotalMembers,MaleMembers,FemaleMembers
0,Afghanistan,Parties,15,13,2
1,Albania,Parties,2,1,1
2,Algeria,Parties,26,16,10
3,Andorra,Parties,8,5,3
4,Angola,Parties,23,16,7


In [18]:
# ensure integrity of entity types 
COP2019.entity_type.unique()

array(['Parties', 'Observer States',
       'United Nations Secretariat units and bodies',
       'Specialized agencies and related organizations',
       'Intergovernmental organizations',
       'Non-governmental organizations'], dtype=object)

In [19]:
# create organizations dataframe, containing only NGOs and IGOs
organizations = COP2019[(COP2019.entity_type == 'Intergovernmental organizations')|(COP2019.entity_type == 'Non-governmental organizations')].copy()

In [20]:
# inspect the last 5 elements; all should be NGOs
organizations.tail()

Unnamed: 0,name,entity_type,TotalMembers,MaleMembers,FemaleMembers
1361,York University,Non-governmental organizations,13,8,5
1362,Young Energy Specialists - Development Coopera...,Non-governmental organizations,8,3,5
1363,Young Power in Social Action,Non-governmental organizations,1,1,0
1364,Zhenjiang Green Sanshan Environmental Public W...,Non-governmental organizations,3,1,2
1365,ZOI Environment Network,Non-governmental organizations,2,2,0


In [21]:
# subsetting the organizations dataset for unit-testing; only NGOs were chosen at this step
organizations1 = organizations.iloc[(-20):-1,].copy()

In [22]:
# inspecting the newly generated dataframe
organizations1.head()

Unnamed: 0,name,entity_type,TotalMembers,MaleMembers,FemaleMembers
1346,"World Green Building Council, Inc.",Non-governmental organizations,3,1,2
1347,World Medical Association,Non-governmental organizations,6,4,2
1348,World Ocean Council,Non-governmental organizations,1,1,0
1349,World Ocean Network,Non-governmental organizations,3,2,1
1350,World Organization of the Scout Movement,Non-governmental organizations,1,1,0


### 2.4. Web Scraping

In [23]:
def scrape(organizations):
    scraped_texts = []
    org_list = list(organizations)
    for i in range(len(org_list)):
        organization = str(org_list[i])
        list_of_links = scrape_google(organization)
        for link in list_of_links:
            try:
                text = str(get_text_from_URL(link))
                scraped_texts.append(text)
            except:
                pass
    return scraped_texts

scraping_result = scrape(organizations1.name)   

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [30]:
scraping_result = [text for text in scraping_result if len(text)>1000]

In [31]:
len(scraping_result)

136

In [33]:
df = pd.DataFrame(scraping_result,columns=['Text'])
df.to_csv("scraped_websites.csv", encoding='utf-8-sig')