# Polish portal of non-governmental organizations

The aim of the project is to collect information from Polish portal of non-governmental organizations about non-governmental organizations in selected category. The information will come from the website [https://spis.ngo.pl/](https://spis.ngo.pl/). To extract most information I used libraries `requests` and `bs4`, and to decrypt email addresses with use of elements of JavaScript I used `selenium`.

## Analyzing the category page

I started with analyzing the category page, I searched for an element that contained the entire content of the selected item on the list.
Such an element is `<div class="pv3 bb b--light-gray">`. The parent of the above-mentioned element, so the element representing the entire list, is the `<div>` tag one level higher: `<div class="relative mb5">`.

## Data extraction from the category page

Then I wrote a script that can extract a list of links to the websites of all foundations in this category from the foundation category page.

Using the `requests` library, I retrieve the content of the category page, and then using `bs4` find the HTML document element that contains all the foundations:

In [1]:
from ipywidgets import widgets

input_dropdown = widgets.Dropdown(
    options=[('Sport, turystyka, rekreacja, hobby', 2753), ('Kultura, sztuka, tradycja', 2759), ('Edukacja, wychowanie', 2771), ('Ochrona zdrowia', 2777), ('Usługi socjalne, pomoc społeczna', 2783), ('Ratownictwo, bezpieczeństwo', 2791), ('Rozwój lokalny', 2796), ('Ekologia', 2803), ('Prawa człowieka, demokracja, prawo', 2810), ('Wsparcie dla organizacji pozarządowych', 2816), ('Rynek pracy, aktywizacja zawodowa', 2838), ('Działalność międzynarodowa', 2844), ('Nauka, technika', 2850), ('Sprawy zawodowe, pracownicze, branżowe', 2853), ('Religia', 2854), ('Inne', 2855), ('Pomoc dla Ukrainy', 4291)], 
    value=2753,
    description='Category:',
    layout={'width': 'max-content'})
input_dropdown

Dropdown(description='Category:', layout=Layout(width='max-content'), options=(('Sport, turystyka, rekreacja, …

In [2]:
import requests
from bs4 import BeautifulSoup

URL = f'https://spis.ngo.pl/?cat[2384]={input_dropdown.value}'

response = requests.get(URL)
category = BeautifulSoup(response.text, 'html.parser')

list_of_organizations = category.select_one('div.relative.mb5')

Next, I look for `<div>` elements representing each organization. We do not look for them in the category, but only within `list_of_organizations`:

In [3]:
organizations = list_of_organizations.select('li.pv3.bb')

The next step is to create a list of URLs of the organizations that are on this list:

In [4]:
urls = []

for organization in organizations:
    urls.append(
        organization.select_one('a')['href']
    )

## Analyzing the organization page

The next website that need to be analyze is the website of any foundation. All the data we are looking for is collected in a table. Most of them are provided openly, only the email is hidden - it waits until we click **Show** (**Pokaż**).

![Devtools](images/ngo-4.png)

The elements representing the "label-value" are `<div class="f7-f6-xl bb b--light-gray flex flex-wrap justify-between pv2">`. Inside each of them there are two more: the first one has class `"pr2"` and contains a label ("Phone" ("Telefon"), "WWW", "Street" ("Ulica"), "Zip code" ("Kod pocztowy"), etc.), the second one has class `"tr"` and `"grow-1"` and contains information (phone number, website address, etc.).

The parent of all rows is `<div class="ba b--light-gray pa3 pt0">`. It should be find at first, then we will look for "label-value" lines in it. In the loop, we will look at each row and, based on the label, decide what to do with its value.

We will treat the line with the e-mail address specially - we will click on its **Show** link and then download the content.


## Organization data extraction

I wrote the entire code inside a function that will receive the URL of the organization's website as an argument and return a dictionary with the data we want to know.

In [5]:
from selenium.webdriver import Firefox
from selenium.common.exceptions import NoSuchElementException

def get_organization_data(url):
    data = {
        'Name' : '',
        'Adres email': '',
        'Adresy www': [],
        'Telefony': [],
        'KRS': '',
        'REGON': '',
        'NIP': '',
        'Rok powstania': '',
    }

#     browser = Firefox(executable_path='geckodriver') - taking it out from the function (later loop) speed up execution by 3 times
    browser.get(url)
    
    table = browser.find_element("css selector",'div.ba.b--light-gray.pa3.pt0')
    rows = table.find_elements("css selector",'div.f7-f6-xl')

    # looking for each row in the table
    for row in rows:
        try:
            # in each row we try to find the label and value
            label = row.find_element("css selector",'.pr2')
            value = row.find_element("css selector",'.tr.grow-1')
        except NoSuchElementException:
            # skipping lines where the label or value could not be found
            # "continue" jumps to the beginning of the for loop, with the next row in the "row" variable
            continue

        # The unique line with email address
        if label.text == 'E-mail':
            try:
                value.find_element("tag name", 'span').click()
                data['Adres email'] = value.text
            except Exception:
                continue

        # Next lines - appending data to dictionary
        elif label.text == 'Telefon':
            data['Telefony'].append(value.text)

        elif label.text == 'WWW':
            data['Adresy www'].append(value.text)

        elif label.text == 'KRS':
            data['KRS'] = value.text

        elif label.text == 'REGON':
            data['REGON'] = value.text

        elif label.text == 'NIP':
            data['NIP'] = value.text

        elif label.text == 'Rok powstania':
            data['Rok powstania'] = value.text
#     browser.quit()
    return data
    

## Generating a report

Let's summarize: we have a `urls` variable with a list of foundation website addresses, and a `get_organization_data` function that can accept such an address and return a dictionary with data. We can now download the details of each of them in a loop and save the downloaded data in a CSV file.

Podsumujmy: mamy zmienną `urls` z listą adresów stron fundacji, oraz funkcję `get_organization_data` która potrafi taki adres przyjąć i zwrócić słownik z danymi. Możemy zatem w pętli pobrać szczegóły każdej z nich i zapamiętać pobrane dane w pliku CSV.

In [6]:
import csv

with open(f'report.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow([
        'Adres email',
        'Adresy www',
        'Telefony',
        'KRS',
        'REGON',
        'NIP',
        'Rok powstania'])
    browser = Firefox(executable_path='geckodriver')
    for organization_url in urls[:len(organizations)]:
        data = get_organization_data(organization_url)
        writer.writerow([
            data['Adres email'],
            ' '.join(data['Adresy www']),
            ', '.join(data['Telefony']),
            data['KRS'],
            data['REGON'],
            data['NIP'],
            data['Rok powstania'],
        ])
    browser.quit()
