# Getting familiar with the homepage

On the page [click](https://yougov.co.uk/ratings/politics/popularity/charities-organisations/all) you'll find a list of UK-based charities. The goal of the workshop is to retrieve information about each charity.

In this part we are going to focus on **getting familiar** with the analyzed website. To do this (not using Selenium yet), we will do the following:
- go to the website address given earlier in the exercise,
- check how we could get the list of the organizations displayed on the page,
- analyze the section on any single organization,
- consider how we can get the following information from the table:
   - organization name,
   - website link.

# Exercise description

Based on the conclusions of the previous exercise, in this part we are going to focus on retrieving the list of the first 20 organizations visible on the page. At the end of the script we should get a list with the information about the organization name and its url, according to the example below:

```python
[
 {'name': 'NSPCC',
  'link': 'https://yougov.co.uk/topics/health/explore/not-for-profit/NSPCC'},
 {'name': 'Prostate Cancer UK',
  'link': 'https://yougov.co.uk/topics/health/explore/not-for-profit/Prostate_Cancer_UK'},
 {'name': 'British Red Cross', ...}
 ...
]
```

Based on the conclusions from the previous exercise, in this notebook we will focus on retrieving the data available for [Macmillan cancer support](https://yougov.co.uk/topics/health/explore/not-for-profit/Macmillan_Cancer_Support?content=all). To do this, complete the implementation of the `get_organization_detailed_data(url)` method to retrieve the following data:

- from page header (next to the logo):
    - fame
    - popularity
    - disliked by
    - neutral

The data should be returned as a dictionary with the following structure:

```python
{
    'Fame': '0%',
    'Popularity': '0%',
    'Disliked by': '0%',
    'Neutral': '0%'
}

```

# Introduction

Using the solutions from previous exercises, copy-paste the definitions of the created functions. Next, retrieve the organizations data and save them to the `yougov.json` file.

### Hints
- scraping the website takes some time, we recommend using the `tqdm` library to track the progress,
- remember that until now, when we worked in separate notebooks, it was necessary to accept cookies every time - now it is possible that we will need to modify this.
- reading and writing to JSON: `Day 4 - API -> JSON`

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.alert import Alert
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from time import sleep
from tqdm import tqdm
import json

In [2]:
def accept_cookies(driver):
    try:
        cookies_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))
        )
        cookies_button.click()
    except Exception as e:
        print("No cookies button found or other error:", e)

def get_organization_list(url):
    # Setup Selenium WebDriver (example with Chrome)
    driver = webdriver.Chrome()
    driver.implicitly_wait(10)  # wait for the page to load

    # Navigate to the page and handle cookies
    driver.get(url)
    accept_cookies(driver)

    sleep(1)  # additional wait

    # Find all the elements with the organization names and links using a CSS selector
    org_elements = driver.find_elements(By.CSS_SELECTOR, ".rankings-entities-list-item")

    # Create a list to store the organization data as dictionaries
    org_data = []

    # Loop through the elements and extract name and link information
    for element in org_elements[:20]:  # Get the first 20 organizations
        name_element = element.find_element(By.CSS_SELECTOR, "span:nth-child(3)")
        link_element = element.get_attribute("href")  # Extract the href attribute

        name = name_element.text
        link = link_element

        org_info = {'name': name, 'link': link}
        org_data.append(org_info)

    # Close the WebDriver
    driver.quit()

    return org_data

In [3]:
def get_organization_detailed_data(url):
    # Configure Chrome to run in headless mode
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # Run Chrome in headless mode
    driver = webdriver.Chrome(options=chrome_options)
    driver.maximize_window()

    # Accept cookies
    driver.get(url)
    accept_cookies(driver)

    # Find and extract the data for Fame, Popularity, Disliked by, and Neutral
    data = {}
    labels = driver.find_elements(By.CSS_SELECTOR, ".entity-header-title .label")
    values = driver.find_elements(By.CSS_SELECTOR, ".entity-header-title + .value")

    for label, value in zip(labels, values):
        data[label.text.strip()] = value.text.strip()

    # Close the WebDriver
    driver.quit()

    return data


In [4]:
# get the list of organizations here

# Define the URL for the organizations list
url = 'https://yougov.co.uk/ratings/politics/popularity/charities-organisations/all'

# Retrieve the list of organizations
org_list = get_organization_list(url)

In [5]:
# get all the required data here
# Initialize an empty list to store detailed organization data
detailed_org_data = []

# Iterate through the organizations and retrieve detailed data
for org in tqdm(org_list, desc="Fetching Detailed Data"):
    org_name = org['name']
    org_link = org['link']
    detailed_data = get_organization_detailed_data(org_link)
    detailed_data['name'] = org_name
    detailed_org_data.append(detailed_data)


Fetching Detailed Data:   0%|          | 0/20 [00:00<?, ?it/s]

Fetching Detailed Data: 100%|██████████| 20/20 [04:57<00:00, 14.86s/it]


In [17]:
detailed_org_data

[{'Fame': '99%',
  'Popularity': '87%',
  'Disliked by': '3%',
  'Neutral': '10%',
  'name': 'Cancer Research UK'},
 {'Fame': '98%',
  'Popularity': '83%',
  'Disliked by': '3%',
  'Neutral': '12%',
  'name': 'British Heart Foundation'},
 {'Fame': '96%',
  'Popularity': '82%',
  'Disliked by': '2%',
  'Neutral': '12%',
  'name': 'Macmillan Cancer Support'},
 {'Fame': '98%',
  'Popularity': '79%',
  'Disliked by': '2%',
  'Neutral': '16%',
  'name': "Alzheimer's Society"},
 {'Fame': '95%',
  'Popularity': '79%',
  'Disliked by': '3%',
  'Neutral': '13%',
  'name': 'Great Ormond Street Hospital'},
 {'Fame': '95%',
  'Popularity': '78%',
  'Disliked by': '4%',
  'Neutral': '13%',
  'name': 'RSPCA'},
 {'Fame': '98%',
  'Popularity': '78%',
  'Disliked by': '3%',
  'Neutral': '16%',
  'name': 'WWF'},
 {'Fame': '95%',
  'Popularity': '77%',
  'Disliked by': '3%',
  'Neutral': '15%',
  'name': 'British Red Cross'},
 {'Fame': '95%',
  'Popularity': '76%',
  'Disliked by': '1%',
  'Neutral': '1

In [None]:
# save the data to yougov.json file here

# Save the detailed organization data to a JSON file
with open('yougov.json', 'w', encoding='utf-8') as json_file:
    json.dump(detailed_org_data, json_file, ensure_ascii=False, indent=4)

print("Data saved to yougov.json")
