# Scrapping World Press Freedom Index Report. 

### Source: [Reporters without Borders](https://rsf.org/en/ranking_table)

![image.png](https://i.imgur.com/p3n7MZ9.png)

[Web Scrapping](https://en.wikipedia.org/wiki/Web_scraping) is an automated process of gatering data from a server. It is usually accomplished by writing an automated program that queries a webserver, requests data(usually HTML) and parses the data to extract the required information. There are a number of ways to acheive this, but we are going to use requests, Beautiful Soup and pandas libraries.

The [Press Freedom Index](https://rsf.org/en/ranking) is an annual ranking of countries compiled and published by [Reporters Without Borders](https://rsf.org/en) since 2002 based upon the organisation's own assessment of the countries' press freedom records in the previous year. The Index ranks 180 countries and regions according to the level of freedom available to journalists. Reporters Without Borders is an international non-profit and non-governmental organization with the stated aim of safeguarding the right to freedom of information. It describes its advocacy as founded on the belief that everyone requires access to the news and information, in line with Article 19 of the Universal Declaration of Human Rights that recognizes the right to receive and share information regardless of frontiers, along with other international rights charters. RSF has consultative status at the United Nations, UNESCO, the Council of Europe, and the International Organisation of the Francophonie.

#### Outline:
* Using the `requests` library, Fetch the HTML data of the `https://rsf.org/en/ranking` website.
* Parse the DOM tree of the HTML page using the Beautiful Soup() method provided by the `Beautiful Soup` library.
* Identify the patterns and attributes like ids, classes and use them to fetch the elements containing the required data.
* Compile the extracted information into data using Python lists and libraries.
* Save the extracted Information into a csv file.

In the end, here's what the csv will look:
```
Rank,Country,Abuse Score,Global Score,Detail Url,Situation Score,Journalist Killings,Citizen Journalist Killings,Media Assistants Killings,2020,2019,2018,2017,2016,2015,2014,2013
1,Norway,0,6.72,https://rsf.org//en/norway,6.72,0,0,0,1,1,1,1,3,2,3,3
2,Finland,0,6.99,https://rsf.org//en/finland,6.99,0,0,0,2,2,4,3,1,1,1,1
...
```

Libraries Used: `requests, Beautiful Soup, Pandas`

You can use the `Run` button at the top of the page to execute the code.


In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="webscrapping-pressfreedom-report", git_commit=True)

<IPython.core.display.Javascript object>

# 1.Using requests library to get the HTML page

- We use the get method provided by the requests library to fetch the page from the webserver.
- we pass in the site url to the get method and check the status code to ensure whether the call is a success.

In [None]:
site_url = 'https://rsf.org/en/ranking_table'
base_url = 'https://rsf.org/'

Installing the `requests` library from pip and importing it.

In [None]:
!pip install requests --upgrade --quiet

In [None]:
import requests

In [None]:
site_response = requests.get(site_url)

In [None]:
site_response.status_code

Checking the status code of the request to check if the page is loaded successfully. Status codes ranging between `200-299` are considered successful responses.

In [None]:
response_text = site_response.text

In [None]:
response_text[:1000]

In [None]:
with open('press_freedom_rankings.html', 'w', encoding="utf-8") as file:
    file.write(response_text)

Writing the HTML response text to a html file and saving it.

# 2.Using Beautiful Soup library to extract data from the webpage.

- We create a parse tree for HTML page(response_text) using the `BeautifulSoup()` method that can be used to extract data from HTML.
- We can now traverse the DOM tree of the HTML page to extract the required data.

In [None]:
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
from bs4 import BeautifulSoup

Installing the `Beautiful Soup` library from pip and importing it.

In [None]:
parsed_web_content = BeautifulSoup(response_text, 'html.parser')

BeautifulSoup method creates a parse tree for the HTML page so that we can get the required data.

In [None]:
tbody_element = parsed_web_content.find('tbody')

Getting the HTML element with `tag: 'tbody'` 
![image1](https://i.imgur.com/cUYadZu.png)

In [None]:
all_entity_rows = tbody_element.find_all('tr')

Each `tr` tag corresponds to HTML content of each row.

In [None]:
all_entity_rows[:5]

`all_entity_rows` is a list of snippets containing the data of a row.

# 3.Consider the example of scrapping data of the country `Norway`. 

- Here we try to fetch the data of the country Norway to understand the stratergy behind fetching the required data by identifying the patterns(like classes, ids and other attributes) in the HTML code.

In [None]:
example_index = 0
data_cells = all_entity_rows[example_index].find_all('td')

In [None]:
retained_cells = data_cells[2: len(data_cells) - 2] #Slicing the array to retain the required cells

In [None]:
#Function to format and get the required values in a row
def get_row_values(row):
    return_array = []
    for index, cell in enumerate(row):
        if index == 0:
            country = cell.find('a')
            return_array.append(country.text.strip())
            return_array.append(base_url+country['href'])
        else:
            return_array.append(cell.text.strip())
    return return_array

* Calling the `get_row_values()` method by passing the row of a parsed table which contains all the data of the country Norway.
* Assigning the values of `Country, Details Url, Abuse Score, Underlying Situation Score, Global Score` to respective variables

In [None]:
country, details_url, abuse_score, situation_score, global_score = get_row_values(retained_cells)

In [None]:
print('Country: ', country)
print('Details Url: ', details_url)
print('Abuse Score: ', abuse_score)
print('Underlying Situation Score: ', situation_score)
print('Global Score: ', global_score)

Getting the Country specific page(`In our case Norway`) to parse `Rankings of the country since 2013, Journalist Deaths, Citizen Journalist Deaths and Media Assistants Deaths` Data using Details Url

![image2](https://i.imgur.com/YMIjkvT.png)

In [None]:
# Function to get a parsed tree of a html page
def get_html_page(page_url):
    html_page = requests.get(page_url)
    if html_page.status_code == 200:
        response_text = html_page.text
        parsed_web_content = BeautifulSoup(response_text, 'html.parser')
        return parsed_web_content
    else:
        raise Exception('Failed to load the page')

In [None]:
page_content_details = get_html_page(details_url)

In [None]:
#Function to format and get the details from an individual country page 
def get_other_info_from_each_url(url):
    elements = page_content_details.find_all('div', class_='js-animated-number')
    def return_values(value):
        return value.text.strip()
    formatted_elems = map(return_values, elements)
    return list(formatted_elems)

journalist_deaths, citizen_journalist_deaths, media_assist_deaths = get_other_info_from_each_url(details_url)
rank_element = page_content_details.find('div', class_='white-b__ranking-score')
rank = rank_element.text.strip()

print('Journalist Killings:', journalist_deaths)
print('Citizen Journalist Killings:', citizen_journalist_deaths)
print('Media Assistant Killings:', media_assist_deaths)

In [None]:
#Function to get previous years rankings for the country
rankings = {}
def get_previous_year_rankings():
    table_body_elements = page_content_details.find_all('tbody')
    for element in table_body_elements:
        rows = element.find_all('tr')[1: len(element)]
        for cell in rows:
            cell_items = cell.find_all('td')
            year = str(cell_items[0].text.strip())
            rank = cell_items[1].text.split('/')[0].strip()
            if not year in rankings.keys():
                rankings[year] = []
            rankings[year].append(rank.strip())

* Using the `get_previous_year_rankings()` method to fetch the previous years ranking of the country and adding them to the rankings dictionary.

In [None]:
get_previous_year_rankings()

* Final data scrapped for the country `Norway`:

In [None]:
print('Current Rank: ', rank)
print('Country: ', country)
print('Details Url: ', details_url)
print('Abuse Score: ', abuse_score)
print('Underlying Situation Score: ', situation_score)
print('Global Score: ', global_score)
print('Journalist Killings:', journalist_deaths)
print('Citizen Journalist Killings:', citizen_journalist_deaths)
print('Media Assistant Killings:', media_assist_deaths)
for year in rankings:
    print('Ranking for the year {year}: {rank}'.format(year=year, rank=rankings[year][0]))


# 4.Scrapping the Data for Multiple Countries.

* Here we try to scrape the data of the first 30 countries.
* We use the methods above and run them in a for loop specifying the number of countries we want to fetch the data for.

In [None]:
ranks = []
countries = []
abuse_scores = []
global_scores = []
details_urls = []
situation_scores = []
list_journalist_deaths = []
list_citizen_journalist_deaths = []
list_media_assist_deaths = []
rankings = {}

for i in range(1, 31):
    data_cells = all_entity_rows[i].find_all('td') # Getting the rows, each row corresponding to each country
    retained_cells = data_cells[2: len(data_cells) - 2]  # retaining the required data cells
    country, details_url, abuse_score, situation_score, global_score = get_row_values(retained_cells)
    print('Scraping data for the country: ', country)
    page_content_details = get_html_page(details_url) # Fetching the individual country page to get additional data
    journalist_deaths, citizen_journalist_deaths, media_assist_deaths = get_other_info_from_each_url(details_url)
    rank_element = page_content_details.find('div', class_='white-b__ranking-score')
    rank = rank_element.text.strip()
    get_previous_year_rankings()
    
    # appending the values to the output arrays
    ranks.append(rank)
    countries.append(country)
    details_urls.append(details_url)
    abuse_scores.append(abuse_score)
    global_scores.append(global_score)
    situation_scores.append(situation_score)
    list_journalist_deaths.append(journalist_deaths)
    list_citizen_journalist_deaths.append(citizen_journalist_deaths)
    list_media_assist_deaths.append(media_assist_deaths)
    
scrappedData_final = {
    'Rank': ranks,
    'Country': countries,
    'Abuse Score': abuse_scores,
    'Global Score': global_scores,
    'Detail Url': details_urls,
    'Situation Score': situation_scores,
    'Journalist Deaths': journalist_deaths,
    'Citizen Journalist Deaths': citizen_journalist_deaths,
    'Media Assistants Deaths': media_assist_deaths
}

scrappedData_final.update(rankings); # adding year wise rankings to the data dictionary


# 5.Using Pandas to convert the dictionary of scrapped data to a DataFrame

The dictionary of values is converted to a DataFrame using `pd.DataFrame` method provided by Pandas.

* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
* It offers data structures and operations for manipulating numerical tables and time series.
* It is a primary datastructure in pandas.
* Here each list of values in the dict themes_information is converted into a column.

In [None]:
!pip install pandas --upgrade --quiet

In [None]:
import pandas as pd

Installing the `pandas` library from pip and importing it.


In [None]:
ranking_total_Info = pd.DataFrame(scrappedData_final)

In [None]:
ranking_total_Info

In [None]:
ranking_total_Info.to_csv('press-freedom-index.csv', index=None)

Using `pd.to_csv` method provided by pandas to save the data frame to a csv file(`press-freedom-index.csv`).

# 6.Putting Everything Together

### Scraping the data for all the 180 countries and save it to a csv file(`press-freedom-index.csv`).

#### Data we are scrapping:
* Name of the Country
* Current rank
* Details page url.
* Abuse Score
* Underlying situation score
* Global score
* Previous rankings (2013-2020)
* Journalists killed (2021)
* Citizen Journalists killed (2021)
* Media assistants killed (2021)

#### Stratergy:

* Using the [base url](https://rsf.org/en/ranking_table), we fetch the HTML page using the requests library.
* The HTML code is converted to a parsing tree using BeautifulSoup library.
* Using the methods provided by the BeautifulSoup library, we parse certain HTML snippets with identifiers such as class names, ids, attributes etc.
* In case of the above project, we parsed a tbody HTML snippet with `<tr>`(rows) corresponding to the data(`<td>`) each country.
* After attributes fetched here are `Name, Details page url, Abuse score, Underlying situation score and global score`.
* Now, we loop through the urls list and fetch the HTML pages of each individual country using the same methodology mentioned above.
* Here, we fetch the data of `Previous rankings(2013-2020), Current ranking, Journalists killed, Citizen Journalists killed, Media assistants killed for the year 2021`.
* The lists of data is put together into a dict(`scrappedData_final`) with keys pertaining to the name of the columns.
* Using pandas library, we now convert the data into a DataFrame which provides with various method to manipute and analyze the data.
* Using the method `to_csv` provided by pandas, we now save the Dataframe to a csv file(`press-freedom-index.csv`).


In [None]:
import requests #packages that is used to download the content from web
import os # package used for file process
from bs4 import BeautifulSoup #a Python library for pulling data out of HTML and XML files
import pandas as pd # the omnipresent of all python to work with dataframes
import numpy as np
import jovian
from IPython.core.display import display

In [None]:
site_url = 'https://rsf.org/en/ranking_table'
base_url = 'https://rsf.org/'

list_countries = []
list_ranking = []
list_details_urls = []
list_abuse_scores = [] 
list_global_scores = []
list_situation_scores = []
list_journalist_killings = []
list_citizen_journalist_killings = []
list_media_assistants_killings = []
rankings = {}

#Function to fetch individual cell values
def fetch_data_cell_values(row):
    #gathering the data of each data cell and appending it to a list
    cell = []
    for index, i in enumerate(row):
        if index == 0:
            country = i.find('a')
            cell.append(country.text.strip()) #country name 
            cell.append(base_url + country['href']) # details url of the country
        else:
            cell.append(i.text.strip()) #other values 
    return cell
    
def get_list_of_rows(parsed_web_content):
    # parsing the tbody snippet which has all the data(tr tags)
    tbody_element = parsed_web_content.find('tbody')
    # parsing the rows inside the tbody tag(each row has multiple data cells, each data cell corresponds to a certain attribute)
    data_rows = tbody_element.find_all('tr')
    return data_rows

def get_list_of_data_cells(row):
    #parsing data cell for each row
    data_cells = row.find_all('td')
    #slicing the array to retain the required cells
    required_data_cells = data_cells[2: len(data_cells) - 2] #Slicing the array to retain the required cells
    return required_data_cells

#Function to download and parse the html of the page
def get_parsed_html_page(page_url):
    #using requests to get the html content
    html_page = requests.get(page_url)
    #checking for the status of the call
    if html_page.status_code == 200:
        response_text = html_page.text
        #using beautiful soup library fetching the parsing tree of the html page
        parsed_web_content = BeautifulSoup(response_text, 'html.parser')
        return parsed_web_content
    else:
        raise Exception('Failed to load the page')

def get_country_specific_data(url):
    #Fetching the html content of the single country page
    parsed_html_page = get_parsed_html_page(url)
    #getting the archived rankings of the country ( 2013-2020) and appending them to a dict of lists
    get_previous_rankings(parsed_html_page)
    #Gathering other info from the page
    get_other_info_from_each_url(parsed_html_page)

def get_other_info_from_each_url(page_content):
    #parsing div with class 'js-animated-number' which returns 3 elements each with a value
    elements = page_content.find_all('div', class_='js-animated-number')
    def return_values(value):
        return value.text.strip()
    formatted_elems = map(return_values, elements)
    #parsing the element with rank value and appending it the list
    rank_element = page_content.find('div', class_='white-b__ranking-score')
    list_ranking.append(rank_element.text.strip())
    #Assigning the values to respective variables and appending them to the arrays
    journalist_deaths, citizen_journalist_deaths, media_assist_deaths = formatted_elems
    list_journalist_killings.append(journalist_deaths)
    list_citizen_journalist_killings.append(citizen_journalist_deaths)
    list_media_assistants_killings.append(media_assist_deaths)

def get_previous_rankings(page_content):
    #parsing 2 tbody snippets
    table_body_elements = page_content.find_all('tbody')
    expected_array_cells = [str(num) for num in range(2013, 2021)]
    row_index = []
    for element in table_body_elements:
        #getting tr(row) which had year and rank data in each cell
        rows = element.find_all('tr')[1: len(element)]
        for cell in rows:
            #obtaining data cell with each containing year and rank 
            cell_items = cell.find_all('td')
            year = str(cell_items[0].text.strip()) #year value
            rank = cell_items[1].text.split('/')[0].strip() #rank value
            row_index.append(year)
            if year not in rankings.keys():
                rankings[year] = []
            rankings[year].append(rank)
    # Filling the empty positions with none value
    for year in expected_array_cells:
        if year not in row_index:
            if year not in rankings.keys():
                rankings[year] = []
            rankings[year].append(np.nan)
            
def arrange_and_display_dataframe():
    #Putting all the data scrapped in to a dict
    scrappedData_final = {
        'Rank': list_ranking,
        'Country': list_countries,
        'Abuse Score': list_abuse_scores,
        'Global Score': list_global_scores,
        'Detail Url': list_details_urls,
        'Situation Score': list_situation_scores,
        'Journalist Killings': list_journalist_killings,
        'Citizen Journalist Killings': list_citizen_journalist_killings,
        'Media Assistants Killings': list_media_assistants_killings
    }

    #merging the dict with few other parameters 
    scrappedData_final.update(rankings)

    #Creating a data from the dict using pandas library 
    ranking_total_Info = pd.DataFrame(scrappedData_final)
    
    return ranking_total_Info;
 
            
def scrape_press_freedom_index():
    #parsing the html data
    parsed_html_page = get_parsed_html_page(site_url)
    #getting the rows(each row consists of the data of a certain country)
    data_rows = get_list_of_rows(parsed_html_page)
    for index, row in enumerate(data_rows):
        #getting the data cells/column(each cell consists of the data of a certain parameter of the report)
        data_cells = get_list_of_data_cells(row)
                                           
        #appending the values to respective arrays
        country, details_url, abuse_score, situation_score, global_score = fetch_data_cell_values(data_cells)
        list_countries.append(country)
        list_details_urls.append(details_url)
        list_abuse_scores.append(abuse_score)
        list_situation_scores.append(situation_score)
        list_global_scores.append(global_score)
        print('Fetching details for {country} '.format(country=country))
                                           
        #Fetching the html page of each country and scrapping the required there                                 
        get_country_specific_data(details_url)

    #Arranging the scrapped data and creating it into a dataframe and save it to a csv file
    data_frame = arrange_and_display_dataframe()
    
    return data_frame

Calling the method `scrape_press_freedom_index()` to scrape the data of press freedom report for 180 countries.


In [None]:
data_frame = scrape_press_freedom_index()

Saving the dataframe to a csv file with name 'press-freedom-index.csv'


In [None]:
data_frame.to_csv('press-freedom-index.csv', index=None)

In [None]:
with pd.option_context('display.max_rows', 200, 'display.max_columns', 20):
    display(data_frame)

# 7.Summary

In this project, we are trying the scrape the data of `Press freedom ranking 2021`. It is Published every year since 2002 by `Reporters Without Borders (RSF)`, the World Press Freedom Index is an important advocacy tool based on the principle of emulation between states. The Index ranks `180 countries and regions` according to the level of freedom available to journalists.


# 8.References

* Requests library source: [Requests Documentation](https://docs.python-requests.org/en/master/)
* Beautiful Soup libray source: [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* Pandas library source: [Documentation](https://pandas.pydata.org/docs/)
* Python programming source: [Doumentation](https://www.python.org/doc/)  
* Course lecture source: [lecture](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis)



# Fetching data from a REST API

In [None]:
import json

In [None]:
json_data = requests.get('https://api.covidtracking.com/v1/us/daily.json')

In [None]:
#Status code
json_data.status_code

In [None]:
HTML_text = json_data.text

In [None]:
HTML_text[:1000]

In [None]:
# Using loads method to convert the string to a dict
formatted_data = json.loads(HTML_text)

In [None]:
formatted_data[:5] #json converted to a dict 

In [None]:
# jovian.commit(project="webscrapping-pressfreedom-report", files=['press-freedom-index.csv'])