This notebook will scrape the links of the individual parliament discourses and save them in a new column, according to their respective speaker.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
import time
from time import sleep
from urllib.request import urlopen
from urllib.error import HTTPError
from tqdm import tqdm, notebook

In [None]:
tqdm().pandas()

In [None]:
df = pd.read_csv('/congress_2021.tsv', sep='\t')

We will continue to employ the GitHub of the newspaper <a href="https://github.com/estadao/bolsonaro-e-ditadura-no-congresso/tree/master/code">O Estado de S.Paulo</a>. However, the discourses' text contents were initially extracted and saved in .txt format. We will add them to a new column of our data frame in this notebook.

In [None]:
def make_request(url):
    '''Takes an url, makes a get request request and returns a text string.'''
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
        r = requests.get(url, headers=headers)
        r.encoding = 'utf-8'
        time.sleep(2)
    
    # if its a HTTP error, return its type
    except HTTPError as e:
        return e.response.status_code
    
    return r.text

In [None]:
def make_soup(data):  
    '''Takes a text string and returns a BeautifulSoup object. '''
    data = data.replace("<b>","").replace("</b>","") # need to remove </b> tags because the Chamber's developer put them out of order and this makes the scraper goes crazy
    soup = BeautifulSoup(data, 'html.parser') 
    return soup

In [None]:
def scrape_content(soup):
    
    '''
    Use BeautifulSoup to extract the content of the URL in which the speech is. 
    Returns a text string.
    
    In some cases there're broken links, unfortunately, we need to skip those
    errors otherwise the scraper wouldn't work out. Usually, the scraper finds error 
    as an attribute in the "find_all" method, however we found this is actually a 
    "504 Gateway Time-out" error. In any case, we print it.
    '''
    
    try:
        paragraph = soup.find("p", attrs = {"align":"justify"})
        
    # This extra step is necessary because in some cases there is more than one 'font' tag inside 
    # the 'p'. On these occasions, part of the text wasn't returned
        fonts = paragraph.find_all("font")
        text = [ font.text.strip() for font in fonts ]
        text = ' '.join(text)
        
    # if its an Attribute error, return its type
    except AttributeError as e:
        return e
    
    return text

In [None]:
def scrape_row(row):
    
    '''
    Execute via df.apply(): accesses the data from each row of the dataframe 
    and performs the necessary operations to access it. 
    '''

    if row.discourse_link:
            url = row.discourse_link
            data = make_request(url)
            soup = make_soup(data)
            txt = scrape_content(soup)
    return txt

In [None]:
# foi pela marilene que consegui fazer o código funcionar hoje (22/09)!
df['original_discourse'] = df.progress_apply(scrape_row, axis=1)

100%|█████████████████████████████████| 20654/20654 [18:40:45<00:00,  3.26s/it]


In [None]:
df

Unnamed: 0,date,session,phase,discourse_link,speaker,party,coalition,state,region,original_discourse
0,2021-12-21,35.2021.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,hugo leal,psd,0,rj,southeast,O SR. HUGO LEAL (PSD - RJ. Como Relator. Sem r...
1,2021-12-21,35.2021.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,domingos sávio,psdb,0,mg,southeast,O SR. DOMINGOS SÁVIO (PSDB - MG. Pela ordem. S...
2,2021-12-21,35.2021.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,moses rodrigues,mdb,0,ce,northeast,O SR. MOSES RODRIGUES (MDB - CE. Pela ordem. S...
3,2021-12-21,35.2021.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,dra. soraya manato,psl,0,es,southeast,A SRA. DRA. SORAYA MANATO (PSL - ES. Pela orde...
4,2021-12-21,35.2021.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,delegado marcelo freitas,psl,0,mg,southeast,O SR. DELEGADO MARCELO FREITAS (PSL - MG. Pela...
...,...,...,...,...,...,...,...,...,...,...
20649,2021-09-27,017.3.56.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,perpétua almeida,pcdob,0,ac,north,A SRA. PERPÉTUA ALMEIDA (Bloco/PCdoB - AC. Par...
20650,2021-09-27,017.3.56.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,marcelo freixo,psol,0,rj,southeast,O SR. MARCELO FREIXO (Bloco/PSB - RJ. Para dis...
20651,2021-09-27,017.3.56.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,arlindo chinaglia,pt,0,sp,southeast,O SR. ARLINDO CHINAGLIA (Bloco/PT - SP. Para d...
20652,2021-07-15,015.3.56.N,agenda,https://www.camara.leg.br/internet/sitaqweb/Te...,juscelino filho,dem,0,ma,northeast,O SR. JUSCELINO FILHO (DEM - MA. Como Relator....


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20654 entries, 0 to 20653
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date                20654 non-null  object
 1   session             20654 non-null  object
 2   phase               20654 non-null  object
 3   discourse_link      20654 non-null  object
 4   speaker             20654 non-null  object
 5   party               20654 non-null  object
 6   coalition           20654 non-null  int64 
 7   state               20654 non-null  object
 8   region              20654 non-null  object
 9   original_discourse  20654 non-null  object
dtypes: int64(1), object(9)
memory usage: 1.6+ MB


In [None]:
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')

In [None]:
# checking potential scraping errors in the discourse column
df[df["original_discourse"].str.contains("NoneType", na=False)]

Unnamed: 0,date,session,phase,discourse_link,speaker,party,coalition,state,region,original_discourse


In [None]:
# checking null entries
df[df['original_discourse'].isnull()]

Unnamed: 0,date,session,phase,discourse_link,speaker,party,coalition,state,region,original_discourse


We need to check manually the entries in which the scraper didn't work, link by link. Some links hold blanked pages, the ones with content were then manually inserted into the data frame.

In [None]:
# checking potential duplicates
pd.concat(g for _, g in df.groupby('original_discourse') if len(g) > 1)

TypeError: '<' not supported between instances of 'AttributeError' and 'AttributeError'

In [None]:
# some discourses have carriage return (\r), new line (\n), tab (\t), and \xa0. Let's clean it up:
df["original_discourse"] = df["original_discourse"].str.replace(r"\r+|\n+|\t+", " ", regex=True).str.split().str.join(" ")

In [None]:
df.to_csv('congress_speeches_2021.tsv', sep='\t', index=False)