## Project: An historical analysis of the Big Brother Brasil show
(Portuguese below)

This script is part of a project that gathers historical data about the Brazilian version of the Big Brother reality show and analyses how it has changed over its 25 years of existence in terms of demographics, audience participation, and other factors. This show is the most-watched programm in Brazil and the goal is to make all this data available for other analysts who wish to explore it. At the time of writing, this is the only one-stop-shop source of BBB data online.

(Portuguese) Projeto: Uma análise histórica do Big Brother Brasil Esse script faz parte de um projeto que coleta dados históricos sobre o Big Brother Brasil de diferentes fontes e os disponibiliza de forma limpa, normalizada e com as devidas conexões. O objetivo desse projeto é tornar os dados acessíveis para outras pessoas que desejarem analisá-los e, no momento da sua publicação, essaé a única fonte de dados consolidada sobre o Big Brother Brasil.

## Script: pulling eviction information from Wikipedia
(Portuguese below)

This script scrapes the Nominations table from the Wikipedia pages. On Wikipedia, this table is divided into 3 with a shared header. Therefore, the script pulls out the headers, normalise them and then splits this table into three.

Script: extraindo dados dos paredões da Wikipedia
Este script extrai a tabela de Paredões das páginas da Wikipedia. Na Wikipedia, essa tabela é dividida em 3 partes com um header compartilhado. Portanto, o script captura os headers, os normaliza e, em seguida, divide essa tabela em três.

In [65]:
# Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup
from IPython.display import display
from io import StringIO
import csv
import gzip
from unidecode import unidecode
import html5lib
import argparse

def nominations_scrape(url)

# Scraping the Wikipedia page

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')

h2_header = soup.find('h2', {'id': 'Histórico'})
desired_table = None

if h2_header:
    parent_div = h2_header.find_parent('div')
    next_div = parent_div.find_next_sibling()
    
    if next_div:
        if next_div.name == "table"
            desired_table = next_div
        else:
            desired_table = next_div.find("table", recursive=False)

html_to_table = pd.read_html(str(desired_table))
Nominations_raw = html_to_table[0]

# Normalising headers

all_levels = [Nominations_raw.columns.get_level_values(level).tolist() for level in range(Nominations_raw.columns.nlevels)]

headers = []

for items in zip(*all_levels):
    merged_items = []
    for i in range(len(items)):
        if i == 0 or items[i] != items[i - 1]: 
            merged_items.append(items[i])
        else:
            merged_items.append("")
    headers.append(" - ".join(filter(None, merged_items)))

headers = ["" if "Unnamed" in item else item for item in headers]
Nominations_raw.columns = headers

# Adding the year of the current file

Nominations_raw['Edicao'] = url.rsplit('_', 1)[-1]

# Splitting the 3 tables

Nominations_raw.replace(r'^\s*$', None, regex=True, inplace=True)

divider_index = Nominations_raw[Nominations_raw.isnull().all(axis=1)].index

df_part1 = Nominations_raw.iloc[:divider_index[0]] 
df_part1.columns = headers
df_part2 = Nominations_raw.iloc[divider_index[0]+1:divider_index[1]]
df_part2.columns = headers
df_part3 = Nominations_raw.iloc[divider_index[1]+1:]
df_part3.columns = headers

Nominations = pd.DataFrame(df_part1)
Individual_nominations = pd.DataFrame(df_part2)
Eviction_results = pd.DataFrame(df_part3)

# Light manipulations to the 3 tables

Nominations = Nominations.dropna(axis=1, how='all')

Nominations.set_index(Nominations.columns[0])

Nominations = Nominations.loc[:, ~Nominations.T.duplicated()]

Nominations = Nominations.T
Nominations.columns = Nominations.iloc[0]
Nominations = Nominations[1:]

Nominations['Edicao'] = url.rsplit('_', 1)[-1]

Individual_nominations = Individual_nominations.dropna(axis=1, how='all')

Individual_nominations.set_index(Individual_nominations.columns[0])

Individual_nominations = Individual_nominations.T
Individual_nominations.columns = Individual_nominations.iloc[0]
Individual_nominations = Individual_nominations[1:]

Individual_nominations['Edicao'] = url.rsplit('_', 1)[-1]

Eviction_results = Eviction_results.dropna(axis=1, how='all')

Eviction_results.set_index(Eviction_results.columns[0])

Eviction_results = Eviction_results.loc[:, ~Eviction_results.T.duplicated()]

Eviction_results_t = Eviction_results.T
Eviction_results_t.columns = Eviction_results_t.iloc[0]
Eviction_results = Eviction_results_t[1:]

Eviction_results = Eviction_results.loc[:, ~Eviction_results.T.duplicated()]

Eviction_results['Edicao'] = url.rsplit('_', 1)[-1]

# Save to csv
Nominations.to_csv(f'Nominations_{year}.csv')
Individual_nominations.to_csv(f'Individual_nominations_{year}.csv')
Eviction_results.to_csv(f'Eviction_results_{year}.csv')

return Nominations, Individual_nominations, Eviction_results

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Process a Wikipedia URL.")
    parser.add_argument("url", type=str, help="The URL to process")
    args = parser.parse_args()

    # Call your function with the provided URL
    scrape_and_process(args.url)