<a href="https://colab.research.google.com/github/danilotpnta/UN-General-Debate-Analysis-SDGs/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UN-General-Debate-Analysis-SDGs

This project analyzes the UN General Debate Corpus from 1970 to 2023. It includes exploratory data analysis (EDA), predictive modeling, and data visualizations focusing on uncovering insights from political speeches and their connection to global challenges.

The project is divided into the following sections:
1. Data Collection
2. Data Preprocessing
3. Exploratory Data Analysis
4. Predictive Modeling
5. Data Visualization

## If running from Google Colab

In [19]:
# Clone your GitHub repository
!git clone https://github.com/danilotpnta/UN-General-Debate-Analysis-SDGs.git

# Navigate to the repository folder
%cd /content/UN-General-Debate-Analysis-SDGs

Cloning into 'UN-General-Debate-Analysis-SDGs'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 34 (delta 12), reused 22 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (34/34), 77.71 KiB | 5.55 MiB/s, done.
Resolving deltas: 100% (12/12), done.
[Errno 2] No such file or directory: '/content/UN-General-Debate-Analysis-SDGs'
/Users/datoapanta/code/UN-General-Debate-Analysis-SDGs


## 1. Data Collection

#### Donwloading the dataset

In [21]:
from importlib import reload
import utils.dataverse_downloader as dataverse_downloader

# Reload the module
reload(dataverse_downloader)

# Call the function to download all files
dataverse_downloader.download_dataset()

Downloading dataset...
File already downloaded: Raw_PDFs_1946-1969.tgz
File already downloaded: Raw_PDFs_1970-1990.tgz
File already downloaded: Raw_PDFs_1991-2022.tgz
File already downloaded: README.txt
File already downloaded: Speakers_by_session.xlsx
File already downloaded: UNGDC_1946-2023.tgz
Uncompressing data...
Data already uncompressed: data/raw/Raw_PDFs_1946-1969
Data already uncompressed: data/raw/Raw_PDFs_1970-1990
Data already uncompressed: data/raw/Raw_PDFs_1991-2022
Data already uncompressed: data/raw/UNGDC_1946-2023


## 2. Data Preprocessing

In [29]:
import os
import pandas as pd

# Define file paths
raw_file_path = os.path.join('data', 'raw', 'Speakers_by_session.xlsx')  
processed_file_path = os.path.join('data', 'processed', 'Speakers_by_session_processed.parquet')  

# Check if the processed file exists
if os.path.exists(processed_file_path):
    print(f"Loading processed data from {processed_file_path}...")
    df_speakers = pd.read_parquet(processed_file_path)
else:
    print(f"Processing raw data from {raw_file_path}...")
    df_speakers = pd.read_excel(raw_file_path)
    
    print(f"Saving processed data to {processed_file_path}...")
    df_speakers.to_parquet(processed_file_path)

# Display the first 15 rows of the DataFrame
df_speakers.head(15)

Loading processed data from data/processed/Speakers_by_session_processed.parquet...


Unnamed: 0,Year,Session,ISO Code,Country,Name of Person Speaking,Post,Unnamed: 6
0,2023,78,BRA,Brazil,Luiz Inacio Lula da Silva,President,
1,2023,78,USA,United States of America,Joseph R. Biden,President,
2,2023,78,COL,Colombia,Gustavo Petro Urrego,President,
3,2023,78,JOR,Jordan,Abdullah II ibn Al Hussein,King,
4,2023,78,POL,Poland,Andrzej Duda,President,
5,2023,78,CUB,Cuba,Miguel Diaz-Canel Bermudez,President,
6,2023,78,TUR,Turkey,Recep Tayyip Erdogan,President,
7,2023,78,PRT,Portugal,Marcelo Rebelo de Sousa,President,
8,2023,78,QAT,Qatar,Tamim bin Hamad Al Thani,Amir,
9,2023,78,ZAF,South Africa,Matamela Cyril Ramaphosa,President,


In [30]:
import os
import zipfile
import pandas as pd

# Define file paths
base_path = os.path.join('data', 'raw', 'UNGDC_1946-2023')
processed_file_path = os.path.join('data', 'processed', 'UNGDC_1946-2023_processed.parquet')
zip_file_path = os.path.join('data', 'processed', 'UNGDC_1946-2023_processed.zip')

# Check if the Parquet file exists
if os.path.exists(processed_file_path):

    print(f"Loading processed data from {processed_file_path}...")
    df_ungdc = pd.read_parquet(processed_file_path)

elif os.path.exists(zip_file_path):

    print(f"Unzipping {zip_file_path}...")
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        #  Extract to the directory of the Parquet file
        zip_ref.extractall(os.path.dirname(processed_file_path))  

    # Now load the unzipped Parquet file
    print(f"Loading processed data from {processed_file_path}...")
    df_ungdc = pd.read_parquet(processed_file_path)
else:
    data = []

    # Walk through the base directory and process each session
    for session_folder in os.listdir(base_path):
        session_path = os.path.join(base_path, session_folder)

        # Ensure that it is a directory
        if os.path.isdir(session_path):
            session_number = session_folder.split(' ')[1]  # Extract session number (e.g., '01')
            year = session_folder.split(' ')[-1]          # Extract the year (e.g., '1946')

            # Loop through the text files in each session directory
            for txt_file in os.listdir(session_path):
                if txt_file.endswith('.txt'):
                    file_path = os.path.join(session_path, txt_file)

                    # Read the content of the text file
                    with open(file_path, 'r', encoding='utf-8') as file:
                        content = file.read()

                    # Extract country code from the file name
                    country_code = txt_file.split('_')[0]  # Extract country code (e.g., 'CAN')

                    # Append the data to the list
                    data.append({
                        'Session': session_number,
                        'Year': year,
                        'Country Code': country_code,
                        'Content': content
                    })

    # Convert the data into a DataFrame for easy analysis
    df_ungdc = pd.DataFrame(data)

    # Sort the DataFrame by Year, Session, and Country Code
    df_ungdc = df_ungdc.sort_values(by=['Year', 'Session', 'Country Code'], ascending=[True, True, True])

    # Reset the index for a cleaner DataFrame
    df_ungdc.reset_index(drop=True, inplace=True)

    # Save the processed DataFrame as a Parquet file
    print(f"Saving processed data to {processed_file_path}...")
    df_ungdc.to_parquet(processed_file_path)

    # Zip the Parquet file
    print(f"Compressing {processed_file_path} into {zip_file_path}...")
    with zipfile.ZipFile(zip_file_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        zipf.write(processed_file_path, os.path.basename(processed_file_path))

# Display the first few rows of the DataFrame
df_ungdc.head(15)

Loading processed data from data/processed/UNGDC_1946-2023_processed.parquet...


Unnamed: 0,Session,Year,Country Code,Content
0,1,1946,ARG,At the resumption of the first session of the ...
1,1,1946,AUS,The General Assembly of the United Nations is ...
2,1,1946,BEL,The\tprincipal organs of the United Nations ha...
3,1,1946,BLR,As more than a year has elapsed since the Unit...
4,1,1946,BOL,Coming to this platform where so many distingu...
5,1,1946,BRA,I would first like to express to the city of N...
6,1,1946,CAN,"If were not anxious, like all my colleagues, ..."
7,1,1946,CHL,I shall occupy this rostrum for a few minutes ...
8,1,1946,CHN,My first words must be to express to the Gover...
9,1,1946,COL,The Colombian delegation does not consider it ...


## 3. Exploratory Data Analysis
