<a href="https://colab.research.google.com/github/alexandrastna/AI-for-ESG/blob/main/Notebooks/2_Thesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Thesis 2 – Data Checks & Overview
I perform consistency checks by analyzing the number of documents per company and per document type. I ensure no expected data is missing and generate descriptive statistics for a global overview of the dataset.

1 - Load the data

In [1]:
# 1. Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 2. Import pandas
import pandas as pd

# 3. Load the merged metadata file
csv_path = "/content/drive/MyDrive/Thèse Master/Data/df_merged_clean.csv"
df_merged = pd.read_csv(csv_path)

# 4. Display settings for better visibility
import IPython.display as display
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)
display.display(df_merged.head(10))  # Show first 10 PDF entries


Mounted at /content/drive


Unnamed: 0,Company,Year,Document Type,Document Title,Path,Ticker SMI,Ticker Seeking Alpha (US),SASB Industry
0,Zurich Insurance Group AG,2023,Annual Report,Zurich_Annual_Report_2023.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/Zurich_Annual_Report_2023.pdf,ZURN,ZURVY,Financials – Insurance
1,Zurich Insurance Group AG,2023,Half-Year Report,Zurich_Half_Year_Report_2023.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/Zurich_Half_Year_Report_2023.pdf,ZURN,ZURVY,Financials – Insurance
2,Zurich Insurance Group AG,2022,Annual Report,Zurich_Annual_Report_2022.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/Zurich_Annual_Report_2022.pdf,ZURN,ZURVY,Financials – Insurance
3,Zurich Insurance Group AG,2022,Half-Year Report,Zurich_Half_Year_Report_2022.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/Zurich_Half_Year_Report_2022.pdf,ZURN,ZURVY,Financials – Insurance
4,Zurich Insurance Group AG,2021,Annual Report,Zurich_Annual_Report_2021.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/Zurich_Annual_Report_2021.pdf,ZURN,ZURVY,Financials – Insurance
5,Zurich Insurance Group AG,2021,Half-Year Report,Zurich_Half_Year_Report_2021.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/Zurich_Half_Year_Report_2021.pdf,ZURN,ZURVY,Financials – Insurance
6,Zurich Insurance Group AG,2023,Earnings Call Transcript,Zurich_2023_Q1 Earnings Call Transcript.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/2023/Zurich_2023_Q1 Earnings Call Transcript.pdf,ZURN,ZURVY,Financials – Insurance
7,Zurich Insurance Group AG,2023,Earnings Call Transcript,Zurich_2023_Q2 Earnings Call Transcript.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/2023/Zurich_2023_Q2 Earnings Call Transcript.pdf,ZURN,ZURVY,Financials – Insurance
8,Zurich Insurance Group AG,2023,Earnings Call Transcript,Zurich_2023_Q3 Earnings Call Transcript.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/2023/Zurich_2023_Q3 Earnings Call Transcript.pdf,ZURN,ZURVY,Financials – Insurance
9,Zurich Insurance Group AG,2023,Earnings Call Transcript,Zurich_2023_Q4 Earnings Call Transcript.pdf,/content/drive/MyDrive/Thèse Master/Data/Zurich Insurance Group AG/2023/Zurich_2023_Q4 Earnings Call Transcript.pdf,ZURN,ZURVY,Financials – Insurance


2 - Overview: Number of Documents per Company and Year

To get a quick overview of the dataset, I count the number of documents available for each company and year combination.

In [2]:
# Total number of documents per company and year
docs_per_company_year = df_merged.groupby(["Company", "Year"]).size().reset_index(name="Nb_Documents")
display.display(docs_per_company_year)


Unnamed: 0,Company,Year,Nb_Documents
0,ABB Ltd,2021,6
1,ABB Ltd,2022,7
2,ABB Ltd,2023,7
3,Compagnie Financière Richemont,2021,5
4,Compagnie Financière Richemont,2022,5
5,Compagnie Financière Richemont,2023,5
6,Holcim Ltd,2021,4
7,Holcim Ltd,2022,6
8,Holcim Ltd,2023,7
9,Lonza Group AG,2021,5


3 - Document Types Breakdown

Next, I count the number of documents by type for each company and year. This helps me verify whether I have a consistent mix of Annual Reports, Sustainability Reports, Half-Year Reports, etc.

In [3]:
# Count document types
docs_by_type = df_merged.groupby(["Company", "Year", "Document Type"]).size().reset_index(name="Count")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
display.display(docs_by_type)



Unnamed: 0,Company,Year,Document Type,Count
0,ABB Ltd,2021,Annual Report,1
1,ABB Ltd,2021,Earnings Call Transcript,4
2,ABB Ltd,2021,Sustainability Report,1
3,ABB Ltd,2022,Earnings Call Transcript,4
4,ABB Ltd,2022,Governance Report,1
5,ABB Ltd,2022,Integrated Report,1
6,ABB Ltd,2022,Sustainability Report,1
7,ABB Ltd,2023,Earnings Call Transcript,4
8,ABB Ltd,2023,Governance Report,1
9,ABB Ltd,2023,Integrated Report,1


4 - Pivot Table: Document Types per Company and Year

To get a clearer overview of the distribution of document types, I create a pivot table that shows how many of each type (Annual Report, Half-Year Report, etc.) exist per company and year.

In [4]:
# Pivot table with document types
pivot_doc_types = df_merged.pivot_table(
    index=["Company", "Year"],
    columns="Document Type",
    values="Document Title",
    aggfunc="count",
    fill_value=0
)

display.display(pivot_doc_types)


Unnamed: 0_level_0,Document Type,Annual Report,Earnings Call Transcript,Governance Report,Half-Year Report,Integrated Report,Sustainability Report
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ABB Ltd,2021,1,4,0,0,0,1
ABB Ltd,2022,0,4,1,0,1,1
ABB Ltd,2023,0,4,1,0,1,1
Compagnie Financière Richemont,2021,1,2,0,1,0,1
Compagnie Financière Richemont,2022,1,2,0,1,0,1
Compagnie Financière Richemont,2023,1,2,0,1,0,1
Holcim Ltd,2021,0,1,0,1,1,1
Holcim Ltd,2022,0,3,0,1,1,1
Holcim Ltd,2023,0,4,0,1,1,1
Lonza Group AG,2021,1,2,0,1,0,1


5 – Detect Missing Entries (Document Gaps)

To identify any missing documents, I compute all possible (Company, Year) combinations and check which ones are absent in my dataset.

In [5]:
# All possible combinations
companies = df_merged["Company"].unique()
years = df_merged["Year"].unique()

import itertools
all_possible = pd.DataFrame(itertools.product(companies, years), columns=["Company", "Year"])

# Merge with existing dataset
merged_check = pd.merge(all_possible, docs_per_company_year, on=["Company", "Year"], how="left")

# Filter combinations with no documents
missing = merged_check[merged_check["Nb_Documents"].isna()]
display.display(missing)


Unnamed: 0,Company,Year,Nb_Documents
