# Setup

To load a pdf and extract it's text we will use some helper functions from the text_transformation_tools module

In [1]:
import text_transformation_tools as ttf

# Import Sustainability Report

We import an exemplary sustainability report and read out some metadata and extract the text as string.

In [2]:
path_to_pdf = 'example_data/example_report.pdf'

last_modified = ttf.get_pdf_last_modified(path_to_pdf)
pdf_text = ttf.pdf_to_text(path_to_pdf)
language = ttf.detect_language(pdf_text)

print('Report was last modified in {} and has language \"{}\"'.format(last_modified, language))
print()
print('First 500 characters of the report\'s content are: \n {}'.format(pdf_text[:500]))

Report was last modified in 2021-03-09 00:00:00 and has language "('en', -396506.95990228653)"

First 500 characters of the report's content are: 
 SUSTAINABILITY REPORT FY20

We want to make sustainability pervasive across all our activities and a reflection of our culture. We are reimagining how we source, manufacture, distribute and recycle, to positively improve the carbon, toxicity, circularity and social impact of our operations.
Introduction
Products and the environment
People and society
About this report
CONTENTS
INTRODUCTION
06 Statement from Bracken Darrell 08 FY20 Highlights 10 Company Structure 12 Logitech in Figures 14 Sustain


Finally we put the text into a pandas DataFrame

In [3]:
df_sections = ttf.pdf_text_to_sections(pdf_text)
print('Number of sections: {}'.format(len(df_sections)))
print()
print('First 10 sections:')
print(df_sections.head(10))

Number of sections: 3083

First 10 sections:
   page  section_index                                       section_text
0     1              0                         SUSTAINABILITY REPORT FY20
1     2              1  We want to make sustainability pervasive acros...
2     2              2                                       Introduction
3     2              3                       Products and the environment
4     2              4                                 People and society
5     2              5                                  About this report
6     2              6                                           CONTENTS
7     2              7                                       INTRODUCTION
8     2              8  06 Statement from Bracken Darrell 08 FY20 High...
9     2              9                         PRODUCTS & THE ENVIRONMENT


# Load Topics

To detect topics we use the KeywordDetector class. Upon instantiation we need to specify which language the instance should support. The language is used for text processing and cleansing.

> To provide the class with topics to detect, we use the load_topics function. The function requires a list of Wikipedia articles to process. Each Wikipedia article is considered a topic. The content of the article is downloaded and processed. Using tf-idf representative keywords are extracted.

In this example we load five articles. They are specified in the file wiki_topics_reduced.csv. For the project businessresponsibility.ch we loaded 66 topics. The 66 topics are specified in the file wiki_topics_prototype_fund.csv

In [1]:
from keyword_detector import KeywordDetector
import pandas as pd

kw_detector = KeywordDetector(lang='en')
df_topics = pd.read_csv('example_data/wiki_topics_reduced.csv', sep=';')

kw_detector.load_topics(df_topics['topic'], df_topics['topic'], min_words = 500, min_tf_idf_=0.08, min_keywords_=2, max_keywords_=10, max_df_=0.8)

We can now look at the topics and the keywords generated for these topics.

In [None]:
topics = kw_detector.get_topics()
keyword_list = kw_detector.get_topic_keywords()

for i in range(len(topics)):
    print('Topic: {}'.format(topics[i]))
    print('Keywords: {}'.format(', '.join(keyword_list[i])))
    print('******************')

Topic: Human rights
Keywords: human rights, natural law, civil political, economic social, political right, universal declaration, cultural right, social cultural, declaration human, natural right
******************
Topic: Climate change
Keywords: climate change, greenhouse gas, global warming, fossil fuel, level rise, co2 emission, sea level, gas emission
******************
Topic: Social inequality
Keywords: social inequality, health care, social status, social class, income inequality, income wealth, doi 10, economic growth, health inequality, gini coefficient
******************
Topic: Labor rights
Keywords: labor right, child labor, labor movement, working condition, labor union, worker right, undocumented worker, minimum wage, core labor
******************


# Detect Topics in Report

To process a report we now use the detect_keywords function.

> The function detect_keywords requires a pandas DataFrame with the texts to process. For each topic a new column is generated. The columns indicate how many unique keywords the respective text contains per topic.

In [None]:
df_section_topics = kw_detector.detect_keywords(df_sections, 'section_text', 120)
df_section_topics.head(10)

Unnamed: 0,page,section_index,section_text,cleansed_text,Human rights,Climate change,Social inequality,Labor rights
0,1,0,SUSTAINABILITY REPORT FY20,,0.0,0.0,0.0,0.0
1,2,1,We want to make sustainability pervasive acros...,want sustainability pervasive activity reflect...,0.0,0.0,0.0,0.0
2,2,2,Introduction,,0.0,0.0,0.0,0.0
3,2,3,Products and the environment,,0.0,0.0,0.0,0.0
4,2,4,People and society,,0.0,0.0,0.0,0.0
5,2,5,About this report,,0.0,0.0,0.0,0.0
6,2,6,CONTENTS,,0.0,0.0,0.0,0.0
7,2,7,INTRODUCTION,,0.0,0.0,0.0,0.0
8,2,8,06 Statement from Bracken Darrell 08 FY20 High...,statement bracken darrell fy20 highlights comp...,0.0,0.0,0.0,0.0
9,2,9,PRODUCTS & THE ENVIRONMENT,,0.0,0.0,0.0,0.0


# Analysis

In this example we look at all sections of the report which contain at leas two unique keywords for the topic "Human rights".

For the project businessresponsibility.ch we used 66 topics for 5 categories (Human Rights, Environment, Corruption, Social Issues, Employee Concerns). If a sustainability report contained at least one topic of a category, the report was considered to report about the category. In this way we processe over 1'000 reports and created indicators about what companies report on sustainability issues.

If you want to know more about the project or the code, visit businessresponsitility.ch, bizres.ch or contact us through GitHub or other means.

In [None]:
num_unique_keywords = 2
sections = df_section_topics[df_section_topics['Human rights'] >= num_unique_keywords]

for idx, row in sections.iterrows():
    page = row['page']
    text = row['section_text']

    print('(P. {}) {}'.format(page, text))
    print('******************')

(P. 22) Universal Declaration of Human Rights, ILO International Labour Standards, OECD Guidelines for Multinational Enterprises, OHSAS 18001, ISO 14001 and SA8000.
******************
(P. 81) The RBA Code of Conduct is our framework for the management of human rights and labor at our production facility. The RBA Code is aligned with international norms and standards including the Universal Declaration of Human Rights, ILO International Labor Standards, OECD Guidelines for Multinational Enterprises, ISO and SA standards.
******************
(P. 84) As a small company, playing in a global market, we recognize the value of collaboration. We joined the Responsible Business Alliance (RBA) in 2007, to collaborate with industry peers and competitors alike and develop tools and programs addressing the sustainability challenges facing our sector today. The RBA has an established Code of Conduct (“the RBA Code”), which is reflective of international norms and good practice, including the Universa