## Extracting text from CSR reports

This notebook contains the code for extraning sentences from CSR reports.

In [3]:
import pandas as pd
import fitz
import os
import spacy


In [38]:
# empty dataframe with all companies
df = pd.DataFrame(columns = {'Company':['3M','Abbott','Accenture','Adobe','AEScorporation','AlaskaAir','Altria',
'Amazon','AmericanAirlines','Apple','AppliedMaterials','BD','BestBuy','Boeing','BostonProperties','Caesars',
'Campbells','CardinalHealth','Carnival','Caterpillar','Cigna','Cisco','Colgate','Comerica','Conagra','ConocoPhillips',
'CSX','CVS','DentsplySirona','Disney','Duke','EastmanChemicals','Equinix','EsteeLauder','FMC','Goldman','Hasbro',
'Healthpeak','HenrySchein','HewlettPackard','HomeDepot','HormelFood','Howmet','IBM','Intel','InternationalPaper',
'InterpublicGroup','Invesco','IronMountain','J.M.Smucker','Johnson_Johnson','JPMorgan','KeyCorp','Keysight','Kimco',
'Leidos','Lincoln','Lockheed','Lowes','Micron','Microsoft','Mohawk','Mondelez','NiSource','Northrop','NortonLifeLock',
'NorweigeanCruiseLine','NRG','Nvidia','Omnicon','Pentair','Pepsi','PNC','Prologis','PVH','P_G','Qualcomm','QuestDiagnostics',
'RoyalCaribbean','Seagate','SealedAir','SouthwestAirlines','Stanley','Starbucks','S_P','Tapestry','TexasInstruments',
'TJX','UnionPacific','UnitedRentals','UPS','Verizon','Visa','Vornado','Walmart','WasteManagemen','WellsFargo',
'Whirlpool','Wynn','Xylem']})

In [39]:
# reading all reports into the dataframe
# using PyMuPDF (fitz)

directory = 'reports'
years = ['reports_2016','reports_2017','reports_2018','reports_2019','reports_2020']

for folder in years:
    for filename in os.listdir(directory + '/' + folder):
        f = os.path.join(directory + '/' + folder, filename)
        # checking if it is a file
        if os.path.isfile(f):
            pdf = fitz.open(f)
            text = ''
            for page in pdf:
                text += page.getText('text')
            # cleaing the name for the dataframe
            company_name = f[21:-9]
            df.loc[company_name, folder] = text

Deprecation: 'getText' removed from class 'Page' after v1.19 - use 'get_text'.


In [40]:
df.head()

Unnamed: 0,Company,reports_2016,reports_2017,reports_2018,reports_2019,reports_2020
3M,,2016 Sustainability Report\n “Starting with te...,"We are\n#improvinglives\n90,000 employees help...",2018 Sustainability Report\nWe tapped youth de...,Improving \n every\n life\n2019...,Improving\nevery\nlife\n2020 Sustainability Re...
Abbott,,GLOBAL\nSUSTAINABILITY \nREPORT 2016\nFROM OU...,2017 \nGLOBAL \nSUSTAINABILITY \nREPORT\nABOUT...,GLOBAL \nSUSTAINABILITY \nREPORT 2018\nCHANG...,2019 GLOBAL SUSTAINABILITY REPORT\nCONTENTS\nS...,2020 GLOBAL SUSTAINABILIT Y REPORT\nCOVID-19: ...
Accenture,,DIFFERENCE\nCORPORATE\nCITIZENSHIP\nREPORT 201...,CORPORATE\nCITIZENSHIP\nREPORT 2017 \nSee how ...,\n \n \nIMPROVING \nTHE WAY THE \nWORLD WORK...,BUILDING \nA FUTURE \nOF SHARED \nSUCCESS\nC...,Building a Future \n of Shared Success\nCorpo...
Adobe,,SUSTAINABILITY & \nSOCIAL IMPACT REPORT\n2016...,SUSTAINABILITY & \nSOCIAL IMPACT REPORT\n2017...,ADOBE CORPORATE SOCIAL \nRESPONSIBILITY REPORT...,1\n2019 Adobe Corporate Sustainability Report\...,2\n2020 Adobe Corporate Social Responsibility ...
AEScorporation,,2016 AES \nSUSTAINABILITY \nREPORT\nASPECT: Pu...,AES SUSTAINABILITY REPORT\n2017\nPreliminary v...,2018 SUSTAINABILITY REPORT\n2018 SUSTAINABILIT...,2019\nSustainability Report\n2019 Sustainabili...,2020\nPerformance \nindicators\nAccelerating t...


In [41]:
# using an english pipeline for sentence detection
nlp = spacy.load('en_core_web_sm')

# increasing max length to avoid trouble with to long reports
nlp.max_length = 20000000

In [42]:
# function returns a list with the detected sentences from spaCy nlp 
def get_list(text):
    doc = nlp(text)
    sentences = [sent.string.replace('\n', '') for sent in doc.sents]
    return sentences

In [43]:
# getting sentences from all reports
df['sentences_2016'] = df['reports_2016'].apply(lambda x: get_list(x))
df['sentences_2017'] = df['reports_2017'].apply(lambda x: get_list(x))
df['sentences_2018'] = df['reports_2018'].apply(lambda x: get_list(x))
df['sentences_2019'] = df['reports_2019'].apply(lambda x: get_list(x))
df['sentences_2020'] = df['reports_2020'].apply(lambda x: get_list(x))
df.head()

Unnamed: 0,Company,reports_2016,reports_2017,reports_2018,reports_2019,reports_2020,sentences_2016,sentences_2017,sentences_2018,sentences_2019,sentences_2020
3M,,2016 Sustainability Report\n “Starting with te...,"We are\n#improvinglives\n90,000 employees help...",2018 Sustainability Report\nWe tapped youth de...,Improving \n every\n life\n2019...,Improving\nevery\nlife\n2020 Sustainability Re...,[2016 Sustainability Report “Starting with tec...,"[We are#improvinglives, 90,000 employees helpi...","[2018 , Sustainability Report, We tapped youth...","[Improving every life2019 , S...","[Improvingeverylife2020 , Sustainability Repor..."
Abbott,,GLOBAL\nSUSTAINABILITY \nREPORT 2016\nFROM OU...,2017 \nGLOBAL \nSUSTAINABILITY \nREPORT\nABOUT...,GLOBAL \nSUSTAINABILITY \nREPORT 2018\nCHANG...,2019 GLOBAL SUSTAINABILITY REPORT\nCONTENTS\nS...,2020 GLOBAL SUSTAINABILIT Y REPORT\nCOVID-19: ...,"[GLOBALSUSTAINABILITY REPORT 2016, FROM OUR C...","[2017 , GLOBAL , SUSTAINABILITY , REPORT, ABOU...","[GLOBAL SUSTAINABILITY REPORT 2018, CHANGING...","[2019 GLOBAL SUSTAINABILITY REPORTCONTENTS, Su...","[2020 GLOBAL SUSTAINABILIT Y REPORT, COVID-19:..."
Accenture,,DIFFERENCE\nCORPORATE\nCITIZENSHIP\nREPORT 201...,CORPORATE\nCITIZENSHIP\nREPORT 2017 \nSee how ...,\n \n \nIMPROVING \nTHE WAY THE \nWORLD WORK...,BUILDING \nA FUTURE \nOF SHARED \nSUCCESS\nC...,Building a Future \n of Shared Success\nCorpo...,"[DIFFERENCE, CORPORATE, CITIZENSHIP, REPORT 20...","[CORPORATE, CITIZENSHIPREPORT 2017 , See how S...","[ IMPROVING , THE WAY , THE , WORLD WORKS A...","[BUILDING A FUTURE OF SHARED , SUCCESS, Corp...",[Building a Future of Shared SuccessCorporat...
Adobe,,SUSTAINABILITY & \nSOCIAL IMPACT REPORT\n2016...,SUSTAINABILITY & \nSOCIAL IMPACT REPORT\n2017...,ADOBE CORPORATE SOCIAL \nRESPONSIBILITY REPORT...,1\n2019 Adobe Corporate Sustainability Report\...,2\n2020 Adobe Corporate Social Responsibility ...,"[SUSTAINABILITY & SOCIAL IMPACT REPORT2016, L...","[SUSTAINABILITY & SOCIAL IMPACT REPORT2017, D...","[ADOBE CORPORATE SOCIAL , RESPONSIBILITY REPOR...","[12019 , Adobe Corporate Sustainability Report...","[2, 2020 , Adobe Corporate Social Responsibili..."
AEScorporation,,2016 AES \nSUSTAINABILITY \nREPORT\nASPECT: Pu...,AES SUSTAINABILITY REPORT\n2017\nPreliminary v...,2018 SUSTAINABILITY REPORT\n2018 SUSTAINABILIT...,2019\nSustainability Report\n2019 Sustainabili...,2020\nPerformance \nindicators\nAccelerating t...,"[2016 AES SUSTAINABILITY , REPORT, ASPECT: , P...","[AES SUSTAINABILITY REPORT2017, Preliminary ve...","[2018 , SUSTAINABILITY REPORT, 2018 SUSTAINABI...","[2019Sustainability Report2019 , Sustainabilit...","[2020, Performance indicatorsAccelerating the ..."


In [44]:
# removing report columns to reduce file size 
df = df.drop(['reports_2016','reports_2017','reports_2018','reports_2019','reports_2020'], axis=1)

In [45]:
# saving data in pickle format
df.to_pickle('all_reports_df.pickle', protocol=4)

Testing and inspecting quality of the sentence detection.

In [8]:
df_read = pd.read_pickle('all_reports_df.pickle')

In [9]:
df_read.head()

Unnamed: 0,Company,sentences_2016,sentences_2017,sentences_2018,sentences_2019,sentences_2020
3M,,[2016 Sustainability Report “Starting with tec...,"[We are#improvinglives, 90,000 employees helpi...","[2018 , Sustainability Report, We tapped youth...","[Improving every life2019 , S...","[Improvingeverylife2020 , Sustainability Repor..."
Abbott,,"[GLOBALSUSTAINABILITY REPORT 2016, FROM OUR C...","[2017 , GLOBAL , SUSTAINABILITY , REPORT, ABOU...","[GLOBAL SUSTAINABILITY REPORT 2018, CHANGING...","[2019 GLOBAL SUSTAINABILITY REPORTCONTENTS, Su...","[2020 GLOBAL SUSTAINABILIT Y REPORT, COVID-19:..."
Accenture,,"[DIFFERENCE, CORPORATE, CITIZENSHIP, REPORT 20...","[CORPORATE, CITIZENSHIPREPORT 2017 , See how S...","[ IMPROVING , THE WAY , THE , WORLD WORKS A...","[BUILDING A FUTURE OF SHARED , SUCCESS, Corp...",[Building a Future of Shared SuccessCorporat...
Adobe,,"[SUSTAINABILITY & SOCIAL IMPACT REPORT2016, L...","[SUSTAINABILITY & SOCIAL IMPACT REPORT2017, D...","[ADOBE CORPORATE SOCIAL , RESPONSIBILITY REPOR...","[12019 , Adobe Corporate Sustainability Report...","[2, 2020 , Adobe Corporate Social Responsibili..."
AEScorporation,,"[2016 AES SUSTAINABILITY , REPORT, ASPECT: , P...","[AES SUSTAINABILITY REPORT2017, Preliminary ve...","[2018 , SUSTAINABILITY REPORT, 2018 SUSTAINABI...","[2019Sustainability Report2019 , Sustainabilit...","[2020, Performance indicatorsAccelerating the ..."


In [10]:
# printing some random sentences

for x in range(1,5):
    for y in range(200,205):
        print(df_read.sentences_2020[x][y])
        print('\n')

To date, their efforts have resulted in 12 new tests for use in a broad range of applications, from high-throughput instruments capable of handling large volumes of tests at once, to rapid point-of-care testing delivering reliable, on-the-spot results, fast.


Early in the pandemic, with only limited information available about the virus, our scientists leveraged years of assay- development expertise to create a series of tests for both high-volume laboratories and doctors’ offices.


These included molecular tests, which help identify active infections, to run on our Alinity


® m, m2000 RealTime® and  ID NOWTM systems; and immunoassay tests for our Alinity 


i  and ARCHITECT® platforms to detect IgG and IgM antibodies to the COVID-19 virus, which help identify late-stage and previous infections. 


Pratham Education 


Foundation: 


Delivering employability skills training to youth in hospitality, 


electrical and health care trades, helping them find job placement; short-


term 