# Extracting text data out of PDF and word files

In [1]:
from os import listdir
from os.path import isfile, join
import PyPDF2
import textract

In [2]:
PATH_TO_FILES = '../data/raw/examples/'

In [3]:
example_names = [f for f in listdir(PATH_TO_FILES) if isfile(join(PATH_TO_FILES, f))]; example_names

['Constitution.pdf',
 'FONAconstitution.pdf',
 'iso-8859-1GoverningDocumentBalesFarm.pdf',
 'FoundationGoverningDocument.docx',
 'GenerosityConstitutionscanned.pdf',
 'InstumentofGovernance.pdf',
 'YCBPTAconstitution2.pdf']

In [4]:
def extract_text_from_pdf(filename):
    path_to_file = PATH_TO_FILES + filename
    
    with open(path_to_file, 'rb') as file:
        pdfReader = PyPDF2.PdfReader(file)
        num_pages = pdfReader.numPages
        count = 0
        text = ""
        
        while count < num_pages:
            pageObj = pdfReader.getPage(count)
            count +=1
            text += pageObj.extractText()
            
    if text == "":
        text = textract.process(path_to_file, method='tesseract', language='eng')
    
    return text

#### Text-based

In [5]:
constitution = extract_text_from_pdf('Constitution.pdf'); constitution

' \n 1\n\nConstitution \n \n \n1  Name \nThe charity’s name is Wragby ChEF. \n \n2  The purposes of the charity are \n  To provide free healthy meals to local children during school holidays. \n \n3  Committee \nThe charity shall be managed by a committee of members who are appointed at the Annual \ngeneral meeting of the charity. \n \n4  Carrying out the purposes \nIn order to carry out the charitable purposes, the committee have the power to: \n \n(1)  raise funds, receive grants and donations \n \n(2)  apply funds to carry out the work of the charity \n \n(3)  co-operate with and support other charities with similar purposes \n \n(4)  do anything which is lawful and necessary to achieve the purposes. \n \n5\n \nMembership \nThe charity shall have a membership. People who support the work of the charity and are \naged 18 or over, can apply to the committee to become a member. Once accepted by the \ncommittee, membership lasts for 1 year and may be renewed. The committee will keep an 

#### Scanned - dodgy

In [6]:
constitution = extract_text_from_pdf('FONAconstitution.pdf'); constitution

b"SCHEDULE 2\nCOMPANY NOT HAVING A SHARE CAPITAL\n\nMemorandum of Association of:\nFRIENDS OF NEITHROP ALLOTMENTS\n\nEach Subscriber to this memorandum of association wishes to form a company under the\nCompanies Act 2006 and agrees to become a member of the company.\n\nName of each subscriber:\n\nSTEPHEN WILLIAM KILSBY\n\nDATED: 30/09/2017\n\nARTICLES OF ASSOCIATION\nPRIVATE COMPANY LIMITED BY GUARANTEE\nAND NOT HAVING A SHARE CAPITAL\n\nThe Company will assume the Statutory Model Articles of Association for a Limited by\nGuarantee Company (not having a share capital) subject to the following amendments.\nThe provisions made herein and the Model Articles of Association will combine to form\nthe constitution of the company.\n\n1. The objects for which the Company is established are:\n\n1.1 To manage the Neithrop allotment site which is in Boxhedge Road, Banbury OX16\nOBP in conjunction with the Boxhedge Allotments Association for the benefit of the\nallotment holders and the local comm

#### Scanned - not dodgy

In [7]:
constitution = extract_text_from_pdf('GenerosityConstitutionscanned.pdf'); constitution

b"Generosity\n\n1.\n\nGENEROSITY\n\nTHE CONSTITUTION\n\nNAME\n\nThe name of the Organisation shall be GENEROSITY (hereinafter referred to as the \xe2\x80\x98Unincorporated\nCommunity Association\xe2\x80\x99). The registered address of the unincorporated association shall be at 19\nKempton Park Road, Birmingham, B36 8RE.\n\n2.\n\nOBJECTS\n\nThe objectives of the Organisation will be:\n\na)\n\nb)\n\na)\nb)\nc)\n\nd)\n\ne)\n\nTo work with vulnerable people within the minority ethnic (Black African) in Birmingham by\nassisting them to improve their physical, mental, and emotional wellbeing.\n\nTo bring members of the community together by organising social events to improve social\nparticipation, promote communities\xe2\x80\x99 cohesion, and reduce loneliness and isolation.\n\nTo run campaigns for children\xe2\x80\x99s right, social justice, and the environment in which they live\nto prevent health issues.\n\nTo create alliances and partnerships between migrant community organisations to i

#### This one is rotated

In [11]:
constitution = extract_text_from_pdf('InstumentofGovernance.pdf'); constitution


b'STATUTORY INSTRUMENTS\n\n2013 No. 1624\nEDUCATION, ENGLAND\n\nThe School Governance (Roles, Procedures and Allowances)\n(England) Regulations 2013\n\nMade - eee Ist July 2013\nLaid before Parliament Sth July 2013\nComing into force -\xe2\x80\x94 - Ist September 2013\n\nThe Secretary of State for Education, in exercise of the powers conferred by sections 19(3), 19(8),\n21(3), 23, 24, 34(5) and 210(7) of the Education Act 2002(a) and sections 519 and 569(4) of, and\nparagraphs 3, 15(1)(b), 15(2)(c), 15(2)(e), 15(2) (0 and 15(2)(h) of Schedule 1 to the Education\nAct 1996(b) makes the following Regulations:\n\nPART 1\nIntroduction\n\nCitation, commencement and application\n\n1.\xe2\x80\x94(]) These Regulations may be cited as the School Governance (Roles, Procedures and\nAllowances) (England) Regulations 2013 and come into force on 1st September 2013.\n\n(2) These Regulations apply in relation to England only.\n\n(3) Parts 2 to 6 of and Schedules 1 and 2 to these Regulations apply only 

#### Scanned - half-dodgy

In [9]:
constitution = extract_text_from_pdf('iso-8859-1GoverningDocumentBalesFarm.pdf'); constitution

b'FILE COPY\n\nCERTIFICATE OF INCORPORATION\nOFA\nCOMMUNITY INTEREST COMPANY\n\nCompany Number 12984299\n\nThe Registrar of Companies for England and Wales, hereby certifies\nthat\n\nBALE\'S FARM CIC\n\nis this day incorporated under the Companies Act 2006 as a\nCommunity Interest Company; is a private company, that the company\nis limited by guarantee, and the situation of its registered office is in\nEngland and Wales\n\nGiven at Companies House, Cardiff, on 29th October 2020\n\n*N12984299V *\n\n\xe2\x80\x98THE OFFICIAL SEAL OF THE\n\nCom pan ies House REGISTRAR OF COMPANIES\n\nThe above information was communicated by electronic means and authenticated\nby the Registrar of Companies under section 1115 of the Companies Act 2006\nCompanies House INO1Ls\n\nReceived for filing in Electronic Format on the: 26/10/2020 X9GGV40G\nCompany Name in BALE\'S FARM CIC\nfull:\nCompany Type: Private company limited by guarantee\nSituation of England and Wales\nRegistered Office:\nProposed Registere

#### Scanned - half-dodgy

In [10]:
constitution = extract_text_from_pdf('YCBPTAconstitution2.pdf'); constitution

b'Ysgol Cwm Brombil Parent Teacher Association\n\nConstitution\n1, Title\nThe Association shall be known as the Ysgol Cwm Brombil Parent Teacher Association (often simply\nreferred to as the PTA)\n\n2. Aims\nThe aims of the Association are to advance the education and wellbeing of the pupils of the school\n\nby providing or assisting in the provision of facilities for education at the school (not normally\nprovided by the Local Authority). This includes:-\n\na} promoting close co-operation and communication between parents and teachers\nb) studying and discussing matters of mutual interest relating to the education and welfare of pupils\n\nc) engaging in activities which support and advance the education of the pupils attending the\nschool, including fund raising and after school activities\n\nd) Applying for funding for wellbeing/enrichment activities for families and pupils of the school.\n\ne) considering applications for funds put to the PTA from parents, teachers, Pupil Council an