<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparation

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the data to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/MyDrive/ThesisAllocationSystem


# Convert PDF to TXT

Convert all PDF files in the current working directory to TXT files.

In [3]:
!pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/96/07/244fbb9c74c0de8a3745cc9f3f496077a29f6418c7cbd90d68fd799574cb/tika-1.24.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-cp37-none-any.whl size=32885 sha256=cc19645580ff36e464f84922a86916086bf428f2fe13a977d4714316c4736883
  Stored in directory: /root/.cache/pip/wheels/73/9c/f5/0b1b738442fc2a2862bef95b908b374f8e80215550fb2a8975
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


In [7]:
from tika import parser
import os

def extract_text_from_pdfs_recursively(dir):
  
    for root, dirs, files in os.walk(dir):
        for file in files:
            path_to_pdf = os.path.join(root, file)
            [stem, ext] = os.path.splitext(path_to_pdf)
            if ext == '.pdf':
                print("Processing " + path_to_pdf)
                pdf_contents = parser.from_file(path_to_pdf)
                path_to_txt = stem + '.txt'
                with open(path_to_txt, 'w') as txt_file:
                    print("Writing contents to " + path_to_txt)
                    txt_file.write(pdf_contents['content'])
            else:
                pass


if __name__ == "__main__":
    extract_text_from_pdfs_recursively(os.getcwd())

Processing /content/drive/My Drive/ThesisAllocationSystem/supervisors/Roemmele_MIA_MPP_Supervision plan_AY 2020-2021.pdf
Writing contents to /content/drive/My Drive/ThesisAllocationSystem/supervisors/Roemmele_MIA_MPP_Supervision plan_AY 2020-2021.txt
Processing /content/drive/My Drive/ThesisAllocationSystem/supervisors/Hammerschmid_MIA_MPP_Supervision plan_AY 2020-2021.pdf
Writing contents to /content/drive/My Drive/ThesisAllocationSystem/supervisors/Hammerschmid_MIA_MPP_Supervision plan_AY 2020-2021.txt
Processing /content/drive/My Drive/ThesisAllocationSystem/supervisors/Anheier_MIA_MPP_Supervision plan_AY 2020-2021.pdf
Writing contents to /content/drive/My Drive/ThesisAllocationSystem/supervisors/Anheier_MIA_MPP_Supervision plan_AY 2020-2021.txt
Processing /content/drive/My Drive/ThesisAllocationSystem/supervisors/Hassel_MIA_MPP_Supervision plan_AY 2020-2021 (002).pdf
Writing contents to /content/drive/My Drive/ThesisAllocationSystem/supervisors/Hassel_MIA_MPP_Supervision plan_AY 20

# Put TXT files into CSV

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: 

Filename | Content 

In [8]:
import csv
from pathlib import Path

In [9]:
def pdf_to_txt(x): 

    os.chdir('/content/drive/MyDrive/ThesisAllocationSystem/' + x)

    with open(x + '.csv', 'w', encoding='Latin-1') as out_file:
        csv_out = csv.writer(out_file)
        csv_out.writerow(['FileName', 'Content'])
        for fileName in Path('.').glob('*.txt'):
            lines = [ ]
            with open(str(fileName.absolute()),'rb') as one_text:
                for line in one_text.readlines():
                    lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
            csv_out.writerow([str(fileName),' '.join(lines)])

In [10]:
pdf_to_txt('test-theses')
pdf_to_txt('test-proposals')
pdf_to_txt('supervisors')
pdf_to_txt('train-papers')