<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the data to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

# **Convert PDF to TXT**

Convert all PDF files in the current working directory to TXT files.

In [None]:
!pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/96/07/244fbb9c74c0de8a3745cc9f3f496077a29f6418c7cbd90d68fd799574cb/tika-1.24.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-cp37-none-any.whl size=32885 sha256=e2d890fa4d5c2bd3511658ea74a74e80ba9883b49e6db85ae6e42c0509bd1d8f
  Stored in directory: /root/.cache/pip/wheels/73/9c/f5/0b1b738442fc2a2862bef95b908b374f8e80215550fb2a8975
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


In [None]:
import os
from tika import parser 
import re

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [None]:
check('./train-papers', './train-papers-txt')

In [None]:
pdf_to_txt('./supervisors', './supervisors-txt') 

In [None]:
# Warning: This will run a couple minutes
pdf_to_txt('./train-papers', './train-papers-txt') 

In [None]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [None]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# **Put TXT files into CSV**

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: character values in columns `FileName` and `Content`.

In [None]:
from glob import glob
import pandas as pd

In [None]:
def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['FileName'] = df['FileName'].str.replace('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/', '')
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
  if not df.empty: 
    print('Succesfully converted txt files in directory' + os.path.basename('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + ' to single csv file.'))
  else: 
    print('File empty.') 
  return None

In [None]:
# Warning: This will take a couple minutes
txt_to_csv('train-papers-txt', 'data_final', 'train-papers-final')

In [None]:
txt_to_csv('test-theses-txt', 'data_final', 'test-theses-final')

In [None]:
txt_to_csv('test-proposals-txt', 'data_final', 'test-proposals-final')

In [None]:
txt_to_csv('supervisors-txt', 'data_final', 'supervisors-final')

# **Data Labelling: Train**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. Thereafter, we integrate these labels into the existing train dataset.

In [None]:
import numpy as np

# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication'                         
}

In [None]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace(r'.txt$', '').str.replace(r'\d+', '').str.lower()

print(data.sample(10))

       FileName                                            Content
179  flachsland  b"How to avoid history repeating itself: the c...
91       bryson  b'Agent-based Modelling\n\nElizabeth M. Gallag...
587   cingolani  b'0003958859 154..185\n\n\nSee discussions, st...
215  hallerberg  b'CESifo Working Paper no. 6228\n\n\neconstor\...
355  kreyenfeld  b'Anticipatory analysis and its alternatives i...
126        cali  b'3\n\nThe International Court of Justice as a...
59       jankin  b'Big data to the rescue? Challenges in analys...
95       bryson  b"Citation for published version:\nTheodorou, ...
34     roemmele  b"Populism in the era of Twitter: How social m...
350  kreyenfeld  b'Demographic Research   a free, expedited, on...


In [None]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)

print(data)

         FileName  ...                domain
0    hammerschmid  ...     public_management
1    hammerschmid  ...     public_management
2    hammerschmid  ...     public_management
3    hammerschmid  ...     public_management
4    hammerschmid  ...     public_management
..            ...  ...                   ...
631          mair  ...  strategic_management
632          mair  ...  strategic_management
633          mair  ...  strategic_management
634    hallerberg  ...     fiscal_governance
635      costello  ...         migration_law

[636 rows x 3 columns]


In [None]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
dum_df

type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [None]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)
data

Unnamed: 0,FileName,Content,domain_comparative_politics,domain_data_science,domain_democracy_studies,domain_development_economics,domain_development_studies,domain_education,domain_european_governance,domain_fiscal_governance,domain_international_law,domain_international_organizations,domain_international_security,domain_labour_policy,domain_migration_law,domain_non_profit,domain_organisational_management,domain_political_communication,domain_public_administration,domain_public_management,domain_social_policy,domain_strategic_management,domain_technology_governance
0,hammerschmid,"b'1 \n \n\nCurry, D., Hammerschmid, G., Jilke,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,hammerschmid,b'2007 EGPA_paper1109.doc\n\n\nSee discussions...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,hammerschmid,b'The Governance of Infrastructure \n\n \n\nEd...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,hammerschmid,b'Administrative tradition and management refo...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,hammerschmid,"b""COCOPS Working Paper no. 1\n\n\nCoordinating...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
631,mair,"b'Social Entrepreneurship\n\nJohanna Mair, Jef...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
632,mair,b'Going global: how middle managers approach t...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
633,mair,b'Microsoft Word - DI-593-E.doc\n\n\n \n1 \n\n...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
634,hallerberg,"b""1002 Initial Cover.pdf\n\n\neconstor\nMake Y...",0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
data.drop(['FileName'], inplace=True, axis=1)

In [None]:
train_df = pd.DataFrame()
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

print(train_df.sample(10))

#type(train_df['labels'].iloc[1])

#label = train_df['labels'].iloc[1]

#type(label[1])

                                               content                                             labels
198  b'2008\n\n\n \n \n \n \n \n \n \n \n \n \n \n ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
301  b'Insuring individualsand politicians: financi...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
491  b"World Bank Document\n\n\nPolicy Research Wor...  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
286  b'Philanthropic Foundations in Cross-National ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
338  b'Civil Servants in Advisory Domains: Between ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
124  b'RSCAS 2019/43rev.3 Legal Trajectories of Neo...  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
57   b"Reading The Tea Leaves\n\n\n \n\nReading The...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
494  b'Microsoft Word - Iacovone.sg1 (003).docx\n\n...  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
169  b"ISBN 978-94-6138-478-2 \n\nAvailable fo

In [None]:
#save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)

# **Data Labelling: Test**



In this section, we assign the newly created labels to student thesis proposals, either referring to their first or second preference. The finished data set will serve as a validation/test dataset.

In [None]:
# Load test data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-theses-final.csv', encoding = 'latin1')

In [None]:
# Add labels column with supervisor 

In [None]:
# Save df