<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparation

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the data to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/MyDrive/ThesisAllocationSystem


# Convert PDF to TXT

Convert all PDF files in the current working directory to TXT files.

In [3]:
!pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/96/07/244fbb9c74c0de8a3745cc9f3f496077a29f6418c7cbd90d68fd799574cb/tika-1.24.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-cp37-none-any.whl size=32885 sha256=1ad996386c8e8b909a453488449ecbf0a4322518f8aa4831b158547219001133
  Stored in directory: /root/.cache/pip/wheels/73/9c/f5/0b1b738442fc2a2862bef95b908b374f8e80215550fb2a8975
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


In [72]:
import os
from tika import parser 
import re
import time
from tqdm import tqdm

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [None]:
pdf_to_txt('./supervisors', './supervisors-txt') 

In [None]:
# Warning: There is no progress bar to this function yet and it will run for a couple minutes.
pdf_to_txt('./train-papers', './train-papers-txt') 

In [193]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [None]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# Put TXT files into CSV

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: 

Filename | Content 

In [194]:
from glob import glob
import pandas as pd

In [195]:
def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['FileName'] = df['FileName'].str.replace('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/', '')
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
  if not df.empty: 
    print('Succesfully converted txt files in directory' + os.path.basename('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + ' to single csv file.'))
  else: 
    print('File empty.') 
  return None

In [192]:
txt_to_csv('train-papers-txt', 'data_final', 'train-papers-final')

Succesfully converted txt files in directorytrain-papers-txt to single csv file.


In [196]:
txt_to_csv('test-theses-txt', 'data_final', 'test-theses-final')

Succesfully converted txt files in directorytest-theses-txt to single csv file.


In [190]:
txt_to_csv('test-proposals-txt', 'data_final', 'test-proposals-final')

Succesfully converted txt files in directorytest-proposals-txt to single csv file.


In [191]:
txt_to_csv('supervisors-txt', 'data_final', 'supervisors-final')

Succesfully converted txt files in directorysupervisors-txt to single csv file.


## **Data Labelling**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. 

In [197]:
import pandas as pd
import numpy as np

# creating prof/research categorical label
domain_dict = {'Anheier': 'non_profit',
              'Bryson': 'technology_governance',
              'CIS': 'international_security',
              'Cali': 'international_law',
              'Cingolani': 'development_studies',              
              'Costello': 'migration_law',
              'Flachsland': 'climate_sustainability',
              'Graf': 'education',
              'Hallerberg': 'fiscal_governance',
              'Hammerschmid': 'public_management',
              'Hassel': 'labour_policy',
              'Hirth': 'energy_economics',
              'Hustedt': 'public_administration',
              'Iacovone': 'development_economics',
              'Jachtenfuchs': 'european_governance',
              'Jankin': 'data_science',
              'Kayser': 'comparative_politics',
              'Kreyenfeld': 'social_policy',
              'Mair': 'strategic_management',
              'Mena': 'organisational_management',              
              'Mungiu-Pippidi': 'democracy_studies',
              'Munzert': 'political_behaviour',
              'Patz': 'international_organizations',
              'Reh': 'european_politics',
              'Roemmele': 'political_communication'                         
}


In [198]:
# Load train data
train_df = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding='latin1')

# Remove FileNames from txt ending
train_df["FileName"] = train_df["FileName"].str.replace(r'.txt$', '').str.replace(r'\d+', '')

print(train_df)

         FileName                                            Content
0    Hammerschmid  b'2007 EGPA_paper1109.doc\n\n\nSee discussions...
1    Hammerschmid  b'The Governance of Infrastructure \n\n \n\nEd...
2    Hammerschmid  b'1 \n \n\nCurry, D., Hammerschmid, G., Jilke,...
3    Hammerschmid  b"COCOPS Working Paper no. 1\n\n\nCoordinating...
4    Hammerschmid  b'Administrative tradition and management refo...
..            ...                                                ...
406  jachtenfuchs  b'Balancing sub-unit autonomy and collective p...
407  jachtenfuchs  b'From Market Integration to Core State Powers...
408  jachtenfuchs  b'More integration, less federation: the Europ...
409  jachtenfuchs  b'From Market Integration to Core State Powers...
410  jachtenfuchs  b'Deepening and widening integration theory\n\...

[411 rows x 2 columns]


In [199]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
train_df["domain"] = train_df["FileName"].map(domain_dict)

print(train_df)

         FileName  ...             domain
0    Hammerschmid  ...  public_management
1    Hammerschmid  ...  public_management
2    Hammerschmid  ...  public_management
3    Hammerschmid  ...  public_management
4    Hammerschmid  ...  public_management
..            ...  ...                ...
406  jachtenfuchs  ...                NaN
407  jachtenfuchs  ...                NaN
408  jachtenfuchs  ...                NaN
409  jachtenfuchs  ...                NaN
410  jachtenfuchs  ...                NaN

[411 rows x 3 columns]


In [200]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(train_df, columns=["domain"])
dum_df

Unnamed: 0,FileName,Content,domain_climate_sustainability,domain_data_science,domain_education,domain_fiscal_governance,domain_international_law,domain_labour_policy,domain_migration_law,domain_non_profit,domain_political_communication,domain_public_administration,domain_public_management,domain_social_policy,domain_technology_governance
0,Hammerschmid,b'2007 EGPA_paper1109.doc\n\n\nSee discussions...,0,0,0,0,0,0,0,0,0,0,1,0,0
1,Hammerschmid,b'The Governance of Infrastructure \n\n \n\nEd...,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Hammerschmid,"b'1 \n \n\nCurry, D., Hammerschmid, G., Jilke,...",0,0,0,0,0,0,0,0,0,0,1,0,0
3,Hammerschmid,"b""COCOPS Working Paper no. 1\n\n\nCoordinating...",0,0,0,0,0,0,0,0,0,0,1,0,0
4,Hammerschmid,b'Administrative tradition and management refo...,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406,jachtenfuchs,b'Balancing sub-unit autonomy and collective p...,0,0,0,0,0,0,0,0,0,0,0,0,0
407,jachtenfuchs,b'From Market Integration to Core State Powers...,0,0,0,0,0,0,0,0,0,0,0,0,0
408,jachtenfuchs,"b'More integration, less federation: the Europ...",0,0,0,0,0,0,0,0,0,0,0,0,0
409,jachtenfuchs,b'From Market Integration to Core State Powers...,0,0,0,0,0,0,0,0,0,0,0,0,0


In [201]:
# concate the two dataframes 
train_df = pd.concat([train_df.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)
train_df

Unnamed: 0,FileName,Content,domain_climate_sustainability,domain_data_science,domain_education,domain_fiscal_governance,domain_international_law,domain_labour_policy,domain_migration_law,domain_non_profit,domain_political_communication,domain_public_administration,domain_public_management,domain_social_policy,domain_technology_governance
0,Hammerschmid,b'2007 EGPA_paper1109.doc\n\n\nSee discussions...,0,0,0,0,0,0,0,0,0,0,1,0,0
1,Hammerschmid,b'The Governance of Infrastructure \n\n \n\nEd...,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Hammerschmid,"b'1 \n \n\nCurry, D., Hammerschmid, G., Jilke,...",0,0,0,0,0,0,0,0,0,0,1,0,0
3,Hammerschmid,"b""COCOPS Working Paper no. 1\n\n\nCoordinating...",0,0,0,0,0,0,0,0,0,0,1,0,0
4,Hammerschmid,b'Administrative tradition and management refo...,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406,jachtenfuchs,b'Balancing sub-unit autonomy and collective p...,0,0,0,0,0,0,0,0,0,0,0,0,0
407,jachtenfuchs,b'From Market Integration to Core State Powers...,0,0,0,0,0,0,0,0,0,0,0,0,0
408,jachtenfuchs,"b'More integration, less federation: the Europ...",0,0,0,0,0,0,0,0,0,0,0,0,0
409,jachtenfuchs,b'From Market Integration to Core State Powers...,0,0,0,0,0,0,0,0,0,0,0,0,0


In [203]:
#save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv')