<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparation

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the data to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/MyDrive/ThesisAllocationSystem


# Convert PDF to TXT

Convert all PDF files in the current working directory to TXT files.

In [3]:
!pip install tika



In [4]:
import os
from tika import parser 
import re

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [32]:
# This is a final version
# pdf_to_txt('./supervisors', './supervisors-txt') 

In [5]:
# Warning: There is no progress bar to this function yet and it will run for a couple minutes.
pdf_to_txt('./train-papers', './train-papers-txt') 

In [6]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [7]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# Put TXT files into CSV

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: 

Filename | Content 

In [8]:
import csv
from pathlib import Path

In [9]:
def txt_to_csv(x): 

    os.chdir('/content/drive/MyDrive/ThesisAllocationSystem/' + x)

    with open(x + '.csv', 'w', encoding = 'Latin-1') as out_file:
        csv_out = csv.writer(out_file)
        csv_out.writerow(['FileName', 'Content'])
        for fileName in Path('.').glob('*.txt'):
            lines = [ ]
            with open(str(fileName.absolute()),'rb') as one_text:
                for line in one_text.readlines():
                    lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
            csv_out.writerow([str(fileName),' '.join(lines)])

In [10]:
txt_to_csv('train-papers-txt')

In [None]:
txt_to_csv('test-theses-txt')

In [None]:
txt_to_csv('test-proposals-txt')

In [35]:
# This is a final version
# txt_to_csv('supervisors-txt')

## **Data Labelling**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. 

In [12]:
import pandas as pd
import numpy as np

# creating prof/research categorical label
domain_dict = {'Anheier': 'non_profit',
              'Bryson': 'technology_governance',
              'CIS': 'international_security',
              'Cali': 'international_law',
              'Cingolani': 'development_studies',              
              'Costello': 'migration_law',
              'Flachsland': 'climate_sustainability',
              'Graf': 'education',
              'Hallerberg': 'fiscal_governance',
              'Hammerschmid': 'public_management',
              'Hassel': 'labour_policy',
              'Hirth': 'energy_economics',
              'Hustedt': 'public_administration',
              'Iacovone': 'development_economics',
              'Jachtenfuchs': 'european_governance',
              'Jankin': 'data_science',
              'Kayser': 'comparative_politics',
              'Kreyenfeld': 'social_policy',
              'Mair': 'strategic_management',
              'Mena': 'organisational_management',              
              'Mungiu-Pippidi': 'democracy_studies',
              'Munzert': 'political_behaviour',
              'Patz': 'international_organizations',
              'Reh': 'european_politics',
              'Roemmele': 'political_communication'                         
}


In [13]:
# Load train data
train_df = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/train-papers-txt/train-papers-txt.csv', encoding='latin1')

# Remove FileNames from txt ending
train_df["FileName"] = train_df["FileName"].str.replace(r'.txt$', '').str.replace(r'\d+', '')

print(train_df)

         FileName                                            Content
0    Hammerschmid  The Governance of Infrastructure    Edited by ...
1    Hammerschmid  PUBLIC PRIVATE PARTNERSHIP BETWEEN EUPHORIA AN...
2    Hammerschmid  More delegation, more political control? Polit...
3    Hammerschmid  1   Curry, D., Hammerschmid, G., Jilke, S., Va...
4    Hammerschmid  2007 EGPA_paper1109.doc   See discussions, sta...
..            ...                                                ...
406  jachtenfuchs  Balancing sub-unit autonomy and collective pro...
407  jachtenfuchs  From Market Integration to Core State Powers: ...
408  jachtenfuchs  More integration, less federation: the Europea...
409  jachtenfuchs  From Market Integration to Core State Powers: ...
410  jachtenfuchs  Deepening and widening integration theory   De...

[411 rows x 2 columns]


In [17]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
train_df["domain"] = train_df["FileName"].map(domain_dict)

print(train_df)

         FileName  ...             domain
0    Hammerschmid  ...  public_management
1    Hammerschmid  ...  public_management
2    Hammerschmid  ...  public_management
3    Hammerschmid  ...  public_management
4    Hammerschmid  ...  public_management
..            ...  ...                ...
406  jachtenfuchs  ...                NaN
407  jachtenfuchs  ...                NaN
408  jachtenfuchs  ...                NaN
409  jachtenfuchs  ...                NaN
410  jachtenfuchs  ...                NaN

[411 rows x 3 columns]


In [18]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(train_df, columns=["domain"])
dum_df

Unnamed: 0,FileName,Content,domain_climate_sustainability,domain_data_science,domain_education,domain_fiscal_governance,domain_international_law,domain_labour_policy,domain_migration_law,domain_non_profit,domain_political_communication,domain_public_administration,domain_public_management,domain_social_policy,domain_technology_governance
0,Hammerschmid,The Governance of Infrastructure Edited by ...,0,0,0,0,0,0,0,0,0,0,1,0,0
1,Hammerschmid,PUBLIC PRIVATE PARTNERSHIP BETWEEN EUPHORIA AN...,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Hammerschmid,"More delegation, more political control? Polit...",0,0,0,0,0,0,0,0,0,0,1,0,0
3,Hammerschmid,"1 Curry, D., Hammerschmid, G., Jilke, S., Va...",0,0,0,0,0,0,0,0,0,0,1,0,0
4,Hammerschmid,"2007 EGPA_paper1109.doc See discussions, sta...",0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406,jachtenfuchs,Balancing sub-unit autonomy and collective pro...,0,0,0,0,0,0,0,0,0,0,0,0,0
407,jachtenfuchs,From Market Integration to Core State Powers: ...,0,0,0,0,0,0,0,0,0,0,0,0,0
408,jachtenfuchs,"More integration, less federation: the Europea...",0,0,0,0,0,0,0,0,0,0,0,0,0
409,jachtenfuchs,From Market Integration to Core State Powers: ...,0,0,0,0,0,0,0,0,0,0,0,0,0


In [19]:
# concate the two dataframes 
train_df = pd.concat([train_df.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)
train_df

Unnamed: 0,FileName,Content,domain_climate_sustainability,domain_data_science,domain_education,domain_fiscal_governance,domain_international_law,domain_labour_policy,domain_migration_law,domain_non_profit,domain_political_communication,domain_public_administration,domain_public_management,domain_social_policy,domain_technology_governance
0,Hammerschmid,The Governance of Infrastructure Edited by ...,0,0,0,0,0,0,0,0,0,0,1,0,0
1,Hammerschmid,PUBLIC PRIVATE PARTNERSHIP BETWEEN EUPHORIA AN...,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Hammerschmid,"More delegation, more political control? Polit...",0,0,0,0,0,0,0,0,0,0,1,0,0
3,Hammerschmid,"1 Curry, D., Hammerschmid, G., Jilke, S., Va...",0,0,0,0,0,0,0,0,0,0,1,0,0
4,Hammerschmid,"2007 EGPA_paper1109.doc See discussions, sta...",0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406,jachtenfuchs,Balancing sub-unit autonomy and collective pro...,0,0,0,0,0,0,0,0,0,0,0,0,0
407,jachtenfuchs,From Market Integration to Core State Powers: ...,0,0,0,0,0,0,0,0,0,0,0,0,0
408,jachtenfuchs,"More integration, less federation: the Europea...",0,0,0,0,0,0,0,0,0,0,0,0,0
409,jachtenfuchs,From Market Integration to Core State Powers: ...,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
#save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/train-papers-txt/train-papers-labeled.csv')