<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the [data](https://drive.google.com/drive/folders/1ExS7M2OOkbYS5Z5O9pbPbaCpSa0rhGet?usp=sharing) to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [133]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [134]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content/drive/My Drive/ThesisAllocationSystem
/content/drive/MyDrive/ThesisAllocationSystem


# **Convert PDF to TXT**

Convert all PDF files in the current working directory to TXT files.

In [135]:
!pip install tika



In [136]:
import os
from tika import parser 
import re

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [None]:
pdf_to_txt('./supervisors', './supervisors-txt') 

In [138]:
# Warning: This will run a couple minutes
pdf_to_txt('./train-papers', './train-papers-txt') 

In [None]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [156]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# **Put TXT files into CSV**

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: character values in columns `FileName` and `Content`.

In [139]:
from glob import glob
import pandas as pd

In [140]:
def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['FileName'] = df['FileName'].str.replace('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/', '')
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
  if not df.empty: 
    print('Succesfully converted txt files in directory' + os.path.basename('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + ' to single csv file.'))
  else: 
    print('File empty.') 
  return None

In [141]:
# Warning: This will take a couple minutes
txt_to_csv('train-papers-txt', 'data_final', 'train-papers-final')

Succesfully converted txt files in directorytrain-papers-txt to single csv file.


In [142]:
txt_to_csv('test-theses-txt', 'data_final', 'test-theses-final')

Succesfully converted txt files in directorytest-theses-txt to single csv file.


In [157]:
txt_to_csv('test-proposals-txt', 'data_final', 'test-proposals-final')

Succesfully converted txt files in directorytest-proposals-txt to single csv file.


In [144]:
txt_to_csv('supervisors-txt', 'data_final', 'supervisors-final')

Succesfully converted txt files in directorysupervisors-txt to single csv file.


# **Data Labelling: Train**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. Thereafter, we integrate these labels into the existing train dataset.

In [145]:
import numpy as np

# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication'                         
}

In [146]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace(r'.txt$', '').str.replace(r'\d+', '').str.lower()

print(data.sample(10))

      FileName                                            Content
117     bryson  b'Learning a Real Time Grasping Strategy\n\nBi...
620       mair  b'Persistent Category Ambiguity: The case of s...
585  cingolani  b'#2013-052 \n\xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \x...
623       mair  b'No Job Name\n\n\nSOURCE AND PATTERNS OF ORGA...
132       cali  b'MergedFile\n\n\n \n\n \n\nArticles\t\r \xa0\...
630       mair  b'361-Emerald_RSO-V057-3611629_CH005 113..135\...
613       mair  b'Entrepreneurship as a Platform for Pursuing ...
675      hirth  b'Notes\n\n\nIntegration Costs Revisited   \nA...
334    hustedt  b'Microsoft Word - 000_dms1_2016_inhalt\n\n\nS...
507   iacovone  b"Microsoft Word - Robinson_paper.doc\n\n\nSee...


In [147]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)
print(data)

         FileName  ...             domain
0    hammerschmid  ...  public_management
1    hammerschmid  ...  public_management
2    hammerschmid  ...  public_management
3    hammerschmid  ...  public_management
4    hammerschmid  ...  public_management
..            ...  ...                ...
702           reh  ...  european_politics
703           reh  ...  european_politics
704           reh  ...  european_politics
705           reh  ...  european_politics
706           reh  ...  european_politics

[707 rows x 3 columns]


In [148]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [149]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)

# Extract label
dat_label = data.drop_duplicates('FileName')

In [150]:
data.drop(['FileName'], inplace=True, axis=1)

In [151]:
train_df = pd.DataFrame(data)
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

print(train_df.sample(10))

# type(train_df['labels'].iloc[1])
# label = train_df['labels'].iloc[1]
# type(label[1])

                                               Content  ...                                             labels
152  b'\xa9 The Author(s) (2019). Published by Oxfo...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
24   b'Svenja\xa0Falk\xa0\xb7 Andrea\xa0R\xf6mmele\...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
231  b"Policy Contribution\n\n\nISSUE 2015/20 \nDEC...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
120  b"The role for simulations in theory construct...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
339  b'1\xa0\n\n \n\nPaper\xa0for\xa0the\xa0PSA\xa0...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
480  b'Decomposing firm-level productivity growth a...  ...  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
244  b'VETNET ECER PROCEEDINGS 2018 \n\n \n \nBathm...  ...  [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
146  b'205\n\n23\nSimulation and the Evolution of \...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6

In [154]:
# Save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)

# **Extract labelled data**

In [None]:
dat_label.drop(['Content'], inplace = True, axis = 1)

In [None]:
# Same procedure for label data 
label_df = pd.DataFrame(dat_label)
label_df['labels'] = label_df.iloc[:, 1:].values.tolist()
label_df

# From df to dict
domain_label = label_df.set_index('FileName').T.to_dict('list')

# **Data Labelling: Test**



In this section, we assign the newly created labels to student thesis proposals, either referring to their first or second preference. The finished data set will serve as a validation/test dataset.

In [162]:
# Load test data
data_test = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-final.csv', encoding = 'latin1')

In [170]:
# creating prof/research categorical label
domain_dict2 = {'thesisproposal1': 'munzert',
                'thesisproposal2': 'traxler',
                'thesisproposal3': 'bryson',
                'thesisproposal4': 'shaikh',
                'thesisproposal5': 'munzert',
                'thesisproposal6': 'iacavone'
}

In [178]:
data_test["FileName"] = data_test["FileName"].str.replace(r'.txt$', '').str.lower()
data_test["domain"] = data_test["FileName"].map(domain_dict2)
data_test["labels"] = data_test["domain"].map(domain_dict)

Unnamed: 0,FileName,Content,domain,labels
0,thesisproposal2,b'Anabel Berj\xf3n S\xe1nchez \n\n \n\nPROPOSA...,traxler,
1,thesisproposal5,b'Master_Thesis_Proposal\n\n\nMaster Thesis Pr...,munzert,political_behaviour
2,thesisproposal6,"b""New Thesis Proposal Form \n\nAY 2019-2020 \n...",iacavone,
3,thesisproposal1,b'Master_Thesis_Proposal\n\n\nMaster Thesis Pr...,munzert,political_behaviour
4,thesisproposal3,b'Thesis Proposal \n\nCitizen Perceptions and ...,bryson,technology_governance
5,thesisproposal4,b'New Thesis Proposal Form \n\nAY 2020-2021 \n...,shaikh,


In [None]:
# Save df
data_test.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-label.csv', index = False)