<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the data to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/MyDrive/ThesisAllocationSystem


# **Convert PDF to TXT**

Convert all PDF files in the current working directory to TXT files.

In [17]:
!pip install tika



In [18]:
import os
from tika import parser 
import re

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [None]:
check('./train-papers', './train-papers-txt')

In [None]:
pdf_to_txt('./supervisors', './supervisors-txt') 

In [None]:
# Warning: This will run a couple minutes
pdf_to_txt('./train-papers', './train-papers-txt') 

In [None]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [20]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# **Put TXT files into CSV**

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: character values in columns `FileName` and `Content`.

In [7]:
from glob import glob
import pandas as pd

In [None]:
def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['FileName'] = df['FileName'].str.replace('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/', '')
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
  if not df.empty: 
    print('Succesfully converted txt files in directory' + os.path.basename('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + ' to single csv file.'))
  else: 
    print('File empty.') 
  return None

In [None]:
# Warning: This will take a couple minutes
txt_to_csv('train-papers-txt', 'data_final', 'train-papers-final')

In [None]:
txt_to_csv('test-theses-txt', 'data_final', 'test-theses-final')

In [None]:
txt_to_csv('test-proposals-txt', 'data_final', 'test-proposals-final')

In [None]:
txt_to_csv('supervisors-txt', 'data_final', 'supervisors-final')

# **Data Labelling: Train**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. Thereafter, we integrate these labels into the existing train dataset.

In [128]:
import numpy as np

# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication'                         
}

{'anheier': 'non_profit',
 'bryson': 'technology_governance',
 'cali': 'international_law',
 'cingolani': 'development_studies',
 'cis': 'international_security',
 'clachsland': 'climate_sustainability',
 'costello': 'migration_law',
 'graf': 'education',
 'hallerberg': 'fiscal_governance',
 'hammerschmid': 'public_management',
 'hassel': 'labour_policy',
 'hirth': 'energy_economics',
 'hustedt': 'public_administration',
 'iacovone': 'development_economics',
 'jachtenfuchs': 'european_governance',
 'jankin': 'data_science',
 'kayser': 'comparative_politics',
 'kreyenfeld': 'social_policy',
 'mair': 'strategic_management',
 'mena': 'organisational_management',
 'mungiu-pippidi': 'democracy_studies',
 'munzert': 'political_behaviour',
 'patz': 'international_organizations',
 'reh': 'european_politics',
 'roemmele': 'political_communication'}

In [117]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace(r'.txt$', '').str.replace(r'\d+', '').str.lower()

print(data.sample(10))

       FileName                                            Content
239        graf  b'The governance of decentralised cooperation ...
498    iacovone  b'Testing the Core&#8208;competency Model of M...
269     anheier  b'Special Feature \n\nThe Civil Society Sector...
197  flachsland  b'From climate finance toward sustainable deve...
133        cali  b'ICON (2018), Vol. 16 No. 1, 214234 doi:10.10...
340     hustedt  b'See discussions, stats, and author profiles ...
310      hassel  b"The Problem of Political Exchange in Complex...
603        mair  b"Microsoft Word - DI-0888-E.doc\n\n\n \n\n \n...
178  flachsland  b'SUMMARY \n\n \n \n\n \n\n \n\n \n\n \n\n \n\...
93       bryson  b'Of, for, and by the people: the legal lacuna...


In [118]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)

print(data)

         FileName  ...                domain
0    hammerschmid  ...     public_management
1    hammerschmid  ...     public_management
2    hammerschmid  ...     public_management
3    hammerschmid  ...     public_management
4    hammerschmid  ...     public_management
..            ...  ...                   ...
631          mair  ...  strategic_management
632          mair  ...  strategic_management
633          mair  ...  strategic_management
634    hallerberg  ...     fiscal_governance
635      costello  ...         migration_law

[636 rows x 3 columns]


In [119]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
dum_df

type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [120]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)

# Extract label
dat_label = data.drop_duplicates('FileName')

In [121]:
data.drop(['FileName'], inplace=True, axis=1)

In [122]:
train_df = pd.DataFrame(data)
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

print(train_df.sample(10))

# type(train_df['labels'].iloc[1])
# label = train_df['labels'].iloc[1]
# type(label[1])

                                               Content  ...                                             labels
232  b'Munich Personal RePEc Archive\n\nRule Bendin...  ...  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
57   b"Reading The Tea Leaves\n\n\n \n\nReading The...  ...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
113  b'Robio_final2_Joanna.pdf\n\n\nLearning Motion...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
406  b'No Job Name\n\n\nPerformance pressure: Patte...  ...  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
119  b"The role for simulations in theory construct...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
531  b'Resourcing International Organizations: Reso...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
524  b"Madagascar\n\n\n \n\n \n\n \n\nAn Analysis o...  ...  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
472  b'Inequalities Between Ethnic Groups, Conflict...  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
4

In [None]:
dat_label.drop(['Content'], inplace = True, axis = 1)

In [None]:
# Same procedure for label data 
label_df = pd.DataFrame(dat_label)
label_df['labels'] = label_df.iloc[:, 1:].values.tolist()
label_df

# From df to dict
domain_label = label_df.set_index('FileName').T.to_dict('list')

In [None]:
# Save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)

# **Data Labelling: Test**



In this section, we assign the newly created labels to student thesis proposals, either referring to their first or second preference. The finished data set will serve as a validation/test dataset.

In [16]:
# Load test data
data_test = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-final.csv', encoding = 'latin1')

In [None]:
# creating prof/research categorical label
domain_dict2 = {'ThesisProposal1': ''





# Add labels column with supervisor 

1 - 
1xt Simon Munzert 
2nd:  Prof. Slava Jankin 
3rd:  Prof. Mark Kayser

2 -
1st: Cristian Traxler 
2nd: Dennis Snower 
3rd: Simon Munzert

4
- Prof. Mujaheed Shaikh  
- Prof. Michaela Kreyenfeld  
- Prof. Helmut K. Anheier

5
1st:  Prof. Simon Munzert 
2nd:  Prof. Slava Jankin 
3rd:  Prof. Mark Kayser 


In [None]:
# Save df