<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the [data](https://drive.google.com/drive/folders/1ExS7M2OOkbYS5Z5O9pbPbaCpSa0rhGet?usp=sharing) to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [78]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [79]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content/drive/My Drive/ThesisAllocationSystem
/content/drive/MyDrive/ThesisAllocationSystem


# **Convert PDF to TXT**

Convert all PDF files in the current working directory to TXT files.

In [80]:
!pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/96/07/244fbb9c74c0de8a3745cc9f3f496077a29f6418c7cbd90d68fd799574cb/tika-1.24.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-cp37-none-any.whl size=32885 sha256=2d87280173725727507f9b2613fca55f2c992896dd4545a12145c8bed2da8442
  Stored in directory: /root/.cache/pip/wheels/73/9c/f5/0b1b738442fc2a2862bef95b908b374f8e80215550fb2a8975
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


In [81]:
import os
from tika import parser 
import re

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [None]:
pdf_to_txt('./supervisors', './supervisors-txt') 

In [None]:
# Warning: This will run a couple minutes
pdf_to_txt('./train-papers', './train-papers-txt') 

In [None]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [None]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# **Put TXT files into CSV**

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: character values in columns `FileName` and `Content`.

In [82]:
from glob import glob
import pandas as pd

In [None]:
def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['FileName'] = df['FileName'].str.replace('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/', '')
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
  if not df.empty: 
    print('Succesfully converted txt files in directory ' + os.path.basename('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + ' to single csv file.'))
  else: 
    print('File empty.') 
  return None

In [None]:
# Warning: This will take a couple minutes
txt_to_csv('train-papers-txt', 'data_final', 'train-papers-final')

Succesfully converted txt files in directory train-papers-txt to single csv file.


In [None]:
txt_to_csv('test-theses-txt', 'data_final', 'test-theses-final')

Succesfully converted txt files in directory test-theses-txt to single csv file.


In [None]:
txt_to_csv('test-proposals-txt', 'data_final', 'test-proposals-final')

Succesfully converted txt files in directory test-proposals-txt to single csv file.


In [None]:
txt_to_csv('supervisors-txt', 'data_final', 'supervisors-final')

Succesfully converted txt files in directory supervisors-txt to single csv file.


# **Data Labelling: Train**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. Thereafter, we integrate these labels into the existing train dataset.

In [83]:
import numpy as np

# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication',
               'shaikh': 'health_economics',
               'snower': 'macroeconomics',
               'stockmann': 'digital_governance',
               'traxler': 'taxation',
               'wegrich': 'policy_process'

}

In [148]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace('.txt$', '').str.replace('\d+', '').str.lower().str.replace('\W+', '')

print(data.sample(10))

          FileName                                            Content
445            cis  b'Information, Connectivity, and Strategic Sta...
557  mungiupippidi  b"Twenty Years of Postcommunism: The Other Tra...
73          jankin  b"Improving Public Services by Mining Citizen ...
662          hirth  b'Reforming the electric power industry in dev...
526           patz  b"1 \n \n\n \n\nBureaucratic politics and the ...
176     flachsland  b'gle00549 128..156\n\n\nPolitical Economy Det...
97          bryson  b'Standardizing Ethical Design for Artificial ...
765        traxler  b'Discussion Papers of the\nMax Planck Institu...
264        anheier  b'Managing non-profit organisations\n\n\nManag...
206     hallerberg  b'City, University of London Institutional Rep...


In [149]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)
print(data)

         FileName  ...             domain
0    hammerschmid  ...  public_management
1    hammerschmid  ...  public_management
2    hammerschmid  ...  public_management
3    hammerschmid  ...  public_management
4    hammerschmid  ...  public_management
..            ...  ...                ...
806       wegrich  ...     policy_process
807       wegrich  ...     policy_process
808       wegrich  ...     policy_process
809       wegrich  ...     policy_process
810       wegrich  ...     policy_process

[811 rows x 3 columns]


In [150]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [151]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)

# Extract label
dat_label = data.drop_duplicates('FileName')

In [88]:
data.drop(['FileName'], inplace=True, axis=1)

In [125]:
train_df = pd.DataFrame()
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

# Check type and content
print(train_df.sample(10))
print(type(train_df['labels'].iloc[1]))
print(type(train_df['labels'].iloc[1][1]))
print(train_df.shape)

                                               content                                             labels
11   b'Written by Nick Thijs \nGerhard Hammerschmid...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
442  b'Understanding journalist killings\n\nSabine ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
64   b'Transfer Topic Labeling with Domain-Specific...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
192  b'Energy transition on the rise: discourses on...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
806  b'S0017257X20000160jra 1..21\n\n\nARTICLE\n\nT...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
246  b'Work-based higher education programmes in Ge...  [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
743  b"E-Journal Article\n\n\neconstor\nMake Your P...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
520  b"policy-brief_crossovers-VOK_ENG\n\n\nWomen E...  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
608  b'DI-0521-E\n\n\nWorking Paper\n\n* Profe

In [126]:
# Save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)

# **Extract labelled data**

In [None]:
dat_label.drop(['Content'], inplace = True, axis = 1)

In [None]:
dat_label['labels'] = dat_label.iloc[:, 1:].values.tolist()

In [161]:
label_df = pd.DataFrame()
label_df['FileName'] = dat_label['FileName']
label_df['labels'] = dat_label['labels']

# Check type
print(type(label_df))
print(type(label_df['labels'].iloc[1]))
print(type(label_df['labels'].iloc[1][1]))

<class 'int'>


# **Data Labelling: Test**



In this section, we assign the newly created labels to student thesis proposals, either referring to their first or second preference. The finished data set will serve as a validation/test dataset.

In [185]:
# Load test data
data_test = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-final.csv', encoding = 'latin1')

In [186]:
# creating prof/research categorical label
domain_dict2 = {'thesisproposal1': 'munzert',
                'thesisproposal2': 'traxler',
                'thesisproposal3': 'bryson',
                'thesisproposal4': 'shaikh',
                'thesisproposal5': 'munzert',
                'thesisproposal6': 'iacovone'
}

In [192]:
# Clean file names
data_test["FileName"] = data_test["FileName"].str.replace(r'.txt$', '').str.lower()

# Add new column: domain
data_test["FileName"] = data_test["FileName"].map(domain_dict2)

# Merge with data label
test_df = pd.merge(data_test, label_df, on='FileName')
test_df['content'] = data_test['Content']

# Remove non-necessary col
test_df.drop(['FileName', 'Content'], inplace = True, axis = 1)

# Swap content and labels
cols = list(test_df.columns)
a, b = cols.index('labels'), cols.index('content')
cols[b], cols[a] = cols[a], cols[b]
test_df = test_df[cols]

# Check type and content
test_df.shape
print(type(test_df))
print(type(test_df['labels'].iloc[1]))
print(type(test_df['labels'].iloc[1][1]))

Unnamed: 0,content,labels
0,b'Anabel Berj\xf3n S\xe1nchez \n\n \n\nPROPOSA...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,b'Master_Thesis_Proposal\n\n\nMaster Thesis Pr...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,"b""New Thesis Proposal Form \n\nAY 2019-2020 \n...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,b'Master_Thesis_Proposal\n\n\nMaster Thesis Pr...,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,b'Thesis Proposal \n\nCitizen Perceptions and ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5,b'New Thesis Proposal Form \n\nAY 2020-2021 \n...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ..."


In [195]:
# Save df
test_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-label.csv', index = False)