<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparation

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the data to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/.shortcut-targets-by-id/1ExS7M2OOkbYS5Z5O9pbPbaCpSa0rhGet/ThesisAllocationSystem


# Convert PDF to TXT

Convert all PDF files in the current working directory to TXT files.

In [None]:
!pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/96/07/244fbb9c74c0de8a3745cc9f3f496077a29f6418c7cbd90d68fd799574cb/tika-1.24.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-cp37-none-any.whl size=32885 sha256=a9282d359a878bb4137fba227398437c84d485d7330e0446b64bb12788fa8b22
  Stored in directory: /root/.cache/pip/wheels/73/9c/f5/0b1b738442fc2a2862bef95b908b374f8e80215550fb2a8975
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


In [None]:
import os
from tika import parser 
import re

def read_pdf(pdf_file):

    text = parser.from_file(pdf_file)['content']
    non_bytes = text.encode().decode()
    no_space = non_bytes.strip()
    final = no_space.strip('\n')
    return final.encode("latin-1","ignore")

def pdf_to_txt(folder_with_pdf, dest_folder):
    pdf_files = []

    for root, dirs, files in os.walk(folder_with_pdf):
        for f in files:
            if '.pdf' in f:
                pdf_files.append(os.path.join(root, f))
    #print(pdf_files)

    for file_ in pdf_files:
        text_file = os.path.splitext(os.path.basename(file_))[0]+'.txt'
        with open(os.path.join(dest_folder,text_file), 'wb') as text_f:
            text_f.write(read_pdf(file_))

    return None

In [None]:
pdf_to_txt('./supervisors', './supervisors-txt') 

In [None]:
# Warning: This will run a couple minutes
pdf_to_txt('./train-papers', './train-papers-txt') 

In [None]:
pdf_to_txt('./test-theses', './test-theses-txt') 

In [None]:
pdf_to_txt('./test-proposals', './test-proposals-txt') 

# Put TXT files into CSV

After importing the packages, define the directory of interest and run the function below to create a CSV files that entails all TXT files in the following structure: 

Filename | Content 

In [3]:
from glob import glob
import pandas as pd

In [None]:
def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['FileName'] = df['FileName'].str.replace('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/', '')
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
  if not df.empty: 
    print('Succesfully converted txt files in directory' + os.path.basename('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + ' to single csv file.'))
  else: 
    print('File empty.') 
  return None

In [None]:
txt_to_csv('train-papers-txt', 'data_final', 'train-papers-final')

Succesfully converted txt files in directorytrain-papers-txt to single csv file.


In [None]:
txt_to_csv('test-theses-txt', 'data_final', 'test-theses-final')

Succesfully converted txt files in directorytest-theses-txt to single csv file.


In [None]:
txt_to_csv('test-proposals-txt', 'data_final', 'test-proposals-final')

Succesfully converted txt files in directorytest-proposals-txt to single csv file.


In [None]:
txt_to_csv('supervisors-txt', 'data_final', 'supervisors-final')

Succesfully converted txt files in directorysupervisors-txt to single csv file.


## **Data Labelling**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. 

In [15]:
import numpy as np

# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication'                         
}

In [16]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace(r'.txt$', '').str.replace(r'\d+', '').str.lower()

print(data.sample(10))

           FileName                                            Content
190      flachsland  b'Decarbonization and EU ETS Reform: Introduci...
211      hallerberg  b'The Role of  Fiscal Coordination and Partisa...
91           bryson  b"Citation for published version:\nWortham, RH...
262            graf  b'The Hybridization of Vocational Training and...
73           jankin  b'Deep_Learning_for_Political_Science\n\n\nDee...
381            mena  b"Mena & Rintamaki, 2019, Managing the past re...
611            mair  b'FRONT-STAGE AND BACKSTAGE CONVENING: THE TRA...
200      flachsland  b"Advocates or cartographers? Scientific advis...
561  mungiu-pippidi  b"Twenty Years of Postcommunism: The Other Tra...
105          bryson  b'See discussions, stats, and author profiles ...


In [17]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)

print(data)

         FileName  ...                domain
0    hammerschmid  ...     public_management
1    hammerschmid  ...     public_management
2    hammerschmid  ...     public_management
3    hammerschmid  ...     public_management
4    hammerschmid  ...     public_management
..            ...  ...                   ...
633          mair  ...  strategic_management
634          mair  ...  strategic_management
635          mair  ...  strategic_management
636          mair  ...  strategic_management
637          mair  ...  strategic_management

[638 rows x 3 columns]


In [20]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
dum_df

type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [21]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)
data

Unnamed: 0,FileName,Content,domain_comparative_politics,domain_data_science,domain_democracy_studies,domain_development_economics,domain_development_studies,domain_education,domain_european_governance,domain_fiscal_governance,domain_international_law,domain_international_organizations,domain_international_security,domain_labour_policy,domain_migration_law,domain_non_profit,domain_organisational_management,domain_political_communication,domain_public_administration,domain_public_management,domain_social_policy,domain_strategic_management,domain_technology_governance
0,hammerschmid,b'2007 EGPA_paper1109.doc\n\n\nSee discussions...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,hammerschmid,b'The Governance of Infrastructure \n\n \n\nEd...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,hammerschmid,"b'1 \n \n\nCurry, D., Hammerschmid, G., Jilke,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,hammerschmid,"b""COCOPS Working Paper no. 1\n\n\nCoordinating...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,hammerschmid,b'Administrative tradition and management refo...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
633,mair,b'Microsoft Word - DI-0610-E.doc\n\n\n \n1 \n\...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
634,mair,b'Adapting for Innovation: Including Divestitu...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
635,mair,"b'Social Entrepreneurship\n\nJohanna Mair, Jef...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
636,mair,b'Going global: how middle managers approach t...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [22]:
data.drop(['FileName'], inplace=True, axis=1)

In [27]:
train_df = pd.DataFrame()
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

print(train_df.sample(10))

#type(train_df['labels'].iloc[1])

#label = train_df['labels'].iloc[1]

#type(label[1])

                                               content                                             labels
474  b'Does formal or informal power-sharing produc...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
523  b'swp0000.dvi\n\n\nInnovation When The Market ...  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
611  b'FRONT-STAGE AND BACKSTAGE CONVENING: THE TRA...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
41   b"NimbusRomNo9L-Regu\n\n\nThis is an Open Acce...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
26   b'truth and consequence in web\ncampaigning: i...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
620  b'r Academy of Management Journal\n2019, Vol. ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
451  b'Voting in the Shadow of Violence: Electoral ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
306  b'Pathways of change in CMEs\n\n\njakob\nTextf...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
530  b"1 \n \n\n \n\nBureaucratic politics and

In [29]:
#save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)