<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/label.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the [data](https://drive.google.com/drive/folders/1ExS7M2OOkbYS5Z5O9pbPbaCpSa0rhGet?usp=sharing) to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/MyDrive/ThesisAllocationSystem


# **Data Labelling: Train**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. Thereafter, we integrate these labels into the existing train dataset.

In [209]:
import numpy as np
import pandas as pd

# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication',
               'shaikh': 'health_economics',
               'snower': 'macroeconomics',
               'stockmann': 'digital_governance',
               'traxler': 'taxation',
               'wegrich': 'policy_process'

}

In [210]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace('.txt$', '').str.replace('\d+', '').str.lower().str.replace('\W+','')

print(data.sample(10))

       FileName                                            Content
112      bryson  b'AFRL-AFOSR-UK-TR-2012-0023 \n \n \n \n \n \n...
145      bryson  b'205\n\n23\nSimulation and the Evolution of \...
601        mair  b'9780521518550c04 92..119\n\n\nSee discussion...
586   cingolani  b'Contents.indd\n\n\n \n\n \nBreaking the Cycl...
643     munzert  b'Measuring the Importance of Political Elites...
226  hallerberg  b'See discussions, stats, and author profiles ...
387        mena  b"Mansell book review OS accepted version\n\n\...
174  flachsland  b'Is the Paris Agreement effective? A systemat...
791     wegrich  b'Urban governance innovations in Rio de Janei...
180  flachsland  b"OPTIONS FOR A CARBON PRICING REFORM - EXECUT...


In [217]:
# Create & save new shape
data_new = pd.DataFrame()
data_new = data[['Content','FileName']]
data_new = data_new.rename(columns = {'FileName': 'label',
                           'Content':'text'})
data_new.sample(10)

data_new.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data2.csv', index = False)


In [180]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)
print(data)

         FileName  ...             domain
0    hammerschmid  ...  public_management
1    hammerschmid  ...  public_management
2    hammerschmid  ...  public_management
3    hammerschmid  ...  public_management
4    hammerschmid  ...  public_management
..            ...  ...                ...
806       wegrich  ...     policy_process
807       wegrich  ...     policy_process
808       wegrich  ...     policy_process
809       wegrich  ...     policy_process
810       wegrich  ...     policy_process

[811 rows x 3 columns]


In [181]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [182]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)

# Save for extraction of labels
data.drop_duplicates(subset = ['FileName']).to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', index = False)

In [183]:
data.drop(['FileName'], inplace=True, axis=1)

In [184]:
train_df = pd.DataFrame()
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

# Check type and content
print(train_df.sample(10))
print(type(train_df['labels'].iloc[1]))
print(type(train_df['labels'].iloc[1][1]))
print(train_df.shape)
print(len(train_df['labels'].iloc[1]))

                                               content                                             labels
252  b'17Wissenschaft Januar 2015\n\nFor researcher...  [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
365  b'Anticipatory analysis and its alternatives i...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
703  b'untitled\n\n\nEU legitimacy revisited: the n...  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
504  b'causes1.qxp\n\n\nSee discussions, stats, and...  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
586  b'Contents.indd\n\n\n \n\n \nBreaking the Cycl...  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
195  b'1 \n \n\n \n\n \n \n\n \n\nClimate Policy an...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
188  b'Decarbonization and EU ETS Reform: Introduci...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
575  b'Microsoft Word - Polsci 2, 2007.doc\n\n\n   ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
587  b'0003958859 154..185\n\n\nSee discussion

In [185]:
# Save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)

# **Extract labelled data**

In [198]:
dat_label = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', encoding = 'latin1')

In [199]:
dat_label.drop(['Content'], inplace = True, axis = 1)

In [200]:
dat_label['labels'] = dat_label.iloc[:, 1:].values.tolist()

In [201]:
label_df = pd.DataFrame()
label_df['FileName'] = dat_label['FileName']
label_df['labels'] = dat_label['labels']

# Check type
print(type(label_df))
print(type(label_df['labels'].iloc[1]))
print(type(label_df['labels'].iloc[1][1]))
print(len(label_df['labels'].iloc[1]))

<class 'pandas.core.frame.DataFrame'>
<class 'list'>
<class 'int'>
28


In [202]:
label_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', index = False)

# **Data Labelling: Test**



In this section, we assign the newly created labels to student thesis proposals, either referring to their first or second preference. The finished data set will serve as a validation/test dataset.

In [203]:
# Load test data
data_test = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-final.csv', encoding = 'latin1')

In [204]:
# Requirements for function: dict & label
domain_dict2 = {'thesisproposal1': 'munzert',
                'thesisproposal2': 'traxler',
                'thesisproposal3': 'bryson',
                'thesisproposal4': 'shaikh',
                'thesisproposal5': 'munzert',
                'thesisproposal6': 'iacovone'
}

label_df = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', converters={'labels': eval})

In [205]:
# Required class list & int
print(type(label_df['labels'].iloc[1]))
print(type(label_df['labels'].iloc[1][1]))

<class 'list'>
<class 'int'>


In [206]:
def append_labels(data, dict, label):
    data["FileName"] = data["FileName"].str.replace(r'.txt$', '').str.lower().str.replace('\W+', '')
    data["FileName"] = data["FileName"].map(dict)
    df_new = pd.merge(data, label, on = 'FileName')
    df_new_new = df_new.rename(columns = {'Content': 'content'})
    df_new_new.drop(['FileName'], inplace = True, axis = 1)

    return df_new_new

In [207]:
test_df = append_labels(data_test, domain_dict2, label_df)

print(type(test_df['labels'].iloc[1]))
print(type(test_df['labels'].iloc[1][1]))
print(len(test_df['labels'].iloc[1]))

<class 'list'>
<class 'int'>
28


In [208]:
# Save df
test_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-label.csv', index = False)