<a href="https://colab.research.google.com/github/blue-create/Dataquest/blob/main/label.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

As part of the workflow between GitHub and Google Colab, please follow these steps: 
1. Upload the [data](https://drive.google.com/drive/folders/1ExS7M2OOkbYS5Z5O9pbPbaCpSa0rhGet?usp=sharing) to a folder in your GDrive. 
2. Mount your GDrive.
3. Set the data folder as your present working directory. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pwd
%cd /content/drive/MyDrive/ThesisAllocationSystem

/content
/content/drive/MyDrive/ThesisAllocationSystem


In [4]:
import pandas as pd
import numpy as np

# **Data Labelling: Train**

We manually define a dictionary containing a categorical label for each professor, broadly describing their area of research. Thereafter, we integrate these labels into the existing train dataset.

In [2]:
# creating prof/research categorical label
domain_dict = {'anheier': 'non_profit',
              'bryson': 'technology_governance',
              'cis': 'international_security',
              'cali': 'international_law',
              'cingolani': 'development_studies',              
              'costello': 'migration_law',
              'clachsland': 'climate_sustainability',
              'graf': 'education',
              'hallerberg': 'fiscal_governance',
              'hammerschmid': 'public_management',
              'hassel': 'labour_policy',
              'hirth': 'energy_economics',
              'hustedt': 'public_administration',
              'iacovone': 'development_economics',
              'jachtenfuchs': 'european_governance',
              'jankin': 'data_science',
              'kayser': 'comparative_politics',
              'kreyenfeld': 'social_policy',
              'mair': 'strategic_management',
              'mena': 'organisational_management',              
              'mungiu-pippidi': 'democracy_studies',
              'munzert': 'political_behaviour',
              'patz': 'international_organizations',
              'reh': 'european_politics',
              'roemmele': 'political_communication',
               'shaikh': 'health_economics',
               'snower': 'macroeconomics',
               'stockmann': 'digital_governance',
               'traxler': 'taxation',
               'wegrich': 'policy_process'

}

In [39]:
# Load train data
data = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv', encoding = 'latin1')

# Remove FileNames from txt ending
data["FileName"] = data["FileName"].str.replace('.txt$', '').str.replace('\d+', '').str.lower().str.replace('\W+', '')

# Labels
print('There are', len(data.FileName.unique()), 'unique labels.\n')
print(data.sample(10))

There are 30 unique labels.

       FileName                                            Content
328     hustedt  b'Political Control of Coordination? The Roles...
349  kreyenfeld  b"Changes in union status during the transitio...
314      hassel  b"Microsoft Word - 16 16 Social policy in the ...
206  hallerberg  b'City, University of London Institutional Rep...
653     munzert  b'LOYAL TO THE GAME? STRATEGIC POLICY REPRESEN...
118      bryson  b'Citation for published version:\nWhitehouse,...
363  kreyenfeld  b'Microsoft Word - zff002_inhalt_editorial.doc...
656     munzert  b's\no\nu\nr\nc\ne\n:\n \nh\nt\nt\np\ns\n:\n/\...
84       bryson  b'Slam the Brakes: Perceptions of Moral Decisi...
756   stockmann  b'1 \n \n\n \n\n \n\n \n\nWe Dont Know What We...


In [40]:
# Create a domain column to facilitate mapping on dictionary keys and pass labels as value
data["domain"] = data["FileName"].map(domain_dict)
print(data)

         FileName  ...             domain
0    hammerschmid  ...  public_management
1    hammerschmid  ...  public_management
2    hammerschmid  ...  public_management
3    hammerschmid  ...  public_management
4    hammerschmid  ...  public_management
..            ...  ...                ...
806       wegrich  ...     policy_process
807       wegrich  ...     policy_process
808       wegrich  ...     policy_process
809       wegrich  ...     policy_process
810       wegrich  ...     policy_process

[811 rows x 3 columns]


In [26]:
# Create binary dummy one-hot encoder for each research domain label
dum_df = pd.get_dummies(data, columns=["domain"])
type(dum_df['domain_comparative_politics'].iloc[1])

numpy.uint8

In [27]:
# concate the two dataframes 
data = pd.concat([data.iloc[:,:2], dum_df.iloc[:,2:]], axis = 1)

# Save for label extraction
data.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', index = False)

In [28]:
data.drop(['FileName'], inplace=True, axis=1)

In [29]:
train_df = pd.DataFrame()
train_df['content'] = data['Content']
train_df['labels'] = data.iloc[:, 1:].values.tolist()

# Check type and content
print(train_df.sample(10))
print(type(train_df['labels'].iloc[1]))
print(type(train_df['labels'].iloc[1][1]))
print(train_df.shape)

                                               content                                             labels
78   b'This is a preprint version of our chapter in...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
295  b'The Politics of Social Pacts\n\n\nThe Politi...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
171  b'Actors, objectives, context_ A framework of ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
75   b"Microsoft Word - DGHJHW_ClimateChangeConjoin...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
53   b'A Conservative Revolution final submission\n...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
26   b'AUFS\xc4TZE\n\nDOI 10.1007/s41358-016-0070-z...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
215  b'CESifo Working Paper no. 6228\n\n\neconstor\...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
570  b'See discussions, stats, and author profiles ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
221  b"City, University of London Institutiona

In [30]:
# Save labeled dataframe as csv 
train_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', index = False)

# **Extract labelled data**

In [49]:
# Import blueprint
df_label = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', encoding = 'latin1')

In [None]:
# Filter unique values
dat_label = df_label.drop_duplicates('FileName')

# Drop content col
dat_label.drop(['Content'], inplace = True, axis = 1)

# Convert encoded domains to list
dat_label['labels'] = dat_label.iloc[:, 1:].values.tolist()

In [54]:
# Create df with values 
label_df = pd.DataFrame()
label_df['FileName'] = dat_label['FileName']
label_df['labels'] = dat_label['labels']

# Check type
print(type(label_df))
print(type(label_df['labels'].iloc[1]))
print(type(label_df['labels'].iloc[1][1]))

<class 'pandas.core.frame.DataFrame'>
<class 'list'>
<class 'int'>


In [55]:
# Save & overwrite old file
label_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/label_data.csv', index = False)

# **Data Labelling: Test**



In this section, we assign the newly created labels to student thesis proposals, either referring to their first or second preference. The finished data set will serve as a validation/test dataset.

In [69]:
# Load test data
data_test = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-final.csv', encoding = 'latin1')

In [70]:
# creating prof/research categorical label
domain_dict2 = {'thesisproposal1': 'munzert',
                'thesisproposal2': 'traxler',
                'thesisproposal3': 'bryson',
                'thesisproposal4': 'shaikh',
                'thesisproposal5': 'munzert',
                'thesisproposal6': 'iacovone'
}

In [71]:
# Clean file names
data_test["FileName"] = data_test["FileName"].str.replace(r'.txt$', '').str.lower().str.replace('\W+', '')

# Replace file names with dict key to merge with label df
data_test["FileName"] = data_test["FileName"].map(domain_dict2)

# Merge with data label & rename col
test_df = pd.merge(data_test, label_df, on='FileName')
test_df.rename(columns = {'Content': 'text'}, inplace = True)

# Remove non-necessary col
test_df.drop(['FileName'], inplace = True, axis = 1)

# Check type and content
test_df.shape
print(type(test_df))
print(type(test_df['labels'].iloc[1]))
print(type(test_df['labels'].iloc[1][1]))

<class 'pandas.core.frame.DataFrame'>
<class 'list'>
<class 'int'>


In [72]:
# Save df
test_df.to_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/test-proposals-label.csv', index = False)