## from_full_data_to_full_data_all_vars.ipynb

In this notebook, I extract all remaining variables, for an analysis with covariates. It should be modeled after `find_remaining_variables.ipynb`, with the exception that we have to add the death date at the beginning. Then, the script can be modeled more or less exactly after the aforementioned notebook. So, the approach is:

   - Add the sterfdatum
   - Redo the entire script as in `find_remaining_variables.ipynb` 
   

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

import re
from itertools import compress

from matplotlib import style
from matplotlib import pyplot as plt
import seaborn as sns

from rdrobust import rdrobust,rdbwselect,rdplot



In [2]:
data = pd.read_csv("../Data/analysis/full_sample_analysis_novars.csv", dtype={'b1-nummer':str}).iloc[:, 1:]
data.shape

(8459, 47)

In [3]:
## Add the sterfdatum
sterfdatum_politici = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", 
              dtype={'b1-nummer':str},
              sheet_name=1)
sterfdatum_politici = sterfdatum_politici[sterfdatum_politici['rubriek'] == 3020][['b1-nummer', 'datum']]

data = pd.merge(data, sterfdatum_politici, how='left',
         left_on='b1-nummer',
         right_on='b1-nummer').rename(columns={'datum':'sterfdatum'})

In [4]:
nonpols = pd.read_csv('../Data/analysis/unmatched_sample_analysis.csv').iloc[:,1:]
nonpols = nonpols[nonpols['b1-nummer'].isna()][['Naam', 'Sterfdatum']].drop_duplicates()
nonpols = nonpols[~nonpols['Sterfdatum'].isna()]

In [5]:
data = pd.merge(data, nonpols,
        how='left',
        left_on='Naam',
         right_on='Naam').rename(columns={'Sterfdatum':'Dod'})

data['Sterfdatum'] = np.where(data['sterfdatum'].isna(), data['Dod'], data['sterfdatum'])

data = data.drop(columns=['sterfdatum', 'Dod'])

data.shape

(8467, 48)

In [6]:
def fix_sterfdatum(sterfdat):
    
    try:
        parts = re.split("-|/", sterfdat)
        parts = [int(x) for x in parts]
        out = pd.Timestamp(day = parts[0], month = parts[1], year = parts[2])
    except:
        out = None
        
    return out

data['Stefdatum'] = data['Sterfdatum'].apply(fix_sterfdatum)


## Load some datasets I need

From here, proceed as in find_remaining_variables.

Now, we load a couple of dataset which will be merged to the dataframe called `data` in a couple of steps. 

We also define a function that cleans up the birthplaces of politicians. Birthplaces of nonpoliticians have already been cleaned up. 

In [7]:
def cleanup(x):
    'Helper to clean up politicians birthplace'
    step1 = re.sub('\((.+)\)', '', x)
    step2 = re.sub("'s-Gravenhage", 'Den Haag', step1)
    step3 = re.sub("'s-Hertogenbosch", "Den Bosch", step2)
    step4 = step3.strip()
    
    return step4

In [8]:
# Election and election history data
## Some datasets which I need
electoral_data = pd.read_csv("../Data/elections/election_results_details.csv").iloc[:,1:]
electoral_data.iloc[:,[2,7,8,9,10,11,12,13]] = electoral_data.iloc[:,[2,7,8,9,10,11,12,13]].apply(lambda x: pd.to_numeric(x, errors='coerce'))
electoral_data['Verkiezingdatum'] = electoral_data['Verkiezingdatum'].apply(lambda x: pd.Timestamp(x))

## Seats data
zetels = electoral_data.groupby(['District', 'Verkiezingdatum']).agg({'Aantal zetels': 'mean'})

## Politician metadata
politician_metadata = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", dtype={'b1-nummer':str})
politician_metadata2 = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", sheet_name = 1, dtype={'b1-nummer':str})

### Clean up the variable 'waarde' (birthplace from politician metadata)
politician_metadata2['waarde'] = politician_metadata2['waarde'].apply(lambda x: cleanup(x))

## Nonpolitician metadata
nonpolitician_metadata = pd.read_csv("../Data/nonpolitician_data/nonpoliticians_birthplace_birthdates.csv").iloc[:,1:]

## Taxes and population (district)
taxes_pop = pd.read_csv('../Data/district_data/taxes_and_population.csv').iloc[:,1:]

## Religious composition over time (district)
religious_comp = pd.read_csv("../Data/district_data/religion_over_time.csv").iloc[:,1:]

## Professional composition 1889 (district)
prof_comp = pd.read_csv("../Data/district_data/professional_composition.csv").iloc[:,1:]

## Clean the religious composition dataset 

- And combine them in one dataframe, to be used later. 

In [9]:
prot = ['Doopsgezinden', 'Evangelisch Luthers', 'Nederlands Hervormden', 
        'overige kerkelijke gezindte', 'Remonstranten',
        'Anglikaans Episcopalen', 'Christelijk Afgescheidenen',
        'Engelse Presbyterianen', 'Hernhutters',
       'Hersteld Evangelisch Luthersen', 'Schotse Gemeente', 'Waals Hervormden',
        'Gereformeerde Kerken', 'Christelijk Gereformeerden']

kath = ['Oud Katholieken', 'Rooms-Katholieken']

In [10]:
def sum_catholic(groups):
    groups = groups[groups['information'].isin(kath)]
    n = groups['total_inhabitants'].mean()
    return groups['aantal'].sum()
    
def sum_protestant(groups):
    groups = groups[groups['information'].isin(prot)]
    return groups['aantal'].sum()

def sum_overig(groups):
    groups = groups[~(groups['information'].isin(prot)) & ~(groups['information'].isin(kath))]
    return groups['aantal'].sum()

In [11]:
protestant = religious_comp.groupby(['name','year']).apply(lambda x: sum_protestant(x)).reset_index().rename(columns={0:'protestant'})
catholic = religious_comp.groupby(['name','year']).apply(lambda x: sum_catholic(x)).reset_index().rename(columns={0:'catholic'})
overig = religious_comp.groupby(['name','year']).apply(lambda x: sum_overig(x)).reset_index().rename(columns={0:'overig'})

In [12]:
religious_comp = pd.merge(protestant, catholic, on=['name', 'year']).merge(overig, on = ['name','year'])

## Function to add the first batch of variables

This function adds:

- Nearest competitor margin in present election

- Percentage won by lib. cand.

- percentage won by soc. cand.

- percentage won by conf. cand.

- Integrate HDNG data on birth place and district characteristics (in folder district_data)

In [13]:
def get_variables(data):
    
    out = pd.DataFrame()
    
    for i in tqdm(range(len(data))):
        
        # For everyone:

        ## How long did you live after this election? (in days)
        try:
            lifespan = (pd.Timestamp(data.iloc[i]['Sterfdatum']) - pd.Timestamp(data.iloc[i]['Verkiezingdatum']))/ pd.Timedelta(1, unit='d')
        
        except:
            lifespan = None
        
        ### Variables specific to the election
 
        # For the politicians only
        if pd.isnull(data.iloc[i]['b1-nummer']):
            begin_period = None
            end_period = None
            tenure = None
            
            try:
                date_of_birth = nonpolitician_metadata[nonpolitician_metadata['Naam'] == data.iloc[i]['Naam']]['Birthdate'].item()
                place_of_birth = nonpolitician_metadata[nonpolitician_metadata['Naam'] == data.iloc[i]['Naam']]['Birthplace'].item()
            except:
                date_of_birth = None
                place_of_birth = None
        else: 
            begin_period = politician_metadata[politician_metadata['b1-nummer'] == data.iloc[i]['b1-nummer']]['begin periode'].values[0]
            end_period = politician_metadata[politician_metadata['b1-nummer'] == data.iloc[i]['b1-nummer']]['einde periode'].values[0]
            try:
                tenure = pd.Timestamp(end_period) - pd.Timestamp(begin_period)
            except:
                tenure = None
                
            date_of_birth = politician_metadata2[(politician_metadata2['b1-nummer'] == data.iloc[i]['b1-nummer']) & (politician_metadata2['rubriek'] == 3010)]['datum'].values[0]
            place_of_birth = politician_metadata2[(politician_metadata2['b1-nummer'] == data.iloc[i]['b1-nummer']) & (politician_metadata2['rubriek'] == 3010)]['waarde'].values[0]
        ## Now, we add all data points to the dataframe
        interim = pd.DataFrame([data.iloc[i]])
        
        # Make all variables        
        interim['lifespan'] = lifespan
        interim['begin_period'] = begin_period if begin_period != None else None
        interim['end_period'] = end_period if end_period != None else None
        interim['tenure'] = tenure
        interim['date_of_birth'] = date_of_birth if date_of_birth != None else None
        interim['place_of_birth'] = place_of_birth if place_of_birth != None else None
        interim = interim.reset_index()
        
        ## Finally, add the interim dataframe to the output dataframe that returns the input plus the appended variables    
        out = out.append(interim)
        
    return(out)

In [14]:
data_with_vars = get_variables(data)

100%|██████████| 8467/8467 [03:12<00:00, 43.97it/s]


In [15]:
data_with_vars.shape

(8467, 56)

## Function to derive all data w.r.t. district and birthplace and death place


- What do I still want from this dataframe?

- Current and other election info (Done)
    - Before/after "algemene" verkiezingen
    - Age at time of election
    - Age of death
    - Turnout previous election in district
    - Increase in turnout w.r.t. previous election
    - No. of candidates in election
    

- Party info for politicians (Done)
    - Indicator whether party already existed at time of election:
        - ARP: 3 april 1879
        - Catholic: 15 maart 1892
        - Liberale Unie: 4 maart 1885
        
    
- Place of birth / Place of death / District characteristics (Done)
    - Taxes
    - School money (Not done yet, maybe implement later)
    - Religion
    - Pop. size
    

In [16]:
# Helper functions

def nearest(items, pivot):
    'Find the nearest date before a particular date pivot'
    return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))

# other helper function to parse district
def parse_district(x):
    'Parse the district name without Roman numerals'
    if ' X' in x:
        x = re.sub(' X', '', x)
    if ' IX' in x:
        x = re.sub(' IX', '', x)
    if ' VIII' in x:
        x = re.sub(' VIII', '', x)
    if ' VII' in x:
        x = re.sub(' VII', '', x)
    if ' VI' in x:
        x = re.sub(' VI', '', x)
    if ' V' in x:
        x = re.sub(' V', '', x)
    if ' IV' in x:
        x = re.sub(' IV', '', x)
    if ' III' in x:
        x = re.sub(' III', '', x)
    if ' II' in x:
        x = re.sub(' II', '', x)
    if ' I' in x:
        x = re.sub(' I', '', x)
    return x

In [17]:
# Actual function

# Write this function here

def get_more_variables(data):
    
    out = pd.DataFrame()
    
    for i in tqdm(range(len(data))):
        
        # For the politicians only
        if data.iloc[i]['b1-nummer'] is not None:
            
            party = politician_metadata[politician_metadata['b1-nummer'] == data.iloc[i]['b1-nummer']]['partij(en)/fractie(s)']

        # For all:
        election_after_arp = np.where(pd.Timestamp(data.iloc[i]['Verkiezingdatum']) > pd.Timestamp('03/04/1879'), 1, 0).item()
        election_after_rk = np.where(pd.Timestamp(data.iloc[i]['Verkiezingdatum']) > pd.Timestamp('15/03/1892'), 1, 0).item()
        election_after_lib = np.where(pd.Timestamp(data.iloc[i]['Verkiezingdatum']) > pd.Timestamp('04/03/1879'), 1, 0).item()
        
        # Current and other election info
        electoral_data_before = electoral_data[electoral_data['Verkiezingdatum'] < pd.Timestamp(data.iloc[i]['Verkiezingdatum'])]
        before = electoral_data_before[electoral_data_before['Type'] == 'algemeen']
        howmany_before_algemeen = before[before['Naam'].str.contains(data.iloc[i]['Naam'])].shape[0]
        
        electoral_data_after = electoral_data[electoral_data['Verkiezingdatum'] > pd.Timestamp(data.iloc[i]['Verkiezingdatum'])]
        after = electoral_data_after[electoral_data_after['Type'] == 'algemeen']   
        howmany_after_algemeen = after[after['Naam'].str.contains(data.iloc[i]['Naam'])].shape[0]
        
        age_at_election = pd.Timestamp(data.iloc[i]['Verkiezingdatum']) - pd.Timestamp(data.iloc[i]['date_of_birth'])

        try:
            age_of_death = pd.Timestamp(data.iloc[i]['Sterfdatum']) - pd.Timestamp(data.iloc[i]['date_of_birth'])
        except:
            age_of_death = None
        
        # find the nearest election before the actual election
        verk_dat = data.iloc[i]['Verkiezingdatum']
        distr = data.iloc[i]['District']
        
        try:
            nearest_el = nearest(electoral_data[(electoral_data['District'] == distr) & (electoral_data['Verkiezingdatum'] != verk_dat)]['Verkiezingdatum'], pd.Timestamp(verk_dat))
            turnout_previous_el = electoral_data[(electoral_data['District'] == distr)&(electoral_data['Verkiezingdatum'] == nearest_el)]['Aantal stemmen geldig'].unique().item()
            diff_turn = data.iloc[i]['turnout'] - turnout_previous_el
        except:
            nearest_el = None
            turnout_previous_el = None
            diff_turn = None
        
        no_of_candidates = electoral_data[(electoral_data['District'] == data.iloc[i]['District']) & (electoral_data['Verkiezingdatum'] == data.iloc[i]['Verkiezingdatum'])].shape[0]

        
        ## place of birth, district characteristics
        ### taxes birthplace 1859, taxes birthplace 1889, difference between the two
        birthplace = data.iloc[i]['place_of_birth']
        
        taxespercap_1859 = taxes_pop[(taxes_pop['name'] == birthplace) & (taxes_pop['year'] == 1859)]['taxes_percap']
        taxespercap_1889 = taxes_pop[(taxes_pop['name'] == birthplace) & (taxes_pop['year'] == 1889)]['taxes_percap']
        
        try: 
            taxespercap_diff = taxespercap_1889.item() - taxespercap_1859.item()
        except:
            taxespercap_diff = None
        
        ### population birthplace 1859
        population_birthplace_1859 = taxes_pop[(taxes_pop['name'] == birthplace) & (taxes_pop['year'] == 1859)]['total_inhabitants']
        
        ### religious composition birthplace 1809
        birthplace_cath = religious_comp[(religious_comp['name'] == birthplace) & (religious_comp['year'] == 1809)]['catholic']
        birthplace_prot = religious_comp[(religious_comp['name'] == birthplace) & (religious_comp['year'] == 1809)]['protestant']
        birthplace_ov = religious_comp[(religious_comp['name'] == birthplace) & (religious_comp['year'] == 1809)]['overig']
        
        try:
            share_cath = birthplace_cath.item() / (birthplace_cath.item() + birthplace_prot.item() + birthplace_ov.item())
            share_prot = birthplace_prot.item() / (birthplace_cath.item() + birthplace_prot.item() + birthplace_ov.item())
        except:
            share_cath = None
            share_prot = None
            
        ### religious composition district 1809 
        
        district = parse_district(distr) # Two district variables are needed to match them to election data (above)
                                        # and municipality data (here)
        
        district_cath = religious_comp[(religious_comp['name'] == district) & (religious_comp['year'] == 1809)]['catholic']
        district_prot = religious_comp[(religious_comp['name'] == district) & (religious_comp['year'] == 1809)]['protestant']
        district_ov = religious_comp[(religious_comp['name'] == district) & (religious_comp['year'] == 1809)]['overig']
        
        ### pop count district 
        district_pop_1859 = taxes_pop[(taxes_pop['name'] == district) & (taxes_pop['year'] == 1859)]['total_inhabitants']
        district_pop_1889 = taxes_pop[(taxes_pop['name'] == district) & (taxes_pop['year'] == 1889)]['total_inhabitants']
        
        ### profcount per cap birth place
        birthplace_agri = prof_comp[(prof_comp['name'] == birthplace) & (prof_comp['category'] == 'agriculture')]['prof_count_per_cap']
        birthplace_indus = prof_comp[(prof_comp['name'] == birthplace) & (prof_comp['category'] == 'industry')]['prof_count_per_cap']
        birthplace_serv = prof_comp[(prof_comp['name'] == birthplace) & (prof_comp['category'] == 'services')]['prof_count_per_cap']
        
        
        ### profcount per cap district
        district_agri = prof_comp[(prof_comp['name'] == district) & (prof_comp['category'] == 'agriculture')]['prof_count_per_cap']
        district_indus = prof_comp[(prof_comp['name'] == district) & (prof_comp['category'] == 'industry')]['prof_count_per_cap']
        district_serv = prof_comp[(prof_comp['name'] == district) & (prof_comp['category'] == 'services')]['prof_count_per_cap']
        
        # write all variables to interim
        
        interim = pd.DataFrame([data.iloc[i]])
        interim['party']  = party.values[0] if len(party.values) else None
        interim['election_after_arp'] = election_after_arp
        interim['election_after_rk'] = election_after_rk
        interim['election_after_lib'] = election_after_lib
        interim['howmany_before_alg'] = howmany_before_algemeen
        interim['howmany_after_alg'] = howmany_after_algemeen
        interim['age_at_election'] = age_at_election #if len(age_at_election) else None
        interim['age_of_death'] = age_of_death
        interim['turnout_previous_el'] = turnout_previous_el
        interim['diff_turn'] = diff_turn
        interim['no_candidates'] = no_of_candidates
        interim['taxespercap_1859'] = taxespercap_1859.item() if len(taxespercap_1859) else None
        interim['taxespercap_1889'] = taxespercap_1889.item() if len(taxespercap_1889) else None
        interim['taxespercap_diff'] = taxespercap_diff
        interim['birthplace_pop_1859'] = population_birthplace_1859.item() if len(population_birthplace_1859) else None
        interim['birthplace_cath'] = birthplace_cath.item() if len(birthplace_cath) else None
        interim['birthplace_prot'] = birthplace_prot.item() if len(birthplace_prot) else None
        interim['birthplace_ov'] = birthplace_ov.item() if len(birthplace_prot) else None
        interim['birthplace_share_cath'] = share_cath if birthplace != None else None
        interim['birthplace_share_prot'] = share_prot if birthplace != None else None
        interim['district_cath'] = district_cath.item() if len(district_cath) else None
        interim['district_prot'] = district_prot.item() if len(district_prot) else None
        interim['district_ov'] = district_ov.item() if len(district_ov) else None
        interim['district_pop_1859'] = district_pop_1859.item() if len(district_pop_1859) else None
        interim['district_pop_1889'] = district_pop_1889.item() if len(district_pop_1859) else None
        interim['birthplace_agri'] = birthplace_agri.item() if len(birthplace_agri) else None
        interim['birthplace_indus'] = birthplace_indus.item() if len(birthplace_indus) else None
        interim['birthplace_serv'] = birthplace_serv.item() if len(birthplace_serv) else None
        interim['district_agri'] = district_agri.item() if len(district_agri) else None
        interim['district_indus'] = district_indus.item() if len(district_indus) else None
        interim['district_serv'] = district_serv.item() if len(district_serv) else None
        
        interim = interim.reset_index()        
        out = out.append(interim)
    
    #Clean the indices
    out = out.iloc[:,2:]
    
    return out

In [18]:
data_with_vars = get_more_variables(data_with_vars)

100%|██████████| 8467/8467 [10:48<00:00, 13.05it/s]


In [19]:
#data_with_vars.to_csv("test.csv", sep = "\t")
# Interim file - seems all good!

## Make a party key to aggregate party

- From very heterogeneous party information

In [20]:
orientations = data_with_vars['party'].unique()
orientations = [i for i in orientations if i]

lib = re.compile("(.+)lib(.+)|(.*)Lib(.*)|Thor(.+)|Putt|Pytt|Takk|Kappey|(.*)VDB(.*)|Radical|liberaal")
liberalen = list(filter(lib.match, orientations))

kat = re.compile("(.*)kath(.*)|(.+)RK(.+)|(.+)Rooms(.+)|Rooms|Katholiek|(.*)Schaep(.*)|(.*)Bahl(.*)")
katholieken = list(filter(kat.match, orientations))

prot = re.compile("(.*)prot(.*)|(.+)CHU|(.+)AR(.+)|ARP|antirev|(.+)antirev(.+)|conservatief|(.+)AR(.+)|CHP|CHU|(.+)CHP|c.h.")
protestanten = list(filter(prot.match, orientations))

soc = re.compile("SDAP|SDP|Socialist|(.+)Socialist|socialist|SD|(.+)SDAP(.+)|(.+)vrijzin")
socialisten = list(filter(soc.match, orientations))


In [21]:
def make_party_key(dataset):
    
    out = pd.DataFrame()
    
    for i in tqdm(range(len(dataset))):
        
        if dataset.iloc[i]['b1-nummer'] is not None:
            
            if dataset.iloc[i]['party'] in protestanten:
                party_category = 'protestant'
                
            if dataset.iloc[i]['party'] in liberalen:
                party_category = 'liberal'
                
            if dataset.iloc[i]['party'] in katholieken:
                party_category = 'catholic'
                
            if dataset.iloc[i]['party'] in socialisten:
                party_category = 'socialist'
                
            if dataset.iloc[i]['party'] not in katholieken + protestanten + liberalen + socialisten:
                party_category = 'none'
                
        else:
            party_category = 'none'
        
        interim = pd.DataFrame([dataset.iloc[i]])
        interim['party_category'] = party_category
        
        out = out.append(interim)
    
    return out

In [22]:
data_with_vars = make_party_key(data_with_vars)

100%|██████████| 8467/8467 [04:25<00:00, 31.84it/s]


## Add the career information


- Create a dummy variable indicating a career in business

- Also indicate one for law, and for finance 

- All on the basis of Regex

- Maybe also one for colonial activities / Raad van Comissarissen 


- **To do**: add the career info for the non-politicians to a data file
- Make an algorithm that distinguishes between time after career for the politicians


In [40]:
nonpols_career = pd.read_csv("../Data/nonpolitician_data/nonpoliticians_careerinfo.csv")

In [38]:
# figure out how to access career data for politicians in da loop
# and also for non politiciens
def add_career_info(data_with_vars):
    
    out = pd.DataFrame()

    for i in range(len(data_with_vars)):
    
        # test for nan or not, a.k.a., politician or not
        if data_with_vars['b1-nummer'].iloc[i] == data_with_vars['b1-nummer'].iloc[i]:
        
        # filtering to the right observations
            cd = politician_careerdata[politician_careerdata['b1-nummer'] == data_with_vars['b1-nummer'].iloc[i]]
            cd = cd[cd['begin'] > data_with_vars['Verkiezingdatum'].iloc[i]]
        
            #business
            bankers = cd[(cd['waarde'].str.contains('bank')) & ~(cd['waarde'].str.contains('rechtbank')) & ~(cd['waarde'].str.contains('Rechtbank'))]
            handelaren = cd[cd['waarde'].str.contains('handelaar')]
            directeuren = cd[cd['waarde'].str.contains('directeur')]
        
            business_ind = np.where(bankers.shape[0] + handelaren.shape[0] + directeuren.shape[0] > 0, 1, 0).item()
    

            # Politics
            burgemeester = cd[cd['waarde'].str.contains('burgem')]
            politics_ind = np.where(burgemeester.shape[0] > 0, 1, 0).item()
                
            #colonial
            colonial = cd[(cd['waarde'].str.contains('koloni')) | (cd['waarde'].str.contains('Ind'))]
            colonial_ind = np.where(colonial.shape[0] > 0, 1, 0).item()
        
        else:
        # if its a nonpolitician 
        
            try:
                business_ind = nonpols_career[nonpols_career['Naam'] == data_with_vars['Naam'].iloc[i]]['prof_business'].item()
            except:
                business_ind = None
            
            try:
                politics_ind = nonpols_career[nonpols_career['Naam'] == data_with_vars['Naam'].iloc[i]]['prof_politics'].item()
            except:
                politics_ind = None
            
            try:
                colonial_ind = nonpols_career[nonpols_career['Naam'] == data_with_vars['Naam'].iloc[i]]['prof_colonial'].item() 
            except:
                colonial_ind = None
        
        interim = pd.DataFrame([data_with_vars.iloc[i]])
        interim['prof_business'] = business_ind
        interim['prof_politics'] = politics_ind
        interim['prof_colonial'] = colonial_ind
    
        out = out.append(interim)
    
return out
        

00943
G. van Leeuwen
01272
00943
00416
W.J.C. Waterschoot van der Gracht
01272
C. van de Stadt
00485
01116
00167
mr. C. Sandenbergh Matthiessen
jhr. M. Salvador
00330
01160
00414
01116
00416
00943
mr. J.R. Thorbecke
01116
00943
00963
00416
00910
00416
00963
01202
01055
00910
00515
01055
01202
01055
01084
F. Sieuwerts
00416
00138
J.L. Kikkert
00416
00138
01055
01116
J.L. Kikkert
00416
00784
J.L. Kikkert
R.C. Sloos
01000
J.D. van Herwerden
00416
00631
00666
01000
00232
00416
00666
00631
01000
00416
00666
00711
00416
00711
00416
00416
00711
00416
jhr.mr. H.G.C.L. Janssens
00232
01202
00416
00666
00232
00666
mr. D.C.A. graaf van Hogendorp
00255
00433
01177
00666
00666
00395
A. Brummelkamp sr.
A. Brummelkamp sr.
00232
00538
00537
H.W. van Marle
00666
W. Bos
00542
00543
00232
W. Bos
jhr.mr. M.A. de Savornin Lohman
00666
00232
W. Bos
01374
00071
00070
jhr.mr. M.A. de Savornin Lohman
00433
00232
W. Bos
00666
00232
jhr. C. Hartsen
H.W. van Marle
00666
00232
00540
H.W. van Marle
W. Bos
00809
jhr

mr. G. Kniphorst
00998
mr. P.W.A. Grevelink
mr. J. Tonckens
mr. C. Hiddingh
jhr.mr. H.J.L. van der Wijck
mr. J.T. Homan
00549
00549
01395
00861
00862
mr. G. Kniphorst
mr. P.W.A. Grevelink
mr. W. Tonckens
mr. H.J. Kymmell
J.A. Meursing
00910
mr. C. Hiddingh
00549
01395
00549
01395
01080
mr. P.W.A. Grevelink
mr. W. Goedkoop
A. Blom
00485
A. van der Vlies
01395
01080
01395
00485
01080
01395
mr. H. Vos
00549
mr. G.L. Kniphorst
P.P. van Zuylen van Nijevelt
01395
mr. G.L. Kniphorst
mr. H. Vos
00549
mr. H.C. Carsten
01395
mr. H. Pelinck
00549
mr. H. Pelinck
00549
00549
00998
00998
01395
00437
01395
00998
00998
mr. J.R. Thorbecke
01395
D.T. Notten
00538
00537
mr. B.J. Gratama
00485
J.F. de Ruyter de Wildt
00998
mr. B.J. Gratama
mr. W.B.S. Boeles
01272
A. Roelink
mr. B.J. Gratama
00400
01395
01272
mr. B.J. Gratama
00998
00766
00041
01272
A. Brummelkamp sr.
00998
01177
A. Brummelkamp sr.
00400
01479
mr. J.D. Dibbits
01177
00400
01479
00400
01177
00998
jhr.mr. M.A. de Savornin Lohman
00400
jhr.mr

00341
A. van der Poel
M. Jongebreur
01201
00577
jhr.mr. J.O. de Jong van Beek en Donk
00577
00178
jhr.mr. J.O. de Jong van Beek en Donk
00573
jhr.mr. C.F. Wesselman
Bangeman Huygens van Lowendaal
00178
00573
jhr.mr. J.O. de Jong van Beek en Donk
jhr.mr. C.F. Wesselman
00178
00577
00573
mr. M. van den Acker
mr. N.F.C.J. Sassen
00167
mr. J.R. Thorbecke
jhr.mr. C.F. Wesselman
00573
00577
00577
00918
T. Princen
00577
00918
mr. M. van den Acker
mr. H.H. Vermeulen
mr. W. Sassen
jhr.mr. C.F. Wesselman
00178
mr. M. van den Acker
00178
00918
jhr.mr. C.F. Wesselman
00178
W. van Rekum
00918
W. van Rekum
01281
00573
mr. J.F. Coolen
00573
01281
00178
00178
00573
01281
00178
00573
01281
jhr.mr. C.F. Wesselman
J.F. Smits van Oyen
00178
01281
00573
00178
00573
00178
01281
00573
00049
00178
01281
mr. D.H. van den Acker
00049
00049
01281
mr. F.W. van den Dungen
mr. F.W. van den Dungen
00049
00063
mr. D.H. van den Acker
J.H.A. Diepen
00063
J.H.A. Diepen
mr. D.H. van den Acker
01422
mr. D.H. van den Acker

00215
00023
A. van Assen
00105
01123
00583
00023
00315
00316
00433
U. baron van Schwartzenberg en Hohenlansberg
01123
00583
00315
00316
J. Binkes
00433
U. baron van Schwartzenberg en Hohenlansberg
00583
00433
01123
U. baron van Schwartzenberg en Hohenlansberg
jhr.mr. J.E.A. van Panhuys
00583
U. baron van Schwartzenberg en Hohenlansberg
01123
C.J. de Bordes
mr. S. Vissering
jhr.mr. J.E.A. van Panhuys
01123
C.J. de Bordes
00583
01177
00815
01123
00682
01177
00815
00374
mr. L.G. Verwer
00815
01177
00583
dr. J.A. Gerth van Wijk
01591
00815
jhr.mr. M.A. de Savornin Lohman
00583
00815
00612
H. Pierson
01518
00815
01591
00612
jhr.mr. P.J. van Swinderen
dr. V. Bruinsma
P.C.F. Frowein
00815
01591
00612
00540
01591
W.C. van Munster
dr. V. Bruinsma
01591
D. de Clercq
01454
J. Troelstra
00450
01454
01604
00450
J. Troelstra
00450
00696
00450
01376
00696
W.C. van Munster
J.F.H. Bekhuis
01376
00450
01069
W. Bax
W.C. van Munster
Z. Middelkoop
01069
W. Bax
00899
01069
mr. T. de Vries
00099
00899
01069


00540
00612
J.F.H. Bekhuis
00540
mr. J.A. van Gilse
00540
mr. J.A. van Gilse
W. Kroese
mr. J.A. van Gilse
00540
00995
H.P.N. Halbertsma
O. Schriecke
00995
dr. C.J. Niemeijer
H. van Eijck van Heslinga
01293
00995
dr. C.J. Niemeijer
04388
01489
mr. A. Ferf
01451
00766
01048
00263
C.A. Zelvelder
H.A.J. van Wijhe
01194
P. Noordwal
A. van der Heide
01194
C.E. van Koetsveld
mr. J.C. Kielstra
A. van der Heide
01194
00353
00663
00556
00485
00556
00425
mr. J.A.G. de Vos van Steenwijk
00223
00425
00556
00807
00425
00910
mr. P.M. van Goens
00556
00353
00807
00425
01316
00313
00910
01316
J. Zeehuizen
00485
J. van Andel
01316
J. Meesters
J. van Eik
01316
00841
J.L. Bernhardi
01316
00841
01316
00440
00766
01316
00691
01357
00314
mr. A.J. Dijckmeester
01357
00314
01358
00314
01358
mr. A.J. Dijckmeester
01358
mr. A.J. Dijckmeester
00894
01358
00894
01358
00612
00071
00070
00076
00071
00070
00076
00893
00076
00893
A. Wiersinga
00893
00076
00893
00163
mr. J.G. van der Hoop
J. Bosch Bruist
00893
00163
00

In [82]:
nonpols_career[nonpols_career['Naam'] == data_with_vars['Naam'].iloc[0]]['prof_business']

Series([], Name: prof_business, dtype: int64)

In [69]:
politician_careerdata = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", sheet_name = 1, dtype={'b1-nummer':str})
politician_careerdata[['begin', 'eind', 'bla1', 'bla2']] = politician_careerdata['datum'].str.split('/',expand=True)

politician_careerdata['begin'] = politician_careerdata['begin'].apply(fix_sterfdatum)
politician_careerdata['eind'] = politician_careerdata['eind'].apply(fix_sterfdatum)

# Business
bankers = politician_careerdata[(politician_careerdata['waarde'].str.contains('bank')) & ~(politician_careerdata['waarde'].str.contains('rechtbank')) & ~(politician_careerdata['waarde'].str.contains('Rechtbank'))]
handelaren = politician_careerdata[politician_careerdata['waarde'].str.contains('handelaar')]
directeuren = politician_careerdata[politician_careerdata['waarde'].str.contains('directeur')]
colonial = politician_careerdata[(politician_careerdata['waarde'].str.contains('koloni')) | (politician_careerdata['waarde'].str.contains('Ind'))]

# Politics
burgemeester = politician_careerdata[politician_careerdata['waarde'].str.contains('burgem')]

#business = bankers['b1-nummer'].unique() + handelaren['b1-nummer'].unique()# + directeuren['b1-nummer'].unique()
colonial = colonial['b1-nummer'].unique().tolist()
politics = burgemeester['b1-nummer'].unique().tolist()
business = bankers['b1-nummer'].unique().tolist() + handelaren['b1-nummer'].unique().tolist() + directeuren['b1-nummer'].unique().tolist()


In [152]:
def add_career_info(data):
    
    out = pd.DataFrame()
    
    for i in tqdm(range(len(data))):
                
        if data.iloc[i]['b1-nummer'] is not None:
            
            business_ind = np.where(data.iloc[i]['b1-nummer'] in business, 1, 0).item()
            politics_ind = np.where(data.iloc[i]['b1-nummer'] in politics, 1, 0).item()
            colonial_ind = np.where(data.iloc[i]['b1-nummer'] in colonial, 1, 0).item()
            
            # Implement the stuff for the nonpoliticians here - from file nonpoliticians_careerinfo.csv
            
        else:
            business_ind = None
            politics_ind = None
            colonial_ind = None
        
        interim = pd.DataFrame([data.iloc[i]])
        interim['prof_business'] = business_ind #if business != None
        interim['prof_politics'] = politics_ind #if business != None
        interim['prof_colonial'] = colonial_ind #if business != None
        
        out = out.append(interim)
    
    return out

In [153]:
#data_with_vars.to_csv("test.csv", sep="\t", index=False)

## Calculate Distance to the Hague (Birthplace)

- Steps: find all unique birthplace
- Look up distance for all unique birthplaces via below algorithm

- Then merge the resulting dataset with data_with_vars.

In [154]:
import pandas as pd
import json
from opencage.geocoder import OpenCageGeocode
from geopy import distance

In [172]:
birthplaces = [i for i in data_with_vars['place_of_birth'].unique() if i is not None]
listwithplaces = pd.DataFrame(birthplaces, columns=['place_of_birth'])

In [173]:
key = 'bcf671e3c4a24cb1845d9f0ed87d2e1b'
geocoder = OpenCageGeocode(key)

In [176]:
def find_distance(listwithplaces):
    
    out = pd.DataFrame()
    
    the_hague = geocoder.geocode('Den Haag')
    lat_hag = the_hague[0]['geometry']['lat']
    lng_hag = the_hague[0]['geometry']['lng']  
    
    for i in tqdm(range(len(listwithplaces))):
            
        try:
            result_A = geocoder.geocode(listwithplaces.iloc[i]['place_of_birth'])
            lat_A = result_A[0]['geometry']['lat']
            lng_A = result_A[0]['geometry']['lng']
        
            afstand = distance.distance((lat_hag, lng_hag), (lat_A,lng_A)).kilometers
            
            if afstand > 250:
                afstand = 250
            
        except:
            
            afstand = None
                
        
        interim = pd.DataFrame([listwithplaces.iloc[i]])
        interim['distance_bp_hag'] = afstand #if business != None
        
        out = out.append(interim)
        
    return out
    

In [178]:
# find the distances
distances = find_distance(listwithplaces)

100%|██████████| 278/278 [02:00<00:00,  2.30it/s]


In [181]:
# merge out with data_with_vars
data_with_vars = pd.merge(data_with_vars, 
        distances,
        how='left',
        left_on='place_of_birth',
        right_on='place_of_birth')

data_with_vars

Unnamed: 0,Naam,name_in_all_elections,name_in_elected_people,Aanbevolen door,Aantal stemmen,Procentueel,District,Verkiezingdatum,Type,Omvang electoraat,...,district_pop_1859,district_pop_1889,birthplace_agri,birthplace_indus,birthplace_serv,district_agri,district_indus,district_serv,party_category,distance_bp_hag
0,S.A. de Moraaz,S.A. de Moraaz,S.A. de Moraaz,,503.0,52.84%,Alkmaar,1848-11-30,algemeen,1107,...,6964.666667,7853.5,0.018379,0.35495,0.282929,0.03674,0.236253,0.410706,liberal,17.462269
1,G. van Leeuwen,,,,438.0,46.01%,Alkmaar,1848-11-30,algemeen,1107,...,6964.666667,7853.5,0.03674,0.236253,0.410706,0.03674,0.236253,0.410706,none,67.434394
2,mr. H.J. Smit,mr. H.J. Smit,H.J. Smit,,1566.0,79.86%,Alkmaar,1850-08-27,algemeen,2833,...,6964.666667,7853.5,0.16846,0.182014,0.351549,0.03674,0.236253,0.410706,liberal,182.765326
3,S.A. de Moraaz,S.A. de Moraaz,S.A. de Moraaz,,1275.0,65.02%,Alkmaar,1850-08-27,algemeen,2833,...,6964.666667,7853.5,0.018379,0.35495,0.282929,0.03674,0.236253,0.410706,liberal,17.462269
4,jhr.mr. C. van Foreest,jhr.mr. C. van Foreest,C. van Foreest,,685.0,34.93%,Alkmaar,1850-08-27,algemeen,2833,...,6964.666667,7853.5,0.03674,0.236253,0.410706,0.03674,0.236253,0.410706,protestant,67.434394
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8462,B. Luteraan,,,SDP,32.0,0.41%,Zwolle,1913-06-17,algemeen,8889,...,12768.666667,13192.0,,,,0.046737,0.271531,0.402851,none,
8463,F.M. Knobel,F.M. Knobel,F.M. Knobel,VL(Lib/VD/SDAP),4249.0,50.15%,Zwolle,1913-06-25,herstemming,8889,...,12768.666667,13192.0,0.002355,0.240244,0.464383,0.046737,0.271531,0.402851,liberal,51.422032
8464,A. baron van Dedem,A. baron van Dedem,A. baron van Dedem,CHU(Ka/AR),4223.0,49.85%,Zwolle,1913-06-25,herstemming,8889,...,12768.666667,13192.0,0.624912,0.088215,0.087509,0.046737,0.271531,0.402851,protestant,141.140276
8465,F.M. Knobel,F.M. Knobel,F.M. Knobel,VL,3236.0,86.22%,Zwolle,1917-06-15,algemeen,9645,...,12768.666667,13192.0,0.002355,0.240244,0.464383,0.046737,0.271531,0.402851,liberal,51.422032


## Final cleanup

Cleanup some variables before exporting to .csv

In [None]:
# Clean up some variables to make them numeric instead of time
def convert_to_num(x):
    if x != None:
        out = x.days/365
    else:
        out = None
    return out

def convert_to_num2(x):
    try:
        out = x.days/365
    except:
        out = None
    return out

data_with_vars['age_of_death'] = data_with_vars['age_of_death'].apply(lambda x: convert_to_num2(x))
data_with_vars['age_at_election'] = data_with_vars['age_at_election'].apply(lambda x: convert_to_num(x))
data_with_vars['tenure'] = data_with_vars['tenure'].apply(lambda x: convert_to_num(x))

## Export to csv