## from_full_data_to_full_data_all_vars.ipynb

In this notebook, I extract all remaining variables, for an analysis with covariates. It should be modeled after `find_remaining_variables.ipynb`, with the exception that we have to add the death date at the beginning. Then, the script can be modeled more or less exactly after the aforementioned notebook. So, the approach is:

   - Add the sterfdatum
   - Redo the entire script as in `find_remaining_variables.ipynb` 
   

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

import re
from itertools import compress

from matplotlib import style
from matplotlib import pyplot as plt
import seaborn as sns

from rdrobust import rdrobust,rdbwselect,rdplot



In [12]:
data = pd.read_csv("../Data/analysis/full_sample_analysis_novars.csv", dtype={'b1-nummer':str}).iloc[:, 1:]
data.head(5)

Unnamed: 0,Naam,name_in_all_elections,name_in_elected_people,Aanbevolen door,Aantal stemmen,Procentueel,District,Verkiezingdatum,Type,Omvang electoraat,...,deflated_wealth,verk_2_gewonnen,verk_3_gewonnen,verk_4_gewonnen,verk_5_gewonnen,verk_6_gewonnen,verk_7_gewonnen,verk_8_gewonnen,verk_9_gewonnen,verk_10_gewonnen
0,S.A. de Moraaz,S.A. de Moraaz,S.A. de Moraaz,,503.0,52.84%,Alkmaar,1848-11-30,algemeen,1107,...,,1.0,0.0,0.0,,,,,,
1,G. van Leeuwen,,,,438.0,46.01%,Alkmaar,1848-11-30,algemeen,1107,...,12355.95389,,,,,,,,,
2,mr. H.J. Smit,mr. H.J. Smit,H.J. Smit,,1566.0,79.86%,Alkmaar,1850-08-27,algemeen,2833,...,306967.902828,1.0,,,,,,,,
3,S.A. de Moraaz,S.A. de Moraaz,S.A. de Moraaz,,1275.0,65.02%,Alkmaar,1850-08-27,algemeen,2833,...,,0.0,0.0,,,,,,,
4,jhr.mr. C. van Foreest,jhr.mr. C. van Foreest,C. van Foreest,,685.0,34.93%,Alkmaar,1850-08-27,algemeen,2833,...,,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,


In [13]:
## Add the sterfdatum
sterfdatum_politici = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", 
              dtype={'b1-nummer':str},
              sheet_name=1)
sterfdatum_politici = sterfdatum_politici[sterfdatum_politici['rubriek'] == 3020][['b1-nummer', 'datum']]

data = pd.merge(data, sterfdatum_politici, how='left',
         left_on='b1-nummer',
         right_on='b1-nummer').rename(columns={'datum':'sterfdatum'})

In [14]:
nonpols = pd.read_csv('../Data/analysis/unmatched_sample_analysis.csv').iloc[:,1:]
nonpols = nonpols[nonpols['b1-nummer'].isna()][['Naam', 'Sterfdatum']].drop_duplicates()
nonpols = nonpols[~nonpols['Sterfdatum'].isna()]

In [15]:
data = pd.merge(data, nonpols,
        how='left',
        left_on='Naam',
         right_on='Naam').rename(columns={'Sterfdatum':'Dod'})

data['Sterfdatum'] = np.where(data['sterfdatum'].isna(), data['Dod'], data['sterfdatum'])

data = data.drop(columns=['sterfdatum', 'Dod'])

## Load some datasets I need

From here, proceed as in find_remaining_variables.

Now, we load a couple of dataset which will be merged to the dataframe called `data` in a couple of steps. 

We also define a function that cleans up the birthplaces of politicians. Birthplaces of nonpoliticians have already been cleaned up. 

In [16]:
def cleanup(x):
    'Helper to clean up politicians birthplace'
    step1 = re.sub('\((.+)\)', '', x)
    step2 = re.sub("'s-Gravenhage", 'Den Haag', step1)
    step3 = re.sub("'s-Hertogenbosch", "Den Bosch", step2)
    step4 = step3.strip()
    
    return step4

In [17]:
# Election and election history data
## Some datasets which I need
electoral_data = pd.read_csv("../Data/elections/election_results_details.csv").iloc[:,1:]
electoral_data.iloc[:,[2,7,8,9,10,11,12,13]] = electoral_data.iloc[:,[2,7,8,9,10,11,12,13]].apply(lambda x: pd.to_numeric(x, errors='coerce'))
electoral_data['Verkiezingdatum'] = electoral_data['Verkiezingdatum'].apply(lambda x: pd.Timestamp(x))

## Seats data
zetels = electoral_data.groupby(['District', 'Verkiezingdatum']).agg({'Aantal zetels': 'mean'})

## Politician metadata
politician_metadata = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", dtype={'b1-nummer':str})
politician_metadata2 = pd.read_excel("../Data/politician_data/tk_1815tot1950uu.xlsx", sheet_name = 1, dtype={'b1-nummer':str})

### Clean up the variable 'waarde' (birthplace from politician metadata)
politician_metadata2['waarde'] = politician_metadata2['waarde'].apply(lambda x: cleanup(x))

## Nonpolitician metadata
nonpolitician_metadata = pd.read_csv("../Data/nonpolitician_data/nonpoliticians_birthplace_birthdates.csv").iloc[:,1:]

## Taxes and population (district)
taxes_pop = pd.read_csv('../Data/district_data/taxes_and_population.csv').iloc[:,1:]

## Religious composition over time (district)
religious_comp = pd.read_csv("../Data/district_data/religion_over_time.csv").iloc[:,1:]

## Professional composition 1889 (district)
prof_comp = pd.read_csv("../Data/district_data/professional_composition.csv").iloc[:,1:]

## Clean the religious composition dataset 

- And combine them in one dataframe, to be used later. 

In [18]:
prot = ['Doopsgezinden', 'Evangelisch Luthers', 'Nederlands Hervormden', 
        'overige kerkelijke gezindte', 'Remonstranten',
        'Anglikaans Episcopalen', 'Christelijk Afgescheidenen',
        'Engelse Presbyterianen', 'Hernhutters',
       'Hersteld Evangelisch Luthersen', 'Schotse Gemeente', 'Waals Hervormden',
        'Gereformeerde Kerken', 'Christelijk Gereformeerden']

kath = ['Oud Katholieken', 'Rooms-Katholieken']

In [19]:
def sum_catholic(groups):
    groups = groups[groups['information'].isin(kath)]
    n = groups['total_inhabitants'].mean()
    return groups['aantal'].sum()
    
def sum_protestant(groups):
    groups = groups[groups['information'].isin(prot)]
    return groups['aantal'].sum()

def sum_overig(groups):
    groups = groups[~(groups['information'].isin(prot)) & ~(groups['information'].isin(kath))]
    return groups['aantal'].sum()

In [20]:
protestant = religious_comp.groupby(['name','year']).apply(lambda x: sum_protestant(x)).reset_index().rename(columns={0:'protestant'})
catholic = religious_comp.groupby(['name','year']).apply(lambda x: sum_catholic(x)).reset_index().rename(columns={0:'catholic'})
overig = religious_comp.groupby(['name','year']).apply(lambda x: sum_overig(x)).reset_index().rename(columns={0:'overig'})

In [21]:
religious_comp = pd.merge(protestant, catholic, on=['name', 'year']).merge(overig, on = ['name','year'])

## Function to add the first batch of variables

This function adds:

- Nearest competitor margin in present election

- Percentage won by lib. cand.

- percentage won by soc. cand.

- percentage won by conf. cand.

- Integrate HDNG data on birth place and district characteristics (in folder district_data)