## elections_to_full_data

This code file departs from the election list. We implement several steps in this file. The first objective is to match the names of the elections with the names of the politicians, and also with the names of the non-politicians whose data we have collected. 

The first thing we do is load the packages, and then clean up the dataset and compute the margin of victory of all the politicians. 

In [1]:
# Load the libraries
import pandas as pd
import numpy as np
import re

from pandas_ods_reader import read_ods
import statistics

import numpy as np
import matplotlib.pyplot as plt

from matplotlib import pyplot as plt

from tqdm import tqdm

from functions import *

from thefuzz import fuzz
from thefuzz import process

In [2]:
# Import the election data
elected_people = pd.read_csv("../Data/elections/allelected.csv", encoding='latin-1')

In [3]:
elected_people['naam'] = elected_people['voornaam'] + ' ' + elected_people['achternaam']
elected_people['verkiezingdatum'] = (elected_people['dag'].astype(str) + 
                                     '-' + 
                                     elected_people['maand'].astype(str) +
                                     '-' +
                                     elected_people['jaar'].astype(str)
                                    )
elected_people['verkiezingdatum'] = elected_people.apply(lambda x: pd.Timestamp(year=x['jaar'], month=x['maand'], day=x['dag']), axis=1)

In [4]:
election_results_details = pd.read_csv("../Data/elections/election_results_details.csv").iloc[:,1:]
election_results_details['Verkiezingdatum'] = (election_results_details['Verkiezingdatum'].
                                               apply(lambda x: pd.Timestamp(x))
                                              )

def get_zetels(df):
    a = pd.to_numeric(df['Aantal zetels'], errors='coerce')
    b = np.mean(a)
    return (b)

aantal_zetels = (election_results_details.groupby(['District', 'Verkiezingdatum']).
                 apply(get_zetels).reset_index().rename(columns={0:'Aantal zetels'})
                )

In [5]:
#pd.merge(elected_people, aantal_zetels, 
#         left_on=['districtsnaam', 'verkiezingdatum'],
#         right_on=['District', 'Verkiezingdatum']).drop(columns=['District', 'Verkiezingdatum'])


In [6]:
all_candidates = pd.read_csv("../Data/elections/election_results_details.csv").iloc[:,1:]


all_candidates['Verkiezingdatum'] = all_candidates['Verkiezingdatum'].str.split("/")
all_candidates['Verkiezingdatum'] = all_candidates['Verkiezingdatum'].apply(lambda x: [int(y) for y in x])
all_candidates['Verkiezingdatum'] = all_candidates['Verkiezingdatum'].apply(
    lambda x: pd.Timestamp(day=x[0], month=x[1], year=x[2]) if all(type(y) == int for y in x)
    else None
)
all_candidates['Aantal zetels'] = all_candidates['Aantal zetels'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
all_candidates['Aantal stemmen'] = (all_candidates['Aantal stemmen'].
                                    apply(lambda x: pd.to_numeric(x, errors='coerce'))
                                   )

aantal_stemmen = (all_candidates.groupby(['District','Verkiezingdatum']).
 apply(lambda x: sum(x['Aantal stemmen'])).
 reset_index().
 rename(columns={0:'totaal aantal stemmen'})
)

all_candidates = pd.merge(all_candidates, aantal_stemmen,
        left_on=['District', 'Verkiezingdatum'],
        right_on=['District', 'Verkiezingdatum'])

all_candidates = (all_candidates.groupby(['District', 'Verkiezingdatum']).
 apply(lambda x: x.sort_values(['Aantal stemmen'], 
                               ascending = False))
).reset_index(drop=True)

all_candidates['hoeveelste_in_verkiezing'] = (all_candidates.groupby(['District', 'Verkiezingdatum']).
                                              cumcount() + 1)

all_candidates['gewonnen'] = np.where(all_candidates['hoeveelste_in_verkiezing'] <= all_candidates['Aantal zetels'], 1, 0)
all_candidates['marginal_winner'] = np.where(all_candidates['Aantal zetels'] - all_candidates['hoeveelste_in_verkiezing'] == 0, 1, 0)
all_candidates['marginal_loser'] = np.where(all_candidates['Aantal zetels'] - all_candidates['hoeveelste_in_verkiezing'] == -1, 1, 0)

In [7]:
all_candidates = get_margin(all_candidates)

100%|██████████| 8238/8238 [01:17<00:00, 105.74it/s]


In [8]:
#get_match(elected_people).to_csv("../Data/politician_data/key_allelected_to_all_candidates.csv", index = False)

In [9]:
key = pd.read_csv("../Data/politician_data/key_allelected_to_all_candidates.csv")

all_candidates = pd.merge(all_candidates, key, 
         how='left',
         left_on='Naam',
         right_on='name_in_all_elections')

cols_to_order = ['Naam', 'name_in_all_elections', 'name_in_elected_people']
new_columns = cols_to_order + (all_candidates.columns.drop(cols_to_order).tolist())

all_candidates = all_candidates[new_columns]

In [10]:
consequential_elections = []

for i in range(len(elected_people)):
    consequential_elections.append((elected_people['verkiezingdatum'].iloc[i], elected_people['districtsnaam'].iloc[i]))


In [61]:
complete_elections_dataset = get_elec_stats(all_candidates)

100%|██████████| 8238/8238 [05:50<00:00, 23.48it/s]


In [None]:
complete_elections_dataset.head(30)

## To Do:

Tomorrow, I have to match the names from the (ever) elected-people to the names in the PDC dataset. This way, I can merge the PDC data of two categories of people:

   - Politicians
   - Unsuccesful future or past politicians
    
Then, the only other candidates that have to be merged are the never succesful candidates, which are supposed to be an exact match. 

   - Check whether this is in fact true, whether all of these observations are to be found in the list of complete_elections_dataset.