The purpose of this notebook is to combine data from our 2 data sources.

Specifically look at the following 2 webpages, with different data on the same election:
- https://www.electionsireland.org/result.cfm?election=1977&cons=219
- https://www.irelandelection.com/election.php?elecid=11&electype=1&constitid=48

Note: The quotas are different.

For every elections we want
- Number of Constituencies
- How many consituency do we have vote data on?
- What was the quota?
- What was the votes/quota in first count?
- What was the lowest votes/quota?
- What was the highest votes/quota?
- Who transfered to who (if you have transfer data)

What parties do well in what areas in the last 25 years? 
eg total town council, dail seats, and other have been elected for a given area. eg sf:3 ff:16 fg: 14 

In [1]:
import pandas as pd
import numpy as np

### Introducing the first data set

The first dataset comes from https://www.electionsireland.org/ 

This data has date 12 columns:

- date: _str_, This is sometimes the exact date of the election, othertimes its just the year, other times its the year and the month: needs to be combined with the other data set which has the year.

- election_type: _str_, Is the type of election the candidate was running in, eg Local, Seanad, Dail, Bi-Election, This data needs to cleaned because it also includes intra-party elections and resignations. (Interesting data but not needed for this analysis)

- party: _str_, The party a candidate ran with, or for independents, it lists them as belonging to a party called independent.

- status: _str_, If they were elected or not. This data needs to be cleaned as it includes other information such as whether the candidate made the cutoff for expenses. 

- constituency_name: _str_, The name of the constituency the candidate ran in.

- seat: _int_, Contains the order in which the candidate was elected, if the candidate wasnt elected then its None.

- count_eliminated: _int_, Which count the candidate was either **elected** or **lost**. 

- first_pref_count _int_, The count of first preference votes recieved. 

- first_pref_pct: _float_, The % of all first preference votes a candidate recieved. 

- pct_of_quota_reached_with_first_pref: _float_, The % of the quota that a candidate reached with first preference votes.

- ran_unopposed: _bool_, True if the candidate ran unopposed (more common in earlier elections)

- candidate: _str_, The candidate's name

- candidate_ID: _int_, A unique ID for each candidate 


In [2]:
df1 = pd.read_parquet('electionsireland_data/ElectionsIreland_candidate.parquet')
df1 = df1.rename(columns={'ID':'candidate_ID'})
print(df1.columns)
print(df1.shape)
df1.head()

Index(['date', 'election_type', 'party', 'status', 'constituency_name', 'seat',
       'count_eliminated', 'first_pref_count', 'first_pref_pct',
       'pct_of_quota_reached_with_first_pref', 'ran_unopposed', 'candidate',
       'candidate_ID'],
      dtype='object')
(30070, 13)


Unnamed: 0,date,election_type,party,status,constituency_name,seat,count_eliminated,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,candidate,candidate_ID
0,23 June 1960,By Election,Labour,Not Elected,Carlow Kilkenny,,,7678.0,0.2016,,False,Seamus Pattison,1
1,1961,17th Dail,Labour,Elected,Carlow Kilkenny,4.0,,4116.0,0.0954,,False,Seamus Pattison,1
2,1965,18th Dail,Labour,Elected,Carlow Kilkenny,3.0,,6299.0,0.1408,,False,Seamus Pattison,1
3,1969,19th Dail,Labour,Elected,Carlow Kilkenny,4.0,,6041.0,0.1311,,False,Seamus Pattison,1
4,1973,20th Dail,Labour,Elected,Carlow Kilkenny,4.0,,5300.0,0.1134,,False,Seamus Pattison,1


### Introducing the second dataset

The first dataset comes from https://www.irelandelection.com/ 

This data has date 10 columns:

- election: _str_, Contains a string describing the election usually just the consitituency and the date of the election

- elected: _bool_, If they were elected or not. This data needs to be combined with Status from the first dataframe. 

- party: _str_, The party a candidate ran with

- first_pref_pct: _float_, The % of all first preference votes a candidate recieved. 

- first_pref_count _int_, The count of first preference votes recieved. 

- first_pref_quota_ratio: _float_, The % of the quota that a candidate reached with first preference votes.

- year: _int_, The year of the election. *NB:* Sometimes there is two elections of the same kind in the same year, (eg 2 dail elections in 1982).

- election_type: _str_, Is the type of election the candidate was running in, eg Local, Seanad, Dail, Bi-Election.

- candidate: _str_, The candidate's name. Unfortunatly there is 2 candidates have the same, they are essentially grouped as the same person, we fix this by linking records with the first dataset. 

- constituency: _str_, The name of the constituency the candidate ran in.

- election_type: _str_, Is the type of election the candidate was running in, eg Local, Dail, Bi-Election.

In [3]:
df2 = pd.read_parquet('irelandelection/ALL_CANDIDATES.parquet')
print(df2.columns)
print(df2.shape)

df2.head()

Index(['election', 'elected', 'party', 'first_pref_pct', 'first_pref_count',
       'first_pref_quota_ratio', 'year', 'candidate', 'constituency',
       'election_type'],
      dtype='object')
(36243, 10)


Unnamed: 0,election,elected,party,first_pref_pct,first_pref_count,first_pref_quota_ratio,year,candidate,constituency,election_type
0,2004 Local Election - Thomastown,True,Labour Party,0.085,641,0.51,2004,Ann Phelan,Thomastown,LOCAL
1,2009 Local Election - Thomastown,True,Labour Party,0.156,1183,0.78,2009,Ann Phelan,Thomastown,LOCAL
2,2011 general election - Carlow–Kilkenny,True,Labour Party,0.109,8072,0.66,2011,Ann Phelan,Carlow–Kilkenny,GENERAL
3,2016 general election - Carlow–Kilkenny,False,Labour Party,0.063,4391,0.38,2016,Ann Phelan,Carlow–Kilkenny,GENERAL
0,1982 (Feb) general election - Carlow–Kilkenny,False,Fianna Fáil,0.017,907,0.1,1982,John McGuinness,Carlow–Kilkenny,GENERAL


### Cleaning Dataframe 1

So we need to clean the following columns in dataframe 1:
- seat and count_eliminated can be dropped as for this analysis we dont really care about the order in which candidates were elected. 
- date, this needs to be converted to a year columns.
- election_type, this should read the same as dataframe 2 with options like GENERAL, LOCAL and SEANAD
- status, will be kept, but we want a column that is a simple bool TRUE or FALSE was the candidate elected.


In [4]:
# dropping seat and count_eliminated

df1 = df1
df1.head(3)

Unnamed: 0,date,election_type,party,status,constituency_name,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,candidate,candidate_ID
0,23 June 1960,By Election,Labour,Not Elected,Carlow Kilkenny,7678.0,0.2016,,False,Seamus Pattison,1
1,1961,17th Dail,Labour,Elected,Carlow Kilkenny,4116.0,0.0954,,False,Seamus Pattison,1
2,1965,18th Dail,Labour,Elected,Carlow Kilkenny,6299.0,0.1408,,False,Seamus Pattison,1


In [5]:
#cleaning date

def get_year_from_date_string(date_str):
    if date_str == None:
        return 0
    elif len(date_str) > 4:
        try:
            return int(date_str[-4:])
        except:
            if isinstance(date_str[-1],int):#last letter is an int
                return date_str[-4:]
    else:
        return int(date_str)

df1 = df1.reset_index().drop(columns=['index'])
df1['year'] = df1.date.apply(get_year_from_date_string)
df1 = df1.drop(columns=['date'])

df1.head(3)

Unnamed: 0,election_type,party,status,constituency_name,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,candidate,candidate_ID,year
0,By Election,Labour,Not Elected,Carlow Kilkenny,7678.0,0.2016,,False,Seamus Pattison,1,1960.0
1,17th Dail,Labour,Elected,Carlow Kilkenny,4116.0,0.0954,,False,Seamus Pattison,1,1961.0
2,18th Dail,Labour,Elected,Carlow Kilkenny,6299.0,0.1408,,False,Seamus Pattison,1,1965.0


In [6]:
#cleaning election_type

def get_election_type_from_string(election_type_str):
    if 'Town' in election_type_str or 'Local' in election_type_str :
        return 'LOCAL'
    elif 'Dail' in election_type_str:
        return 'GENERAL'
    elif 'Seanad' in election_type_str:
        return 'SEANAD'
    elif 'Westminster' in election_type_str:
        return 'Westminster'.upper()
    elif 'European' in election_type_str:
        return 'EUROPEAN'
    elif 'By Election' in election_type_str:
        return 'BI-ELECTION'
    else: # for the rows that represent resignations or appointments or some other event in a politicans career
        return None

df1['election_type'] = df1.election_type.apply(get_election_type_from_string)
print(df1.election_type.unique())

['BI-ELECTION' 'GENERAL' 'EUROPEAN' None 'LOCAL' 'SEANAD' 'WESTMINSTER']


In [7]:
df1.status.unique()

array(['Not Elected', 'Elected', 'Appointed', 'Resigned', 'Disqualified',
       None, 'Co-opted', 'Candidate', 'Dublin', 'Lucan',
       'awaiting update', 'changed\xa0to',
       'Dublin University (Trinity College)', 'Died in office:',
       'Agricultural Panel', 'Gorey\xa0\xa0\xa0-\xa0\xa0\xa0Resigned',
       'Nominated by Taoiseach', 'Inishowen', '(Replaced  David McKenna)',
       '(Replaced  Michael Flynn)', 'Administrative Panel', 'Drumcliff',
       'Ballybrack\xa0\xa0\xa0-\xa0\xa0\xa0Resigned\n            \n              (ill health)',
       'Kilkenny\xa0\xa0\xa0-\xa0\xa0\xa0Resigned\n            \n              (dual mandate TD)'],
      dtype=object)

In [8]:
# cleaning status
def was_elected(status):
    status = str(status) # some of the status are None 

    if status == 'Elected':
        return True
    elif status == 'Not Elected':
        return False

    else: # for the rows that represent resignations or appointments or some other event in a politicans career
        return None

df1['elected'] = df1.status.apply(was_elected)
df1.head()

Unnamed: 0,election_type,party,status,constituency_name,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,candidate,candidate_ID,year,elected
0,BI-ELECTION,Labour,Not Elected,Carlow Kilkenny,7678.0,0.2016,,False,Seamus Pattison,1,1960.0,False
1,GENERAL,Labour,Elected,Carlow Kilkenny,4116.0,0.0954,,False,Seamus Pattison,1,1961.0,True
2,GENERAL,Labour,Elected,Carlow Kilkenny,6299.0,0.1408,,False,Seamus Pattison,1,1965.0,True
3,GENERAL,Labour,Elected,Carlow Kilkenny,6041.0,0.1311,,False,Seamus Pattison,1,1969.0,True
4,GENERAL,Labour,Elected,Carlow Kilkenny,5300.0,0.1134,,False,Seamus Pattison,1,1973.0,True


### Cleaing dataframe 2

Renaming:
Dataframe 2 has a column called ```first_pref_quota_ratio``` which is the same information as ```pct_of_quota_reached_with_first_pref``` in dataframe 1

In [9]:
df2 = df2.rename(columns={'first_pref_quota_ratio':'pct_of_quota_reached_with_first_pref','constituency':'constituency_name'}).reset_index().drop(columns=['index'])

print(df2.columns)
print(df2.shape)

Index(['election', 'elected', 'party', 'first_pref_pct', 'first_pref_count',
       'pct_of_quota_reached_with_first_pref', 'year', 'candidate',
       'constituency_name', 'election_type'],
      dtype='object')
(36243, 10)


### Fixing Constitenuecy names.

I noticed that the constituency names are slightly different

In [39]:
set(df1[df1['election_type'] =='LOCAL'].constituency_name.unique()).difference(set(df2[df2['election_type'] =='LOCAL'].constituency_name.unique()))
# I dont bother to fix this but we probably should get a list of all constituency names and the map them onto whatever the website calls them. 
# this is easy with dail elections but not so easy with local elections as i cant find anything about it

{'Adare Rathkeale',
 'Artane Whitehall',
 'Athenry Oranmore',
 'Bailieborough Cootehill',
 'Ballaghadereen',
 'Ballincollig Carrigaline',
 'Ballybay',
 'Ballybay Clones',
 'Ballyfermot Drimnagh',
 'Ballymote Tobercurry',
 'Ballymun Finglas',
 'Ballymun Whitehall',
 'Ballyshannon',
 'Bandon Kinsale',
 'Bantry West Cork',
 'Beaumont Donaghmede',
 'Belturbet TC',
 'Blanchardstown Mulhuddart',
 'Blarney Macroom',
 'Borris-in-Ossory Mountmellick',
 'Bray No 1',
 'Bray No 2',
 'Bray No 3',
 'Bray South',
 'Bundoran',
 'Cabra Finglas',
 'Cabra Glasnevin',
 'Callan Thomastown',
 'Cappaghmore Kilmallock',
 'Carlow No 1',
 'Carlow No 2',
 'Carrick on Suir',
 'Carrick-On-Shannon',
 'Carrickmacross Castleblaney',
 'Cashel Tipperary',
 'Cavan Belturbet',
 'Celbridge Leixlip',
 'Clare West',
 'Clonakilty',
 'Conamara',
 'Cootehill',
 'Cork East',
 'Cork West',
 'Crumlin Kimmage',
 'Drogheda No 1 Laurence Gate',
 'Drogheda No 2 West Gate',
 'Drogheda No 3 Duleek Gate',
 'Droichead Nua',
 'Dromahaire'

In [38]:
set(df1[df1['election_type'] =='GENERAL'].constituency_name.unique()).difference(set(df2[df2['election_type'] =='GENERAL'].constituency_name.unique()))

{'Antrim East',
 'Antrim Mid',
 'Antrim North',
 'Antrim South',
 'Armagh Mid',
 'Armagh North',
 'Armagh South',
 'Athlone Longford',
 'Belfast Cromac',
 'Belfast Duncairn',
 'Belfast Falls',
 'Belfast Ormeau',
 'Belfast Pottinger',
 'Belfast Shankill',
 "Belfast St Anne's",
 'Belfast Victoria',
 'Belfast Woodvale',
 'Carlow',
 'Carlow Kildare',
 'Carlow Kilkenny',
 'Cavan East',
 'Cavan Monaghan',
 'Cavan West',
 'Clare East',
 'Clare Galway South',
 'Clare West',
 'Cork',
 'Cork City North',
 'Cork City South',
 'Cork East and North East',
 'Cork Mid/North/South/South East and West',
 'Cork North Central',
 'Cork North East',
 'Cork North West',
 'Cork South Central',
 'Cork South East',
 'Cork South West',
 'Donegal Leitrim',
 'Donegal North',
 'Donegal North East',
 'Donegal South',
 'Donegal South West',
 'Down East',
 'Down Mid',
 'Down North',
 'Down South',
 'Down West',
 'Dublin College Green',
 'Dublin Harbour',
 'Dublin Mid',
 'Dublin Mid West',
 'Dublin North Central',
 'D

In [40]:
current_constituencies = pd.read_html('https://en.wikipedia.org/wiki/D%C3%A1il_constituencies',flavor='bs4')[3]
historic_constituencies = pd.read_html('https://en.wikipedia.org/wiki/Historic_D%C3%A1il_constituencies',flavor='bs4')[1]
historic_constituencies

Unnamed: 0,Constituency,County or city,Created,Abolished,Seats
0,Antrim,Antrim,1921,1922.0,7
1,Antrim East,Antrim,1918,1921.0,1
2,Antrim Mid,Antrim,1918,1921.0,1
3,Antrim North,Antrim,1918,1921.0,1
4,Antrim South,Antrim,1918,1921.0,1
...,...,...,...,...,...
255,Wexford North,Wexford,1918,1921.0,1
256,Wexford South,Wexford,1918,1921.0,1
257,Wicklow[aq],Wicklow,1923,,345
258,Wicklow East,Wicklow,1918,1921.0,1


In [41]:
current_constituencies

Unnamed: 0,Constituency,Seats
0,Carlow–Kilkenny,5
1,Cavan–Monaghan,5
2,Clare,4
3,Cork East,4
4,Cork North-Central,4
5,Cork North-West,3
6,Cork South-Central,4
7,Cork South-West,3
8,Donegal,5
9,Dublin Bay North,5


### Joining the two datasets:

Some of the data in Dataframe 1 is missings, hence why we are combining these 2 dataframes.
I remove entries from dataframe 1 that arent elected. 

Recall that some rows in dataframe 1 just represent a politician being appointed as minister or resigning, not always an election.

In [10]:
df1 = df1[~df1.elected.isnull()]
df1

Unnamed: 0,election_type,party,status,constituency_name,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,candidate,candidate_ID,year,elected
0,BI-ELECTION,Labour,Not Elected,Carlow Kilkenny,7678.0,0.2016,,False,Seamus Pattison,1,1960.0,False
1,GENERAL,Labour,Elected,Carlow Kilkenny,4116.0,0.0954,,False,Seamus Pattison,1,1961.0,True
2,GENERAL,Labour,Elected,Carlow Kilkenny,6299.0,0.1408,,False,Seamus Pattison,1,1965.0,True
3,GENERAL,Labour,Elected,Carlow Kilkenny,6041.0,0.1311,,False,Seamus Pattison,1,1969.0,True
4,GENERAL,Labour,Elected,Carlow Kilkenny,5300.0,0.1134,,False,Seamus Pattison,1,1973.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
30065,WESTMINSTER,Democratic Unionist,Not Elected,Newry and Armagh,5764.0,0.1284,,False,William Irwin,4203,2010.0,False
30066,WESTMINSTER,Democratic Unionist,Not Elected,Newry and Armagh,13177.0,0.2459,,False,William Irwin,4203,2017.0,False
30067,WESTMINSTER,Democratic Unionist,Not Elected,Newry and Armagh,11000.0,0.2166,,False,William Irwin,4203,2019.0,False
30068,BI-ELECTION,Non party/Independent,Not Elected,Cork South Central,219.0,0.0052,0.01,False,Brian McEnery,4211,1994.0,False


In [11]:
print(
    df1.shape,
    df2.shape
    )

print(sorted(set(df1.columns)))
print(sorted(set(df2.columns)))

(28590, 12) (36243, 10)
['candidate', 'candidate_ID', 'constituency_name', 'elected', 'election_type', 'first_pref_count', 'first_pref_pct', 'party', 'pct_of_quota_reached_with_first_pref', 'ran_unopposed', 'status', 'year']
['candidate', 'constituency_name', 'elected', 'election', 'election_type', 'first_pref_count', 'first_pref_pct', 'party', 'pct_of_quota_reached_with_first_pref', 'year']


In [12]:
#df1.year.unique()
df1 = df1.reindex(columns=[
    'year',
    'candidate',
    'candidate_ID',
    'constituency_name',
    'party',
    'elected',
    'election_type',
    'first_pref_count',
    'first_pref_pct',
    'pct_of_quota_reached_with_first_pref',
    'ran_unopposed',
    'status',
    ])
df1.head(3)

Unnamed: 0,year,candidate,candidate_ID,constituency_name,party,elected,election_type,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,status
0,1960.0,Seamus Pattison,1,Carlow Kilkenny,Labour,False,BI-ELECTION,7678.0,0.2016,,False,Not Elected
1,1961.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,4116.0,0.0954,,False,Elected
2,1965.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,6299.0,0.1408,,False,Elected


In [13]:
df2.year = df2.year.astype(float)
df2.year.unique()

array([2004., 2009., 2011., 2016., 1982., 1985., 1991., 1997., 1999.,
       2002., 2007., 2020., 2014., 2015., 1987., 1989., 1992., 1960.,
       1961., 1965., 1967., 1969., 1973., 1974., 1977., 1979., 1981.,
       1994., 1934., 1950., 1955., 1948., 1951., 1954., 1957., 1956.,
       1925., 1932., 1933., 1937., 1938., 1943., 1944., 1923., 1927.,
       1921., 1922., 1984., 2019., 1968., 1959., 1942., 1945., 1928.,
       1980., 1953., 1924., 1998., 1996., 1983., 1995., 2013., 1972.,
       2010., 1970., 1949., 1958., 1946., 1929., 1926., 1963., 1952.,
       1976., 1947., 1964., 1975., 1935., 1966., 2005., 1920., 1930.,
       2001., 2000., 1936., 1931., 1939., 1940.])

In [14]:
pd.concat([df1,df2],axis=1).sort_values('year')

ValueError: The column label 'year' is not unique.

In [None]:
combined_dataframe = pd.DataFrame()

In [15]:
import jellyfish

def are_names_the_similar(name_1,name_2):
    if name_1==name_2:
        return True
    elif jellyfish.levenshtein_distance(name_1,name_2) <=2:
        return True
    else:
        return False

get the similar names
then find a match where both tables have a record, eg mary b and marie b lost a general election in the same constitency in 1969. 
then we can map mary b and marie b back to the same ID
then concat the two tables and anywhere where we see 

CASE 1: We the same rows in each table
- Case 1A: Data is the same
- Case 1B: Name is similar but one bit of data is different, eg pct_of_quota_reached_with_first_pref is nan in one table and 0.38 in another.

CASE 2: We have data in one table that is not in the other, eg election results for the 70s


What do we want from this:
- I want every name to be linked with an ID
- We want a table that has all the election results, with no duplication, and with the most accurate information. 

In [None]:
We know that one table already has an ID. 

In [16]:
all_names_without_ID = set(df2.candidate.unique())
name_set = {}
for name_1 in df1.candidate.unique():
    name_set[name_1] = []
    for name_2 in list(all_names_without_ID):
        if are_names_the_similar(name_1,name_2):
            name_set[name_1].append(name_2)
            print(name_1,name_2)

    #print(similar_names)   
    # temp_df = df1[df1.year == year]
    # temp_df2 = df2[df2.year == year]
    # for name in temp_df.candidate:
    #     mp_df2.candidate.apply(lambda name_2: jellyfish.levenstein_distance)

    # print(temp_df.head())
    # print(temp_df2.head())
    # break

Seamus Pattison Séamus Pattison
John McGuinness John McGuinness
Phil Hogan Paul Hogan
Phil Hogan Phil Hogan
M J Nolan M. J. Nolan
M J Nolan M.J. Nolan
Jim Townsend Jim Townsend
Mary White Mary White
Fergal Browne Fergal Browne
Eddie Collins Hughes Eddie Collins Hughes
Billy Nolan Billy Nolan
Billy Nolan Bill Nolan
Mary M White Mary White
Tommy Kinsella Tommy Kinsella
Jimmy Brennan Tommy Brennan
Jimmy Brennan Jimmy Brennan
Walter Lacey Walter Lacey
Geraldine Callinan O'Dea Geraldine Callinan-O'Dea
Annie Parker Byrne Annie Parker-Byrne
Arthur McDonald Arthur McDonald
Enda Nolan Enda Nolan
Francis Deane Francis Deane
Francis Deane Francis Dunne
Noel Kennedy Ned Kennedy
Noel Kennedy Noel Kennedy
Clifford Kelly Clifford Kelly
Michael McCarey Michael Carey
Michael McCarey Michael McCarey
Michael McCarey Michael McCartney
Michael McCarey Michael McCarthy
Matt McPhillips Matt McPhillips
Joe O'Reilly Joe O'Reilly
Joe O'Reilly Joe O'Neill
Joe O'Reilly Tom O'Reilly
Joe O'Reilly Joe Reilly
Joe O'R

NameError: name 'similar_names' is not defined

loop through and see if the name is the same or similar, are there records in both tables.

In [20]:
for name_from_df1,list_of_names_from_df2 in name_set.items():
    t1 = df1[df1.candidate.isin([name_from_df1])]
    t2 = df2[df2.candidate.isin(list_of_names_from_df2)]
    break
t1

Unnamed: 0,year,candidate,candidate_ID,constituency_name,party,elected,election_type,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,status
0,1960.0,Seamus Pattison,1,Carlow Kilkenny,Labour,False,BI-ELECTION,7678.0,0.2016,,False,Not Elected
1,1961.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,4116.0,0.0954,,False,Elected
2,1965.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,6299.0,0.1408,,False,Elected
3,1969.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,6041.0,0.1311,,False,Elected
4,1973.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,5300.0,0.1134,,False,Elected
5,1977.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,6276.0,0.1132,,False,Elected
6,1981.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,6104.0,0.1082,,False,Elected
7,1982.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,5565.0,0.1025,,False,Elected
8,1982.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,5642.0,0.1011,,False,Elected
9,1987.0,Seamus Pattison,1,Carlow Kilkenny,Labour,True,GENERAL,7358.0,0.128,,False,Elected


In [21]:
t2

Unnamed: 0,election,elected,party,first_pref_pct,first_pref_count,pct_of_quota_reached_with_first_pref,year,candidate,constituency_name,election_type
72,1960 by-election - Carlow–Kilkenny,False,Labour Party,0.202,7678,0.4,1960.0,Séamus Pattison,Carlow–Kilkenny,BI-ELECTION
73,1961 general election - Carlow–Kilkenny,True,Labour Party,0.095,4116,0.57,1961.0,Séamus Pattison,Carlow–Kilkenny,GENERAL
74,1965 general election - Carlow–Kilkenny,True,Labour Party,0.141,6299,0.84,1965.0,Séamus Pattison,Carlow–Kilkenny,GENERAL
75,1967 Local Election - Kilkenny,True,Labour Party,0.141,1317,1.27,1967.0,Séamus Pattison,Kilkenny,LOCAL
76,1969 general election - Carlow–Kilkenny,True,Labour Party,0.131,6041,0.79,1969.0,Séamus Pattison,Carlow–Kilkenny,GENERAL
77,1973 general election - Carlow–Kilkenny,True,Labour Party,0.113,5300,0.68,1973.0,Séamus Pattison,Carlow–Kilkenny,GENERAL
78,1974 Local Election - Kilkenny,True,Labour Party,0.116,1166,1.03,1974.0,Séamus Pattison,Kilkenny,LOCAL
79,1977 general election - Carlow–Kilkenny,True,Labour Party,0.113,6276,0.68,1977.0,Séamus Pattison,Carlow–Kilkenny,GENERAL
80,1979 Local Election - Kilkenny,True,Labour Party,0.107,1213,0.96,1979.0,Séamus Pattison,Kilkenny,LOCAL
81,1981 general election - Carlow–Kilkenny,True,Labour Party,0.108,6095,0.65,1981.0,Séamus Pattison,Carlow–Kilkenny,GENERAL


In [35]:
# we have two data frames which contain people with similar names. 
# next step is to check if constituencies are similar names 

0.8857142857142858

In [None]:
for row_df1 in t1.itertuples():
    print(row)
    for row_df2 in t2.itertuples():
        if row_df2.election_type == row_df1.election_type and row_df2.year == row_df1.year and jellyfish.jaro_winkler(row_df1.constituency_name,constituency_name) > 0.8 and jellyfish.jaro_winkler(row_df1.party,row_df2.party) > 0.8:
            # theyre the same if they are the same type of election, in the same year, and the constituency names are similar. 
            # The only exception i can think of this would be two men called tomas o brien and thomas o brien who run in local elections, one in kildare east and another in kildare west. 
            
        break

In [None]:
# is the year the same
# and is the data the same

for row in t1.fillna('-').itertuples():
    print('row',row.constituency_name)
    similar_name_same_year_df = t2[(t2.year == row.year)]
    if similar_name_same_year_df.shape[0] > 1: # if there are multiple people with similar names running in the same year
        # check constituency name similarity
        most_similar = ''
        highest_jaro_winler = 0

        for constituency in similar_name_same_year_df.constituency_name:
            if jellyfish.jaro_winkler(constituency,row.constituency_name) >= highest_jaro_winler:
                most_similar = constituency
                highest_jaro_winler = jellyfish.jaro_winkler(constituency,row.constituency_name)

        print('m:',most_similar)
        #similar_name_same_year_df
        matched_row = similar_name_same_year_df[similar_name_same_year_df['constituency_name'] == most_similar]

    else: # we have 1 row matched with 1 row in another dataframe. so combine them
        #print('beep')
        matched_row = similar_name_same_year_df[0]
        duno = row._asdict()
        ddos = list(matched_row.itertuples())._asdict()
        d =  create_new_dictionary(duno,ddos)
        print(d)
        print('-------------------------')
        

row Cork South West


KeyError: 0

In [None]:
t1

Unnamed: 0,year,candidate,candidate_ID,constituency_name,party,elected,election_type,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,status
8437,1982.0,D F O'Sullivan,3433,Cork South West,Fianna Fail,False,GENERAL,2640.0,0.0804,,False,Not Elected
8438,1991.0,D F O'Sullivan,3433,Skibbereen,Fianna Fail,True,LOCAL,1688.0,0.1103,0.88,False,Elected
29676,1991.0,Dan O'Sullivan,8959,Naas,Non party/Independent,False,LOCAL,175.0,0.0168,0.13,False,Not Elected


In [None]:
similar_name_same_year_df

Unnamed: 0,election,elected,party,first_pref_pct,first_pref_count,pct_of_quota_reached_with_first_pref,year,candidate,constituency_name,election_type
22320,1991 Local Election - Naas,False,Independent,0.017,175,0.13,1991.0,Dan O'Sullivan,Naas,LOCAL


In [None]:
def create_new_dictionary(dict1,dict2):
    new_dict = {}
    for key,val in dict1.items():
        #print(key,val)
        if key=='Index':
            continue
        elif len(str(dict2.get(key))) > len(str(dict1.get(key))): # we pick the longest values because on floats that means higher precision and on constituency names it means more detail
            new_dict[key] = dict2.get(key)
            if key =='candidate':
                new_dict['AKA'] = dict1.get('candidate')
        else:
            new_dict[key] = val
            new_dict['AKA'] = dict2.get('candidate')
    return new_dict

{'year': 1969.0,
 'candidate': 'Kevin Hurley',
 'candidate_ID': '2837',
 'constituency_name': 'Cork City South–East',
 'party': 'Labour Party',
 'elected': False,
 'election_type': 'GENERAL',
 'first_pref_count': 2020.0,
 'first_pref_pct': 0.0748,
 'pct_of_quota_reached_with_first_pref': 0.3,
 'ran_unopposed': False,
 'status': 'Not Elected'}

In [None]:
similar_name_same_year_df.shape

(1, 10)

In [None]:
temp_df.head()

Unnamed: 0,year,candidate,candidate_ID,constituency_name,party,elected,election_type,first_pref_count,first_pref_pct,pct_of_quota_reached_with_first_pref,ran_unopposed,status
18838,1880.0,T P O'Connor,10028,Galway,,True,WESTMINSTER,487.0,0.3315,,False,Elected
18840,1880.0,John Lever,10029,Galway,,True,WESTMINSTER,501.0,0.341,,False,Elected
18841,1880.0,Hugh Tarpey,10030,Galway,Liberal,False,WESTMINSTER,481.0,0.3274,0.98,False,Not Elected


In [None]:
temp_df2.head()

Unnamed: 0,election,elected,party,first_pref_pct,first_pref_count,first_pref_quota_ratio,year,candidate,constituency,election_type


Run through and say per name in 