# Cleaning data in python

In this notebook we will continue to work with the two data files resulting from scrapeing wikipedia and merging data from teh US Census Beureu. Before performing any exploratory data analysis or machine learning we have to clean the data, since although we created the dataset they are still somewhat messy. The workflow will consist of going through each dataset column by column and clean or change the data. Finally we will add some derived values from the dataset as new columns.

In [2]:
import pandas as pd
import os
import numpy as np

We will start with setting up the paths to both files.

In [3]:
current_folder = os.getcwd()
district_data = current_folder + '/data/resultingData/congress_merged.csv'
member_data = current_folder + '/data/resultingData/congress_members.csv'

In [4]:
member_frame = pd.read_csv(member_data)
member_frame

Unnamed: 0.1,Unnamed: 0,District,Member,Party,Party.1,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens
0,0,Alabama 1,Bradley Byrne,,Republican,Alabama SenateAlabama State Board of Education,Duke University (BA)University of Alabama (JD),2014 (special),Fairhope,1955.0,{{marriage|Rebecca Dukes|1982}},4
1,1,Alabama 2,Martha Roby,,Republican,"Montgomery, Alabama City Council",New York University (BM)Samford University (JD),2011,Montgomery,1976.0,Riley Roby,2
2,2,Alabama 3,Mike Rogers,,Republican,"Calhoun County, Alabama CommissionerAlabama Ho...","Jacksonville State University (BA, MPA)Birming...",2003,Saks,1958.0,Beth,
3,3,Alabama 4,Robert Aderholt,,Republican,"Haleyville, Alabama Municipal Judge",University of North AlabamaBirmingham–Southern...,1997,Haleyville,1965.0,Caroline McDonald,
4,4,Alabama 5,Mo Brooks,,Republican,Alabama House of RepresentativesMadison County...,Duke University (BA)University of Alabama (JD),2011,Huntsville,1954.0,{{marriage|Martha Jenkins|1976}},4
...,...,...,...,...,...,...,...,...,...,...,...,...
456,456,,Jim Langevin,,,,,,,,none,
457,457,,Mark Green,,,,,,,,Camie,2
458,458,,Mike McCaul,,,,,,,,Linda Mays,5
459,459,,Michael Clifton Burgess,,,,,,,,Laura Burgess,3


We will start with droping the columns 'Unnamed 0' and 'Party'

In [5]:
member_frame = member_frame.drop(columns = ['Unnamed: 0','Party'])

Next we'll change the name 'Party.1' to simply 'Party'

In [6]:
member_frame = member_frame.rename(columns = {'Party.1':'Party'})

Next we'll split the dataFrame putting the rows where District is NaN into it's own frame. This is the data scraped from the wiki infoboxes which due to naming differences couldn't be directly joined on the main dataFrame.

In [7]:
extra_frame = member_frame[(member_frame['District'].isnull())]
extra_frame

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens
435,,John Raymond Garamendi,,,,,,,{{marriage|Patricia Wilkinson|1965}},6
436,,,,,,,,,Janine Bera,1
437,,,,,,,,,{{marriage|Vera Davis|1974}},2
438,,,,,,,,,none,
439,,,,,,,,,none,
440,,,,,,,,,none,
441,,,,,,,,,Rosel Cloud,3
442,,John Larson,,,,,,,Leslie Best,3
443,,Rick Allen,,,,,,,{{marriage|Robin Allen|1975}},4
444,,Russell Mark Fulcher,,,,,,,{{marriage|Kara||2018|end=div}},3


As we can see we have some things to work with. We will start with trying to get the names to match. This can be done for all but five members. To do this we will have to compare the names in the extraFrame to those on wikipedia and make them equal.

For John Raymond Garamendi it's enough to drop the middle name. Luckily there's just one congres member with the surname Larson, namely John B. Larson and so on until we can match all of these.

In [8]:
extra_frame['Member'] = extra_frame['Member'].replace({'John Raymond Garamendi':'John Garamendi', 'John Larson':'John B. Larson','Rick Allen': 'Rick W. Allen',
                             'Russell Mark Fulcher': 'Russ Fulcher','Chuy García':'Jesús "Chuy" García','Anthony Brown': 'Anthony G. Brown',
                             'Stephen Francis Lynch': 'Stephen F. Lynch', 'Annie Kuster':'Ann McLane Kuster', 
                             'Donald Milford Payne Jr.': 'Donald Payne Jr.', 'Peter Thomas King': 'Peter T. King',
                             'Jerrold Lewis Nadler': 'Jerry Nadler', 'José Serrano':'José E. Serrano', 'Sean P. Maloney':'Sean Patrick Maloney',
                             'Joe Morelle': 'Joseph Morelle', 'Dave Joyce': 'David Joyce', 'Jim Langevin':'James Langevin',
                             'Mark Green': 'Mark E. Green', 'Mike McCaul': 'Michael McCaul', 'Michael Clifton Burgess': 'Michael C. Burgess',
                             'Filemon Vela': 'Filemon Vela Jr.'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


We will ignore the warning in this case. Next we'll try and fill in the blanks by trying to match the name of the spouses with congress members.

In [9]:
extra_frame.loc[436, 'Member'] = 'Ami Bera'
extra_frame.loc[437, 'Member'] = 'Danny K. Davis'
extra_frame.loc[441, 'Member'] = 'Michael Cloud'
extra_frame

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens
435,,John Garamendi,,,,,,,{{marriage|Patricia Wilkinson|1965}},6
436,,Ami Bera,,,,,,,Janine Bera,1
437,,Danny K. Davis,,,,,,,{{marriage|Vera Davis|1974}},2
438,,,,,,,,,none,
439,,,,,,,,,none,
440,,,,,,,,,none,
441,,Michael Cloud,,,,,,,Rosel Cloud,3
442,,John B. Larson,,,,,,,Leslie Best,3
443,,Rick W. Allen,,,,,,,{{marriage|Robin Allen|1975}},4
444,,Russ Fulcher,,,,,,,{{marriage|Kara||2018|end=div}},3


Now we will place the missing values in a complete frame but since we are not interested in teh name of the spouse we will simply put a yes or no instead

In [10]:
member_complete = member_frame.iloc[:435, :]
member_complete.loc[23, 'Spouse'] = 'Yes'
member_complete.loc[23, 'Childrens'] = 6
member_complete.loc[27, 'Spouse'] = 'Yes'
member_complete.loc[27, 'Childrens'] = 1
member_complete.loc[141, 'Spouse'] = 'Yes'
member_complete.loc[141, 'Childrens'] = 2
member_complete.loc[387, 'Spouse'] = 'Yes'
member_complete.loc[387, 'Childrens'] = 3
member_complete.loc[84, 'Spouse'] = 'Yes'
member_complete.loc[84, 'Childrens'] = 3
member_complete.loc[128, 'Spouse'] = 'Yes'
member_complete.loc[128, 'Childrens'] = 4
member_complete.loc[133, 'Spouse'] = 'No'
member_complete.loc[133, 'Childrens'] = 3
member_complete.loc[138, 'Spouse'] = 'Yes'
member_complete.loc[138, 'Childrens'] = 3
member_complete.loc[187, 'Spouse'] = 'Yes'
member_complete.loc[187, 'Childrens'] = 3
member_complete.loc[199, 'Spouse'] = 'Yes'
member_complete.loc[199, 'Childrens'] = 1
member_complete.loc[244, 'Spouse'] = 'Yes'
member_complete.loc[244, 'Childrens'] = 2
member_complete.loc[254, 'Spouse'] = 'Yes'
member_complete.loc[254, 'Childrens'] = 3
member_complete.loc[261, 'Spouse'] = 'Yes'
member_complete.loc[261, 'Childrens'] = 2
member_complete.loc[269, 'Spouse'] = 'Yes'
member_complete.loc[269, 'Childrens'] = 1
member_complete.loc[274, 'Spouse'] = 'No'
member_complete.loc[274, 'Childrens'] = 5
member_complete.loc[284, 'Spouse'] = 'Yes'
member_complete.loc[284, 'Childrens'] = 3
member_complete.loc[277, 'Spouse'] = 'Yes'
member_complete.loc[277, 'Childrens'] = 3
member_complete.loc[312, 'Spouse'] = 'Yes'
member_complete.loc[312, 'Childrens'] = 0
member_complete.loc[344, 'Spouse'] = 'No'
member_complete.loc[344, 'Childrens'] = 0
member_complete.loc[359, 'Spouse'] = 'Yes'
member_complete.loc[359, 'Childrens'] = 2
member_complete.loc[370, 'Spouse'] = 'Yes'
member_complete.loc[370, 'Childrens'] = 5
member_complete.loc[386, 'Spouse'] = 'Yes'
member_complete.loc[386, 'Childrens'] = 3
member_complete.loc[394, 'Spouse'] = 'Yes'
member_complete.loc[394, 'Childrens'] = 0

In [11]:
member_complete

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens
0,Alabama 1,Bradley Byrne,Republican,Alabama SenateAlabama State Board of Education,Duke University (BA)University of Alabama (JD),2014 (special),Fairhope,1955.0,{{marriage|Rebecca Dukes|1982}},4
1,Alabama 2,Martha Roby,Republican,"Montgomery, Alabama City Council",New York University (BM)Samford University (JD),2011,Montgomery,1976.0,Riley Roby,2
2,Alabama 3,Mike Rogers,Republican,"Calhoun County, Alabama CommissionerAlabama Ho...","Jacksonville State University (BA, MPA)Birming...",2003,Saks,1958.0,Beth,
3,Alabama 4,Robert Aderholt,Republican,"Haleyville, Alabama Municipal Judge",University of North AlabamaBirmingham–Southern...,1997,Haleyville,1965.0,Caroline McDonald,
4,Alabama 5,Mo Brooks,Republican,Alabama House of RepresentativesMadison County...,Duke University (BA)University of Alabama (JD),2011,Huntsville,1954.0,{{marriage|Martha Jenkins|1976}},4
...,...,...,...,...,...,...,...,...,...,...
430,Wisconsin 5,Jim Sensenbrenner,Republican,Wisconsin State SenateWisconsin State Assembly,Stanford University (BA)University of Wisconsi...,1979,Menomonee Falls,1943.0,{{marriage|Cheryl Warren|1977|2020|end=d.}},2
431,Wisconsin 6,Glenn Grothman,Republican,Wisconsin SenateWisconsin State Assembly,"University of Wisconsin–Madison (BA, JD)",2015,Campbellsport,1955.0,none,
432,Wisconsin 7,Tom Tiffany,Republican,Wisconsin SenateWisconsin State Assembly,University of Wisconsin–River Falls (BS),2020 (special),Minocqua,1957.0,Christine Sully,3
433,Wisconsin 8,Mike Gallagher,Republican,Political advisorU.S. Marine Corps,Princeton University (BA)National Intelligence...,2017,Green Bay,1984.0,{{marriage|[[Anne Horak Gallagher]]|2019}},


Next we'll go through and clean up the education column. Since some will have more then one education listed e.g. a B.A. and then a J.D. we will have to rank them and search and replace in the ranking order. We will start with J.D. degrees since they are directly related to law. After that we'll look for PhDs followed by master degrees and finally bachelor degrees. However The first thing we will do is to drop the four vacant rows.

In [12]:
member_complete = member_complete[member_complete.Member != 'Vacant']
member_complete = member_complete.reset_index(drop=True)
member_complete.loc[member_complete['Education'].str.contains('JD'), 'Education'] = 'JD'
member_complete.loc[member_complete['Education'].str.contains('PhD'), 'Education'] = 'PhD'
member_complete.loc[member_complete['Education'].str.contains('MD|DDS|DVM'), 'Education'] = 'Profesional Doctorate'
member_complete.loc[member_complete['Education'].str.contains('MBA|MS|MPA|MPP|MFA|MA|MPhil|MUP|MSN|MPH|LLM|MHS|THM'), 'Education'] = 'Master'
member_complete.loc[member_complete['Education'].str.contains('BA|BS'), 'Education'] = 'Bachelor'

The congress members whose level of education is not stated on wikipedia will be assumed to have reached a master degree so we will replace the name of the institution with said level 

In [13]:
member_complete.loc[member_complete['Education'].str.contains('Bachelor|Master|PhD|Profesional Doctorate|JD') == False, 
                   'Education'] = 'Master'
member_complete

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens
0,Alabama 1,Bradley Byrne,Republican,Alabama SenateAlabama State Board of Education,JD,2014 (special),Fairhope,1955.0,{{marriage|Rebecca Dukes|1982}},4
1,Alabama 2,Martha Roby,Republican,"Montgomery, Alabama City Council",JD,2011,Montgomery,1976.0,Riley Roby,2
2,Alabama 3,Mike Rogers,Republican,"Calhoun County, Alabama CommissionerAlabama Ho...",JD,2003,Saks,1958.0,Beth,
3,Alabama 4,Robert Aderholt,Republican,"Haleyville, Alabama Municipal Judge",JD,1997,Haleyville,1965.0,Caroline McDonald,
4,Alabama 5,Mo Brooks,Republican,Alabama House of RepresentativesMadison County...,JD,2011,Huntsville,1954.0,{{marriage|Martha Jenkins|1976}},4
...,...,...,...,...,...,...,...,...,...,...
426,Wisconsin 5,Jim Sensenbrenner,Republican,Wisconsin State SenateWisconsin State Assembly,JD,1979,Menomonee Falls,1943.0,{{marriage|Cheryl Warren|1977|2020|end=d.}},2
427,Wisconsin 6,Glenn Grothman,Republican,Wisconsin SenateWisconsin State Assembly,JD,2015,Campbellsport,1955.0,none,
428,Wisconsin 7,Tom Tiffany,Republican,Wisconsin SenateWisconsin State Assembly,Bachelor,2020 (special),Minocqua,1957.0,Christine Sully,3
429,Wisconsin 8,Mike Gallagher,Republican,Political advisorU.S. Marine Corps,PhD,2017,Green Bay,1984.0,{{marriage|[[Anne Horak Gallagher]]|2019}},


Next we'll go through the "Prior experience" column. For his we will sort those who have prior public experience from those that do not. In this case public experience will be 'Board of...', City Council, House of Representatives, Judge, Mayor, Senate, State Assembly, Sheriff, State Director, Commissioner, Deputy Secretary of the Interior, '... Police Department', Ambassador, Secretary of Health and Human Services, Treasurer, City-County Council, Judge-Executive, '... Restoration Authority', County Executive, Governor, Secretary of Defense for International Security Affairs, Presidential Task Force ...,Municipal Council, General Assembly, Chair of the New Mexico Democratic Party, Town Council, Puerto Rican Community Affairs, Supreme Court, Republican Committee, Secretary of State, Commission, Texas Justice of the Peace, House of Delegates. Finally we'll change the name of the column to be Prior Public experience

In [14]:
member_complete.loc[member_complete['Prior experience'].str.contains('Board of|City Council|House of Representatives|Judge'
                  '|Mayor|Senate|Assembly|Sheriff|State Director|Commissioner|Deputy Secretary of the Interior'
                  '|Police Department|Ambassador|Secretary of Health and Human Services|Treasurer|City-County Council'
                  '|Judge-Executive|Restoration Authority|County Executive|Governor|Secretary of Defense for International Security Affairs'
                  '|Presidential Task Force|Municipal Council|New Mexico Democratic Party|Town Council'
                  '|Puerto Rican Community Affairs|Supreme Court|Republican Committee|Secretary of State|Commission'
                  '|Texas Justice of the Peace|House of Delegates'),'Prior experience'] = 'Yes'

member_complete.loc[member_complete['Prior experience'].str.contains('Yes') == False,'Prior experience'] = 'No'
member_complete.rename(columns = {'Prior experience':'Prior Public Experience'})
member_complete

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens
0,Alabama 1,Bradley Byrne,Republican,Yes,JD,2014 (special),Fairhope,1955.0,{{marriage|Rebecca Dukes|1982}},4
1,Alabama 2,Martha Roby,Republican,Yes,JD,2011,Montgomery,1976.0,Riley Roby,2
2,Alabama 3,Mike Rogers,Republican,Yes,JD,2003,Saks,1958.0,Beth,
3,Alabama 4,Robert Aderholt,Republican,Yes,JD,1997,Haleyville,1965.0,Caroline McDonald,
4,Alabama 5,Mo Brooks,Republican,Yes,JD,2011,Huntsville,1954.0,{{marriage|Martha Jenkins|1976}},4
...,...,...,...,...,...,...,...,...,...,...
426,Wisconsin 5,Jim Sensenbrenner,Republican,Yes,JD,1979,Menomonee Falls,1943.0,{{marriage|Cheryl Warren|1977|2020|end=d.}},2
427,Wisconsin 6,Glenn Grothman,Republican,Yes,JD,2015,Campbellsport,1955.0,none,
428,Wisconsin 7,Tom Tiffany,Republican,Yes,Bachelor,2020 (special),Minocqua,1957.0,Christine Sully,3
429,Wisconsin 8,Mike Gallagher,Republican,No,PhD,2017,Green Bay,1984.0,{{marriage|[[Anne Horak Gallagher]]|2019}},


Moving on to Assumed office the only cleaning will be to remove the '(Special)' string 

In [15]:
member_complete['Assumed office'] = member_complete['Assumed office'].apply(lambda x: x.split('(special)')[0])

Next we'll create two derived values, age when they assumed office and how long they have been in office

In [16]:
member_complete = member_complete.astype({'Assumed office': int, 'Born':int})
years_in_office = []
age_when_elected = []
for i in range(431):
    years = 2020-member_complete.loc[i,'Assumed office']
    age = member_complete.loc[i,'Assumed office']-member_complete.loc[i,'Born']
    years_in_office.append(years)
    age_when_elected.append(age)
member_complete['Years in office'] = years_in_office
member_complete['Age when elected'] = age_when_elected
member_complete

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens,Years in office,Age when elected
0,Alabama 1,Bradley Byrne,Republican,Yes,JD,2014,Fairhope,1955,{{marriage|Rebecca Dukes|1982}},4,6,59
1,Alabama 2,Martha Roby,Republican,Yes,JD,2011,Montgomery,1976,Riley Roby,2,9,35
2,Alabama 3,Mike Rogers,Republican,Yes,JD,2003,Saks,1958,Beth,,17,45
3,Alabama 4,Robert Aderholt,Republican,Yes,JD,1997,Haleyville,1965,Caroline McDonald,,23,32
4,Alabama 5,Mo Brooks,Republican,Yes,JD,2011,Huntsville,1954,{{marriage|Martha Jenkins|1976}},4,9,57
...,...,...,...,...,...,...,...,...,...,...,...,...
426,Wisconsin 5,Jim Sensenbrenner,Republican,Yes,JD,1979,Menomonee Falls,1943,{{marriage|Cheryl Warren|1977|2020|end=d.}},2,41,36
427,Wisconsin 6,Glenn Grothman,Republican,Yes,JD,2015,Campbellsport,1955,none,,5,60
428,Wisconsin 7,Tom Tiffany,Republican,Yes,Bachelor,2020,Minocqua,1957,Christine Sully,3,0,63
429,Wisconsin 8,Mike Gallagher,Republican,No,PhD,2017,Green Bay,1984,{{marriage|[[Anne Horak Gallagher]]|2019}},,3,33


Now we'll clean the last two messy columns. First up is children this row isn't that messy and we might as well go through each line manually adding a line to replace non numerical strings with apurely numeric string, after that we'll replace the empty values and NaNwith 0s 

In [17]:
member_complete.loc[26, 'Childrens'] = '1'
member_complete.loc[32, 'Childrens'] = '5'
member_complete.loc[57, 'Childrens'] = '4'
member_complete.loc[110, 'Childrens'] = '1'
member_complete.loc[117, 'Childrens'] = '1'
member_complete.loc[138, 'Childrens'] = '2'
member_complete.loc[139, 'Childrens'] = '2'
member_complete.loc[154, 'Childrens'] = '4'
member_complete.loc[152, 'Childrens'] = '0'
member_complete.loc[153, 'Childrens'] = '4'
member_complete.loc[168, 'Childrens'] = '1'
member_complete.loc[168, 'Childrens'] = '1'
member_complete.loc[177, 'Childrens'] = '3'
member_complete.loc[178, 'Childrens'] = '3'
member_complete.loc[193, 'Childrens'] = '1'
member_complete.loc[194, 'Childrens'] = '1'
member_complete.loc[203, 'Childrens'] = '2'
member_complete.loc[211, 'Childrens'] = '4'
member_complete.loc[214, 'Childrens'] = '3'
member_complete.loc[215, 'Childrens'] = '3'
member_complete.loc[225, 'Childrens'] = '1'
member_complete.loc[226, 'Childrens'] = '1'
member_complete.loc[284, 'Childrens'] = '3'
member_complete.loc[314, 'Childrens'] = '1'
member_complete.loc[317, 'Childrens'] = '1'
member_complete.loc[338, 'Childrens'] = '0'
member_complete.loc[342, 'Childrens'] = '4'
member_complete.loc[346, 'Childrens'] = '3'
member_complete.loc[363, 'Childrens'] = '0'
member_complete.loc[373, 'Childrens'] = '2'
member_complete.loc[380, 'Childrens'] = '4'
member_complete.loc[381, 'Childrens'] = '2'
member_complete.loc[392, 'Childrens'] = '5'
member_complete.loc[424, 'Childrens'] = '2'
member_complete.loc[2, 'Childrens'] = '0'
member_complete = member_complete.replace(r'^\s*$', np.nan, regex=True).fillna('0') #Replace empty with first NaN and then with '0'

In [19]:
pd.set_option('display.max_rows', 450)
member_complete['Childrens']

0                                           4
1                                           2
2                                           0
3                                           0
4                                           4
5                                           3
6                                           0
7                                           2
8                                           0
9                                           0
10                                          3
11                                          3
12                                          6
13                                          0
14                                          1
15                                          0
16                                          2
17                                          2
18                                          2
19                                          3
20                                          4
21                                

Now we have one last column to clean, Spouses. This is probably the most messy of the columns due to different formats and the fact that we can't simply look up a single keyword to decide the marital status, since someone can divorce and hthen remary. Therefore we will do this in four steps. First we'll look for people who have had a divorce or whose spouses have died. Then we'll look for the keyword marriage after which we'll look for '{' signs. Finally we should have a frame with either Yes, No and just names. The result of this should be a column with either Yes or No for each member.

In [17]:
member_complete[member_complete["Spouse"].str.contains("divorce|die|div")]

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens,Years in office,Age when elected
7,Alaska at-large,Don Young,Republican,Yes,Bachelor,1973,Fort Yukon,1933,{{marriage|Lu Fredson|1963|2009|reason=died}}<...,2,47,40
14,Arizona 7,Ruben Gallego,Democratic,Yes,Bachelor,2015,Phoenix,1979,{{marriage|[[Kate Gallego|Kate Widland]]|2010|...,1,5,36
31,California 11,Mark DeSaulnier,Democratic,Yes,Bachelor,2015,Concord,1952,Melinda Clune (divorced)<ref>https://www.eastb...,2,5,63
34,California 14,Jackie Speier,Democratic,Yes,JD,2008,Hillsborough,1950,{{marriage|Steve Sierra|1987|1994|end=died}}<b...,2,12,58
38,California 18,Anna Eshoo,Democratic,Yes,Master,1993,Atherton,1942,George Eshoo (divorced),2,27,51
52,California 32,Grace Napolitano,Democratic,Yes,Master,1999,Norwalk,1936,"{{marriage|Frank Napolitano|1980s|December 15,...",5,21,63
57,California 37,Karen Bass,Democratic,Yes,Bachelor,2011,Baldwin Hills,1953,{{Marriage|Jesus Lechuga|1980|1986|end=divorced}},4,9,58
63,California 43,Maxine Waters,Democratic,Yes,Bachelor,1991,Inglewood,1938,{{marriage|Edward Waters|1956|1972|end=div}}<b...,2,29,53
65,California 45,Katie Porter,Democratic,No,JD,2019,Irvine,1974,Matthew Hoffman <br>(m. 2003; div. 2013)<ref n...,3,1,45
76,Colorado 4,Ken Buck,Republican,No,JD,2015,Greeley,1959,{{marriage|Dayna Roane|1984|1994|end=div}}<ref...,2,5,56


In [18]:
member_complete.loc[7, 'Spouse'] = 'Yes'
member_complete.loc[14, 'Spouse'] = 'No'
member_complete.loc[31, 'Spouse'] = 'No'
member_complete.loc[34, 'Spouse'] = 'Yes'
member_complete.loc[38, 'Spouse'] = 'No'
member_complete.loc[52, 'Spouse'] = 'No'
member_complete.loc[63, 'Spouse'] = 'Yes'
member_complete.loc[63, 'Spouse'] = 'No'
member_complete.loc[76, 'Spouse'] = 'No'
member_complete.loc[79, 'Spouse'] = 'Yes'
member_complete.loc[85, 'Spouse'] = 'No'
member_complete.loc[98, 'Spouse'] = 'No'
member_complete.loc[109, 'Spouse'] = 'No'
member_complete.loc[117, 'Spouse'] = 'No'
member_complete.loc[128, 'Spouse'] = 'Yes'
member_complete.loc[131, 'Spouse'] = 'Yes'
member_complete.loc[139, 'Spouse'] = 'Yes'
member_complete.loc[170, 'Spouse'] = 'Yes'
member_complete.loc[174, 'Spouse'] = 'Yes'
member_complete.loc[178, 'Spouse'] = 'No'
member_complete.loc[184, 'Spouse'] = 'No'
member_complete.loc[208, 'Spouse'] = 'No'
member_complete.loc[209, 'Spouse'] = 'No'
member_complete.loc[222, 'Spouse'] = 'No'
member_complete.loc[223, 'Spouse'] = 'No'
member_complete.loc[228, 'Spouse'] = 'No'
member_complete.loc[267, 'Spouse'] = 'No'
member_complete.loc[282, 'Spouse'] = 'No'
member_complete.loc[306, 'Spouse'] = 'No'
member_complete.loc[320, 'Spouse'] = 'Yes'
member_complete.loc[346, 'Spouse'] = 'No'
member_complete.loc[349, 'Spouse'] = 'Yes'
member_complete.loc[350, 'Spouse'] = 'Yes'
member_complete.loc[352, 'Spouse'] = 'Yes'
member_complete.loc[397, 'Spouse'] = 'Yes'
member_complete.loc[412, 'Spouse'] = 'Yes'

In [19]:
member_complete[member_complete["Spouse"].str.contains("marriage")]

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens,Years in office,Age when elected
0,Alabama 1,Bradley Byrne,Republican,Yes,JD,2014,Fairhope,1955,{{marriage|Rebecca Dukes|1982}},4,6,59
4,Alabama 5,Mo Brooks,Republican,Yes,JD,2011,Huntsville,1954,{{marriage|Martha Jenkins|1976}},4,9,57
13,Arizona 6,David Schweikert,Republican,Yes,Master,2011,Scottsdale,1962,{{marriage|Joyce Schweikert|2006}},0,9,49
16,Arizona 9,Greg Stanton,Democratic,Yes,JD,2019,Phoenix,1970,{{marriage|Nicole|2006|}},2,1,49
18,Arkansas 2,French Hill,Republican,No,Bachelor,2015,Little Rock,1956,{{marriage|Martha McKenzie|1988}},2,5,59
...,...,...,...,...,...,...,...,...,...,...,...,...
423,Wisconsin 2,Mark Pocan,Democratic,Yes,Bachelor,2013,Madison,1964,{{marriage|Philip Frank|2006}},0,7,49
424,Wisconsin 3,Ron Kind,Democratic,No,JD,1997,La Crosse,1963,{{marriage|Tawni Kind|1994}},2,23,34
426,Wisconsin 5,Jim Sensenbrenner,Republican,Yes,JD,1979,Menomonee Falls,1943,{{marriage|Cheryl Warren|1977|2020|end=d.}},2,41,36
429,Wisconsin 8,Mike Gallagher,Republican,No,PhD,2017,Green Bay,1984,{{marriage|[[Anne Horak Gallagher]]|2019}},0,3,33


As we can see the list of rows which has a string containing the word marriage in the spouse column is large, hence we don't want to manually set each of theses to yes or no, isntead we'll look for any which we're unsure about and set them to yes or no.

In [20]:
member_complete.loc[26, 'Spouse'] = 'Yes'
member_complete.loc[141, 'Spouse'] = 'Yes'
member_complete.loc[171, 'Spouse'] = 'No'
member_complete.loc[187, 'Spouse'] = 'Yes'
member_complete.loc[337, 'Spouse'] = 'Yes'
member_complete.loc[7, 'Spouse'] = 'Yes'

Now we'll replace any 'none' or '0' with 'No' and the rest with 'Yes'.

In [21]:
member_complete.loc[member_complete['Spouse'].str.contains('none|0'), 'Spouse'] = 'No'
member_complete.loc[member_complete['Spouse'].str.contains('Yes|No') == False, 'Spouse'] = 'Yes'

Before saving the data we'll have to remove unicode whitespaces, \xa0

In [22]:
member_complete['District'] = member_complete['District'].str.split().str.join(' ')

In [23]:
member_complete.to_csv(current_folder + '/data/resultingData/clean_member.csv') #Print the results to a csv file.
member_complete

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born,Spouse,Childrens,Years in office,Age when elected
0,Alabama 1,Bradley Byrne,Republican,Yes,JD,2014,Fairhope,1955,Yes,4,6,59
1,Alabama 2,Martha Roby,Republican,Yes,JD,2011,Montgomery,1976,Yes,2,9,35
2,Alabama 3,Mike Rogers,Republican,Yes,JD,2003,Saks,1958,Yes,0,17,45
3,Alabama 4,Robert Aderholt,Republican,Yes,JD,1997,Haleyville,1965,Yes,0,23,32
4,Alabama 5,Mo Brooks,Republican,Yes,JD,2011,Huntsville,1954,Yes,4,9,57
...,...,...,...,...,...,...,...,...,...,...,...,...
426,Wisconsin 5,Jim Sensenbrenner,Republican,Yes,JD,1979,Menomonee Falls,1943,No,2,41,36
427,Wisconsin 6,Glenn Grothman,Republican,Yes,JD,2015,Campbellsport,1955,No,0,5,60
428,Wisconsin 7,Tom Tiffany,Republican,Yes,Bachelor,2020,Minocqua,1957,Yes,3,0,63
429,Wisconsin 8,Mike Gallagher,Republican,No,PhD,2017,Green Bay,1984,No,0,3,33


Now we'll look at the other data set with data about each district.

In [24]:
district_frame = pd.read_csv(district_data)
district_frame

Unnamed: 0.1,Unnamed: 0,District,People: Sex and Age Total population,People: Sex and Age Male,People: Sex and Age Female,People: Sex and Age Median age (years),People: Race Total population,People: Race White,People: Race Black or African American,People: Race American Indian and Alaska Native,...,Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Total households,"Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Less than $10,000","Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) $200,000 or more",Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Median household income (dollars),Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Mean household income (dollars),Socioeconomic: Percentage of Families and People Whose Income in the Past 12 Months is Below the Poverty Level All people,Education: Educational Attainment Percent high school graduate or higher,Most Employees,Largest Payroll,Most Establishments
0,0,Alabama District 01,715346,344543,370803,40.3,715346,479949,196891,6992,...,272626,23918,9860,46445,65685,17.0,87.0,Retail trade,Manufacturing,Retail trade
1,1,Alabama District 02,678122,326346,351776,38.8,678122,426674,217977,2838,...,254315,25282,7590,48290,64712,18.0,85.7,Health care and social assistance,Health care and social assistance,Retail trade
2,2,Alabama District 03,708409,345941,362468,39.2,708409,490987,183949,1300,...,264595,29780,8563,45832,64442,18.1,85.5,Manufacturing,Manufacturing,Retail trade
3,3,Alabama District 04,686297,336349,349948,40.5,686297,591084,51171,4391,...,256963,21203,7590,45387,63553,17.6,82.9,Manufacturing,Manufacturing,Retail trade
4,4,Alabama District 05,725634,359028,366606,39.5,725634,539903,130913,3123,...,290207,18096,14954,57174,77865,12.9,88.8,"Professional, scientific, and technical services","Professional, scientific, and technical services",Retail trade
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
432,4,Wisconsin District 05,731341,358554,372787,41.3,731341,659846,21218,2076,...,297562,11468,23293,70271,95385,6.8,94.8,Manufacturing,Manufacturing,Health care and social assistance
433,5,Wisconsin District 06,714886,361506,353380,41.9,714886,660186,15080,3448,...,294811,12834,12744,59868,78267,9.0,92.4,Manufacturing,Manufacturing,Retail trade
434,6,Wisconsin District 07,710420,357492,352928,44.7,710420,660634,5661,13419,...,305764,14231,10296,57200,73533,9.6,92.0,Manufacturing,Manufacturing,Retail trade
435,7,Wisconsin District 08,735997,367530,368467,40.7,735997,662077,12052,19016,...,301309,11134,12544,61423,78602,8.3,92.8,Manufacturing,Manufacturing,Retail trade


We start with dropping the 'Unnamed: 0' column

In [25]:
district_frame = district_frame.drop(columns = ['Unnamed: 0'])

First we have to change the values in the column 'District' to math those found in the 'congress_member' data, e.g. 'Alabama District 01' should be Alabama 1. This will be done in two steps first we remove 'District' by replacing it with '' and then we change 01->1, 02->2 etc. upto 09->9. However first we'll extract a list of states to add to the dataset.

In [26]:
states = district_frame['District'].apply(lambda x: x.split('District')[0]) #Remove the word 'Estimate' from first column

district_frame['District'] = district_frame['District'].replace(' District', '', regex = True)
district_frame['District'] = district_frame['District'].replace('\(At Large\)', 'at-large', regex = True)
district_frame['District'] = district_frame['District'].replace('01', '1', regex = True)
district_frame['District'] = district_frame['District'].replace('02', '2', regex = True)
district_frame['District'] = district_frame['District'].replace('03', '3', regex = True)
district_frame['District'] = district_frame['District'].replace('04', '4', regex = True)
district_frame['District'] = district_frame['District'].replace('05', '5', regex = True)
district_frame['District'] = district_frame['District'].replace('06', '6', regex = True)
district_frame['District'] = district_frame['District'].replace('07', '7', regex = True)
district_frame['District'] = district_frame['District'].replace('08', '8', regex = True)
district_frame['District'] = district_frame['District'].replace('09', '9', regex = True)
district_frame['District'] = district_frame['District'].str.strip() #Remove any trailing whitespace

district_frame

Unnamed: 0,District,People: Sex and Age Total population,People: Sex and Age Male,People: Sex and Age Female,People: Sex and Age Median age (years),People: Race Total population,People: Race White,People: Race Black or African American,People: Race American Indian and Alaska Native,People: Race Asian,...,Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Total households,"Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Less than $10,000","Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) $200,000 or more",Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Median household income (dollars),Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Mean household income (dollars),Socioeconomic: Percentage of Families and People Whose Income in the Past 12 Months is Below the Poverty Level All people,Education: Educational Attainment Percent high school graduate or higher,Most Employees,Largest Payroll,Most Establishments
0,Alabama 1,715346,344543,370803,40.3,715346,479949,196891,6992,10627,...,272626,23918,9860,46445,65685,17.0,87.0,Retail trade,Manufacturing,Retail trade
1,Alabama 2,678122,326346,351776,38.8,678122,426674,217977,2838,7442,...,254315,25282,7590,48290,64712,18.0,85.7,Health care and social assistance,Health care and social assistance,Retail trade
2,Alabama 3,708409,345941,362468,39.2,708409,490987,183949,1300,13049,...,264595,29780,8563,45832,64442,18.1,85.5,Manufacturing,Manufacturing,Retail trade
3,Alabama 4,686297,336349,349948,40.5,686297,591084,51171,4391,3072,...,256963,21203,7590,45387,63553,17.6,82.9,Manufacturing,Manufacturing,Retail trade
4,Alabama 5,725634,359028,366606,39.5,725634,539903,130913,3123,12215,...,290207,18096,14954,57174,77865,12.9,88.8,"Professional, scientific, and technical services","Professional, scientific, and technical services",Retail trade
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
432,Wisconsin 5,731341,358554,372787,41.3,731341,659846,21218,2076,24913,...,297562,11468,23293,70271,95385,6.8,94.8,Manufacturing,Manufacturing,Health care and social assistance
433,Wisconsin 6,714886,361506,353380,41.9,714886,660186,15080,3448,17053,...,294811,12834,12744,59868,78267,9.0,92.4,Manufacturing,Manufacturing,Retail trade
434,Wisconsin 7,710420,357492,352928,44.7,710420,660634,5661,13419,12134,...,305764,14231,10296,57200,73533,9.6,92.0,Manufacturing,Manufacturing,Retail trade
435,Wisconsin 8,735997,367530,368467,40.7,735997,662077,12052,19016,15780,...,301309,11134,12544,61423,78602,8.3,92.8,Manufacturing,Manufacturing,Retail trade


Now we can start with the majority of the cleaning of this data set which will mainly be about replacing values with other. If we start wit hthe population statisticts we want the total number of males and females to be given in percent rather than absolute numbers.

In [27]:
district_frame['People: Sex and Age Male'] = district_frame['People: Sex and Age Male'] / district_frame['People: Sex and Age Total population']
district_frame['People: Sex and Age Female'] = district_frame['People: Sex and Age Female']/district_frame['People: Sex and Age Total population']

Next we'll do the same for the ethnicities

In [28]:
district_frame['People: Race White'] = district_frame['People: Race White'] / district_frame['People: Race Total population']
district_frame['People: Race Black or African American'] = district_frame['People: Race Black or African American'] / district_frame['People: Race Total population']
district_frame['People: Race American Indian and Alaska Native'] = district_frame['People: Race American Indian and Alaska Native'] / district_frame['People: Race Total population']
district_frame['People: Race Asian'] = district_frame['People: Race Asian'] / district_frame['People: Race Total population']
district_frame['People: Race Native Hawaiian and Other Pacific Islander'] = district_frame['People: Race Native Hawaiian and Other Pacific Islander'] / district_frame['People: Race Total population']
district_frame['People: Race Some other race'] = district_frame['People: Race Some other race'] / district_frame['People: Race Total population']
district_frame['People: Hispanic or Latino and Race Hispanic or Latino (of any race)'] = district_frame['People: Hispanic or Latino and Race Hispanic or Latino (of any race)'] / district_frame['People: Race Total population']

Now we'll calculate the fraction of natives and foreign born

In [29]:
district_frame['People: Place of Birth Native'] = district_frame['People: Place of Birth Native'] / district_frame['People: Race Total population']
district_frame['People: Place of Birth Foreign born'] = district_frame['People: Place of Birth Foreign born'] / district_frame['People: Race Total population']

Next we have the different proportions of workers in each district

In [30]:
district_frame['Total Occupation'] = district_frame['Workers: Occupation Management, business, science, and arts occupations'] + district_frame['Workers: Occupation Service occupations'] + district_frame['Workers: Occupation Sales and office occupations'] + district_frame['Workers: Occupation Natural resources, construction, and maintenance occupations'] + district_frame['Workers: Occupation Production, transportation, and material moving occupations']
district_frame['Workers: Occupation Management, business, science, and arts occupations'] = district_frame['Workers: Occupation Management, business, science, and arts occupations'] / district_frame['Total Occupation']
district_frame['Workers: Occupation Sales and office occupations'] = district_frame['Workers: Occupation Sales and office occupations'] / district_frame['Total Occupation']
district_frame['Workers: Occupation Service occupations'] = district_frame['Workers: Occupation Service occupations'] / district_frame['Total Occupation']
district_frame['Workers: Occupation Natural resources, construction, and maintenance occupations'] = district_frame['Workers: Occupation Natural resources, construction, and maintenance occupations'] / district_frame['Total Occupation']
district_frame['Workers: Occupation Production, transportation, and material moving occupations'] = district_frame['Workers: Occupation Production, transportation, and material moving occupations'] / district_frame['Total Occupation']

Next we have the fraction belonging to the lowest respectively highest income bracket

In [31]:
district_frame['Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) $200,000 or more'] = district_frame['Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) $200,000 or more'] / district_frame['Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Total households']
district_frame['Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Less than $10,000'] = district_frame['Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Less than $10,000'] / district_frame['Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Total households']

The last two columns to change values in are the percentage of population living below poverty level and the percentage of high school graduate or higher where we'll divide the values by 100 to get back fraction.

In [32]:
district_frame['Socioeconomic: Percentage of Families and People Whose Income in the Past 12 Months is Below the Poverty Level All people'] = district_frame['Socioeconomic: Percentage of Families and People Whose Income in the Past 12 Months is Below the Poverty Level All people'] / 100
district_frame['Education: Educational Attainment Percent high school graduate or higher'] = district_frame['Education: Educational Attainment Percent high school graduate or higher'] / 100

While the current column names are specific in what they refere to they are a bit unwieldy therefore we will now shorten them down.

In [33]:
district_frame = district_frame.rename(columns = {'People: Sex and Age Total population':'Total Population',
                                             'People: Sex and Age Male': 'Percentage Male', 'People: Sex and Age Female': 'Percentage Female',
                                             'People: Sex and Age Median age (years)':'Median age(years)',
                                             'People: Race Total population': 'Race Total population',
                                             'People: Race White':'Fraction White',
                                             'People: Race Black or African American':'Fraction Black or A.American',
                                             'People: Race American Indian and Alaska Native':'Fraction Indian or Alaskan Native',
                                             'People: Race Asian':'Fraction Asian',
                                             'People: Race Native Hawaiian and Other Pacific Islande': 'Fraction Hawaiian and Pacific Islande',
                                             'People: Race Some other race': 'Fraction Other',
                                             'People: Hispanic or Latino and Race Hispanic or Latino (of any race)':'Fraction Hispanic or Latino',
                                             'People: Place of Birth Total population':'Birth Total Population',
                                             'People: Place of Birth Native':'Fraction Native',
                                             'People: Place of Birth Foreign born':'Fraction Foreigner',
                                             'Workers: Occupation Management, business, science, and arts occupations':'Fraction Management, business, science, and arts occupation',
                                             'Workers: Occupation Service occupations':'Fraction Service occupations',
                                             'Workers: Occupation Sales and office occupations':'Fraction Sales and office occupations',
                                             'Workers: Occupation Natural resources, construction, and maintenance occupations':'Fraction Natural resources, construction, and maintenance occupations',
                                             'Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Total households':'Total households',
                                             'Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Less than $10,000':'Fraction of households with income below $10,000',
                                             'Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) $200,000 or more':'Fraction of households with income at $200,000 or more',
                                             'Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Median household income (dollars)':'Median household income',
                                             'Socioeconomic: Income and Benefits (In 2018 inflation-adjusted dollars) Mean household income (dollars)':'Mean household income',
                                             'Socioeconomic: Percentage of Families and People Whose Income in the Past 12 Months is Below the Poverty Level All people':'Fraction of all people below poverty level',
                                             'Education: Educational Attainment Percent high school graduate or higher': 'Fraction attaining at least high school graduation',
                                             'Workers: Occupation Production, transportation, and material moving occupations':'Fraction Production, transportation, and material moving occupations'})

Finally we add the state column and then save the file

In [34]:
district_frame['State'] = states
district_frame.to_csv(current_folder + '/data/resultingData/clean_district.csv') #Print the results to a csv file.   
district_frame

Unnamed: 0,District,Total Population,Percentage Male,Percentage Female,Median age(years),Race Total population,Fraction White,Fraction Black or A.American,Fraction Indian or Alaskan Native,Fraction Asian,...,"Fraction of households with income at $200,000 or more",Median household income,Mean household income,Fraction of all people below poverty level,Fraction attaining at least high school graduation,Most Employees,Largest Payroll,Most Establishments,Total Occupation,State
0,Alabama 1,715346,0.481645,0.518355,40.3,715346,0.670933,0.275239,0.009774,0.014856,...,0.036167,46445,65685,0.170,0.870,Retail trade,Manufacturing,Retail trade,302309,Alabama
1,Alabama 2,678122,0.481250,0.518750,38.8,678122,0.629199,0.321442,0.004185,0.010974,...,0.029845,48290,64712,0.180,0.857,Health care and social assistance,Health care and social assistance,Retail trade,277754,Alabama
2,Alabama 3,708409,0.488335,0.511665,39.2,708409,0.693084,0.259665,0.001835,0.018420,...,0.032363,45832,64442,0.181,0.855,Manufacturing,Manufacturing,Retail trade,294766,Alabama
3,Alabama 4,686297,0.490092,0.509908,40.5,686297,0.861266,0.074561,0.006398,0.004476,...,0.029537,45387,63553,0.176,0.829,Manufacturing,Manufacturing,Retail trade,286219,Alabama
4,Alabama 5,725634,0.494778,0.505222,39.5,725634,0.744043,0.180412,0.004304,0.016834,...,0.051529,57174,77865,0.129,0.888,"Professional, scientific, and technical services","Professional, scientific, and technical services",Retail trade,335078,Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
432,Wisconsin 5,731341,0.490269,0.509731,41.3,731341,0.902241,0.029012,0.002839,0.034065,...,0.078279,70271,95385,0.068,0.948,Manufacturing,Manufacturing,Health care and social assistance,388665,Wisconsin
433,Wisconsin 6,714886,0.505683,0.494317,41.9,714886,0.923484,0.021094,0.004823,0.023854,...,0.043228,59868,78267,0.090,0.924,Manufacturing,Manufacturing,Retail trade,367724,Wisconsin
434,Wisconsin 7,710420,0.503212,0.496788,44.7,710420,0.929920,0.007969,0.018889,0.017080,...,0.033673,57200,73533,0.096,0.920,Manufacturing,Manufacturing,Retail trade,355290,Wisconsin
435,Wisconsin 8,735997,0.499363,0.500637,40.7,735997,0.899565,0.016375,0.025837,0.021440,...,0.041632,61423,78602,0.083,0.928,Manufacturing,Manufacturing,Retail trade,380200,Wisconsin


Now we've finished the cleaning of the two data files and can move on to the next part which will be data exploration, which will be done in two separate notebooks (one for each data file). However as a final step we'll merge the two datafiles. The code can also be found in the python file "data_cleaning_congress.py".

In [35]:
current_folder = os.getcwd()
dataM = current_folder + '/data/resultingData/clean_member.csv'
dataD = current_folder + '/data/resultingDAta/clean_district.csv'
member_data = pd.read_csv(dataM)
member_data = member_data.astype({'District': str}) #Make sure that column is a string
district_data = pd.read_csv(dataD)
district_data = district_data.astype({'District':str}) #Make sure that column is a string
data = pd.merge(member_data, district_data, on = 'District')
data = data.drop(columns = ['Unnamed: 0_x', 'Unnamed: 0_y'])
data['State'] = data['State'].str.strip()
data.to_csv(current_folder + '/data/resultingData/merged_data.csv')