# Data Manipulation with `Python` Exercises

Welcome to one of your first exercise notebooks. So what should you expect from these notebooks? Well, we will be touching on the concepts and code that we ran through in the subsequent labs and practices, except the majority of the coding will be done by you now. The questions that we ask of you will be very familiar, although the output might throw a few more errors. Some of these issues we have not seen yet and this is meant to challenge you. 

## Read in the Data

We will be using a different data set for this exercise (don't worry, if you like the Baby Names data set, we will be seeing it again). These data are filled with all of the U.S. Congress members from January 1947 to February 2014 along with some information about them.

Go ahead and read in the `congress-terms.csv` in the `datasets/` directory. Pay particular attention to the encoding. Run the following line...

In [142]:
import pandas as pd

with open('../../../datasets/congress-terms.csv', 'r', encoding = 'ISO-8859-1' ) as file:
    data = file.read()
    
    data_lists = data.split("\n")

    list_of_lists = []
    for line in data_lists:
        row = line.split(',')
        list_of_lists.append(row)

    # return the first 11 lists (rows) to get an idea of what the data looks like     
    for row in list_of_lists[0:11]:
        print(' ,'.join(row))

congress ,chamber ,bioguide ,firstname ,middlename ,lastname ,suffix ,birthday ,state ,party ,incumbent ,termstart ,age
80 ,house ,M000112 ,Joseph ,Jefferson ,Mansfield , ,1861-02-09 ,TX ,D ,Yes ,1/3/47 ,85.9
80 ,house ,D000448 ,Robert ,Lee ,Doughton , ,1863-11-07 ,NC ,D ,Yes ,1/3/47 ,83.2
80 ,house ,S000001 ,Adolph ,Joachim ,Sabath , ,1866-04-04 ,IL ,D ,Yes ,1/3/47 ,80.7
80 ,house ,E000023 ,Charles ,Aubrey ,Eaton , ,1868-03-29 ,NJ ,R ,Yes ,1/3/47 ,78.8
80 ,house ,L000296 ,William , ,Lewis , ,1868-09-22 ,KY ,R ,No ,1/3/47 ,78.3
80 ,house ,G000017 ,James ,A. ,Gallagher , ,1869-01-16 ,PA ,R ,No ,1/3/47 ,78
80 ,house ,W000265 ,Richard ,Joseph ,Welch , ,1869-02-13 ,CA ,R ,Yes ,1/3/47 ,77.9
80 ,house ,B000565 ,Sol , ,Bloom , ,1870-03-09 ,NY ,D ,Yes ,1/3/47 ,76.8
80 ,house ,H000943 ,Merlin , ,Hull , ,1870-12-18 ,WI ,R ,Yes ,1/3/47 ,76
80 ,house ,G000169 ,Charles ,Laceille ,Gifford , ,1871-03-15 ,MA ,R ,Yes ,1/3/47 ,75.8


**Question 1**: You will notice something a little bit different about reading in this file, particularly the `encoding` parameter. Do a bit of research on what encoding is. What happens when you remove this parameter all together? Do your best to describe any errors being thrown.

**Question 2**: In the `list_of_lists` variable, the last item of each list is the `age` of the member of congress. This is currently a string. Without using any packages, convert all of the values for `age` into floats.

In [73]:
# Execute your code for question 2 here
# -------------------------------------
subset = list_of_lists[1:18637]

age_floats = []

for row in subset:
    age_flo = float(row[12])
    age_floats.append(age_flo)
    
age_floats[0:5]

[85.9, 83.2, 80.7, 78.8, 78.3]

**Question 3**: Once you have converted the `age` values for every member, go ahead and read in the file with `pandas` save the data frame to a variable called `df`.

In [166]:
# Execute your code for question 3 here
# -------------------------------------
import pandas as pd

with open('../../../datasets/congress-terms.csv', 'r', encoding = 'ISO-8859-1' ) as file:
    df = pd.read_csv(file)

df.head()

Unnamed: 0,congress,chamber,bioguide,firstname,middlename,lastname,suffix,birthday,state,party,incumbent,termstart,age
0,80,house,M000112,Joseph,Jefferson,Mansfield,,1861-02-09,TX,D,Yes,1/3/47,85.9
1,80,house,D000448,Robert,Lee,Doughton,,1863-11-07,NC,D,Yes,1/3/47,83.2
2,80,house,S000001,Adolph,Joachim,Sabath,,1866-04-04,IL,D,Yes,1/3/47,80.7
3,80,house,E000023,Charles,Aubrey,Eaton,,1868-03-29,NJ,R,Yes,1/3/47,78.8
4,80,house,L000296,William,,Lewis,,1868-09-22,KY,R,No,1/3/47,78.3


**Question 4**: Find a method to print of the column headers of the data frame `df`.

In [84]:
# Execute your code for question 4 here
# -------------------------------------
list(df.columns)


['congress',
 'chamber',
 'bioguide',
 'firstname',
 'middlename',
 'lastname',
 'suffix',
 'birthday',
 'state',
 'party',
 'incumbent',
 'termstart',
 'age']

**Question 5**: Congresses are numbered. Notice that there is a column devoted to the Cogress number. This column is conveniently called `congress`. Create a subsetted data frame of the 80th congress only and call this subset `congress80`. 

In [164]:
# Execute your code for question 5 here
# -------------------------------------
congress80 = df[(df['congress'] == 80)]

df.head()

Unnamed: 0,congress,chamber,bioguide,firstname,middlename,lastname,suffix,birthday,state,party,incumbent,termstart,age
0,80,house,M000112,Joseph,Jefferson,Mansfield,,1861-02-09,TX,D,Yes,1/3/47,85.9
1,80,house,D000448,Robert,Lee,Doughton,,1863-11-07,NC,D,Yes,1/3/47,83.2
2,80,house,S000001,Adolph,Joachim,Sabath,,1866-04-04,IL,D,Yes,1/3/47,80.7
3,80,house,E000023,Charles,Aubrey,Eaton,,1868-03-29,NJ,R,Yes,1/3/47,78.8
4,80,house,L000296,William,,Lewis,,1868-09-22,KY,R,No,1/3/47,78.3


**Question 6**: Now, from this `congress80` subset, use a method that will count the rows who are House members and then again for Senate Members.

In [167]:
# Execute your code for question 6 here
# -------------------------------------
congress80 = df[(df['congress'] == 80)]
congress80_h = congress80[(congress80['chamber'] == 'house')]

len(congress80_h)

453

In [168]:
congress80 = df[(df['congress'] == 80)]
congress80_s = congress80[(congress80['chamber'] == 'senate')]

len(congress80_s)


102