# Census Variables

You have decided to volunteer for your local community by offering to clean their recently collected census data. The description of this dataset is as follows:

| column         | description                                                                                                             |
|----------------|-------------------------------------------------------------------------------------------------------------------------|
| first_name     | The respondent’s first name.                                                                                            |
| last_name      | The respondent’s last name.                                                                                             |
| birth_year     | The respondent’s year of birth.                                                                                         |
| voted          | If the respondent participated in the current voting cycle.                                                             |
| num_children   | The number of children the respondent has.                                                                              |
| income_year    | The average yearly income the respondent earns.                                                                         |
| higher_tax     | The respondent’s answer to the question: “Rate your agreement with the statement: the wealthy should pay higher taxes.” |
| marital_status | The respondent’s current marital status.                                                                                |


In [96]:
import pandas as pd

# Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)

The `census` dataframe is composed of simulated census data to represent demographics of a small community in the U.S. Call the `.head()` method on the `census` dataframe and print the output to view the first five rows.

In [97]:
display(census.head())
display(census.describe())
print(census.dtypes)
print(census['birth_year'].unique())

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced
2,Salomon,Orn,1992,True,2,166313.45,agree,single
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,Gust,Abernathy,1945,False,2,143316.08,agree,married


Unnamed: 0,num_children,income_year
count,100.0,100.0
mean,1.81,111380.7897
std,1.433333,49015.171775
min,0.0,35635.14
25%,0.75,71246.52
50%,2.0,104990.805
75%,3.0,153492.09
max,4.0,198123.77


first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object
['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 'missing' '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


In [98]:
# replace the missing value with 1967
census['birth_year'] = census['birth_year'].replace(['missing'], 1967)
# print out the unique values
print(census['birth_year'].unique())

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 1967 '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


In [99]:
census['birth_year'] = census['birth_year'].astype('int')
print(census.dtypes)
print(census['birth_year'].mean())

first_name         object
last_name          object
birth_year          int64
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object
1973.4


In [100]:
census['higher_tax'] = pd.Categorical(census['higher_tax'], ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)
print(census['higher_tax'].unique())

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']


In [101]:
# Use cat.codes to label encode the higher_tax variable
census['higher_tax'] = census['higher_tax'].cat.codes

# print out the median of the higher_tax variable
print(census['higher_tax'].median())

2.0


In [102]:
# Create a new variable called marital_codes by Label Encoding the marital_status variable
# NB - this is not the same as one-hot encoding, and you have to do this BEFORE one-hot encoding
census['marital_codes'] = pd.Categorical(census['marital_status'], ['married', 'widowed', 'divorced', 'separated', 'never married'], ordered=True).codes
display(census.head())

census = pd.get_dummies(data=census, columns=['marital_status'])
display(census.head())

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status,marital_codes
0,Denise,Ratke,2005,False,0,92129.41,1,single,-1
1,Hali,Cummerata,1987,False,0,75649.17,2,divorced,2
2,Salomon,Orn,1992,True,2,166313.45,3,single,-1
3,Sarina,Schiller,1965,False,2,71704.81,4,married,0
4,Gust,Abernathy,1945,False,2,143316.08,3,married,0


Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_codes,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed
0,Denise,Ratke,2005,False,0,92129.41,1,-1,0,0,1,0
1,Hali,Cummerata,1987,False,0,75649.17,2,2,1,0,0,0
2,Salomon,Orn,1992,True,2,166313.45,3,-1,0,0,1,0
3,Sarina,Schiller,1965,False,2,71704.81,4,0,0,1,0,0
4,Gust,Abernathy,1945,False,2,143316.08,3,0,0,1,0,0


In [103]:
print(census['birth_year'].unique().min())
print(census['birth_year'].unique().max())

1940
2007


In [104]:
# Create a new variable called age_group, which groups respondents based on their birth year. The groups should be in five-year increments, e.g., 25-30, 31-35, etc.
bins_list = [i for i in range(census['birth_year'].unique().min(), census['birth_year'].unique().max()+5, 5)]

# create a list of labels for the bins
labels_list = []
for i in range(len(bins_list)):
    try:
        upper_age = max(bins_list)-bins_list[i]
        lower_age = max(bins_list)-bins_list[i+1]
        labels_list.append(f'{lower_age}-{upper_age}')
    except:
        pass

census['age_group'] = pd.cut(census['birth_year'], bins=bins_list, labels=labels_list)
display(census.head())

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_codes,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed,age_group
0,Denise,Ratke,2005,False,0,92129.41,1,-1,0,0,1,0,5-10
1,Hali,Cummerata,1987,False,0,75649.17,2,2,1,0,0,0,20-25
2,Salomon,Orn,1992,True,2,166313.45,3,-1,0,0,1,0,15-20
3,Sarina,Schiller,1965,False,2,71704.81,4,0,0,1,0,0,45-50
4,Gust,Abernathy,1945,False,2,143316.08,3,0,0,1,0,0,65-70


In [105]:
# label encode the age_group variable
census['age_group'] = pd.Categorical(census['age_group'], labels_list, ordered=True).codes
display(census.head())

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_codes,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed,age_group
0,Denise,Ratke,2005,False,0,92129.41,1,-1,0,0,1,0,12
1,Hali,Cummerata,1987,False,0,75649.17,2,2,1,0,0,0,9
2,Salomon,Orn,1992,True,2,166313.45,3,-1,0,0,1,0,10
3,Sarina,Schiller,1965,False,2,71704.81,4,0,0,1,0,0,4
4,Gust,Abernathy,1945,False,2,143316.08,3,0,0,1,0,0,0
