***Census Variables***

You have decided to volunteer for your local community by offering to clean their recently collected census data. The description of this dataset is as follows:
* column	        description
* first_name	    The respondent’s first name.
* last_name	        The respondent’s last name.
* birth_year	    The respondent’s year of birth.
* voted	            If the respondent participated in the current voting cycle.
* num_children	    The number of children the respondent has.
* income_year	    The average yearly income the respondent earns.
* higher_tax	    The respondent’s answer to the question: “Rate your agreement with the statement: the wealthy should pay higher taxes.”
* marital_status	The respondent’s current marital status.

In [13]:
# Import pandas with alias
import pandas as pd

#1 Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)

In [14]:
#2 Census data to represent demographics of a small community in the U.S. Let's see what is in it
print(census.head())

  first_name  last_name birth_year  voted  num_children  income_year  \
0     Denise      Ratke       2005  False             0     92129.41   
1       Hali  Cummerata       1987  False             0     75649.17   
2    Salomon        Orn       1992   True             2    166313.45   
3     Sarina   Schiller       1965  False             2     71704.81   
4       Gust  Abernathy       1945  False             2    143316.08   

       higher_tax marital_status  
0        disagree         single  
1         neutral       divorced  
2           agree         single  
3  strongly agree        married  
4           agree        married  


In [15]:
#3 Compare the values returned from the .head() method with the data types of each variable by calling .dtypes on the census dataframe and print the result.
print(census.dtypes)

first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object


In [16]:
#4 The manager of the census would like to know the average birth year of the respondents. 
# We were able to see from .dtypes that birth_year has been assigned the str datatype whereas it should be expressed in int.
# Print the unique values of the variable using the .unique() method. To see if there is missing data or NaN value
print(census.birth_year.unique())

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 'missing' '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


In [17]:
#5 There appears to be a missing value in the birth_year column. You find that the respondent’s birth year is 1967.
# Use the .replace() method to replace the missing value with 1967, so that the data type can be changed to int. 
# Then recheck the values in birth_year by calling the .unique() method and printing the results.
census.birth_year = census.birth_year.replace('missing', '1967')
print(census.birth_year.unique())

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' '1967' '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


In [18]:
#6 Now that we have adjusted the values in the birth_year variable, 
# change the datatype from str to int and print the datatypes of the census dataframe with .dtypes.
census.birth_year = census.birth_year.astype('int64')
print(census.dtypes)

first_name         object
last_name          object
birth_year          int64
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object


In [19]:
#7 Having assigned birth_year to the appropriate data type, 
# print the average birth year of the respondents to the census using the pandas .mean() method.
print(census.birth_year.mean())

1973.4


In [20]:
#8 Your manager would like to set an order to the higher_tax variable so that: 
# strongly disagree < disagree < neutral < agree < strongly agree.
# Convert the higher_tax variable to the category data type with the appropriate order, 
#then print the new order using the .unique() method.
census.higher_tax = pd.Categorical(census.higher_tax, ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)
print(census.higher_tax.unique())

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']


In [21]:
#9 Your manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. 
# Label encode the higher_tax variable and print the median using the pandas .median() method.
census.higher_tax = census.higher_tax.cat.codes
print("Median sentiment is " + str(census.higher_tax.median()) + " which represents neutral")

Median sentiment is 2.0 which represents neutral


In [22]:
#10 Your manager is interested in using machine learning models on the census data in the future. To help, let’s One-Hot Encode marital_status to create binary variables of each category. 
# Use the pandas get_dummies() method to One-Hot Encode the marital_status variable.
census1 = pd.get_dummies(data=census, columns=['marital_status'])
print(census1.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

   higher_tax  marital_status_divorced  marital_status_married  \
0           1                        0                       0   
1           2                        1                       0   
2           3                        0                       0   
3           4                        0                       1   
4           3                        0                       1   

   marital_status_single  marital_status_widowed  
0                      1                       0  
1                      0                       0  
2          

In [23]:
#11 Create a new variable called marital_codes by Label Encoding the marital_status variable. 
# This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay 
# higher taxes based on their marital status.
census.marital_status = pd.Categorical(census.marital_status, ['single', 'married', 'widowed', 'divorced'], ordered=True)
census['marital_codes'] = census.marital_status.cat.codes
print(census.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

   higher_tax marital_status  marital_codes  
0           1         single              0  
1           2       divorced              3  
2           3         single              0  
3           4        married              1  
4           3        married              1  


In [24]:
#12 Create a new variable called age_group, which groups respondents based on their birth year. 
# The groups should be in five-year increments, e.g., 25-30, 31-35, etc. 
# Then label encode the age_group variable to assist the Census team in the event they would like to use machine learning 
# to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.
census['age'] = 2022 - census.birth_year
census['age_group'] = census.age.apply(lambda x: '15-25' if x >= 15 and x < 25 else '25-35' if x >= 25 and x < 35 else '35-45' if x >= 35 and x < 45 else '45-55' if x >= 45 and x < 55 else '55-65' if x >= 55 and x < 65 else '65-75' if x >= 65 and x < 75 else '75-85' if x >= 75 and x < 85 else '85-95' if x >= 85 and x < 95 else '95-105' if x >= 95 and x < 105 else 'above 105') 
census.age_group = pd.Categorical(census.age_group, ['15-25', '25-35', '35-45', '45-55', '55-65', '65-75', '75-85', '85-95', '95-105',  'above 105'], ordered=True)
census['age_group_codes'] = census.age_group.cat.codes
print(census.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

   higher_tax marital_status  marital_codes  age age_group  age_group_codes  
0           1         single              0   17     15-25                0  
1           2       divorced              3   35     35-45                2  
2           3         single              0   30     25-35                1  
3           4        married              1   57     55-65                4  
4           3        married              1   77     75-85                6  
