## Cleaning demographic census data to perform statistical calculations
In this project,we will clean a dataframe, composed of simulated census data to represent demographics of a small community in the U.S. and do some statiscal calculations on the cleaned data.


In [1]:
# Let's start by importing necessary libraries
import pandas as pd
import numpy as np

In [3]:
# let's load the data from our system.
census = pd.read_csv(r"C:\Users\amanp\OneDrive\Desktop\census_data.csv")


In [5]:
# Let's review the dataframe description and values returned by .head() to assess the variable types of each of the variables. 
# This is an important step to understand what preprocessing will be necessary to work with the data.

census.head()


Unnamed: 0.1,Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced
2,2,Salomon,Orn,1992,True,2,166313.45,agree,single
3,3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,4,Gust,Abernathy,1945,False,2,143316.08,agree,married


So, the data contains the following columns:

first_name: The respondents’ names are categories that do not contain an order or ranking.

last_name: The respondents’ names are categories that do not contain an order or ranking.

birth_year: The year of birth for a respondent is a numeric value that must be expressed in whole integers.

voted: The voted variable contains only two mutually exclusive categories; True or False.

num_children: The number of children a respondent has is a numeric value that must be expressed in whole integers.

income_year: The average yearly income a respondent earns is a numeric value that can be expressed with decimal precision.

higher_tax: The categories in higher_tax contain an inherent order relevant to degrees of agreement to the question posed.

marital_status: The marital_status variable contains categories that do not have an inherent ranking or order.

In [6]:
# Let's check the dtypes of all the columns

census.dtypes


Unnamed: 0          int64
first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

In [7]:
#  to know the average birth year of the respondents. We were able to see from .dtypes that birth_year has been assigned the
# str datatype whereas it should be expressed in int. Let's first print the unique values of the variable using the .unique() 
# method.

census.birth_year.unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 'missing', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

In [8]:
# There appears to be a missing value in the birth_year column. Let's assume that with some research we have found that the 
# respondent’s birth year is 1967.
# let's use the .replace() method to replace the missing value with 1967, so that the data type can be changed to int. Then 
# recheck the values in birth_year by calling the .unique() method and printing the results.

census.birth_year = census.birth_year.replace(['missing'], 1967)

In [9]:
# Now, that we have adjusted the values in the birth_year variable, let's change the datatype from str to int and print the 
# datatypes of the census dataframe.

census.birth_year = census.birth_year.astype(int)
print(census.dtypes)

Unnamed: 0          int64
first_name         object
last_name          object
birth_year          int32
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object


In [10]:
# Having assigned birth_year to the appropriate data type, print the average birth year of the respondents to the census.

census.birth_year.mean()

1973.4

In [11]:
# We also  to set an order to the higher_tax variable so that: strongly disagree < disagree < neutral < agree < strongly agree.
# Let's convert the higher_tax variable to the category data type with the appropriate order, then print the new order using
# the .unique() method.

census['higher_tax'] = pd.Categorical(census['higher_tax'], ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered= True)
print(census['higher_tax'].unique())

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']


In [12]:
# We would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. Label encode 
# the higher_tax variable for the same and print the median.

census['higher_tax'] = census['higher_tax'].cat.codes
print(census['higher_tax'].median())

2.0


In [13]:
# To perform machine learning models later on for this data, let’s One-Hot Encode marital_status tocreate binary variables
# of each category. 

census = pd.get_dummies(census, columns=['marital_status'])
census.head()

Unnamed: 0.1,Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed
0,0,Denise,Ratke,2005,False,0,92129.41,1,0,0,1,0
1,1,Hali,Cummerata,1987,False,0,75649.17,2,1,0,0,0
2,2,Salomon,Orn,1992,True,2,166313.45,3,0,0,1,0
3,3,Sarina,Schiller,1965,False,2,71704.81,4,0,1,0,0
4,4,Gust,Abernathy,1945,False,2,143316.08,3,0,1,0,0


#### Conclusion and next steps:
In this project, we have cleaned the census data of a small community in the US(including label encoding and One-Hot encoding) and performed some statistical calculations like calculating the mean and median of various columns. We did the following jobs:
1. calculating the mean of birth_year(by first converting it to appropriate datatype and filling the missing values).
2. setting an order to the higher_tax variable (by converting it to categorical datatype).
3. label encoding the higher_Tax variable to calculate the median.
4. One-Hot encoding the marital_status to perform machine learning models.

Next, we can also do the following:
1. Create a new variable called marital_codes by Label Encoding the marital_status variable. This could help the Census team  use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.

2. Create a new variable called age_group, which groups respondents based on their birth year. The groups should be in five-year increments, e.g., 25-30, 31-35, etc. Then label encode the age_group variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.
etc.
