# Make Sense of Census 


## Problem Statement

Hello!

You have been hired by 'CACT'(Census Analysis and Collection Team) to help with your numpy programming skills. Your major work for today involves census record management and data analysis. 



## About the Dataset


The snapshot of the data, you will be working on:


![census data](CensusData.png)


The dataset has details of 100 people with the following 8 features


| Features | Description |
|:--------------:|:-------------------------------------------------------------------------------------------------------------------------------------:|
| age | Age of the person |
| education-num | No. of years of education they had |
| race | Person's race   <br> KEY==>  0 : Amer-Indian-Eskimo <br>  1 : Asian-Pac-Islander <br> 2 : Black <br>  3 : Other                    <br>  4 : White |
| sex | Person's gender  <br> KEY==>  0 : Female <br>  1 : Male |
| capital-gain | Income from investment sources, apart from wages/salary |
| capital loss | Losses from investment sources, apart from wages/salary |
| hours-per-week | No. of hours per week the person works |
| income | Annual Income of the person    <br> KEY==> 0 : Less than or equal to 50K  <br> 1 : More than 50K  |



## Why solve this project
After completing this project, you will have a better grip on working with numpy.
In this project, you will apply the following concepts:

- Array Appending
- Array Slicing
- Array Filtering
- Array Aggregation




## Data Reading 

In this first task, we will load the data to a numpy array and add a new record to it.

## Instructions :

* Load the file `'data_file'`(given) and store it in a variable called `'data'` using `"np.genfromtxt()"`

***       
       Example of genfromtxt function

![census data](genfromtxt.png)

***

**Note:**

The parameter 'delimiter="," ' is set because file that we are opening has extension 'csv'(Comma Separated Values)

The parameter 'skip_header=1' is set because the first row of the data(which is called header) contains string values but in our numpy array we need only integers(Remember numpy array can only store data of a single data type)


* Append `'new_record'` (given) to `'data'` using `"np.concatenate()"` and store the new array in a variable called `census`


## Hint

Concatenate along the row by putting `axis=0` in `np.concatenate()`


## Test case

#data

variable declaration check


variable type check - numpy.ndarray


data.shape==(1000, 8)


#census

variable declaration check


variable type check - numpy.ndarray


census.shape==(1001, 8)



In [12]:
# Importing header files
import numpy as np

#File path
data_file='subset_1000.csv'

#Code starts here

#Loading data file and saving it into a new numpy array 
data = np.genfromtxt(data_file, delimiter=",", skip_header=1)
print(data.shape)

#New record
new_record=[[50,  9,  4,  1,  0,  0, 40,  0]]

#Concatenating the new record to the existing numpy array
census=np.concatenate((data, new_record),axis = 0)

print(type(census)==np.ndarray)
print(census.shape)

#Code ends here

(1000, 8)
True
(1001, 8)


## Success Message

Congrats! You have successfully loaded the data

# Young Country? Old Country?


We often associate the potential of a country based on the age distribution of the people residing there. We too want to do a simple analysis of the age distribution

## Instructions :

* Create a new array called `'age'` by taking only age column(age is the column with index 0) of `'census'` array.


* Find the max age and store it in a variable called `'max_age'`.


* Find the min age and store it in a variable called `'min_age'`. 


* Find the mean of the age and store it in variable called `'age_mean'`.


* Find the standard deviation of the age and store it in a variable called `'age_std'`.


#### Ponder whether based on the above statistics, would you classify the country as 'young' or 'old'?


## Hint 

You can subset the 'Age' column from 'census' array by writing code similar to `age=census[:,0]`


## Test Case

#age

Variable declaration check

Variable type check - numpy.ndarray

len(age)==1001

age[43]==49

#max_age


Variable declaration check


max_age==90


#min_age


Variable declaration check

min_age==17

#age_mean

Variable declaration check

round(age_mean,2)==round(38.06293706293706,2)

#age_std

Variable declaration check

round(age_std,2)==round(13.341478176165857,2)




In [7]:
#Code starts here

#Subsetting the array to include only 'Age' column
age=census[:,0]

#Finding the max value of age
max_age=age.max()
print("Max Age= ",max_age)

#Find the min value of age
min_age=age.min()
print("Min Age= ",min_age)

#Find the mean of age
age_mean=age.mean()
print("Age Average= ", age_mean)

#Find the standard deviation of age
age_std=age.std()
print("Age Standard Deviation= ",age_std)


#Code ends here

Max Age=  90.0
Min Age=  17.0
Age Average=  38.06293706293706
Age Standard Deviation=  13.341478176165857


## Success Message

Congrats! You have successfully done the required numpy operations.

# Minority Report


The constitution of the country tries it's best to ensure that people of all races are able to live harmoniously. Let's check the country's race distribution to identify the minorities so that goverment can help them.

## Instructions :

* Create four different arrays by subsetting `'census'` array by Race column(Race is the column with index 2) and save them in `'race_0'`,`'race_1'`, `'race_2'`, `'race_3'` and `'race_4'` respectively(Meaning: Store the array where 'race'column has value `0` in `'race_0'`, so on and so forth)


* Store the length of the above created arrays in `'len_0'`, `'len_1'`,`'len_2'`, `'len_3'` and `'len_4'` respectively


* Find out which is the race with the minimum no. of citizens


* Store the number associated with the minority race in a variable called `'minority_race'`(For eg: if `"len(race_0)"` is the minimum, store `0` in `'minority_race'`)



## Hint

You can subset the array based on column value by writing code similar to:


`race_0=census[census[:,2]==0]`(If you need to subset the 2nd column based on value '0')




## Test Case

#race_0

Variable declaration check

race_0==numpy.ndarray

np.all(race_0[:,2] == 0)

#race_1

Variable declaration check

race_1==numpy.ndarray

np.all(race_1[:,2] == 1)

#race_2

Variable declaration check

race_2==numpy.ndarray

np.all(race_2[:,2] == 2)

#race_3

Variable declaration check

race_3==numpy.ndarray

np.all(race_3[:,2] == 3)

#race_4

Variable declaration check

race_4==numpy.ndarray

np.all(race_4[:,2] == 4)


#len_0

Variable declaration check

len_0==10

#len_1

Variable declaration check

len_1==27


#len_2

Variable declaration check

len_2==110


#len_3

Variable declaration check

len_3==6


#len_4

Variable declaration check

len_4==848


#minority_race

Variable declaration check

minority_race==3



In [3]:
#Code starts here

#Creating new subsets based on 'Age'
race_0=census[census[:,2]==0]
race_1=census[census[:,2]==1]
race_2=census[census[:,2]==2]
race_3=census[census[:,2]==3]
race_4=census[census[:,2]==4]


#Finding the length of the above created subsets
len_0=len(race_0)
len_1=len(race_1)
len_2=len(race_2)
len_3=len(race_3)
len_4=len(race_4)

#Printing the length of the above created subsets
print('Race_0: ', len_0)
print('Race_1: ', len_1)
print('Race_2: ', len_2)
print('Race_3: ', len_3)
print('Race_4: ', len_4)

#Storing the race with minimum length into a variable 
minority_race=3

#Code ends here

Race_0:  10
Race_1:  27
Race_2:  110
Race_3:  6
Race_4:  848


## Success Message

Congrats! You have successfully identified the minority race

## Senior Welfare

As per the new govt. policy, all citizens above age 55 should not be made to work more than 25 hours per week.
Let us look at the data and see if that policy is followed.

## Instructions:

* Create a new subset array called `'senior_citizens'` by filtering `'census'` according to age>60 (age is the column with index 0)


* Add all the working hours(working hours is the column with index 6) of `'senior_citizens'` and store it in a variable called `'working_hours_sum'`


* Find the length of `'senior_citizens'` and store it in a variable called `'senior_citizens_len'` 


* Finally find the average working hours of the senior citizens by dividing `'working_hours_sum'` by `'senior_citizens_len'` and store it in a variable called `'avg_working hours'`.


* Print `'avg_working_hours'` and see if the govt. policy is followed.

## Hint
To add all the working hours, you can write code similar to:

`working_hours_sum=senior_citizens.sum(axis=0)[6]`

## Test Case

#senior_citizens

Variable declaration check

senior_citizens==numpy.ndarray

np.all(senior_citizens[:,0] > 60)

#working_hours_sum

Variable declaration check

working_hours_sum==1917

#senior_citizens_len

Variable declaration check

senior_citizens_len==61

#avg_working_hours

Variable declaration check

round(avg_working_hours,2)==round(31.42622950819672,2)



In [14]:
#Code starts here

#Subsetting the array based on the age 
senior_citizens=census[census[:,0]>60]

#Calculating the sum of all the values of array
working_hours_sum=senior_citizens.sum(axis=0)[6]

#Finding the length of the array
senior_citizens_len=len(senior_citizens)

#Finding the average working hours
avg_working_hours=working_hours_sum/senior_citizens_len

#Printing the average working hours
print((avg_working_hours))

#Code ends here

31.42622950819672


## Success Message

Congrats! You have successfully calculated the average working hours for senior citizens.

# Education Matters

Our parents have repeatedly told us that we need to study well in order to get a good(read: higher paying) job. Let's see whether the higher educated people have a better pay in general. 

## Instructions : 

* Create two new subset arrays called `'high'` and `'low'` by filtering `'census'` according to education-num>10 and education-num<=10 (education-num is the column with index 1) respectively.


* Find the mean of income column(income is the column with index 7) of `'high'` array and store it in `'avg_pay_high'`. Do the same for `'low'` array and store it's mean in `'avg_pay_low'`.

**Note:**
     - Since income is a binary variable, mean() here represents the percentage of ppl having annual income higher than 50K.
     - You could have used `"mean()"` function to solve 'Task 2' as  well
     
* Compare `'avg_pay_high'` and `'avg_pay_low'` and see  whether there is truth in better education leads to better pay       


## Hint
You can create subset `'high'` by writing code similar to:

`high=census[census[:,1]>10]`

Similarly for `'low'`

## Test Case

#high

Variable declaration check

high==numpy.ndarray

np.all(high[:,1] > 10)

#avg_pay_high

Variable declaration check

round(avg_pay_high,2)==round(0.42813455657492355,2)

#low

Variable declaration check

low==numpy.ndarray

np.all(low[:,1] <= 10)

#avg_pay_low

Variable declaration check

round(avg_pay_low,2)==round(0.13649851632047477,2)



In [5]:
#Code starts here

#Creating an array based on 'education' column
high=census[census[:,1]>10]

#Finding the average pay
avg_pay_high=high[:,7].mean()

#Printing the average pay
print(avg_pay_high)

#Creating an array based on 'education' column
low=census[census[:,1]<=10]

#Finding the average pay
avg_pay_low=low[:,7].mean()

#Printing the average pay
print(avg_pay_low)

#Code ends here

0.42813455657492355
0.13649851632047477


## Success Message

Congrats! You have successfully found out the avg. pay of people based on their education level