In [2]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
data = pd.read_csv("../datasets/CPS2016_CSV.csv", index_col=None)
data.columns = ['age', 'sex', 'state', 'citizen', 'race', 'marital',
                'num_in_house', 'num_child', 'educ', 'worker_class', 'industry',
                'occupation', 'weekly_hrs', 'fam_income']
data.head()

Unnamed: 0,age,sex,state,citizen,race,marital,num_in_house,num_child,educ,worker_class,industry,occupation,weekly_hrs,fam_income
0,50,1,16,5,1,1,5,2,35,3,51,22,6,14
1,27,1,6,1,1,6,4,0,39,3,51,12,6,15
2,35,2,6,1,2,6,2,1,39,2,51,15,6,9
3,29,1,51,1,1,6,2,0,39,3,51,21,6,12
4,57,1,53,1,1,6,5,0,39,3,51,12,6,14


# Looking at the data

The CPS is used to collect data for a variety of other studies that keep the nation informed of the economic and social well-being of its people. This dataset includes 55253 entries representing responses from single-job holding Americans from August 2016.

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55253 entries, 0 to 55252
Data columns (total 14 columns):
age             55253 non-null int64
sex             55253 non-null int64
state           55253 non-null int64
citizen         55253 non-null int64
race            55253 non-null int64
marital         55253 non-null int64
num_in_house    55253 non-null int64
num_child       55253 non-null int64
educ            55253 non-null int64
worker_class    55253 non-null int64
industry        55253 non-null int64
occupation      55253 non-null int64
weekly_hrs      55253 non-null int64
fam_income      55253 non-null int64
dtypes: int64(14)
memory usage: 5.9 MB


# Describing the data

This dataset has a mix of numerical and categorical variables. All of the categorical variables are represented as integers which represent different categories. You can see which number represent which categories in the CPS Dataset Description PDF. Some of the numerical data is represented as increasing integers, each which represents a larger range of numbers. For instance, the valid entries for 'fam_income' are:

- 1	LESS THAN 5,000
- 2	5,000 TO 7,499
- 3	7,500 TO 9,999
- 4	10,000 TO 12,499
- 5	12,500 TO 14,999
- 6	15,000 TO 19,999
- 7	20,000 TO 24,999
- 8	25,000 TO 29,999
- 9	30,000 TO 34,999
- 10	35,000 TO 39,999
- 11	40,000 TO 49,999
- 12	50,000 TO 59,999
- 13	60,000 TO 74,999
- 14	75,000 TO 99,999
- 15	100,000 TO 149,999
- 16	150,000 OR MORE		

## Please refer to the CPS Dataset Description PDF for a description of each variable and their valid entries. 

The following variables are strictly numerical:

age, num_in_house (number of people living in current household), num_child (number of children <18)

The following variables CAN be used numerically, but will be used categorically when using classification methods:

educ, weekly_hrs, fam_income

- As you can see below, the mean for 'fam_income' Referring to the PDF (or from example above), we can see that 12 represents '50,000 to 59,999' and 13 represents '60,000 to 74,999'. Therefore we can estimate that the average family income from this dataset is between 50,000 to 59,999.

- The mean for 'educ' is 40.789. Referring to the PDF, we can see that the integer entries for education reprent the growing levels of education from 31: 'before first grade' to 46: 'doctorage degree'. 40.789 lies between 40: 'some college but no degree' and 41:'associates degree'. This doesn't give us the most accurate representation, but the number representation gives us an estimation as to when level of education someone has achieved.

- The mean for 'weekly_hrs' is 3.845. Referring to the PDF, we can see that 3 represents '35-39 hrs' and 4 represents '40 hrs'. Again, we cannot say for sure how many hours exactly someone works a week based off of 3.845, but we can estimate that the average survey responder worked around or just under 40 hours a week. I believe the true average for Americans (found online) is around 38.1 hours per week.

In [25]:
data.describe(include='all')

Unnamed: 0,age,sex,state,citizen,race,marital,num_in_house,num_child,educ,worker_class,industry,occupation,weekly_hrs,fam_income
count,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0,55253.0
mean,42.655421,1.471739,28.345375,1.534722,1.429081,2.922212,3.074494,0.600999,40.789079,3.811829,30.935316,12.193311,3.845076,12.181311
std,14.542007,0.499205,16.178664,1.264564,1.304305,2.242702,1.552056,1.015603,2.56594,0.745296,14.492841,6.80535,1.321362,3.495008
min,15.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,31.0,1.0,1.0,1.0,1.0,1.0
25%,31.0,1.0,13.0,1.0,1.0,1.0,2.0,0.0,39.0,4.0,22.0,7.0,4.0,10.0
50%,42.0,1.0,29.0,1.0,1.0,1.0,3.0,0.0,40.0,4.0,36.0,14.0,4.0,13.0
75%,54.0,2.0,42.0,1.0,1.0,6.0,4.0,1.0,43.0,4.0,42.0,17.0,4.0,15.0
max,85.0,2.0,56.0,5.0,26.0,6.0,14.0,9.0,46.0,6.0,51.0,22.0,6.0,16.0


Check dataframe for NaNs

In [26]:
data.isnull().values.any()

False

Let's take a look at value counts of some of the variables 
(Again refer to PDF to see categorical representation of numbers)

In [27]:
data.state.value_counts()

6     4711
48    3140
36    2248
12    2213
17    1697
42    1601
39    1386
26    1273
25    1213
34    1176
13    1117
37    1098
11    1081
22    1076
30    1061
51    1050
47    1016
53    1015
54    1013
38     978
1      958
56     953
28     940
35     926
5      916
18     897
55     895
16     865
49     854
15     848
41     837
4      828
33     804
50     802
45     792
40     782
27     775
29     764
24     748
8      735
32     730
46     715
31     707
20     696
21     689
2      675
19     670
10     663
9      578
23     545
44     503
Name: state, dtype: int64

In [28]:
data.citizen.value_counts()

1    46483
5     4261
4     3746
3      500
2      263
Name: citizen, dtype: int64

In [29]:
data.race.value_counts()

1     44957
2      5471
4      3049
3       609
5       313
7       271
6       197
8       154
9        55
10       41
15       39
21       36
16       31
11       13
12        8
19        3
26        2
22        1
13        1
18        1
20        1
Name: race, dtype: int64

In [30]:
data.marital.value_counts()

1    30157
6    16310
4     5817
3     1103
5     1045
2      821
Name: marital, dtype: int64

In [31]:
data.num_child.value_counts()

0    37144
1     7872
2     6735
3     2506
4      756
5      161
6       52
7       15
8        7
9        5
Name: num_child, dtype: int64

In [32]:
data.educ.value_counts()

39    15090
43    12575
40     9749
44     5289
42     3436
41     2509
37     1142
46     1037
45      956
36      858
38      644
35      618
34      530
33      518
32      185
31      117
Name: educ, dtype: int64

In [33]:
data.industry.value_counts()

22    6077
40    4621
36    4051
4     3835
42    3685
46    3190
51    2829
41    2689
23    2297
38    2224
32    1649
21    1341
43    1278
44    1229
1     1052
33     974
10     887
48     878
34     874
49     778
47     776
45     726
14     694
6      578
13     514
24     488
19     466
8      448
7      434
3      378
29     327
17     325
50     242
16     218
39     211
20     205
27     186
5      180
25     175
11     171
12     154
35     149
26     141
9      138
15     112
2       98
18      84
31      76
37      63
30      35
28      23
Name: industry, dtype: int64

In [34]:
data.occupation.value_counts()

17    6592
1     6383
16    5545
22    3348
10    3315
21    3063
19    3015
13    2989
8     2977
2     2749
14    2263
15    2088
20    1870
3     1617
11    1284
4     1264
12    1178
9     1020
6      933
7      737
5      535
18     488
Name: occupation, dtype: int64

In [35]:
data.weekly_hrs.value_counts()

4    31614
6     7514
2     4717
1     4683
3     3415
5     3310
Name: weekly_hrs, dtype: int64

In [36]:
data.fam_income.value_counts()

15    9414
14    8167
16    7906
13    6446
12    4847
11    4345
9     2756
10    2735
8     2140
7     1907
6     1439
5      808
4      722
1      687
3      513
2      421
Name: fam_income, dtype: int64

## Graphing 