# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [2]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [3]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [4]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [5]:
data_collection.download_data_and_load_into_sql()

DuplicateDatabase: database "opportunity_youth" already exists


Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd
import numpy as np
import seaborn as sns
import requests
import matplotlib.pyplot as plt
%matplotlib inline

DBNAME = "opportunity_youth"

conn = psycopg2.connect(dbname=DBNAME)
cur = conn.cursor()

In [7]:
pd.read_sql("""SELECT *
            FROM pums_2017
            LIMIT 10;""", conn)

Unnamed: 0,rt,serialno,division,sporder,puma,region,st,adjinc,pwgtp,agep,...,pwgtp71,pwgtp72,pwgtp73,pwgtp74,pwgtp75,pwgtp76,pwgtp77,pwgtp78,pwgtp79,pwgtp80
0,P,2013000055538,9,1,10501,4,53,1061971,36.0,64.0,...,34.0,35.0,38.0,34.0,57.0,38.0,40.0,12.0,12.0,35.0
1,P,2013000055685,9,1,11701,4,53,1061971,12.0,61.0,...,3.0,4.0,12.0,19.0,22.0,13.0,3.0,20.0,3.0,12.0
2,P,2013000055685,9,2,11701,4,53,1061971,12.0,61.0,...,4.0,3.0,11.0,20.0,23.0,12.0,4.0,20.0,4.0,11.0
3,P,2013000055685,9,3,11701,4,53,1061971,17.0,60.0,...,5.0,5.0,17.0,25.0,30.0,18.0,5.0,27.0,6.0,16.0
4,P,2013000055702,9,1,11101,4,53,1061971,24.0,32.0,...,7.0,6.0,37.0,21.0,8.0,39.0,38.0,20.0,21.0,24.0
5,P,2013000055702,9,2,11101,4,53,1061971,35.0,39.0,...,9.0,10.0,57.0,34.0,12.0,61.0,56.0,41.0,35.0,37.0
6,P,2013000055702,9,3,11101,4,53,1061971,41.0,11.0,...,12.0,13.0,67.0,31.0,16.0,72.0,57.0,42.0,45.0,40.0
7,P,2013000055702,9,4,11101,4,53,1061971,41.0,6.0,...,12.0,13.0,66.0,31.0,15.0,72.0,56.0,43.0,44.0,40.0
8,P,2013000055702,9,5,11101,4,53,1061971,47.0,0.0,...,11.0,17.0,77.0,51.0,14.0,87.0,83.0,41.0,54.0,46.0
9,P,2013000055825,9,1,11505,4,53,1061971,5.0,62.0,...,5.0,2.0,5.0,1.0,4.0,5.0,9.0,5.0,5.0,4.0


Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

### df_pums_2017_og

In [13]:
df_pums_2017_og = pd.read_sql("""
    SELECT * 
    FROM pums_2017;
    """, conn)
df_pums_2017_og.shape

(359075, 286)

#### Checking for missing values- narrow down what columns you're interested in and then run this

In [19]:
df_pums_2017_og.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359075 entries, 0 to 359074
Columns: 286 entries, rt to pwgtp80
dtypes: float64(100), object(186)
memory usage: 783.5+ MB


In [16]:
df_pums_2017_og.isna().sum()

rt          0
serialno    0
division    0
sporder     0
puma        0
           ..
pwgtp76     0
pwgtp77     0
pwgtp78     0
pwgtp79     0
pwgtp80     0
Length: 286, dtype: int64

In [20]:
df_pums_2017_og.pwgtp.value_counts()

14.0     24365
13.0     23960
15.0     20619
12.0     18225
16.0     15655
         ...  
164.0        1
175.0        1
181.0        1
179.0        1
155.0        1
Name: pwgtp, Length: 190, dtype: int64

In [9]:
df_pums_2017 = pd.read_sql("""
    SELECT * 
    FROM pums_2017 
    WHERE agep BETWEEN 16.0 AND 24.0
    AND puma IN ('11610', '11611', '11612', '11613', '11614', '11615')
    AND esr IN ('3', '6')
    AND sch = '1'
    ORDER BY agep ASC;
    """, conn)

In [10]:
df_pums_2017.head(20)

Unnamed: 0,rt,serialno,division,sporder,puma,region,st,adjinc,pwgtp,agep,...,pwgtp71,pwgtp72,pwgtp73,pwgtp74,pwgtp75,pwgtp76,pwgtp77,pwgtp78,pwgtp79,pwgtp80
0,P,2014000732829,9,4,11613,4,53,1045195,21.0,16.0,...,6.0,15.0,17.0,22.0,8.0,28.0,32.0,25.0,30.0,16.0
1,P,2017000468464,9,4,11612,4,53,1011189,22.0,16.0,...,29.0,42.0,34.0,21.0,9.0,30.0,20.0,19.0,22.0,20.0
2,P,2015001329594,9,3,11612,4,53,1035988,28.0,16.0,...,11.0,41.0,58.0,26.0,22.0,30.0,25.0,10.0,56.0,38.0
3,P,2016001052457,9,2,11610,4,53,1029257,28.0,16.0,...,24.0,8.0,24.0,29.0,48.0,44.0,10.0,9.0,48.0,47.0
4,P,2015001313106,9,4,11614,4,53,1035988,22.0,16.0,...,27.0,23.0,20.0,6.0,23.0,25.0,28.0,28.0,7.0,23.0
5,P,2017000648807,9,2,11614,4,53,1011189,4.0,16.0,...,4.0,5.0,4.0,4.0,1.0,4.0,7.0,1.0,8.0,1.0
6,P,2013000303198,9,4,11613,4,53,1061971,10.0,16.0,...,12.0,20.0,3.0,10.0,10.0,16.0,9.0,21.0,20.0,11.0
7,P,2015000639047,9,5,11614,4,53,1035988,12.0,16.0,...,12.0,10.0,20.0,11.0,20.0,11.0,13.0,12.0,4.0,18.0
8,P,2015001455204,9,3,11614,4,53,1035988,8.0,16.0,...,14.0,2.0,6.0,14.0,3.0,7.0,11.0,3.0,7.0,2.0
9,P,2014000440342,9,4,11614,4,53,1045195,26.0,16.0,...,25.0,22.0,24.0,24.0,6.0,7.0,26.0,24.0,25.0,23.0


In [11]:
df_pums_2017.shape

(391, 286)

In [37]:
df_pums_2017.loc[df_pums_2017['dis' == 1],['ddrs','dear','deye','dout','dphy','drem']]

KeyError: False

<br>**Check unique values w/i division, region, and st col. To confirm we're dealing with one region**

In [147]:
print(df_pums_2017.division.value_counts(), '\n', df_pums_2017.region.value_counts(), '\n', df_pums_2017.st.value_counts())

9    285
Name: division, dtype: int64 
 4    285
Name: region, dtype: int64 
 53    285
Name: st, dtype: int64


In [148]:
del df_pums_2017['division']
del df_pums_2017['region']
del df_pums_2017['st']

In [149]:
df_pums_2017.head()

Unnamed: 0,rt,serialno,sporder,puma,adjinc,pwgtp,agep,cit,citwp,cow,...,pwgtp71,pwgtp72,pwgtp73,pwgtp74,pwgtp75,pwgtp76,pwgtp77,pwgtp78,pwgtp79,pwgtp80
0,P,2017000106768,4,11609,1011189,24.0,16.0,1.0,,,...,7.0,40.0,22.0,47.0,45.0,27.0,20.0,21.0,25.0,7.0
1,P,2017000468464,4,11612,1011189,22.0,16.0,1.0,,,...,29.0,42.0,34.0,21.0,9.0,30.0,20.0,19.0,22.0,20.0
2,P,2014000440342,4,11614,1045195,26.0,16.0,1.0,,,...,25.0,22.0,24.0,24.0,6.0,7.0,26.0,24.0,25.0,23.0
3,P,2015001329594,3,11612,1035988,28.0,16.0,1.0,,,...,11.0,41.0,58.0,26.0,22.0,30.0,25.0,10.0,56.0,38.0
4,P,2014000732829,4,11613,1045195,21.0,16.0,1.0,,,...,6.0,15.0,17.0,22.0,8.0,28.0,32.0,25.0,30.0,16.0


In [150]:
df_pums_2017.cow.value_counts()

1    109
6      8
3      6
2      5
4      5
5      1
7      1
8      1
Name: cow, dtype: int64

cow = Class of worker

b .N/A (less than 16 years old/NILF who last worked more than 5
.years ago or never worked)

1 .Employee of a private for-profit company or business, or of an
.individual, for wages, salary, or commissions

2 .Employee of a private not-for-profit, tax-exempt, or
.charitable organization

3 .Local government employee (city, county, etc.)

4 .State government employee

5 .Federal government employee

6 .Self-employed in own not incorporated business, professional
.practice, or farm

7 .Self-employed in own incorporated business, professional
.practice or farm

8 .Working without pay in family business or farm

9 .Unemployed and last worked 5 years ago or earlier or never
.worked <-- This is what we're looking for. But esr, below, will be better it seems like.

In [151]:
df_pums_2017.sch.value_counts()

1    285
Name: sch, dtype: int64

sch = School enrollment

b .N/A (less than 3 years old)

1 .No, has not attended in the last 3 months <-- this is what we're looking

2 .Yes, public school or public college

3 .Yes, private school or college or home school

In [152]:
df_pums_2017.esr.value_counts()

6    285
Name: esr, dtype: int64

esr = Employment status recode

b .N/A (less than 16 years old)\
1 .Civilian employed, at work\
2 .Civilian employed, with a job but not at work\
3 .Unemployed <-- this is what we're interested in\
4 .Armed forces, at work\
5 .Armed forces, with a job but not at work\
6 .Not in labor force <-- this is what we're interested in

In [153]:
df_pums_2017.puma.value_counts()

11613    55
11611    52
11614    51
11610    36
11612    35
11615    32
11609    24
Name: puma, dtype: int64

puma = Public use microdata area code

53 11610	King County (Central)--Renton City, Fairwood, Bryn Mawr & Skyway PUMA\
53 11611	King County (West Central)--Burien, SeaTac, Tukwila Cities & White Center PUMA\
53 11612	King County (Far Southwest)--Federal Way, Des Moines Cities & Vashon Island PUMA\
53 11613	King County (Southwest Central)--Kent City PUMA\
53 11614	King County (Southwest)--Auburn City & Lakeland PUMA\
53 11615	King County (Southeast)--Maple Valley, Covington & Enumclaw Cities PUMA

In [162]:
df_pums_2017.cow.value_counts()

1    209
9     25
6     11
2     10
3     10
4      7
5      2
7      1
8      1
Name: cow, dtype: int64

Make sure you close the DB connection when you are done using it

In [163]:
conn.close()