# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [3]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [4]:
data_collection.download_data_and_load_into_sql()

Successfully created database and all tables

Successfully downloaded ZIP file
    https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pwa.zip
    
Successfully downloaded GZIP file
    https://lehd.ces.census.gov/data/lodes/LODES7/wa/wac/wa_wac_S000_JT00_2017.csv.gz
    
Successfully downloaded GZIP file
    https://lehd.ces.census.gov/data/lodes/LODES7/wa/wa_xwalk.csv.gz
    
Successfully downloaded CSV file
    https://www2.census.gov/geo/docs/maps-data/data/rel/2010_Census_Tract_to_2010_PUMA.txt
    
Successfully loaded CSV file into `pums_2017` table
        
Successfully loaded CSV file into `puma_names_2010` table
        
Successfully loaded CSV file into `wa_jobs_2017` table
        
Successfully loaded CSV file into `wa_geo_xwalk` table
        
Successfully loaded CSV file into `ct_puma_xwalk` table
        


Now it's time to explore the data!

In [13]:
import psycopg2
import pandas as pd
import sqlite3

In [14]:
DBNAME = "opportunity_youth"

In [23]:
conn = psycopg2.connect(dbname=DBNAME)
cur = conn.cursor()

In [30]:
df = pd.read_sql("SELECT * FROM pums_2017 LIMIT 10;", conn)

Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [35]:
df.head()

Unnamed: 0,rt,serialno,division,sporder,puma,region,st,adjinc,pwgtp,agep,...,pwgtp71,pwgtp72,pwgtp73,pwgtp74,pwgtp75,pwgtp76,pwgtp77,pwgtp78,pwgtp79,pwgtp80
0,P,2013000000006,9,1,11606,4,53,1061971,27.0,68.0,...,53.0,24.0,39.0,24.0,7.0,27.0,8.0,46.0,25.0,50.0
1,P,2013000000006,9,2,11606,4,53,1061971,22.0,66.0,...,49.0,21.0,38.0,20.0,7.0,25.0,8.0,41.0,22.0,47.0
2,P,2013000000012,9,1,10100,4,53,1061971,22.0,72.0,...,24.0,22.0,25.0,7.0,21.0,35.0,6.0,22.0,6.0,37.0
3,P,2013000000012,9,2,10100,4,53,1061971,19.0,64.0,...,21.0,18.0,19.0,7.0,17.0,29.0,6.0,19.0,6.0,29.0
4,P,2013000000038,9,1,11505,4,53,1061971,4.0,52.0,...,4.0,1.0,2.0,8.0,8.0,1.0,4.0,6.0,1.0,4.0


In [44]:
pd.read_sql(
"""
SELECT *
FROM pums_2017
WHERE agep BETWEEN 16.0 AND 24.0
AND puma IN ('11610','11611','11613','11614','11615')
AND esr in ('3', '6')
AND sch = '1'
ORDER BY agep ASC
LIMIT 10;
""", conn)

Unnamed: 0,rt,serialno,division,sporder,puma,region,st,adjinc,pwgtp,agep,...,pwgtp71,pwgtp72,pwgtp73,pwgtp74,pwgtp75,pwgtp76,pwgtp77,pwgtp78,pwgtp79,pwgtp80
0,P,2014000707906,9,3,11614,4,53,1045195,44.0,16.0,...,14.0,42.0,40.0,51.0,44.0,48.0,57.0,31.0,14.0,12.0
1,P,2013000303198,9,4,11613,4,53,1061971,10.0,16.0,...,12.0,20.0,3.0,10.0,10.0,16.0,9.0,21.0,20.0,11.0
2,P,2017001063337,9,3,11611,4,53,1011189,12.0,16.0,...,3.0,12.0,10.0,4.0,11.0,4.0,21.0,21.0,11.0,11.0
3,P,2014000440342,9,4,11614,4,53,1045195,26.0,16.0,...,25.0,22.0,24.0,24.0,6.0,7.0,26.0,24.0,25.0,23.0
4,P,2015000639047,9,5,11614,4,53,1035988,12.0,16.0,...,12.0,10.0,20.0,11.0,20.0,11.0,13.0,12.0,4.0,18.0
5,P,2017000648807,9,2,11614,4,53,1011189,4.0,16.0,...,4.0,5.0,4.0,4.0,1.0,4.0,7.0,1.0,8.0,1.0
6,P,2015001455204,9,3,11614,4,53,1035988,8.0,16.0,...,14.0,2.0,6.0,14.0,3.0,7.0,11.0,3.0,7.0,2.0
7,P,2015001313106,9,4,11614,4,53,1035988,22.0,16.0,...,27.0,23.0,20.0,6.0,23.0,25.0,28.0,28.0,7.0,23.0
8,P,2014000062718,9,3,11614,4,53,1045195,45.0,16.0,...,35.0,18.0,75.0,54.0,69.0,60.0,15.0,43.0,36.0,46.0
9,P,2014000732829,9,4,11613,4,53,1045195,21.0,16.0,...,6.0,15.0,17.0,22.0,8.0,28.0,32.0,25.0,30.0,16.0


agep - age
  * we want youths that are 16 ~ 24

esr - employment status record
  * 3 = unemployed
  * 6 = not in labor force
  
puma - region
  * 11610 central king county
  * 11611 west central king county
  * 11612 far southwest king county
  * 11613 southwest central king county
  * 11614 southwest king county
  * 11615 southeast king county
  
sch - school enrollment
  * sch = 1 means they aren't in school

In [168]:
listc = df.columns[104:207]


In [458]:
counter = 0

In [421]:
if (listc[counter] not in keep):
    keep.append(listc[counter])

In [424]:
counter+=1
counter

102

In [459]:
listc[counter]

'pobp'

In [426]:
len(keep)

71

In [427]:
keep

['povpip',
 'privcov',
 'pubcov',
 'rac1p',
 'rac2p',
 'rac3p',
 'racaian',
 'racasn',
 'racblk',
 'racnh',
 'racnum',
 'racpi',
 'racsor',
 'racwht',
 'sciengp',
 'sciengrlp',
 'sfn',
 'sfr',
 'pobp',
 'socp',
 'fddrsp',
 'fdearp',
 'fdeyep',
 'fdisp',
 'fdoutp',
 'fdphyp',
 'fdratp',
 'fdratxp',
 'fdremp',
 'fengp',
 'fesrp',
 'ffodp',
 'fgclp',
 'fgcmp',
 'fgcrp',
 'fhins1p',
 'fhins2p',
 'fhins3c',
 'fhins3p',
 'fhins4c',
 'fhins4p',
 'fhins5c',
 'fhins5p',
 'fhins6p',
 'fhins7p',
 'fhisp',
 'fintp',
 'flanp',
 'flanxp',
 'fmigp',
 'foccp',
 'foip',
 'fpap',
 'fpernp',
 'fpincp',
 'fpubcovp',
 'fracp',
 'frelp',
 'fretp',
 'fschgp',
 'fschlp',
 'fschp',
 'fsemp',
 'fsexp',
 'fssip',
 'fssp',
 'fwagp',
 'fwkhp',
 'fwklp',
 'fwkwp',
 'fwrkp']

Make sure you close the DB connection when you are done using it

In [431]:
df[keep].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359075 entries, 0 to 359074
Data columns (total 71 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   povpip     347475 non-null  float64
 1   privcov    359075 non-null  object 
 2   pubcov     359075 non-null  object 
 3   rac1p      359075 non-null  object 
 4   rac2p      359075 non-null  object 
 5   rac3p      359075 non-null  object 
 6   racaian    359075 non-null  object 
 7   racasn     359075 non-null  object 
 8   racblk     359075 non-null  object 
 9   racnh      359075 non-null  object 
 10  racnum     359075 non-null  object 
 11  racpi      359075 non-null  object 
 12  racsor     359075 non-null  object 
 13  racwht     359075 non-null  object 
 14  sciengp    91236 non-null   object 
 15  sciengrlp  91236 non-null   object 
 16  sfn        9411 non-null    object 
 17  sfr        9411 non-null    object 
 18  pobp       359075 non-null  object 
 19  socp       215545 non-n

In [457]:
df[['fwkwp', 'fwrkp']][df['fwkwp'] == '1']

Unnamed: 0,fwkwp,fwrkp
38,1,0
39,1,0
99,1,0
111,1,0
148,1,0
...,...,...
358965,1,0
358967,1,0
359016,1,0
359029,1,0


In [441]:
df['fwrkp']

0         0
1         0
2         0
3         0
4         0
         ..
359070    0
359071    0
359072    0
359073    0
359074    0
Name: fwrkp, Length: 359075, dtype: object

In [462]:
pd.read_sql(
"""
SELECT *
FROM pums_2017
WHERE agep BETWEEN 16.0 AND 24.0
""", conn).shape

(38170, 286)

In [None]:
conn.close()