# Immigrants in CA - JTF
#### A repo containing the code and data to reproduce the numbers in PPIC's Immigrants in CA JTF

<details>
    <summary><strong>Goal</strong></summary>
    The goal of this notebook is to reproduce the numbers of bullet 4 in the Immigrants in CA JTF. 
</details>

<details>
    <summary><strong>Context</strong></summary>
    We've downloaded raw data from <strong><i>ipums.org</i></strong>.
    It includes ACS 2004-2018 and the variables:
    <ul>
        <li>statefip</li>
        <li>bpld</li>
        <li>citizen</li>
        <li>yrsusa2</li>
        <li>migplac1</li>
    </ul>
    plus the typical variables included by IPUMS (perwt, hhwt, gq, etc).
</details>

***
### Set up working environment

In [1]:
import pandas as pd
import gzip
from pathlib import Path
from tools import tree
from datetime import datetime as dt
today = dt.today().strftime("%d-%b-%y")

today

'09-Jan-20'

In [2]:
RAW_DATA = Path("../data/raw/")
INTERIM_DATA = Path("../data/interim/")
PROCESSED_DATA = Path("../data/processed/")
FINAL_DATA = Path("../data/final/")

In [3]:
tree(RAW_DATA)

+ ..\data\raw
    + usa_00072.dta.gz


***
### Loading the data to `pandas`. 

**Must use `gzip` to read gzipped stata file. This is not the case of csv's but stata files maintain _categoricals_.**

In [4]:
with gzip.open(RAW_DATA / 'usa_00072.dta.gz', 'r') as file:
    data = pd.read_stata(file)

In [5]:
data.head()

Unnamed: 0,year,sample,serial,cbserial,hhwt,cluster,statefip,strata,gq,pernum,perwt,age,bpl,bpld,citizen,yrsusa2,migplac1
0,2004,2004 acs,23712,,208,2004000000000.0,california,6,households under 1970 definition,1,220,37,nebraska,nebraska,,,
1,2004,2004 acs,23712,,208,2004000000000.0,california,6,households under 1970 definition,2,222,14,california,california,,,
2,2004,2004 acs,23712,,208,2004000000000.0,california,6,households under 1970 definition,3,221,12,california,california,,,
3,2004,2004 acs,23712,,208,2004000000000.0,california,6,households under 1970 definition,4,220,47,indiana,indiana,,,
4,2004,2004 acs,23713,,244,2004000000000.0,california,6,households under 1970 definition,1,258,29,florida,florida,,,


***
### Subsetting only the year of interest.

**Note: `year` is still a categorical variable so we must use strings to filter on it.**

In [6]:
data['year'].cat.categories

Index(['2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018'],
      dtype='object')

In [7]:
working_data = data[data['year'] == '2016'].copy()

There are too many categories to use `data['bpld'].cat.categories` so we print them all out, in order, to choose the cut off points for our filters later on.

In [8]:
for index,category in enumerate(working_data['bpld'].cat.categories, start = 1):
    print(f"{index}.\t{category}")

1.	alabama
2.	alaska
3.	arizona
4.	arkansas
5.	california
6.	colorado
7.	connecticut
8.	delaware
9.	district of columbia
10.	florida
11.	georgia
12.	hawaii
13.	idaho
14.	illinois
15.	indiana
16.	iowa
17.	kansas
18.	kentucky
19.	louisiana
20.	maine
21.	maryland
22.	massachusetts
23.	michigan
24.	minnesota
25.	mississippi
26.	missouri
27.	montana
28.	nebraska
29.	nevada
30.	new hampshire
31.	new jersey
32.	new mexico
33.	new york
34.	north carolina
35.	north dakota
36.	ohio
37.	oklahoma
38.	oregon
39.	pennsylvania
40.	rhode island
41.	south carolina
42.	south dakota
43.	tennessee
44.	texas
45.	utah
46.	vermont
47.	virginia
48.	washington
49.	west virginia
50.	wisconsin
51.	wyoming
52.	american samoa
53.	samoa, 1940-1950
54.	guam
55.	puerto rico
56.	u.s. virgin islands
57.	canada
58.	bermuda
59.	cape verde
60.	mexico
61.	belize/british honduras
62.	costa rica
63.	el salvador
64.	guatemala
65.	honduras
66.	nicaragua
67.	panama
68.	cuba
69.	dominican republic
70.	haiti
71.	jamaica
72.	antig

### Filters

In [9]:
mask_state = working_data['statefip'] == 'california'
mask_5years = working_data['yrsusa2'] == "0-5 years"
mask_latam = (working_data['bpld'] >= 'mexico') & (working_data['bpld'] <= 'south america, ns')
mask_asia = (working_data['bpld'] >= 'china') & (working_data['bpld'] <= 'asia, nec/ns')

Subsetting the dataset to only California and those _recently arrived_ (0-5 years).

In [10]:
california = working_data[mask_state & mask_5years].copy()

New column: `place of origin` - aggregation of `bpld`.

In [11]:
california.loc[mask_latam, 'place of origin'] = 'Latin America'
california.loc[mask_asia, 'place of origin'] = 'Asia'
california['place of origin'].fillna('Other', inplace = True)

In [12]:
place_of_origin_shares = california.groupby(['place of origin']).agg({'perwt': 'sum'}).apply(lambda x: x / x.sum()).style.format("{:.0%}")

In [13]:
place_of_origin_shares

Unnamed: 0_level_0,perwt
place of origin,Unnamed: 1_level_1
Asia,55%
Latin America,29%
Other,16%


[**Other notebook**](00_newly_arrivals_chart.ipynb)