# Scraping latest DACA data from USCIS

pdfs linked at https://www.uscis.gov/tools/reports-studies/immigration-forms-data

In [1]:
from tabula import read_pdf
import pandas as pd

### DACA Expiration Data

This report reflects the most up-to-date data available at the time the report is generated. Number of Individuals with DACA Expiration on or after Mar. 31, 2018 as of Mar. 31, 2018. Individuals who have obtained Lawful Permanent Resident Status or U.S. Citizenship are excluded. Totals may not sum due to rounding.

In [2]:
df = read_pdf("https://www.uscis.gov/sites/default/files/USCIS/Resources/Reports%20and%20Studies/Immigration%20Forms%20Data/All%20Form%20Types/DACA/DACA_Expiration_Data_Mar_31_2018.pdf")

In [3]:
df.head()

Unnamed: 0,Approximate Active DACA Recipients:
0,"As of March 31, 2018"
1,Month/Year Current DACA Number Number with Ren...
2,Expires (Rounded) Pending (Rounded)
3,Mar-18 50 30
4,"Apr-18 2,200 1,020"


In [4]:
df.tail()

Unnamed: 0,Approximate Active DACA Recipients:
24,"Dec-19 9,780 30"
25,"Jan-20 8,120 20"
26,"Feb-20 21,660 40"
27,"Mar-20 25,070 30"
28,"Grand Total 693,850 26,350"


In [5]:
df.rename(columns={'Approximate Active DACA Recipients:': 'CurrentExpiration'},inplace=True)

df.CurrentExpiration = df.CurrentExpiration.replace(to_replace='Grand Total', value='GrandTotal', regex=True)
df = pd.DataFrame(df.CurrentExpiration.str.split(' ',2).tolist(),
                                   columns = ['CurrentExpiration','Number','Pending'])

In [6]:
df = df[3:-1]

In [7]:
df.to_csv('data/DACA_expiration_data_20180331.csv', index=False)

### DACA Population Data - Country of Birth

1. The report reflects the most up-to-date data available at the time the report is generated.
2. The active DACA population are individuals who have an approved I-821D with validity as of Mar. 31, 2018. 
3. Individuals who have obtained Lawful Permanent Resident Status or U.S. Citizenship are excluded.
4. Totals may not sum due to rounding.
5. Countries with fewer than 10 active DACA recipients are notated with the letter "D."
6. Not available means the data is not available in the electronic systems.

In [8]:
df = read_pdf("https://www.uscis.gov/sites/default/files/USCIS/Resources/Reports%20and%20Studies/Immigration%20Forms%20Data/All%20Form%20Types/DACA/DACA_Population_Data_Mar_31_2018.pdf", pages="1-5")

In [9]:
df

Unnamed: 0.1,Unnamed: 0,Approximate Active DACA Recipients:,Unnamed: 2,Unnamed: 3
0,,Country of Birth,,
1,,"As of March 31, 2018",,
2,,,Number,
3,,Country of Birth,,
4,,,(rounded),
5,Grand Total,,,693850
6,Mexico,,,553200
7,El Salvador,,,26160
8,Guatemala,,,17920
9,Honduras,,,16420


In [10]:
df['Unnamed: 3'].fillna(df['Approximate Active DACA Recipients:'],inplace=True)
df.rename(columns={'Unnamed: 0': 'Country','Unnamed: 3': 'Number'},inplace=True)
df.drop(['Approximate Active DACA Recipients:','Unnamed: 2'], axis=1, inplace=True)
df = df[6:]
df

Unnamed: 0,Country,Number
6,Mexico,553200
7,El Salvador,26160
8,Guatemala,17920
9,Honduras,16420
10,Peru,7220
11,"Korea, South",7150
12,Brazil,5730
13,Ecuador,5360
14,Colombia,4910
15,Argentina,3880


In [11]:
df.to_csv('data/DACA_population_data_country_of_birth_20180331.csv', index=False)

### DACA Population Data - State/Territory of Residence

1. The report reflects the most up-to-date data available at the time the report is generated.
2. The active DACA population are individuals who have an approved I-821D with validity as of Mar. 31, 2018. 
3. Individuals who have obtained Lawful Permanent Resident Status or U.S. Citizenship are excluded.
4. Totals may not sum due to rounding.
5. States/Territories with fewer than 10 active DACA recipients are notated with the letter "D."

In [12]:
df = read_pdf("https://www.uscis.gov/sites/default/files/USCIS/Resources/Reports%20and%20Studies/Immigration%20Forms%20Data/All%20Form%20Types/DACA/DACA_Population_Data_Mar_31_2018.pdf", pages="6-7")

In [13]:
df

Unnamed: 0.1,Unnamed: 0,Approximate Active DACA Recipients:,Unnamed: 2,Unnamed: 3
0,,State or Territory of Residence,,
1,,"As of March 31, 2018",,
2,,,Number,
3,,State or Territory of Residence,,
4,,,(rounded),
5,Grand Total,,,693850
6,California,,,199230
7,Texas,,,113960
8,Illinois,,,36740
9,New York,,,31880


In [14]:
df['Unnamed: 3'].fillna(df['Approximate Active DACA Recipients:'],inplace=True)
df.rename(columns={'Unnamed: 0': 'State/Territory','Unnamed: 3': 'Number'},inplace=True)
df.drop(['Approximate Active DACA Recipients:','Unnamed: 2'], axis=1, inplace=True)
df = df[6:]

df

Unnamed: 0,State/Territory,Number
6,California,199230
7,Texas,113960
8,Illinois,36740
9,New York,31880
10,Florida,26900
11,Arizona,25970
12,North Carolina,25380
13,Georgia,21880
14,New Jersey,17890
15,Washington,16880


In [15]:
df.to_csv('data/DACA_population_data_state_of_residence_20180331.csv', index=False)

### DACA Population Data - Core Based Statistical Area

1. The report reflects the most up-to-date data available at the time the report is generated.
2. The Active DACA population are individuals who have an approved I-821D with validity as of Mar. 31, 2018.
3. Individuals who have obtained Lawful Permanent Resident Status or U.S. Citizenship are excluded.
4. Core Based Statistical Areas (CBSA) at the time of most recent application. CBSAs are defined by the Office of Management and Budget.
5. CBSA with less than 1,000 individuals are included in Other CBSA.
6. Not available means the data is not available in the electronic systems. 
7. Totals may not sum due to rounding.

In [16]:
df = read_pdf("https://www.uscis.gov/sites/default/files/USCIS/Resources/Reports%20and%20Studies/Immigration%20Forms%20Data/All%20Form%20Types/DACA/DACA_Population_Data_Mar_31_2018.pdf", pages="8-10")

In [17]:
df

Unnamed: 0.1,Unnamed: 0,Approximate Active DACA Recipients:,Unnamed: 2,Unnamed: 3
0,,Core Based Statistical Area,,
1,,"As of March 31, 2018",,
2,,,Number,
3,,Core Based Statistical Area,,
4,,,(rounded),
5,Grand Total,,,693850
6,"Los Angeles-Long Beach-Anaheim, CA",,,88180
7,"New York-Newark-Jersey City, NY-NJ-PA",,,46370
8,"Dallas-Fort Worth-Arlington, TX",,,37290
9,"Chicago-Naperville-Elgin, IL-IN-WI",,,35030


In [18]:
df['Unnamed: 3'].fillna(df['Approximate Active DACA Recipients:'],inplace=True)
df.rename(columns={'Unnamed: 0': 'CBSA','Unnamed: 3': 'Number'},inplace=True)
df.drop(['Approximate Active DACA Recipients:','Unnamed: 2'], axis=1, inplace=True)
df = df[6:]
df

Unnamed: 0,CBSA,Number
6,"Los Angeles-Long Beach-Anaheim, CA",88180
7,"New York-Newark-Jersey City, NY-NJ-PA",46370
8,"Dallas-Fort Worth-Arlington, TX",37290
9,"Chicago-Naperville-Elgin, IL-IN-WI",35030
10,"Houston-The Woodlands-Sugar Land, TX",34920
11,"Riverside-San Bernardino-Ontario, CA",23950
12,"Phoenix-Mesa-Scottsdale, AZ",22370
13,"Atlanta-Sandy Springs-Roswell, GA",15430
14,"San Francisco-Oakland-Hayward, CA",14930
15,"Washington-Arlington-Alexandria, DC-VA-MD-WV",13320


In [19]:
df.to_csv('data/DACA_population_CBSA_20180331.csv', index=False)

### DACA Population Data - Gender, Age, Marital Status

1. The report reflects the most up-to-date data available at the time the report is generated.
2. The Active DACA population are individuals who have an approved I-821D with validity as of Mar. 31, 2018. 
3. Individuals who have obtained Lawful Permanent Resident Status or U.S. Citizenship are excluded.
4. Totals may not sum due to rounding.
5. Not available means the data is not available in the electronic systems.

In [20]:
df = read_pdf("https://www.uscis.gov/sites/default/files/USCIS/Resources/Reports%20and%20Studies/Immigration%20Forms%20Data/All%20Form%20Types/DACA/DACA_Population_Data_Mar_31_2018.pdf", pages=11)

In [21]:
df

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Approximate Active DACA Recipients:,Unnamed: 3
0,,,Gender,
1,,,"As of March 31, 2018",
2,,Sex,Number (rounded),
3,Grand Total,,,693850
4,Female,,,365380
5,Male,,,328410
6,Not available,,,70
7,,Approximate Active DACA Recipients:,,
8,,Age Group,,
9,,"As of March 31, 2018",,


In [22]:
df.drop(['Approximate Active DACA Recipients:','Unnamed: 1'], axis=1, inplace=True)

df1 = df[4:7].copy()
df1.rename(columns={'Unnamed: 0': 'Gender','Unnamed: 3': 'Number'},inplace=True)

df2 = df[12:18].copy()
df2.rename(columns={'Unnamed: 0': 'Age','Unnamed: 3': 'Number'},inplace=True)

df3 = df[26:31].copy()
df3.rename(columns={'Unnamed: 0': 'MaritalStatus','Unnamed: 3': 'Number'},inplace=True)

df1

Unnamed: 0,Gender,Number
4,Female,365380
5,Male,328410
6,Not available,70


In [23]:
df1.to_csv('data/DACA_population_gender_20180331.csv', index=False)
df2.to_csv('data/DACA_population_age_20180331.csv', index=False)
df3.to_csv('data/DACA_population_maritalstatus_20180331.csv', index=False)