# Data Extraction

This notebook is for loading the IPUMS data to your local machine. The last cell
in this notebook can be pasted in other notebooks to load the dataset via the
exported pickle (.pkl) file.

### Load Dependencies

In [1]:
# Load Dependencies
from pathlib import Path
import pandas as pd

In [2]:
# Load Custom Scripts
from src.utils.ipums_extract import (
    get_ipums_data,
    load_ipums_from_pkl,
)

## Define Parameters and Variables

You can obtain your IPUMS API key here:
[https://account.ipums.org/api_keys](https://account.ipums.org/api_keys).

**Caution:** Avoid pushing your IPUMS API key to the repo!

In [None]:
# Define API Key
API_KEY = "key"

In [4]:
# Define Parameters
DOWNLOAD_DIR = Path(r"data")
PKL_EXPORT = True
PKL_PATH = Path(r"data/mozambique.pkl")

collection = "ipumsi"
description = "data mining mozambique project"
samples = ["mz1997a", "mz2007a", "mz2017a"]

In [5]:
# Define Variables
variables = [
    'PERSONS',                                  # Tech Households
    'GQ',                                       # Group Quarters
    'URBAN',                                    # Global Geography
    'GEO1_MZ', 'GEO2_MZ',                       # National Geography
    'OWNERSHIP',                                # Household Economic
    'PHONE',                                    # Utilities
    'AUTOS',                                    # Appliances
    'ROOMS',                                    # Dwelling Characteristics
    'HHTYPE',                                   # Constructed Household
    'FAMSIZE', 'NCHILD',                        # Constructed Family
    'RESIDENT',                                 # Presence Indicator
    'AGE', 'SEX', 'MARST',                      # Demographics
    'MORTMOT', 'MORTFAT',                       # Mortality
    'NATIVITY', 'CITIZEN', 'BPL1_MZ',           # Nativity and Birthplace
    'SCHOOL', 'LIT', 'EDATTAIN',                # Education
    'EMPSTAT', 'LABFORCE',                      # Work
    'MIGRATE1', 'MIGRATE5']                     # Migration (RESPONSE)

## Submit Extract

The cell below will start the extract of the data from IPUMS severs. Note that
this may take several minutes.

In [6]:
# Get IPUMS Data
mig1_data, mig5_data = get_ipums_data(
    collection=collection,
    description=description,
    samples=samples,
    variables=variables,
    api_key=API_KEY,
    download_dir=DOWNLOAD_DIR,
    pkl_export=PKL_EXPORT,
    pkl_path=PKL_PATH
)

Extract submitted to IPUMS. Extract ID: 12.
Waiting for extract to finish processing on IPUMS server... [complete]
Downloading extract to data ... [complete]
Extracting data from extract to DataFrame...

See the `ipums_conditions` attribute of this codebook for terms of use.
See the `ipums_citation` attribute of this codebook for the appropriate citation.


 [complete]
Updating DataFrame with labels... [complete]
Transforming data to fix NIU/unknown values, other issues... [complete]
Processing migration response variables (MIGRATE1, MIGRATE5)... [complete]
Removing metadata columns unnecessary for analyses... [complete]
Removing detailed columns unnecessary for analyses... [complete]
Standardizing binary variables... [complete]
Binarizing categorical variables... [complete]
Binning continuous variables... [complete]
Saving IPUMS DataFrame to data/mozambique.pkl ... [complete]

**** IPUMS dataset extraction and processing complete. ****


## Load Exported Extract

The cell below (as mentioned previously) can be pasted into other notebooks to
load the exported and preprocessed extract into other notebooks and scripts.

In [7]:
# Load from PKL
mig1_data, mig5_data = load_ipums_from_pkl(PKL_PATH)

print(mig1_data.shape)
print(mig5_data.shape)

(5929529, 66)
(4974569, 66)
