# Travel.State.Gov Visa Issuances
### [Output Dataset(s)](../data/extracted_data/state-dept)

## Purpose 
This notebook provides functionality to "scrape" or extract all data from the PDF files found on the  [Monthly Immigrant Visa Issuance Statistics](https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/immigrant-visa-statistics/monthly-immigrant-visa-issuances.html) page. The State Department releases monthly data on visa issuances, for both immigrant visas and nonimmigrant visas.  

The PDFs come in two forms.
  * Posts --> Provides the counts of visas by post and class.
  * FSC (Foreign State of Chargeability, or Place of Birth)--> Provides the counts of visas granted by FSC and by visa class.


<img src="../misc/images/monthly_visa_stats_pdf.png" width=500/>

  

This notebook provides specific functionality to: 
1. Download all PDF files to a local directory (could be applied to another site)
2. Extract structured data from all PDFs and recode the visa types to narrower categories. 

We also provide an example summarizing this data. 

## Approach

Using Python we will programattically download the PDFs and then extract the information from them. Finally we will combine the datasources to create a more comprehensive dataset. 

## Code 

**Imports**

In [15]:
import logging
import logging.config
from pathlib import Path
import requests

from bs4 import BeautifulSoup
import pandas as pd
from PyPDF2 import PdfFileReader
import tabula
import time

from urllib.parse import urljoin, urlparse

pd.set_option("max_rows", 400)
today_date = time.strftime("%Y-%m-%d")

**Source Data URL**

## 1. Download PDFs

**Functions**

In [8]:
def download_pdf(url: str, name: str, output_directory: str):
    """
    Function to download a single PDF file from a provided link.

    Parameters:
        url: URL of the file you want to download
        name: Label you want to apply to the file
        output_folder: Folder path to save file

    Returns:
      Saves the file to the output directory, function itself returns nothing.

    Example:
      download_pdf(
        'https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/immigrant-visa-statistics/monthly-immigrant-visa-issuances.html',
        'July 2020 - IV Issuances by Post and Visa Class',
        'state-dept/'
      )
    """
    output_directory = Path(output_directory)
    response = requests.get(url)
    if response.status_code == 200:
        # Write content in pdf file
        outpath = output_directory / f"{name}.pdf"
        pdf = open(str(outpath), "wb")
        pdf.write(response.content)
        pdf.close()
        print("File ", f"{name}.pdf", " downloaded")
    else:
        print("File ", f"{name}.pdf", " not found.")


def download_all_pdf_links(url: str, output_directory: str):
    """
    Download all PDFs on a webpage where the PDFs
    are presented as links. Uses the download_pdf function
    defined above.

    Parameters:
      url (str): URL for website with links to many PDF documents, each PDF link
           must be a direct download URL and not a URL to another website with PDF links.
      output_directory: Folder path to savae file

    Returns:
      None, but saves many files to the output directory.

    Examples:
      download_all_pdf_links(
        https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/immigrant-visa-statistics/monthly-immigrant-visa-issuances.html,
         'state-dept')
    """

    output_directory = Path(output_directory)
    output_directory.mkdir(exist_ok=True, parents=True)

    parse_url = urlparse(url)
    base_url = f"{parse_url.scheme}://{parse_url.netloc}"

    # Request URL and get response object
    response = requests.get(url)

    # Parse text obtained
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all hyperlinks present on webpage
    links = soup.find_all("a")

    # Iterate through links we found,
    # if it's a PDF link, download the PDF and save in output_directory
    for link in links:
        if ".pdf" in link.get("href", []):
            name = link.text
            url = f"{base_url}/{link.get('href')}"
            download_pdf(url, name, output_directory)
    print("All PDF files downloaded")

### Download Single Example File 

Here we have the url for a single pdf and then pass that url (`example_pdf`) to the `download_pdf` function. 

In [9]:
# July 2020 Post file https://travel.state.gov/content/dam/visas/Statistics/Immigrant-Statistics/MonthlyIVIssuances/JULY%202020%20-%20IV%20Issuances%20by%20Post%20and%20Visa%20Class.pdf
example_pdf = (
    "https://travel.state.gov/content/dam/visas/Statistics/"
    "Immigrant-Statistics/MonthlyIVIssuances/"
    "JULY%202021%20-%20IV%20Issuances%20by%20Post%20and%20Visa%20Class.pdf"
)

download_pdf(
    example_pdf,
    "July 2020 - IV Issuances by Post and Visa Class",
    "../data/raw_source_files/state-dept/",
)

### Download all files 

Now let's download all PDFs on the State Department Visa Statistics page. We will pass the base url for that page to the `download_all_pdf_links` function, and then save them out to our `"../data/raw_source_files/state-dept"` folder. 

In [1]:
url = "https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/immigrant-visa-statistics/monthly-immigrant-visa-issuances.html"

In [7]:
download_all_pdf_links(url, "../data/raw_source_files/state-dept")

File  March 2017 - IV Issuances by FSC or Place of Birth and Visa Class.pdf  downloaded
File  March 2017 - IV Issuances by Post and Visa Class.pdf  downloaded


KeyboardInterrupt: 

----------------

## 2. Extract Data from PDFs

To extract structured data (in tabular format) from the PDFs we use a python package called [tabula-py](https://tabula-py.readthedocs.io/en/latest/). This package is a wrapper for a library written in the Java programming language called Tabula. It provides functionality to extract data from pdfs. We also use another python library called PdfFileReader to count the number of pages we need to process. 

In [59]:
# Note below function not generalizable as has hard coded column names
def get_table_data(path: str, data_cols: list = ["Post", "Visa Class", "Issuances"]):
    """
    Parameters:
      path: path to specific PDF file to extract data from
      data_cols: what the output data columns should be.
      if processing the Post tables it is most likely:
       ["Post", "Visa Class", "Issuances"],
        if processing the FSC tables it is most likley
        ["FSC", "Visa Class", "Issuances"]

    Returns:
      Pandas dataframe of structured (tabular) data extracted from the PDF
      path provided.

    Example:
      get_table_data(
        'data-repo-mvp/state-dept/April 2018 - IV Issuances by FSC or Place of Birth and Visa Class.pdf',
        data_cols = ["FSC", "Visa Class", "Issuances"]
        )

    """
    # Read the PDF to get basic info on it
    pdf = PdfFileReader(path)

    # Data Holders
    full_table = pd.DataFrame(columns=data_cols)  # Will hold the combined data

    # Processing PDF - we start with the first page (start)
    # and go to the last page (stop)
    start = 1
    stop = pdf.getNumPages() + 1
    for i in range(start, stop):
        # Extract data from the specific PDF page using Tabula
        df = tabula.read_pdf(
            path,
            pages=f"{i}",
            lattice=True,
            pandas_options={
                "header": None
            },  # none because some have headers and some dont
        )[0]

        # Edge case error correction  - sometimes fully null extra columns
        # are produced by Tabula
        if df.shape[1] > 3:
            full_null = df.isnull().all()
            full_null_index = full_null[full_null].index[0]
            if full_null_index:
                df = df.drop(full_null_index, axis=1)
            else:
                print(f"ERROR on portion of table: {path}")

        df.columns = data_cols

        # Check if we have  headers, if so drop 2 top rows
        if not str(df.iloc[1][data_cols][2]).replace(",", "").isdigit():
            df = df.loc[2:, :]

        # Append this page of data to the full table
        full_table = full_table.append(df)

    # Clean up and validate the full table
    # We validate by comparing the grand total column in the PDF
    # to the sum of visas in the extracted table
    full_table = full_table.reset_index(drop=True)

    grand_total = full_table[
        full_table[data_cols[0]].str.upper().str.contains("GRAND TOTAL")
    ]
    full_table = full_table.drop(grand_total.index, axis=0)

    full_table.loc[:, "Issuances"] = (
        full_table.Issuances.astype(str).str.replace(",", "").astype(int)
    )

    table_grand_total = full_table.Issuances.sum()
    row_grand_total = int(grand_total.Issuances.sum().replace(",", ""))

    assert (
        table_grand_total == row_grand_total
    ), f"Warning - Grand Total Row Does Not Equal Sum of Rows {row_grand_total} vs {table_grand_total}"
    print("Data successfully extracted.")

    return full_table


def extract_data_for_specific_year_month(
    pdf_folder_path: str, year: int, month: str, report: str
):
    """
    Helper function that allows you to extract data from a SINGLE PDF by passing
    a folder path where PDF files are located and then retrieve a specific PDF based on a
     year, named month (for example April or May) and report type of either fsc or post being present in the
    PDF file name.

    Parameters:
        pdf_folder: path to folder holding PDFs
        year: year of data to extract
        month: month of data to extract
        report: (options) -->  posts | fsc

    Returns:
        Pandas dataframe of structured (tabular) data extracted from the PDF
      path provided.

    Example:
        extract_data_for_specific_year_month('state-dept', 2019, 'August', 'fsc')
    """
    pdf_folder = Path(pdf_folder_path)
    report = report.lower()
    target_filepath = None
    data_cols = (
        ["Post", "Visa Class", "Issuances"]
        if report == "post"
        else ["FSC", "Visa Class", "Issuances"]
    )
    for file in pdf_folder.iterdir():
        fn = file.name.lower()
        if str(year).lower() in fn and str(month).lower() in fn and report in fn:
            target_filepath = file
            break
    if target_filepath and target_filepath.exists():
        return get_table_data(str(target_filepath), data_cols=data_cols)


def extract_data_from_many_pdfs(pdf_folder_path, start_year, stop_year, report):
    """
    Helper function that allows you to extract data from MANY PDFs of a single
    report type (FSC, POST) by passing  a folder path where PDF files are located
    and then retrieve data on all PDFs within a time range (start year to stop year)
    and the report type

    Parameters:
        pdf_folder (str): path to folder holding PDFs
        start_year (int | str): start year of data to extract
        stop_year (int | str): stop year of data to extract
        report (str): (options) -->  posts | fsc

    Returns:
        Pandas dataframe of structured (tabular) data extracted from the PDF
      path provided.

    Example:
        extract_data_for_specific_year_month('state-dept', 2019, 'August', 'fsc')
    """

    months = [
        "January",
        "February",
        "March",
        "April",
        "May",
        "June",
        "July",
        "August",
        "September",
        "October",
        "November",
        "December",
    ]

    visa_raw_data = []
    for year in range(start_year, stop_year + 1):
        for month in months:
            data = extract_data_for_specific_year_month(
                pdf_folder_path, year, month, report
            )
            if data is not None:
                data["source"] = f"{year}-{month}"
                visa_raw_data.append(data)
                print(year, month, "- Processed")
            else:
                print(year, month, "- Not Available")
    out_df = pd.concat(visa_raw_data, axis=0).reset_index(drop=True)
    out_df["year_month"] = pd.to_datetime(out_df.source)
    return out_df

### Extract data for years 

Below we assign our paths to variables instead of writing them out in the function call, this is just to make the code more readable. We also apply Path(../path) to the paths as this provides some functionality for handling paths/folders etc.

In [60]:
downloaded_data_path = Path("../data/raw_source_files/state-dept/")
extracted_data_path = Path("../data/extracted_data/state-dept")

Below we call a function that was written a few cells above. This function leverages some additional functions to process each pdf and pull out the table data, then combine them together. 

We will first extract all the data from the PDFs from 2019-2021, for the "Post and Visa Class" PDFs.

### Getting Posts 

**Note this may take about 20 minutes to run**

Also, if processing 2017 -> 2021, then it may take even longer. 

In [61]:
posts_data_2019_2021 = extract_data_from_many_pdfs(
    downloaded_data_path, 2021, 2021, "post"  # start year  # end year  # pdf type
)



Data successfully extracted.
2021 January - Processed
Data successfully extracted.
2021 February - Processed
Data successfully extracted.
2021 March - Processed




Data successfully extracted.
2021 April - Processed




Data successfully extracted.
2021 May - Processed




Data successfully extracted.
2021 June - Processed




Data successfully extracted.
2021 July - Processed




Data successfully extracted.
2021 August - Processed




Data successfully extracted.
2021 September - Processed
2021 October - Not Available
2021 November - Not Available
2021 December - Not Available


**Now let's take a look at the data output**

We end up with a large table that has every row (Post, Visa class, issuances) from the pdfs aggregated together. We have also tagged each row with source data indicating the year and month of the data. We also have created a date field of that source info called `year_month` we can use to summarize data

In [70]:
posts_data_2019_2021.head()

Unnamed: 0,Post,Visa Class,Issuances,source,year_month
0,Abidjan,CR1,11,2021-January,2021-01-01
1,Abidjan,CR2,2,2021-January,2021-01-01
2,Abidjan,IR1,16,2021-January,2021-01-01
3,Abidjan,IR2,16,2021-January,2021-01-01
4,Abidjan,IR3,2,2021-January,2021-01-01


### Getting FSC 

**Note this may take about 20 minutes to run**

In [68]:
fsc_data_2019_2021 = extract_data_from_many_pdfs(
    downloaded_data_path, 2021, 2021, "fsc"
)



Data successfully extracted.
2021 January - Processed
Data successfully extracted.
2021 February - Processed
Data successfully extracted.
2021 March - Processed




Data successfully extracted.
2021 April - Processed




Data successfully extracted.
2021 May - Processed




Data successfully extracted.
2021 June - Processed




Data successfully extracted.
2021 July - Processed




Data successfully extracted.
2021 August - Processed




Data successfully extracted.
2021 September - Processed
2021 October - Not Available
2021 November - Not Available
2021 December - Not Available


**Now take a look at the output data**

This looks very much like the post data above, but instead of having a customs post as the first column we have the foriegn state of chargeability.

In [69]:
fsc_data_2019_2021.head()

Unnamed: 0,FSC,Visa Class,Issuances,source,year_month
0,Afghanistan,CR1,2,2021-January,2021-01-01
1,Afghanistan,IR1,23,2021-January,2021-01-01
2,Afghanistan,IR2,2,2021-January,2021-01-01
3,Afghanistan,SB1,3,2021-January,2021-01-01
4,Afghanistan,SI1,1,2021-January,2021-01-01


### Export this data to csv

We can now call `to_csv` on each file to save it out. 

In [19]:
posts_data_2019_2021.to_csv(extracted_data_path / f"raw_posts_extract-{today_date}.csv")

In [23]:
fsc_data_2019_2021.to_csv(extracted_data_path / f"raw_fsc_extract-{today_date}.csv")

------------

## 3. Analyze / Summarize Data 

Now that we have this data in a structured format we will provide some examples of reformatting and summarizing this data to make it more useful

### Example 1: Get total visas by visa class per month for the Post data

In [65]:
summed_by_yearmonth_and_class_post = (
    posts_data_2019_2021.groupby(["year_month", "Visa Class"]).sum().reset_index()
)

summed_by_yearmonth_and_class_post.pivot(
    index="Visa Class", columns="year_month", values="Issuances"
).fillna(0)

year_month,2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01
Visa Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AM,0.0,0.0,0.0,17.0,5.0,25.0,13.0,6.0,5.0
B2A,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
BC,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0
BC1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
BX,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,5.0
C2A,0.0,0.0,0.0,7.0,0.0,0.0,0.0,1.0,0.0
C5,0.0,0.0,0.0,0.0,3.0,0.0,45.0,8.0,5.0
CQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,948.0,407.0
CR1,2106.0,2171.0,2869.0,1431.0,929.0,1423.0,1465.0,1353.0,1173.0
CR2,207.0,200.0,220.0,159.0,110.0,157.0,161.0,173.0,184.0


### Example 2:  Get total visas by visa class per month for the FSC data

In [71]:
summed_by_yearmonth_and_class_fsc = (
    fsc_data_2019_2021.groupby(["year_month", "Visa Class"]).sum().reset_index()
)

summed_by_yearmonth_and_class_fsc.pivot(
    index="Visa Class", columns="year_month", values="Issuances"
).fillna(0)

year_month,2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01
Visa Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AM,0.0,0.0,0.0,17.0,5.0,25.0,13.0,6.0,5.0
B2A,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
BC,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0
BC1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
BX,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,5.0
C2A,0.0,0.0,0.0,7.0,0.0,0.0,0.0,1.0,0.0
C5,0.0,0.0,0.0,0.0,3.0,0.0,45.0,8.0,5.0
CQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,948.0,407.0
CR1,2106.0,2171.0,2869.0,1431.0,929.0,1423.0,1465.0,1353.0,1173.0
CR2,207.0,200.0,220.0,159.0,110.0,157.0,161.0,173.0,184.0


### Example 2: Get total visas by visa class per month with simplified coding

The state department uses many different visa class codes. From talking to experts in the field we understand that often codes change, new ones are added and olds ones are removed. That said, many of theses codes can be combined to summarized general families of visas which is helpful for analysis. 

Below we have created and initial recoding of visas into a smaller number of classes. We are using a Python dictionary to recode different classes. 

An example of the recoding is:

```
    "IR": {
        "1a": ["IR1", "CR1", "IB1", "IW1", "VI5", "IW"],
        "1b": ["IR2", "CR2", "IB2", "IB3", "IW2"],
        "1c": ["IR5"],
        "1d": ["IR3", "IR4", "IH3", "IH4"],
    },
```

Here we are saying that `["IR1", "CR1", "IB1", "IW1", "VI5", "IW"]` can all be recoded to a higher class of `1a` or an even higher level of `IR`.

We created this recode dictionary with some help from experts in the field but may have made mistakes or assumptions, therefore recognize that this recode is for example only. 

In [72]:
recodes = {
    "IR": {
        "1a": ["IR1", "CR1", "IB1", "IW1", "VI5", "IW"],
        "1b": ["IR2", "CR2", "IB2", "IB3", "IW2"],
        "1c": ["IR5"],
        "1d": ["IR3", "IR4", "IH3", "IH4"],
    },
    "FSP": {
        "2a": ["F11", "F12", "B11", "B12", "F1"],
        "2b": [
            "F21",
            "F22",
            "F23",
            "F24",
            "F25",
            "C21",
            "C22",
            "C23",
            "C24",
            "C25",
            "B21",
            "B22",
            "B23",
            "B24",
            "B25",
            "FX",
            "FX1",
            "FX2",
            "FX3",
            "CX",
            "CX1",
            "CX2",
            "CX3",
            "BX1",
            "BX2",
            "BX3",
        ],
        "2c": ["F31", "F32", "F33", "C31", "C32", "C33", "B31", "B32", "B33", "F3"],
        "2d": ["F41", "F42", "F43", "F4"],
    },
    "EB": {
        "3a": ["E11", "E12", "E13", "E14", "E15", "E1"],
        "3b": ["E21", "E22", "E23", "E2"],
        "3c": ["E31", "E32", "E34", "E35", "EW3", "EW4", "EW5", "E3", "EW"],
        "3d": [
            "BC1",
            "BC2",
            "BC3",
            "SD1",
            "SD2",
            "SD3",
            "SE1",
            "SE2",
            "SE3",
            "SF1",
            "SF2",
            "SG1",
            "SG2",
            "SH1",
            "SH2",
            "SJ1",
            "SJ2",
            "SK1",
            "SK2",
            "SK3",
            "SK4",
            "SL1",
            "SN1",
            "SN2",
            "SN3",
            "SN4",
            "SR1",
            "SR2",
            "SR3",
            "BC",
            "E4",
            "SD",
            "SE",
            "SF",
            "SG",
            "SH",
            "SJ",
            "SK",
            "SN",
            "SR",
        ],
        "3e": [
            "C51",
            "C52",
            "C53",
            "T51",
            "T52",
            "T53",
            "R51",
            "R52",
            "R53",
            "I51",
            "I52",
            "I53",
            "C5",
            "T5",
            "R5",
            "I5",
        ],
    },
    "DI": ["DV1", "DV2", "DV3", "DV"],
    "Other": [
        "AM",
        "AM1",
        "AM2",
        "AM3",
        "SC2",
        "SI1",
        "SI2",
        "SI3",
        "SM1",
        "SM2",
        "SM3",
        "SQ1",
        "SQ2",
        "SQ3",
        "SU2",
        "SU3",
        "SU5",
        "SB1",
        "SC",
        "SI",
        "SM",
        "SQ",
        "SU",
    ],
}

**Create a coding lookup based on the `recode` dictonary above**

Now let's use some code to unpack these different recodings into a table format

In [73]:
unpack_codes = []
# iterate over the keys in the recode dictionary
for k in recodes:
    next_level = recodes[k]
    # if the value (next_level) is a dictionary then iterate over that as well
    # this means that there is a sub level code such as `1a`
    if isinstance(next_level, dict):
        for sub_k in next_level:
            unpack_codes += [[k, sub_k, val] for val in next_level[sub_k]]
    else:
        # if there are just detail values then we assign the `base_code`
        # as the `sublevel code` as well
        unpack_codes += [[k, k, val] for val in next_level]

coding_map = pd.DataFrame(
    unpack_codes, columns=["base_code", "base_2_code", "detail_code"]
)

Below we see we have unpacked that information into a table with a row for each recode

The highest level is called the `base_code` and the sub code is called `base_2_code`, original code is called `detail_code`

In [81]:
coding_map

Unnamed: 0,base_code,base_2_code,detail_code
0,IR,1a,IR1
1,IR,1a,CR1
2,IR,1a,IB1
3,IR,1a,IW1
4,IR,1a,VI5
5,IR,1a,IW
6,IR,1b,IR2
7,IR,1b,CR2
8,IR,1b,IB2
9,IR,1b,IB3


**Assign simplified codes to the dataframe**

We can merge the visa issuance data to the coding map to create different summaries

**Using the FSC data**

In [82]:
summary_data = coding_map.merge(
    fsc_data_2019_2021, left_on="detail_code", right_on="Visa Class", how="right"
)

summary_data.base_code = summary_data.base_code.fillna("NA")
summary_data.detail_code = summary_data.detail_code.fillna("NA")

In [83]:
fsc_data_2019_2021.shape

(11750, 5)

In [80]:
summary_data

Unnamed: 0,base_code,base_2_code,detail_code,FSC,Visa Class,Issuances,source,year_month
0,IR,1a,CR1,Afghanistan,CR1,2,2021-January,2021-01-01
1,IR,1a,IR1,Afghanistan,IR1,23,2021-January,2021-01-01
2,IR,1b,IR2,Afghanistan,IR2,2,2021-January,2021-01-01
3,Other,Other,SB1,Afghanistan,SB1,3,2021-January,2021-01-01
4,Other,Other,SI1,Afghanistan,SI1,1,2021-January,2021-01-01
...,...,...,...,...,...,...,...,...
11745,IR,1a,IB1,Zimbabwe,IB1,1,2021-September,2021-09-01
11746,IR,1a,IR1,Zimbabwe,IR1,5,2021-September,2021-09-01
11747,IR,1b,IR2,Zimbabwe,IR2,1,2021-September,2021-09-01
11748,IR,1c,IR5,Zimbabwe,IR5,15,2021-September,2021-09-01


**Create a pivot table of simplified visa classes over time - using least granular coding**

We'll first summarize with the base code, after running the cell below you can see the most general visa class coding along with sums by year and month

In [85]:
base_code_summary_long = (
    summary_data.groupby(["base_code", "year_month"]).Issuances.sum().reset_index()
)
print(base_code_summary_long.head())

base_code_summary_long.pivot(
    index="base_code", columns="year_month", values="Issuances"
)

  base_code year_month  Issuances
0        DI 2021-01-01          1
1        DI 2021-02-01          1
2        DI 2021-03-01         20
3        DI 2021-04-01        534
4        DI 2021-05-01        965


year_month,2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01
base_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
DI,1.0,1.0,20.0,534.0,965.0,1552.0,1824.0,4155.0,8770.0
EB,1406.0,1334.0,2624.0,1692.0,2105.0,3020.0,2226.0,1923.0,1970.0
FSP,296.0,296.0,2336.0,6317.0,6242.0,8566.0,10601.0,12039.0,13124.0
IR,9689.0,11252.0,15720.0,14937.0,15250.0,21602.0,21569.0,22239.0,21076.0
,,19.0,294.0,678.0,765.0,898.0,1152.0,2292.0,1911.0
Other,488.0,300.0,676.0,683.0,877.0,2294.0,2633.0,2843.0,603.0


**Same as above but using the second level of coding as well**

In [86]:
base_code_summary_long = (
    summary_data.groupby(["base_code", "base_2_code", "year_month"])
    .Issuances.sum()
    .reset_index()
)
print(base_code_summary_long.head())

base_code_summary_long_pivot = base_code_summary_long.pivot(
    index=["base_code", "base_2_code"], columns="year_month", values="Issuances"
)

  base_code base_2_code year_month  Issuances
0        DI          DI 2021-01-01          1
1        DI          DI 2021-02-01          1
2        DI          DI 2021-03-01         20
3        DI          DI 2021-04-01        534
4        DI          DI 2021-05-01        965


Unnamed: 0_level_0,year_month,2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01
base_code,base_2_code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
DI,DI,1,1,20,534,965,1552,1824,4155,8770
EB,3a,6,6,28,85,149,206,251,271,243
EB,3b,12,23,84,176,133,246,213,262,293
EB,3c,1320,1020,1896,1052,937,1428,1147,989,1113
EB,3d,36,127,235,257,339,597,457,359,297
EB,3e,32,158,381,122,547,543,158,42,24
FSP,2a,20,36,275,584,829,932,1107,1313,1565
FSP,2b,155,101,1267,3650,2762,3916,4078,5378,5984
FSP,2c,46,46,350,848,961,1036,1156,1285,1582
FSP,2d,75,113,444,1235,1690,2682,4260,4063,3993


These summaries could then be exported to csv or excel using the `to_csv()` or `to_excel()` methods of the dataframe and used in additional analysis

In [None]:
base_code_summary_long_pivot.to_csv("../data/misc/state_dept_base_code_long_pivot.csv")

# End