In [None]:
%load_ext nb_black

<IPython.core.display.Javascript object>

# CBP Encounters Scraping 

## Purpose 

This notebook provides functionality to "scrape" or extract all data from the Tableau dashboards on the [CBP Southwest Land Border Encounters](https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters) page. CBP does not provide this data in a spreadsheet format nor does it enable download of the data through the embedded Tableau chart. Therefore this code was developed to pull down all the data included every data point that exists with every possible combination of filters. 

**Please note:** When using the data output by this notebook for analysis or exploration that you will likely need to filter the data based available columns. The columns are based on the filter options in the tableau dashboards. 

## Approach

We use a the python programming language along with some python libraries that simplify the process of extracting data from the dashboard.

 The [TableauScraper](https://github.com/bertrandmartel/tableau-scraping) library (an open source project)  provides the primary functionality for extracting data from Tableau.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
%cd drive/Shareddrives/Data\ Products\ Team/Products/Immigration\ Data\ Hub/DataRepo/data-repo-mvp

/content/drive/Shareddrives/Data Products Team/Products/Immigration Data Hub/DataRepo/data-repo-mvp


## The Code !

In [4]:
!pip install TableauScraper

Collecting TableauScraper
  Downloading TableauScraper-0.1.29-py3-none-any.whl (22 kB)
Installing collected packages: TableauScraper
Successfully installed TableauScraper-0.1.29


In [5]:
# Import helper python libraries
from tableauscraper import TableauScraper as TS
import itertools
import logging
import time
import pprint
import pandas as pd

pp = pprint.PrettyPrinter(indent=4)
import logging.config


logging.config.dictConfig(
    {
        "version": 1,
        "disable_existing_loggers": True,
    }
)


### First We Grab the CBP Tableau Data

The cbp url see [link] TODO on how to find the url

TODO --> Maybe se if there is an automated way to grab this 

In [6]:
dashboard1_url = (
    "https://publicstats.cbp.gov/t/PublicFacing/views/"
    "CBPSBOEnforcementActionsDashboardsAUGFY21/"
    "SBOEncounters8076?:isGuestRedirectFromVizportal=y&:embed=y"
)

dashboard2_url =  (
    "https://publicstats.cbp.gov/t/PublicFacing/views/"
    "CBPSBOEnforcementActionsDashboardsAUGFY21/"
    "SBObyMonthDemo8076?:isGuestRedirectFromVizportal=y&:embed=y"
)




Here we activate (instantiate is the technical term) the TableauScraper library and then load data from the dashboard url. 

**Dashboard to load**

In [7]:
# The first url will get the first dashboard and the second the second.
current_dashboard = dashboard2_url

In [8]:
# Create a tableau scraper object
ts = TS()

# We then pass the url to that object and it grabs data from the CBP dashboard
ts.loads(current_dashboard)

The tableau dashboard has the filters linked to a specific visualization. On the CBP page we have a line chart, table and a bar chart in the first embedded dashboard. We want to extract the table or line data as they contain the same information. We need to find the visualization that has the filters that will update all three charts. 

TODO - Explain more about filters, worksheets and workboks ...

In [22]:

def find_filters_worksheet(ts):

    """
    Function to search the dashboard to find the worksheet that manages
    the filters that are avaiable on the dashboard. 
    """
    workbook = ts.getWorkbook()
    filters_ws = None
    wb_names = []
    for t in workbook.worksheets:
        filters = t.getFilters()
        if len(filters) > 0:
            print("-"*90)
            print(
                f"Filters Presesnt on worksheet name --> {t.name}"
            )  # show worksheet name
            pp.pprint(filters)  # show dataframe for this worksheet
            filters_ws = t.name
            print("-"*90)
        else:
            print("\nno filters", t.name)

        wb_names.append(t.name)
    return filters_ws, wb_names



In [23]:
filters_ws, wb_names = find_filters_worksheet(ts)

------------------------------------------------------------------------------------------
Filters Presesnt on worksheet name --> All MoM Change Podium
[   {   'column': 'Citizenship Grouping',
        'globalFieldName': '[federated.1xhccc00nlacbx14ajs101w1uee1].[none:Citizenship '
                           'Grouping:nk]',
        'ordinal': 0,
        'selection': [   'El Salvador',
                         'Guatemala',
                         'Honduras',
                         'Mexico',
                         'Other',
                         'all'],
        'selectionAlt': [   {   'columnFullNames': ['[Citizenship Grouping]'],
                                'domainTables': [   {   'isSelected': True,
                                                        'label': 'El '
                                                                 'Salvador'}],
                                'fn': '[federated.1xhccc00nlacbx14ajs101w1uee1].[none:Citizenship '
                              

Above we can see which worksheet has the filters, if we are using the first url the filters are on the `SBO Line Graph` worksheet, if using the second url the filtes are on the `All MoM Change Podium` worksheet.

 The `filters_ws` holds the value for which worksheet has the filters.  

In [24]:
print(filters_ws)

All MoM Change Podium


**NOTE**: Based on the names above you can select which element of the dashboard you want to extract data from. 

In [25]:
data_element_target1 = "SBO Table"
data_element_target2 = "Demo FYTD by Month (2)"

### Next, we build out all possible combinations of filters 

In [26]:

def unpack_filter_information(ts, filters_ws, skip_filter=[]):

    # Specifically ask for the Line Graph since it has the filters
    ws = ts.getWorksheet(filters_ws)

    # Get the different filter names and their values
    filter_master_list = ws.getFilters()
    filter_possible_values_list = {i["column"]: i["values"] for i in filter_master_list}

    # We hold this data in a dictionary and where the filter names are keys
    # and the values are lists of possible values
    pp.pprint(filter_possible_values_list)

    # Grab just the column names
    all_filter_columns = [i["column"] for i in filter_master_list]
    # Drop the fiscal year because the chart automatically includes all fiscal years
    for col in skip_filter:
        try:
            all_filter_columns.remove(col)
        except ValueError:
            print(f"{col} not present")

    # This is somewhat complicated, but what it does is creates an exhaustive
    # list of all possible filter combinations, this does not control for
    # combinations that don't exist though, meaning that some filter comnbinations
    all_f_values = []
    for f in filter_master_list:
        if f["column"] not in skip_filter:
            vals = f["values"]
            all_f_values.append([None] + vals)
    all_filter_combinations = list(itertools.product(*all_f_values))

    filter_data = {
        "master_list": filter_master_list,
        "filter_columns": all_filter_columns,
        "filter_combinations": all_filter_combinations,
    }
    print("\nFilter on:")
    pp.pprint(filter_data['filter_columns'])
    return filter_data


**Run this cell if you are extracting data from the first dashboard**

In [None]:
# Dashboard 1- Note we skip fiscal year because it is broken out in the chart already
filter_data = unpack_filter_information(
    ts,
    filters_ws,
    skip_filter=["Fiscal Year"],
)


{   'Citizenship Grouping': [   'El Salvador',
                                'Guatemala',
                                'Honduras',
                                'Mexico',
                                'Other'],
    'Component': ['Office of Field Operations', 'U.S. Border Patrol'],
    'Demographic': [   'Accompanied Minors',
                       'FMUA',
                       'Single Adults',
                       'UC / Single Minors'],
    'Fiscal Year': ['2018', '2019', '2020', '2021 (FYTD)'],
    'Title of Authority': ['Title 8', 'Title 42']}

Filter on:
['Citizenship Grouping', 'Component', 'Demographic', 'Title of Authority']


<IPython.core.display.Javascript object>

**Run this cell if you are extractring data from the second dashboard**

In [27]:

# Dashboard 2  - note we skip demographics becauase they are broken out in the chart already
filter_data = unpack_filter_information(
    ts,
    filters_ws,
    skip_filter=["Demographic"],
)


{   'Citizenship Grouping': [   'El Salvador',
                                'Guatemala',
                                'Honduras',
                                'Mexico',
                                'Other'],
    'Demographic': [   'Accompanied Minors',
                       'FMUA',
                       'Single Adults',
                       'UC / Single Minors'],
    'Title of Authority': ['Title 8', 'Title 42']}

Filter on:
['Citizenship Grouping', 'Title of Authority']


In [28]:
# See the various combinations
pp.pprint(filter_data["filter_combinations"])

[   (None, None),
    (None, 'Title 8'),
    (None, 'Title 42'),
    ('El Salvador', None),
    ('El Salvador', 'Title 8'),
    ('El Salvador', 'Title 42'),
    ('Guatemala', None),
    ('Guatemala', 'Title 8'),
    ('Guatemala', 'Title 42'),
    ('Honduras', None),
    ('Honduras', 'Title 8'),
    ('Honduras', 'Title 42'),
    ('Mexico', None),
    ('Mexico', 'Title 8'),
    ('Mexico', 'Title 42'),
    ('Other', None),
    ('Other', 'Title 8'),
    ('Other', 'Title 42')]


### Now lets pull down the data

**Data Extraction Function**

We wil create a function to pull down the data and paramaterize some of the arguments 

In [29]:
def get_dashboard_data(
    url, all_filter_columns, all_filter_combinations, filter_worksheet, data_worksheet
):
    failed_combination = []
    tableau_dataframe = pd.DataFrame()
    for filter_combination in all_filter_combinations:
        print("Attempting Fitler Combination", filter_combination)
        ts = TS()
        ts.loads(url)
        workbook = ts  # in case all filters are null
        worksheet = ts.getWorksheet(filter_worksheet)
        try:
            for idx, col in enumerate(all_filter_columns):
                # If it is none it means we are not applying any filter option for the dropdown filter
                if filter_combination[idx] is None:
                    continue
                else:
                    # apply the individual filter and continue iterating
                    worksheet = workbook.getWorksheet(filter_worksheet)
                    workbook = worksheet.setFilter(
                        col, filter_combination[idx], filterDelta=True
                    )
            subset_worksheet = workbook.getWorksheet(data_worksheet)
            subset_data = subset_worksheet.data
            if len(subset_data) > 0:  # Only do this if we have data
                # Now we iterate over the fitler and label the data with
                # what filters were applied.
                for col, val in list(zip(all_filter_columns, filter_combination)):
                    if val is None:
                        val = "all"
                    subset_data.loc[:, col] = val

                # append the data to our master dataframe
                tableau_dataframe = tableau_dataframe.append(subset_data)
            else:
                print(f"WARNING No Length on {filter_combination}")
                failed_combination.append(filter_combination)
        except Exception as e:
            print(f"WARNING on {filter_combination} \n {e}")
            failed_combination.append(filter_combination)
    return tableau_dataframe, failed_combination

**Dashboard 1**

Now we will run it ... This may take about 20 minutes

In [None]:
dataset, failed_combination = get_dashboard_data(
    current_dashboard,
    filter_data["filter_columns"],
    filter_data["filter_combinations"],
    filters_ws,
    data_element_target1,
)

Attempting Fitler Combination (None, None, None, None)
Attempting Fitler Combination (None, None, None, 'Title 8')
Attempting Fitler Combination (None, None, None, 'Title 42')
Attempting Fitler Combination (None, None, 'Accompanied Minors', None)
Attempting Fitler Combination (None, None, 'Accompanied Minors', 'Title 8')
Attempting Fitler Combination (None, None, 'Accompanied Minors', 'Title 42')
Attempting Fitler Combination (None, None, 'FMUA', None)
Attempting Fitler Combination (None, None, 'FMUA', 'Title 8')
Attempting Fitler Combination (None, None, 'FMUA', 'Title 42')
Attempting Fitler Combination (None, None, 'Single Adults', None)
Attempting Fitler Combination (None, None, 'Single Adults', 'Title 8')
Attempting Fitler Combination (None, None, 'Single Adults', 'Title 42')
Attempting Fitler Combination (None, None, 'UC / Single Minors', None)
Attempting Fitler Combination (None, None, 'UC / Single Minors', 'Title 8')
Attempting Fitler Combination (None, None, 'UC / Single Minors

Attempting Fitler Combination ('El Salvador', 'U.S. Border Patrol', 'Single Adults', 'Title 42')
Attempting Fitler Combination ('El Salvador', 'U.S. Border Patrol', 'UC / Single Minors', None)
Attempting Fitler Combination ('El Salvador', 'U.S. Border Patrol', 'UC / Single Minors', 'Title 8')
Attempting Fitler Combination ('El Salvador', 'U.S. Border Patrol', 'UC / Single Minors', 'Title 42')
Attempting Fitler Combination ('Guatemala', None, None, None)
Attempting Fitler Combination ('Guatemala', None, None, 'Title 8')
Attempting Fitler Combination ('Guatemala', None, None, 'Title 42')
Attempting Fitler Combination ('Guatemala', None, 'Accompanied Minors', None)
Attempting Fitler Combination ('Guatemala', None, 'Accompanied Minors', 'Title 8')
Attempting Fitler Combination ('Guatemala', None, 'Accompanied Minors', 'Title 42')
Attempting Fitler Combination ('Guatemala', None, 'FMUA', None)
Attempting Fitler Combination ('Guatemala', None, 'FMUA', 'Title 8')
Attempting Fitler Combination

 'NoneType' object is not subscriptable
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'FMUA', None)
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'FMUA', 'Title 8')
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'FMUA', 'Title 42')
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'Single Adults', None)
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'Single Adults', 'Title 8')
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'Single Adults', 'Title 42')
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'UC / Single Minors', None)
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'UC / Single Minors', 'Title 8')
Attempting Fitler Combination ('Honduras', 'U.S. Border Patrol', 'UC / Single Minors', 'Title 42')
Attempting Fitler Combination ('Mexico', None, None, None)
Attempting Fitler Combination ('Mexico', None, None, 'Title 8')
Attempting Fitler Combin

 'NoneType' object is not subscriptable
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'Accompanied Minors', 'Title 42')
 'NoneType' object is not subscriptable
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'FMUA', None)
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'FMUA', 'Title 8')
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'FMUA', 'Title 42')
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'Single Adults', None)
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'Single Adults', 'Title 8')
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'Single Adults', 'Title 42')
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'UC / Single Minors', None)
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'UC / Single Minors', 'Title 8')
Attempting Fitler Combination ('Other', 'U.S. Border Patrol', 'UC / Single Minors', 'Title 42')


<IPython.core.display.Javascript object>

**Dashboard 2**: Takes about 5 min

In [30]:
dataset, failed_combination = get_dashboard_data(
    current_dashboard,
    filter_data["filter_columns"],
    filter_data["filter_combinations"],
    filters_ws,
    data_element_target2,
)

Attempting Fitler Combination (None, None)
Attempting Fitler Combination (None, 'Title 8')
Attempting Fitler Combination (None, 'Title 42')
Attempting Fitler Combination ('El Salvador', None)
Attempting Fitler Combination ('El Salvador', 'Title 8')
Attempting Fitler Combination ('El Salvador', 'Title 42')
Attempting Fitler Combination ('Guatemala', None)
Attempting Fitler Combination ('Guatemala', 'Title 8')
Attempting Fitler Combination ('Guatemala', 'Title 42')
Attempting Fitler Combination ('Honduras', None)
Attempting Fitler Combination ('Honduras', 'Title 8')
Attempting Fitler Combination ('Honduras', 'Title 42')
Attempting Fitler Combination ('Mexico', None)
Attempting Fitler Combination ('Mexico', 'Title 8')
Attempting Fitler Combination ('Mexico', 'Title 42')
Attempting Fitler Combination ('Other', None)
Attempting Fitler Combination ('Other', 'Title 8')
Attempting Fitler Combination ('Other', 'Title 42')


**Check if anything failed**

In [31]:
print(len(failed_combination))
pp.pprint(failed_combination)

0
[]


Note if there are failure, some/all of these failures may be because the combination results in no data. You can verify this manually be attempting these combinations with the actual tableau dashboard. 

## Review Data

In [None]:
dataset.to_csv(
    "../data/extracted_data/cbp-tableau/cbp-encounters-dashboard-1-nov-2021.csv"
)

<IPython.core.display.Javascript object>

In [None]:
dataset.to_csv(
    "../data/extracted_data/cbp-tableau/cbp-encounters-dashboard-2-nov-2021.csv"
)

<IPython.core.display.Javascript object>

In [32]:
dataset

Unnamed: 0,Component-value,Component-alias,Demographic-value,Demographic-alias,Month (abbv)-value,Month (abbv)-alias,SUM(Encounter Count)-alias,ATTR(Demographic (copy))-alias,Citizenship Grouping,Title of Authority
0,Office of Field Operations,Office of Field Operations,%all%,%all%,%all%,%all%,68996,%many-values%,all,all
1,Office of Field Operations,Office of Field Operations,%all%,%all%,AUG,AUG,13329,%many-values%,all,all
2,Office of Field Operations,Office of Field Operations,%all%,%all%,JUL,JUL,12935,%many-values%,all,all
3,Office of Field Operations,Office of Field Operations,%all%,%all%,JUN,JUN,10385,%many-values%,all,all
4,Office of Field Operations,Office of Field Operations,%all%,%all%,MAY,MAY,7943,%many-values%,all,all
...,...,...,...,...,...,...,...,...,...,...
84,%all%,%all%,%all%,%all%,FEB,FEB,4673,%many-values%,Other,Title 42
85,%all%,%all%,%all%,%all%,JAN,JAN,5634,%many-values%,Other,Title 42
86,%all%,%all%,%all%,%all%,DEC,DEC,4763,%many-values%,Other,Title 42
87,%all%,%all%,%all%,%all%,NOV,NOV,3492,%many-values%,Other,Title 42


In [33]:
dataset[
    (dataset["Citizenship Grouping"] == "El Salvador")
    & (dataset["Title of Authority"] == "all")
    & (dataset["Component-value"] == "Office of Field Operations")
]

Unnamed: 0,Component-value,Component-alias,Demographic-value,Demographic-alias,Month (abbv)-value,Month (abbv)-alias,SUM(Encounter Count)-alias,ATTR(Demographic (copy))-alias,Citizenship Grouping,Title of Authority
0,Office of Field Operations,Office of Field Operations,%all%,%all%,%all%,%all%,2665,%many-values%,El Salvador,all
1,Office of Field Operations,Office of Field Operations,%all%,%all%,AUG,AUG,718,%many-values%,El Salvador,all
2,Office of Field Operations,Office of Field Operations,%all%,%all%,JUL,JUL,562,%many-values%,El Salvador,all
3,Office of Field Operations,Office of Field Operations,%all%,%all%,JUN,JUN,527,%many-values%,El Salvador,all
4,Office of Field Operations,Office of Field Operations,%all%,%all%,MAY,MAY,411,%many-values%,El Salvador,all
5,Office of Field Operations,Office of Field Operations,%all%,%all%,APR,APR,200,%many-values%,El Salvador,all
6,Office of Field Operations,Office of Field Operations,%all%,%all%,MAR,MAR,52,%many-values%,El Salvador,all
7,Office of Field Operations,Office of Field Operations,%all%,%all%,FEB,FEB,37,%many-values%,El Salvador,all
8,Office of Field Operations,Office of Field Operations,%all%,%all%,JAN,JAN,47,%many-values%,El Salvador,all
9,Office of Field Operations,Office of Field Operations,%all%,%all%,DEC,DEC,39,%many-values%,El Salvador,all


# End