Skip to content

chapmanjacobd/us_visa_statistics

Repository files navigation

us_visa_statistics

Monthly Immigrant and Nonimmigrant Visa Issuances Data extracted from multi-page PDF tables

Depending on what you are looking for the Office of Immigration Statistics data might be more useful. However, this is still a good example of how to parse PDF table data in an automated way.

Graphs

Code setup
require(data.table)
require(ggplot2)

D = data.table::fread('immigrant_data.csv')
D = D[country!='Other']
D[country == 'Bahamas', country := 'Bahamas, The']
D[country == 'Bosnia-Herzegovina', country := 'Bosnia and Herzegovina']
D[country == 'Burkina-Faso', country := 'Burkina Faso']
D[country == 'China - Mainland born', country := 'China - mainland born']
D[country == 'China – mainland born', country := 'China - mainland born']
D[country == 'China-Mainland born', country := 'China - mainland born']
D[country == 'China', country := 'China - mainland born']
D[country == 'China-Taiwan born', country := 'China - Taiwan born']
D[country == 'Taiwan', country := 'China - Taiwan born']
D[country == 'Hong Kong S.A.R', country := 'Hong Kong S.A.R.']
D[country == 'Hong Kong-BNO', country := 'Hong Kong S.A.R.']
D[country == 'Cocos (Keeling) Islands', country := 'Cocos Islands']
D[country == 'Czec Republic', country := 'Czech Republic']
D[country == 'Eswatini', country := 'eSwatini']
D[country == 'Swaziland', country := 'eSwatini']
D[country == 'Eswatini*', country := 'eSwatini']
D[country == 'Kyrgystan', country := 'Kyrgyzstan']
D[country == 'North Korea', country := 'Korea, North']
D[country == 'South Korea', country := 'Korea, South']
D[country == 'Northern Ireland (DV only)', country := 'Great Britain and Northern Ireland']
D[country == 'Saint Maarten', country := 'Sint Maarten']

library(grDevices)

extract_nth_wraparound <- function(x, n) {
  index <- (seq_along(x) - 1) %% n + 1
  ordered_indices <- order(index)
  return(x[ordered_indices])
}

create_divergent_palette <- function(factor_levels, pal="Zissou 1", repeat_n=8) {
  num_levels <- length(factor_levels)
  palette <- hcl.colors(num_levels, pal)
  palette <- extract_nth_wraparound(palette, num_levels / repeat_n)
  color_mapping <- setNames(palette, factor_levels)

  return(color_mapping)
}

# require('scales')
# show_col(create_divergent_palette(type_factors, pal="Zissou 1", repeat_n=10))

Quantity of visas issued by visa type over time

Bar chart: Visa Type over Time, showing COVID-19 Pandemic visa issuance impact

I think this shows COVID-19 Pandemic impact on visa issuance pretty well

type_counts <- aggregate(count ~ visa_type, data = D, FUN = sum)
type_factors = type_counts$visa_type[order(type_counts$count, decreasing = TRUE)]
color_palette = create_divergent_palette(type_factors)
D$visa_type <- factor(D$visa_type, levels = type_factors)

ggplot(D) +
  aes(x = date, fill = visa_type, weight = count) + geom_bar() +
  scale_fill_manual(values = create_divergent_palette(type_factors, pal="Zissou 1", repeat_n=7))

Visa Type over Time

Tile chart: Visa Type over Time, showing deprecation of certain visa types

I think most of these abrupt start and stops are actually due to differences in reporting after 2020. For example: DV1,DV2,DV3 become DV in later PDFs

ggplot(D) +
  aes(x = date, y = visa_type) + geom_tile()

Facets of visas issued over time by type

Bar chart for each visa type

p = ggplot(D[count > 1]) +
  aes(x = date, weight = count) + geom_bar(fill = "#000") +
  theme_minimal() + theme(strip.text.x = element_text(size = 5), axis.text.y = element_text(size = 5), axis.text.x = element_blank()) +
  facet_wrap(vars(visa_type), scales = "free_y")

ggsave(plot=p, filename = "images/visa_types.png", width = 4000, height = 2500, units='px')

Facets of visas issued over time by Foreign Service Center (FSC)

Bar chart for each country

I'm just guessing with that acronym--can't find it documented anywhere but Foreign Service Officer (FSO) is a pretty well-known acronym so FSOs must work inside of FSCs or something like that...

p = ggplot(D[count > 1]) +
  aes(x = date, weight = count) + geom_bar(fill = "#000") +
  theme_minimal() + theme(strip.text.x = element_text(size = 5), axis.text.y = element_text(size = 5), axis.text.x = element_blank()) +
  facet_wrap(vars(country), scales = "free_y")

ggsave(plot=p, filename = "images/countries.png", width = 4000, height = 2500, units='px')

Notes

If you are using this data you should be aware that there are some data quality issues. Some of those issues have been identified by others here.

Visa Symbol key

The Visa Office has changed its methodology for calculating visa data beginning with the Fiscal Year (FY) 2019 annual Report of the Visa Office and continuing with FY 2020 data, to reflect the greater access to application-level data attained during FY 2019. Our previous methodology was based on a count of workload actions, which were not linked by application. The new methodology more accurately reflects final outcomes from the visa application process during a specified reporting period. The new methodology follows visa applications, including updates to their status (i.e., issued or refused), which could change as the fiscal year progresses, or result in slight changes in data for earlier years. Therefore, beginning with FY 2020, individual monthly issuance reports should not be aggregated, as this will not provide an accurate issuance total for the fiscal year to date. Instead, refer to our annual Report of the Visa Office for final full Fiscal Year statistics. While the new methodology is more accurate, it does not mean that our prior methodology was flawed. The two are simply not comparable. However, based on our analysis, the discrepancies between the methodologies are minor. For example, the difference between reported issuances of NIVs and IVs in FY 2018 (legacy methodology) and those in FY 2019 (new methodology) is less than one percent worldwide. U.S. Department of State

immigrant_data.csv

Shape

(153458, 4)

Sample of rows

country visa_type count date
0 Afghanistan CR1 11 2017-03-01
1 Afghanistan DV1 2 2017-03-01
2 Afghanistan DV2 1 2017-03-01
153455 Zimbabwe IR1 1 2024-03-01
153456 Zimbabwe IR2 3 2024-03-01
153457 Zimbabwe IR5 4 2024-03-01

Summary statistics

count
count 153458
mean 21.2988
std 92.6385
min 1
25% 1
50% 3
75% 11
max 5009

Pandas columns with 'converted' dtypes

column original_dtype converted_dtype
country object string
visa_type object string
count int64 Int64
date object string

Numerical columns

Bins

count
(-4.008, 835.667] 153121
(835.667, 1670.333] 239
(1670.333, 2505.0] 73
(2505.0, 3339.667] 18
(3339.667, 4174.333] 5
(4174.333, 5009.0] 2

Categorical columns

common values of country column

Count Percentage
India 2384 1.55352
China - mainland born 2332 1.51963
Philippines 2191 1.42775
Vietnam 2114 1.37758
Mexico 2088 1.36063
Korea, South 2054 1.33848
Pakistan 1960 1.27722
Brazil 1885 1.22835
Venezuela 1874 1.22118
Colombia 1851 1.20619
Nigeria 1814 1.18208
China - Taiwan born 1804 1.17557
Ukraine 1792 1.16775
Russia 1781 1.16058
Jamaica 1769 1.15276
Egypt 1694 1.10389
Ecuador 1683 1.09672
Dominican Republic 1651 1.07586
El Salvador 1649 1.07456
Great Britain and Northern Ireland 1640 1.0687
Iran 1605 1.04589
Bangladesh 1560 1.01656
Peru 1552 1.01135
Ghana 1541 1.00418
Jordan 1527 0.995061
Honduras 1519 0.989847
Nepal 1516 0.987892
Guatemala 1514 0.986589
Haiti 1496 0.97486
Turkey 1455 0.948142

common values of visa_type column

Count Percentage
IR1 12096 7.88229
CR1 10912 7.11074
IR5 10679 6.95891
IR2 10381 6.76472
SB1 4133 2.69325
FX 3904 2.54402
CR2 3892 2.5362
DV1 3730 2.43063
FX1 3675 2.39479
F11 3519 2.29314
FX2 3321 2.16411
DV 3311 2.15759
F41 3172 2.06702
F1 3154 2.05529
F4 3053 1.98947
DV2 2959 1.92821
F43 2946 1.91974
DV3 2687 1.75097
F24 2661 1.73402
F42 2606 1.69818
E3 2500 1.62911
F3 2409 1.56981
F2B 2341 1.5255
F31 2325 1.51507
F33 2298 1.49748
F32 2228 1.45186
F12 2189 1.42645
FX3 1925 1.25441
F21 1923 1.25311
E2 1764 1.1495

common values of date column

Count Percentage
2017-10-01 2687 1.75097
2018-05-01 2686 1.75032
2018-06-01 2633 1.71578
2018-04-01 2621 1.70796
2019-07-01 2613 1.70275
2018-10-01 2597 1.69232
2019-04-01 2596 1.69167
2019-06-01 2578 1.67994
2018-07-01 2574 1.67733
2019-05-01 2570 1.67473
2019-10-01 2567 1.67277
2018-08-01 2560 1.66821
2017-05-01 2552 1.663
2017-11-01 2547 1.65974
2017-12-01 2538 1.65387
2018-02-01 2538 1.65387
2018-03-01 2536 1.65257
2017-06-01 2514 1.63823
2019-12-01 2502 1.63041
2018-12-01 2490 1.62259
2018-11-01 2484 1.61868
2019-02-01 2466 1.60695
2019-01-01 2461 1.6037
2019-11-01 2452 1.59783
2020-01-01 2448 1.59522
2018-01-01 2441 1.59066
2017-03-01 2438 1.58871
2019-03-01 2406 1.56786
2017-04-01 2396 1.56134
2020-02-01 2355 1.53462

Low cardinality (many similar values)

  • country
  • date
  • visa_type

Missing values

0 nulls/NaNs (0.0% dataset values missing)

About

Monthly Immigrant and Nonimmigrant Visa Issuances Data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages