---
title: "Easy Visa"
date: 2025-08-05
---

<center><p float="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/>
</p></center>

<center><font size=10>Artificial Intelligence and Machine Learning</font></center>
<center><font size=6>Advanced Machine Learning - Project Debrief</font></center>

<center><img src="https://images.pexels.com/photos/7235894/pexels-photo-7235894.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2" width="800" height="500"></center>

<center><font size=6>Visa Approval Facilitation</font></center>

## Problem Statement

In [4]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

### Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

### Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data  scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:

* Facilitate the process of visa approvals.
* Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

### Data Description

The data contains the different attributes of employee and the employer. The detailed data dictionary is given below.

* case_id: ID of each visa application
* continent: Information of continent the employee
* education_of_employee: Information of education of the employee
* has_job_experience: Does the employee has any job experience? Y= Yes; N = No
* requires_job_training: Does the employee require any job training? Y = Yes; N = No
* no_of_employees: Number of employees in the employer's company
* yr_of_estab: Year in which the employer's company was established
* region_of_employment: Information of foreign worker's intended region of employment in the US.
* prevailing_wage:  Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
* unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
* full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
* case_status:  Flag indicating if the Visa was certified or denied

## Importing necessary libraries

In [71]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 scikit-learn==1.5.2 matplotlib==3.7.1 seaborn==0.13.1 xgboost==2.0.3 -q --user

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-expr 1.1.13 requires pandas>=2, but you have pandas 1.5.3 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the below.*

In [None]:
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Library to split data
from sklearn.model_selection import train_test_split

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score


# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier
)

from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

# Libraries to get different metric scores
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# To tune different models
from sklearn.model_selection import RandomizedSearchCV

## Import Dataset

In [None]:
# loading data into a pandas dataframe
easy_visa_data = pd.read_csv("/content/drive/MyDrive/EasyVisa.csv")

# copy the original data set.
data = easy_visa_data.copy()

## Overview of the Dataset

#### View the first and last 5 rows of the dataset

In [None]:
data.head()

In [None]:
data.tail()

#### Understand the shape of the dataset

In [None]:
data.shape

There are 25480 rows and 12 columns

#### Check the data types of the columns for the dataset

In [None]:
data.info()

There are no missing values

In [None]:
data.duplicated().sum()

There are no duplicate rows.

In [None]:
data.isnull().sum()

No NULL values in columns

In [None]:
data.case_id.nunique()

caes_id is unique per row and can safely be dropped.

In [None]:
# ID contains only unique values so we will drop it
data.drop(columns=['case_id'], axis=1, inplace=True)
data.head()

## Exploratory Data Analysis (EDA)

#### Let's check the statistical summary of the data

In [None]:
data.describe().T

**no_of_employees:**
* Mean: 5,667 employees per company on average.
* Median (50%): 2,109 — way lower than mean -> positively skewed (long tail to the right).
* Max: 602,069 - extreme outlier (very large company).

**yr_of_estab (Year of Establishment)**
* Mean: 1979; Median: 1997 -> suggests many newer companies, but some very old ones.
* Min: 1800 — this is unusually early, likely a data entry error, may need further investigation.
* Max: 2016 - recent establishment years.
* Most companies are relatively modern.

#### Fixing the negative values in number of employees columns

In [None]:
(data['no_of_employees'] < 0).sum()

In [None]:
data['no_of_employees'] = data['no_of_employees'].abs()
(data['no_of_employees'] < 0).sum()

Correct the negetive value by taking the absolute value and re-check for remaining negetives.

#### Let's check the count of each unique category in each of the categorical variables

In [None]:
# Making a list of all catrgorical variables
cat_col = list(data.select_dtypes("object").columns)

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 50)

### Univariate Analysis

In [None]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Observations on education of employee

In [None]:
labeled_barplot(data, "education_of_employee", perc=True)

* The majority of employees **(~78%)** have either a **Bachelor's or Master's** degree, making this the dominant education range.
* **Doctorate holders** are the smallest group (8.6%), which may reflect the specialized nature of such roles or fewer applicants.
* **High School education** accounts for 13.4% — significantly lower than higher education categories

#### Observations on region of employment

In [None]:
labeled_barplot(data, "region_of_employment", perc=True)

* The Northeast has the highest share of visa applicants, closely followed by the South and West.
* The Island region has a very small applicant base.
* Midwest lags behind in volume, possibly due to fewer employer sponsorships or tech hubs compared to coasts.

#### Observations on job experience

In [None]:
labeled_barplot(data, "has_job_experience", perc=True)

* A majority of applicants (58.1%) have prior job experience, which could positively influence their visa approval chances.
* Nearly 42% are inexperienced, possibly recent graduates or fresh entrants.

#### Observations on case status

In [None]:
labeled_barplot(data, "case_status", perc=True)

* Two-thirds of applications are approved, indicating a favorable environment for most applicants.
* Still, 1 in 3 applications gets denied, suggesting room for improvement in eligibility, documentation, or employer sponsorship strength.

### Bivariate Analysis

**Creating functions that will help us with further analysis.**

In [None]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

In [None]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

## find the correlation between the variables
plt.figure(figsize=(10, 5))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

There appears to be no correlation within independent features of the data.

#### Those with higher education may want to travel abroad for a well-paid job. Let's find out if education has any impact on visa certification

In [None]:
stacked_barplot(data, "education_of_employee", "case_status")

* The higher the education level, the greater the visa approval rate.
* Applicants with only a High School education face a much higher rejection rate (~65%).
* Doctorate holders have the highest approval rate (~87%), followed closely by Master’s degree applicants.
* This chart clearly shows that education is a strong predictor of visa certification success.

#### Lets' similarly check for the continents and find out how the visa status vary across different continents.

In [None]:
stacked_barplot(data, "continent", "case_status")

* Europe and Africa show the highest approval rates, making them favorable continents in terms of visa outcomes.
* Asia and Oceania are more middle-ground, with balanced approval/denial rates.
* South America has the lowest certification rate (~58%).

#### Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Let's see if having work experience has any influence over visa certification

In [None]:
stacked_barplot(data, "has_job_experience", "case_status")

* Applicants with job experience are much more likely to be certified (~75%) compared to those without.
* Lack of experience increases denial risk, with nearly half of inexperienced applicants getting denied.

#### Checking if the prevailing wage is similar across all the regions of the US

In [None]:
sns.boxplot(data=data, x="region_of_employment", y="prevailing_wage")

* Midwest and Island regions offer the highest median prevailing wages, despite having fewer applicants.
* West and Northeast, despite being tech hubs, show lower median wages, possibly due to a higher number of entry-level or lower-paying positions.
* All regions show a large number of outliers, indicating presence of high-paying roles (e.g., tech, specialized roles).

#### The US government has established a prevailing wage to protect local talent and foreign workers. Let's analyze the data and see if the visa status changes with the prevailing wage

In [None]:
distribution_plot_wrt_target(data, "prevailing_wage", "case_status")

* Most denied cases cluster at lower wages.
* A visible right-skewed distribution — higher wages are rare for denials.
* Certified applications tend to have higher prevailing wages overall.
* The density peaks around $70,000–$90,000, compared to denials peaking closer to $30,000–$50,000.

**With Outliers**
* Certified applications have a higher median wage than denied ones.
* Certified wages also show a broader upper range, with more high-wage outliers.

**Without Outliers**
* Even after removing outliers, Certified cases still have slightly higher median and upper quartile wages, reinforcing the positive relationship between wage and approval likelihood.

**Conclusion:**
* Higher prevailing wage is positively correlated with visa certification.
* Applicants with lower wages are more likely to be denied.
* Wage can be a strong predictive feature for classification models.

#### The prevailing wage has different units (Hourly, Weekly, etc). Let's find out if it has any impact on visa applications getting certified.

In [None]:
stacked_barplot(data, "unit_of_wage", "case_status")

* Yearly wage entries have the highest certification rates — likely linked to full-time, salaried positions.
* Hourly wages show the lowest approval rate (~35%), suggesting that jobs paid by the hour (e.g., part-time or lower-skill roles) may be viewed as less favorable.
* Monthly and weekly wages fall in the middle, with moderate certification rates.
* Unit of wage is a meaningful indicator in visa decision outcomes.
* Yearly wage offers the strongest signal for visa success.

## Data Pre-processing

### Outlier Check

In [None]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

* All three features contain significant outliers.
* Consider using robust scaling, log transformations, or capping/flooring outliers for preprocessing.


### Data Preparation for modeling

In [None]:
data["case_status"] = data["case_status"].apply(lambda x: 1 if x == "Certified" else 0)

X = data.drop("case_status", axis=1)
y = data["case_status"]

X = pd.get_dummies(X)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_val, y_val, test_size=0.1, random_state=1, stratify=y_val
)


In [None]:
print("Shape of Training set : ", X_train.shape)
print("Shape of Validation set : ", X_val.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

## Model Building

### Model Evaluation Criterion

* Thge Model can make wrong predictions as:
1. Model predicts the visa application will get certified for the applications that should get denied.
2. Model predicts the visa application will get denied for the applications that should get certified.
* Both the cases are important so we can use F1 score as the metric for evaluating the model. Greater the F1 score higher are the chances of minimizing False Negetives and False Positives.
* We will use balanced classs weights.


In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn


def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

In [None]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

#### Defining scorer to be used for cross-validation and hyperparameter tuning

In [None]:
scorer = metrics.make_scorer(metrics.f1_score) ## define the metric


**We are now done with pre-processing and evaluation criterion, so let's start building the model.**

### Model building with original data

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score
print("\nCross-Validation performance on training dataset:\n")

for name, model in models:
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
    cv_result = cross_val_score(estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold)
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\nValidation Performance:\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = f1_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

In [None]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

* **GBM outperforms** all other models in terms of both median performance and consistency (tight spread).
* **Ensemble methods** (GBM, AdaBoost, XGBoost) clearly outperform individual models like dtree and Bagging.
* **Decision Tree** not only underperforms but also has inconsistent results.


### Model Building with oversampled data

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1.0, k_neighbors=5, random_state=1) ## set the k-nearest neighbors and sampling strategy
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))


print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))

In [None]:
models = []
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []
names = []

print("\nCross-Validation performance on training dataset:\n")

for name, model in models:
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\nValidation Performance:\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = f1_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))


In [None]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

* GBM is the top performer, reliable and stable across folds.
* AdaBoost is a close second, great if you care more about minimizing false negatives.
* XGBoost is also strong.

### Model Building with undersampled data

In [None]:
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)


print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))


print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))


print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  ## set the number of splits
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un,scoring = scorer, cv=kfold,n_jobs =-1
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un) ## fit the model on the undersampled data.
    scores = f1_score(y_val, model.predict(X_val)) ## define the metric function name.
    print("{}: {}".format(name, scores))

In [None]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

* GBM remains the most robust across evaluation metrics.
* AdaBoost is consistently strong.
* XGBoost maintains good performance.
* Decision Tree underperforms across the board.
* Bagging struggles with consistency and recall/precision trade-off.

## Hyperparameter Tuning

### Tuning AdaBoost using oversampled data

In [None]:
%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [50, 100, 150], ## set the number of estimators
    "learning_rate": [0.01, 0.05, 0.1], ## set the learning rate.
    "estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ]
}


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=50,
    n_jobs = -1,
    scoring=scorer,
    cv=5,
    random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over) ## fit the model on over sampled data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
## set the best parameters.
tuned_ada = AdaBoostClassifier(
    n_estimators=100, learning_rate=0.1, estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
)

tuned_ada.fit(X_train_over, y_train_over)

In [None]:
ada_train_perf = model_performance_classification_sklearn(tuned_ada, X_train_over, y_train_over)
ada_train_perf

In [None]:
ada_val_perf = model_performance_classification_sklearn(tuned_ada, X_val, y_val)
ada_val_perf

### Tuning Random forest using undersampled data

In [None]:
%%time

# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [100, 200],
    "min_samples_leaf": np.arange(1, 5),
    "max_features": [np.arange(1, 10, 2), 'sqrt'],
    "max_samples": np.arange(0.5, 1.0, 0.1)
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=50,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=1
)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_rf2 = RandomForestClassifier(
    max_features='sqrt',
    random_state=1,
    max_samples=0.5,
    n_estimators=200,
    min_samples_leaf=4,
)

tuned_rf2.fit(X_train_un, y_train_un)

In [None]:
rf2_train_perf = model_performance_classification_sklearn(
    tuned_rf2, X_train_un, y_train_un
)
rf2_train_perf

In [None]:
rf2_val_perf = model_performance_classification_sklearn(
    tuned_rf2, X_val, y_val
)
rf2_val_perf

### Tuning with Gradient boosting with oversampled data

In [None]:
%%time

# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50, 200, 50),
    "learning_rate": [0.01, 0.1],
    "subsample": [0.7, 1.0],
    "max_features": ['sqrt', 'log2']
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    scoring=scorer,
    n_iter=50,
    n_jobs=-1,
    cv=5,
    random_state=1
)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_gbm = GradientBoostingClassifier(
    max_features='sqrt',
    random_state=1,
    learning_rate=0.1,
    n_estimators=150,
    subsample=1.0
)

tuned_gbm.fit(X_train_over, y_train_over)

In [None]:
gbm_train_perf = model_performance_classification_sklearn(
    tuned_gbm, X_train_over, y_train_over
)
gbm_train_perf

In [None]:
## print the model performance on the validation data.
gbm_val_perf = model_performance_classification_sklearn(tuned_gbm, X_val, y_val)
gbm_val_perf

### Tuning XGBoost using oversampled data

In [None]:
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

## define the hyperparameters
param_grid={
    'n_estimators':[100, 200, 300],
    'scale_pos_weight':[1, 2, 5],
    'learning_rate':[0.01, 0.05, 0.1],
    'gamma':[0, 1, 5],
    'subsample':[0.7, 0.8, 1.0]
    }

## Set the cv parameter
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=50,
    n_jobs = -1,
    scoring=scorer,
    cv=5,
    random_state=1
    )

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
## Code to define the best model
xgb2 = XGBClassifier(
    random_state=1,
    eval_metric='logloss',
    subsample=1.0,
    scale_pos_weight=2,
    n_estimators=200,
    learning_rate=0.05,
    gamma=1,
)

xgb2.fit(X_train_over, y_train_over)

In [None]:
xgb2_train_perf = model_performance_classification_sklearn(
    xgb2, X_train_over, y_train_over
)
xgb2_train_perf

In [None]:
## Model performance on the validation data.
xgb2_val_perf = model_performance_classification_sklearn(xgb2, X_val, y_val)
xgb2_val_perf

**We have now tuned all the models, let's compare the performance of all tuned models and see which one is the best.**

## Model performance comparison and choosing the final model

In [None]:
models_train_comp_df = pd.concat(
    [
        gbm_train_perf.T,
        xgb2_train_perf.T,
        ada_train_perf.T,
        rf2_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Gradient Boosting tuned with oversampled data",
    "XGBoost tuned with oversampled data",
    "AdaBoost tuned with oversampled data",
    "Random forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df

In [None]:
# validation performance comparison

models_val_comp_df = pd.concat(
    [
        gbm_val_perf.T,
        xgb2_val_perf.T,
        ada_val_perf.T,
        rf2_val_perf.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Gradient Boosting tuned with oversampled data",
    "XGBoost tuned with oversampled data",
    "AdaBoost tuned with oversampled data",
    "Random forest tuned with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Model performance data
data = {
    "Metric": ["Accuracy", "Recall", "Precision", "F1"],
    "Gradient Boosting (Over)": [0.740369, 0.845233, 0.783179, 0.813023],
    "XGBoost (Over)": [0.725105, 0.948629, 0.724763, 0.821722],
    "AdaBoost (Over)": [0.739642, 0.841315, 0.784453, 0.811890],
    "Random Forest (Under)": [0.708533, 0.722682, 0.819551, 0.768074]
}

# Create DataFrame
df = pd.DataFrame(data)

# Set 'Metric' as index
df.set_index("Metric", inplace=True)

# Plot grouped bar chart
df.plot(kind="bar", figsize=(12, 6))
plt.title("Model Comparison Across Metrics")
plt.ylabel("Score")
plt.ylim(0.65, 1.0)
plt.xticks(rotation=0)
plt.legend(title="Model", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.grid(axis="y")

plt.show()

**Best Overall (F1 Score):**
**XGBoost (Oversampled)** has the highest F1 score (0.822) and recall (0.949), making it the best model when you want to catch as many positives as possible (e.g. maximize visa certification prediction).

**Best Precision:**
**Random Forest (Undersampled)** has the highest precision (0.820) — good if you want to avoid false positives (e.g. avoid certifying unlikely cases).

**Most Balanced:**
**Gradient Boosting (Oversampled)** is a solid all-around choice with good precision and recall — a safe default if you want balanced performance without extremes.

**Final Pick**
Use **XGBoost (Oversampled)** considering recall as the priority.

In [None]:
## print the model performance on the test data by the best model.
test = model_performance_classification_sklearn(xgb2, X_test, y_test)
test

In [None]:
## print the feature importances from the best model.
feature_names = X_train.columns
importances = xgb2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

## Actionable Insights and Recommendations

**Top Predective Features**
* education_of_employee_Bachelor's
* has_job_experience_N
* education_of_employee_Doctorate
* education_of_employee_Master's
* has_job_experience_Y

**Education level and job experience are the most important predictors for visa approval**

* Higher education (Master’s) and experience increase certification chances.
* Region matters: South > Northeast for approvals.
* Higher wage offers correlate with approvals (~$7K more median).

1. Prioritize Experienced Applicants.
* Insight: Lack of job experience is a key differentiator in denied cases.
* Recommendation: Encourage applicants to gain relevant work experience before applying or emphasize their existing experience in the application.

2. Target Candidates with Higher Education
* Insight: Master’s degree holders have a higher success rate than Bachelor’s degree holders.
* Recommendation: Prioritize or recommend applicants with Master’s or higher degrees, especially in STEM or in-demand fields.

<font size=6 color='blue'>Power Ahead</font>
___