<a href="https://colab.research.google.com/github/alex-jk/GTA_apartment_rentals/blob/main/LMIA_approved_applications_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import LMIA data**

In [2]:
from getpass import getpass
TOKEN = getpass('GitHub token: ')

!git clone https://{TOKEN}@github.com/alex-jk/LMIA-analysis.git
%cd LMIA-analysis

GitHub token: ··········
Cloning into 'LMIA-analysis'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 18 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (18/18), 405.43 KiB | 1.98 MiB/s, done.
Resolving deltas: 100% (4/4), done.
/content/LMIA-analysis


In [10]:
import pandas as pd
import re

In [9]:
df = pd.read_csv("data/LMIA_by_company_2025Q1.csv")

df_ontario = df[df['Province/Territory']=='Ontario'].copy().reset_index(drop=True)

print(f"\nDF shape: {df_ontario.shape}")
print(f"\nDF columns: {df_ontario.columns}")

print(f"\nNumber of companies: {len(df_ontario['Employer'].unique())}")
print(f"\nNumber of addresses: {len(df_ontario['Address'].unique())}")


DF shape: (5073, 8)

DF columns: Index(['Province/Territory', 'Program Stream', 'Employer', 'Address',
       'Occupation', 'Incorporate Status', 'Approved LMIAs',
       'Approved Positions'],
      dtype='object')

Number of companies: 4266

Number of addresses: 3590


**Normalize Employer names and addresses function**

In [11]:
def normalize_simple(s):
    if not isinstance(s, str):
        return None
    s = s.lower()
    # replace any run of non-alphanumeric chars with a single space (keeps digits)
    s = re.sub(r"[^a-z0-9]+", " ", s)
    return s.strip()

Normalize Employer names and addresses

In [12]:
df_ontario["Employer_key"] = df_ontario["Employer"].map(normalize_simple)
df_ontario["Address_key"]  = df_ontario["Address"].map(normalize_simple)

print(f"\nNumber of unique Employer keys: {len(df_ontario['Employer_key'].unique())}")
print(f"\nNumber of unique Address keys: {len(df_ontario['Address_key'].unique())}")


Number of unique Employer keys: 4265

Number of unique Address keys: 3585


**Check employer and address key combinations**

<font color="blue">Employer key - Address key combinations</font>.

In [15]:
# 1) How many unique addresses per employer?
addr_per_emp = (
    df_ontario.groupby("Employer_key", dropna=False)["Address_key"]
      .nunique()
      .rename("n_addresses")
      .reset_index()
)

# 2) Employers with >1 address
multi_addr = addr_per_emp[addr_per_emp["n_addresses"] > 1].sort_values("n_addresses", ascending=False)

# 3) Quick summary
total_employers = len(addr_per_emp)
single_addr = (addr_per_emp["n_addresses"] == 1).sum()
multi_count = (addr_per_emp["n_addresses"] > 1).sum()
share_multi = multi_count / total_employers if total_employers else 0

print(f"Total employers: {total_employers}")
print(f"Employers with exactly 1 address: {single_addr}")
print(f"Employers with > 1 address: {multi_count} ({share_multi:.1%})")

Total employers: 4265
Employers with exactly 1 address: 4264
Employers with > 1 address: 1 (0.0%)
