<a href="https://colab.research.google.com/github/alexgaaranes/malaia-group-2/blob/main/MALAIA_Liyab_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MALAIA - Group 2

---
##### Predicting Starting Salaries of Filipino Graduates Using Academic Background and Industry Placement: A Machine Learning Approach Based on the Liyab First Pay Survey

<br>

Cleaning data from [**Liyab First Pay Survey dataset**](https://docs.google.com/spreadsheets/d/1gnA91Tjr_3UCNV8x1_LoE0oC56r-pXXRdJcgTfOLlm0/edit?gid=549575995#gid=549575995)

### Data Prep and Initial Exploration

In [1]:
# Mounting Google Drive. If running locally, ensure 'liyab.csv' is in the same directory or provide the correct path.
try:
    from google.colab import drive
    drive.mount('/content/drive')
    # Update this path if your file is located elsewhere in Google Drive
    csv_path = "/content/drive/Shareddrives/MALAIA Group 2/liyab_data/liyab.csv"
except ModuleNotFoundError:
    print("Not running in Colab. Assuming 'liyab.csv' is in the current directory or accessible via a local path.")
    csv_path = "liyab.csv" # Adjust if your local path is different

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import re

# Read data
try:
    liyab = pd.read_csv(csv_path)
except FileNotFoundError:
    print(f"Error: The file {csv_path} was not found. Please check the path.")
    # In a real scenario, you might stop execution here or try a fallback path
    liyab = pd.DataFrame() # Create an empty DataFrame to prevent further errors if file not found

print(f"Successfully loaded data. Shape: {liyab.shape}")

Successfully loaded data. Shape: (2933, 9)


In [3]:
print("Initial Data Information:")
if not liyab.empty:
    liyab.info()
    print("\nMissing Values per Column:")
    print(liyab.isnull().sum())
    print("\nFirst 5 Rows:")
    print(liyab.head())
    print("\nColumn Names:")
    print(liyab.columns.tolist())
else:
    print("DataFrame is empty. Cannot display info.")

Initial Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2933 entries, 0 to 2932
Data columns (total 9 columns):
 #   Column                                                                                   Non-Null Count  Dtype  
---  ------                                                                                   --------------  -----  
 0   Timestamp                                                                                2933 non-null   object 
 1   What year did you start your first job?                                                  2933 non-null   int64  
 2   In what industry was this job?                                                           2933 non-null   object 
 3   What was your role?                                                                      2933 non-null   object 
 4   What was your monthly salary (in PHP)?                                                   2933 non-null   float64
 5   What school did you graduate from?   

### Data Cleaning

#### 1. Year Started First Job

In [4]:
year_col = 'What year did you start your first job?'
if not liyab.empty and year_col in liyab.columns:
    print(f"Original value counts for '{year_col}':")
    print(liyab[year_col].value_counts().sort_index().head(10)) # Show some problematic ones if any

    # Convert to numeric, coercing errors. This handles non-numeric strings.
    liyab[year_col] = pd.to_numeric(liyab[year_col], errors='coerce')

    # Filter rows with years outside the plausible range (1987-2025)
    # Also drops rows where year became NaN due to non-numeric original values
    original_rows = len(liyab)
    liyab.dropna(subset=[year_col], inplace=True) # Remove NaNs from coerce
    liyab = liyab[liyab[year_col].between(1987, 2025)]
    liyab[year_col] = liyab[year_col].astype(int)
    print(f"\nRows removed due to invalid/outside range year: {original_rows - len(liyab)}")
    print(f"Cleaned value counts for '{year_col}':")
    print(liyab[year_col].value_counts().sort_index())
else:
    print(f"Column '{year_col}' not found or DataFrame is empty.")

Original value counts for 'What year did you start your first job?':
What year did you start your first job?
2       1
18      1
19      1
20      2
21      1
208     1
209     1
1987    1
1992    1
1997    1
Name: count, dtype: int64

Rows removed due to invalid/outside range year: 11
Cleaned value counts for 'What year did you start your first job?':
What year did you start your first job?
1987      1
1992      1
1997      1
1998      2
1999      3
2000      3
2001      2
2002      5
2003      7
2004      6
2005     18
2006     18
2007     18
2008     21
2009     31
2010     42
2011     65
2012     89
2013    118
2014    142
2015    193
2016    291
2017    441
2018    574
2019    649
2020    112
2021     24
2022     27
2023      6
2024     10
2025      2
Name: count, dtype: int64


#### 2. Gender Cleaning

In [5]:
gender_col = 'What is your gender?'
if not liyab.empty and gender_col in liyab.columns:
    print(f"Original value counts for '{gender_col}':")
    print(liyab[gender_col].value_counts().head(10))

    liyab['Cleaned Gender'] = liyab[gender_col].astype(str).str.lower().str.strip()

    gender_map = {
        # FEMALE variants
        'female': 'Female',
        'f': 'Female',
        'femaile': 'Female',
        'femail': 'Female',
        'femali': 'Female',
        'femalen': 'Female',
        'femal': 'Female',
        'femalr': 'Female',
        'femalw': 'Female',
        'femaled': 'Female',
        'femae': 'Female',
        'feme': 'Female',
        'babae': 'Female',
        'cisgender female': 'Female',
        'cis female': 'Female',
        'women': 'Female',
        'woman': 'Female',
        'female (cishet)': 'Female',
        'biological female': 'Female',
        'heterosexual female': 'Female',
        'female (queer)': 'Female',
        'cisgender-female': 'Female',
        'female, cisgender': 'Female',
        'cis woman/female': 'Female',
        'frmale': 'Female',
        '*sex = female': 'Female',

        # MALE variants
        'male': 'Male',
        'm': 'Male',
        'make': 'Male',
        'man': 'Male',
        'cisgender male': 'Male',
        'cis male': 'Male',
        'male cisgender': 'Male',
        'heterosexual male': 'Male',
        'homosexual man': 'Male', # Categorizing by gender identity primarily
        'males': 'Male',
        'mqle': 'Male',
        'norzagaray collegemale': 'Male', # This appeared in earlier exploration, likely a data entry error
        'homosexual male': 'Male',

        # LGBTQ+
        'lgbtq': 'LGBTQ+',
        'gay': 'LGBTQ+',
        'lesbian': 'LGBTQ+',
        'queer': 'LGBTQ+',
        'bisexual': 'LGBTQ+',
        'bisexual woman': 'LGBTQ+',
        'bisexual female': 'LGBTQ+',
        'cis-gender, pansexual, masculine': 'LGBTQ+',
        'nonbinary': 'LGBTQ+',
        'non-binary': 'LGBTQ+',
        'nb': 'LGBTQ+',
        'gender fluid': 'LGBTQ+',
        'non-conforming': 'LGBTQ+',
        'non-conforming male': 'LGBTQ+',
        'non-binary, presenting mainly as male': 'LGBTQ+',
        'homosexual': 'LGBTQ+', # General homosexual if not specified as man/woman for gender

        # PREFER NOT TO SAY
        'prefer not to say': 'Prefer not to say',
        'prefer not to mention': 'Prefer not to say',

        # OTHERs (explicitly mapped, rest will become 'Other')
        'tired potato': 'Other',
        'pogi': 'Other' # Humorous entry
    }

    liyab['Cleaned Gender'] = liyab['Cleaned Gender'].map(gender_map).fillna(liyab['Cleaned Gender'])

    # Consolidate remaining unmapped values to 'Other'
    allowed_genders = ['Female', 'Male', 'LGBTQ+', 'Prefer not to say', 'Other']
    liyab['Cleaned Gender'] = liyab['Cleaned Gender'].apply(lambda x: x if x in allowed_genders else 'Other')

    # Optional: Drop original gender column and rename
    # liyab.drop(columns=[gender_col], inplace=True)
    # liyab.rename(columns={'Cleaned Gender': gender_col}, inplace=True)

    print("\nCleaned value counts for 'Cleaned Gender':")
    print(liyab['Cleaned Gender'].value_counts())
else:
    print(f"Column '{gender_col}' not found or DataFrame is empty.")

Original value counts for 'What is your gender?':
What is your gender?
Female     1404
Male        862
F           246
M            92
Female       42
female       29
male         11
FEMALE       11
Woman         6
MALE          5
Name: count, dtype: int64

Cleaned value counts for 'Cleaned Gender':
Cleaned Gender
Female               1769
Male                  991
Other                 135
LGBTQ+                 25
Prefer not to say       2
Name: count, dtype: int64


#### 3. University Cleaning

In [6]:
uni_col = 'What school did you graduate from?'
if not liyab.empty and uni_col in liyab.columns:
    liyab['Cleaned University'] = liyab[uni_col].astype(str).str.lower().str.strip()
    # Remove content in parentheses (e.g., (BS), (Manila Campus))
    liyab['Cleaned University'] = liyab['Cleaned University'].apply(lambda x: re.sub(r'\s*\([^)]*\)\s*', '', x).strip())
    # Remove punctuation except spaces, then normalize spaces
    liyab['Cleaned University'] = liyab['Cleaned University'].apply(lambda x: re.sub(r'[^a-z0-9\s]', '', x))
    liyab['Cleaned University'] = liyab['Cleaned University'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

    # Specific string replacements for common terms BEFORE regex mapping
    replacements = {
        'univ ': 'university ',
        ' univ': ' university',
        'st ': 'saint ',
        'sta ': 'santa ',
        ' de ': ' ',
        ' la ': ' ',
        ' los ': ' ',
        ' baños ': ' banos ',
        ' and ': ' ',
        ' & ': ' '
    }
    for old, new in replacements.items():
        liyab['Cleaned University'] = liyab['Cleaned University'].str.replace(old, new, regex=False)
    # Re-apply space normalization
    liyab['Cleaned University'] = liyab['Cleaned University'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

    university_map = {
        # UP System (Order matters: more specific regex first)
        r'.*university philippines diliman.*': 'University of the Philippines Diliman',
        r'.*up diliman.*': 'University of the Philippines Diliman',
        r'^upd$': 'University of the Philippines Diliman',
        r'.*university philippines los banos.*': 'University of the Philippines Los Baños',
        r'.*up los banos.*': 'University of the Philippines Los Baños',
        r'^uplb$': 'University of the Philippines Los Baños',
        r'.*university philippines manila.*': 'University of the Philippines Manila',
        r'.*up manila.*': 'University of the Philippines Manila',
        r'^upm$': 'University of the Philippines Manila',
        r'.*university philippines visayas.*': 'University of the Philippines Visayas',
        r'.*up visayas.*': 'University of the Philippines Visayas',
        r'^upv$': 'University of the Philippines Visayas',
        r'.*university philippines cebu.*': 'University of the Philippines Cebu',
        r'.*up cebu.*': 'University of the Philippines Cebu',
        r'.*university philippines baguio.*': 'University of the Philippines Baguio',
        r'.*up baguio.*': 'University of the Philippines Baguio',
        r'^upb$': 'University of the Philippines Baguio',
        r'.*university philippines mindanao.*': 'University of the Philippines Mindanao',
        r'.*up mindanao.*': 'University of the Philippines Mindanao',
        r'.*university philippines open university.*': 'University of the Philippines Open University',
        r'.*up open univ.*': 'University of the Philippines Open University',
        r'^upou$': 'University of the Philippines Open University',
        r'.*university philippines.*': 'University of the Philippines (Unspecified Campus)',
        r'^up$': 'University of the Philippines (Unspecified Campus)',

        # Ateneo System
        r'.*ateneo manila university.*': 'Ateneo de Manila University',
        r'^admu$': 'Ateneo de Manila University',
        r'.*ateneo davao university.*': 'Ateneo de Davao University',
        r'^addu$': 'Ateneo de Davao University',
        r'.*ateneo zamboanga university.*': 'Ateneo de Zamboanga University',
        r'^adzu$': 'Ateneo de Zamboanga University',
        r'.*ateneo naga university.*': 'Ateneo de Naga University',
        r'^adnu$': 'Ateneo de Naga University',
        r'.*xavier university ateneo cagayan.*': 'Xavier University - Ateneo de Cagayan',
        r'.*xavier university.*': 'Xavier University - Ateneo de Cagayan',
        r'.*ateneo cagayan.*': 'Xavier University - Ateneo de Cagayan',
        r'.*ateneo.*': 'Ateneo de Manila University', # General Ateneo, default to Manila

        # De La Salle System
        r'.*salle university manila.*': 'De La Salle University Manila',
        r'.*salle manila.*': 'De La Salle University Manila',
        r'^dlsum$': 'De La Salle University Manila',
        r'^dlsu$': 'De La Salle University Manila',
        r'.*salle college saint benilde.*': 'De La Salle-College of Saint Benilde',
        r'.*salle csb.*': 'De La Salle-College of Saint Benilde',
        r'^csb$': 'De La Salle-College of Saint Benilde',
        r'^benilde$': 'De La Salle-College of Saint Benilde',
        r'.*salle lipa.*': 'De La Salle Lipa',
        r'^dlsl$': 'De La Salle Lipa',
        r'.*salle university dasmarinas.*': 'De La Salle University Dasmariñas',
        r'^dlsud$': 'De La Salle University Dasmariñas',
        r'.*salle university.*': 'De La Salle University Manila', # Default DLSU to Manila
        r'.*salle medical health sciences institute.*' : 'De La Salle Medical and Health Sciences Institute',
        r'.*salle.*': 'De La Salle University Manila', # General La Salle

        # UST
        r'.*university santo tomas.*': 'University of Santo Tomas',
        r'^ust$': 'University of Santo Tomas',

        # Mapua
        r'.*mapua institute technology.*': 'Mapúa University',
        r'.*mapua university.*': 'Mapúa University',
        r'^mapua$': 'Mapúa University',

        # PUP
        r'.*polytechnic university philippines.*': 'Polytechnic University of the Philippines',
        r'^pup$': 'Polytechnic University of the Philippines',

        # Other common schools
        r'.*adamson university.*': 'Adamson University',
        r'.*far eastern university.*': 'Far Eastern University',
        r'^feu$': 'Far Eastern University',
        r'.*lyceum philippines university.*': 'Lyceum of the Philippines University',
        r'^lpu$': 'Lyceum of the Philippines University',
        r'.*miriam college.*': 'Miriam College',
        r'.*national university.*': 'National University',
        r'^nu$': 'National University',
        r'.*pamantasan lungsod maynila.*': 'Pamantasan ng Lungsod ng Maynila',
        r'^plm$': 'Pamantasan ng Lungsod ng Maynila',
        r'.*san beda university.*': 'San Beda University',
        r'.*san beda college.*': 'San Beda University',
        r'^sbu$': 'San Beda University',
        r'^sbc$': 'San Beda University',
        r'.*silliman university.*': 'Silliman University',
        r'.*technological institute philippines.*': 'Technological Institute of the Philippines',
        r'^tip$': 'Technological Institute of the Philippines',
        r'.*technological university philippines.*': 'Technological University of the Philippines',
        r'^tup$': 'Technological University of the Philippines',
        r'.*university east.*': 'University of the East',
        r'^ue$': 'University of the East',
        r'.*university san carlos.*': 'University of San Carlos',
        r'^usc$': 'University of San Carlos',
        r'.*saint louis university.*': 'Saint Louis University Baguio',
        r'^slu$': 'Saint Louis University Baguio',
        r'.*central philippine university.*': 'Central Philippine University',
        r'^cpu$': 'Central Philippine University',
        r'.*mindanao state university iligan institute technology.*': 'Mindanao State University - Iligan Institute of Technology',
        r'.*msu iit.*': 'Mindanao State University - Iligan Institute of Technology',
        r'.*mindanao state university.*': 'Mindanao State University (Unspecified Campus)',
        r'^msu$': 'Mindanao State University (Unspecified Campus)',
        r'.*holy angel university.*': 'Holy Angel University',
        r'^hau$': 'Holy Angel University',
        r'.*university baguio.*': 'University of Baguio',
        r'.*ub.*': 'University of Baguio',
        r'.*university makati.*': 'University of Makati',
        r'^umak$': 'University of Makati',
        r'.*cebu institute technology.*': 'Cebu Institute of Technology - University',
        r'^cit u.*': 'Cebu Institute of Technology - University',
        r'.*university cebu.*': 'University of Cebu',
        r'.*university perpetual help system dalta.*': 'University of Perpetual Help System DALTA',
        r'.*uphsd.*': 'University of Perpetual Help System DALTA',
        r'.*asia pacific college.*': 'Asia Pacific College',
        r'^apc$': 'Asia Pacific College',
        r'.*enderun colleges.*': 'Enderun Colleges',
        r'.*iacademy.*': 'iACADEMY',
        r'.*sti college.*': 'STI College',
        r'^sti$': 'STI College',
        r'.*ama computer university.*': 'AMA Computer University',
        r'.*ama computer college.*': 'AMA Computer University',
        r'^ama$': 'AMA Computer University',

        # Non-university / Special Cases
        r'.*still in school.*': 'Still Enrolled',
        r'.*not yet a graduate.*': 'Still Enrolled',
        r'.*not yet graduated.*': 'Still Enrolled',
        r'.*undergrad.*': 'Still Enrolled / Did Not Graduate',
        r'.*didnt graduate.*': 'Did Not Graduate',
        r'.*college dropout.*': 'Did Not Graduate',
        r'.*high school.*': 'High School Graduate',
        r'.*hs grad.*': 'High School Graduate',
        r'^na$': 'Not Applicable',
        r'^n a$': 'Not Applicable',
        r'nan': 'Not Specified', # for the string 'nan'
        r'.*prefer not to say.*': 'Prefer Not to Say',
        r'.*secret.*': 'Prefer Not to Say',
        r'^\s*$': 'Not Specified', # Empty strings after strip
        r'^\d{4}$': 'Invalid Entry (Year)',
        r'^\d{1,2}$': 'Invalid Entry (Number)',
        r'.*overseas.*': 'Overseas University'
    }

    # Apply the mapping using regex
    # Create a temporary column to avoid issues with chained assignment warnings
    temp_uni_values = liyab['Cleaned University'].copy()
    for pattern, standard_name in university_map.items():
        # Apply regex replacement where the current value matches the pattern
        mask = temp_uni_values.str.contains(pattern, regex=True, case=False, na=False)
        temp_uni_values[mask] = standard_name
    liyab['Cleaned University'] = temp_uni_values

    # Final catch-all for unmapped values that are not special categories
    known_categories = set(university_map.values())
    liyab['Cleaned University'] = liyab['Cleaned University'].apply(
        lambda x: x if x in known_categories else ('Other University' if len(x) > 3 else 'Not Specified')
        # len(x) > 3 is a heuristic to avoid classifying short, possibly invalid entries as 'Other University'
    )
    liyab['Cleaned University'].fillna('Not Specified', inplace=True)

    print("\nCleaned value counts for 'Cleaned University':")
    print(liyab['Cleaned University'].value_counts(dropna=False))

    # Example: Remove rows where university is not suitable for graduate salary prediction
    # original_rows_before_uni_drop = len(liyab)
    # categories_to_drop = ['Still Enrolled / Did Not Graduate', 'Did Not Graduate', 'High School Graduate',
    #                       'Not Applicable', 'Prefer Not to Say', 'Not Specified', 'Invalid Entry (Year)', 'Invalid Entry (Number)']
    # liyab = liyab[~liyab['Cleaned University'].isin(categories_to_drop)]
    # print(f"\nRows removed based on cleaned university category: {original_rows_before_uni_drop - len(liyab)}")
else:
    print(f"Column '{uni_col}' not found or DataFrame is empty.")


Cleaned value counts for 'Cleaned University':
Cleaned University
Other University                                      1004
Ateneo de Manila University                            407
University of the Philippines Diliman                  389
De La Salle University Manila                          235
Not Specified                                          187
University of Santo Tomas                              151
University of the Philippines Los Baños                 83
University of the Philippines (Unspecified Campus)      75
Polytechnic University of the Philippines               44
University of the Philippines Manila                    42
Far Eastern University                                  42
University of Baguio                                    28
Miriam College                                          25
Mapúa University                                        22
Asia Pacific College                                    19
San Beda University                             

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  liyab['Cleaned University'].fillna('Not Specified', inplace=True)


#### 4. Industry Cleaning

In [7]:
!pip install sentence-transformers scikit-learn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [8]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict

industry_col = 'In what industry was this job?'
if not liyab.empty and industry_col in liyab.columns:
    # Normalize function
    def normalize_industry(s):
        if not isinstance(s, str): return ''
        s = s.lower().strip()
        s = re.sub(r'[^a-z0-9\s]', '', s)  # remove punctuation
        s = re.sub(r'\s+', ' ', s).strip() # remove extra spaces and trim
        return s

    # Define master categories
    master_categories = [
        "Accountancy, Banking and Finance",
        "Business Process Outsourcing (BPO)", # Added BPO
        "Business, Consulting and Management",
        "Charity and Voluntary Work (NGO)",
        "Creative Arts, Design and Media", # Combined Media
        "Energy and Utilities",
        "Engineering and Manufacturing",
        "Environment and Agriculture",
        "Healthcare and Pharmaceuticals", # Combined Science/Pharma
        "Hospitality, Events, Leisure, Sport and Tourism", # Combined related fields
        "Information Technology (IT)",
        "Law and Legal Services",
        "Law Enforcement and Security",
        "Marketing, Advertising and Public Relations (PR)",
        "Property and Construction",
        "Public Services and Administration (Government)",
        "Recruitment and Human Resources (HR)",
        "Retail and E-commerce", # Added E-commerce
        "Sales",
        "Education and Training", # Renamed for clarity
        "Transport and Logistics",
        "Other"
    ]

    # Load SentenceTransformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    category_embeddings = model.encode(master_categories)

    # Prepare unique normalized entries for embedding
    liyab[industry_col].fillna('', inplace=True)
    unique_original_entries = liyab[industry_col].unique()
    normalized_to_original_map = defaultdict(list)
    for entry in unique_original_entries:
        normalized = normalize_industry(entry)
        if normalized: # Only consider non-empty normalized strings
             normalized_to_original_map[normalized].append(entry)

    unique_normalized_entries = [n for n in normalized_to_original_map.keys() if n] # Ensure no empty strings

    normalized_entry_to_category = {}
    if unique_normalized_entries:
        entry_embeddings = model.encode(unique_normalized_entries)
        # Assign each unique normalized entry to the closest master category
        for i, entry_vector in enumerate(entry_embeddings):
            similarities = cosine_similarity([entry_vector], category_embeddings)[0]
            best_category_idx = np.argmax(similarities)
            best_category = master_categories[best_category_idx]
            normalized_entry_to_category[unique_normalized_entries[i]] = best_category

    # Manual overrides for normalized keys (applied after similarity mapping)
    manual_overrides = {
        normalize_industry("bpo"): "Business Process Outsourcing (BPO)",
        normalize_industry("call center"): "Business Process Outsourcing (BPO)",
        normalize_industry("kpo"): "Business Process Outsourcing (BPO)",
        normalize_industry("shared services"): "Business Process Outsourcing (BPO)",
        normalize_industry("itbpo"): "Information Technology (IT)", # Or BPO, depends on definition
        normalize_industry("software engineering"): "Information Technology (IT)",
        normalize_industry("fintech"): "Accountancy, Banking and Finance",
        normalize_industry("ecommerce"): "Retail and E-commerce",
        normalize_industry("government"): "Public Services and Administration (Government)",
        normalize_industry("ngo"): "Charity and Voluntary Work (NGO)",
        normalize_industry("real estate"): "Property and Construction",
        normalize_industry("academe"): "Education and Training",
        normalize_industry("education"): "Education and Training",
        normalize_industry("teaching"): "Education and Training",
        normalize_industry("research"): "Other", # Could be IT, Science, etc. Needs context or map to specific if clear
        normalize_industry("architecture"): "Property and Construction",
        normalize_industry("construction"): "Property and Construction",
        normalize_industry("advertising"): "Marketing, Advertising and Public Relations (PR)",
        normalize_industry("media"): "Creative Arts, Design and Media",
        normalize_industry("telecommunications"): "Information Technology (IT)", # Often grouped with IT
        normalize_industry("pharmaceutical"): "Healthcare and Pharmaceuticals",
        normalize_industry("aviation"): "Transport and Logistics",
        normalize_industry("automotive"): "Engineering and Manufacturing", # Or Sales if dealer
        normalize_industry("food and beverage"): "Hospitality, Events, Leisure, Sport and Tourism", # Or Manufacturing if production
        normalize_industry("fmcg"): "Retail and E-commerce", # Or Sales/Manufacturing
        # Entries that are clearly not industries
        normalize_industry("2020"): "Other",
        normalize_industry("na"): "Other",
        normalize_industry("none"): "Other"
    }
    for norm_key, override_cat in manual_overrides.items():
        if norm_key: # Ensure key is not empty
            normalized_entry_to_category[norm_key] = override_cat

    # Map back to the original DataFrame
    def map_to_final_industry_category(original_value):
        if pd.isna(original_value) or original_value.strip() == '':
            return 'Not Specified'
        normalized_val = normalize_industry(original_value)
        if not normalized_val:
            return 'Not Specified'
        return normalized_entry_to_category.get(normalized_val, 'Other') # Default for unmapped

    liyab['Cleaned Industry'] = liyab[industry_col].apply(map_to_final_industry_category)

    print("\nCleaned value counts for 'Cleaned Industry':")
    print(liyab['Cleaned Industry'].value_counts(dropna=False))

    # Display original entries for a specific cleaned category for review
    # print("\nOriginal entries for 'Other' category:")
    # for norm_val, orig_vals in normalized_to_original_map.items():
    #    if normalized_entry_to_category.get(norm_val) == 'Other':
    #        print(f"  Normalized: '{norm_val}' -> Original(s): {orig_vals}")
else:
    print(f"Column '{industry_col}' not found or DataFrame is empty.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  liyab[industry_col].fillna('', inplace=True)



Cleaned value counts for 'Cleaned Industry':
Cleaned Industry
Accountancy, Banking and Finance                    330
Business Process Outsourcing (BPO)                  247
Engineering and Manufacturing                       227
Retail and E-commerce                               205
Other                                               199
Education and Training                              198
Marketing, Advertising and Public Relations (PR)    178
Information Technology (IT)                         169
Healthcare and Pharmaceuticals                      169
Public Services and Administration (Government)     166
Property and Construction                           163
Creative Arts, Design and Media                     152
Sales                                               141
Business, Consulting and Management                  64
Charity and Voluntary Work (NGO)                     60
Transport and Logistics                              54
Hospitality, Events, Leisure, Sport and T

#### 5. Other Date Column Validation (Placeholder)

In [9]:
# Placeholder for other date validations
# Example: If a 'Timestamp' or 'Graduation Date' column exists
if not liyab.empty:
    if 'Timestamp' in liyab.columns: # Common in Google Form exports
        liyab['Timestamp'] = pd.to_datetime(liyab['Timestamp'], errors='coerce')
        print(f"'Timestamp' column converted to datetime. NaT count: {liyab['Timestamp'].isnull().sum()}")

    grad_year_col = 'What year did you graduate?' # Assuming this column name might exist
    if grad_year_col in liyab.columns:
        liyab[grad_year_col] = pd.to_numeric(liyab[grad_year_col], errors='coerce')
        # Add similar filtering as 'What year did you start your first job?' if needed
        # liyab.dropna(subset=[grad_year_col], inplace=True)
        # liyab = liyab[liyab[grad_year_col].between(1980, 2025)] # Example range
        print(f"'{grad_year_col}' column converted to numeric. NaN count: {liyab[grad_year_col].isnull().sum()}")
else:
    print("DataFrame is empty.")

'Timestamp' column converted to datetime. NaT count: 0


#### 6. Handling Missing Values (Post Cleaning)

In [10]:
if not liyab.empty:
    print("Missing values after initial cleaning steps:")
    print(liyab.isnull().sum())

    # Define key columns that are essential for the analysis
    # The target variable (salary) is crucial. Its name needs to be identified.
    # Let's assume it's 'What is your current MONTHLY salary in PHP?' or similar
    # For this example, I'll use a placeholder name 'Monthly Salary PHP'
    # You'll need to replace 'Monthly Salary PHP' with the actual salary column name.

    salary_column_guess = next((col for col in liyab.columns if 'salary' in col.lower() and 'monthly' in col.lower()), None)
    if salary_column_guess:
        print(f"Guessed salary column: {salary_column_guess}")
        # Convert salary to numeric, removing non-numeric characters like commas, 'PHP'
        if liyab[salary_column_guess].dtype == 'object':
            liyab[salary_column_guess] = liyab[salary_column_guess].astype(str).str.replace(r'[^\d.]', '', regex=True)
            liyab[salary_column_guess] = pd.to_numeric(liyab[salary_column_guess], errors='coerce')

        key_columns_for_analysis = [salary_column_guess, 'Cleaned University', 'Cleaned Industry', 'What year did you start your first job?']
        # Ensure all key columns actually exist in the DataFrame before trying to drop NaNs
        key_columns_present = [col for col in key_columns_for_analysis if col in liyab.columns]

        if key_columns_present:
            original_rows = len(liyab)
            liyab.dropna(subset=key_columns_present, inplace=True)
            print(f"\nRows removed due to missing values in key columns ({', '.join(key_columns_present)}): {original_rows - len(liyab)}")
        else:
            print("\nCould not find all key columns for NaN removal.")

        # Further, remove rows where Cleaned University indicates non-graduates or invalid entries
        if 'Cleaned University' in liyab.columns:
            original_rows = len(liyab)
            uni_categories_to_drop = ['Still Enrolled', 'Still Enrolled / Did Not Graduate', 'Did Not Graduate',
                                      'High School Graduate', 'Not Applicable', 'Prefer Not to Say',
                                      'Not Specified', 'Invalid Entry (Year)', 'Invalid Entry (Number)']
            liyab = liyab[~liyab['Cleaned University'].isin(uni_categories_to_drop)]
            print(f"Rows removed due to unsuitable university categories: {original_rows - len(liyab)}")
    else:
        print("\nCould not identify the salary column automatically. Skipping NaN removal based on salary.")

    print("\nMissing values after NaN strategy:")
    print(liyab.isnull().sum())
    print(f"\nFinal shape of the cleaned DataFrame: {liyab.shape}")
else:
    print("DataFrame is empty.")

Missing values after initial cleaning steps:
Timestamp                                                                                     0
What year did you start your first job?                                                       0
In what industry was this job?                                                                0
What was your role?                                                                           0
What was your monthly salary (in PHP)?                                                        0
What school did you graduate from?                                                          167
What is your gender?                                                                        130
Did you negotiate your job offer?                                                            54
If you can provide additional context to any of your answers above, you can do so here.    1761
Cleaned Gender                                                                             

#### 7. Final Review and Column Selection (Example)

In [11]:
if not liyab.empty:
    print("Cleaned DataFrame Head:")
    print(liyab.head())

    # List of columns to keep for analysis/modeling
    # This would depend on the features you intend to use.
    # Original columns that were cleaned should be replaced by their 'Cleaned' versions.
    columns_to_keep = []
    if salary_column_guess and salary_column_guess in liyab.columns: columns_to_keep.append(salary_column_guess)
    if 'Cleaned University' in liyab.columns: columns_to_keep.append('Cleaned University')
    if 'Cleaned Industry' in liyab.columns: columns_to_keep.append('Cleaned Industry')
    if 'Cleaned Gender' in liyab.columns: columns_to_keep.append('Cleaned Gender')
    if 'What year did you start your first job?' in liyab.columns: columns_to_keep.append('What year did you start your first job?')
    # Add other relevant columns like 'Timestamp', 'What course did you take in college?', etc.
    # For example, if 'What course did you take in college?' is also cleaned into 'Cleaned Course'
    # columns_to_keep.append('Cleaned Course')

    # Ensure all columns in columns_to_keep actually exist
    final_columns = [col for col in columns_to_keep if col in liyab.columns]

    if final_columns:
        liyab_final = liyab[final_columns].copy()
        print("\nFinal selected DataFrame for modeling (liyab_final):")
        print(liyab_final.head())
        print(f"Shape of liyab_final: {liyab_final.shape}")
    else:
        print("\nNo columns selected for the final DataFrame. Check column names and cleaning steps.")
        liyab_final = pd.DataFrame() # Empty df
else:
    print("DataFrame is empty, no final review possible.")
    liyab_final = pd.DataFrame() # Empty df

Cleaned DataFrame Head:
            Timestamp  What year did you start your first job?  \
0 2020-01-21 12:57:30                                     2014   
1 2020-01-21 12:59:09                                     2017   
2 2020-01-21 13:31:10                                     2016   
3 2020-01-21 13:34:33                                     2011   
4 2020-01-21 14:34:42                                     2008   

  In what industry was this job?      What was your role?  \
0                        Banking             HR Associate   
1                        Fintech     Business Development   
2                        Academe           Junior Partner   
3                        Banking  Resourcing & Compliance   
4                Market Research       Research Associate   

   What was your monthly salary (in PHP)? What school did you graduate from?  \
0                                 18000.0                         UP Diliman   
1                                 15000.0           

### **Original Notebook Cells (For Reference - May be outdated or partially integrated above)**

In [12]:
# !pip install rapidfuzz scikit-learn # Moved to industry cleaning cell

In [13]:
# This cell for university clustering was for exploration.
# The implemented cleaning uses a more direct mapping approach.
# from rapidfuzz.distance import Levenshtein
# from sklearn.cluster import AgglomerativeClustering
# import numpy as np
# strings = liyab['What school did you graduate from?'].unique() # Example, use original unique names
# # ... rest of the clustering code ...
# print("University clustering output (for reference only, not used in final cleaning pipeline):")

In [14]:
# Original value counts for universities (for reference)
# universities_count = liyab_original_backup['What school did you graduate from?'].value_counts().sort_index()
# print(universities_count)

In [15]:
# Original formatted university counts (for reference)
# universities_formatted = liyab_original_backup['What school did you graduate from?'].astype(str).str.lower().str.strip()
# universities_formatted = universities_formatted.apply(lambda x: re.sub(r'[^a-z0-9\s]', '', x))
# print(universities_formatted.value_counts())