**Feature Engineering Pipeline**

**Null Handling Strategy:**
- Text columns: Replace nulls with empty string `""`
- Categorical columns: Replace nulls with `"unknown"`
- Numerical columns (salary): Replace nulls with `0` for min/max
- Embeddings: Use empty string embeddings instead of zero vectors
- Validation: Check for nulls after each major transformation stage

**Pipeline Steps:**
1. **Null Handling** (Upfront standardization)
2. Deduplication
3. **Pattern Features** (On raw text - URLs, emails, phone numbers, symbols)
4. Standardize & Clean Text
5. Split Joined Words (CamelCase and wordninja)
6. **Embeddings** (Before stopword removal - sentence-transformers)
7. Stopword Removal
8. Lemmatization
9. **TF-IDF** (Word + Char n-grams - fit on train, transform on test)
10. **TruncatedSVD** (Dimensionality reduction: 25K → 500 features, ~XX% variance)
11. Parse Salary (min/max)
12. Parse Location (country/state/city)
13. **Expand Embeddings** (Convert arrays to individual columns)
14. Structured Features (categorical, binary)
15. Final Validation

**Train/Test Split Handling:**
- **TF-IDF Vectorizers**: Fitted on training data, reused for test data (no leakage)
- **TruncatedSVD Model**: Fitted on training data, reused for test data (no leakage)
- Ensures consistent feature space between train and test sets

**Output Features:**
- Pattern Features: ~20 features (URL/email/phone/symbol counts)
- Salary Features: 2 features (min, max)
- Binary Features: 3 features (telecommuting, has_logo, has_questions)
- Categorical Features: 9 features (employment_type, experience, education, industry, function, department, location x3)
- Embedding Features: 1,536 features (384 dims × 4 text columns)
- TF-IDF SVD Features: 500 features (reduced from 25,000)
- **Total: ~2,070 features** (down from ~26,500 before SVD reduction)

In [1]:
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import hstack
from transformers import AutoTokenizer, AutoModel

# NLP preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, wordpunct_tokenize
nltk.download("punkt")
nltk.download('punkt_tab')
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
import wordninja

lemmatizer = WordNetLemmatizer()
STOPWORDS = set(stopwords.words('english'))
LEMMATIZER = WordNetLemmatizer()

TEXT_COLS = ['title', 'description', 'requirements', 'benefits', 'company_profile']
CATEGORICAL_COLS = ['employment_type', 'required_experience', 'required_education', 'industry', 'function', 'location', 'department']
BINARY_COLS = ['telecommuting', 'has_company_logo', 'has_questions', 'fraudulent']

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alden\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\alden\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alden\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\alden\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\alden\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
def handle_nulls_upfront(df):
    """
    Handle all null values upfront before any processing.
    This ensures consistent null handling throughout the pipeline.
    
    Strategy:
    - Text columns → empty string ""
    - Categorical columns → "unknown"
    - salary_range → "unknown"
    - Binary columns → keep as-is (will be handled during model training)
    """
    df_clean = df.copy()
    
    # Text columns: replace with empty string
    for col in TEXT_COLS:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].fillna('').astype(str)
            # Also handle explicit null-like strings
            df_clean[col] = df_clean[col].replace(['nan', 'NaN', 'None', 'none'], '')
    
    # Categorical columns: replace with 'unknown'
    for col in CATEGORICAL_COLS:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].fillna('unknown').astype(str)
            # Handle various null-like values
            null_like = ['Not Applicable', 'NaN', 'not applicable', 'Unspecified', 
                        'Other', 'Others', 'none', 'na', 'n/a', '', ' ', 'nan', 'None']
            df_clean[col] = df_clean[col].replace(null_like, 'unknown')
    
    # Salary range: replace with 'unknown'
    if 'salary_range' in df_clean.columns:
        df_clean['salary_range'] = df_clean['salary_range'].fillna('unknown').astype(str)
    
    return df_clean


def validate_nulls(df, stage_name=""):
    """
    Validate that no unexpected nulls exist after processing.
    Reports null counts and returns boolean indicating if nulls were found.
    """
    null_counts = df.isnull().sum()
    has_nulls = null_counts.any()
    
    if has_nulls:
        print(f"Nulls found after {stage_name}:")
        print(null_counts[null_counts > 0])
        return False
    else:
        print(f"No nulls found after {stage_name}")
        return True

In [3]:
def cleanAndDeduplicate(df):
    df_cleaning = df.copy()

    def simplify_employment_type(x):
        if not isinstance(x, str) or x == 'unknown':
            return 'unknown'
        
        x = x.strip().lower()
        if x in ['full-time', 'part-time']:
            return x
        elif x in ['contract', 'temporary']:
            return 'non-permanent'
        else:
            return 'unknown'
    
    df_cleaning['employment_type_clean'] = df_cleaning['employment_type'].apply(simplify_employment_type)

    def comparison_key(row):
        emp = None if row['employment_type_clean'] == 'unknown' else row['employment_type_clean']
        return (row['location'], row['title'], row['description'], row['requirements'], emp)

    df_cleaning['dedup_key'] = df_cleaning.apply(comparison_key, axis=1)
    df_deduped = df_cleaning.drop_duplicates(subset=['dedup_key'])
    
    print(f"Removed {len(df_cleaning) - len(df_deduped)} duplicate rows")
    return df_deduped

In [4]:
def check_corpus(df, text_cols):
    corpus_stats = {}

    for col in text_cols:
        texts = df[col].fillna("").astype(str).str.lower().tolist()

        tokens = []
        for t in texts:
            tokens.extend([w for w in word_tokenize(t) if len(w) > 2])

        corpus_stats[col] = len(set(tokens))

    print(corpus_stats)
    return corpus_stats

In [5]:
def apply_text_normalization(df, text_cols):
    """
    Normalize text: lowercase, remove URLs, punctuation, extra whitespace.
    
    Note: Assumes nulls have been handled - all text should be strings.
    Empty strings remain empty and are handled properly.
    """
    def normalize_text(text: str) -> str:
        # Handle empty strings (from nulls)
        if not text or not text.strip():
            return ""

        text = text.lower().strip()
        text = re.sub(r"http\S+|www\S+", " ", text)  # remove URLs
        text = re.sub(r"[^a-z\s']", " ", text)  # remove punctuation/numbers except apostrophes
        text = re.sub(r"\s+", " ", text).strip()

        return text
    
    for col in text_cols:
        tqdm.pandas(desc=f"Normalizing {col}")
        df[col] = df[col].progress_apply(normalize_text)
    
    print("Text normalization complete")
    return df


def apply_split_df(df, text_cols):
    """
    Split CamelCase and joined words using wordninja.
    Examples: 'SmartContract' -> 'Smart Contract', 'makemoney' -> 'make money'
    """
    def split_camel_case(token):
        """Splits CamelCase tokens: 'SmartContract' -> ['Smart', 'Contract']"""
        return re.sub('([a-z])([A-Z])', r'\1 \2', token).split()

    def split_joined_words(text, min_len=10):
        # Handle empty strings
        if not text or not text.strip():
            return ""
        
        tokens = text.split()
        new_tokens = []

        for token in tokens:
            # Skip short tokens
            if len(token) < min_len:
                new_tokens.append(token)
                continue

            # 1. Try CamelCase split
            camel_split = split_camel_case(token)

            if len(camel_split) > 1:
                # After splitting CamelCase, apply wordninja to each part
                final_parts = []
                for part in camel_split:
                    wn = wordninja.split(part)
                    final_parts.extend(wn)
                new_tokens.extend(final_parts)
                continue

            # 2. If no CamelCase, try wordninja directly
            wn = wordninja.split(token)
            if len(wn) > 1:
                new_tokens.extend(wn)
            else:
                new_tokens.append(token)

        return " ".join(new_tokens)

    for col in text_cols:
        tqdm.pandas(desc=f"Splitting joined words in {col}")
        df[col] = df[col].progress_apply(split_joined_words)
    
    print("Word splitting complete")
    return df

In [6]:
def remove_stopwords_df(df, text_cols):
    """
    Remove English stopwords from multiple text columns.
    Handles empty strings gracefully.
    """
    def remove_stopwords_text(text):
        # Handle empty strings
        if not text or not text.strip():
            return ""
        
        tokens = word_tokenize(text.lower())
        clean_tokens = [t for t in tokens if t not in STOPWORDS and len(t) > 2]
        return " ".join(clean_tokens)
    
    for col in text_cols:
        tqdm.pandas(desc=f"Removing stopwords in {col}")
        df[col] = df[col].progress_apply(remove_stopwords_text)
    
    print("Stopword removal complete")
    return df

In [7]:
def lemmatize_df(df, text_cols):
    """
    Lemmatize multiple text columns.
    Handles empty strings gracefully.
    """
    def lemmatize_text(text):
        # Handle empty strings
        if not text or not text.strip():
            return ""
        
        tokens = word_tokenize(text.lower())
        lemmas = [lemmatizer.lemmatize(t) for t in tokens if len(t) > 2]
        return " ".join(lemmas)
    
    for col in text_cols:
        tqdm.pandas(desc=f"Lemmatizing {col}")
        df[col] = df[col].progress_apply(lemmatize_text)
    
    print("Lemmatization complete")
    return df

In [8]:
def addWordPatterns(df):
    patterns = {
        'urls': r'http[s]?://\S+',
        'emails': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'phone_numbers': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
        'money_symbols': r'[$€£¥]',
        'other_symbols': r'[©®™]',
    }
    
    def count_patterns(text, patterns):
        if not text or not text.strip():
            return {k: 0 for k in patterns.keys()}
        return {k: len(re.findall(pat, str(text))) for k, pat in patterns.items()}

    for col in ['description', 'company_profile', 'requirements', 'benefits']:
        if col in df.columns:
            feat_df = df[col].apply(lambda x: count_patterns(x, patterns))
            feat_df = pd.DataFrame(list(feat_df), index=df.index).add_prefix(f'{col}_')
            df = pd.concat([df, feat_df], axis=1)

    print("Pattern features added")
    return df

def parse_salary_range(df):
    MONTH_MAP = {
        'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
        'jul': 7, 'aug': 8, 'sep': 9, 'sept': 9, 'oct': 10, 'nov': 11, 'dec': 12
    }   
    
    def parse_salary(s):
        if not s or str(s).lower() == 'unknown':
            return (0, 0)

        s = str(s).strip()
        
        if '-' not in s:
            try: 
                val = int(s)
                return (val, val)
            except ValueError: 
                return (0, 0)

        left, right = [v.strip() for v in s.split('-', 1)]

        def val(v):
            return int(v) if v.isdigit() else MONTH_MAP.get(v.lower())

        l_val, r_val = val(left), val(right)
        
        if l_val is not None and r_val is not None:
            return (l_val, r_val)
        else:
            return (0, 0)

    df[['salary_min', 'salary_max']] = df['salary_range'].apply(
        lambda x: pd.Series(parse_salary(x))
    )
    
    print("Salary parsing complete")
    return df

In [9]:
from sentence_transformers import SentenceTransformer

def sentenceEmbedding(df, text_cols, model_name='all-MiniLM-L6-v2', device=None):
    model = SentenceTransformer(model_name, device=device)
    
    for col in text_cols:
        tqdm.pandas(desc=f"Embedding {col}")
        # Generate embeddings as numpy arrays
        embeddings = df[col].progress_apply(lambda x: model.encode(str(x)) if pd.notna(x) else np.zeros(model.get_sentence_embedding_dimension()))
        df[f"{col}_embedding"] = embeddings
    
    return df

In [None]:
def expand_embeddings(df, embedding_cols):
    """
    Expand embedding arrays (stored as numpy arrays in cells) into separate columns.
    """
    print("\n--- Expanding Embeddings ---")
    
    for col in embedding_cols:
        if col not in df.columns:
            print(f"Warning: {col} not found in dataframe")
            continue

        embedding_matrix = np.vstack(df[col].values)
        n_dims = embedding_matrix.shape[1]
        col_names = [f"{col}_dim_{i}" for i in range(n_dims)]
        
        # df from embeddings
        embedding_df = pd.DataFrame(embedding_matrix, columns=col_names, index=df.index)
        df = pd.concat([df, embedding_df], axis=1)
        df = df.drop(columns=[col])
        
        print(f"Expanded {col}: {n_dims} dimensions")
    
    print("Embedding expansion complete")
    return df

In [11]:
def build_tfidf(
    df, 
    text_cols, 
    word_ngrams=(1, 2), 
    char_ngrams=(3, 5),
    min_df=3,            # ignore terms appearing in <3 docs
    max_df=0.8,          # ignore terms appearing in >80% of docs
    vectorizers=None     # Pre-fitted vectorizers for test data
):
    """
    Build TF-IDF features for text columns.
    Creates both word-level and character-level n-grams.
    """
    tfidf_results = {}
    vectorizers_out = {} if vectorizers is None else vectorizers
    is_training = vectorizers is None

    for col in text_cols:
        print(f'Processing {col}...')
        
        # Ensure all values are strings (should already be from null handling)
        text_data = df[col].astype(str)

        if is_training:
            # Training: fit new vectorizers
            word_vectorizer = TfidfVectorizer(
                tokenizer=word_tokenize,
                token_pattern=None,
                ngram_range=word_ngrams,
                max_features=2000,
                min_df=min_df,
                max_df=max_df
            )
            word_vec = word_vectorizer.fit_transform(text_data)

            char_vectorizer = TfidfVectorizer(
                analyzer='char_wb',
                ngram_range=char_ngrams,
                max_features=3000,
                min_df=min_df,
                max_df=max_df
            )
            char_vec = char_vectorizer.fit_transform(text_data)
            
            # Store vectorizers for reuse
            vectorizers_out[f'{col}_word'] = word_vectorizer
            vectorizers_out[f'{col}_char'] = char_vectorizer
        else:
            # Test: use pre-fitted vectorizers
            word_vectorizer = vectorizers_out[f'{col}_word']
            word_vec = word_vectorizer.transform(text_data)
            
            char_vectorizer = vectorizers_out[f'{col}_char']
            char_vec = char_vectorizer.transform(text_data)

        tfidf_results[col] = {
            "word_tfidf": word_vec,
            "char_tfidf": char_vec,
            "word_features": word_vectorizer.get_feature_names_out(),
            "char_features": char_vectorizer.get_feature_names_out()
        }

    if is_training:
        print("TF-IDF feature extraction complete (fitted new vectorizers)")
    else:
        print("TF-IDF feature extraction complete (used pre-fitted vectorizers)")
    
    return tfidf_results, vectorizers_out

In [12]:
def merge_tfidf_results(tfidf_results):
    """
    Merge all TF-IDF matrices (word and char) into a single sparse matrix.
    Returns both the sparse matrix and a DataFrame representation.
    """
    all_matrices = []
    all_feature_names = []

    for col, result in tfidf_results.items():
        word_features = [f"{col}_word_{f}" for f in result["word_features"]]
        all_feature_names.extend(word_features)
        all_matrices.append(result["word_tfidf"])
        
        char_features = [f"{col}_char_{f}" for f in result["char_features"]]
        all_feature_names.extend(char_features)
        all_matrices.append(result["char_tfidf"])

    # Combine all sparse matrices horizontally
    combined_matrix = hstack(all_matrices).tocsr()

    # Create a sparse DataFrame (efficient for large feature sets)
    tfidf_df = pd.DataFrame.sparse.from_spmatrix(combined_matrix, columns=all_feature_names)

    print(f"TF-IDF merge complete: {combined_matrix.shape[0]} samples × {combined_matrix.shape[1]} features")
    return combined_matrix, tfidf_df, all_feature_names

In [None]:
def apply_svd_reduction(tfidf_matrix, n_components=500, random_state=42):
    """
    Apply TruncatedSVD to reduce TF-IDF dimensionality.
    """
    print(f"Applying TruncatedSVD: {tfidf_matrix.shape[1]} features -> {n_components} components")
    
    svd = TruncatedSVD(n_components=n_components, random_state=random_state)
    svd_matrix = svd.fit_transform(tfidf_matrix)
    return svd_matrix, svd

In [14]:
def parse_location(df):
    def parse_location_parts(loc):
        if not isinstance(loc, str) or loc == 'unknown':
            return ("unknown", "unknown", "unknown")
        
        parts = [p.strip() for p in loc.split(',')]
        while len(parts) < 3:
            parts.append("unknown")
        return (parts[0], parts[1], parts[2])
    
    df_loc = df['location'].apply(parse_location_parts)
    df['location_country'] = df_loc.apply(lambda x: x[0])
    df['location_state'] = df_loc.apply(lambda x: x[1])
    df['location_city'] = df_loc.apply(lambda x: x[2])
    
    print("Location parsing complete")
    return df

In [15]:
dft = pd.read_csv('../data/fake_job_postings.csv')
binary_cols = [col for col in dft.columns if dft[col].nunique() == 2]
categorical_cols = [col for col in dft.columns if 2 < dft[col].nunique() < 150]
text_cols = [col for col in dft.columns if dft[col].dtype == 'object' and col not in categorical_cols + ['job_id']]

In [16]:
def preprocess_df(initial_df, n_svd_components=500, svd_model=None, tfidf_vectorizers=None):
    df = initial_df.copy()
    
    print(f"Starting with {len(df)} job postings")
    
    # STEP 1: Handle nulls upfront
    df = handle_nulls_upfront(df)
    validate_nulls(df, "initial null handling")
    
    # STEP 2: Deduplicate
    df = cleanAndDeduplicate(df)
    validate_nulls(df, "deduplication")
    
    # STEP 3: Add pattern features (on raw text)
    df = addWordPatterns(df)
    validate_nulls(df, "pattern features")
    
    # STEP 4: Normalize text
    print("\n--- Text Processing ---")
    df = apply_text_normalization(df, TEXT_COLS)
    
    # STEP 5: Split joined words
    df = apply_split_df(df, TEXT_COLS)
    
    # STEP 6: Generate embeddings (before stopword removal)
    print("\n--- Generating Embeddings ---")
    sentence_cols = ['description', 'requirements', 'benefits', 'company_profile']
    df = sentenceEmbedding(df, text_cols=sentence_cols)
    validate_nulls(df, "sentence embeddings")
    
    # STEP 7: Remove stopwords
    df = remove_stopwords_df(df, TEXT_COLS)
    
    # STEP 8: Lemmatize
    df = lemmatize_df(df, TEXT_COLS)
    
    # STEP 9: Build TF-IDF for all text columns
    print("\n--- Building TF-IDF Features ---")
    tfidf_results, tfidf_vectorizers = build_tfidf(df, TEXT_COLS, vectorizers=tfidf_vectorizers)
    tfidf_matrix, tfidf_df, tfidf_feature_names = merge_tfidf_results(tfidf_results)
    
    # STEP 10: Apply TruncatedSVD to reduce TF-IDF dimensions
    print("\n--- Applying TruncatedSVD Dimensionality Reduction ---")
    if svd_model is None:
        # Training data: fit new SVD model
        svd_matrix, svd_model = apply_svd_reduction(tfidf_matrix, n_components=n_svd_components)
        print("✓ Fitted new SVD model")
    else:
        # Test data: use pre-fitted SVD model
        svd_matrix = svd_model.transform(tfidf_matrix)
        explained_variance = svd_model.explained_variance_ratio_.sum()
        print(f"✓ Used pre-fitted SVD model")
        print(f"Explained variance: {explained_variance:.2%}")
        print(f"SVD output shape: {svd_matrix.shape}")
    
    # Convert SVD matrix to DataFrame
    svd_col_names = [f"tfidf_svd_{i}" for i in range(svd_matrix.shape[1])]
    svd_df = pd.DataFrame(svd_matrix, columns=svd_col_names, index=df.index)
    
    # STEP 11: Parse salary range
    df = parse_salary_range(df)
    validate_nulls(df, "salary parsing")
    
    # STEP 12: Parse location
    df = parse_location(df)
    validate_nulls(df, "location parsing")
    
    # STEP 13: Prepare final features
    categorical_cols = [
        'employment_type_clean', 
        'required_experience', 
        'required_education', 
        'industry', 
        'function',
        'department',
        'location_country', 
        'location_state', 
        'location_city'
    ]
    
    for col in categorical_cols:
        if col in df.columns:
            df[col] = df[col].astype(str)
    
    # Drop intermediate columns (processed text, original location/employment/salary)
    df_features = df.drop(
        columns=TEXT_COLS + ['dedup_key', 'location', 'employment_type', 'salary_range'], 
        errors='ignore'
    )
    
    # STEP 14: Expand embeddings into separate columns
    embedding_cols = [f'{col}_embedding' for col in sentence_cols]
    df_features = expand_embeddings(df_features, embedding_cols)
    
    # STEP 15: Add SVD-reduced TF-IDF features
    print("\n--- Adding TF-IDF SVD Features ---")
    df_features = pd.concat([df_features, svd_df], axis=1)
    print(f"Added {svd_matrix.shape[1]} TF-IDF SVD features")
    
    validate_nulls(df_features, "final processing")
    
    print(f"\n=== Final Feature Summary ===")
    print(f"Total features: {df_features.shape[1]}")
    print(f"Total samples: {df_features.shape[0]}")
    
    return df_features, svd_model, tfidf_vectorizers

In [17]:
## For downstream notebooks, we will create train and test datasets

df_raw = pd.read_csv('../data/fake_job_postings.csv')

from sklearn.model_selection import train_test_split
train_raw, test_raw = train_test_split(df_raw, test_size=0.2, random_state=42)

# training data 
df_train, svd_model, tfidf_vectorizers = preprocess_df(train_raw, n_svd_components=500)

# test data (reuses the fitted TF-IDF vectorizers and SVD model from training)
df_test, _, _ = preprocess_df(test_raw, svd_model=svd_model, tfidf_vectorizers=tfidf_vectorizers)

print("\n" + "=" * 80)
print("FINAL SHAPES")
print("=" * 80)
print(f"Training set: {df_train.shape}")
print(f"Test set: {df_test.shape}")

Starting with 14304 job postings
No nulls found after initial null handling
Removed 259 duplicate rows
No nulls found after deduplication
Pattern features added
No nulls found after pattern features

--- Text Processing ---


Normalizing title: 100%|██████████| 14045/14045 [00:00<00:00, 506731.01it/s]
Normalizing description: 100%|██████████| 14045/14045 [00:00<00:00, 26263.87it/s]
Normalizing requirements: 100%|██████████| 14045/14045 [00:00<00:00, 53580.28it/s]
Normalizing benefits: 100%|██████████| 14045/14045 [00:00<00:00, 131189.05it/s]
Normalizing company_profile: 100%|██████████| 14045/14045 [00:00<00:00, 49747.96it/s]


Text normalization complete


Splitting joined words in title: 100%|██████████| 14045/14045 [00:00<00:00, 89873.19it/s]
Splitting joined words in description: 100%|██████████| 14045/14045 [00:06<00:00, 2122.37it/s]
Splitting joined words in requirements: 100%|██████████| 14045/14045 [00:04<00:00, 3141.94it/s]
Splitting joined words in benefits: 100%|██████████| 14045/14045 [00:01<00:00, 12976.33it/s]
Splitting joined words in company_profile: 100%|██████████| 14045/14045 [00:02<00:00, 5121.75it/s]


Word splitting complete

--- Generating Embeddings ---


Embedding description: 100%|██████████| 14045/14045 [02:56<00:00, 79.69it/s]
Embedding requirements: 100%|██████████| 14045/14045 [02:03<00:00, 113.78it/s]
Embedding benefits: 100%|██████████| 14045/14045 [01:23<00:00, 167.67it/s]
Embedding company_profile: 100%|██████████| 14045/14045 [02:11<00:00, 106.50it/s]


No nulls found after sentence embeddings


Removing stopwords in title: 100%|██████████| 14045/14045 [00:00<00:00, 70395.50it/s]
Removing stopwords in description: 100%|██████████| 14045/14045 [00:03<00:00, 4425.88it/s]
Removing stopwords in requirements: 100%|██████████| 14045/14045 [00:01<00:00, 8883.41it/s]
Removing stopwords in benefits: 100%|██████████| 14045/14045 [00:00<00:00, 23911.37it/s]
Removing stopwords in company_profile: 100%|██████████| 14045/14045 [00:01<00:00, 8365.89it/s]


Stopword removal complete


Lemmatizing title: 100%|██████████| 14045/14045 [00:01<00:00, 8488.60it/s] 
Lemmatizing description: 100%|██████████| 14045/14045 [00:04<00:00, 3045.10it/s]
Lemmatizing requirements: 100%|██████████| 14045/14045 [00:02<00:00, 5902.99it/s]
Lemmatizing benefits: 100%|██████████| 14045/14045 [00:00<00:00, 16362.78it/s]
Lemmatizing company_profile: 100%|██████████| 14045/14045 [00:02<00:00, 6043.97it/s]


Lemmatization complete

--- Building TF-IDF Features ---
Processing title...
Processing description...
Processing requirements...
Processing benefits...
Processing company_profile...
TF-IDF feature extraction complete (fitted new vectorizers)
TF-IDF merge complete: 14045 samples × 25000 features

--- Applying TruncatedSVD Dimensionality Reduction ---
Applying TruncatedSVD: 25000 features -> 500 components
Explained variance: 62.30%
SVD output shape: (14045, 500)
✓ Fitted new SVD model
Salary parsing complete
No nulls found after salary parsing
Location parsing complete
No nulls found after location parsing

--- Expanding Embeddings ---
Expanded description_embedding: 384 dimensions
Expanded requirements_embedding: 384 dimensions
Expanded benefits_embedding: 384 dimensions
Expanded company_profile_embedding: 384 dimensions
Embedding expansion complete

--- Adding TF-IDF SVD Features ---
Added 500 TF-IDF SVD features
No nulls found after final processing

=== Final Feature Summary ===
To

Normalizing title: 100%|██████████| 3561/3561 [00:00<00:00, 446661.58it/s]
Normalizing description: 100%|██████████| 3561/3561 [00:00<00:00, 27599.51it/s]
Normalizing requirements: 100%|██████████| 3561/3561 [00:00<00:00, 54722.95it/s]
Normalizing benefits: 100%|██████████| 3561/3561 [00:00<00:00, 133781.63it/s]
Normalizing company_profile: 100%|██████████| 3561/3561 [00:00<00:00, 51895.96it/s]


Text normalization complete


Splitting joined words in title: 100%|██████████| 3561/3561 [00:00<00:00, 91994.29it/s]
Splitting joined words in description: 100%|██████████| 3561/3561 [00:01<00:00, 2298.53it/s]
Splitting joined words in requirements: 100%|██████████| 3561/3561 [00:01<00:00, 3331.45it/s]
Splitting joined words in benefits: 100%|██████████| 3561/3561 [00:00<00:00, 13460.04it/s]
Splitting joined words in company_profile: 100%|██████████| 3561/3561 [00:00<00:00, 5926.47it/s]


Word splitting complete

--- Generating Embeddings ---


Embedding description: 100%|██████████| 3561/3561 [00:41<00:00, 85.85it/s] 
Embedding requirements: 100%|██████████| 3561/3561 [00:26<00:00, 132.08it/s]
Embedding benefits: 100%|██████████| 3561/3561 [00:16<00:00, 209.87it/s]
Embedding company_profile: 100%|██████████| 3561/3561 [00:31<00:00, 112.89it/s]


No nulls found after sentence embeddings


Removing stopwords in title: 100%|██████████| 3561/3561 [00:00<00:00, 55976.66it/s]
Removing stopwords in description: 100%|██████████| 3561/3561 [00:00<00:00, 4197.27it/s]
Removing stopwords in requirements: 100%|██████████| 3561/3561 [00:00<00:00, 8604.58it/s]
Removing stopwords in benefits: 100%|██████████| 3561/3561 [00:00<00:00, 22313.29it/s]
Removing stopwords in company_profile: 100%|██████████| 3561/3561 [00:00<00:00, 8250.98it/s]


Stopword removal complete


Lemmatizing title: 100%|██████████| 3561/3561 [00:00<00:00, 50129.27it/s]
Lemmatizing description: 100%|██████████| 3561/3561 [00:01<00:00, 2926.27it/s]
Lemmatizing requirements: 100%|██████████| 3561/3561 [00:00<00:00, 5577.96it/s]
Lemmatizing benefits: 100%|██████████| 3561/3561 [00:00<00:00, 15065.95it/s]
Lemmatizing company_profile: 100%|██████████| 3561/3561 [00:00<00:00, 5569.70it/s]


Lemmatization complete

--- Building TF-IDF Features ---
Processing title...
Processing description...
Processing requirements...
Processing benefits...
Processing company_profile...
TF-IDF feature extraction complete (used pre-fitted vectorizers)
TF-IDF merge complete: 3561 samples × 25000 features

--- Applying TruncatedSVD Dimensionality Reduction ---
✓ Used pre-fitted SVD model
Explained variance: 62.30%
SVD output shape: (3561, 500)
Salary parsing complete
No nulls found after salary parsing
Location parsing complete
No nulls found after location parsing

--- Expanding Embeddings ---
Expanded description_embedding: 384 dimensions
Expanded requirements_embedding: 384 dimensions
Expanded benefits_embedding: 384 dimensions
Expanded company_profile_embedding: 384 dimensions
Embedding expansion complete

--- Adding TF-IDF SVD Features ---
Added 500 TF-IDF SVD features
No nulls found after final processing

=== Final Feature Summary ===
Total features: 2072
Total samples: 3561

FINAL SH

In [20]:
df_train.to_csv('../data/processed_train_features.csv', index=False)
df_test.to_csv('../data/processed_test_features.csv', index=False)

In [29]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)