## Silver Layer: Data Enrichment and Feature Engineering

The **Silver Layer** takes the raw, validated data from the Bronze Layer (`bronze_resumes`) and transforms it into a structured, analysis-ready format. This layer performs **cleansing, standardization, and feature extraction** using PySpark UDFs to create new, valuable columns like 'extracted_skills'. The output is the `silver_features` Delta table.

In [0]:
# Imports required for UDFs and DataFrame operations
from pyspark.sql import functions as F
from pyspark.sql import types as T
import re
from pyspark.sql.types import ArrayType, StringType

### Feature Extraction: Defining the Skill Extractor (UDF)

To analyze skill demand and cluster candidates, we must convert the raw text into structured features. This process defines a **User Defined Function (UDF)** in Python using Regular Expressions (`re`) to identify and extract key technical skills (Python, SQL, AWS, etc.) from the resume text. PySpark then applies this UDF to every resume in parallel across the cluster.

In [0]:
# --- List of Common Data Science/Tech Skills ---
SKILLS_LIST = [
    r'\bPython\b', r'\bSQL\b', r'\bSpark\b', r'\bDatabricks\b', r'\bR\b', 
    r'\bJava\b', r'\bC\+\+\b', r'\bC#\b', r'\bReact\b', r'\bAngular\b', 
    r'\bAWS\b', r'\bAzure\b', r'\bGCP\b', r'\bMachine Learning\b', 
    r'\bDeep Learning\b', r'\bTensorFlow\b', r'\bPyTorch\b', r'\bKeras\b',
    r'\bTableau\b', r'\bPower BI\b', r'\bExcel\b', r'\bPandas\b', r'\bNumPy\b',
    r'\bDocker\b', r'\bKubernetes\b', r'\bScikit-learn\b', r'\bScala\b'
]
# Combine skills into a single, case-insensitive regex pattern
SKILL_PATTERN = r'(?i)' + '|'.join(SKILLS_LIST) 


# --- Python Function to Extract Skills ---
def extract_skills_from_text(text):
  """Uses regex to find and return a list of matching skills."""
  if text is None:
    return []
  
  # Find all matches, clean, normalize, and remove duplicates
  matches = re.findall(SKILL_PATTERN, text)
  unique_skills = sorted(list(set([m.strip().lower() for m in matches])))
  
  return unique_skills


# --- Register the Python function as a PySpark UDF ---
extract_skills_udf = F.udf(extract_skills_from_text, ArrayType(StringType()))

print("✅ Skill extraction function and PySpark UDF registered.")

✅ Skill extraction function and PySpark UDF registered.


### Silver Table Creation: Applying UDF and Saving

This section reads the raw `bronze_resumes` table, applies the `extract_skills_udf` to create the new `extracted_skills` column, and saves the resulting feature-rich DataFrame as the `silver_features` Delta table. This table serves as the single source of truth for all subsequent machine learning models and analytics.

In [0]:
# Read the Bronze table created by the Bronze_Layer.ipynb notebook
# This is the INPUT for the Silver Layer
bronze_df = spark.table("bronze_resumes")

# Apply the UDF to the 'Resume_Text' column and create a new column 'extracted_skills'
silver_df = bronze_df.withColumn(
    "extracted_skills", 
    extract_skills_udf(F.col("Resume_Text"))
)

print(f"\n✅ Created Silver DataFrame with extracted skills.")

# Show the result to verify the extraction
silver_df.select("Resume_ID", "extracted_skills").show(5, truncate=False)

# 💾 Save the result as the SILVER Delta Table (OUTPUT of the Silver Layer)
silver_df.write.format("delta").mode("overwrite").saveAsTable("silver_features")

print("\n💾 Silver table 'silver_features' created successfully!")


✅ Created Silver DataFrame with extracted skills.
+---------+----------------------+
|Resume_ID|extracted_skills      |
+---------+----------------------+
|0        |[]                    |
|1        |[python, sql, tableau]|
|2        |[]                    |
|3        |[]                    |
|4        |[]                    |
+---------+----------------------+
only showing top 5 rows

💾 Silver table 'silver_features' created successfully!
