## 05_Jobs_Analysis_and_Predictions.ipynb

**Purpose:**  
Analyze job descriptions and HRMS data to generate job analytics and predictive features.

**Input:**  
- `data/processed/Jobs_cleaned.csv`  
- `data/processed/HRMS_cleaned.csv`

**Output:**  
- `data/processed/jobs_analytics_features.csv`

**Notes:**  
- Generates job-market insights and feature vectors for downstream dashboards.

In [None]:
import pandas as pd

# Load datasets
jobs = pd.read_csv(r"C:\Users\abanu\Documents\t_iq_hr\data\processed\Jobs_cleaned.csv")
hrms = pd.read_csv(r"C:\Users\abanu\Documents\t_iq_hr\data\processed\HRMS_cleaned.csv")

# Basic checks
print("Jobs shape:", jobs.shape)
print("HRMS shape:", hrms.shape)

print("\nJobs columns:\n", jobs.columns.tolist())
print("\nHRMS columns:\n", hrms.columns.tolist())

# Ensure text columns are strings (VERY IMPORTANT)
text_cols = ['skills', 'job_text']

for col in text_cols:
    jobs[col] = jobs[col].astype(str)

print("\n‚úÖ Notebook 5 Cell 1 completed successfully")


Jobs shape: (10000, 13)
HRMS shape: (10000, 11)

Jobs columns:
 ['job_id', 'job_title', 'company', 'location', 'industry', 'seniority', 'employment_type', 'salary_min', 'salary_max', 'years_experience', 'skills', 'job_text', 'job_len']

HRMS columns:
 ['employee_id', 'name', 'department', 'job_role', 'location', 'current_salary', 'satisfaction_score', 'engagement_score', 'num_skills', 'years_at_company', 'trainings_count']

‚úÖ Notebook 5 ‚Äì Cell 1 completed successfully


In [11]:
# ============================
# Cell 2: Job Feature Engineering
# ============================

# Salary midpoint (important for matching & analytics)
jobs['salary_mid'] = (jobs['salary_min'] + jobs['salary_max']) / 2

# Ensure numeric columns are correct type
numeric_cols = [
    'salary_min', 'salary_max', 'salary_mid',
    'years_experience', 'job_len'
]

jobs[numeric_cols] = jobs[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Basic sanity check
print("Salary mid stats:")
print(jobs['salary_mid'].describe())

print("\nYears of experience distribution:")
print(jobs['years_experience'].value_counts().head())

jobs.head()


Salary mid stats:
count     10000.000000
mean     108987.285050
std       36396.915685
min       32841.000000
25%       78043.375000
50%      108893.250000
75%      140211.250000
max      184538.500000
Name: salary_mid, dtype: float64

Years of experience distribution:
years_experience
1     960
10    951
2     927
3     925
4     913
Name: count, dtype: int64


Unnamed: 0,job_id,job_title,company,location,industry,seniority,employment_type,salary_min,salary_max,years_experience,skills,job_text,job_len,salary_mid
0,JOB000001,3,122,0,2,3,3,107397,140048,0,Quantitative Analysis;Azure;Feature Engineerin...,"Design, develop and maintain scalable Kubernet...",59,123722.5
1,JOB000002,13,111,3,6,1,0,42676,71202,5,Spark;Node.js;spaCy;Azure;Kubernetes;GANs,Maintain documentation and operate within Agil...,56,56939.0
2,JOB000003,26,155,3,2,0,3,117841,140337,10,AWS;GANs;Kotlin;RESTful Design;Tableau;RoBERTa...,Collaborate with cross-functional teams to def...,64,129089.0
3,JOB000004,6,52,10,3,3,2,103579,143901,4,Looker;RESTful Design;Git;Jira;Linux;CI/CD;Fin...,Maintain documentation and operate within Agil...,63,123740.0
4,JOB000005,7,199,11,7,1,4,128419,150905,10,Leadership;Java;Hyperparameter Tuning;Customer...,Develop proof-of-concepts for new ML models an...,64,139662.0


In [12]:
# -------------------------------
# Cell 3: Job Feature Engineering
# -------------------------------

# 1. Salary spread (how wide the salary band is)
jobs['salary_range'] = jobs['salary_max'] - jobs['salary_min']

# 2. Experience bucket (for analytics / matching)
jobs['exp_bucket'] = pd.cut(
    jobs['years_experience'],
    bins=[-1, 2, 5, 10, 20],
    labels=['Entry', 'Mid', 'Senior', 'Expert']
)

# 3. Skill count (number of skills required)
jobs['num_skills_required'] = jobs['skills'].apply(
    lambda x: len(str(x).split(';'))
)

# 4. Job description length category
jobs['job_len_bucket'] = pd.cut(
    jobs['job_len'],
    bins=[0, 40, 60, 100],
    labels=['Short', 'Medium', 'Long']
)

# Quick check
jobs[['salary_mid', 'salary_range', 'years_experience',
      'exp_bucket', 'num_skills_required', 'job_len_bucket']].head()


Unnamed: 0,salary_mid,salary_range,years_experience,exp_bucket,num_skills_required,job_len_bucket
0,123722.5,32651,0,Entry,5,Medium
1,56939.0,28526,5,Mid,6,Medium
2,129089.0,22496,10,Senior,10,Long
3,123740.0,40322,4,Mid,7,Long
4,139662.0,22486,10,Senior,6,Long


In [None]:
# -------------------------------
# CELL 4: Job Feature Table
# -------------------------------

job_features = jobs[[
    'job_id',
    'industry',
    'seniority',
    'employment_type',
    'salary_mid',
    'salary_range',
    'years_experience',
    'exp_bucket',
    'num_skills_required',
    'job_len_bucket'
]].copy()

job_features.head()


Unnamed: 0,job_id,industry,seniority,employment_type,salary_mid,salary_range,years_experience,exp_bucket,num_skills_required,job_len_bucket
0,JOB000001,2,3,3,123722.5,32651,0,Entry,5,Medium
1,JOB000002,6,1,0,56939.0,28526,5,Mid,6,Medium
2,JOB000003,2,0,3,129089.0,22496,10,Senior,10,Long
3,JOB000004,3,3,2,123740.0,40322,4,Mid,7,Long
4,JOB000005,7,1,4,139662.0,22486,10,Senior,6,Long


In [14]:
# Save final job analytics features
output_path = r"C:\Users\abanu\Documents\t_iq_hr\data\processed\jobs_analytics_features.csv"
job_features.to_csv(output_path, index=False)

print("‚úÖ Job analytics feature table saved successfully!")
print("üìÅ Path:", output_path)


‚úÖ Job analytics feature table saved successfully!
üìÅ Path: C:\Users\abanu\Documents\t_iq_hr\data\processed\jobs_analytics_features.csv
