<a href="https://colab.research.google.com/github/diyanali/Glassdoor_salaryPrediction/blob/main/Glassdoor_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Glassdoor Salary Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

In today’s rapidly evolving job market, understanding and estimating fair compensation has become increasingly important for both job seekers and employers. Salary transparency not only helps candidates make informed decisions but also assists companies in offering competitive pay structures. This project focuses on leveraging machine learning techniques to build a model capable of predicting salary estimates for job postings using data collected from Glassdoor, one of the leading platforms for job listings and company reviews.

The primary objective of this project is to develop a supervised regression model that can predict the salary range for a given job description. The model will utilize various job and company-related features such as job title, company rating, location, size, industry, and required skills to determine an appropriate salary estimate. This type of problem falls under the category of supervised learning, specifically a regression task, where the target variable is a continuous numerical value representing the salary estimate.

The dataset for this project includes detailed job postings from Glassdoor, consisting of features like company name, job title, location, headquarters, rating, size, industry, sector, revenue, and more. Additionally, certain binary indicators such as whether the job description mentions Python, Excel, AWS, or Spark are included to capture technical skill requirements. Some fields, such as salary estimates and job descriptions, require parsing and preprocessing to extract meaningful features like seniority level or job function.

The real-world utility of this project lies in its wide range of applications. Job seekers can use the model’s predictions to gauge whether a posted salary aligns with market standards, reducing the risk of underpayment or missed opportunities. Employers and HR professionals can use the insights to set competitive salaries for attracting top talent. Moreover, educational counselors and career platforms can integrate such models to guide students and professionals with data-driven career planning.

To build the predictive model, the first step involves thorough data cleaning and preprocessing. This includes handling missing or inconsistent values, encoding categorical variables into a machine-readable format, and scaling numerical features where necessary. Exploratory data analysis (EDA) is then performed to identify patterns, correlations, and outliers that could affect model performance.

Several regression algorithms will be evaluated for this task, including Linear Regression as a baseline model, and more sophisticated approaches like Decision Trees, Random Forest Regressors, and XGBoost. These models will be assessed using common regression metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and the R² Score to determine how well the model captures the variability in salary predictions. Hyperparameter tuning using GridSearchCV or RandomizedSearchCV may also be performed to enhance model performance.

Optionally, the project can be extended to include a user interface using frameworks like Streamlit or Flask, allowing users to input job features and receive instant salary predictions. This makes the model accessible and interactive, increasing its usability in real-world scenarios.

In conclusion, the Glassdoor Salary Prediction project is a practical and impactful application of machine learning in the domain of career analytics and HR tech. It combines data preprocessing, feature engineering, model building, and evaluation into a comprehensive pipeline that can empower individuals and organizations with actionable salary insights.



# **GitHub Link -**

https://github.com/diyanali/Glassdoor_salaryPrediction

# **Problem Statement**



Develop a machine learning model to predict job salary estimates based on company and job-related features using data from Glassdoor.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('ignore')

plt.style.use('ggplot')
sns.set_palette('Set2')

### Dataset Loading

In [None]:
df=pd.read_csv('glassdoor_jobs.csv')

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
df.shape

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
duplicate_count =df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

In [None]:
df.apply(lambda col: col.duplicated().sum()) ##no. of duplicate values in each column


#### Missing Values/Null Values

In [None]:
df.drop(['Unnamed: 0'],axis=1,inplace=True)

In [None]:
df['Company Name'] = df['Company Name'].str.split('\n').str[0]#Removing'/n

In [None]:
df['Min Salary']=df.apply(lambda row: row['Salary Estimate'][:row['Salary Estimate'].find('-')], axis=1)#Min Salary Column

In [None]:
df['Min Salary'] = df.apply(
    lambda row: row['Min Salary'][row['Min Salary'].find(':') + 1:].strip()
    if row['Min Salary'].startswith('Employer Provided Salary:')
    else row['Min Salary'],
    axis=1
)#Min Salry Creation In Specific Cases

In [None]:
for i in range(len(df)):
    if df.iloc[i]['Salary Estimate'].endswith('(Glassdoor est.)'):
        salary_est = df.iloc[i]['Salary Estimate']
        start = salary_est.find('-') + 1
        end = salary_est.find(' ')
        df.at[i, 'Max Salary'] = salary_est[start:end]

    elif df.iloc[i]['Salary Estimate'].endswith('(Employer est.)'):
        salary_est = df.iloc[i]['Salary Estimate']
        start = salary_est.find('-') + 1
        end = salary_est.find('(')
        df.at[i, 'Max Salary'] = salary_est[start:end]

    elif df.iloc[i]['Salary Estimate'].startswith('Employer Provided Salary:') and df.iloc[i]['Salary Estimate'].endswith('Per Hour'):
        salary_est = df.iloc[i]['Salary Estimate']
        salary_range = salary_est.replace('Employer Provided Salary:', '').replace('Per Hour', '').strip()
        min_sal, max_sal = salary_range.split('-')
        df.at[i, 'Max Salary'] = max_sal.strip()


    elif df.iloc[i]['Salary Estimate'].startswith('Employer Provided Salary:'):
        salary_est = df.iloc[i]['Salary Estimate']
        start = salary_est.find('-') + 1
        df.at[i, 'Max Salary'] = salary_est[start:]

    #Creation Of Max Salary using specific conditions

In [None]:
df['Max Salary']=df['Max Salary'].replace(np.nan,'')
#Replacing '' with null

In [None]:
def parse_salary(s):
    s = s.strip().replace('$', '')
    if s.endswith('K'):
        return int(float(s.replace('K', '')))  # Annual salary
    else:
        hourly = float(s)
        return int((hourly * 40 * 52)/1000)  # Convert hourly to annual assuming 40 hrs/week


In [None]:
df['Max Salary'] = df[df['Max Salary']!='']['Max Salary'].apply(parse_salary)

In [None]:
df['Min Salary'] = df[df['Min Salary']!='']['Min Salary'].apply(parse_salary)

In [None]:
df['Avg Salary']=np.round((df['Max Salary']+df['Min Salary'])/2,decimals=2)

Maximum,minimum and average Salary are created.

### What did you know about your dataset?

In [None]:
df.drop(['Salary Estimate'],axis=1,inplace=True)

In [None]:
df.drop(['Job Description'],axis=1,inplace=True)

Salary estimation and job description columns are eliminated.

In [None]:
df['Rating']=df['Rating'].replace(-1,np.nan)

In [None]:
df['Company Name']=df['Company Name'].replace('<intent>',np.nan)

missing values are replaced with null values

In [None]:
df['Location_State'] = df['Location'].apply(
    lambda x: x[x.find(',')+2:].strip() if ',' in x else x.strip()
)

In [None]:
df.drop(['Location'],axis=1,inplace=True)

new column Location state is created to store on state name.

In [None]:
df['Headquarters'] = df['Headquarters'].apply(
    lambda x: x[x.find(',')+2:].strip() if ',' in x else x.strip()
)

In [None]:
df['Headquarters']=df['Headquarters'].replace('-1',np.nan)

In [None]:
df['Size']=df['Size'].replace('-1',np.nan)

In [None]:
df['Founded']=df['Founded'].replace(-1,np.nan)

In [None]:
df['Type of ownership']=df['Type of ownership'].replace('-1',np.nan)

In [None]:
df['Industry']=df['Industry'].replace('-1',np.nan)

In [None]:
df['Sector']=df['Sector'].replace('-1',np.nan)

In [None]:
df['Revenue']=df['Revenue'].replace('-1',np.nan)

In [None]:
df['Competitors']=df['Competitors'].replace('-1',np.nan)

In [None]:
df.drop(['Industry'],axis=1,inplace=True)

In [None]:
df.isnull().sum()

In [None]:
missing_percent = df.isnull().mean() * 100
print(missing_percent.sort_values(ascending=False))

In [None]:
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.barplot(x=missing.values, y=missing.index, palette='mako')
plt.xlabel("Number of Missing Values")
plt.ylabel("Columns")
plt.title("Missing Values Per Column")
plt.show()


In [None]:
df.drop(['Competitors'],axis=1,inplace=True)

In [None]:
df['Rating']=df['Rating'].replace(np.nan,df['Rating'].mean())

In [None]:
categorical_cols = ['Company Name', 'Headquarters', 'Size', 'Type of ownership', 'Sector', 'Revenue']
df[categorical_cols] = df[categorical_cols].fillna('Unknown')

In [None]:
df['Founded'] = df['Founded'].fillna(df['Founded'].median())

In [None]:
def clean_job_title(title):
    title = title.lower()
    if 'data scientist' in title:
        return 'Data Scientist'
    elif 'data analyst' in title:
        return 'Data Analyst'
    elif 'data engineer' in title:
        return 'Data Engineer'
    elif 'machine learning' in title or 'ml engineer' in title:
        return 'ML Engineer'
    elif 'research scientist' in title or 'researcher' in title:
        return 'Research Scientist'
    elif 'manager' in title or 'director' in title or 'lead' in title or 'head' in title:
        return 'Manager/Director'
    elif 'intern' in title or 'junior' in title or 'jr.' in title or 'college' in title:
        return 'Intern/Junior'
    elif 'analyst' in title:
        return 'Other Analyst'
    elif 'scientist' in title:
        return 'Other Scientist'
    else:
        return 'Other'
df['Cleaned Job Title'] = df['Job Title'].apply(clean_job_title)

In [None]:
def categorize_other_roles(title):
    title = title.lower()

    if any(k in title for k in ['chief', 'vp', 'head', 'director', 'principal']):
        return 'Manager/Director'

    elif 'consultant' in title or 'analytics consultant' in title:
        return 'Other Analyst'

    elif 'architect' in title or 'data modeler' in title:
        return 'Other Scientist'

    elif 'engineer' in title and any(k in title for k in ['product', 'platform', 'spark', 'systems']):
        return 'Data Engineer'

    elif 'analytics' in title or 'data systems specialist' in title or 'data & analytics' in title:
        return 'Other Analyst'

    elif 'data science engineer' in title or 'ml' in title:
        return 'ML Engineer'

    elif any(k in title for k in ['account exec', 'business development']):
        return 'Manager/Director'

    elif 'environmental' in title:
        return 'Other Scientist'

    elif 'intern' in title or 'junior' in title:
        return 'Intern/Junior'

    elif 'software engineer' in title and 'visualization' in title:
        return 'Data Engineer'

    elif 'product engineer' in title and 'data science' in title:
        return 'Data Engineer'

    elif 'data management specialist' in title:
        return 'Data Analyst'

    else:
        return 'Other'
df.loc[df['Cleaned Job Title'] == 'Other', 'Cleaned Job Title'] = \
    df.loc[df['Cleaned Job Title'] == 'Other', 'Job Title'].apply(categorize_other_roles)


In [None]:
df[df['Cleaned Job Title']=='Other']['Job Title']

In [None]:
df.at[496, 'Cleaned Job Title'] = 'Data Scientist'
df.at[746, 'Cleaned Job Title'] = 'Data Scientist'
df.at[821, 'Cleaned Job Title'] = 'Data Scientist'

In [None]:
df.drop(['Job Title'],axis=1,inplace=True)

In [None]:
df.head()

## ***2. Understanding Your Variables***

In [None]:
list(df.columns)

In [None]:
df.describe()

In [None]:
df.dtypes

### Check Unique Values for each variable.

In [None]:
for column in df.columns:
    print(f"\n{column} - {df[column].nunique()} unique values")
    print(df[column].unique()[:10])  # Show only first 10 unique values


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Univariate Histogram for Salary distribution

In [None]:
sns.histplot(df['Avg Salary'], kde=True)
plt.title("Salary Distribution")
plt.show()


#### Box plot of Average Salary to spot outliers.

In [None]:
sns.boxplot(x=df['Avg Salary'])
plt.title("Boxplot of Average Salary")
plt.show()


Box plot of slaary Distribution by Job title.

In [None]:
sns.boxplot(data=df, x='Cleaned Job Title', y='Avg Salary')
plt.xticks(rotation=45)
plt.title("Salary Distribution by Job Title (In Thousands)")
plt.tight_layout()
plt.show()

In [None]:
df_filtered = df[df["Size"] != 'Unknown']
g = sns.FacetGrid(df_filtered, col='Size', col_wrap=3, height=4, sharex=False)
g.map(sns.histplot, 'Avg Salary', kde=True, color='skyblue')
g.set_titles(col_template="{col_name}")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Histogram of Avg Salary by Company Size", fontsize=16)
plt.tight_layout()
plt.show()

#### Data Preprocessing

In [None]:
train_data=df[df['Min Salary'].notna()]

In [None]:
test_data=df[df['Min Salary'].notna()]

In [None]:
train_data.head()

In [None]:
test_data.head()


In [None]:
# Define target variable
y = train_data['Min Salary']  # 'train' already has no NaNs in min_salary

# Define feature set by dropping target and irrelevant columns
X = train_data.drop(['Min Salary', 'Max Salary', 'Avg Salary', 'Salary Estimate',
                'Job Title', 'Company Name', 'Job Description'],
               axis=1, errors='ignore')


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Dictionary to store model results
results = []

# List of regression models to test
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42, objective='reg:squarederror') # corrected objective
}

# Apply one-hot encoding to categorical features
X_encoded = pd.get_dummies(X)

# Train and evaluate each model
for name, model in models.items():
    # Use the encoded features X_encoded for training and prediction
    model.fit(X_encoded, y)
    # Need to apply the same encoding to X_test before predicting
    # To do this properly, we should fit the OneHotEncoder on the training data (X_train)
    # and then transform both X_train and X_test.
    # Since we are fitting on X here, we will get dummy variables for all categories present in X
    # and then apply this to X_test. This might lead to issues if X_test has categories not in X.
    # A more robust approach would be to encode X_train and X_test separately after the split,
    # ensuring consistent columns.

    # For now, let's encode X_test using the columns from X_encoded
    X_test_encoded = pd.get_dummies(X_test)
    # Align columns - this is crucial if X_test has different categories than X
    X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)


    y_pred = model.predict(X_test_encoded)

    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    results.append({
        'Model': name,
        'R² Score': round(r2, 4),
        'MAE': round(mae, 2),
        'MSE': round(mse, 2),
        'RMSE': round(rmse, 2)
    })

# Convert results to a DataFrame for easy comparison
results_df = pd.DataFrame(results)
results_df.sort_values(by='R² Score', ascending=False, inplace=True)

# Display the results
print(results_df)

#### Conclusion
The project successfully built and evaluated several machine learning models to predict job salary estimates based on the Glassdoor dataset. Through data cleaning, preprocessing, and exploratory analysis, the dataset was prepared for modeling. The evaluation of different regression algorithms revealed that the Decision Tree model provided the most accurate predictions for the minimum salary based on the features used. This suggests that a tree-based approach is effective in capturing the complex relationships between job and company features and salary estimates in this dataset. The developed model can be a valuable tool for job seekers and employers to understand and estimate salary ranges.