## 📖 Background
You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

## 💾 The data

The data comes from a survey hosted by an HR consultancy, available in `'salaries.csv'`.

#### Each row represents a single employee's salary record for a given year:
- **`work_year`** - The year the salary was paid.  
- **`experience_level`** - Employee experience level:  
  - **`EN`**: Entry-level / Junior  
  - **`MI`**: Mid-level / Intermediate  
  - **`SE`**: Senior / Expert  
  - **`EX`**: Executive / Director  
- **`employment_type`** - Employment type:  
  - **`PT`**: Part-time  
  - **`FT`**: Full-time  
  - **`CT`**: Contract  
  - **`FL`**: Freelance  
- **`job_title`** - The job title during the year.  
- **`salary`** - Gross salary paid (in local currency).  
- **`salary_currency`** - Salary currency (ISO 4217 code).  
- **`salary_in_usd`** - Salary converted to USD using average yearly FX rate.  
- **`employee_residence`** - Employee's primary country of residence (ISO 3166 code).  
- **`remote_ratio`** - Percentage of remote work:  
  - **`0`**: No remote work (<20%)  
  - **`50`**: Hybrid (50%)  
  - **`100`**: Fully remote (>80%)  
- **`company_location`** - Employer's main office location (ISO 3166 code).  
- **`company_size`** - Company size:  
  - **`S`**: Small (<50 employees)  
  - **`M`**: Medium (50–250 employees)  
  - **`L`**: Large (>250 employees)  

In [72]:
import pandas as pd

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

salaries = pd.read_csv('salaries.csv')
salaries.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2024,MI,FT,Developer,168276,USD,168276,US,0,US,M
1,2024,MI,FT,Developer,112184,USD,112184,US,0,US,M
2,2024,EN,FT,Developer,180000,USD,180000,US,0,US,M
3,2024,EN,FT,Developer,133500,USD,133500,US,0,US,M
4,2024,EN,FT,Developer,122000,USD,122000,US,0,US,M


### Question 1: Analyse how factors such as country, experience level, and remote ratio impact salaries for Data Analysts, Data Scientists, and Machine Learning Engineers. In which conditions do professionals achieve the highest salaries?

In [73]:
q1 = salaries[salaries['job_title'].str.contains("^Data Analyst$|^Data Scientist$|^Machine Learning Engineer$",case=False)]
q1.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
34,2024,MI,FT,Machine Learning Engineer,149000,USD,149000,US,0,US,M
35,2024,MI,FT,Machine Learning Engineer,93000,USD,93000,US,0,US,M
38,2024,MI,FT,Machine Learning Engineer,296300,USD,296300,US,0,US,M
39,2024,MI,FT,Machine Learning Engineer,166600,USD,166600,US,0,US,M
40,2024,SE,FT,Machine Learning Engineer,378700,USD,378700,US,0,US,M


In [74]:
# Salary by Job Title
salary_by_job_title = q1.groupby('job_title')['salary_in_usd'].mean().round(2).reset_index()
salary_by_job_title

Unnamed: 0,job_title,salary_in_usd
0,Data Analyst,108528.63
1,Data Scientist,159397.07
2,Machine Learning Engineer,196891.01


'Machine Learning Engineer' on average earns most.

In [75]:
# Salary by Country (Employee Residence)
salary_by_country = q1.groupby(['job_title', 'employee_residence'])['salary_in_usd'].mean().round(2).reset_index().sort_values('salary_in_usd', ascending=False)
salary_by_country.head(10)

Unnamed: 0,job_title,employee_residence,salary_in_usd
129,Machine Learning Engineer,UA,230000.0
31,Data Analyst,MX,205825.0
131,Machine Learning Engineer,US,201848.73
100,Machine Learning Engineer,AU,200922.87
104,Machine Learning Engineer,CA,168350.53
125,Machine Learning Engineer,PR,167500.0
93,Data Scientist,US,165208.37
76,Data Scientist,MX,163457.14
107,Machine Learning Engineer,DE,154959.87
80,Data Scientist,NZ,154475.75


'Machine Learning Engineer' on average earns more in country like 'UA', 'US', 'AU', 'CA' and on second is 'Data Analyst' in 'MX'.

In [76]:
# Salary by Experience Level
salary_by_experience = q1.groupby(['job_title', 'experience_level'])['salary_in_usd'].mean().round(2).reset_index()
salary_by_experience

Unnamed: 0,job_title,experience_level,salary_in_usd
0,Data Analyst,EN,91320.36
1,Data Analyst,EX,112925.67
2,Data Analyst,MI,99731.82
3,Data Analyst,SE,125629.33
4,Data Scientist,EN,102297.2
5,Data Scientist,EX,205171.07
6,Data Scientist,MI,140777.74
7,Data Scientist,SE,171341.36
8,Machine Learning Engineer,EN,148796.34
9,Machine Learning Engineer,EX,222596.43


'Machine Learning Engineer' on 'Executive Level' earns most.

In [77]:
# Salary by Remote Ratio
salary_by_remote_ratio = q1.groupby(['job_title', 'remote_ratio'])['salary_in_usd'].mean().round(2).reset_index()
salary_by_remote_ratio

Unnamed: 0,job_title,remote_ratio,salary_in_usd
0,Data Analyst,0,107243.89
1,Data Analyst,50,49858.57
2,Data Analyst,100,112683.21
3,Data Scientist,0,161558.41
4,Data Scientist,50,80231.45
5,Data Scientist,100,155269.63
6,Machine Learning Engineer,0,199900.9
7,Machine Learning Engineer,50,96883.74
8,Machine Learning Engineer,100,186902.22


'Machine Learning Engineer' who works in person earns most on average.

In [78]:
# Combine all factors
highest_salary_conditions = q1.groupby(['job_title', 'experience_level', 'remote_ratio', 'employee_residence'])['salary_in_usd'].mean().round(2).reset_index().sort_values('salary_in_usd', ascending=False)
highest_salary_conditions.head(10)

Unnamed: 0,job_title,experience_level,remote_ratio,employee_residence,salary_in_usd
13,Data Analyst,EN,0,MX,429950.0
183,Data Scientist,MI,0,MX,352500.0
242,Data Scientist,SE,50,CH,323295.0
160,Data Scientist,EX,50,US,260000.0
290,Machine Learning Engineer,MI,0,AU,258333.33
287,Machine Learning Engineer,EX,0,US,239405.36
360,Machine Learning Engineer,SE,100,UA,230000.0
165,Data Scientist,EX,100,US,221078.98
335,Machine Learning Engineer,SE,0,US,212365.85
330,Machine Learning Engineer,SE,0,FR,212000.0


'Data Analyst' in 'MX' earn most on average.(Quite Surprising)

'Data Scientist' earns second, third, fourth most in different country while 'Machine Learning Engineer' are on fifth, sixth, seventh across different experience level in different country.

### Question 2: Develop a predictive model to estimate an employee’s salary (in USD) using experience level, company location, and remote ratio. Which features are the strongest predictors of salary?

In [79]:
q2 = q1[['experience_level','remote_ratio','employee_residence','salary_in_usd']]
q2.head()

Unnamed: 0,experience_level,remote_ratio,employee_residence,salary_in_usd
34,MI,0,US,149000
35,MI,0,US,93000
38,MI,0,US,296300
39,MI,0,US,166600
40,SE,0,US,378700


In [80]:
# Preprocessing data for predictive models

q2['experience_level']=pd.factorize(q2['experience_level'])[0]
q2['remote_ratio']=pd.factorize(q2['remote_ratio'])[0]
q2['employee_residence']=pd.factorize(q2['employee_residence'])[0]

# Applying scaling
scaler = MinMaxScaler()

q2['salary_in_usd'] = scaler.fit_transform(q2[['salary_in_usd']])

In [81]:
X = q2.drop(['salary_in_usd'],axis=1) 
y = q2['salary_in_usd']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=1)

In [82]:
# Random Forest Classifier model
rfr = RandomForestRegressor(n_estimators=100, random_state=42)

rfr.fit(X_train, y_train)

y_pred = rfr.predict(X_test) # Predicting model

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: {}".format(mae))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {}".format(mse))
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Sqaure Error: {}".format(rmse))
r2 = r2_score(y_test, y_pred)
print("R2 score: {}".format(r2))

Mean Absolute Error: 0.06199522490107808
Mean Squared Error: 0.006826109668338132
Root Mean Sqaure Error: 0.08262027395463012
R2 score: 0.20140726063232695


In [83]:
# Linear Regression model
from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_train, y_train)

y_pred = lm.predict(X_test) # Predicting model

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: {}".format(mae))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {}".format(mse))
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Sqaure Error: {}".format(rmse))
r2 = r2_score(y_test, y_pred)
print("R2 score: {}".format(r2))

Mean Absolute Error: 0.06981738717121636
Mean Squared Error: 0.00814199611359689
Root Mean Sqaure Error: 0.09023301011047392
R2 score: 0.04746051613593805


In [84]:
# Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

y_pred = gb_model.predict(X_test)

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: {}".format(mae))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {}".format(mse))
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Sqaure Error: {}".format(rmse))
r2 = r2_score(y_test, y_pred)
print("R2 score: {}".format(r2))

Mean Absolute Error: 0.06189334594524895
Mean Squared Error: 0.006751633401801228
Root Mean Sqaure Error: 0.08216832359127955
R2 score: 0.2101203063642838


In [85]:
# XGBoost Regressor model
xg_model = xgb.XGBRegressor(n_estimators=100, random_state=42)

# Train the model
xg_model.fit(X_train, y_train)

y_pred = xg_model.predict(X_test)

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: {}".format(mae))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {}".format(mse))
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Sqaure Error: {}".format(rmse))
r2 = r2_score(y_test, y_pred)
print("R2 score: {}".format(r2))

Mean Absolute Error: 0.06203786158658907
Mean Squared Error: 0.006900015825996513
Root Mean Sqaure Error: 0.08306633389043068
R2 score: 0.19276091245331872


### Gradient Boosting is best model among all four.

In [86]:
# Get the feature importances from the trained model
feature_importances = gb_model.feature_importances_

feature_names = X_train.columns

# Create a DataFrame to combine feature names with their importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the importance of each feature
print(importance_df)

              Feature  Importance
0    experience_level    0.638020
2  employee_residence    0.316741
1        remote_ratio    0.045239


### Strongest predictor of Salary is 'Experience Level'.

### Question 3: Expand your model by incorporating additional features, such as company size and employment type. Evaluate its performance, what improves, and what doesn’t? Finally, propose new features to make future salary predictions even more accurate.

In [87]:
q3 = q1[['experience_level','remote_ratio','employee_residence','salary_in_usd','company_size','employment_type']]

q3['experience_level']=pd.factorize(q3['experience_level'])[0]
q3['remote_ratio']=pd.factorize(q3['remote_ratio'])[0]
q3['employee_residence']=pd.factorize(q3['employee_residence'])[0]
q3['company_size']=pd.factorize(q3['company_size'])[0]
q3['employment_type']=pd.factorize(q3['employment_type'])[0]

# Applying scaling
scaler = MinMaxScaler()

q3['salary_in_usd'] = scaler.fit_transform(q3[['salary_in_usd']])

q3.head()

Unnamed: 0,experience_level,remote_ratio,employee_residence,salary_in_usd,company_size,employment_type
34,0,0,0,0.176548,0,0
35,0,0,0,0.102767,0,0
38,0,0,0,0.370619,0,0
39,0,0,0,0.199736,0,0
40,1,0,0,0.479183,0,0


In [88]:
X = q3.drop(['salary_in_usd'],axis=1) # Independet variable
y = q3['salary_in_usd'] # dependent variable

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=1)

In [89]:
# Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

y_pred = gb_model.predict(X_test)

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: {}".format(mae))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {}".format(mse))
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Sqaure Error: {}".format(rmse))
r2 = r2_score(y_test, y_pred)
print("R2 score: {}".format(r2))

Mean Absolute Error: 0.06188522274508126
Mean Squared Error: 0.006754106669492138
Root Mean Sqaure Error: 0.08218337222024986
R2 score: 0.20983095654184614


In [90]:
# Get the feature importances from the trained model
feature_importances = gb_model.feature_importances_

feature_names = X_train.columns

# Create a DataFrame to combine feature names with their importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the importance of each feature
print(importance_df)

              Feature  Importance
0    experience_level    0.634390
2  employee_residence    0.312861
1        remote_ratio    0.043134
3        company_size    0.005226
4     employment_type    0.004389


#### 'Experience Level' is most importance features among all and 'Employee Residence' is second most while 'Employment Type' and 'Company Size' are least important.
While there is slight changes in 'Experience Level', 'Employee Residence', 'Remote Ratio' when 'Company Size', 'Employment Type' are added.

'Company Size', 'Employment Type' has negligible importance has it both combined 0.01%.

### I have added 'Work Year' and 'Company Location'. Let's see of it make any changes.

In [91]:
q4 = q1[['work_year', 'experience_level', 'employment_type', 'salary_in_usd', 'remote_ratio', 'employee_residence','company_location', 'company_size']]

In [92]:
q4['experience_level']=pd.factorize(q4['experience_level'])[0]
q4['remote_ratio']=pd.factorize(q4['remote_ratio'])[0]
q4['employee_residence']=pd.factorize(q4['employee_residence'])[0]
q4['company_size']=pd.factorize(q4['company_size'])[0]
q4['employment_type']=pd.factorize(q4['employment_type'])[0]
q4['work_year']=pd.factorize(q4['work_year'])[0]
q4['company_location']=pd.factorize(q4['company_location'])[0]

# Applying scaling
scaler = MinMaxScaler()

q4['salary_in_usd'] = scaler.fit_transform(q4[['salary_in_usd']])

In [93]:
X = q4.drop(['salary_in_usd'],axis=1) # Independet variable
y = q4['salary_in_usd'] # dependent variable

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=1)

In [94]:
# Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

y_pred = gb_model.predict(X_test)

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: {}".format(mae))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {}".format(mse))
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Sqaure Error: {}".format(rmse))
r2 = r2_score(y_test, y_pred)
print("R2 score: {}".format(r2))

Mean Absolute Error: 0.06178183042930189
Mean Squared Error: 0.006706899772555146
Root Mean Sqaure Error: 0.08189566394233058
R2 score: 0.21535373407892866


## Output is almost similar to previous but this is slightly best model.

In [95]:
# Get the feature importances from the trained model
feature_importances = gb_model.feature_importances_

feature_names = X_train.columns

# Create a DataFrame to combine feature names with their importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the importance of each feature
print(importance_df)

              Feature  Importance
1    experience_level    0.621066
4  employee_residence    0.281024
3        remote_ratio    0.034977
0           work_year    0.030481
5    company_location    0.025103
2     employment_type    0.004265
6        company_size    0.003083


'Experience Level' is most important. 'Employee Residence' importance has decreased.

'Employment Type', 'Company Size' has combine 0.07% importance while 'Work Year' and 'Company Location' has 0.055% importance.

# Conclusion

## 'Experience Level' is most importance features among all to predict a salary of Data Analyst, Data Engineer or Machine Learning Engineer