# **Salary Prediction of Software Developers**

This is a salary prediction project using the ***Stack Overflow 2018 Developer Survey***
In the IT sector various features play a vital role in defining a good career as a developer. 
For developers seeking jobs, it would be beneficial if they had a model to predict the salary range based on various features like 
country, years of coding, developer type, degree, skillset, and programming languages

**Team Members :**
* Surya M N - PES1UG19CS525
* Avanish V Patil - PES1UG19CS096
* Kedar U Shet - PES1UG19CS217
* Tushar Y S - PES1UG19CS545

In [51]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import pandas as pd 
import copy
import datetime as dt
import os
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
import cufflinks as cf
cf.go_offline()
import missingno as msn
from matplotlib import cm
color = sns.color_palette()
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_absolute_error
from keras import backend as K

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from IPython.display import display

def display_all(df):
    with pd.option_context("display.max_rows", 500, "display.max_columns", 300,'display.max_colwidth', -1):
        display(df)
        
# Error Functions

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

def rmse(y_true, y_pred):
        return K.sqrt(K.mean(K.square(y_pred - y_true))) 

def root_mean_squared_error(y_true, y_pred):
    return ((y_test - y_pred) ** 2).mean() ** .5

# **Import Dataset**

In [52]:
df = pd.read_csv("../input/stack-overflow-2018-developer-survey/survey_results_public.csv",usecols=["Country","Gender","Age","IDE","JobSatisfaction","Employment","FormalEducation","UndergradMajor","DevType","YearsCoding","EducationTypes","ConvertedSalary","LanguageWorkedWith","DatabaseWorkedWith","FrameworkWorkedWith","HoursComputer"])

In [53]:
display_all(df.head())

# **Data Preprocessing**

In [54]:
df.drop_duplicates(keep=False,inplace=True)

In [55]:
#Removing all NULL values
df = df.dropna()

#Remove salary lesser than 1000 and greater than 300000
df = df[(df['ConvertedSalary']>1000) & (df['ConvertedSalary']<300000)]

# Considering only Male and Female in the dataset
df = df[(df['Gender'] == 'Male') | (df['Gender'] == 'Female')]

# Remove age outliers
df = df[df['Age'] != 'Under 18 years old']
df = df[df['Age'] != '65 years or older']

# One hot encode LanguagesWorkedWith

In [56]:
languages = ['JavaScript','HTML','CSS','SQL','Bash/Shell','Java','Python','C#','PHP','TypeScript','C++','C']

temp = df['LanguageWorkedWith'].str.split(';', expand=True)

# Get all the possible values in this column
new_columns = pd.unique(temp.values.ravel())
for new_c in new_columns:
    if new_c and new_c is not np.nan and new_c in languages:

        # Create new column for each unique column
        idx = df['LanguageWorkedWith'].str.contains(new_c, regex=False).fillna(False)
        df.loc[idx, f"{new_c}"] = 1


# One hot encode DevType

In [57]:
temp = df['DevType'].str.split(';', expand=True)

# Get all the possible values in this column
new_columns = pd.unique(temp.values.ravel())
for new_c in new_columns:
    if new_c and new_c is not np.nan:
        
        # Create new column for each unique column
        idx = df['DevType'].str.contains(new_c, regex=False).fillna(False)
        df.loc[idx, f"{new_c}"] = 1



# One hot encode FrameWorkedWith

In [58]:
temp = df['FrameworkWorkedWith'].str.split(';', expand=True)

# Get all the possible values in this column
new_columns = pd.unique(temp.values.ravel())
for new_c in new_columns:
    if new_c and new_c is not np.nan:
        
        # Create new column for each unique column
        idx = df['FrameworkWorkedWith'].str.contains(new_c, regex=False).fillna(False)
        df.loc[idx, f"{new_c}"] = 1


In [59]:
# Filling null values in encoded columns with 0
df = df.fillna(0)

In [60]:
display_all(df.head())

# **Exploratory Data Analysis**

# Gender Distribution

In [61]:
gen = pd.DataFrame(df['Gender'].dropna().str.split(';').tolist()).stack()
gen=  gen.value_counts().sort_values(ascending=False)
labels = gen.index
labels= 'Male', 'Female', 'Non-binary or Transgender'
f, ax1 = plt.subplots(figsize=(15,7))

sizes = gen/gen.sum() * 100
sizes = [85.594640,12.897822 , 1.5075379]
explode = (0.05,0.05,0.05)
colors= ['#66b3ff','#c2c2f0', '#ff9999']
ax1.pie(sizes, colors = colors, labels=labels, autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = explode)

centre_circle = plt.Circle((0,0),0.50,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.axis('equal')
plt.tight_layout()
plt.show()

# Bar graph for Formal Education of the respondants

In [62]:
edu = df['FormalEducation'].value_counts()
edu = pd.DataFrame({'type':edu.index,'percent':(edu.values)*100/sum(edu.values)})
fig = plt.figure()
sns.barplot(edu['percent'],edu['type'])
plt.show()

# Types of Developers

In [63]:
plt.figure(figsize=(15,7))
temp_devtype = pd.DataFrame(df['DevType'].dropna().str.split(';').tolist()).stack()
temp_devtype_counts = temp_devtype.value_counts().sort_values()
temp_devtype_counts.plot.barh(color=sns.color_palette('pastel',15))
plt.title('DevType', fontsize=15)
plt.yticks(fontsize=18)
plt.show()

# Most Popular Languages

In [64]:
plt.figure(figsize=(15,10))
temp_language = pd.DataFrame(df['LanguageWorkedWith'].str.split(';').tolist()).stack()
temp_language_counts = temp_language.value_counts().sort_values()
temp_language_counts.plot.barh(color=sns.color_palette('pastel',15))
plt.title('Most Popular Languages', fontsize=15)
plt.yticks(fontsize=12)
plt.show()

# Most Popular Framework

In [65]:
plt.figure(figsize=(15,7))
temp_framework = pd.DataFrame(df['FrameworkWorkedWith'].dropna().str.split(';').tolist()).stack()
temp_framework_counts = temp_framework.value_counts().sort_values()
temp_framework_counts.plot.barh(color=sns.color_palette('pastel',15))
plt.title('Most Popular Framework', fontsize=15)
plt.yticks(fontsize=12)
plt.show()

# Most Popular IDE

In [66]:
plt.figure(figsize=(15,10))
temp_ide = pd.DataFrame(df['IDE'].str.split(';').tolist()).stack()
temp_ide_counts = temp_ide.value_counts().sort_values()
temp_ide_counts.plot.barh(color=sns.color_palette("pastel", 15))
plt.title('Most Popular IDE', fontsize=15)
plt.yticks(fontsize=12)
plt.show()

# Job Satisfaction in different countries

In [67]:
df['JobSatisfaction'].value_counts()
sat = df[np.logical_or(np.logical_or(df['JobSatisfaction'] == 'Moderately satisfied', df['JobSatisfaction'] == 'Extremely satisfied'), df['JobSatisfaction'] == 'Slightly satisfied')]

plt.figure(figsize=(14, 8))
sns.countplot(data=sat, x='Country', hue='JobSatisfaction', palette='Paired', order=sat['Country'].value_counts()[:10].index)
sns.despine(left=True)
plt.xticks(rotation='vertical')

# Graph plotting the number of males and females in various age groups

In [68]:
male_female = df[df["Gender"].isin(['Male', 'Female'])]
plt.figure(figsize=(15,8))
g=sns.countplot(x=male_female['Age'],hue=male_female['Gender'], order=male_female['Age'].dropna().sort_values().unique())
g.set_xlabel("Age")
g.set_xticklabels(g.get_xticklabels(),rotation=90)
g.legend(bbox_to_anchor=(1.1, 1.05))
plt.title("Age Vs Gender")
plt.show()

# Salary Plot

In [69]:
f, ax = plt.subplots(figsize=(18, 7))
plt.xticks(rotation='45')
sns.distplot(df['ConvertedSalary']);
plt.xlabel('Annual salary', fontsize=15)

# Salary vs Country

In [70]:
df1 = df.copy()
plt.figure(figsize=(16,5))
df1 = df1[df1["Country"].isin(['India', 'United States', 'Germany', 'United Kingdom','France','Canada','Spain','Australia','Israel'])]
df1 = df1.groupby('Country', as_index=False)['ConvertedSalary'].mean()
df1 = df1.sort_values('ConvertedSalary')
plt.bar(df1['Country'],df1['ConvertedSalary'],color=['r', 'g', 'b', 'k', 'y', 'm', 'c'])
plt.xlabel("Countries")
plt.ylabel("Salary per annum (USD)")
plt.title("Country vs Average Salary",fontdict={'weight': 'bold', 'size': 24})
plt.show()

# Salary v/s Coding Experience

In [71]:
df1 = df.copy()
plt.figure(figsize=(16,5))
df1['YearsCoding'].replace({'0-2 years':1,'3-5 years':4, '6-8 years':7, '9-11 years':10, '12-14 years':13, '15-17 years':16, '18-20 years':19,'21-23 years':22,'24-26 years':25,'27-29 years':28,'30 or more years':31},inplace=True)
df1 = df1.groupby('YearsCoding', as_index=False)['ConvertedSalary'].mean()
plt.plot(df1['YearsCoding'],df1['ConvertedSalary'])
plt.title("Salary vs Coding Experience",fontdict={'weight': 'bold', 'size': 24})
plt.xlabel("Progression in coding years")
plt.ylabel("Salary per annum (USD)")
plt.show()
print("Correlation : " ,df1['YearsCoding']. corr(df1['ConvertedSalary']))

# Salary v/s Country

In [72]:
df1 = df.copy()
plt.figure(figsize=(18,5))
df1 = df1[df1["Country"].isin(['India', 'United States', 'Germany', 'United Kingdom','France','Canada','Spain','Australia','Israel'])]
df1['Country'].replace({'India':1,'Spain':2, 'France':3,'Germany':4,'Canada':5,'United Kingdom':6,'Australia':7,'Israel':8,'United States':10},inplace=True)
df1 = df1.groupby('Country', as_index=False)['ConvertedSalary'].mean()
df1 = df1.sort_values('ConvertedSalary')
# plt.bar(df1['Country'],df1['ConvertedSalary'],color="rgbkymc")
plt.plot(df1['Country'],df1['ConvertedSalary'])
plt.xlabel("Countries")
plt.ylabel("Salary per annum (USD)")
plt.title("Country vs Average Salary",fontdict={'weight': 'bold', 'size': 24})
plt.show()
print("Correlation : " ,df1['Country']. corr(df1['ConvertedSalary']))

# Salary v/s Age

In [73]:
df_age = df.copy()
plt.figure(figsize=(16,5))
df_age['Age'].replace({'18 - 24 years old':21, '25 - 34 years old':29.5, '35 - 44 years old':39.5, '45 - 54 years old':49.5, '55 - 64 years old':59.5},inplace=True)
df_age = df_age.groupby('Age', as_index=False)['ConvertedSalary'].mean()
plt.plot(df_age['Age'],df_age['ConvertedSalary'])
plt.title("Salary vs Age",fontdict={'weight': 'bold', 'size': 24})
plt.xlabel("Progression in Age")
plt.ylabel("Salary per annum (USD)")
plt.show()
print("Correlation : " ,df_age['Age']. corr(df1['ConvertedSalary']))

# Salary v/s Hours of using Computer

In [74]:
df_comp = df.copy()
plt.figure(figsize=(16,5))
df_comp['HoursComputer'].replace({'Less than 1 hour':1,'1 - 4 hours':2.5, '5 - 8 hours':6.5, '9 - 12 hours':10.5, 'Over 12 hours':13.5},inplace=True)
df_comp = df_comp.groupby('HoursComputer', as_index=False)['ConvertedSalary'].mean()
plt.plot(df_age['Age'],df_age['ConvertedSalary'])
plt.title("Salary vs HoursComputer",fontdict={'weight': 'bold', 'size': 24})
plt.xlabel("Hours on Computer")
plt.ylabel("Salary per annum (USD)")
plt.show()
print("Correlation : " , df_comp['HoursComputer']. corr(df1['ConvertedSalary']))

# Transforming Categorical Data to Numerical Data

In [75]:
df = df[df["Country"].isin(['India', 'United States', 'Germany', 'United Kingdom','France','Canada','Spain','Australia','Israel'])]
df['Age'] = df['Age'].map({'18 - 24 years old':21, '25 - 34 years old':29.5, '35 - 44 years old':39.5, '45 - 54 years old':49.5, '55 - 64 years old':59.5}).astype(int)
df['YearsCoding'] = df['YearsCoding'].map({'0-2 years':1,'3-5 years':4, '6-8 years':7, '9-11 years':10, '12-14 years':13, '15-17 years':16, '18-20 years':19,'21-23 years':22,'24-26 years':25,'27-29 years':28,'30 or more years':31}).astype(int)
df['Country'] = df['Country'].map({'India':1,'Spain':2, 'France':3,'Germany':4,'Canada':5,'United Kingdom':6,'Australia':7,'Israel':8,'United States':9}).astype(int)
df['HoursComputer'] = df['HoursComputer'].map({'Less than 1 hour':1,'1 - 4 hours':2.5, '5 - 8 hours':6.5, '9 - 12 hours':10.5, 'Over 12 hours':13.5}).astype(int)
df['Gender'] = df['Gender'].map({'Female':0,'Male':1}).astype(int)

In [76]:
# After all the preprocessing
df.shape

# **Salary Prediction**
> ***Note : The metric chosen for validating the models are accuracy 
> -> We have assumed that  if the predicted salary is within the range of +/- 20,000 USD is a correct prediction***

***The attributes chosen to predict salary are :***
* Age
* Country
* Gender
* Years of Coding Experience
* Hours working on Computer per day
* Languages worked with
* Developer Type
* Framework worked with

## **Multiple Linear Regression**

In [77]:
# Linear Regression Model

from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix,accuracy_score
regressor=LinearRegression()

In [78]:
temp_devtype = pd.DataFrame(df['DevType'].dropna().str.split(';').tolist()).stack()
temp_devtype_counts = temp_devtype.value_counts().sort_values()
devTypes = list(temp_devtype_counts.index)
temp_framework = pd.DataFrame(df['FrameworkWorkedWith'].dropna().str.split(';').tolist()).stack()
temp_framework_counts = temp_framework.value_counts().sort_values()
frameworkTypes = list(temp_framework_counts.index)
attributes = ['Age','Gender','YearsCoding','Country','HoursComputer','JavaScript','HTML','CSS','SQL','Bash/Shell','Java','Python','C#','PHP','TypeScript','C++','C'] + devTypes + frameworkTypes
x = df[attributes]
y = df['ConvertedSalary']

In [79]:
# Splitting data into training and testing

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)

In [80]:
#Traning the model
regressor.fit(x_train,y_train)

In [81]:
predicted=regressor.predict(x_test)


In [82]:
dframe=pd.DataFrame({'ActualSalary':y_test,'PredictedSalary':predicted})
dframe.head()

In [83]:
# first_bracket = "From 1k to 25k"
# second_bracket = "From 25k to 50k"
# third_bracket = "From 50k to 75k"
# forth_bracket = "From 75k to 100k"
# fifth_bracket = "From 100k to 150k"
# sixth_bracket = "From 150k to 200k"
# seventh_bracket = "From 200k to 300k"

In [84]:
# dframe['ActualSalaryRange'] = pd.cut(dframe['ActualSalary'], bins=[1000,25000,50000,75000,100000,150000,200000,300000], labels=[first_bracket, second_bracket, third_bracket,forth_bracket,fifth_bracket,sixth_bracket,seventh_bracket])
# dframe['PredictedSalaryRange'] = pd.cut(dframe['PredictedSalary'], bins=[1000,25000,50000,75000,100000,150000,200000,300000], labels=[first_bracket, second_bracket, third_bracket,forth_bracket,fifth_bracket,sixth_bracket,seventh_bracket])

In [85]:
seriesObj = dframe.apply(lambda x: True if abs(x['ActualSalary'] -  x['PredictedSalary'])<=20000 else False , axis=1)
num = len(seriesObj[seriesObj == True].index)

print("Accuracy: ", num/dframe.shape[0]*100)
print("RMSE:" , root_mean_squared_error(y_test,predicted))


# **Support Vector Regression**

In [86]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
model_svr_regr = make_pipeline(StandardScaler(),  SVR(kernel="poly", C=100, gamma="auto", degree=3, epsilon=0.1, coef0=1))
model_svr_regr.fit(x_train, y_train)
price_svr=model_svr_regr.predict(x_test)

In [87]:
dframe_svr=pd.DataFrame({'ActualSalary':y_test,'PredictedSalary':price_svr})
seriesObj = dframe_svr.apply(lambda x: True if abs(x['ActualSalary'] -  x['PredictedSalary'])<=20000 else False , axis=1)
num = len(seriesObj[seriesObj == True].index)


print("Accuracy: ", num/dframe_svr.shape[0]*100)
print("RMSE:" , root_mean_squared_error(y_test,price_svr))

# **Ridge Regression**

In [88]:
from sklearn import linear_model
model_r = linear_model.Ridge(normalize= True, alpha= 0.001)
model_r.fit(x_train,y_train)
priceridge = model_r.predict(x_test)

In [89]:
dframe_ridge=pd.DataFrame({'ActualSalary':y_test,'PredictedSalary':priceridge})
seriesObj = dframe_ridge.apply(lambda x: True if abs(x['ActualSalary'] -  x['PredictedSalary'])<=20000 else False , axis=1)
num = len(seriesObj[seriesObj == True].index)

print("Accuracy: ", num/dframe_ridge.shape[0]*100)
print("RMSE:" , root_mean_squared_error(y_test,priceridge))

# **Adaboost**

In [90]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
adaboost = AdaBoostRegressor(random_state=0, n_estimators=100)
adaboost.fit(x_train, y_train)

In [91]:
price_ada=model_svr_regr.predict(x_test)

dframe_ada=pd.DataFrame({'ActualSalary':y_test,'PredictedSalary':price_ada})
seriesObj = dframe_ada.apply(lambda x: True if abs(x['ActualSalary'] -  x['PredictedSalary'])<=20000 else False , axis=1)
num = len(seriesObj[seriesObj == True].index)

print("Accuracy: ", num/dframe_ada.shape[0]*100)
print("RMSE:" , root_mean_squared_error(y_test,price_ada))

# **Decision Tree Classifier**

In [92]:
from sklearn.tree import DecisionTreeClassifier
dtc_model = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=5, min_samples_split=3, min_samples_leaf=1)
dtc_model.fit(x_train, y_train)

In [93]:
dtc_predict = dtc_model.predict(x_test)

In [94]:
dframe_dec=pd.DataFrame({'ActualSalary':y_test,'PredictedSalary':dtc_predict})
seriesObj = dframe_dec.apply(lambda x: True if abs(x['ActualSalary'] -  x['PredictedSalary'])<=20000 else False , axis=1)
num = len(seriesObj[seriesObj == True].index)
print("Accuracy: ", num/dframe_dec.shape[0]*100)
print("RMSE:" , root_mean_squared_error(y_test,predicted))

# **Artificial Neural Network**

In [95]:
#Dependencies
import keras
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from keras.layers import Dense, Dropout


# Neural network
model = Sequential()
model.add(Dense(64, input_dim=x.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation = 'linear'))



In [96]:
model.compile(loss=rmse, optimizer='adam')

In [97]:
history = model.fit(x_train, y_train, epochs=100, batch_size=128)

In [98]:
y_pred_ann = model.evaluate(x_test, y_test)

In [99]:
y_prediction = model.predict(x_test)
y_prediction_df = y_prediction.reshape((-1,))


In [100]:
dframe_ann_reg = pd.DataFrame({'ActualSalary':y_test,'PredictedSalary':y_prediction_df})
# dframe_ann_reg['Match'] = 0
seriesObj = dframe_ann_reg.apply(lambda x: True if abs(x['ActualSalary'] -  x['PredictedSalary'])<=20000 else False , axis=1)
# dframe_ann_reg['Match'] = dframe_ann_reg.apply(f, axis=1)
num = len(seriesObj[seriesObj == True].index)
print(dframe_ann_reg.head(20),'\n')
print("Accuracy: ", num/dframe_ann_reg.shape[0]*100)
print("RMSE:" , root_mean_squared_error(y_test,y_prediction_df))
# dframe_ann_reg.head(50)

# **Conclusion**

The following models were implemented :
* Multiple Linear Regression
* Support Vector Regression
* Adaboost
* Decision Tree Classifier
* Ridge Regression
* Artificial Neural Network

The models are validated on the basis of accuracy (Predicted salary with a margin of 20,000 USD is considered a right prediction)
and RMSE

ANN performed the best with an accuracy of 61.40% and an RMSE of 32005

Predictions are only as good as the dataset. The prediction did not perform extremely well for the dataset although we got pretty good results for lots of records.