## Predict The Data Scientists Salary In India

Data scientist is the sexiest job in the world. How many times have you heard that? Analytics India Annual Salary Study which aims to understand a wide range of trends data science says that the median analytics salary in India for the year 2017 is INR 12.7 Lakhs across all experience level and skill sets. So given the job description and other key information can you predict the range of salary of the job posting? What kind of factors influence the salary of a data scientist? 
The study also says that in the world of analytics, Mumbai is the highest paymaster at almost 13.3 Lakhs per annum, followed by Bengaluru at 12.5 Lakhs. The industry of the data scientist can also influence the salary. Telecom industry pays the highest median salaries to its analytics professionals at 18.6 Lakhs. What are you waiting for, solve the problem by predicting how much a data scientist or analytics professional will be paid by analysing the data given. 

Bonus Tip: You can analyse the data and get key insights for your career as well. The best data scientists and machine learning engineers will be given awesome prizes at the end of hackathon. 

Data The dataset is based on salary and job postings in India across the internet. The train and the test data consists of attributes mentioned below. The rows of train dataset has rich amount of information regarding the job posting such as name of the designation and key skills required for the job. The training data and test data comprise of 19802 samples and of 6601 samples each. This is a dataset which has been collected over some time to gather relevant analytics jobs posting over the years. 

`Features`:
- Name of the company (Encoded) 
- Years of experience 
- Job description 
- Job designation 
- Job Type 
- Key skills 
- Location 
- Salary in Rupees Lakhs(To be predicted) 


`Problem Statement`: Based on the given attributes and salary information, build a robust machine learning model that predicts the salary range of the salary post.

##### Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import warnings
warnings.filterwarnings('ignore')
from scipy.sparse import hstack, csr_matrix

##### Import datasets

In [2]:
train = pd.read_csv(os.path.join('data', 'Final_Train_Dataset.csv'))
test = pd.read_csv(os.path.join('data', 'Final_Test_Dataset.csv'))

In [3]:
len_train = len(train)
len_test = len(test)
data = pd.concat((train, test))

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,experience,job_description,job_desig,job_type,key_skills,location,salary,company_name_encoded
0,0.0,5-7 yrs,Exp: Minimum 5 years;Good understanding of IOC...,Senior Exploit and Vulnerability Researcher,,"team skills, communication skills, analytical ...",Delhi NCR(Vikas Puri),6to10,3687
1,1.0,10-17 yrs,He should have handled a team of atleast 5-6 d...,Head SCM,,"ppc, logistics, inventory management, supply c...",Sonepat,10to15,458
2,2.0,5-9 yrs,Must be an effective communicator (written & s...,Deputy Manager - Talent Management & Leadershi...,Analytics,"HR Analytics, Employee Engagement, Training, S...",Delhi NCR,15to25,4195
3,3.0,7-10 yrs,7 - 10 years of overall experience in data e...,Associate Manager Data Engineering,Analytics,"SQL, Javascript, Automation, Python, Ruby, Ana...",Bengaluru,10to15,313
4,4.0,1-3 yrs,Chartered Accountancy degree or MBA in Finance...,TS- GSA- Senior Analyst,,"accounting, finance, cash flow, financial plan...",Gurgaon,3to6,1305


In [5]:
data = data[['experience', 'job_description', 'job_desig', 'job_type', 'key_skills', 'location', 'salary', 'company_name_encoded']]

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26403 entries, 0 to 6600
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   experience            26403 non-null  object
 1   job_description       20463 non-null  object
 2   job_desig             26403 non-null  object
 3   job_type              6434 non-null   object
 4   key_skills            26402 non-null  object
 5   location              26403 non-null  object
 6   salary                19802 non-null  object
 7   company_name_encoded  26403 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 1.8+ MB


In [7]:
data.isna().sum()

experience                  0
job_description          5940
job_desig                   0
job_type                19969
key_skills                  1
location                    0
salary                   6601
company_name_encoded        0
dtype: int64

In [8]:
len(data)

26403

##### Data Preprocessing: Skills

In [9]:
data['key_skills'] = data['key_skills'] \
    .apply(lambda x: str(x).lower()) \
        .apply(lambda x: re.sub(r'\...', '', x)) \
            .apply(lambda x: re.sub(r',', '', x)) \
                .apply(lambda x: re.sub(r'\s+', ' ', x))

##### Data Preprocessing: Job_Design

In [10]:
data['job_desig'] = data['job_desig'] \
    .apply(lambda x: str(x).lower()) \
        .apply(lambda x: re.sub(r'[^a-z]', ' ', x)) \
            .apply(lambda x: re.sub(r'\s+', ' ', x))

##### Data Preprocessing: job_description

In [11]:
data['job_description'] = data['job_description'].fillna('missing')
data['job_description'] = data['job_description'] \
    .apply(lambda x: str(x).lower()) \
        .apply(lambda x: re.sub(r'[^a-z]', ' ', x)) \
            .apply(lambda x: re.sub(r'\s+', ' ', x)) \

##### Data Preprocessing: location

In [12]:
data['location'] = data['location'] \
    .apply(lambda x: str(x).lower()) \
        .apply(lambda x: re.sub(r'[^a-z]', ' ', x)) \
            .apply(lambda x: re.sub(r'\s+', ' ', x)) \

##### Data Preprocessing: job_type -> cleaning data

In [13]:
train['job_type'].unique()

array([nan, 'Analytics', 'analytics', 'Analytic', 'ANALYTICS', 'analytic'],
      dtype=object)

In [14]:
data['job_type'].fillna('missingjobtype', inplace=True)
data['job_type'].replace('Analytics', 'analytics', inplace=True)
data['job_type'].replace('Analytic', 'analytics', inplace=True)
data['job_type'].replace('ANALYTICS', 'analytics', inplace=True)
data['job_type'].replace('analytic', 'analytics', inplace=True)

##### Data Preprocessing: experience

In [15]:
data['experience'].sample(10)

14857     4-6 yrs
5814      3-6 yrs
5272      5-8 yrs
5826      1-5 yrs
889       5-8 yrs
19076     3-5 yrs
7142      2-4 yrs
531      8-13 yrs
18492    6-10 yrs
12795    9-12 yrs
Name: experience, dtype: object

In [16]:
data['min_exp'] = data['experience'].apply(lambda x: x.split('-')[0]).astype('int')
data['max_exp'] = data['experience'].apply(lambda x: x.split('-')[1].split(' ')[0]).astype('int')
data.drop(columns = ['experience'], inplace = True)

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26403 entries, 0 to 6600
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   job_description       26403 non-null  object
 1   job_desig             26403 non-null  object
 2   job_type              26403 non-null  object
 3   key_skills            26403 non-null  object
 4   location              26403 non-null  object
 5   salary                19802 non-null  object
 6   company_name_encoded  26403 non-null  int64 
 7   min_exp               26403 non-null  int32 
 8   max_exp               26403 non-null  int32 
dtypes: int32(2), int64(1), object(6)
memory usage: 1.8+ MB


In [18]:
data['merged'] = (data['job_desig'] + ' ' + data['job_description'] + ' ' + data['key_skills'] + ' ' + data['job_type'])

In [19]:
data.drop(columns = ['job_desig', 'job_description', 'key_skills', 'job_type'], inplace = True)

In [20]:
data

Unnamed: 0,location,salary,company_name_encoded,min_exp,max_exp,merged
0,delhi ncr vikas puri,6to10,3687,5,7,senior exploit and vulnerability researcher ex...
1,sonepat,10to15,458,10,17,head scm he should have handled a team of atle...
2,delhi ncr,15to25,4195,5,9,deputy manager talent management leadership de...
3,bengaluru,10to15,313,7,10,associate manager data engineering years of o...
4,gurgaon,3to6,1305,1,3,ts gsa senior analyst chartered accountancy de...
...,...,...,...,...,...,...
6596,mumbai,,2692,4,7,business analyst implementation p p s p erp sc...
6597,gurgaon,,104,1,5,sap basis administration missing crm scm srm c...
6598,mumbai,,2025,5,10,apps store developer lead android ios ovi stor...
6599,hyderabad,,2512,7,12,associate scientific liasion scientific liasio...


##### LabelEncoder on location and salary

In [21]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
data['salary'] = LE.fit_transform(data['salary'])

In [22]:
y_data = data.pop('salary').values

##### StandardScaler

In [23]:
from sklearn.preprocessing import StandardScaler

In [24]:
for col in ['min_exp', 'max_exp']:
    data[col] = StandardScaler().fit_transform(data[col].values.reshape(-1, 1))

##### TF-IDF

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy import sparse

In [26]:
tf1 = TfidfVectorizer(min_df=3, token_pattern=r'\w{3,}', ngram_range=(1,3), max_df=0.9)
data_merged = tf1.fit_transform(data['merged'])

In [27]:
tf2 = TfidfVectorizer(min_df=2, token_pattern=r'\w{3,}')
data_loc = tf2.fit_transform(data['location'])

In [28]:
data = sparse.csr_matrix(data[['company_name_encoded', 'min_exp', 'max_exp']].values)

In [29]:
data = hstack((data, data_merged, data_loc)).toarray()

##### Splitting data

In [30]:
X_train = data[:len(train)]
X_test = data[len(train):]

In [31]:
y_train = y_data[:len(train)]

In [32]:
len(X_train), len(X_test), len(y_train)

(19802, 6601, 19802)

In [33]:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)

In [34]:
param = {'objective': 'multiclass',
         'num_iterations': 80,
         'learning_rate': 0.04,  
         'num_leaves': 23,
         'max_depth': 7, 
         'min_data_in_leaf': 28, 
         'max_bin': 10, 
         'min_data_in_bin': 3,   
         'num_class': 6,
         'metric': 'multi_logloss'
         }

In [35]:
lgbm = lgb.train(params=param,
                 train_set=train_data,
                 num_boost_round=100)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.076686 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 37843
[LightGBM] [Info] Number of data points in the train set: 19802, number of used features: 3849
[LightGBM] [Info] Start training from score -1.808668
[LightGBM] [Info] Start training from score -1.481706
[LightGBM] [Info] Start training from score -1.568717
[LightGBM] [Info] Start training from score -2.531528
[LightGBM] [Info] Start training from score -1.947629
[LightGBM] [Info] Start training from score -1.723636


##### Accuracy on train

In [36]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [37]:
y_pred = lgbm.predict(X_train)
predictions = []
for x in y_pred:
    predictions.append(np.argmax(x))

print('accuracy:', accuracy_score(y_train, predictions))

accuracy: 0.5292394707605292


##### Accuracy on test

In [38]:
y_pred_test = lgbm.predict(X_test)
predictions = []
for x in y_pred_test:
    predictions.append(np.argmax(x))

In [39]:
op = pd.DataFrame()
op['salary'] = LE.inverse_transform(predictions)
op

Unnamed: 0,salary
0,15to25
1,0to3
2,6to10
3,0to3
4,0to3
...,...
6596,10to15
6597,6to10
6598,15to25
6599,15to25


In [40]:
op.to_csv('submissions.csv', index = False)