####  ADMN5016 Assignment 
####  Proof of Concept for Machine Learning Application  


####  Business Analytics, St Lawrence College - Kingston  

####  Saranya Rajasekhar Nair 
####  Elvis  Ramirez - Student ID: 4354150                    

## I. About the Problem Statement ✔️

This study aims to provide precise insights into the current salary trajectories within the data science realm by examining the complex interconnections among various factors. and we leverage the dataset, containing information such as work years, experience level, employment type, job title, salary details, employee residence, remote work ratio, company location, and company size, to create a machine learning model. Based on the provided data profiles, this model will be specifically developed to accurately predict salaries for different job positions. 

Furthermore, the implementation of this machine learning model can significantly improve a company's process. By utilizing the predictions generated by the model, the company can enhance its salary structuring and decision-making processes. The model can assist in determining competitive compensation packages, optimizing resource allocation, and aligning salaries with industry standards.

## II. Domain Knowledge ✔️ 

#### work_year [categorical] : 
This represents the specific year in which the salary was disbursed. Different years may have different economic conditions which can impact the salary level.

#### experience_level [categorical] : 
The level of experience a person holds in a particular job. This is a key determinant in salary calculation as typically, more experienced individuals receive higher pay due to their advanced skills and knowledge.

#### employment_type [categorical] : 
The nature of the employment contract such as full-time, part-time, or contractual can greatly influence the salary. Full-time employees often have higher annual salaries compared to their part-time or contractual counterparts.

####  job_title [categorical] : 
The role an individual holds within a company. Different roles have different salary scales based on the responsibilities and skills required. For example, managerial roles typically pay more than entry-level positions.

#### salary [numerical] : 
The total gross salary paid to the individual. This is directly influenced by factors such as experience level, job title, and employment type.

#### salary_currency [categorical] : 
The specific currency in which the salary is paid, denoted by an ISO 4217 code. Exchange rates could affect the value of the salary when converted into different currencies.

#### salaryinusd [numerical] : 
The total gross salary amount converted to US dollars. This allows for a uniform comparison of salaries across different countries and currencies.

#### employee_residence [categorical]: 
The primary country of residence of the employee, denoted by an ISO 3166 code. The cost of living and prevailing wage rates in the employee's country of residence can impact salary levels.

#### remote_ratio [ratio]: 
The proportion of work done remotely. With the rise of remote work, companies may adjust salaries based on the cost of living in the employee's location and the proportion of remote work.

#### company_location [categorical]: 
The location of the employer's main office or the branch that holds the contract. Companies in different locations may offer different salary scales due to varying economic conditions and cost of living.

#### company_size [categorical]: 
The median number of employees in the company during the work year. Larger companies often have structured salary scales and may offer higher salaries due to economies of scale and larger revenue streams.

✔️ These variables, in combination with appropriate statistical and machine learning techniques, can help predict an individual's salary.

# III. Import Dataset, and the required libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

# Import Neccessary libraries
import numpy as np 
import pandas as pd

# Import Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Import Statistics libraries
from scipy import stats
from scipy.stats import norm

# Import Scikit-learn for Machine Learning libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE

# Import country code libraries
!pip install pycountry -q
import pycountry

#Install plot library
import plotly.io as pio

## IV. Input Data

In [3]:
mySalariesDF = pd.read_csv('C:/datasets2/ds_salaries.csv')

In [4]:
mySalariesDF.shape

(3755, 11)

In [5]:
mySalariesDF

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L
