<h1 align="center">Insights in Data: Understanding Your Company's Salaries </h1>
<center>Ali Shamsi</center>



The primary objective of this project is to leverage data science methodologies to extract meaningful insights from the salary data within our company. As a Data Scientist, I am tasked with exploring various relationships and patterns among key variables such as age, education level, job title, and years of experience to uncover factors influencing salary distributions. Through exploratory data analysis, statistical calculations, and visualizations, I aim to provide a comprehensive understanding of who receives the highest salaries within the organization. The goal is not only to identify top earners but also to offer actionable insights that can inform decision-making processes. The Dataset I used based for this project (https://www.kaggle.com/datasets/sinhasatwik/salary-base-data?select=Salary_Data.csv).

I'm a Data Scientist working for a company. My boss asks me "Gather the data from the salaries given, and give me useful insights, I also want to know which employees are paid the most in our company." What interests me is the variables (Age,Education Level, etc..) in relationship into finding several aspects of the most paid in the company and the insights that we can find using these relationships.


## 🗃️Data Processing
- The Data/CSV File is loaded using the pandas library from our directories, setting the data as df.

- Noticing the "Education Level" column needs to be cleaned, we manipulate the outliers, replacing it to the right catogory.
    - Using .unique() and .replace() to help us in this process.

-  Other libraries are imported, seaborn and matplotlib, as it's known for its usage of data visualization.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



#CVS File is loaded and set to df
file_path = "D:\Salary_Data.csv"
df = pd.read_csv(file_path)


# Display unique values in the "Education Level" column
print("Education Levels Before Cleaning:")
print(df["Education Level"].unique())

df["Education Level"] = df["Education Level"].replace({
    "Bachelor's": "Bachelor's Degree",
    "Master's": "Master's Degree",
    "phD": "PhD"
})

# Display unique values again after cleaning
print("\nUnique Education Levels After Cleaning:")
print(df["Education Level"].unique())


#Checking unique, looks good no need to clean
print("\nChecking If Needing Cleaning For Gender")
print(df["Gender"].unique())

## 📐Statistics and Evaluation 
- I've selected 4 relationships to look at 
    - Average Salary of Employee By Age.
    - Average Salary of Employee by Education.
    - Average Salary of Employee by Years Of Experience.
    - Average Salary of Genders.
* Salary Distribtuion won't need to be calculated but rather can be inputed into a graph
* The most important factor is finding the mean/average, which is used throughout all calculations.
* The Groupby Function in this instance is very useful and helps us calcuate the mean.

In [None]:

# Average Salary by Age
mean_salary_by_age = df.groupby("Age")["Salary"].mean().reset_index()

# Average Salary by Education
mean_salary_by_education = df.groupby("Education Level")["Salary"].mean().reset_index()

# Average Salary by Years of Experience
average_salary_by_experience = df.groupby("Years of Experience")["Salary"].mean().reset_index()


# The reason why I printed is to show the exact value in the salary, understanding the gap and can be used to look at.
# Average Salary by Gender
Gender_pay = df.groupby("Gender")["Salary"].mean().reset_index()
print(Gender_pay)


## 📊Data Visualization

- For Data Visualization, we use both libraries called seaborn & matplotlib.
- We generate plots for each relationships and variable that was mentioned in our evaluation.
    * For our Average Salary By Education, we had to reorder the education level due to it's default alphabetical order.


In [None]:


#Average Salary By Age
plt.figure(figsize=(12, 8))
# Line graph of Average Salary by Age
sns.lineplot(data=mean_salary_by_age, x="Age", y="Salary", marker="o")
plt.title("Average Salary by Age")
plt.xlabel("Age")
plt.ylabel("Average Salary")
plt.show()



# Average Salary by Education
# I had to Reorder them to their right spot, because without ordering it does it alphabatically.
education_order = ["High School", "Bachelor's Degree", "Master's Degree","PhD"]
sns.set_theme(style="ticks", font_scale=1.25)
#bar plot of average salaries for each education level
plt.figure(figsize=(12, 8))
sns.barplot(x="Education Level", y="Salary", data=mean_salary_by_education, order=education_order, palette="crest")
plt.title("Average Salary by Education")
plt.xlabel("Education Level")
plt.ylabel("Mean Salary")
plt.show()



#Average Salary by Years of Experience
plt.figure(figsize=(12, 8))
#line plot of average salary by Experience
sns.lineplot(data=average_salary_by_experience, x="Years of Experience", y="Salary", marker="o")
plt.title("Average Salary by Years of Experience")
plt.xlabel("Years of Experience")
plt.ylabel("Average Salary")
plt.show()



# Average Salary by Gender
plt.figure(figsize=(10, 6))
# Bar plot of Average Salary by Gender
sns.barplot(data=Gender_pay, x="Gender", y="Salary")
plt.title("Average Salary by Gender")
plt.xlabel("Gender")
plt.ylabel("Average Salary")
plt.show()



#Distribution Salaries
#Histogram of Salary/Distribution
sns.histplot(df['Salary'], bins=20, kde=True, color='skyblue')
plt.title("Distribution of Salaries")
plt.xlabel("Salary")
plt.ylabel("Frequency")\
# for count
for rect in ax.patches:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height, f'{int(height)}', ha='center', va='bottom')
plt.show()












# 📝Insights and Discussion
Based on our analysis, We've conducted many aspects of the employee's salary, including many helpful insights that we can make to help our company.

#### Average Salary by Age
- Between the age of 21 and 53, we can see that the average salary is at a growth pace on par of the age increasing. Pass the age of 53, the graph shows a settle frenzy where it's neither a decline or at a growth rate. We can make a conclusion on this chart, that between the age of 21 and 53, if you are older you are to make more money, but past the age of 53, it is not certain if your salary is to increase.

#### Average Salary by Education
- We have catogorized education level's into 4 groups, Highschool, Bachelor's Degree, Master's Degree and PhD. In this chart, the average salary is displayed of each group represented in a bar graph. The lowest average salary belongs to the education level of Highschool. Which could be looked at as being less qualified into doing higher paid skills that involve some sort of degree as in compared to the other education levels. The highest average salary belongs to the education level of PhD's, as to be expected since they're very qualified and specialized into the skills of what they're doing which enables them to have a higher average salary.    


#### Average Salary by Years of Experience
- In this chart, we are analyzing the average salary an individual makes within their years of experience. In a sense, we can see that the longer your years of experience are, the more you get paid, up until you've reached 25 years of experience, there's a sudden decline of growth in salary. Which is questionable since years of experience = increase in salary.

#### Average Salary by Gender
* 3 Groups are categorized, Male, Female and Other, based upon the data we are given. The lowest salary average we see is for Female, while Male and Other are marginally closer in average, Other is paid the most averagely. We have also displayed the exact value of the salaries in the Statistics and Evaluation section. We will touch upon this later. 

#### Salary Distribution
* In Salary Distribution, we have a wide range of salaries being ditributed. In a sense, it isn't bad because if everyone is being paid the same salary then that could be a lost for the company, as in everyone being overpayed, which is a possibility. That's why its good to have a range of salaries being distributed.
***
### Correlations between Average Salary by Age and Average Salary by Years of Experience
Both of these graphs are similar, which to my understanding makes sense. The more years of Experience you have, the more you get paid, Which is also vice versa with age. We have figured from the chart that the older you are, the more you get paid. If you're an older person, you will most likely have experience from elsewhere as well, which enables you to get paid more as in the same way you are in our company and gaining your years of experience which gets you paid more. So both correlate with themselves, because regardless you have experience.

### Average Salary Gap Between Genders
In our graph, There was a significant gap from Female to the other two counterparts in the category.
Below, we have the output that was produced from our Statistics and Evaluation Section.
```python
   Gender         Salary
0  Female  107888.998672
1    Male  121389.870915
2   Other  125869.857143
```

In this output, we can see a significant differences between Female and the other two counterparts. This leads to an answer saying there is a possible Salary Gap, Which is bad socially and in a moral aspect. While the other two are similar marginally in Salary, there is quite a bigger difference compared to Female. If action not taken care of, there could be a uproar of salary difference and disputes. This could be solved by increasing salaries for Females, or putting Females in higher salary postions, also more inclusive in the board of power etc.. which leads to a higher average of salary.

### Decline In Growth of Salary Past The 25 Years of Experience
Past a certain point, there's a decline of salary growth when reaching 25 Years of Experience. Is there a cap for these Employees past a certain salary? What other ways are there to compensate for their salaries apart from increasing it. We must applaud for the loyalty of these Employees for sticking with the company for so long. A choice needs to be made, because the older employees can leave due to the results of that, in a capitalist aspect.
***
 In conclusion, we went into several relationships that helped us account for the salaries that are being distributed, including details of whose being paid more or less, Insights were made to help us justifiy the results of these data's including questions that we asked from the data we gathered, and the correlation that we found.

# References
* Packages used: pandas, seaborn, matplotlib

* CSV file/Dataset downloaded from here (file provided in Submission)
https://www.kaggle.com/datasets/sinhasatwik/salary-base-data?select=Salary_Data.csv



