In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Loading the Dataset
Dataset<-read.table("/kaggle/input/data-science-salaries-2023/ds_salaries.csv", header=TRUE, sep=',')
#Head - First 6 Rows of the Dataset
head(Dataset)
#Tail - Last 6 Rows of the Dataset
tail(Dataset)

In [None]:
#Getting the Dimensions of the Dataset
dim(Dataset)

In [None]:
#Structure of the Dataset
str(Dataset)

In [None]:
#Checking Nulls in the Dataset
colSums(is.na(Dataset))#Therefore, our Dataset is not having any Null values.

In [None]:
#Summary of the Dataset
summary(Dataset)

In [None]:
#Operations
#Subsetting Work_year
work_year<-Dataset$work_year
head(work_year)

In [None]:
#typeof work_year
typeof(work_year)

In [None]:
#Measures of Central Tendancy
median(work_year)
mode(work_year)
quantile(work_year)

In [None]:
#Subsetting Numeric Columns
num_cols=Dataset[, c(1, 5, 7, 9)]
head(num_cols)

In [None]:
# Calculate the mean of each column in the Dataset
#1. Mean
print("Mean - ")
print(apply(num_cols, 2, mean))
#print()

#2. Median
print("Median")
print(apply(num_cols, 2, median))
#print()

#3. Mode
print("Mode")
print(apply(num_cols, 2, mode))
#print()

#4. Quantile
print("Quantile")
print(apply(num_cols, 2, quantile))

In [None]:
#unique values of work_year
work_year_unq=unique(work_year)
work_year_unq

In [None]:
#Value Counts of work_year
work_yr_freq=table(work_year)
work_yr_freq

In [None]:
#plotting the chart
# Define a color palette
color_palette <- rainbow(length(work_yr_freq))
pie(work_yr_freq, col=color_palette, main="Work Year")
legend("topright", as.character(sort(work_year_unq)), cex=0.8, fill=color_palette)

In [None]:
barplot(work_yr_freq, xlab='Year', ylab='Count', main='Work Year', col=color_palette)
legend("topleft", as.character(sort(work_year_unq)), fill=color_palette)

In [None]:
#Selecting cols
selected_cols<-Dataset[, c(2, 3, 4, 6, 8, 10, 11)]

#finding uniques
unq_vals<-sapply(selected_cols, function(col) unique(col))
print(unq_vals)

In [None]:
#finding value counts
value_counts<-sapply(selected_cols, function(col) table(col))
sorted_val_cnts<-lapply(value_counts, function(v_c) sort(head(v_c), decreasing = TRUE))

print(sorted_val_cnts)

In [None]:
#colnames
labels<-colnames(selected_cols)
#Barplot
for(i in 1:length(sorted_val_cnts))
{
  barplot(sorted_val_cnts[[i]], main=labels[i], xlab='Categories', ylab='Count')
}

**Trends of Average Salaries in each Year**

In [None]:
library(dplyr)
#group data by work_year and calculating the average Salary
yearly_salary_avg<-Dataset%>%
  group_by(work_year)%>%
  summarise(avg_salary=mean(salary_in_usd))
yearly_salary_avg

**Visualisation**

In [None]:
#Trends of Average Salaries in each year
library(ggplot2)

#Creating the Plot
ggplot(yearly_salary_avg, aes(x=work_year, y=avg_salary)) +
  geom_line() +
  labs(title='Trends of Average Salaries in each Year', x='Work Year', y='Average Salaries')

"Average salaries have shown a steady upward trend from 2020 to 2023, with a notable peak in 2022."

**Top 5 Job Title's Salaries**

In [None]:
top_5_job_salaries<-Dataset%>%
  group_by(job_title)%>%
  summarise(Avg_Sal=mean(salary))%>%
  arrange(desc(Avg_Sal))%>%
  head()
top_5_job_salaries

**Visualisation**

In [None]:
ggplot(top_5_job_salaries, aes(x=job_title, y=Avg_Sal)) +
  geom_col() +
  labs(title="Top 5 Job Title Salaries", x="Job Title", y="Salary") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The analysis of top job titles and their corresponding average salaries reveals intriguing insights. The "Head of Machine Learning" commands the highest average salary, standing at an impressive  6,000,000.Followingcloselyisthe"PrincipalDataArchitect"withanaveragesalaryof 3,000,000. The roles of "Lead Machine Learning Engineer" and "Lead Data Scientist" also stand out, garnering substantial average salaries of approximately  2,548,667and 928,485, respectively. Further down the spectrum, the "Data Analytics Lead" and "BI Data Analyst" roles showcase respectable average salaries of  922,500and 836,644.8, reflecting the varying compensation trends among these prominent job titles.

**Top 5 Job Titles**

In [None]:
top_5_job_titles<-head(sort(table(Dataset$job_title), decreasing = TRUE), 5)
top_5_job_titles

**Salary Basing on Currency**

In [None]:
salary_basingon_currency<-Dataset%>%
  group_by(salary_currency)%>%
  summarise(Avg_Sal=mean(salary))%>%
  arrange(desc(Avg_Sal))%>%
  head()
salary_basingon_currency

In [None]:
ggplot(salary_basingon_currency, aes(x = salary_currency, y = Avg_Sal)) +
  geom_col() +
  labs(title = "Salary vs. Currency", x = "Salary", y = "Currency")

The analysis of average salaries based on different currency types indicates noteworthy discrepancies. Notably, "CL" (Chilean Peso) boasts the highest average salary, trailed by "HU" (Hungarian Forint), "JP" (Japanese Yen), and others. These differences underscore the potential influence of economic conditions and living costs in various regions on compensation trends. Delving deeper into the factors driving these disparities and their implications for the workforce within specific currency contexts could yield valuable insights.

**Finding Average Salaries Basing on the Company Size**

In [None]:
#Trends of Average Salaries Basing on the Company Size
sal_trnd_by_firm_size<-Dataset%>%
  group_by(company_size)%>%
  summarise(Avg_sal=mean(salary))
sal_trnd_by_firm_size

**Visualisation**

In [None]:
#visualisation
#barplot
ggplot(sal_trnd_by_firm_size, aes(x=company_size, y=Avg_sal)) +
  geom_col() +
  labs(title='Average Salaries basing on the Company Size', x='Company Size', y='Average Salary')

The data shows that there is a clear distinction in average salaries based on company size. Companies categorized as "Large" (L) have the highest average salary of  438,794.4,followedby"Medium"(M)sizedcompanieswithanaveragesalaryof 150,712.8. On the other hand, "Small" (S) sized companies have an average salary of $281,430.1. This suggests a significant variation in compensation based on the size of the company, with larger companies generally offering higher average salaries compared to smaller ones.

**Average Salaries basing on the Location**

In [None]:
avg_sal_by_loc<-Dataset%>%
  group_by(company_location)%>%
  summarise(Average_Salary=mean(salary))%>%
  arrange(desc(Average_Salary))%>%
  head()
avg_sal_by_loc

**Visualisation**

In [None]:
#Visualisation
ggplot(avg_sal_by_loc, aes(x=company_location, y=Average_Salary)) +
  geom_col() +
  labs(title='Average Salary basing on Company Location', x='Company Location', y='Average of Salary')

The analysis reveals that the companies with the location abbreviation "CL" (potentially representing Chile) have the highest average salary among the various company locations in the dataset. This suggests that employees working in companies located in "CL" tend to earn a significantly higher average salary compared to other locations. The specific reasons behind this disparity would require further investigation, considering factors such as local economic conditions, industry specialization, and company size.

**Distribution of Salary by Experience Level**

In [None]:
ggplot(Dataset, aes(x=experience_level, y=scale(salary))) +
  geom_boxplot() +
  labs(title="Boxplot for the Salary basing on Experience Level", x='Experience Level', y='Salary') +
  theme_minimal()

In [None]:
ggplot(Dataset, aes(x=experience_level, y=salary_in_usd)) +
  geom_boxplot(fill="#1380A1") +
  labs(title="Distribution of Salary by Experience Level", x="Experience Level", y="Salary") +
  theme_minimal()

**Distribution of Salary by Employment Type**

In [None]:
ggplot(Dataset, aes(x=employment_type, y=salary_in_usd)) +
  geom_boxplot(fill="#ADD8E6") +
  labs(title="Distribution of Salary by Employment Type", x="Employment Type", y="Salary")

**Distribution of Salary**

In [None]:
hist_salary<-ggplot(Dataset, aes(x=salary)) +
  geom_histogram(binwidth = 10000, fill="#FF5733", color="#1380A1") +
  labs(title="Distribution of Salary", x="Salary", y="Frequency")
hist_salary

**Distribution of Salary in USD**

In [None]:
hist_salary_in_usd<-ggplot(Dataset, aes(x=salary_in_usd)) +
  geom_histogram(binwidth=10000, fill = "#FF5733", color = "#1380A1") +
  labs(title="Distribution of Salary in USD", x="Salary", y="Frequency")
hist_salary_in_usd

**Distribution of Remote Ratio**

In [None]:
hist_remote_ratio<-ggplot(Dataset, aes(x=remote_ratio)) +
  geom_histogram(binwidth=10, fill="#FF5733", color="#1380A1") +
  labs(title="Distribution of Remote Ratio", x="Remote Ratio", y="Frequency")
hist_remote_ratio

**Finding Relations**

In [None]:
#Correlation
norm_num_cols<-scale(num_cols)
correl_mat<-cor(norm_num_cols)
correl_mat

In [None]:
plot(x=Dataset$work_year, y=Dataset$salary, xlab="Work Year", ylab="Salary",
main="Relation B/W Work Year & Salary")

In [None]:
#Creating Scatter Plot
pairs(~work_year+salary+salary_in_usd+remote_ratio, data=num_cols,
main='Relations between the Data Science Salaries')

In [None]:
ggplot(Dataset, aes(x = salary, y = remote_ratio)) +
  geom_point(color = "#1380A1") +
  labs(title = "Salary vs. Remote Ratio", x = "Salary", y = "Remote Ratio") +
  theme_minimal()

**Salary V/S Salary in USD**

In [None]:
ggplot(Dataset, aes(x = salary, y = salary_in_usd)) +
  geom_point(color = "#1380A1") +
  labs(title = "Salary vs. Salary in USD", x = "Salary", y = "Salary in USD") +
  theme_minimal()

The scatter plot "Salary vs. Salary in USD" illustrates the relationship between the salary values and their corresponding salary amounts in USD for the dataset. Each point on the plot represents a data entry, with its horizontal position indicating the salary and its vertical position showing the corresponding salary in USD. The plot reveals a positive correlation between the two variables, as the points tend to form a rising trend from the lower left to the upper right. This suggests that higher salaries are generally associated with higher values in USD. The plot helps us understand how salaries are related to their equivalent values in USD, providing insights into potential patterns and trends within the dataset.