## Exploratory Data Analysis of Jobs in the various fields of Data
### **Intro**
Welcome to this exploratory data analysis (EDA) notebook, where we delve into the diverse landscape of jobs within the field of data. In this analysis, we will be examining a dataset comprising essential information about different job roles, encompassing details such as work experience, job titles, salary structures, and more. Lets analyse the data via our guiding questions.
1. What is the distribution of work experience levels?
2. How are salaries distributed across different experience levels?
3. What are the most common job titles in the dataset?
4. How does job category distribution look like?
5. What is the most common employment type (full-time, part-time, contract)?
6. How are salaries distributed across different job categories?
7. Which work settings are most common among data professionals?
8. What is the geographic distribution of companies?
9. How does company size impact salaries and job titles?
10. What is the overall salary distribution in the dataset?
11. How has the number of work years evolved over time?
12. Are there outliers in the salary data?

In [1]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

In [2]:
df = pd.read_csv("/kaggle/input/jobs-in-data/jobs_in_data.csv")
df.head(5)

Unnamed: 0,work_year,job_title,job_category,salary_currency,salary,salary_in_usd,employee_residence,experience_level,employment_type,work_setting,company_location,company_size
0,2023,Data DevOps Engineer,Data Engineering,EUR,88000,95012,Germany,Mid-level,Full-time,Hybrid,Germany,L
1,2023,Data Architect,Data Architecture and Modeling,USD,186000,186000,United States,Senior,Full-time,In-person,United States,M
2,2023,Data Architect,Data Architecture and Modeling,USD,81800,81800,United States,Senior,Full-time,In-person,United States,M
3,2023,Data Scientist,Data Science and Research,USD,212000,212000,United States,Senior,Full-time,In-person,United States,M
4,2023,Data Scientist,Data Science and Research,USD,93300,93300,United States,Senior,Full-time,In-person,United States,M


In [3]:
df.columns

Index(['work_year', 'job_title', 'job_category', 'salary_currency', 'salary',
       'salary_in_usd', 'employee_residence', 'experience_level',
       'employment_type', 'work_setting', 'company_location', 'company_size'],
      dtype='object')

In [4]:
df.shape

(9355, 12)

### 1. What is the distribution of work experience levels?

In [5]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['experience_level'], nbinsx=5, marker_color='darkblue'))
fig.update_layout(
    title='Distribution of Experience Levels',
    xaxis_title='Work Levels',
    yaxis_title='Frequency'  
)
fig.show()

### 2. How are salaries distributed across different experience levels?

In [6]:
fig = px.box(df, x='experience_level', y='salary_in_usd', title='Salary Distribution Across Experience Levels')
fig.update_layout(
    xaxis_title='Experience Level',
    yaxis_title='Salary in USD',
    boxmode='group'
)
fig.show()

### 3. What are the most common job titles in the dataset?

In [7]:
common_jobs = df['job_title'].value_counts().nlargest(3).index.tolist()
common_jobs

['Data Engineer', 'Data Scientist', 'Data Analyst']

### 4. How does job category distribution look like?

In [8]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['job_category'], nbinsx=5, marker_color='darkred'))
fig.update_layout(
    title='Distribution of Category Distribution',
    xaxis_title='Work Levels',
    yaxis_title='Frequency'  
)
fig.show()

### 5. What is the most common employment type (full-time, part-time, contract)?

In [9]:
df['employment_type'].value_counts().idxmax()

'Full-time'

### 6. How are salaries distributed across different job categories?

In [10]:
fig = px.box(df, x='job_category', y='salary_in_usd', title='Salary Distribution Across Job Categories')
fig.update_layout(
    xaxis_title='Job Categories',
    yaxis_title='Salary in USD',
    boxmode='group'
)
fig.show()

### 7. Which work settings are most common among data professionals?

In [11]:
df.work_setting.value_counts().idxmax()

'In-person'

### 8. What is the geographic distribution of companies?

In [12]:
fig = px.choropleth(
    df,
    locations='company_location',
    locationmode='country names',
    title='Company Distribution Across Countries',
    color='work_year',
    color_continuous_scale=px.colors.sequential.Plasma,
    labels={'work_year': 'Work Years'},
)

# Show the plot
fig.show()

### 9. How does company size impact salaries and job titles?

In [13]:
fig = px.scatter(
    df,
    x='company_size',
    y='salary_in_usd',
    color='job_title',
    title='Impact of Company Size on Salaries and Job Titles',
    labels={'salary_in_usd': 'Salary in USD'},
    category_orders={'company_size': ['Small', 'Medium', 'Large']}
)

fig.update_layout(
    xaxis_title='Company Size',
    yaxis_title='Salary in USD',
)

fig.show()

### 10. What is the overall salary distribution in the dataset?

In [14]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['salary_in_usd'], nbinsx=5, marker_color='darkgreen'))
fig.update_layout(
    title='Overall Distribution of Salary',
    xaxis_title='Salary In USD',
    yaxis_title='Frequency'  
)
fig.show()

### 11. How does the distribution of employee residence locations look like?

In [15]:
fig = px.choropleth(
    df,
    locations='employee_residence',
    locationmode='country names',
    title='Employee Residence Location Distribution Across Countries',
    color='work_year',
    color_continuous_scale=px.colors.sequential.Plasma,
    labels={'employee_residence': 'Employee Residence'},
)

# Show the plot
fig.show()

### 12. Are there outliers in the salary data?

In [16]:
quartiles = df.groupby('job_title')['salary_in_usd'].quantile([0.25, 0.75]).unstack()

df['quartile'] = df.apply(lambda row: 'First Quartile' if row['salary_in_usd'] <= quartiles.loc[row['job_title'], 0.25] else ('Last Quartile' if row['salary_in_usd'] >= quartiles.loc[row['job_title'], 0.75] else 'Intermediate'), axis=1)


fig = px.scatter(
    df,
    x='job_title',
    y='salary_in_usd',
    color='quartile',
    title='Outliers in the Salary Data using Quartiles',
    labels={'salary_in_usd': 'Salary in USD'},
    color_discrete_map={'First Quartile': 'brown', 'Intermediate': 'gray', 'Last Quartile': 'red'}
)
fig.update_layout(
    xaxis_title='Job Titles',
    yaxis_title='Salary in USD',
)
fig.show()