# EDA Exercise: Working with JSON Data in Python
In this exercise, you will load a JSON file containing employee data and perform exploratory data analysis (EDA).
The goal is to apply your skills with `pandas` to analyze, describe, and understand a small dataset.

## Step 1: Load the JSON File

In [3]:
import pandas as pd
# Load the JSON file into a pandas DataFrame
# Hint: Use pd.read_json with the correct path
df = pd.read_json('data_samples/sample_employees.json')  # Update path if needed
df.head()

Unnamed: 0,id,name,age,salary,department,years_at_company
0,1,Dennis Vazquez,32,58893.885591,Sales,8
1,2,Mr. Wayne Rodriguez,24,62551.217478,Sales,7
2,3,Lindsey Rodriguez,37,66139.477833,Marketing,5
3,4,Mark Martinez,27,55798.641543,Engineering,1
4,5,Linda Brown,31,44693.758043,HR,9


## Step 2: Understand the Structure

In [4]:
# Print the shape and column names of the dataset
print(df.shape)
print(df.columns.tolist())

(30, 6)
['id', 'name', 'age', 'salary', 'department', 'years_at_company']


## Step 3: Check for Missing Values

In [5]:
# Use pandas functions to check for missing values in each column
df.isnull().sum()

id                  0
name                0
age                 0
salary              0
department          0
years_at_company    0
dtype: int64

## Step 4: Descriptive Statistics

In [6]:
# Display summary statistics for all numerical columns
df.describe()

Unnamed: 0,id,age,salary,years_at_company
count,30.0,30.0,30.0,30.0
mean,15.5,30.0,59106.279019,5.966667
std,8.803408,4.927054,8860.124477,2.266447
min,1.0,19.0,44693.758043,1.0
25%,8.25,27.0,50772.578478,5.0
50%,15.5,31.0,60205.08515,7.0
75%,22.75,32.0,65960.143194,7.75
max,30.0,43.0,81248.08073,9.0


## Step 5: Grouping and Aggregation

In [10]:
# Find the average salary per department
df.groupby('department')['salary'].mean()

department
Engineering    59374.221553
HR             55978.507288
Marketing      63975.985549
Sales          57726.159139
Name: salary, dtype: float64

## Step 6: Filtering Data

In [9]:
#Filter the dataset to only include employees with more than 5 years at the company
df[df['years_at_company'] > 5]

Unnamed: 0,id,name,age,salary,department,years_at_company
0,1,Dennis Vazquez,32,58893.885591,Sales,8
1,2,Mr. Wayne Rodriguez,24,62551.217478,Sales,7
4,5,Linda Brown,31,44693.758043,HR,9
7,8,Brooke Vasquez,32,66659.694857,Engineering,7
9,10,Nathan Hamilton,26,53772.943943,HR,7
11,12,Brian Nolan,34,50233.2508,Marketing,6
13,14,Francisco Kirk,31,67054.088312,HR,8
14,15,Joanna Dougherty,33,81248.08073,Marketing,8
16,17,Bryan Walker,32,60687.387773,HR,9
17,18,Kerri Hill,25,55893.396757,Engineering,7
