## ‚ÄúWorking Hard, Earning Less? A Look at Hospitality Wages in Ireland‚Äù


###  Project Goals


This project is intended for my portfolio to demonstrate data cleaning, EDA and storytelling with real government data. 
As a hospitality worker myself, this project comes with a personal deal out of curiosity to investigate how far our industry is from the national average of earning/income levels over the years. 

These are the main questions this project aims to answer: 

- How have average weekly earnings in the hospitality sector changed over the last 10 years?

- Do wage increases keep up with inflation (e.g. Consumer Price Index)?

And the two bonus questions I'll try to investigate a bit: 

- How do hospitality wages compare to other sectors (e.g. finance, education)? (maybe, if i find the data? )

- Has the wage gap between sectors increased or decreased?

### Data sources: 

- **CSO ‚Äì EHQ03 / EHQ12 / EHA04** (selected statistics): Average Weekly Earnings, Average Hourly Earnings, Average Weekly Paid Hours (Quarterly / Annual).  
  *Access via CSO StatBank ‚Äì search table `EHQ03` (Average Earnings, Hours Worked, Employment and Labour Costs).*
- **CSO ‚Äì CPI** (for inflation adjustment) ‚Äî table `CPM12` or equivalent.The main source of data used for this project is the CSO

**Filters used for analysis**
- Metrics: `Average Weekly Earnings (Euro)`, `Average Hourly Earnings (Euro)`, `Average Weekly Paid Hours (Hours)`
- Sectors: `Accommodation and food service activities (I)` and `All NACE economic sectors`
- Period: 2008Q1 ‚Äì 2025Q1 (quarterly) / 2008 ‚Äì 2024 (annual where applicable)


### Version control - Github 
Tis project will also be uploaded to my portfolio!

https://github.com/clarissa-sc/hospitality_wages



### üì° Importing libraries


In [1]:

import pandas as pd #dataframes 
import numpy as np #linear algebra
##import seaborn as sns #visualisation
##sns.set(color_codes=True)


## 1. Introduction 

In this section, I‚Äôll prepare the CSO dataset for analysis by checking for missing values, formatting dates, and standardizing column names


### 1.1 Loading data from CSO

From the CSO database
Average Earnings, Hours Worked, Employment and Labour Costs (EHQ03) 

Quarterly - 2008Q1 - 2025Q1

Divided by economyc sectors and filtered by- **Accommodation and food service activities (I) & All NACE economic sectors**

Filtered by: 
 - Average Weekly Earnings
 - Average Hourly Earnings
 - Average Weekly Paid Hours
 
 The goal is to compare the Accommodation and food service activities (I) (refered to hospitality workers) to the national average of all sectors. We will investigate how far from the national average the hospitality wages are.
 
 "Wage gap analysis: Weekly & hourly pay gap between hospitality and all sectors." 

In [8]:

# Load the dataset
df = pd.read_csv("/Users/clarissacardoso/Desktop/hospitality_wages/hospitality_wages_project/data/raw/EHQ03.20250810T200815.csv")  # or .read_excel("cso_wages.xlsx")

# Look at the first 5 rows
df.head(10)

Unnamed: 0,Statistic Label,Quarter,Economic Sector NACE Rev 2,Type of Employee,UNIT,VALUE
0,Average Weekly Earnings,2008Q1,All NACE economic sectors,All employees,Euro,704.6
1,Average Weekly Earnings,2008Q1,Accommodation and food service activities (I),All employees,Euro,347.53
2,Average Weekly Earnings,2008Q2,All NACE economic sectors,All employees,Euro,705.28
3,Average Weekly Earnings,2008Q2,Accommodation and food service activities (I),All employees,Euro,346.15
4,Average Weekly Earnings,2008Q3,All NACE economic sectors,All employees,Euro,696.11
5,Average Weekly Earnings,2008Q3,Accommodation and food service activities (I),All employees,Euro,350.46
6,Average Weekly Earnings,2008Q4,All NACE economic sectors,All employees,Euro,721.89
7,Average Weekly Earnings,2008Q4,Accommodation and food service activities (I),All employees,Euro,348.21
8,Average Weekly Earnings,2009Q1,All NACE economic sectors,All employees,Euro,709.55
9,Average Weekly Earnings,2009Q1,Accommodation and food service activities (I),All employees,Euro,332.98


### 1.2 First glance at the data

Checking the basics presented in this dataser: its shape, datatypes, column names


In [4]:
df.info() # Data types and null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 6 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Statistic Label             414 non-null    object 
 1   Quarter                     414 non-null    object 
 2   Economic Sector NACE Rev 2  414 non-null    object 
 3   Type of Employee            414 non-null    object 
 4   UNIT                        414 non-null    object 
 5   VALUE                       414 non-null    float64
dtypes: float64(1), object(5)
memory usage: 19.5+ KB


- 6 columns
- 414 non-null observations
Data types vary between objects (text data) and floats




In [5]:
# Dimensions of dataset
print("Shape:", df.shape) # how many rows/columns

# Column names
print("Columns:", df.columns)


# Quick statistics for numeric columns
print(df.describe())

Shape: (414, 6)
Columns: Index(['Statistic Label', 'Quarter', 'Economic Sector NACE Rev 2',
       'Type of Employee', 'UNIT', 'VALUE'],
      dtype='object')
             VALUE
count   414.000000
mean    203.271208
std     283.393851
min      11.880000
25%      22.387500
50%      30.950000
75%     342.242500
max    1026.200000



### 2. Checking for missing values


In [6]:
df.isnull().sum()


Statistic Label               0
Quarter                       0
Economic Sector NACE Rev 2    0
Type of Employee              0
UNIT                          0
VALUE                         0
dtype: int64

At a first glance, seems like no data is missing from the rows. However, 

In [7]:
# check missing values

print(df.isna().sum())

Statistic Label               0
Quarter                       0
Economic Sector NACE Rev 2    0
Type of Employee              0
UNIT                          0
VALUE                         0
dtype: int64


### 3. Fix Data Types

In [12]:
# 1. Strip whitespace from text columns
df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)

# 2. Convert 'Quarter' to datetime (first day of the quarter)
df['Quarter'] = pd.PeriodIndex(df['Quarter'], freq='Q').to_timestamp()

# 3. Ensure 'VALUE' is numeric
df['VALUE'] = pd.to_numeric(df['VALUE'], errors='coerce')

# 4. (Optional) Rename columns to be easier to type
df.rename(columns={
    'Statistic Label': 'Statistic_Label',
    'Economic Sector NACE Rev 2': 'Economic_Sector',
    'Type of Employee': 'Employee_Type'
}, inplace=True)


print("\nUpdated column names and types:")
print(df.dtypes)

# Check updated dtypes
#print(df.dtypes)

# Quick look
df.head()


Updated column names and types:
Statistic_Label            object
Quarter            datetime64[ns]
Economic_Sector            object
Employee_Type              object
UNIT                       object
VALUE                     float64
dtype: object


Unnamed: 0,Statistic_Label,Quarter,Economic_Sector,Employee_Type,UNIT,VALUE
0,Average Weekly Earnings,2008-01-01,All NACE economic sectors,All employees,Euro,704.6
1,Average Weekly Earnings,2008-01-01,Accommodation and food service activities (I),All employees,Euro,347.53
2,Average Weekly Earnings,2008-04-01,All NACE economic sectors,All employees,Euro,705.28
3,Average Weekly Earnings,2008-04-01,Accommodation and food service activities (I),All employees,Euro,346.15
4,Average Weekly Earnings,2008-07-01,All NACE economic sectors,All employees,Euro,696.11


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Statistic_Label  414 non-null    object        
 1   Quarter          414 non-null    datetime64[ns]
 2   Economic_Sector  414 non-null    object        
 3   Employee_Type    414 non-null    object        
 4   UNIT             414 non-null    object        
 5   VALUE            414 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 19.5+ KB


In [13]:
df.describe()

Unnamed: 0,VALUE
count,414.0
mean,203.271208
std,283.393851
min,11.88
25%,22.3875
50%,30.95
75%,342.2425
max,1026.2
