## Data Cleaning and Merging

This section walks through the data ingestion and transformation process to prepare the dataset for analysis. Key steps include:

- Load multiple yearly datasets from CSVs
- Clean the job postings dataset
- Integrate exchange rate data to convert salaries to EUR
- Merge country metadata for regional classification
- Export the cleaned dataset for further exploration

### Import Libraries

In [1]:
from pathlib import Path
import ast

import requests
import pandas as pd
from datetime import datetime
from tqdm import tqdm

### Load multiple yearly datasets from CSVs

In [2]:
raw_data_dir = Path.cwd().parents[1] / 'Raw_Data'
csv_files = list(raw_data_dir.glob('*data_jobs*.csv'))

dfs = [pd.read_csv(f) for f in csv_files]
df = pd.concat(dfs, ignore_index=True)

df.head(3)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Data Analyst,"Summer Internship -Data Analyst Intern, Risk M...","Marlborough, MA",via Boatingrevealed.com,"Full-time, Part-time, and Internship",False,"New York, United States",2024-01-01 00:00:01,False,True,United States,,,,BJ's Wholesale Club,['excel'],{'analyst_tools': ['excel']}
1,Data Analyst,"Staff Data Analyst Operations, Infrastructure ...","Fremont, CA",via ClimateTechList,Full-time,False,"California, United States",2024-01-01 00:00:11,True,False,United States,,,,Tesla,"['tableau', 'flow']","{'analyst_tools': ['tableau'], 'other': ['flow']}"
2,Data Analyst,Junior Data Analyst - Entry Level,"Waco, TX",via ZipRecruiter,Full-time and Part-time,False,"Texas, United States",2024-01-01 00:00:15,True,False,United States,,,,Next Recruiting,,


### Clean the job postings dataset

In [3]:
df = df.dropna(subset=['salary_year_avg'])
df = df.drop_duplicates()

df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

df['job_skills'] = df['job_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)

### Integrate exchange rate data to convert salaries to EUR

TTo convert salaries to a more relevant local currency (EUR), collect USD→EUR exchange rates for each year starting from 2023 using a public API. These rates will later be used to calculate both annual and monthly salaries in EUR.

In [4]:
current_year = datetime.now().year

# Function to get exchange rate from Frankfurter API
def get_usd_to_eur_rate_frankfurter(date_str):
    url = f"https://api.frankfurter.app/{date_str}?from=USD&to=EUR"
    response = requests.get(url)
    data = response.json()
    if 'rates' in data and 'EUR' in data['rates']:
        return data['rates']['EUR']
    return None

# Filter unique dates: only from 2023 to current year
unique_dates = sorted(
    date for date in df['job_posted_date'].dt.date.unique()
    if date.year >= 2023 and date.year <= current_year
)

# Fetch exchange rates only for filtered dates
ex_rates = []
for date_obj in tqdm(unique_dates, desc='Fetching exchange rates'):
    date_str = date_obj.strftime('%Y-%m-%d')
    rate = get_usd_to_eur_rate_frankfurter(date_str)
    ex_rates.append({'job_posted_date': date_obj, 'usd_to_eur': rate})

# Save results
df_ex_rate = pd.DataFrame(ex_rates).round(4)
df_ex_rate.to_csv(raw_data_dir / 'ex_rate_daily.csv', index=False)

df_ex_rate.head(3)

Fetching exchange rates:   0%|          | 0/731 [00:00<?, ?it/s]

Fetching exchange rates: 100%|██████████| 731/731 [05:02<00:00,  2.41it/s]


Unnamed: 0,job_posted_date,usd_to_eur
0,2023-01-01,0.9376
1,2023-01-02,0.9361
2,2023-01-03,0.9483


In [33]:
# Merge Exchange Rate. Convert job_year_avg in EUR
df_Final = df.rename(columns={'salary_year_avg': 'salary_year_avg_usd'})

df_Final['job_posted_date'] = pd.to_datetime(df_Final['job_posted_date'])
df_ex_rate['job_posted_date'] = pd.to_datetime(df_ex_rate['job_posted_date'])

df_Final['job_posted_date'] = df_Final['job_posted_date'].dt.date
df_ex_rate['job_posted_date'] = df_ex_rate['job_posted_date'].dt.date

df_Final = df_Final.merge(df_ex_rate, on='job_posted_date', how='left')

df_Final['salary_year_avg_eur'] = (df_Final['salary_year_avg_usd'] * df_Final['usd_to_eur']).round(2)
df_Final['salary_month_avg_eur'] = (df_Final['salary_year_avg_eur'] / 12).round(2)

df_Final.loc[1:3, ['job_title_short', 'job_posted_date', 'usd_to_eur', 'salary_year_avg_usd', 'salary_year_avg_eur', 'salary_month_avg_eur']]

Unnamed: 0,job_title_short,job_posted_date,usd_to_eur,salary_year_avg_usd,salary_year_avg_eur,salary_month_avg_eur
1,Data Scientist,2024-01-01,0.905,112500.0,101812.5,8484.38
2,Data Scientist,2024-01-01,0.905,162623.5,147174.27,12264.52
3,Data Analyst,2024-01-01,0.905,42500.0,38462.5,3205.21


### Merge country metadata for regional classification

In [34]:
df_EU = pd.read_csv(raw_data_dir / 'EU_Countries_dict.csv', delimiter=';')

df_Final = df_Final.merge(df_EU, how='left', left_on='job_country', right_on='country')
pd.set_option('future.no_silent_downcasting', True)
df_Final['is_eu'] = df_Final['is_eu'].fillna(False).astype(bool)

df_Final.loc[1001:1003, ['job_title_short', 'job_country', 'is_eu']]

Unnamed: 0,job_title_short,job_country,is_eu
1001,Data Scientist,United States,False
1002,Data Engineer,United States,False
1003,Data Analyst,Germany,True


In [35]:
# Add Region grouping column
def group_country(row):
    if row['job_country'] == 'United States':
        return 'US'
    elif row['is_eu'] == True:
        return 'EU'
    else:
        return 'Other'
    
df_Final['region_group'] = df_Final.apply(group_country, axis=1)

df_Final.loc[1001:1003, ['region_group', 'job_country', 'salary_month_avg_eur']]

Unnamed: 0,region_group,job_country,salary_month_avg_eur
1001,US,United States,5746.25
1002,US,United States,9960.17
1003,EU,Germany,5516.4


### Export the cleaned dataset for further exploration

After cleaning and merging, the final dataset is exported to CSV for downstream analysis. Temporary or unused columns are dropped to reduce noise.

In [36]:
df_Final.drop(columns=['usd_to_eur', 'ISO', 'salary_year_avg_usd', 'salary_hour_avg'], inplace=True)
df_Final.to_pickle(raw_data_dir / 'df_Final_2.pkl')

df_Final.head(3)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,company_name,job_skills,job_type_skills,salary_year_avg_eur,salary_month_avg_eur,country,is_eu,region_group
0,Data Engineer,"Snowflake Data Engineer (Dallas, TX)",Texas,via Built In,Full-time,False,"Illinois, United States",2024-01-01,False,True,United States,year,Confie,"[python, sql, snowflake, azure, databricks, bi...","{'cloud': ['snowflake', 'azure', 'databricks',...",108600.0,9050.0,,False,US
1,Data Scientist,Data Scientist III (50% REMOTE) Jobs,"Huntsville, AL",via Clearance Jobs,Contractor,False,Georgia,2024-01-01,False,False,United States,year,Advantex Consulting,"[java, scala, python, aws, tensorflow, pytorch...","{'cloud': ['aws'], 'libraries': ['tensorflow',...",101812.5,8484.38,,False,US
2,Data Scientist,Principal Data Scientist,"Palmyra, PA",via Ladders,Full-time,False,"New York, United States",2024-01-01,False,False,United States,year,Unisys Corporation,[express],{'webframeworks': ['express']},147174.27,12264.52,,False,US
