# Preliminary EDA - Data Quality Hell

This notebook covers the initial inspection and cleaning of the **Model Case** dataset (January 1st - 15th, 2026). 

**Objective:** Prepare the data for deeper Exploratory Data Analysis (EDA) and future transformations using a stable, reproducible snapshot.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)



## 1. Load Data

We start with the merged **Model Case** dataset containing 4,483 records from 6 countries extracted for the Jan 1-15 period.

In [1]:
input_csv = Path("../data/interim/all_jobs_merged.csv")
df = pd.read_csv(input_csv)
print(f"Initial Shape: {df.shape}")
df.head()

Initial Shape: (4483, 8)


## 2. Drop Unnecessary Columns

Columns `description` (too inconsistent) and `adref` (no analytical value) are dropped to simplify the analysis.

> **Note:** We use `errors='ignore'` so the cell can be re-run without errors if the columns were already removed.

In [1]:
cols_to_drop = ['description', 'adref']
df = df.drop(columns=cols_to_drop, errors='ignore')
df.head()



## 3. Date Conversion

Converting the `created` column to a standard `datetime` format. We use `errors='coerce'` to handle any malformed strings.

In [1]:
df['created'] = pd.to_datetime(df['created'], errors='coerce')
print(f"Missing dates after conversion: {df['created'].isnull().sum()}")
df.info()

Missing dates after conversion: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4483 entries, 0 to 4482
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   country_code  4483 non-null   object             
 1   title         4483 non-null   object             
 2   id            4483 non-null   int64              
 3   company       4065 non-null   object             
 4   location      4483 non-null   object             
 5   created       4483 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), int64(1), object(4)
memory usage: 210.3+ KB


## 4. Null Analysis

Identifying columns with missing values and inspecting the problematic rows.

In [1]:
null_counts = df.isnull().sum()
print("--- Null Counts per Column ---")
print(null_counts)

print("\n--- Rows with Null Title ---")
if 'title' in df.columns:
    display(df[df['title'].isnull()])

print("\n--- Sample of Rows with Null Company (First 10) ---")
if 'company' in df.columns:
    display(df[df['company'].isnull()].head(10))

--- Null Counts per Column ---
country_code      0
title             0
id                0
company         418
location          0
created           0
dtype: int64

--- Rows with Null Title ---
Empty DataFrame
Columns: [country_code, title, id, company, location, created]
Index: []

--- Sample of Rows with Null Company (First 10) ---
     country_code  \
2764           nl   
3652           sg   
3653           sg   
3654           sg   
3655           sg   
3656           sg   
3657           sg   
3658           sg   
3659           sg   
3660           sg   

                                                                                                   title  \
2764                                                                         Growth Marketeer | Skincare   
3652                                                                      Business Development Executive   
3653                                            Sales & Marketing Executive (Japanese Speaking, Support)   


**Findings:**
- **General Scope:** The dataset contains **4,483 jobs** across 6 countries for the Jan 1-15 period.
- **Country Distribution:** Switzerland (1,163), Belgium (920), and Singapore (845) are the most represented.
- **Quality Metrics:**
    - **Titles:** Perfect coverage (0 nulls).
    - **Companies:** 418 missing values (9.3%), concentrated in Singapore and Netherlands entries.
    - **Dates:** All dates fall within the target range (Jan 1st - 15th).

**Proposed Strategy:** Fill missing companies with "Unknown". The data is remarkably clean regarding titles and dates, allowing us to move forward without dropping rows at this stage.

## 5. Duplicate Analysis

Checking for exact row duplicates across the key remaining columns.

In [1]:
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

if duplicates > 0:
    print("\n--- Sample of Duplicated Rows ---")
    display(df[df.duplicated()].head())

Duplicate rows: 0


**Findings:** No exact row duplicates were found in this snapshot. 

**Rationale:** This suggests that the Adzuna API retrieval and the subsequent flattening/merging scripts are correctly handling pagination without overlapping records.

## 6. Initial Cleaning (Execution)

Applying the decisions made above.

In [1]:
# Remove duplicates
df = df.drop_duplicates()

# Remove null title
df = df.dropna(subset=['title'])

# Fill null companies
df['company'] = df['company'].fillna('Unknown')

print(f"Final Shape after cleaning: {df.shape}")
df.isnull().sum()

Final Shape after cleaning: (4483, 6)


## Next Steps

1. **Job Content Analysis:** Perform a frequency analysis of job titles to identify common roles.
2. **Geographical Distribution:** Visualize job density across the 6 represented countries.
3. **Time Series Exploration:** Analyze daily job posting counts to identify trends in early January.
4. **Company Profiling:** identify the top recruiters in this dataset (once "Unknown" are handled).