# Preliminary EDA - Data Quality Hell

This notebook covers the initial inspection and cleaning of the **Model Case** dataset (January 1st - 15th, 2026). 

**Objective:** Prepare the data for deeper Exploratory Data Analysis (EDA) and future transformations using a stable, reproducible snapshot.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load Data

We start with the merged **Model Case** dataset containing 39,844 records from 19 countries extracted for the Jan 1-15 period.

In [None]:
input_csv = Path("../data/interim/all_jobs_merged.csv")
df = pd.read_csv(input_csv)
print(f"Initial Shape: {df.shape}")
df.head()

## 2. Drop Unnecessary Columns

Columns `description` (too inconsistent) and `adref` (no analytical value) are dropped to simplify the analysis.

> **Note:** We use `errors='ignore'` so the cell can be re-run without errors if the columns were already removed.

In [None]:
cols_to_drop = ['description', 'adref']
df = df.drop(columns=cols_to_drop, errors='ignore')
df.head()

## 3. Date Conversion

Converting the `created` column to a standard `datetime` format. We use `errors='coerce'` to handle any malformed strings.

In [None]:
df['created'] = pd.to_datetime(df['created'], errors='coerce')
print(f"Missing dates after conversion: {df['created'].isnull().sum()}")
df.info()

**Findings:**
- **General Scope:** The dataset contains **39,844 jobs** across 19 countries for the Jan 1-15 period.
- **Technical Depth:** Captures specialized roles (Data Engineer, Scientist, Analyst, MLOps, Architect).
- **Quality Metrics:**
    - **Titles:** Perfect coverage (0 nulls).
    - **Companies:** 1,333 missing values (3.3%), appearing across multiple territories.
    - **Dates:** All dates fall within the target range (Jan 1st - 15th).

**Proposed Strategy:** Fill missing companies with "Unknown". We also have a `search_term` column for 35,361 records to enable role-based segment analysis.

In [None]:
null_counts = df.isnull().sum()
print("--- Null Counts per Column ---")
print(null_counts)

print("\n--- Sample of Rows with Null Company (First 10) ---")
if 'company' in df.columns:
    display(df[df['company'].isnull()].head(10))

**Status Check:** The data is mostly clean regarding mandatory fields (Title, Date, Location). The next step is to address internal consistency and redundant records.

## 4. Multi-Role & Record Redundancy Analysis

Checking for exact row duplicates and unique job identifier matches.

In [None]:
exact_duplicates = df.duplicated().sum()
id_duplicates = df.duplicated(subset=['id']).sum()

print(f"Exact row duplicates: {exact_duplicates}")
print(f"Duplicate job IDs: {id_duplicates}")

if id_duplicates > 0:
    print("\n--- Sample of Rows with Duplicate IDs ---")
    # Show some examples of duplicated IDs to understand why they exist
    duplicate_ids = df[df.duplicated(subset=['id'])]['id'].head(3)
    display(df[df['id'].isin(duplicate_ids)].sort_values(by='id').head(10))

**Findings:** While there are no exact row duplicates, we found **11,819 duplicate job IDs**.

**Revised Strategy:** Instead of treating these as 'junk' duplicates, we recognize that the same job can match multiple search terms (e.g., 'Data Engineer' and 'Big Data'). 

To preserve the richness of the classification, we will **not** drop these records yet. This allows us to perform a more accurate 'Market Demand' analysis by role in the next notebook.

## 5. Initial Cleaning (Execution)

Applying the decisions made above.

In [None]:
# Note: We are KEEPING duplicate IDs to preserve multi-role classification
# as decided in the Data Quality strategy update.

# Remove null title (safety check)
df = df.dropna(subset=['title'])

# Fill null companies
df['company'] = df['company'].fillna('Unknown')

print(f"Final Shape after cleaning: {df.shape}")
print(f"Total Unique Jobs (by ID): {df['id'].nunique()}")
print(f"\nRemaining missing values:\n{df.isnull().sum()}")

## Next Steps

1. **Job Content Analysis:** Perform a frequency analysis of job titles to identify common roles.
2. **Geographical Distribution:** Visualize job density across the different countries.
3. **Time Series Exploration:** Analyze daily job posting counts to identify trends in early January.
4. **Company Profiling:** identify the top recruiters in this dataset.