# Preliminary EDA - Data Quality Hell

This notebook covers the initial inspection and cleaning of the consolidated jobs dataset. 

**Objective:** Prepare the data for deeper Exploratory Data Analysis (EDA) and future transformations.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load Data

We start with the merged dataset containing ~9,500 records from 19 countries.

In [None]:
input_csv = Path("../data/interim/all_jobs_merged.csv")
df = pd.read_csv(input_csv)
print(f"Initial Shape: {df.shape}")
df.head()

## 2. Drop Unnecessary Columns

Columns `description` (too inconsistent) and `adref` (no analytical value) are dropped to simplify the analysis.

> **Note:** We use `errors='ignore'` so the cell can be re-run without errors if the columns were already removed.

In [None]:
cols_to_drop = ['description', 'adref']
df = df.drop(columns=cols_to_drop, errors='ignore')
df.head()

## 3. Date Conversion

Converting the `created` column to a standard `datetime` format. We use `errors='coerce'` to handle any malformed strings.

In [None]:
df['created'] = pd.to_datetime(df['created'], errors='coerce')
print(f"Missing dates after conversion: {df['created'].isnull().sum()}")
df.info()

## 4. Null Analysis

Identifying columns with missing values and inspecting the problematic rows.

In [None]:
null_counts = df.isnull().sum()
print("--- Null Counts per Column ---")
print(null_counts)

print("\n--- Rows with Null Title ---")
display(df[df['title'].isnull()])

print("\n--- Sample of Rows with Null Company (First 10) ---")
display(df[df['company'].isnull()].head(10))

**Findings:**
- **Title:** 1 missing value. This row is missing critical information and should be removed.
- **Company:** 319 missing values. Many of these seem to be valid jobs where the company name was simply not provided by the API.

**Proposed Strategy:** Fill missing companies with "Unknown" and drop the single row missing a title.

## 5. Duplicate Analysis

Checking for exact row duplicates across the key remaining columns.

In [None]:
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

if duplicates > 0:
    print("\n--- Sample of Duplicated Rows ---")
    display(df[df.duplicated()].head())

**Proposed Strategy:** Drop all 33 duplicates to ensure analysis integrity.

## 6. Initial Cleaning (Execution)

Applying the decisions made above.

In [None]:
# Remove duplicates
df = df.drop_duplicates()

# Remove null title
df = df.dropna(subset=['title'])

# Fill null companies
df['company'] = df['company'].fillna('Unknown')

print(f"Final Shape after cleaning: {df.shape}")
df.isnull().sum()

## Next Steps
1. Save this baseline cleaned version for EDA.
2. Start analyzing job counts by country and time trends.