# # Table of Contents
# 1. [Importing Libraries](#import-libraries)
# 2. [Advanced Exploratory Analysis (EDA)](#eda-advanced)
# 3. [Correlations, Boxplots, and Histograms](#plots)
# 4. [Categorical vs. Salary](#cat-vs-salary)
# 5. [Scatter Experience vs. Age vs. Salary](#scatter)

# # Importing Libraries <a id="import-libraries"></a>

In [1]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from typing import Optional, List
from project_pwc.config import FIGURES_DIR

sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
num_cols = ["Age", "Years of Experience", "Salary"]

[32m2025-01-19 20:52:39.417[0m | [1mINFO    [0m | [36mproject_pwc.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\Usuario\Documents\prueba_pwc\predictive_salary_model[0m


# # Advanced Exploratory Analysis (EDA) <a id="eda-advanced"></a>

In [4]:
df_clean = pd.read_csv('C:/Users/Usuario/Documents/prueba_pwc/predictive_salary_model/data/interim/dataset_cleaned.csv')

In [3]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   373 non-null    int64  
 1   Age                  373 non-null    int64  
 2   Gender               373 non-null    object 
 3   Education Level      373 non-null    object 
 4   Job Title            373 non-null    object 
 5   Years of Experience  373 non-null    int64  
 6   Salary               373 non-null    float64
 7   Description          373 non-null    object 
dtypes: float64(1), int64(3), object(4)
memory usage: 23.4+ KB


# # Correlations, Boxplots and Histograms <a id="plots"></a>

In [31]:
def plot_correlation_heatmap(df: pd.DataFrame, numerical_cols: List[str], filename: str = "correlation_heatmap.png") -> None:
    
    corr_matrix = df[numerical_cols].corr()

    plt.figure(figsize=(8, 6))
    sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
    plt.title("Correlation Heatmap")
    
    output_path = os.path.join(FIGURES_DIR, filename)
    plt.savefig(output_path)
    print(f"Heatmap guardado en: {output_path}")

plot_correlation_heatmap(df_clean, num_cols)

Description

The correlation between Age, Years of Experience and Salary is shown.
We observe that Age and Years of Experience have a very high correlation (~0.98), and both are strongly related to Salary (~0.92 and ~0.93, respectively).
Conclusions

This level of correlation suggests multicollinearity: in a linear model, it could cause instability in the coefficients.
The three variables are very relevant to explain the variation in Salary, but perhaps only one of them (Age or Experience) is sufficient in linear models.
In tree algorithms (RandomForest, XGBoost) this redundancy has less impact, but it is good to keep it in mind.

In [None]:
def plot_boxplots(df: pd.DataFrame, numerical_cols: List[str], prefix: str = "boxplot") -> None:

    for col in numerical_cols:
        plt.figure(figsize=(5, 4))
        sns.boxplot(x=df[col])
        plt.title(f"Boxplot for {col}")
        
        filename = f"{prefix}_{col}.png"
        output_path = os.path.join(FIGURES_DIR, filename)
        plt.savefig(output_path)
        print(f"Boxplot de {col} guardado en: {output_path}")

plot_boxplots(df_clean, num_cols)

Description

Each boxplot reveals the distribution and possible outliers:
Age: Range 23–53, median near 36, no extreme outliers.
Years of Experience: Range 0–25, peaks at 5–10 years and occasional cases of 20+ years.
Salary: Range ~350–250,000, with median ~95,000 and high values ​​exceeding 200,000.
Conclusions

Age does not present anomalous values.
Experience may have marked jumps (new vs. very experienced).
Salary is skewed to the right; there are few very high values ​​that could be considered outliers in linear models.

In [None]:
def plot_histograms(df: pd.DataFrame, numerical_cols: List[str], prefix: str = "histogram") -> None:

    for col in numerical_cols:
        plt.figure(figsize=(5, 4))
        sns.histplot(df[col], kde=True, color="teal", bins=20)
        plt.title(f"Distribution of {col}")

        filename = f"{prefix}_{col}.png"
        output_path = os.path.join(FIGURES_DIR, filename)
        plt.savefig(output_path)
        print(f"Histograma de {col} guardado en: {output_path}")

plot_histograms(df_clean, num_cols)

Description

Age: Almost normal approximation, with highest density between 30–40 years.
Years of Experience: More irregular distribution; a peak in the first years (0–5) and another around ~15.
Salary: Marked skew to the right (long tail); the main group is located between 50,000 and 120,000, but there are also salaries up to 250,000.
Conclusions

With such an asymmetric distribution in Salary, you could apply a log transformation for methods such as Linear Regression.
The idea of ​​creating “bins” for Experience (junior, semi-senior, senior, etc.) is confirmed given its non-uniform distribution.

# # Categorical vs. Salary <a id="cat-vs-salary"></a>

In [None]:
def analyze_categorical_vs_salary(df: pd.DataFrame, cat_col: str, salary_col: str = "Salary") -> None:

    mean_salary_by_cat = df.groupby(cat_col)[salary_col].mean().sort_values(ascending=False)
    print(f"=== Mean {salary_col} by {cat_col} ===\n{mean_salary_by_cat}\n")

    plt.figure(figsize=(8, 4))
    sns.violinplot(data=df, x=cat_col, y=salary_col, palette="viridis")
    plt.title(f"{salary_col} distribution by {cat_col}")
    
    output_filename = f"violin_{cat_col}.png"
    output_path = os.path.join(FIGURES_DIR, output_filename)
    plt.savefig(output_path)
    print(f"Violinplot guardado en: {output_path}")

analyze_categorical_vs_salary(df_clean, "Education Level")

* Gender: The violinplot suggests differences in the median and extreme values ​​of salary (male vs. female vs. missing).
* Education Level: Salary tends to be higher as academic training increases (PhD > Master’s > Bachelor’s), although with overlaps between categories.

Implications:
Both categorical variables (Gender, Education Level) could have predictive power.
The treatment of “Missing” in both cases (gender or educational level) is important; it is convenient to keep it as an additional category so as not to lose records.
One-Hot Encoding or other coding strategies that reflect educational level and gender in the modeling are recommended.

# # Scatter Experience vs. Age vs. Salary <a id="scatter"></a>

In [None]:
def scatter_experience_age_salary(df: pd.DataFrame,
                                  x_col: str = "Age",
                                  y_col: str = "Years of Experience",
                                  hue_col: Optional[str] = "Gender",
                                  size_col: Optional[str] = "Salary",
                                  filename: str = "scatter_experience_age_salary.png") -> None:

    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=df, x=x_col, y=y_col, hue=hue_col, size=size_col, sizes=(20, 200), alpha=0.7)
    plt.title(f"{y_col} vs {x_col} (hue={hue_col}, size={size_col})")
    
    output_path = os.path.join(FIGURES_DIR, filename)
    plt.savefig(output_path)
    print(f"Scatterplot guardado en: {output_path}")

scatter_experience_age_salary(df_clean)

Description

There is an almost linear relationship: the older the person, the more experience, which confirms the strong numerical correlation.
The large points (representing high salaries) are usually in the area of ​​greater experience and/or age, although there are exceptions.
The color suggests that both men and women are distributed along the entire diagonal, with notable variations in size (Salary).
Conclusions

This is a clear indication of multicollinearity between age and experience.
A high Salary usually coincides with a greater range of experience/age, but there are also medium or high cases in intermediate ranges.
You could eliminate one of the two variables in linear models or create a third (for example, Age - YearsExperience) if you want to capture the idea of ​​“career start age”.