## <b>4.3 PREPROCESSAMENT I AN√ÄLISI DE DADES</b>

### <b>4.3.2 Transformacions i enginyeria de variables b√†sica</b>

#### <b>4.3.2.1 Preparaci√≥ inicial i tractament de valors nuls</b>

In [4]:
# ============================================================
# Preparaci√≥ inicial i tractament de valors nuls
# ============================================================
# Objectiu d'aquest script:
#   - Carregar el dataset net (clean_motor_insurance.csv) com a punt
#     de partida de la fase d‚Äôenginyeria b√†sica.
#   - Verificar la integritat del conjunt de dades:
#        * dimensions (files, columnes)
#        * tipus de dades
#        * duplicats
#        * valors nuls
#   - Tractar els valors nuls segons la seva naturalesa:
#        * Nuls estructurals (Date_lapse):
#             - Crear Has_lapse (1 si hi ha baixa, 0 si no)
#             - Crear Is_active (1 si actiu, 0 si no)
#             - Derivar Policy_duration (nom√©s per p√≤lisses amb lapse)
#        * Nuls moderats (Length ~9.79%):
#             - Crear Length_missing_flag (indicador d‚Äôimputaci√≥)
#             - Imputar per mediana segmentada per Type_risk
#               (amb fallback a mediana global si un segment no t√© valors)
#        * Nuls baixos (Type_fuel ~1.67%):
#             - Imputar amb categoria 'Unknown' (no es fa one-hot aqu√≠)
#
# Nota:
#   - Aquest subapartat NO crea encara les derivades temporals finals
#     (Driver_age, Vehicle_age, etc.) ni fa codificaci√≥ avan√ßada.
#   - El resultat √©s un dataframe `df_prep` lliure de nuls no estructurals,
#     amb variables b√†siques de control temporal i de qualitat.
# ============================================================

import pandas as pd
import numpy as np

# ------------------------------------------------------------
# 1) C√†rrega del dataset net
# ------------------------------------------------------------
CLEAN_PATH = "clean_motor_insurance.csv"  # Ruta del fitxer net previ
# sep=None amb engine="python" deixa que pandas infereixi el separador
# (funciona b√© si pot detectar si √©s ',' o ';', etc.)
df_prep = pd.read_csv(CLEAN_PATH, sep=None, engine="python")
print("clean_motor_insurance.csv carregat correctament")
print("Dimensions:", df_prep.shape)  # (n_files, n_columnes)

# ------------------------------------------------------------
# 2) Verificacions b√†siques d'integritat
# ------------------------------------------------------------
# 2.1 Duplicats exactes (files)
# .duplicated() marca True per files id√®ntiques a una anterior;
# .sum() compta quantes hi ha.
n_dup = df_prep.duplicated().sum()
print("\nDuplicats exactes:", n_dup)
# 2.2 Taula de nuls inicial
# isna().sum() ‚Üí n¬∫ de nuls per columna
missing_count = df_prep.isna().sum()
# Calculem percentatge de nuls per columna
missing_pct = (missing_count / len(df_prep) * 100).round(2)
# Constru√Øm taula amb recompte i percentatge de nuls
missing_table = (
    pd.DataFrame({"missing_count": missing_count, "missing_pct": missing_pct})
      .sort_values("missing_count", ascending=False)  # m√©s nuls a dalt
)
print("\nTaula de nuls inicial (top 10):")
# display √©s √∫til en notebooks (Jupyter) per visualitzar bonic
display(missing_table.head(10))

# ------------------------------------------------------------
# 3) Tractament de nuls estructurals (Date_lapse)
# ------------------------------------------------------------
# Context: Date_lapse √©s NaN en p√≤lisses que continuen actives.
# Per tant, el NaN aqu√≠ no √©s "dada perduda", sin√≥ informaci√≥:
# "no hi ha data de baixa" ‚Üí p√≤lissa activa.
# 3.1 Has_lapse: 1 si hi ha data de baixa (no NaN), 0 si NaN
df_prep["Has_lapse"] = np.where(df_prep["Date_lapse"].isna(), 0, 1)
# 3.2 Is_active: complementari de Has_lapse
# Si Has_lapse = 1 (cancel¬∑lada) ‚Üí Is_active = 0
# Si Has_lapse = 0 (sense baixa) ‚Üí Is_active = 1
df_prep["Is_active"] = 1 - df_prep["Has_lapse"]
# 3.3 Conversi√≥ robusta de Date_lapse i Date_start_contract a datetime
#     (si ja venien com datetime del fitxer net, no passa res)
df_prep["Date_lapse"] = pd.to_datetime(
    df_prep["Date_lapse"], format="%Y-%m-%d", errors="coerce"
)
df_prep["Date_start_contract"] = pd.to_datetime(
    df_prep["Date_start_contract"], format="%Y-%m-%d", errors="coerce"
)
# 3.4 Policy_duration: nom√©s per contractes finalitzats (Has_lapse=1)
# Calculem la durada (en anys) entre l‚Äôinici del contracte i la data de baixa.
# Si no hi ha lapse, deixem NaN (nuls estructurals).
df_prep["Policy_duration"] = np.where(
    df_prep["Has_lapse"] == 1,
    (df_prep["Date_lapse"] - df_prep["Date_start_contract"]).dt.days / 365.25,
    np.nan
).round(2)  # arrodonim a 2 decimals
# Validaci√≥ r√†pida del lapse
print("\nDistribuci√≥ Has_lapse:")
display(df_prep["Has_lapse"].value_counts())

print(
    "Percentatge p√≤lisses cancel¬∑lades:",
    round(df_prep["Has_lapse"].mean() * 100, 2), "%"
)

# ------------------------------------------------------------
# 4) Tractament de nuls baixos (Type_fuel)
# ------------------------------------------------------------
# Segons el disseny: imputar valors nuls amb categoria "Unknown".
# No es fa one-hot aqu√≠; es deixa per fases posteriors de codificaci√≥.
if "Type_fuel" in df_prep.columns:
    # Comptem nuls abans d‚Äôimputar
    n_fuel_nulls_before = df_prep["Type_fuel"].isna().sum()
    # Substitu√Øm NaN per la categoria "Unknown"
    df_prep["Type_fuel"] = df_prep["Type_fuel"].fillna("Unknown")
    # Verifiquem que ja no hi ha nuls
    n_fuel_nulls_after = df_prep["Type_fuel"].isna().sum()
    print("\nType_fuel nuls abans:", n_fuel_nulls_before)
    print("Type_fuel nuls despr√©s:", n_fuel_nulls_after)
    # Distribuci√≥ en percentatges per veure pes d‚Äô"Unknown"
    print("Distribuci√≥ Type_fuel:")
    display(df_prep["Type_fuel"].value_counts(normalize=True).rename("pct"))

# ------------------------------------------------------------
# 5) Tractament de nuls moderats (Length)
# ------------------------------------------------------------
# Estrat√®gia:
#   - Crear flag Length_missing_flag (1 si s‚Äôha imputat, 0 si original).
#   - Imputar nuls de Length segons mediana per Type_risk.
#   - Si un Type_risk no t√© valors no nuls, usar mediana global.
if "Length" in df_prep.columns:
    # 5.1 Flag de control: nuls actuals de Length
    df_prep["Length_missing_flag"] = np.where(df_prep["Length"].isna(), 1, 0)
    n_len_nulls_before = df_prep["Length"].isna().sum()
    print("\nLength nuls abans:", n_len_nulls_before)
    # 5.2 Mediana per segment Type_risk
    # groupby("Type_risk") calcula mediana de Length per cada grup de risc
    median_by_risk = df_prep.groupby("Type_risk")["Length"].median()
    # Mediana global per a fallback si algun Type_risk no t√© valors
    global_median = df_prep["Length"].median()
    # 5.3 Funci√≥ d‚Äôimputaci√≥ segmentada fila a fila
    def impute_length(row):
        # Nom√©s imputem si el valor original √©s NaN
        if pd.isna(row["Length"]):
            # busquem la mediana del segment corresponent
            seg_med = median_by_risk.get(row["Type_risk"], np.nan)
            # si no hi ha mediana de segment, utilitzem la global
            return seg_med if not pd.isna(seg_med) else global_median
        # si no hi ha NaN, retornem el valor original
        return row["Length"]
    # Apliquem la funci√≥ a cada fila
    df_prep["Length"] = df_prep.apply(impute_length, axis=1)
    n_len_nulls_after = df_prep["Length"].isna().sum()
    print("Length nuls despr√©s:", n_len_nulls_after)
    # Control de descriptives despr√©s d‚Äôimputar
    print("\nDescriptives Length despr√©s de la imputaci√≥:")
    display(df_prep["Length"].describe())
    print(
        "Percentatge imputat (Length_missing_flag=1):",
        round(df_prep["Length_missing_flag"].mean() * 100, 2), "%"
    )

# ------------------------------------------------------------
# 6) Comprovaci√≥ final de nuls no estructurals
# ------------------------------------------------------------
# Tornem a calcular nuls per verificar que nom√©s queden els esperats (estructurals)
missing_count_after = df_prep.isna().sum()
missing_pct_after = (missing_count_after / len(df_prep) * 100).round(2)
missing_table_after = (
    pd.DataFrame({
        "missing_count": missing_count_after,
        "missing_pct": missing_pct_after
    })
    .sort_values("missing_count", ascending=False)
)
print("\nTaula de nuls FINAL (top 10):")
display(missing_table_after.head(10))
# Variables on s‚Äôaccepten nuls perqu√® s√≥n estructurals
structural_allowed = ["Date_lapse", "Policy_duration"]
# A partir de la taula de nuls final, eliminem les files que corresponen
# a variables estructurals per centrar-nos en la resta
non_struct_nulls = missing_table_after.drop(index=structural_allowed, errors="ignore")
# Validaci√≥ final:
# No han de quedar nuls en cap variable que NO sigui estructural.
# Si en queden, llancem un error per detectar-ho aviat.
assert non_struct_nulls["missing_count"].sum() == 0, \
       "Hi ha nuls no estructurals pendents despr√©s del tractament!"
print("\nValidaci√≥ correcta: nom√©s queden nuls estructurals esperats.")
print("Dimensions finals df_prep:", df_prep.shape)


clean_motor_insurance.csv carregat correctament
Dimensions: (105555, 34)

Duplicats exactes: 0

Taula de nuls inicial (top 10):


Unnamed: 0,missing_count,missing_pct
Date_lapse,70408,66.7
Length,10329,9.79
Type_fuel,1764,1.67
ID,0,0.0
Cylinder_capacity,0,0.0
Area,0,0.0
Second_driver,0,0.0
Year_matriculation,0,0.0
Power,0,0.0
N_doors,0,0.0



Distribuci√≥ Has_lapse:


Has_lapse
0    70408
1    35147
Name: count, dtype: int64

Percentatge p√≤lisses cancel¬∑lades: 33.3 %

Type_fuel nuls abans: 1764
Type_fuel nuls despr√©s: 0
Distribuci√≥ Type_fuel:


Type_fuel
D          0.615774
P          0.367515
Unknown    0.016712
Name: pct, dtype: float64


Length nuls abans: 10329
Length nuls despr√©s: 0

Descriptives Length despr√©s de la imputaci√≥:


count    105555.000000
mean          4.185750
std           0.455467
min           1.978000
25%           3.941000
50%           4.202000
75%           4.433000
max           8.218000
Name: Length, dtype: float64

Percentatge imputat (Length_missing_flag=1): 9.79 %

Taula de nuls FINAL (top 10):


Unnamed: 0,missing_count,missing_pct
Policy_duration,70408,66.7
Date_lapse,70408,66.7
ID,0,0.0
Length,0,0.0
Year_matriculation,0,0.0
Power,0,0.0
Cylinder_capacity,0,0.0
Value_vehicle,0,0.0
N_doors,0,0.0
Type_fuel,0,0.0



Validaci√≥ correcta: nom√©s queden nuls estructurals esperats.
Dimensions finals df_prep: (105555, 38)


#### <b>4.3.2.2 Derivaci√≥ de variables temporals i coher√®ncia cronol√≤gica</b>

In [6]:
# ============================================================
# Derivaci√≥ de variables temporals i coher√®ncia cronol√≤gica
# ============================================================
# Aquest script deriva i valida magnituds temporals clau a partir de
# p√≤lisses d‚Äôasseguran√ßa motor, ja pre-processades i carregades a df_prep.
# No aplica correccions, nom√©s genera flags, c√†lculs i un report.
# ============================================================

import pandas as pd  # Llibreria per manipulaci√≥ de dades tabulars i tractament de dates
import numpy as np   # Llibreria num√®rica usada aqu√≠ per a creaci√≥ de columnes condicionals

# Copiem per no modificar df_prep original
df_time = df_prep.copy()

# ------------------------------------------------------------
# 0) FOR√áAR CAST DE TOTES LES DATES A DATETIME
# ------------------------------------------------------------
# Garantim que totes les columnes de data siguin datetime64 encara que
# el fitxer anterior ja les hagu√©s convertit.

date_cols = [
    "Date_start_contract",   # Inici del contracte
    "Date_last_renewal",     # √öltima renovaci√≥
    "Date_next_renewal",     # Renovaci√≥ futura
    "Date_birth",            # Naixement conductor
    "Date_driving_licence",  # Obtenir carnet conduir
    "Date_lapse"             # Baixa/lapse (si n'hi ha)
]

for c in date_cols:
    # Com que les dates s√≥n ISO (YYYY-MM-DD), especifiquem dayfirst=False
    # per evitar warnings i garantir parsing coherent.
    df_time[c] = pd.to_datetime(
        df_time[c],
        format="%Y-%m-%d",   # especifiquem format ISO expl√≠cit
        errors="coerce"      # converteix valors no parsejables a NaT
    )

print("Tipus despr√©s de la conversi√≥ for√ßada:")
print(df_time[date_cols].dtypes)

# ------------------------------------------------------------
# 1) VARIABLES TEMPORALS DERIVADES
# ------------------------------------------------------------

# Edat del conductor (anys) a l'inici del contracte
df_time["Driver_age"] = (
    df_time["Date_start_contract"] - df_time["Date_birth"]
).dt.days / 365.25

# Antiguitat del carnet (anys) a la darrera renovaci√≥
df_time["Licence_age"] = (
    df_time["Date_last_renewal"] - df_time["Date_driving_licence"]
).dt.days / 365.25

# Antiguitat de la p√≤lissa (anys) fins la darrera renovaci√≥
df_time["Policy_age"] = (
    df_time["Date_last_renewal"] - df_time["Date_start_contract"]
).dt.days / 365.25

# Dies fins la pr√≤xima renovaci√≥
df_time["Days_to_next"] = (
    df_time["Date_next_renewal"] - df_time["Date_last_renewal"]
).dt.days

# Antiguitat del vehicle en anys
df_time["Vehicle_age"] = (
    df_time["Date_start_contract"].dt.year - df_time["Year_matriculation"]
)

# ------------------------------------------------------------
# 2) DETECTAR ANOMALIES TEMPORALS
# ------------------------------------------------------------
# Validem inconsist√®ncies cronol√≤giques i casos implausibles.

anomalies = {
    "birth_after_licence": (df_time["Date_birth"] > df_time["Date_driving_licence"]).sum(),
    # Naixement posterior a l'obtenci√≥ del carnet ‚Üí impossible

    "licence_negative": (df_time["Licence_age"] < 0).sum(),
    # Antiguitat carnet negativa ‚Üí renovaci√≥ anterior a obtenci√≥ del carnet

    "policy_negative": (df_time["Policy_age"] < 0).sum(),
    # Renovaci√≥ anterior a l'inici del contracte ‚Üí inconsist√®ncia greu

    "next_before_last": (df_time["Days_to_next"] < 0).sum(),
    # Renovaci√≥ futura anterior a la passada ‚Üí error de cronologia

    "lapse_before_start": (
        df_time["Date_lapse"].notna() &
        (df_time["Date_lapse"] < df_time["Date_start_contract"])
    ).sum(),
    # Lapse abans de l'inici de contracte ‚Üí inconsistent si √©s un sinistre/cancel¬∑laci√≥ real

    "lapse_before_last": (
        df_time["Date_lapse"].notna() &
        (df_time["Date_lapse"] < df_time["Date_last_renewal"])
    ).sum()
    # Lapse abans de la darrera renovaci√≥ ‚Üí a revisar segons definici√≥ de negoci
}

print("\n--- ANOMALIES DETECTADES ---")
for k, v in anomalies.items():
    print(f"{k}: {v}")

# ------------------------------------------------------------
# 3) DESCRIPTIVES DE LES VARIABLES TEMPORALS
# ------------------------------------------------------------
# Revisem distribucions b√†siques per inspeccionar rangs i detectar valors extrems.

temp_vars = ["Driver_age", "Licence_age", "Policy_age", "Vehicle_age", "Days_to_next"]

print("\n--- DESCRIPTIVES VARIABLES TEMPORALS ---")
display(df_time[temp_vars].describe().T)

# ------------------------------------------------------------
# 4) PREVIEW FINAL
# ------------------------------------------------------------
# Inspecci√≥ visual de les primeres files amb totes les dates i derivades.

print("\nMostra de registres amb variables temporals derivades:")
display(df_time[[
    "Date_start_contract", "Date_last_renewal", "Date_next_renewal",
    "Date_birth", "Date_driving_licence", "Date_lapse",
    "Driver_age", "Licence_age", "Policy_age", "Vehicle_age", "Days_to_next"
]].head())

Tipus despr√©s de la conversi√≥ for√ßada:
Date_start_contract     datetime64[ns]
Date_last_renewal       datetime64[ns]
Date_next_renewal       datetime64[ns]
Date_birth              datetime64[ns]
Date_driving_licence    datetime64[ns]
Date_lapse              datetime64[ns]
dtype: object

--- ANOMALIES DETECTADES ---
birth_after_licence: 0
licence_negative: 31
policy_negative: 68
next_before_last: 0
lapse_before_start: 0
lapse_before_last: 404

--- DESCRIPTIVES VARIABLES TEMPORALS ---


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Driver_age,105555.0,44.34442,12.765212,18.015058,34.234086,43.972621,53.73306,91.52909
Licence_age,105555.0,24.765484,12.486053,-2.2423,14.410678,23.854894,34.171116,74.028747
Policy_age,105555.0,2.981346,3.923427,-0.911704,0.0,1.002053,4.0,37.998631
Vehicle_age,105555.0,9.244887,7.148871,0.0,4.0,9.0,13.0,68.0
Days_to_next,105555.0,365.08802,0.283326,365.0,365.0,365.0,365.0,366.0



Mostra de registres amb variables temporals derivades:


Unnamed: 0,Date_start_contract,Date_last_renewal,Date_next_renewal,Date_birth,Date_driving_licence,Date_lapse,Driver_age,Licence_age,Policy_age,Vehicle_age,Days_to_next
0,2015-11-05,2015-11-05,2016-11-05,1956-04-15,1976-03-20,NaT,59.556468,39.627652,0.0,11,366
1,2015-11-05,2016-11-05,2017-11-05,1956-04-15,1976-03-20,NaT,59.556468,40.629706,1.002053,11,365
2,2015-11-05,2017-11-05,2018-11-05,1956-04-15,1976-03-20,NaT,59.556468,41.629021,2.001369,11,365
3,2015-11-05,2018-11-05,2019-11-05,1956-04-15,1976-03-20,NaT,59.556468,42.628337,3.000684,11,365
4,2017-09-26,2017-09-26,2018-09-26,1956-04-15,1976-03-20,NaT,61.448323,41.519507,0.0,13,365


### <b>4.3.2.3 Tractament d'anomalies temporals i incoher√®ncies</b>

In [8]:
# ============================================================
# Tractament d'anomalies temporals i incoher√®ncies
# ============================================================
# Objectiu:
#   - Detectar i sanejar incoher√®ncies temporals residuals identificades
#     a la fase anterior (df_time).
#   - Crear flags de tra√ßabilitat per a cada tipologia d‚Äôanomalia, per
#     poder tra√ßar quins registres han estat corregits.
#   - Aplicar correccions conservadores (sense inventar informaci√≥ nova),
#     nom√©s "retallant" les dates perqu√® no generin valors impossibles:
#        * Licence_age < 0:
#             -> ajustar Date_driving_licence = Date_last_renewal
#        * Policy_age  < 0:
#             -> ajustar Date_start_contract = Date_last_renewal
#        * Date_lapse < Date_last_renewal (si lapse):
#             -> ajustar Date_lapse = Date_last_renewal
#   - Recalcular les derivades temporals afectades despr√©s de canviar dates.
#   - Validar que no queden antiguitats negatives ni incoher√®ncies de lapse.
#
# Entrada:
#   df_time = dataset amb derivades temporals i anomalies detectades
#
# Sortida:
#   df_time_corr = dataset temporalment coherent + flags de control
# ============================================================

# Fem una c√≤pia per no modificar df_time original
df_time_corr = df_time.copy()

# ------------------------------------------------------------
# 1) Flags de tra√ßabilitat (abans de corregir)
# ------------------------------------------------------------
# Aqu√≠ nom√©s marquem quins registres s√≥n incoherents, sense tocar dades.
# Aix√≤ permet:
#   - Analitzar quants i quins casos eren problem√†tics.
#   - Mantenir rastre de qu√® s‚Äôha corregit despr√©s.
# 1.1 Llic√®ncia incoherent: Licence_age < 0
#     (equival a Date_driving_licence posterior a Date_last_renewal)
df_time_corr["Licence_incoherent_flag"] = np.where(
    df_time_corr["Licence_age"] < 0,  # condici√≥ d'incoher√®ncia
    1,                                 # valor si condici√≥ certa
    0                                  # valor si condici√≥ falsa
)
# 1.2 P√≤lissa incoherent: Policy_age < 0
#     (equival a Date_start_contract posterior a Date_last_renewal)
df_time_corr["Policy_incoherent_flag"] = np.where(
    df_time_corr["Policy_age"] < 0,
    1,
    0
)
# 1.3 Lapse incoherent: Date_lapse abans de Date_last_renewal, per√≤ nom√©s
#     als registres on hi ha lapse (Date_lapse no √©s NaN).
df_time_corr["Lapse_incoherent_flag"] = np.where(
    (df_time_corr["Date_lapse"].notna()) &  # hi ha lapse
    (df_time_corr["Date_lapse"] < df_time_corr["Date_last_renewal"]),  # abans de last_renewal
    1,
    0
)
print("Flags d‚Äôincoher√®ncia (abans de corregir):")
# Sumem cada flag per veure quants casos hi ha de cada tipus
display(df_time_corr[
    ["Licence_incoherent_flag", "Policy_incoherent_flag", "Lapse_incoherent_flag"]
].sum())

# ------------------------------------------------------------
# 2) Correccions conservadores de dates
# ------------------------------------------------------------
# Ara que ja tenim les flags, apliquem correccions m√≠nimes per evitar
# valors temporals impossibles, per√≤ sense inventar dades noves.
# Sempre "portem" les dates incoherents fins a Date_last_renewal.
# 2.1 Correcci√≥ Licence_age negativa:
#     Si Licence_incoherent_flag = 1, vol dir que Licence_age < 0.
#     Ajustem Date_driving_licence = Date_last_renewal
#     ‚áí despr√©s de la correcci√≥, Licence_age = 0 anys.
mask_lic = df_time_corr["Licence_incoherent_flag"] == 1
df_time_corr.loc[mask_lic, "Date_driving_licence"] = df_time_corr.loc[mask_lic, "Date_last_renewal"]
# 2.2 Correcci√≥ Policy_age negativa:
#     Si Policy_incoherent_flag = 1, vol dir que Policy_age < 0.
#     Ajustem Date_start_contract = Date_last_renewal
#     ‚áí Policy_age passa a ser 0 anys.
mask_pol = df_time_corr["Policy_incoherent_flag"] == 1
df_time_corr.loc[mask_pol, "Date_start_contract"] = df_time_corr.loc[mask_pol, "Date_last_renewal"]
# 2.3 Correcci√≥ lapse incoherent:
#     Si Lapse_incoherent_flag = 1, vol dir que Date_lapse < Date_last_renewal.
#     Ajustem Date_lapse = Date_last_renewal per garantir que la data
#     de baixa no sigui anterior a l‚Äô√∫ltima renovaci√≥.
mask_lap = df_time_corr["Lapse_incoherent_flag"] == 1
df_time_corr.loc[mask_lap, "Date_lapse"] = df_time_corr.loc[mask_lap, "Date_last_renewal"]

# ------------------------------------------------------------
# 3) Recalcular derivades temporals despr√©s de corregir
# ------------------------------------------------------------
# Despr√©s de modificar dates, cal recalcular totes les magnituds
# temporals derivades que depenen d‚Äôelles.

# Edat del conductor a l'inici del contracte
df_time_corr["Driver_age"] = (
    (df_time_corr["Date_start_contract"] - df_time_corr["Date_birth"]).dt.days / 365.25
)
# Antiguitat del carnet a la darrera renovaci√≥
df_time_corr["Licence_age"] = (
    (df_time_corr["Date_last_renewal"] - df_time_corr["Date_driving_licence"]).dt.days / 365.25
)
# Antiguitat de la p√≤lissa a la darrera renovaci√≥
df_time_corr["Policy_age"] = (
    (df_time_corr["Date_last_renewal"] - df_time_corr["Date_start_contract"]).dt.days / 365.25
)
# Dies fins la pr√≤xima renovaci√≥
df_time_corr["Days_to_next"] = (
    (df_time_corr["Date_next_renewal"] - df_time_corr["Date_last_renewal"]).dt.days
)
# Antiguitat del vehicle a l'inici del contracte
df_time_corr["Vehicle_age"] = (
    df_time_corr["Date_start_contract"].dt.year - df_time_corr["Year_matriculation"]
)
# Si Policy_duration existeix i dep√®n de Date_start_contract i Date_lapse,
# tamb√© l‚Äôhem de recalcular amb les dates ja corregides:
df_time_corr["Policy_duration"] = np.where(
    df_time_corr["Has_lapse"] == 1,  # nom√©s t√© sentit per p√≤lisses cancel¬∑lades
    (df_time_corr["Date_lapse"] - df_time_corr["Date_start_contract"]).dt.days / 365.25,
    np.nan  # per la resta, mantenim NaN com a nuls estructurals
).round(2)

# ------------------------------------------------------------
# 4) Validaci√≥ post-correcci√≥
# ------------------------------------------------------------
# Tornem a calcular alguns indicadors per verificar que les correccions
# han eliminat les incoher√®ncies temporals cr√≠tiques.
# Comptem quants registres continuen amb Licence_age < 0
lic_neg_after = (df_time_corr["Licence_age"] < 0).sum()
# Comptem quants registres continuen amb Policy_age < 0
pol_neg_after = (df_time_corr["Policy_age"] < 0).sum()
# Comptem quants lapses continuen sent incoherents (abans de Date_last_renewal)
lap_incoh_after = (
    (df_time_corr["Date_lapse"].notna()) &
    (df_time_corr["Date_lapse"] < df_time_corr["Date_last_renewal"])
).sum()
print("\n--- VALIDACI√ì POST-CORRECCI√ì ---")
print("Licence_age < 0 :", lic_neg_after)
print("Policy_age  < 0 :", pol_neg_after)
print("Lapse incoherent :", lap_incoh_after)

# ------------------------------------------------------------
# 5) Mostra de registres corregits (opcional)
# ------------------------------------------------------------
# Per revisar manualment qu√® s‚Äôha tocat, mostrem una mostra de files
# que tenien alguna incoher√®ncia (mask_lic, mask_pol o mask_lap).
print("\nMostra de registres corregits (Licence/Policy/Lapse):")
display(df_time_corr.loc[
    mask_lic | mask_pol | mask_lap,
    [
        "ID",                     # identificador de la p√≤lissa/registre
        "Date_start_contract",    # data d'inici (pot haver estat corregida)
        "Date_last_renewal",      # darrera renovaci√≥ (refer√®ncia de correcci√≥)
        "Date_lapse",             # data de baixa (pot haver estat corregida)
        "Date_driving_licence",   # data del carnet (pot haver estat corregida)
        "Driver_age",             # derivades post-correcci√≥
        "Licence_age",
        "Policy_age",
        "Licence_incoherent_flag",  # flags per saber qu√® estava malament
        "Policy_incoherent_flag",
        "Lapse_incoherent_flag"
    ]
].head())

# df_time_corr queda llest per continuar amb:
#   - codificaci√≥ categ√≤rica b√†sica
#   - variables de negoci derivades
#   - exportaci√≥ final (ex: transformed_motor_insurance.csv)


Flags d‚Äôincoher√®ncia (abans de corregir):


Licence_incoherent_flag     31
Policy_incoherent_flag      68
Lapse_incoherent_flag      404
dtype: int64


--- VALIDACI√ì POST-CORRECCI√ì ---
Licence_age < 0 : 0
Policy_age  < 0 : 0
Lapse incoherent : 0

Mostra de registres corregits (Licence/Policy/Lapse):


Unnamed: 0,ID,Date_start_contract,Date_last_renewal,Date_lapse,Date_driving_licence,Driver_age,Licence_age,Policy_age,Licence_incoherent_flag,Policy_incoherent_flag,Lapse_incoherent_flag
268,136,2016-04-13,2017-04-13,2017-04-13,1990-04-26,45.147159,26.965092,0.999316,0,0,1
350,180,2015-09-15,2016-09-15,2016-09-15,1986-05-30,48.227242,30.297057,1.002053,0,0,1
371,192,2011-06-17,2016-06-17,2016-06-17,1994-12-01,53.127995,21.544148,5.002053,0,0,1
1454,726,2007-06-30,2017-06-30,2017-06-30,1990-01-01,53.801506,27.493498,10.001369,0,0,1
1729,861,2018-01-15,2018-01-15,NaT,2007-07-11,47.047228,10.516085,0.0,0,1,0


### <b>4.3.2.4 Variables de negoci derivades</b>

In [10]:
# ============================================================
# Variables de negoci derivades
# ============================================================
# Objectiu:
#   - Crear atributs orientats a la perspectiva actuarial i econ√≤mica.
#   - Afegir indicadors binaris de sinistralitat anual i hist√≤rica.
#   - Calcular la r√†tio cost/prima: mesura clau de rendibilitat t√®cnica.
#   - Validar distribucions i coher√®ncia b√†sica de les noves variables.
#
# Input esperat:
#   df_time_corr ‚Üí dataset despr√©s de corregir incoher√®ncies temporals.
#
# Output:
#   df_business ‚Üí dataset ampliat amb variables de negoci derivades.
# ============================================================

# Fem c√≤pia del dataset corregit per no modificar df_time_corr directament
df_business = df_time_corr.copy()

# ------------------------------------------------------------
# 1) Indicador de sinistre anual
# ------------------------------------------------------------
# A partir del nombre de sinistres de l'any (`N_claims_year`),
# creem una variable bin√†ria:
#   - 1 si hi ha almenys un sinistre (N_claims_year > 0)
#   - 0 si no hi ha sinistres (N_claims_year = 0 o NaN tractat com 0 si es vol)
df_business["Has_claims_year"] = (df_business["N_claims_year"] > 0).astype(int)

# ------------------------------------------------------------
# 2) Indicador d‚Äôhist√≤ric de sinistres
# ------------------------------------------------------------
# De manera an√†loga, creem un indicador hist√≤ric basat en `N_claims_history`:
#   - 1 si al llarg de l'hist√≤ric hi ha hagut algun sinistre
#   - 0 si no hi ha sinistres registrats en la hist√≤ria
df_business["Has_claims_history"] = (df_business["N_claims_history"] > 0).astype(int)

# ------------------------------------------------------------
# 3) R√†tio econ√≤mica cost/prima
# ------------------------------------------------------------
# Definim una r√†tio econ√≤mica clau: cost de sinistres anuals / prima anual.
# Aquesta r√†tio:
#   - > 1 indica que el cost de sinistres supera la prima cobrada (p√≤lissa deficit√†ria).
#   - < 1 indica marge t√®cnic positiu (prima cobreix sinistres).
#   - Si Premium = 0, la divisi√≥ d√≥na inf, que aqu√≠ convertim a NaN.
df_business["Claims_to_premium_ratio"] = (
    df_business["Cost_claims_year"] / df_business["Premium"]
).replace([np.inf, -np.inf], np.nan)  # evitem infinits si Premium=0

# ------------------------------------------------------------
# 4) Validaci√≥ b√†sica de les noves variables
# ------------------------------------------------------------

print("Distribuci√≥ Has_claims_year:")
# value_counts() mostra quants 0 i 1 hi ha (sense normalitzar)
print(df_business["Has_claims_year"].value_counts())
print("\nDistribuci√≥ Has_claims_history:")
print(df_business["Has_claims_history"].value_counts())
print("\nDescriptives Claims_to_premium_ratio:")
# .describe() dona info b√†sica: count, mean, std, min, quartils i max
print(df_business["Claims_to_premium_ratio"].describe())

# ------------------------------------------------------------
# 5) Mostra de registres amb variables de negoci derivades
# ------------------------------------------------------------
# Mostrem una mostra de files amb les variables originals de sinistres,
# la prima i les noves derivades de negoci per fer una inspecci√≥ visual.
print("\nMostra de registres amb variables de negoci derivades:")
print(df_business[[
    "N_claims_year", "Has_claims_year",
    "N_claims_history", "Has_claims_history",
    "Premium", "Cost_claims_year", "Claims_to_premium_ratio"
]].head())


Distribuci√≥ Has_claims_year:
Has_claims_year
0    85909
1    19646
Name: count, dtype: int64

Distribuci√≥ Has_claims_history:
Has_claims_history
1    68959
0    36596
Name: count, dtype: int64

Descriptives Claims_to_premium_ratio:
count    105555.000000
mean          0.476172
std           4.720622
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max        1071.248039
Name: Claims_to_premium_ratio, dtype: float64

Mostra de registres amb variables de negoci derivades:
   N_claims_year  Has_claims_year  N_claims_history  Has_claims_history  \
0              0                0                 0                   0   
1              0                0                 0                   0   
2              0                0                 0                   0   
3              0                0                 0                   0   
4              0                0                 0                   0   

   Premium  Cost_claims_year 

### <b>4.3.2.5 Validaci√≥ final, exportaci√≥ i mapa de features</b>

In [12]:
# ============================================================
# Validaci√≥ final, exportaci√≥ i mapa de features
# ============================================================
# Dataset d‚Äôentrada:
#   df_business ‚Üí dataset final amb:
#       - nuls estructurals √∫nicament a Date_lapse / Policy_duration
#       - variables temporals derivades
#       - incoher√®ncies corregides + flags de control
#       - variables de negoci derivades
#
# Sortides:
#   * transformed_motor_insurance.csv
#   * schema_after_4_3_2.csv
#   * Diccionari intern de mapa de features (feature_map)
# ============================================================

# Fem una c√≤pia per no tocar df_business directament
df_final = df_business.copy()

# ------------------------------------------------------------
# 1) VALIDACI√ì FINAL DE NULS
# ------------------------------------------------------------
# Calculem quants nuls hi ha per variable al dataset final.
missing_table_final = df_final.isna().sum().to_frame("missing_count")
missing_table_final["missing_pct"] = (
    missing_table_final["missing_count"] / len(df_final) * 100
).round(2)
print("\n--- Nuls finals (top 10) ---")
# Mostrem les 10 variables amb m√©s nuls (per inspecci√≥ visual)
display(missing_table_final.sort_values("missing_count", ascending=False).head(10))
# Nuls estructurals permesos (per definici√≥ de negoci):
#   - Date_lapse: nom√©s t√© valor quan hi ha cancel¬∑laci√≥ (Has_lapse = 1).
#   - Policy_duration: nom√©s es pot calcular si Has_lapse = 1.
structural_allowed = ["Date_lapse", "Policy_duration"]
# Eliminem aquestes variables estructurals de la taula per analitzar nom√©s la resta
non_structural_nulls = missing_table_final.drop(index=structural_allowed, errors="ignore")
# Validaci√≥ estricta:
# Comprovem que la suma de nuls en totes les altres variables sigui 0.
# Si no, llancem una excepci√≥ per aturar el pipeline.
assert non_structural_nulls["missing_count"].sum() == 0, \
       "Encara hi ha nuls no estructurals!"
print("\nValidaci√≥ correcta: nom√©s queden nuls estructurals esperats.")

# ------------------------------------------------------------
# 2) ORDENACI√ì L√íGICA DE VARIABLES PER FAM√çLIES
# ------------------------------------------------------------
# Definim l‚Äôordre de les columnes segons fam√≠lies conceptuals
# per facilitar lectura, debugging i modelitzaci√≥ posterior.
# Identificador √∫nic del registre/p√≤lissa
cols_id = ["ID"]
# Dates ‚Äúcrues‚Äù / originals del negoci
cols_dates = [
    "Date_start_contract", "Date_last_renewal", "Date_next_renewal",
    "Date_birth", "Date_driving_licence", "Date_lapse"
]
# Variables temporals derivades (antiguitats, durades, etc.)
cols_temporal_der = [
    "Driver_age", "Licence_age", "Vehicle_age", "Policy_age",
    "Days_to_next", "Policy_duration"
]
# Flags de qualitat i d‚Äôincoher√®ncies corregides
cols_flags = [
    "Has_lapse",
    "Licence_incoherent_flag", "Policy_incoherent_flag", "Lapse_incoherent_flag",
    "Length_missing_flag"
]
# Caracter√≠stiques f√≠siques / t√®cniques del vehicle
cols_vehicle = [
    "Year_matriculation", "Power", "Cylinder_capacity",
    "Value_vehicle", "N_doors", "Length", "Weight"
]
# Variables de risc i perfil comercial
cols_risk = [
    "Distribution_channel", "Seniority", "Policies_in_force",
    "Max_policies", "Max_products", "Type_risk", "Area", "Second_driver"
]
# Variables de negoci i sinistralitat
cols_business = [
    "Premium", "N_claims_year", "Cost_claims_year",
    "N_claims_history", "R_Claims_history",
    "Has_claims_year", "Has_claims_history", "Claims_to_premium_ratio",
    "Payment", "Lapse", "Type_fuel"
]
# Juntem totes les columnes en l‚Äôordre desitjat
all_cols = (
    cols_id
    + cols_dates
    + cols_temporal_der
    + cols_flags
    + cols_vehicle
    + cols_risk
    + cols_business
)
# Abans de reordenar, comprovem que totes aquestes columnes existeixen a df_final.
# Aix√≤ ajuda a detectar errors de nom o passos previs que hagin fallat.
missing_in_final = [c for c in all_cols if c not in df_final.columns]
if missing_in_final:
    # Aturem amb un error expl√≠cit indicant quines columnes falten
    raise ValueError(f"Columnes esperades per√≤ absents a df_final: {missing_in_final}")
# Reordenem les columnes segons l‚Äôestructura definida
df_final = df_final[all_cols]
print("\nReordenaci√≥ de variables completada.")
print("Dimensions df_final:", df_final.shape)

# ------------------------------------------------------------
# 3) GENERACI√ì DE L‚ÄôESQUEMA FINAL (schema_after_4_3_2.csv)
# ------------------------------------------------------------
# Es crea un esquema simple (diccionari de dades) amb:
#   - nom de cada variable
#   - tipus de dada (dtype) en pandas
schema = pd.DataFrame({
    "variable": df_final.columns,
    "dtype": df_final.dtypes.astype(str)
})
# Guardem aquest esquema a CSV per documentaci√≥ i tra√ßabilitat
schema.to_csv("schema_after_4_3_2.csv", index=False, encoding="utf-8")
print("\nüìÑ schema_after_4_3_2.csv generat correctament.")
display(schema.head(10))

# ------------------------------------------------------------
# 4) MAPA DE FEATURES PER FAM√çLIA DE MODELS
# ------------------------------------------------------------
# Definim de forma expl√≠cita quines variables utilitzarem
# en cada model (freq√º√®ncia, severitat, etc.).
# Aquest mapa serveix com a diccionari intern o configuraci√≥.
feature_map = {
    # FREQ√ú√àNCIA (classificaci√≥ Has_claims_year):
    # Variables que expliquen la probabilitat de tenir almenys un sinistre en l‚Äôany.
    "freq_model_features": [
        "Driver_age", "Licence_age", "Vehicle_age", "Has_lapse",
        "Policy_duration", "Second_driver", "Area", "Type_risk",
        "Type_fuel", "Has_claims_history", "Value_vehicle", "Power",
        "Premium", "Seniority", "Policies_in_force", "Distribution_channel"
    ],
    # SEVERITAT (regressi√≥ Cost_claims_year, condicionada a tenir sinistre):
    # Variables m√©s orientades a quantificar el cost dels sinistres.
    "sev_model_features": [
        "Vehicle_age", "Value_vehicle", "Power", "Type_risk",
        "Area", "Policy_duration", "Claims_to_premium_ratio",
        "Weight", "Cylinder_capacity", "Length"
    ],
    # Flags de qualitat/tra√ßabilitat (no entren necess√†riament al model,
    # per√≤ serveixen per control i auditoria).
    "flags_features": cols_flags,
    # Variables objectiu (targets) per als models:
    #   - Freq√º√®ncia: Has_claims_year (bin√†ria)
    #   - Severitat: Cost_claims_year (cont√≠nua)
    "targets": ["Has_claims_year", "Cost_claims_year"]
}
print("\n--- Mapa de features ---")
for k, v in feature_map.items():
    print(f"{k}: {v}")
# Convertim el mapa a un DataFrame per poder visualitzar-lo millor
mapa_features_df = pd.DataFrame(
    [(grp, var) for grp, vars_ in feature_map.items() for var in vars_],
    columns=["feature_group", "variable"]
)
display(mapa_features_df.head(20))

# ------------------------------------------------------------
# 5) EXPORTACI√ì DEL DATASET FINAL
# ------------------------------------------------------------
# Exportem el dataset final, llest per modelitzaci√≥ o consum per altres eines.
df_final.to_csv("transformed_motor_insurance.csv", index=False, encoding="utf-8")
print("\n============================================================")
print("EXPORTACI√ì COMPLETADA")
print("Fitxers generats:")
print(" - transformed_motor_insurance.csv")
print(" - schema_after_4_3_2.csv")
print("============================================================")



--- Nuls finals (top 10) ---


Unnamed: 0,missing_count,missing_pct
Date_lapse,70408,66.7
Policy_duration,70408,66.7
ID,0,0.0
Value_vehicle,0,0.0
N_doors,0,0.0
Type_fuel,0,0.0
Length,0,0.0
Weight,0,0.0
Driver_age,0,0.0
Licence_age,0,0.0



Validaci√≥ correcta: nom√©s queden nuls estructurals esperats.

Reordenaci√≥ de variables completada.
Dimensions df_final: (105555, 44)

üìÑ schema_after_4_3_2.csv generat correctament.


Unnamed: 0,variable,dtype
ID,ID,int64
Date_start_contract,Date_start_contract,datetime64[ns]
Date_last_renewal,Date_last_renewal,datetime64[ns]
Date_next_renewal,Date_next_renewal,datetime64[ns]
Date_birth,Date_birth,datetime64[ns]
Date_driving_licence,Date_driving_licence,datetime64[ns]
Date_lapse,Date_lapse,datetime64[ns]
Driver_age,Driver_age,float64
Licence_age,Licence_age,float64
Vehicle_age,Vehicle_age,int64



--- Mapa de features ---
freq_model_features: ['Driver_age', 'Licence_age', 'Vehicle_age', 'Has_lapse', 'Policy_duration', 'Second_driver', 'Area', 'Type_risk', 'Type_fuel', 'Has_claims_history', 'Value_vehicle', 'Power', 'Premium', 'Seniority', 'Policies_in_force', 'Distribution_channel']
sev_model_features: ['Vehicle_age', 'Value_vehicle', 'Power', 'Type_risk', 'Area', 'Policy_duration', 'Claims_to_premium_ratio', 'Weight', 'Cylinder_capacity', 'Length']
flags_features: ['Has_lapse', 'Licence_incoherent_flag', 'Policy_incoherent_flag', 'Lapse_incoherent_flag', 'Length_missing_flag']
targets: ['Has_claims_year', 'Cost_claims_year']


Unnamed: 0,feature_group,variable
0,freq_model_features,Driver_age
1,freq_model_features,Licence_age
2,freq_model_features,Vehicle_age
3,freq_model_features,Has_lapse
4,freq_model_features,Policy_duration
5,freq_model_features,Second_driver
6,freq_model_features,Area
7,freq_model_features,Type_risk
8,freq_model_features,Type_fuel
9,freq_model_features,Has_claims_history



EXPORTACI√ì COMPLETADA
Fitxers generats:
 - transformed_motor_insurance.csv
 - schema_after_4_3_2.csv


#### <b>Rebaixar dataset a 4 decimals, decimals i milers</b>

In [14]:
# -------------------------------
# 1) Carregar dataset final
# -------------------------------
df = pd.read_csv("transformed_motor_insurance.csv")
# -------------------------------
# 2) Identificar columnes float
# -------------------------------
float_cols = df.select_dtypes(include="float").columns.tolist()
# -------------------------------
# 3) Arrodonir les float a 4 decimals
# -------------------------------
df[float_cols] = df[float_cols].round(4)
# -------------------------------
# 4) Convertir format num√®ric:
#    - sense separador de milers
#    - decimals amb coma ','
# -------------------------------
for col in float_cols:
    df[col] = (
        df[col]
        .astype(str)             # Convertir a text
        .str.replace('.', ',', regex=False)  # Substituir punt per coma
    )
# -------------------------------
# 5) Exportar CSV final net
# -------------------------------
output_path = "transformed_motor_insurance_rounded_european.csv"
df.to_csv(output_path, index=False, encoding="utf-8")
output_path


'transformed_motor_insurance_rounded_european.csv'