## Fase 1: Exploración y Limpieza

#### Exploración Inicial:

Realiza una exploración inicial de los datos para identificar posibles problemas, como valores
nulos, atípicos o datos faltantes en las columnas relevantes.
Utiliza funciones de Pandas para obtener información sobre la estructura de los datos, la
presencia de valores nulos y estadísticas básicas de las columnas involucradas.
Une los dos conjuntos de datos de la forma más eficiente

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None) 

In [2]:
df1 = pd.read_csv("Customer Flight Activity.csv")
df2 = pd.read_csv("Customer Loyalty History.csv")

Se realiza un análisis exploratorio (EDA) de cada dataset por separado antes de realizar la unión con el objetivo de comprender su estructura, calidad y coherencia interna. 

Este paso permite identificar variables clave, posibles valores nulos, duplicados o inconsistencias, así como confirmar la existencia de la columna común necesaria para la unión. Analizar cada conjunto de datos de forma independiente facilita detectar problemas antes de combinarlos y evita trasladar errores al dataset final.

In [3]:
def eda_basico(df):
    print("Primeras filas:")
    display(df.head())
    print("\nInformación general:")
    df.info()
    print("\nColumnas:")
    display(df.columns)
    print("\nValores nulos (conteo):")
    display(df.isnull().sum())
    print("\nPorcentaje de nulos (%):")
    display(((df.isnull().sum() / len(df)) * 100).round(2))
    print("\nFilas duplicadas completas:")
    print(df.duplicated().sum())
    print("\nDuplicados en Loyalty Number:")
    print(df["Loyalty Number"].duplicated().sum())
    print("\n")

In [4]:
eda_basico(df1)

Primeras filas:


Unnamed: 0,Loyalty Number,Year,Month,Flights Booked,Flights with Companions,Total Flights,Distance,Points Accumulated,Points Redeemed,Dollar Cost Points Redeemed
0,100018,2017,1,3,0,3,1521,152.0,0,0
1,100102,2017,1,10,4,14,2030,203.0,0,0
2,100140,2017,1,6,0,6,1200,120.0,0,0
3,100214,2017,1,0,0,0,0,0.0,0,0
4,100272,2017,1,0,0,0,0,0.0,0,0



Información general:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 405624 entries, 0 to 405623
Data columns (total 10 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Loyalty Number               405624 non-null  int64  
 1   Year                         405624 non-null  int64  
 2   Month                        405624 non-null  int64  
 3   Flights Booked               405624 non-null  int64  
 4   Flights with Companions      405624 non-null  int64  
 5   Total Flights                405624 non-null  int64  
 6   Distance                     405624 non-null  int64  
 7   Points Accumulated           405624 non-null  float64
 8   Points Redeemed              405624 non-null  int64  
 9   Dollar Cost Points Redeemed  405624 non-null  int64  
dtypes: float64(1), int64(9)
memory usage: 30.9 MB

Columnas:


Index(['Loyalty Number', 'Year', 'Month', 'Flights Booked',
       'Flights with Companions', 'Total Flights', 'Distance',
       'Points Accumulated', 'Points Redeemed', 'Dollar Cost Points Redeemed'],
      dtype='object')


Valores nulos (conteo):


Loyalty Number                 0
Year                           0
Month                          0
Flights Booked                 0
Flights with Companions        0
Total Flights                  0
Distance                       0
Points Accumulated             0
Points Redeemed                0
Dollar Cost Points Redeemed    0
dtype: int64


Porcentaje de nulos (%):


Loyalty Number                 0.0
Year                           0.0
Month                          0.0
Flights Booked                 0.0
Flights with Companions        0.0
Total Flights                  0.0
Distance                       0.0
Points Accumulated             0.0
Points Redeemed                0.0
Dollar Cost Points Redeemed    0.0
dtype: float64


Filas duplicadas completas:
1864

Duplicados en Loyalty Number:
388887




In [5]:
eda_basico(df2)

Primeras filas:


Unnamed: 0,Loyalty Number,Country,Province,City,Postal Code,Gender,Education,Salary,Marital Status,Loyalty Card,CLV,Enrollment Type,Enrollment Year,Enrollment Month,Cancellation Year,Cancellation Month
0,480934,Canada,Ontario,Toronto,M2Z 4K1,Female,Bachelor,83236.0,Married,Star,3839.14,Standard,2016,2,,
1,549612,Canada,Alberta,Edmonton,T3G 6Y6,Male,College,,Divorced,Star,3839.61,Standard,2016,3,,
2,429460,Canada,British Columbia,Vancouver,V6E 3D9,Male,College,,Single,Star,3839.75,Standard,2014,7,2018.0,1.0
3,608370,Canada,Ontario,Toronto,P1W 1K4,Male,College,,Single,Star,3839.75,Standard,2013,2,,
4,530508,Canada,Quebec,Hull,J8Y 3Z5,Male,Bachelor,103495.0,Married,Star,3842.79,Standard,2014,10,,



Información general:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16737 entries, 0 to 16736
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Loyalty Number      16737 non-null  int64  
 1   Country             16737 non-null  object 
 2   Province            16737 non-null  object 
 3   City                16737 non-null  object 
 4   Postal Code         16737 non-null  object 
 5   Gender              16737 non-null  object 
 6   Education           16737 non-null  object 
 7   Salary              12499 non-null  float64
 8   Marital Status      16737 non-null  object 
 9   Loyalty Card        16737 non-null  object 
 10  CLV                 16737 non-null  float64
 11  Enrollment Type     16737 non-null  object 
 12  Enrollment Year     16737 non-null  int64  
 13  Enrollment Month    16737 non-null  int64  
 14  Cancellation Year   2067 non-null   float64
 15  Cancellation Month  2067 non-nu

Index(['Loyalty Number', 'Country', 'Province', 'City', 'Postal Code',
       'Gender', 'Education', 'Salary', 'Marital Status', 'Loyalty Card',
       'CLV', 'Enrollment Type', 'Enrollment Year', 'Enrollment Month',
       'Cancellation Year', 'Cancellation Month'],
      dtype='object')


Valores nulos (conteo):


Loyalty Number            0
Country                   0
Province                  0
City                      0
Postal Code               0
Gender                    0
Education                 0
Salary                 4238
Marital Status            0
Loyalty Card              0
CLV                       0
Enrollment Type           0
Enrollment Year           0
Enrollment Month          0
Cancellation Year     14670
Cancellation Month    14670
dtype: int64


Porcentaje de nulos (%):


Loyalty Number         0.00
Country                0.00
Province               0.00
City                   0.00
Postal Code            0.00
Gender                 0.00
Education              0.00
Salary                25.32
Marital Status         0.00
Loyalty Card           0.00
CLV                    0.00
Enrollment Type        0.00
Enrollment Year        0.00
Enrollment Month       0.00
Cancellation Year     87.65
Cancellation Month    87.65
dtype: float64


Filas duplicadas completas:
0

Duplicados en Loyalty Number:
0




Gestión de duplicados antes de unir los dataframes:

La gestión de duplicados se realiza antes del proceso de unión para evitar la propagación o multiplicación de registros repetidos en el dataset final.

In [6]:
df1[df1.duplicated()]

Unnamed: 0,Loyalty Number,Year,Month,Flights Booked,Flights with Companions,Total Flights,Distance,Points Accumulated,Points Redeemed,Dollar Cost Points Redeemed
42,101902,2017,1,0,0,0,0,0.0,0,0
227,112142,2017,1,0,0,0,0,0.0,0,0
478,126100,2017,1,0,0,0,0,0.0,0,0
567,130331,2017,1,0,0,0,0,0.0,0,0
660,135421,2017,1,0,0,0,0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...
404668,949628,2018,12,0,0,0,0,0.0,0,0
404884,960050,2018,12,0,0,0,0,0.0,0,0
405111,971370,2018,12,0,0,0,0,0.0,0,0
405410,988392,2018,12,0,0,0,0,0.0,0,0


In [9]:
df1=df1.drop_duplicates()

In [None]:
df1.duplicated().sum()

np.int64(0)

Unir los dos dataframes una vez gestionados los duplicados, para su posterior EDA completo y limpieza de datos. 

Se ha realizado un `left merge` utilizando la variable *Loyalty Number* como clave común entre ambos datasets.

Este tipo de unión permite conservar todos los registros del dataset principal (actividad de vuelos) e incorporar la información sociodemográfica cuando esté disponible. De este modo, se evita perder información relevante sobre la actividad de los clientes y se garantiza que el análisis se base en la totalidad de los registros operativos.

In [12]:
df_merged = pd.merge(df1, df2, on="Loyalty Number", how="left")

In [19]:
def eda_completo(df):

    """
    Realiza un análisis exploratorio completo del DataFrame,
    incluyendo estructura, estadística descriptiva, nulos,
    duplicados y distribución de variables categóricas.
    """
    print("PRIMER VISTAZO A LOS DATOS")
    print("Primeras filas:")
    display(df.head())
    
    print("Ultimas filas:")
    display(df.tail())
    
    print("Muestra aleatoria:")
    display(df.sample(5))

    print("ESTRUCTURA DEL DATAFRAME")
    print(f"Dimensiones (filas, columnas): {df.shape}\n")
    
    print("Columnas:")
    display(df.columns)
    
    print("Informacion general:")
    df.info()
    
    print("Tipos de datos por columna:")
    display(df.dtypes)

    print("ESTADISTICAS DESCRIPTIVAS")
    print("Variables numericas:")
    display(df.describe().T)
    
    print("Variables categoricas:")
    display(df.describe(include='O').T)

    print("CARDINALIDAD")
    print(f"Total de valores unicos en el DataFrame: {df.nunique().sum()}\n")
    
    print("Numero de valores unicos por columna:")
    display(df.nunique())

    print("Valores unicos por columna:")
    for col in df.columns:
        print(f"{col}: {df[col].unique()}")

    print("VALORES NULOS")
    display(df.isnull().sum())
    
    print("Porcentaje de valores nulos (%)")
    display(((df.isnull().sum() / df.shape[0]) * 100).round(2))

    print("DUPLICADOS")
    dup_count = df.duplicated().sum()
    print(f"Duplicados: {dup_count}")
    
    if dup_count > 0:
        display(df[df.duplicated()])
    else:
        print("No hay filas duplicadas.")

    print("VALUE COUNTS (solo categoricas)")
    cat_cols = df.select_dtypes(include='object').columns
    for col in cat_cols:
        print(f"\n{col}")
        display(df[col].value_counts())


In [20]:
eda_completo(df_merged)

PRIMER VISTAZO A LOS DATOS
Primeras filas:


Unnamed: 0,Loyalty Number,Year,Month,Flights Booked,Flights with Companions,Total Flights,Distance,Points Accumulated,Points Redeemed,Dollar Cost Points Redeemed,Country,Province,City,Postal Code,Gender,Education,Salary,Marital Status,Loyalty Card,CLV,Enrollment Type,Enrollment Year,Enrollment Month,Cancellation Year,Cancellation Month
0,100018,2017,1,3,0,3,1521,152.0,0,0,Canada,Alberta,Edmonton,T9G 1W3,Female,Bachelor,92552.0,Married,Aurora,7919.2,Standard,2016,8,,
1,100102,2017,1,10,4,14,2030,203.0,0,0,Canada,Ontario,Toronto,M1R 4K3,Male,College,,Single,Nova,2887.74,Standard,2013,3,,
2,100140,2017,1,6,0,6,1200,120.0,0,0,Canada,British Columbia,Dawson Creek,U5I 4F1,Female,College,,Divorced,Nova,2838.07,Standard,2016,7,,
3,100214,2017,1,0,0,0,0,0.0,0,0,Canada,British Columbia,Vancouver,V5R 1W3,Male,Bachelor,63253.0,Married,Star,4170.57,Standard,2015,8,,
4,100272,2017,1,0,0,0,0,0.0,0,0,Canada,Ontario,Toronto,P1L 8X8,Female,Bachelor,91163.0,Divorced,Star,6622.05,Standard,2014,1,,


Ultimas filas:


Unnamed: 0,Loyalty Number,Year,Month,Flights Booked,Flights with Companions,Total Flights,Distance,Points Accumulated,Points Redeemed,Dollar Cost Points Redeemed,Country,Province,City,Postal Code,Gender,Education,Salary,Marital Status,Loyalty Card,CLV,Enrollment Type,Enrollment Year,Enrollment Month,Cancellation Year,Cancellation Month
403755,999902,2018,12,0,0,0,0,0.0,0,0,Canada,Ontario,Toronto,M1R 4K3,Male,College,,Married,Aurora,7290.07,Standard,2014,5,,
403756,999911,2018,12,0,0,0,0,0.0,0,0,Canada,Newfoundland,St. John's,A1C 6H9,Male,Doctor,217943.0,Single,Nova,8564.77,Standard,2012,8,,
403757,999940,2018,12,3,0,3,1233,123.0,0,0,Canada,Quebec,Quebec City,G1B 3L5,Female,Bachelor,47670.0,Married,Nova,20266.5,Standard,2017,7,,
403758,999982,2018,12,0,0,0,0,0.0,0,0,Canada,British Columbia,Victoria,V10 6T5,Male,College,,Married,Star,2631.56,Standard,2018,7,,
403759,999986,2018,12,0,0,0,0,0.0,0,0,Canada,Ontario,Ottawa,K1F 2R2,Female,Bachelor,46594.0,Married,Nova,8257.01,2018 Promotion,2018,2,,


Muestra aleatoria:


Unnamed: 0,Loyalty Number,Year,Month,Flights Booked,Flights with Companions,Total Flights,Distance,Points Accumulated,Points Redeemed,Dollar Cost Points Redeemed,Country,Province,City,Postal Code,Gender,Education,Salary,Marital Status,Loyalty Card,CLV,Enrollment Type,Enrollment Year,Enrollment Month,Cancellation Year,Cancellation Month
296848,753306,2018,9,0,0,0,0,0.0,0,0,Canada,Quebec,Montreal,H2Y 2W2,Male,College,,Single,Nova,3011.27,Standard,2017,11,2018.0,5.0
159100,516337,2017,10,0,0,0,0,0.0,0,0,Canada,Ontario,Toronto,P1L 8X8,Male,Bachelor,74770.0,Married,Nova,4428.03,Standard,2014,7,,
58551,537722,2017,4,10,0,10,820,82.0,0,0,Canada,British Columbia,West Vancouver,V6V 8Z3,Male,Bachelor,61649.0,Married,Star,8208.93,Standard,2015,10,,
193562,562084,2017,12,7,0,7,2968,296.0,0,0,Canada,Ontario,Toronto,M8Y 4K8,Male,Bachelor,96153.0,Married,Star,2882.35,Standard,2014,11,,
48535,899734,2017,3,0,0,0,0,0.0,0,0,Canada,Ontario,Trenton,K8V 4B2,Male,Bachelor,63225.0,Married,Nova,6102.31,Standard,2016,3,2016.0,11.0


ESTRUCTURA DEL DATAFRAME
Dimensiones (filas, columnas): (403760, 25)

Columnas:


Index(['Loyalty Number', 'Year', 'Month', 'Flights Booked',
       'Flights with Companions', 'Total Flights', 'Distance',
       'Points Accumulated', 'Points Redeemed', 'Dollar Cost Points Redeemed',
       'Country', 'Province', 'City', 'Postal Code', 'Gender', 'Education',
       'Salary', 'Marital Status', 'Loyalty Card', 'CLV', 'Enrollment Type',
       'Enrollment Year', 'Enrollment Month', 'Cancellation Year',
       'Cancellation Month'],
      dtype='object')

Informacion general:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403760 entries, 0 to 403759
Data columns (total 25 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Loyalty Number               403760 non-null  int64  
 1   Year                         403760 non-null  int64  
 2   Month                        403760 non-null  int64  
 3   Flights Booked               403760 non-null  int64  
 4   Flights with Companions      403760 non-null  int64  
 5   Total Flights                403760 non-null  int64  
 6   Distance                     403760 non-null  int64  
 7   Points Accumulated           403760 non-null  float64
 8   Points Redeemed              403760 non-null  int64  
 9   Dollar Cost Points Redeemed  403760 non-null  int64  
 10  Country                      403760 non-null  object 
 11  Province                     403760 non-null  object 
 12  City                         403760 n

Loyalty Number                   int64
Year                             int64
Month                            int64
Flights Booked                   int64
Flights with Companions          int64
Total Flights                    int64
Distance                         int64
Points Accumulated             float64
Points Redeemed                  int64
Dollar Cost Points Redeemed      int64
Country                         object
Province                        object
City                            object
Postal Code                     object
Gender                          object
Education                       object
Salary                         float64
Marital Status                  object
Loyalty Card                    object
CLV                            float64
Enrollment Type                 object
Enrollment Year                  int64
Enrollment Month                 int64
Cancellation Year              float64
Cancellation Month             float64
dtype: object

ESTADISTICAS DESCRIPTIVAS
Variables numericas:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Loyalty Number,403760.0,549875.383713,258961.514684,100018.0,326699.0,550598.0,772152.0,999986.0
Year,403760.0,2017.500352,0.5,2017.0,2017.0,2018.0,2018.0,2018.0
Month,403760.0,6.501335,3.451982,1.0,4.0,7.0,10.0,12.0
Flights Booked,403760.0,4.13405,5.230064,0.0,0.0,1.0,8.0,21.0
Flights with Companions,403760.0,1.036569,2.080472,0.0,0.0,0.0,1.0,11.0
Total Flights,403760.0,5.170619,6.526858,0.0,0.0,1.0,10.0,32.0
Distance,403760.0,1214.460979,1434.098521,0.0,0.0,525.0,2342.0,6293.0
Points Accumulated,403760.0,124.263761,146.696179,0.0,0.0,53.0,240.0,676.5
Points Redeemed,403760.0,30.838587,125.758002,0.0,0.0,0.0,0.0,876.0
Dollar Cost Points Redeemed,403760.0,2.495973,10.172033,0.0,0.0,0.0,0.0,71.0


Variables categoricas:


Unnamed: 0,count,unique,top,freq
Country,403760,1,Canada,403760
Province,403760,11,Ontario,130258
City,403760,29,Toronto,80775
Postal Code,403760,55,V6E 3D9,21944
Gender,403760,2,Female,202757
Education,403760,5,Bachelor,252567
Marital Status,403760,3,Married,234845
Loyalty Card,403760,3,Star,183745
Enrollment Type,403760,2,Standard,380419


CARDINALIDAD
Total de valores unicos en el DataFrame: 37771

Numero de valores unicos por columna:


Loyalty Number                 16737
Year                               2
Month                             12
Flights Booked                    22
Flights with Companions           12
Total Flights                     33
Distance                        4746
Points Accumulated              1549
Points Redeemed                  587
Dollar Cost Points Redeemed       49
Country                            1
Province                          11
City                              29
Postal Code                       55
Gender                             2
Education                          5
Salary                          5890
Marital Status                     3
Loyalty Card                       3
CLV                             7984
Enrollment Type                    2
Enrollment Year                    7
Enrollment Month                  12
Cancellation Year                  6
Cancellation Month                12
dtype: int64

Valores unicos por columna:
Loyalty Number: [100018 100102 100140 ... 999731 999788 999891]
Year: [2017 2018]
Month: [ 1  9  2  3 11  4  5  7  6  8 10 12]
Flights Booked: [ 3 10  6  0  8 11  9  4  7  5  2  1 12 13 14 16 15 17 18 19 20 21]
Flights with Companions: [ 0  4  7  1  6  3  5  2 10  8  9 11]
Total Flights: [ 3 14  6  0 15 11 12 10  8  9  7  5 16  2  1 17 13 22  4 19 18 21 26 20
 23 25 27 24 28 30 29 31 32]
Distance: [1521 2030 1200 ... 1217  617 4135]
Points Accumulated: [152.   203.   120.   ...  18.75 601.   626.  ]
Points Redeemed: [  0 341 364 310 445 312 343 366 389 292 447 324 456 409 436 327 322 291
 323 300 290 309 325 386 321 363 340 670 443 517 444 328 344 367 313 333
 293 449 297 455 372 356 405 381 466 419 369 352 482 335 329 305 415 396
 317 348 314 334 350 330 318 298 420 336 471 680 441 353 484 301 374 417
 501 299 398 307 368 306 347 439 395 481 337 382 426 373 399 424 326 392
 438 467 480 448 308 400 376 375 460 339 385 611 431 320 362 404 442 410
 361 319 435

Loyalty Number                      0
Year                                0
Month                               0
Flights Booked                      0
Flights with Companions             0
Total Flights                       0
Distance                            0
Points Accumulated                  0
Points Redeemed                     0
Dollar Cost Points Redeemed         0
Country                             0
Province                            0
City                                0
Postal Code                         0
Gender                              0
Education                           0
Salary                         102260
Marital Status                      0
Loyalty Card                        0
CLV                                 0
Enrollment Type                     0
Enrollment Year                     0
Enrollment Month                    0
Cancellation Year              354110
Cancellation Month             354110
dtype: int64

Porcentaje de valores nulos (%)


Loyalty Number                  0.00
Year                            0.00
Month                           0.00
Flights Booked                  0.00
Flights with Companions         0.00
Total Flights                   0.00
Distance                        0.00
Points Accumulated              0.00
Points Redeemed                 0.00
Dollar Cost Points Redeemed     0.00
Country                         0.00
Province                        0.00
City                            0.00
Postal Code                     0.00
Gender                          0.00
Education                       0.00
Salary                         25.33
Marital Status                  0.00
Loyalty Card                    0.00
CLV                             0.00
Enrollment Type                 0.00
Enrollment Year                 0.00
Enrollment Month                0.00
Cancellation Year              87.70
Cancellation Month             87.70
dtype: float64

DUPLICADOS
Duplicados: 0
No hay filas duplicadas.
VALUE COUNTS (solo categoricas)

Country


Country
Canada    403760
Name: count, dtype: int64


Province


Province
Ontario                 130258
British Columbia        106442
Quebec                   79573
Alberta                  23360
Manitoba                 15900
New Brunswick            15352
Nova Scotia              12507
Saskatchewan              9861
Newfoundland              6244
Yukon                     2679
Prince Edward Island      1584
Name: count, dtype: int64


City


City
Toronto           80775
Vancouver         62314
Montreal          49687
Winnipeg          15900
Whistler          13994
Halifax           12507
Ottawa            12262
Edmonton          11768
Trenton           11710
Quebec City       11698
Dawson Creek      10725
Fredericton       10266
Regina             9861
Kingston           9652
Tremblant          9576
Victoria           9444
Hull               8612
West Vancouver     7831
St. John's         6244
Thunder Bay        6171
Sudbury            5493
Moncton            5086
Calgary            4584
Banff              4296
London             4195
Peace River        2712
Whitehorse         2679
Kelowna            2134
Charlottetown      1584
Name: count, dtype: int64


Postal Code


Postal Code
V6E 3D9    21944
V5R 1W3    16529
V6T 1Y8    13994
V6E 3Z3    13128
M2M 7K8    12855
P1J 8T7    12093
H2T 9K8    12000
K8V 4B2    11710
G1B 3L5    11698
H2T 2J6    10747
U5I 4F1    10725
V1E 4R6    10713
E3B 2H2    10266
R2C 0M5    10059
M9K 2P4     9652
H5Y 2S9     9576
V10 6T5     9444
K1F 2R2     9375
H2Y 2W2     8824
J8Y 3Z5     8612
M8Y 4K8     8207
H4G 3T4     8184
B3J 9S2     7952
V6V 8Z3     7831
P2T 6G3     7790
H2Y 4R4     7623
M1R 4K3     7533
P1L 8X8     6790
P1W 1K4     6604
T9G 1W3     6430
A1C 6H9     6244
K8T 5M5     6171
M2Z 4K1     6168
P5S 6R4     5925
M5V 1G5     5493
S6J 3G0     5437
T3G 6Y6     5338
E1A 2A7     5086
T3E 2V9     4584
B3C 2M8     4555
S1J 3C5     4424
T4V 1D4     4296
M5B 3E4     4195
M2M 6J7     3693
R6Y 4T5     3441
M2P 4F6     3045
K1G 4Z0     2887
T9O 2W2     2712
Y2K 6R0     2679
R3R 3T4     2400
H3T 8L4     2141
V09 2E9     2134
C1A 6E8     1584
H3J 5I6      168
M3R 4K8       72
Name: count, dtype: int64


Gender


Gender
Female    202757
Male      201003
Name: count, dtype: int64


Education


Education
Bachelor                252567
College                 102260
High School or Below     18915
Doctor                   17731
Master                   12287
Name: count, dtype: int64


Marital Status


Marital Status
Married     234845
Single      108153
Divorced     60762
Name: count, dtype: int64


Loyalty Card


Loyalty Card
Star      183745
Nova      136883
Aurora     83132
Name: count, dtype: int64


Enrollment Type


Enrollment Type
Standard          380419
2018 Promotion     23341
Name: count, dtype: int64

In [16]:
df_merged["Cancellation Year"].unique()

array([  nan, 2018., 2015., 2016., 2014., 2013., 2017.])

In [18]:
df_merged["Cancellation Month"].unique()

array([nan,  3.,  9.,  2.,  7.,  6.,  8.,  4.,  1.,  5., 11., 12., 10.])

Se realizó un análisis exploratorio completo del dataset con el objetivo de comprender su estructura, calidad y características generales antes de aplicar cualquier transformación relevante.

#### Estructura general

El dataset presenta una estructura coherente, con un número elevado de registros y una combinación adecuada de variables numéricas y categóricas. Los nombres de las columnas son consistentes y permiten identificar claramente el tipo de información contenida en cada una.

Los tipos de datos, en general, se corresponden con la naturaleza de las variables, aunque algunas variables temporales requieren revisión posterior.

#### Estadísticas descriptivas

El análisis descriptivo inicial permite identificar patrones generales en las variables numéricas, observándose distribuciones asimétricas en varias variables relacionadas con la actividad de vuelo. Las medias y medianas muestran diferencias relevantes, lo que sugiere la presencia de valores elevados que influyen en los promedios.

En las variables categóricas se observa una distribución coherente y consistente con el contexto del programa de fidelización.

#### Valores nulos

Se detectan valores nulos en determinadas variables, especialmente en aquellas relacionadas con cancelaciones y salario. En esta fase del análisis no se realiza ninguna modificación, ya que primero se evalúa su significado y coherencia dentro del contexto del dataset.

In [17]:
df_merged.to_csv("df_merged.csv", index=False)