<div style="text-align: center;">
  <img src="./imagens/logo_novaims.png" alt="Logo" style="width: 150px; height: auto; margin-bottom: 10px;">
  <h1 style="margin: 0;"><strong>Machine Learning Project: Amazing International Airlines Inc.</strong></h1>
  <h2 style="margin: 0;"><strong>Part 1/2: Exploratory Data Analysis</strong></h2>
</div>

<div style="text-align: left; margin-top: 15px;">
  <p style="margin: 0;"><strong>Group 51:</strong></p>
  <ul style="margin: 0; padding-left: 20px;">
    <li>André Ferreira | 20250398</li>
    <li>Fausto Gomes | 20221915</li>
    <li>Maria Francisca Gonçalves | 20221942</li>
    <li>Miguel Matos | 20221925</li>
  </ul>
</div>

# <span style="color:#0097b2">0. Context</span>

## 1. Business Understanding

Amazing International Airlines Inc. (AIAI) is a commercial airline that operates in an increasingly competitive market, where customer loyalty and personalized experiences are key differentiators. Despite having a well-established loyalty program, AIAI faces challenges in understanding the diversity of its customer base and designing targeted marketing strategies.

The main business goal of this project is to **develop data-driven customer segmentation** that supports personalized marketing initiatives, improves customer retention, and increases overall profitability. By identifying distinct customer groups, AIAI aims to tailor services, loyalty benefits, and communication strategies to the specific needs and behaviors of each segment.

As data mining consultants for AIAI, we will analyze three years of loyalty and flight activity data to extract insights about customer behavior and value patterns.

The analysis will follow the **CRISP-DM methodology**, beginning with business and data understanding (current phase), followed by data preparation, modeling, evaluation, and deployment.  
In this EDA stage, the focus is on:
- Exploring and assessing the quality of the datasets provided by AIAI.
- Identifying relevant variables and potential data issues.
- Engineering new features that capture customer value and travel behavior.
- Formulating hypotheses for the subsequent clustering phase.

Ultimately, the insights generated in this phase will form the foundation for creating meaningful and actionable customer segments that align with AIAI’s strategic objectives.




# <span style="color:#0097b2">1. Importing Packages and Libraries</span>

In [2]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# <span style="color:#0097b2">2. Reading the Data</span>

In [10]:
# Load data
customer_df = pd.read_csv('../data/raw/DM_AIAI_CustomerDB.csv')
flights_df = pd.read_csv('../data/raw/DM_AIAI_FlightsDB.csv')

# Basic info
print("CustomerDB shape:", customer_df.shape)
print("FlightsDB shape:", flights_df.shape)

CustomerDB shape: (16921, 21)
FlightsDB shape: (608436, 10)


- The **Customer Database** (`CustomerDB`) contains **16,921 records** and **21 columns**, each representing a unique customer enrolled in the loyalty program.  

- The **Flights Database** (`FlightsDB`) contains **608,436 records** and **10 columns**, each representing a customer’s monthly flight activity.  

# <span style="color:#0097b2">3. Metadata</span>

**Customer Database (`DM_AIAI_CustomerDB.csv`)**
- `Unnamed: 0`: Imported row index (redundant; to be dropped)
- `Loyalty#`: Unique customer identifier
- `First Name`: Customer first name
- `Last Name`: Customer last name
- `Customer Name`: Full name
- `Country`: Country of residence
- `Province or State`: State/Province
- `City`: City
- `Latitude`: Latitude of residence
- `Longitude`: Longitude of residence
- `Postal code`: Postal/ZIP code
- `Gender`: Customer gender
- `Education`: Education level
- `Location Code`: Urban/Suburban/Rural classification
- `Income`: Annual income (USD)
- `Marital Status`: Marital status
- `LoyaltyStatus`: Loyalty program tier (Aurora, Nova, Star)
- `EnrollmentDateOpening`: Program enrollment date
- `CancellationDate`: Program cancellation date (if any)
- `Customer Lifetime Value`: Estimated lifetime value (USD)
- `EnrollmentType`: Enrollment channel/type (e.g., Standard, 2021 Promotion)

**Flights Database (`DM_AIAI_FlightsDB.csv`)**
- `Loyalty#`: Customer identifier (foreign key)
- `Year`: Activity year
- `Month`: Activity month (1–12)
- `YearMonthDate`: First day of the activity month
- `NumFlights`: Number of flights in the month
- `NumFlightsWithCompanions`: Flights with companions
- `DistanceKM`: Total distance flown (km)
- `PointsAccumulated`: Loyalty points earned
- `PointsRedeemed`: Loyalty points redeemed
- `DollarCostPointsRedeemed`: Dollar value of redeemed points


**Relationship Between Datasets**

The **Customer Database (CustomerDB)** and the **Flights Database (FlightsDB)** are related through the common key **`Loyalty#`**, which uniquely identifies each customer.

- In **CustomerDB**, each row represents a **unique customer**, including demographic details, loyalty status, and lifetime value information.  
- In **FlightsDB**, each row corresponds to a **monthly flight activity record** for a given customer, containing information about flights, distance, and loyalty points earned or redeemed.

This establishes a **one-to-many (1 → N) relationship**, where:  
> **One customer** can have **multiple flight activity records** across different months.

This relationship allows the integration of both datasets into a single, richer analytical table through a merge operation on the `Loyalty#` key.  
Such integration will enable a comprehensive customer view that combines **demographic, behavioral, and value-based attributes**, essential for the upcoming clustering analysis.

In [13]:
customer_df


Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
0,0,480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
1,1,549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.490930,...,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
2,2,429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.282730,-123.120740,...,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
3,3,608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,...,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
4,4,530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.428730,-75.713364,...,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16916,15,100012,Ethan,Thompson,Ethan Thompson,Canada,Quebec,Quebec City,46.759733,-71.141009,...,male,Bachelor,Suburban,,Single,Star,2/27/2019,2/27/2019,,Standard
16917,16,100013,Layla,Young,Layla Young,Canada,Alberta,Edmonton,53.524829,-113.546357,...,female,Bachelor,Rural,,Married,Star,9/20/2017,9/20/2017,,Standard
16918,17,100014,Amelia,Bennett,Amelia Bennett,Canada,New Brunswick,Moncton,46.051866,-64.825428,...,male,Bachelor,Rural,,Married,Star,11/28/2020,11/28/2020,,Standard
16919,18,100015,Benjamin,Wilson,Benjamin Wilson,Canada,Quebec,Quebec City,46.862970,-71.133444,...,female,College,Urban,,Married,Star,4/9/2020,4/9/2020,,Standard


In [14]:
flights_df

Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
0,413052,2021,12,12/1/2021,2.0,2.0,9384.0,938.00,0.0,0.0
1,464105,2021,12,12/1/2021,0.0,0.0,0.0,0.00,0.0,0.0
2,681785,2021,12,12/1/2021,10.0,3.0,14745.0,1474.00,0.0,0.0
3,185013,2021,12,12/1/2021,16.0,4.0,26311.0,2631.00,3213.0,32.0
4,216596,2021,12,12/1/2021,9.0,0.0,19275.0,1927.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
608431,999902,2019,12,12/1/2019,7.2,0.0,30766.5,3076.65,0.0,0.0
608432,999911,2019,12,12/1/2019,0.0,0.0,0.0,0.00,0.0,0.0
608433,999940,2019,12,12/1/2019,14.4,0.9,18261.0,1826.10,0.0,0.0
608434,999982,2019,12,12/1/2019,0.0,0.0,0.0,0.00,0.0,0.0


### Columns Overview and Dataset Preview

Before proceeding with deeper exploration, it is useful to examine the list of columns in each dataset and visually inspect some sample records.


This helps confirm that the data has been imported correctly, that column names are consistent with the metadata, and that the content follows the expected structure.

In [17]:
# Display all columns for each dataset
print("CustomerDB columns:\n", customer_df.columns.tolist())
print("\nFlightsDB columns:\n", flights_df.columns.tolist())

CustomerDB columns:
 ['Unnamed: 0', 'Loyalty#', 'First Name', 'Last Name', 'Customer Name', 'Country', 'Province or State', 'City', 'Latitude', 'Longitude', 'Postal code', 'Gender', 'Education', 'Location Code', 'Income', 'Marital Status', 'LoyaltyStatus', 'EnrollmentDateOpening', 'CancellationDate', 'Customer Lifetime Value', 'EnrollmentType']

FlightsDB columns:
 ['Loyalty#', 'Year', 'Month', 'YearMonthDate', 'NumFlights', 'NumFlightsWithCompanions', 'DistanceKM', 'PointsAccumulated', 'PointsRedeemed', 'DollarCostPointsRedeemed']


In [18]:
# Preview dataset in different positions for CustomerDB
print("\nFirst 5 rows (CustomerDB):")
display(customer_df.head())

print("Last 5 rows (CustomerDB):")
display(customer_df.tail())

# Preview dataset in different positions for FlightsDB
print("\nFirst 5 rows (FlightsDB):")
display(flights_df.head())

print("Last 5 rows (FlightsDB):")
display(flights_df.tail())


First 5 rows (CustomerDB):


Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
0,0,480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
1,1,549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,...,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
2,2,429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,...,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
3,3,608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,...,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
4,4,530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,...,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion


Last 5 rows (CustomerDB):


Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
16916,15,100012,Ethan,Thompson,Ethan Thompson,Canada,Quebec,Quebec City,46.759733,-71.141009,...,male,Bachelor,Suburban,,Single,Star,2/27/2019,2/27/2019,,Standard
16917,16,100013,Layla,Young,Layla Young,Canada,Alberta,Edmonton,53.524829,-113.546357,...,female,Bachelor,Rural,,Married,Star,9/20/2017,9/20/2017,,Standard
16918,17,100014,Amelia,Bennett,Amelia Bennett,Canada,New Brunswick,Moncton,46.051866,-64.825428,...,male,Bachelor,Rural,,Married,Star,11/28/2020,11/28/2020,,Standard
16919,18,100015,Benjamin,Wilson,Benjamin Wilson,Canada,Quebec,Quebec City,46.86297,-71.133444,...,female,College,Urban,,Married,Star,4/9/2020,4/9/2020,,Standard
16920,19,100016,Emma,Martin,Emma Martin,Canada,British Columbia,Dawson Creek,55.720562,-120.16009,...,female,Master,Suburban,,Single,Star,7/21/2020,7/21/2020,,Standard



First 5 rows (FlightsDB):


Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
0,413052,2021,12,12/1/2021,2.0,2.0,9384.0,938.0,0.0,0.0
1,464105,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
2,681785,2021,12,12/1/2021,10.0,3.0,14745.0,1474.0,0.0,0.0
3,185013,2021,12,12/1/2021,16.0,4.0,26311.0,2631.0,3213.0,32.0
4,216596,2021,12,12/1/2021,9.0,0.0,19275.0,1927.0,0.0,0.0


Last 5 rows (FlightsDB):


Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
608431,999902,2019,12,12/1/2019,7.2,0.0,30766.5,3076.65,0.0,0.0
608432,999911,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
608433,999940,2019,12,12/1/2019,14.4,0.9,18261.0,1826.1,0.0,0.0
608434,999982,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
608435,999986,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0


# <span style="color:#0097b2">4. Data Exploration</span>

### 4.1 Columns Overview and Dataset Preview

In this section, we inspect the overall structure of both datasets using the `info()` method.  
This provides an overview of:
- The number of records and columns  
- The data types of each variable (`int`, `float`, `object`, etc.)  
- The presence of missing values (non-null count differences)

This analysis helps to identify which variables may need data type conversions and highlights potential columns requiring cleaning or imputation.

In [23]:
print("CustomerDB Info:")
customer_df.info()

CustomerDB Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16921 entries, 0 to 16920
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               16921 non-null  int64  
 1   Loyalty#                 16921 non-null  int64  
 2   First Name               16921 non-null  object 
 3   Last Name                16921 non-null  object 
 4   Customer Name            16921 non-null  object 
 5   Country                  16921 non-null  object 
 6   Province or State        16921 non-null  object 
 7   City                     16921 non-null  object 
 8   Latitude                 16921 non-null  float64
 9   Longitude                16921 non-null  float64
 10  Postal code              16921 non-null  object 
 11  Gender                   16921 non-null  object 
 12  Education                16921 non-null  object 
 13  Location Code            16921 non-null  object 
 14  Incom

#### Observations – Customer Database

- The **CustomerDB** dataset contains **16,921 records** and **21 columns**, each representing a unique customer.  
- Most variables are of type `object`, corresponding to categorical or text fields such as names, location, gender, and loyalty status.  
- Numeric variables (`Income`, `Customer Lifetime Value`, coordinates) are stored as `float64` or `int64`.  
- The date variables (`EnrollmentDateOpening`, `CancellationDate`) are currently stored as text (`object`) and will require conversion to `datetime` format.  
- **Missing values** were detected in:
  - `Income` → 20 missing entries  
  - `Customer Lifetime Value` → 20 missing entries  
  - `CancellationDate` → large proportion of missing values (≈86%), which is expected since most customers remain active.  
- The column `Unnamed: 0` is an index artifact and should be removed.  
- Overall, the dataset is well structured, with consistent data types and only minor cleaning required.



In [22]:
print("\nFlightsDB Info:")
flights_df.info()


FlightsDB Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608436 entries, 0 to 608435
Data columns (total 10 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Loyalty#                  608436 non-null  int64  
 1   Year                      608436 non-null  int64  
 2   Month                     608436 non-null  int64  
 3   YearMonthDate             608436 non-null  object 
 4   NumFlights                608436 non-null  float64
 5   NumFlightsWithCompanions  608436 non-null  float64
 6   DistanceKM                608436 non-null  float64
 7   PointsAccumulated         608436 non-null  float64
 8   PointsRedeemed            608436 non-null  float64
 9   DollarCostPointsRedeemed  608436 non-null  float64
dtypes: float64(6), int64(3), object(1)
memory usage: 46.4+ MB


#### Observations – Flights Database

- The **FlightsDB** dataset contains **608,436 records** and **10 columns**, representing monthly flight activity for each customer.  
- The common identifier `Loyalty#` is present, enabling a future merge with the CustomerDB.  
- All variables have **no missing values**, indicating a clean dataset.  
- The column `YearMonthDate` is stored as `object` but should be converted to a `datetime` type for easier temporal analysis.  
- The dataset is mostly numeric, making it suitable for aggregation and feature engineering (e.g., total distance, total points).  
- Given its size, this dataset likely includes multiple records per customer.


### 4.2 Columns Overview and Dataset Preview

We use the `describe()` method to compute basic descriptive statistics for all numerical variables, including the mean, standard deviation, minimum, maximum, and quartiles.


This provides a general understanding of the data distribution and allows the identification of potential outliers or inconsistent values.

In [24]:
print("CustomerDB - Full Summary:")
display(customer_df.describe(include="all").T)

print("\nFlightsDB - Full Summary:")
display(flights_df.describe(include="all").T)

CustomerDB - Full Summary:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,16921.0,,,,8440.023639,4884.775439,0.0,4210.0,8440.0,12670.0,16900.0
Loyalty#,16921.0,,,,550197.393771,259251.503597,100011.0,326823.0,550896.0,772438.0,999999.0
First Name,16921.0,4941.0,Deon,13.0,,,,,,,
Last Name,16921.0,15404.0,Salberg,4.0,,,,,,,
Customer Name,16921.0,16921.0,Cecilia Householder,1.0,,,,,,,
Country,16921.0,1.0,Canada,16921.0,,,,,,,
Province or State,16921.0,11.0,Ontario,5468.0,,,,,,,
City,16921.0,29.0,Toronto,3390.0,,,,,,,
Latitude,16921.0,,,,47.1745,3.307971,42.984924,44.231171,46.087818,49.28273,60.721188
Longitude,16921.0,,,,-91.814768,22.242429,-135.05684,-120.23766,-79.383186,-74.596184,-52.712578



FlightsDB - Full Summary:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Loyalty#,608436.0,,,,550037.873084,258935.180575,100018.0,326961.0,550834.0,772194.0,999986.0
Year,608436.0,,,,2020.0,0.816497,2019.0,2019.0,2020.0,2021.0,2021.0
Month,608436.0,,,,6.5,3.452055,1.0,3.75,6.5,9.25,12.0
YearMonthDate,608436.0,36.0,12/1/2021,16901.0,,,,,,,
NumFlights,608436.0,,,,3.908107,5.057889,0.0,0.0,0.0,7.2,21.0
NumFlightsWithCompanions,608436.0,,,,0.983944,2.003785,0.0,0.0,0.0,0.9,11.0
DistanceKM,608436.0,,,,7939.341419,10260.421873,0.0,0.0,856.4,15338.175,42040.0
PointsAccumulated,608436.0,,,,793.777781,1025.918521,0.0,0.0,85.275,1533.7125,4204.0
PointsRedeemed,608436.0,,,,235.251678,983.233374,0.0,0.0,0.0,0.0,7496.0
DollarCostPointsRedeemed,608436.0,,,,2.324835,9.725168,0.0,0.0,0.0,0.0,74.0


#### Observations – Descriptive Statistics

##### Customer Database (`CustomerDB`)
- The dataset includes **16,921 customers**, all located in **Canada**, across **11 provinces** and **29 cities**, with **Toronto** being the most represented city (3,390 records).  
- The gender distribution is balanced, with approximately **8.5k females** and **8.4k males**.  
- **Education:** most customers hold a **Bachelor’s degree** (10,586), followed by College and Master levels.  
- **Location Code:** the majority live in **Suburban areas** (5,716), with fewer in Urban or Rural zones.  
- **Income:** mean ≈ **37,758 USD**, median ≈ 34,000 USD — but includes **zero values**, indicating missing or incomplete income data.  
- **Customer Lifetime Value (CLV):** average ≈ **7,990 USD**, with a large standard deviation (≈6,800), suggesting high variability among customer profitability.  
- **LoyaltyStatus:** most customers are in the **Star tier** (7,761), followed by Nova and Aurora, showing an unbalanced distribution.  
- **CancellationDate:** only **2,310 customers** have a cancellation recorded (~14%), indicating most are still active.  
- **EnrollmentType:** mainly **Standard enrollments** (≈93%), with a smaller fraction under promotions.  

Overall, the Customer Database is consistent, with minor missing data and some skewness in economic variables (Income and CLV).

---

##### Flights Database (`FlightsDB`)
- The dataset records **608,436 monthly flight activities** over **3 years (2019–2021)**, with each record tied to a specific customer (`Loyalty#`).  
- **Flights per month:** average ≈ 3.9, but median = 0 - indicating that many customers have months without flights.  
- **Flights with companions:** low mean (~1), suggesting most customers travel alone.  
- **DistanceKM:** highly variable, average ≈ **7,939 km**, ranging up to 42,000 km — reflecting a mix of short and long-haul travelers.  
- **PointsAccumulated vs PointsRedeemed:** average earned points ≈ 794, redeemed ≈ 235 → most customers accumulate more than they use.  
- **DollarCostPointsRedeemed:** average ≈ 2.3 USD, confirming that most redemption values are small or zero.  

These statistics highlight that flight behavior is **uneven across customers** — with a majority of low-activity members and a smaller segment of frequent travelers. This variability will be crucial for identifying behavioral clusters in the next phase.

### 4.3 Check for duplicates

In [25]:
# Check for duplicates in both datasets
duplicates_customer = customer_df.duplicated().sum()
duplicates_flights = flights_df.duplicated().sum()

print(f"Number of duplicate rows in CustomerDB: {duplicates_customer}")
print(f"Number of duplicate rows in FlightsDB: {duplicates_flights}")

Number of duplicate rows in CustomerDB: 0
Number of duplicate rows in FlightsDB: 2903


### 4.4 Missing Values Analysis

Missing values can arise from incomplete data collection, optional fields, or customers with no recorded activity.  
Before handling them, it is important to identify which variables contain missing data and their respective counts.

Here, all blank or empty strings are replaced with `NaN` to ensure consistent treatment of missing values.  
We then calculate the total number of missing entries per column in both datasets.


In [26]:
# Replace blank or empty strings with NaN
customer_df.replace(["", " "], np.nan, inplace=True)
flights_df.replace(["", " "], np.nan, inplace=True)

# Check missing values
missing_customer = customer_df.isna().sum()
missing_flights = flights_df.isna().sum()

print("Missing values in CustomerDB:\n", missing_customer[missing_customer > 0].sort_values(ascending=False))
print("\nMissing values in FlightsDB:\n", missing_flights[missing_flights > 0].sort_values(ascending=False))


Missing values in CustomerDB:
 CancellationDate           14611
Income                        20
Customer Lifetime Value       20
dtype: int64

Missing values in FlightsDB:
 Series([], dtype: int64)


- **CustomerDB:**  
  - The column `CancellationDate` has **14,611 missing values** (~86% of the dataset). This is expected, as most customers remain active and therefore have no cancellation record.  
  - Both `Income` and `Customer Lifetime Value` have **20 missing values** each, a negligible proportion that can later be imputed (e.g., using median values).  
  - Overall, the dataset has few missing entries aside from the cancellation field, which will not negatively impact the analysis.

- **FlightsDB:**  
  - No missing values were detected, indicating that the flight activity data is complete and consistent.

These results confirm that the data quality is generally high, with only minor imputation needed for economic variables in the Customer Database.

### 4.5 Unique Values Analysis

Examining the number of unique values per column.

- Variables with low variability (potential categorical features or constants);  
- High-cardinality columns (e.g., unique IDs or names);  
- Possible redundant fields.

This analysis also provides insights into which variables are suitable for grouping, encoding, or feature selection.

In [27]:
# Count unique values in each column
unique_customer = customer_df.nunique().sort_values(ascending=False)
unique_flights = flights_df.nunique().sort_values(ascending=False)

print("Unique values in CustomerDB:\n", unique_customer)
print("\nUnique values in FlightsDB:\n", unique_flights)

Unique values in CustomerDB:
 Customer Name              16921
Unnamed: 0                 16901
Loyalty#                   16757
Last Name                  15404
Customer Lifetime Value     7996
Income                      5694
First Name                  4941
EnrollmentDateOpening       2449
CancellationDate            1260
Postal code                   75
Longitude                     49
Latitude                      49
City                          29
Province or State             11
Education                      5
Location Code                  3
Marital Status                 3
LoyaltyStatus                  3
Gender                         2
EnrollmentType                 2
Country                        1
dtype: int64

Unique values in FlightsDB:
 DistanceKM                  66762
PointsAccumulated           37064
Loyalty#                    16737
PointsRedeemed               8146
DollarCostPointsRedeemed      104
NumFlights                     41
YearMonthDate                 

In [28]:
# Identify categorical columns automatically (dtype = object)
categorical_cols_customer = customer_df.select_dtypes(include='object').columns
categorical_cols_flights = flights_df.select_dtypes(include='object').columns

print("Categorical columns in CustomerDB:")
print(categorical_cols_customer.tolist())

print("\nCategorical columns in FlightsDB:")
print(categorical_cols_flights.tolist())


Categorical columns in CustomerDB:
['First Name', 'Last Name', 'Customer Name', 'Country', 'Province or State', 'City', 'Postal code', 'Gender', 'Education', 'Location Code', 'Marital Status', 'LoyaltyStatus', 'EnrollmentDateOpening', 'CancellationDate', 'EnrollmentType']

Categorical columns in FlightsDB:
['YearMonthDate']


In [30]:
# Inspect unique values for all categorical columns in CustomerDB
for col in categorical_cols_customer:
    print(f"\nUnique values in {col} ({customer_df[col].nunique()}):")
    print(customer_df[col].unique())

# And for FlightsDB
for col in categorical_cols_flights:
    print(f"\nUnique values in {col} ({flights_df[col].nunique()}):")
    print(flights_df[col].unique())



Unique values in First Name (4941):
['Cecilia' 'Dayle' 'Necole' ... 'Juliann' 'Olivia' 'Liam']

Unique values in Last Name (15404):
['Householder' 'Menez' 'Hannon' ... 'Bennett' 'Wilson' 'Martin']

Unique values in Customer Name (16921):
['Cecilia Householder' 'Dayle Menez' 'Necole Hannon' ... 'Amelia Bennett'
 'Benjamin Wilson' 'Emma Martin']

Unique values in Country (1):
['Canada']

Unique values in Province or State (11):
['Ontario' 'Alberta' 'British Columbia' 'Quebec' 'Yukon' 'New Brunswick'
 'Manitoba' 'Nova Scotia' 'Saskatchewan' 'Newfoundland'
 'Prince Edward Island']

Unique values in City (29):
['Toronto' 'Edmonton' 'Vancouver' 'Hull' 'Whitehorse' 'Trenton' 'Montreal'
 'Dawson Creek' 'Quebec City' 'Moncton' 'Fredericton' 'Ottawa' 'Tremblant'
 'Calgary' 'Whistler' 'Thunder Bay' 'Peace River' 'Winnipeg' 'Sudbury'
 'West Vancouver' 'Halifax' 'London' 'Victoria' 'Regina' 'Kelowna'
 "St. John's" 'Kingston' 'Banff' 'Charlottetown']

Unique values in Postal code (75):
['M2Z 4K1' '

## <span style="color:#0097b2">4.X FlightsDB — Deep Exploration</span>

The goal of this section is to investigate the quality, consistency, and behavioral patterns in `FlightsDB`, raising hypotheses and potential anomalies for discussion with the team before any cleanup.


### 4.X.1 Temporal coverage & parsing

We validate the time period, ensure that `YearMonthDate` is in datetime format, and evaluate coverage by month.


In [33]:
# Parse YearMonthDate and create YearMonth column
flights_df['YearMonthDate'] = pd.to_datetime(flights_df['YearMonthDate'])
flights_df['YearMonth'] = flights_df['Year'].astype(str) + '-' + flights_df['Month'].astype(str).str.zfill(2)

# Cobertura por Year/Month
print("Years:", sorted(flights_df['Year'].unique()))
print("Months:", sorted(flights_df['Month'].unique()))
coverage = flights_df['YearMonth'].value_counts().sort_index()
display(coverage.head(40))

Years: [np.int64(2019), np.int64(2020), np.int64(2021)]
Months: [np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12)]


YearMonth
2019-01    16901
2019-02    16901
2019-03    16901
2019-04    16901
2019-05    16901
2019-06    16901
2019-07    16901
2019-08    16901
2019-09    16901
2019-10    16901
2019-11    16901
2019-12    16901
2020-01    16901
2020-02    16901
2020-03    16901
2020-04    16901
2020-05    16901
2020-06    16901
2020-07    16901
2020-08    16901
2020-09    16901
2020-10    16901
2020-11    16901
2020-12    16901
2021-01    16901
2021-02    16901
2021-03    16901
2021-04    16901
2021-05    16901
2021-06    16901
2021-07    16901
2021-08    16901
2021-09    16901
2021-10    16901
2021-11    16901
2021-12    16901
Name: count, dtype: int64

#### Observations – Temporal Coverage

- The `FlightsDB` dataset spans a **three-year period (2019–2021)**, with complete data for all 12 months of each year.  
- Each month contains exactly **16,901 records**, which is slightly fewer than the **16,921 customers** in the `CustomerDB`.  
  This indicates that around **20 customers have no recorded flight activity** throughout the period.   
- The consistent monthly record count suggests that the dataset was **systematically generated**, with one entry per customer for each month — even if the customer had no flights during that time (represented by zeros in the activity columns).  
- This structure ensures that every customer has a **complete temporal record**, making it possible to measure both **activity** and **inactivity patterns** across months and years.  
- No gaps or irregularities were detected across the three years, confirming a **uniform and comprehensive reporting structure**.

Overall, the temporal coverage is **complete, consistent, and balanced**, supporting reliable longitudinal and behavioral analyses of customer flight activity.

## 4.X.2 Duplicate Record Analysis

To ensure data integrity, two complementary checks were performed to identify potential duplicate records in the `FlightsDB`:

1. **Exact Duplicates:**  
   Rows that are fully identical across all columns. These typically arise from accidental data replication or export issues.  

2. **Logical Duplicates:**  
   Multiple entries for the same customer (`Loyalty#`) in the same month and year (`Year`, `Month`).  
   Since each customer should have only one record per month, these cases may indicate repeated or inconsistent data entries.


In [None]:
# --- Exact duplicates ---
exact_duplicates = flights_df.duplicated().sum()
print(f"Number of exact duplicate rows: {exact_duplicates}")

# --- Logical duplicates (by customer-month) ---
key_counts = flights_df.groupby(['Loyalty#','Year','Month']).size().reset_index(name='n')
dups_key = key_counts[key_counts['n'] > 1].sort_values('n', ascending=False)
print(f"Customer-month keys with duplicates: {len(dups_key)}")
display(dups_key.head(10))


Number of exact duplicate rows: 2903
Customer-month keys with duplicates: 5868


Unnamed: 0,Loyalty#,Year,Month,n
387643,678205,2021,8,3
387633,678205,2020,10,3
387623,678205,2019,12,3
387624,678205,2020,1,3
387625,678205,2020,2,3
387626,678205,2020,3,3
387627,678205,2020,4,3
387628,678205,2020,5,3
387629,678205,2020,6,3
387630,678205,2020,7,3


- Are they really duplicates? 
- Or several trips in the same month that the system recorded separately? 

In [40]:
# Check all records for this specific customer and month
flights_df[(flights_df['Loyalty#'] == 678205) & 
            (flights_df['Year'] == 2020) & 
            (flights_df['Month'] == 1)]


Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed,YearMonth
75663,678205,2020,1,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,2020-01
80590,678205,2020,1,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,2020-01
82412,678205,2020,1,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,2020-01


#### Observations – Detailed Inspection of Duplicated Records

- The table above shows all entries for **customer 678205** in **January 2020** (`Year = 2020`, `Month = 1`).  
- All three rows are **identical across every column**, with zero values for flights, distance, and points.  
- This confirms that these are **true duplicate records**, not separate flight activities.  
- Such cases likely originate from **data replication during dataset generation or export**, rather than genuine multiple transactions.  
- Similar inspections should be conducted for other duplicated customer–month combinations to determine whether they are identical or contain variations.

These verified duplicates should be addressed during the data cleaning phase to ensure accurate aggregation and analysis of customer flight activity.


### 4.X.3 Consistency rules

We reviewed simple rules that **should not** be violated:
- `NumFlightsWithCompanions <= NumFlights`
- Any negative values for `NumFlights`, `DistanceKM`, `Points*`, `DollarCostPointsRedeemed`
- Consistency between points and distance & between points redeemed and cost in dollars

In [43]:
# Negatives?
neg_checks = {
    'NumFlights': (flights_df['NumFlights'] < 0).sum(),
    'NumFlightsWithCompanions': (flights_df['NumFlightsWithCompanions'] < 0).sum(),
    'DistanceKM': (flights_df['DistanceKM'] < 0).sum(),
    'PointsAccumulated': (flights_df['PointsAccumulated'] < 0).sum(),
    'PointsRedeemed': (flights_df['PointsRedeemed'] < 0).sum(),
    'DollarCostPointsRedeemed': (flights_df['DollarCostPointsRedeemed'] < 0).sum(),
}
print("Negative values count:", neg_checks)

# Companions não pode exceder voos
viol_comp = (flights_df['NumFlightsWithCompanions'] > flights_df['NumFlights']).sum()
print("Rows where companions > flights:", viol_comp)

# Points accumulated per km ratio (expected ~0.1 p/km)
ratio_pts_km = flights_df.loc[flights_df['DistanceKM'] > 0, 'PointsAccumulated'] / flights_df.loc[flights_df['DistanceKM'] > 0, 'DistanceKM']
print("Points per KM - quantiles:\n", ratio_pts_km.quantile([0.01,0.25,0.5,0.75,0.99]))

# Ratio $ per point redeemed (expected ~0.01 $/point if 100 pts = $1)
ratio_usd_pt = flights_df.loc[flights_df['PointsRedeemed'] > 0, 'DollarCostPointsRedeemed'] / flights_df.loc[flights_df['PointsRedeemed'] > 0, 'PointsRedeemed']
print("USD per redeemed point - quantiles:\n", ratio_usd_pt.quantile([0.01,0.25,0.5,0.75,0.99]))

# ncoherent cases: $>0 with PointsRedeemed=0, or PointsRedeemed>0 with $=0
# usd_no_points: there is a cost in dollars but no points redeemed → impossible.
# points_no_usd: there are points redeemed but zero cost → suspicious (could be a registration error or special promotion).
usd_no_points = ((flights_df['DollarCostPointsRedeemed'] > 0) & (flights_df['PointsRedeemed'] == 0)).sum()
points_no_usd = ((flights_df['PointsRedeemed'] > 0) & (flights_df['DollarCostPointsRedeemed'] == 0)).sum()
print("DollarCost>0 but PointsRedeemed=0:", usd_no_points)
print("PointsRedeemed>0 but DollarCost=0:", points_no_usd)

Negative values count: {'NumFlights': np.int64(0), 'NumFlightsWithCompanions': np.int64(0), 'DistanceKM': np.int64(0), 'PointsAccumulated': np.int64(0), 'PointsRedeemed': np.int64(0), 'DollarCostPointsRedeemed': np.int64(0)}
Rows where companions > flights: 0
Points per KM - quantiles:
 0.01    0.099383
0.25    0.099961
0.50    0.099987
0.75    0.100000
0.99    0.100000
dtype: float64
USD per redeemed point - quantiles:
 0.01    0.009644
0.25    0.009819
0.50    0.009883
0.75    0.009943
0.99    0.010000
dtype: float64
DollarCost>0 but PointsRedeemed=0: 0
PointsRedeemed>0 but DollarCost=0: 0


#### Observations - Consistency and Ratio Checks

- **No negative values** were found in any numerical variable (`NumFlights`, `DistanceKM`, `PointsAccumulated`, etc.), confirming that all recorded measures are logically valid.  
- **No cases** were found where `NumFlightsWithCompanions` exceeded `NumFlights`, indicating consistent reporting of companion flights.  
- The **points-per-kilometer ratio** is extremely stable across the dataset, with a median of **≈0.10 points per km** and minimal variation.  
  This confirms that the loyalty system applies a **fixed rule of roughly 0.1 point earned per kilometer flown**.  
- The **USD-per-point ratio** is also highly consistent, with a median of **≈0.0099 USD/point**.  
- No incoherent cases were detected:
  - No entries with a dollar cost but zero points redeemed.  
  - No entries with points redeemed but zero dollar value.  

Overall, the `FlightsDB` dataset shows **internal consistency**.


In [44]:
zero_share = {
    'NumFlights_zero': (flights_df['NumFlights'] == 0).mean(),
    'DistanceKM_zero': (flights_df['DistanceKM'] == 0).mean(),
    'PointsAccumulated_zero': (flights_df['PointsAccumulated'] == 0).mean(),
    'PointsRedeemed_zero': (flights_df['PointsRedeemed'] == 0).mean(),
}
print("Share of zeros:", {k: round(v,3) for k,v in zero_share.items()})

# Redime sem voar no mês é possível; quantificar
redeem_no_flights = ((flights_df['PointsRedeemed'] > 0) & (flights_df['NumFlights'] == 0)).mean()
print("Share of months with redemption but no flights:", round(redeem_no_flights,3))


Share of zeros: {'NumFlights_zero': np.float64(0.501), 'DistanceKM_zero': np.float64(0.491), 'PointsAccumulated_zero': np.float64(0.491), 'PointsRedeemed_zero': np.float64(0.942)}
Share of months with redemption but no flights: 0.0


### 4.X.5 Fractional flights?

We validate whether `NumFlights` and `NumFlightsWithCompanions` appear with **decimals**


In [46]:
# Calculate the percentage of fractional values in 'NumFlights'
# Checks how many records have non integer flight counts
frac_flights = (flights_df['NumFlights'] % 1 != 0).mean()

# Calculate the share of fractional values in 'NumFlightsWithCompanions'
# Verifies if there are non integer values in the number of flights with companions
frac_comp = (flights_df['NumFlightsWithCompanions'] % 1 != 0).mean()

print("Percentage of fractional NumFlights:", round(frac_flights,3))
print("Percentage of fractional NumFlightsWithCompanions:", round(frac_comp,3))


Percentage of fractional NumFlights: 0.147
Percentage of fractional NumFlightsWithCompanions: 0.083


#### Observations - Percentage of fractional values for NumFlights and NumFlightsWithCompanions

- 14.7% of `NumFlights`

- 8.3% of `NumFlightsWithCompanions` 

are fractional — counts of flights shouldn’t normally be decimals.

- This suggests the dataset may include **averaged or normalized values**, in early years.  
- These values are not necessarily errors.



In [None]:
# 1) What fractional steps are used? (e.g., 0.1, 0.2, 0.5)
frac_parts = (flights_df['NumFlights'] % 1).round(1)
print(frac_parts.value_counts(normalize=True).head(10))

# 2) Are fractions concentrated in specific years/months?
by_month_frac = (flights_df['NumFlights'] % 1 != 0).groupby([flights_df['Year'], flights_df['Month']]).mean()
print(by_month_frac.unstack().round(3))

# 3) Are fractions tied to specific customers (data generation artifact)?
cust_frac = (flights_df['NumFlights'] % 1 != 0).groupby(flights_df['Loyalty#']).mean()
print(cust_frac.describe().round(3))
print("Top customers with fractional counts:")
print(cust_frac.sort_values(ascending=False).head(10))


NumFlights
0.0    0.852818
0.9    0.022385
0.7    0.019849
0.8    0.019239
0.6    0.016054
0.5    0.015846
0.4    0.014534
0.3    0.014117
0.2    0.013287
0.1    0.011871
Name: proportion, dtype: float64
Month    1      2     3      4      5      6      7     8      9      10  \
Year                                                                       
2019   0.42  0.403  0.43  0.415  0.436  0.427  0.435  0.46  0.442  0.473   
2020   0.00  0.000  0.00  0.000  0.000  0.000  0.000  0.00  0.000  0.000   
2021   0.00  0.000  0.00  0.000  0.000  0.000  0.000  0.00  0.000  0.000   

Month     11     12  
Year                 
2019   0.474  0.485  
2020   0.000  0.000  
2021   0.000  0.000  
count    16737.000
mean         0.148
std          0.099
min          0.000
25%          0.028
50%          0.167
75%          0.222
max          0.333
Name: NumFlights, dtype: float64
Top customers with fractional counts:
Loyalty#
241144    0.333333
851901    0.333333
709747    0.333333
956861    0.3333

#### Observations – (NumFlights)</span>

##### fractional parts
- The fractional parts appear in **uniform 0.1 increments** (0.1, 0.2, …, 0.9).  
- About **85%** of all `NumFlights` values are integers (0.0), while **15%** are fractional.  
##### fractional parts by year and month
- Fractional `NumFlights` occur **only in 2019**, with roughly **40–48%** of monthly records affected.  
- From **2020 onward**, all values are integers.
- This strongly suggests that **2019 was generated differently**, possibly using **averaged or normalized flight data**.
##### fractional parts by customer
- On average, customers have **15%** of their records as fractional, matching the global proportion.  
- Top customers with `0.333` ratios correspond to having **12 fractional months** (one full year).  
- This confirms that customers with fractional data correspond to those active in **2019**, while later data are integer-based.  




In [51]:
# 1) Fractional steps for NumFlightsWithCompanions
frac_parts_comp = (flights_df['NumFlightsWithCompanions'] % 1).round(1)
print("Fractional parts (NumFlightsWithCompanions):")
print(frac_parts_comp.value_counts(normalize=True).head(10))

# 2) Are fractions concentrated in specific years/months?
by_month_frac_comp = (flights_df['NumFlightsWithCompanions'] % 1 != 0).groupby([flights_df['Year'], flights_df['Month']]).mean()
print("\nShare of fractional values by Year/Month (NumFlightsWithCompanions):")
print(by_month_frac_comp.unstack().round(3))

# 3) Are fractions tied to specific customers?
cust_frac_comp = (flights_df['NumFlightsWithCompanions'] % 1 != 0).groupby(flights_df['Loyalty#']).mean()
print("\nSummary by customer (NumFlightsWithCompanions):")
print(cust_frac_comp.describe().round(3))
print("Top customers with fractional companion flight counts:")
print(cust_frac_comp.sort_values(ascending=False).head(10))


Fractional parts (NumFlightsWithCompanions):
NumFlightsWithCompanions
0.0    0.916504
0.8    0.014194
0.9    0.014090
0.7    0.013985
0.6    0.011732
0.5    0.009759
0.4    0.008553
0.3    0.005766
0.2    0.003189
0.1    0.002229
Name: proportion, dtype: float64

Share of fractional values by Year/Month (NumFlightsWithCompanions):
Month     1      2      3      4      5      6      7     8      9      10  \
Year                                                                         
2019   0.238  0.236  0.238  0.247  0.243  0.241  0.238  0.26  0.265  0.265   
2020   0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.00  0.000  0.000   
2021   0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.00  0.000  0.000   

Month     11    12  
Year                
2019   0.265  0.27  
2020   0.000  0.00  
2021   0.000  0.00  

Summary by customer (NumFlightsWithCompanions):
count    16737.000
mean         0.084
std          0.064
min          0.000
25%          0.000
50%          0.083
75%     

#### Observations – Fractional Flight Counts Analysis

##### fractional parts
- About **91.6%** of `NumFlightsWithCompanions` values are integers, and **8–9%** are fractional (again in 0.1 increments).  
- Fractional steps most likely indicate **averaged companion activity** for 2019, consistent with `NumFlights`.
##### fractional parts by year and month
- Fractional values for `NumFlightsWithCompanions` also occur **only in 2019**, with **24–27%** of monthly records affected.  
- Years **2020 and 2021** contain only integer counts.  
##### fractional parts by customer
- On average, **8.4%** of each customer’s monthly records show fractional companion counts.  
- Top customers (≈0.28–0.31) have fractional values for almost all 12 months of 2019. 




#### 4.X.6 Monthly outliers

Identify months with extreme values ​​in `NumFlights`, `DistanceKM` and `Points*` (top 0.1%).


In [54]:
# Define a function to find potential extreme (outlier) values in a numeric column
def top_outliers(s, p=0.999):
    # Compute the threshold corresponding to the 99.9th percentile (top 0.1%)
    thr = s.quantile(p)
    # Count how many rows have values equal to or above this threshold
    return thr, flights_df[s >= thr].shape[0]

# Apply the function to key numeric columns
for col in ['NumFlights', 'DistanceKM', 'PointsAccumulated', 'PointsRedeemed', 'DollarCostPointsRedeemed']:
    # For each column, get the threshold and the number of extreme values
    thr, n = top_outliers(flights_df[col])
    # Print results in a clean, readable format
    print(f"{col}: threshold p99.9={thr:.2f} | rows >= thr: {n}")

NumFlights: threshold p99.9=20.00 | rows >= thr: 1684
DistanceKM: threshold p99.9=40080.70 | rows >= thr: 609
PointsAccumulated: threshold p99.9=4007.57 | rows >= thr: 609
PointsRedeemed: threshold p99.9=6662.00 | rows >= thr: 610
DollarCostPointsRedeemed: threshold p99.9=66.00 | rows >= thr: 703


#### Observations – Extreme Value Detection (Top 0.1%)

To detect potential outliers, the 99.9th percentile (`p = 0.999`) was computed for key numeric variables.  
This percentile defines a **threshold** above which only the top **0.1% of records** lie, these are the most extreme values in the dataset.

| Variable | 99.9th Percentile (Threshold) | Records Above Threshold | Interpretation |
|-----------|-------------------------------|--------------------------|----------------|
| **NumFlights** | 20.00 | 1,684 | Only 0.1% of records have ≥ 20 flights per month → very frequent flyers. |
| **DistanceKM** | 40,080.7 | 609 | Top 0.1% of distances are above ~40,000 km |
| **PointsAccumulated** | 4,007.57 | 609 | Consistent with distance |
| **PointsRedeemed** | 6,662.00 | 610 | High redemption volumes - possibly elite-tier customers. |
| **DollarCostPointsRedeemed** | 66.00 | 703 | Monetary equivalent of top redemptions (~$66). |

**Interpretation:**
- These thresholds help to **identify extreme but legitimate cases** — heavy travelers, big spenders, or top members.  
- The counts (≈600–1,700 rows) represent only **0.1% of the dataset**, aligning with expectations for extreme values.  
- No anomalies or unrealistic magnitudes were detected, suggesting that **the data are scaled realistically**, even at the tails.


# <span style="color:#0097b2">4. Data Preprocessing</span>

## Duplicates

In [4]:
# Check for duplicates in both datasets
duplicates_customer = customer_df.duplicated().sum()
duplicates_flights = flights_df.duplicated().sum()

print(f"Number of duplicate rows in CustomerDB: {duplicates_customer}")
print(f"Number of duplicate rows in FlightsDB: {duplicates_flights}")

Number of duplicate rows in CustomerDB: 0
Number of duplicate rows in FlightsDB: 2903


## Set Index

## Changing Values - n gostei do subtitulo mas ok

## Metric and Non Metric Features

## Missing Values

## Data Type Correction

## Feature Engineering

## New Features Analysis

## Outliers

## Variable Selection

## Redundancy (Perspectives)

## Scaling

## Encoding (extra??)