# Data Mining Project

## Index
1. [Introduction](#introduction)
2. [Data Collection](#data-collection)
3. [Data Preprocessing](#data-preprocessing)
4. [Exploratory Data Analysis (EDA)](#eda)
5. [Feature Engineering](#feature-engineering)
6. [Model Selection and Training](#model-selection)
7. [Model Evaluation](#model-evaluation)
8. [Conclusion](#conclusion)  

1. <a name="introduction"></a>Introduction
   - Overview of the project
   - Objectives and goals

2. <a name="data-collection"></a>Data Collection
   - Description of data sources
   - Methods of data collection

3. <a name="data-preprocessing"></a>Data Preprocessing
   - Data cleaning techniques
   - Handling missing values
   - Data transformation and normalization

4. <a name="eda"></a>Exploratory Data Analysis (EDA)
   - Summary statistics
   - Data visualization techniques
   - Identifying patterns and trends

5. <a name="feature-engineering"></a>Feature Engineering
   - Feature selection methods
   - Creating new features
   - Dimensionality reduction techniques

6. <a name="model-selection"></a>Model Selection and Training
   - Overview of algorithms considered
   - Training procedures
   - Hyperparameter tuning

7. <a name="model-evaluation"></a>Model Evaluation
   - Evaluation metrics
   - Cross-validation results
   - Comparison of model performance

8. <a name="conclusion"></a>Conclusion
   - Summary of findings
   - Future work and improvements

In [2]:
import sqlite3
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

from itertools import product
from scipy.stats import skewnorm

from datetime import datetime
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

## DATA WRANGLING COSTUMERS

In [3]:
customers = pd.read_csv("https://raw.githubusercontent.com/catamina07/datamining-group/main/data/DM_AIAI_CustomerDB.csv")
customers.head()

Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
0,0,480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
1,1,549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,...,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
2,2,429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,...,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
3,3,608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,...,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
4,4,530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,...,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion


In [None]:
# Drop the first column which is an unnecessary index column
customers.drop(columns=customers.columns[0], inplace=True)

In [6]:
print(customers.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16921 entries, 0 to 16920
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Loyalty#                 16921 non-null  int64  
 1   First Name               16921 non-null  object 
 2   Last Name                16921 non-null  object 
 3   Customer Name            16921 non-null  object 
 4   Country                  16921 non-null  object 
 5   Province or State        16921 non-null  object 
 6   City                     16921 non-null  object 
 7   Latitude                 16921 non-null  float64
 8   Longitude                16921 non-null  float64
 9   Postal code              16921 non-null  object 
 10  Gender                   16921 non-null  object 
 11  Education                16921 non-null  object 
 12  Location Code            16921 non-null  object 
 13  Income                   16901 non-null  float64
 14  Marital Status        

In [7]:
customers.isna().sum()

Loyalty#                       0
First Name                     0
Last Name                      0
Customer Name                  0
Country                        0
Province or State              0
City                           0
Latitude                       0
Longitude                      0
Postal code                    0
Gender                         0
Education                      0
Location Code                  0
Income                        20
Marital Status                 0
LoyaltyStatus                  0
EnrollmentDateOpening          0
CancellationDate           14611
Customer Lifetime Value       20
EnrollmentType                 0
dtype: int64

In [7]:
# Total number of rows
total_rows = len(costumers)

# Columns to check with NaN values that are critical
columns_to_check = ['Income', 'Customer Lifetime Value']

# Count how many rows have NaN in any of these columns
rows_with_nulls = costumers[columns_to_check].isnull().any(axis=1).sum()

# Percentage of rows that would be removed
percent_rows = rows_with_nulls / total_rows * 100
print(f"Rows to remove: {rows_with_nulls} ({percent_rows:.2f}%)")

# Drop rows only if less than 5% of data
if percent_rows < 5:
    costumers = costumers.dropna(subset=columns_to_check)
    print("Rows removed.")
else:
    print("Not removing rows, they represent more than 5% of total.")

Rows to remove: 0 (0.00%)
Rows removed.


## DATA WRANGLING FLIGHTS

In [5]:
flights = pd.read_csv("https://raw.githubusercontent.com/catamina07/datamining-group/main/data/DM_AIAI_FlightsDB.csv")
flights.head()

Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
0,413052,2021,12,12/1/2021,2.0,2.0,9384.0,938.0,0.0,0.0
1,464105,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
2,681785,2021,12,12/1/2021,10.0,3.0,14745.0,1474.0,0.0,0.0
3,185013,2021,12,12/1/2021,16.0,4.0,26311.0,2631.0,3213.0,32.0
4,216596,2021,12,12/1/2021,9.0,0.0,19275.0,1927.0,0.0,0.0
