## Data Mining Project - Deliverable 1

In this project we will follow the CRIPS-DM methodology. We will start with :

### **Business Understanding:**
We want to segment the costumers in the Loyalty Program of AIAI Airlines. We want to explore value-based segmentation, behavioral segmentation and demographic segmentation. Ultimeltly we want to combine the three perspectives into a final segmentation framework.

For this we have two available datasets:
    - DM_AIAI_CustomerDB.csv : with information regarding the costumers
    - DM_AIAI_FlightsDB.csv : with information regarding the costumers' flying activity with AIAI Airlines

### **Data Understanding :**


In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import openpyxl 

In [None]:
# read the available files
df_customerDB = pd.read_csv('DM_AIAI_CustomerDB.csv')
df_flightsDB = pd.read_csv('DM_AIAI_FlightsDB.csv')
df_metadata = pd.read_csv('DM_AIAI_Metadata.csv', sep=';') # this file is not comma separated but semicolon separated

In [3]:
#DESCRIPTIVE STATISTICS AND VISUALIZATIONS
# ver head(), info(), describe( of each dataframe)
# fazer histogramas e boxplots das variáveis 

#ASSESS DATA QUALITY ISSUES AND CLUSTERING RELIABILITY ??
# check for missing values, duplicates, inconsistencies, outliers

#IDENTIFY PRELIMINARY BEHAVIORAL SIGNALS
# ver correlações e scatter plots entre as features

#DEVELOP AND JUSTIFY ENGINEERED FEATURES

In [4]:
display(df_metadata)

Unnamed: 0,CustomerDB,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,Variable,Description,
1,,Loyalty#,Unique customer identifier for loyalty program...,
2,,First Name,Customer's first name,
3,,Last Name,Customer's last name,
4,,Customer Name,Customer's full name (concatenated),
5,,Country,Customer's country of residence,
6,,Province or State,Customer's province or state,
7,,City,Customer's city of residence,
8,,Latitude,Geographic latitude coordinate of customer loc...,
9,,Longitude,Geographic longitude coordinate of customer lo...,


In [5]:
metadata_customerDB = df_metadata.loc[1:20,['Unnamed: 1', 'Unnamed: 2']]
metadata_customerDB.columns = ['Variable', 'Description']
display(metadata_customerDB)

Unnamed: 0,Variable,Description
1,Loyalty#,Unique customer identifier for loyalty program...
2,First Name,Customer's first name
3,Last Name,Customer's last name
4,Customer Name,Customer's full name (concatenated)
5,Country,Customer's country of residence
6,Province or State,Customer's province or state
7,City,Customer's city of residence
8,Latitude,Geographic latitude coordinate of customer loc...
9,Longitude,Geographic longitude coordinate of customer lo...
10,Postal code,Customer's postal/ZIP code


We can already suppose that the first name and last name variables wont be informative since we have a vairiable Customer Name that is the combination of the two. We should check if there is any error in the Costumer Name. We should probably calculate a featyre of the time the customer was in the loyalty program.

"Customer segmentation enables AIAI to identify distinct groups within their loyalty program." We are interested in segmenting the customers in the loyalty program, so should we remove the costumers that have already left the program? Question for the teacher.

In [6]:
metadata_flightsDB = df_metadata.loc[24:,['Unnamed: 1', 'Unnamed: 2']]
metadata_flightsDB.columns = ['Variable', 'Description']
display(metadata_flightsDB)

Unnamed: 0,Variable,Description
24,Loyalty#,Unique customer identifier linking to CustomerDB
25,Year,Year of flight activity record
26,Month,Month of flight activity record (1-12)
27,YearMonthDate,First day of the month for the activity period
28,NumFlights,Total number of flights taken by customer in t...
29,NumFlightsWithCompanions,Number of flights where customer traveled with...
30,DistanceKM,Total distance traveled in kilometers for the ...
31,PointsAccumulated,Loyalty points earned by customer during the m...
32,PointsRedeemed,Loyalty points spent/redeemed by customer duri...
33,DollarCostPointsRedeemed,Dollar value of points redeemed during the month


We have 3 variables dedicated to the time, this is tripling its significance, we should probably merge the 3 and crete a date time variable.
Check if there is a relationship between the Distance and the Points acummulated, or the number of flights. The points redeemed and the Cost of the points redeemed is probably redundante.

Loyalty is the costumer identifier so it should be the index for the costumers dataset

In [7]:
df_customerDB.head()

Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
0,0,480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
1,1,549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,...,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
2,2,429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,...,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
3,3,608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,...,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
4,4,530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,...,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion


In [8]:
df_customerDB.columns

Index(['Unnamed: 0', 'Loyalty#', 'First Name', 'Last Name', 'Customer Name',
       'Country', 'Province or State', 'City', 'Latitude', 'Longitude',
       'Postal code', 'Gender', 'Education', 'Location Code', 'Income',
       'Marital Status', 'LoyaltyStatus', 'EnrollmentDateOpening',
       'CancellationDate', 'Customer Lifetime Value', 'EnrollmentType'],
      dtype='object')

We have a column called 'Unnamed: 0' that has the index information which is not relevant so we should drop it

In [9]:
df_customerDB.drop('Unnamed: 0', axis=1, inplace=True)
df_customerDB.head()

Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
0,480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,M2Z 4K1,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
1,549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,T3G 6Y6,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
2,429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,V6E 3D9,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
3,608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,P1W 1K4,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
4,530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,J8Y 3Z5,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion


This is a dataset about the costumers in the loyalty program, so the index should be the costumer ID - Loyalty#

In [10]:
df_customerDB.set_index('Loyalty#', inplace=True)
df_customerDB.head()

Unnamed: 0_level_0,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,M2Z 4K1,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,T3G 6Y6,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,V6E 3D9,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,P1W 1K4,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,J8Y 3Z5,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion


In [11]:
df_customerDB.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16921 entries, 480934 to 100016
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   First Name               16921 non-null  object 
 1   Last Name                16921 non-null  object 
 2   Customer Name            16921 non-null  object 
 3   Country                  16921 non-null  object 
 4   Province or State        16921 non-null  object 
 5   City                     16921 non-null  object 
 6   Latitude                 16921 non-null  float64
 7   Longitude                16921 non-null  float64
 8   Postal code              16921 non-null  object 
 9   Gender                   16921 non-null  object 
 10  Education                16921 non-null  object 
 11  Location Code            16921 non-null  object 
 12  Income                   16901 non-null  float64
 13  Marital Status           16921 non-null  object 
 14  LoyaltyStatus        

Check the cancellation date, if we need to 

From here we can see that apparently only the Cancellation Date and the Costumer Lifetime Value have Null values. For the cancellation date it is expected, because it probably means the costumer is still in the program. For the Costumer Lifetime Value, it doesnt make as much sense

Lets split the variables in numerical and categorical. The ones whos Dtype is object will be continuous.

In [12]:
costumers_num_var = ['Latitude','Longitude','Income','Customer Lifetime Value']
costumers_cat_var = df_customerDB.columns.drop(costumers_num_var)
print(costumers_cat_var)

Index(['First Name', 'Last Name', 'Customer Name', 'Country',
       'Province or State', 'City', 'Postal code', 'Gender', 'Education',
       'Location Code', 'Marital Status', 'LoyaltyStatus',
       'EnrollmentDateOpening', 'CancellationDate', 'EnrollmentType'],
      dtype='object')


In [13]:
costumers_categories = ['Country', 'Province or State','City', 'Gender', 'Education', 'Marital Status', 'LoyaltyStatus', 'EnrollmentType']

for feat in costumers_categories:
    print(f'{feat}:')
    print(df_customerDB[feat].unique())



Country:
['Canada']
Province or State:
['Ontario' 'Alberta' 'British Columbia' 'Quebec' 'Yukon' 'New Brunswick'
 'Manitoba' 'Nova Scotia' 'Saskatchewan' 'Newfoundland'
 'Prince Edward Island']
City:
['Toronto' 'Edmonton' 'Vancouver' 'Hull' 'Whitehorse' 'Trenton' 'Montreal'
 'Dawson Creek' 'Quebec City' 'Moncton' 'Fredericton' 'Ottawa' 'Tremblant'
 'Calgary' 'Whistler' 'Thunder Bay' 'Peace River' 'Winnipeg' 'Sudbury'
 'West Vancouver' 'Halifax' 'London' 'Victoria' 'Regina' 'Kelowna'
 "St. John's" 'Kingston' 'Banff' 'Charlottetown']
Gender:
['female' 'male']
Education:
['Bachelor' 'College' 'Master' 'High School or Below' 'Doctor']
Marital Status:
['Married' 'Divorced' 'Single']
LoyaltyStatus:
['Star' 'Aurora' 'Nova']
EnrollmentType:
['Standard' '2021 Promotion']


From this we can see that the only column we have is Canada, which makes this feature not relevant for data segmentation. We can also see the possible values for the other features. There doesnt seems to be a problem with the data. Bachelor and Maters mean College, should we do something about that?

We will remove the Country feature but only on the part of the project of Feature Engineering