# Shinkansen Travel Experience

## Introduction
The purposde of this report is to:
* Identify the different factors that drive passenger satisaction.
* Make a model to predict whether a passenger was satisfied or not considering the overall experience of traveling on the Shinkansen Bullet Train.

**Dataset**

The problem consists of 2 separate datasets: Travel data & Survey data. Travel data has information related to passengers and attributes related to the Shinkansen train, in which they traveled. The survey data is aggregated data of surveys indicating the post-service experience. The data has been split into two groups:

* Train_Data
* Test_Data

**Travel Data Dictionary:**

* **ID** - The unique ID of the passenger.
* **Gender** - The gender of the passenger.
* **Customer_Type** - Loyalty type of the passenger.
* **Age** - The age of the passenger.
* **Type_Travel** - Purpose of travel for the passenger.
* **Travel_Class** - The train class that the passenger traveled in.
* **Travel_Distance** - The distance traveled by the passenger.
* **Departure_Delay_in_Mins** - The delay (in minutes) in train departure.
* **Arrival_Delay_in_Mins** - The delay (in minutes) in train arrival.

**Survey Data Dictionay:**

* **ID** - The unique ID of the passenger.
* **Platform_Location** - How convenient the location of the platform is for the passenger.
* **Seat_Class** - The type of the seat class in the train, Green Car seats are usually more spacious and comfortable than ordinary seats. On the Shinkansen train, there are only four seats per row in the Green Car, versus five in the ordinary car.
* **Overall_Experience** - The overall experience of the passenger. Target variable.
* **Seat_Comfort** - The comfort level of the seat for the passenger.
* **Arrival_Time_Convenient** - How convenient the arrival time of the train is for the passenger.
* **Catering** - How convenient the catering service is for the passenger.
* **Onboard_Wifi_Service** - The quality of the onboard Wi-Fi service for the passenger.
* **Onboard_Entertainment** - The quality of the onboard entertainment for the passenger.
* **Online_Support** - The quality of the online support for the passenger.
* **Ease_of_Online_Booking** - The ease of online booking for the passenger.
* **Onboard_Service** - The quality of the onboard service for the passenger.
* **Legroom** - Legroom is the general term used in place of the more accurate “seat pitch”, which is the distance between a point on one seat and the same point on the seat in front of it. This variable describes the convenience of the legroom provided for the passenger.
* **Baggage_Handling** - The convenience of baggage handling for the passenger.
* **CheckIn_Service** - The convenience of the check-in service for the passenger.
* **Cleanliness** - The passenger's view of the cleanliness of the service.
* **Online_Boarding** - The convenience of the online boarding process for the passenger.

## Data Validation 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Train Data

In [7]:
travel_train = pd.read_csv('../data/Traveldata_train.csv')

In [9]:
travel_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Gender                   94302 non-null  object 
 2   Customer_Type            85428 non-null  object 
 3   Age                      94346 non-null  float64
 4   Type_Travel              85153 non-null  object 
 5   Travel_Class             94379 non-null  object 
 6   Travel_Distance          94379 non-null  int64  
 7   Departure_Delay_in_Mins  94322 non-null  float64
 8   Arrival_Delay_in_Mins    94022 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 6.5+ MB


In [10]:
survey_train = pd.read_csv('../data/Surveydata_train.csv')

In [11]:
survey_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                       94379 non-null  int64 
 1   Overall_Experience       94379 non-null  int64 
 2   Seat_Comfort             94318 non-null  object
 3   Seat_Class               94379 non-null  object
 4   Arrival_Time_Convenient  85449 non-null  object
 5   Catering                 85638 non-null  object
 6   Platform_Location        94349 non-null  object
 7   Onboard_Wifi_Service     94349 non-null  object
 8   Onboard_Entertainment    94361 non-null  object
 9   Online_Support           94288 non-null  object
 10  Ease_of_Online_Booking   94306 non-null  object
 11  Onboard_Service          86778 non-null  object
 12  Legroom                  94289 non-null  object
 13  Baggage_Handling         94237 non-null  object
 14  CheckIn_Service          94302 non-nul

We can use ID as the column to join the trains datasets.

In [12]:
data_train = pd.merge(travel_train, survey_train, on = 'ID', how = 'inner')

In [13]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 94379 entries, 0 to 94378
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Gender                   94302 non-null  object 
 2   Customer_Type            85428 non-null  object 
 3   Age                      94346 non-null  float64
 4   Type_Travel              85153 non-null  object 
 5   Travel_Class             94379 non-null  object 
 6   Travel_Distance          94379 non-null  int64  
 7   Departure_Delay_in_Mins  94322 non-null  float64
 8   Arrival_Delay_in_Mins    94022 non-null  float64
 9   Overall_Experience       94379 non-null  int64  
 10  Seat_Comfort             94318 non-null  object 
 11  Seat_Class               94379 non-null  object 
 12  Arrival_Time_Convenient  85449 non-null  object 
 13  Catering                 85638 non-null  object 
 14  Platform_Location     

**Observations**:
* There are 94397 observations on the train dataset.
* There are missing values in several columns of the dataset.

Let's check the unique values in each column and if we have any duplicates

In [14]:
data_train.nunique()

ID                         94379
Gender                         2
Customer_Type                  2
Age                           75
Type_Travel                    2
Travel_Class                   2
Travel_Distance             5210
Departure_Delay_in_Mins      437
Arrival_Delay_in_Mins        434
Overall_Experience             2
Seat_Comfort                   6
Seat_Class                     2
Arrival_Time_Convenient        6
Catering                       6
Platform_Location              6
Onboard_Wifi_Service           6
Onboard_Entertainment          6
Online_Support                 6
Ease_of_Online_Booking         6
Onboard_Service                6
Legroom                        6
Baggage_Handling               5
CheckIn_Service                6
Cleanliness                    6
Online_Boarding                6
dtype: int64

In [17]:
data_train.duplicated().any()

False

**Observations**:
* ID is a unique identifier for each passenger, we can drop this column as it would not add any value to our analysis.
* We don't have any duplicated rows on our dataset.

Let's drop the ID column and define list for numerical and categorical columns to explore and work separately.

In [18]:
data_train = data_train.drop('ID', axis=1)

In [19]:
object_att = data_train.select_dtypes(include = 'object').columns.values.tolist()
object_att

['Gender',
 'Customer_Type',
 'Type_Travel',
 'Travel_Class',
 'Seat_Comfort',
 'Seat_Class',
 'Arrival_Time_Convenient',
 'Catering',
 'Platform_Location',
 'Onboard_Wifi_Service',
 'Onboard_Entertainment',
 'Online_Support',
 'Ease_of_Online_Booking',
 'Onboard_Service',
 'Legroom',
 'Baggage_Handling',
 'CheckIn_Service',
 'Cleanliness',
 'Online_Boarding']

In [21]:
num_att = data_train.select_dtypes(exclude = 'object').columns.values.tolist()
num_att

['Age',
 'Travel_Distance',
 'Departure_Delay_in_Mins',
 'Arrival_Delay_in_Mins',
 'Overall_Experience']

Let's check for strange values in numerical and categorical columns.

In [38]:
data_train[num_att].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,94346.0,39.419647,15.116632,7.0,27.0,40.0,51.0,85.0
Travel_Distance,94379.0,1978.888185,1027.961019,50.0,1359.0,1923.0,2538.0,6951.0
Departure_Delay_in_Mins,94322.0,14.647092,38.138781,0.0,0.0,0.0,12.0,1592.0
Arrival_Delay_in_Mins,94022.0,15.005222,38.439409,0.0,0.0,0.0,13.0,1584.0
Overall_Experience,94379.0,0.546658,0.497821,0.0,0.0,1.0,1.0,1.0


In [34]:
for i in range(len(object_att)):
    print(object_att[i])
    print(data_train[object_att[i]].unique())

Gender
['Female' 'Male' nan]
Customer_Type
['Loyal Customer' 'Disloyal Customer' nan]
Type_Travel
[nan 'Personal Travel' 'Business Travel']
Travel_Class
['Business' 'Eco']
Seat_Comfort
['Needs Improvement' 'Poor' 'Acceptable' 'Good' 'Excellent'
 'Extremely Poor' nan]
Seat_Class
['Green Car' 'Ordinary']
Arrival_Time_Convenient
['Excellent' 'Needs Improvement' 'Acceptable' nan 'Good' 'Poor'
 'Extremely Poor']
Catering
['Excellent' 'Poor' 'Needs Improvement' nan 'Acceptable' 'Good'
 'Extremely Poor']
Platform_Location
['Very Convenient' 'Needs Improvement' 'Manageable' 'Inconvenient'
 'Convenient' nan 'Very Inconvenient']
Onboard_Wifi_Service
['Good' 'Needs Improvement' 'Acceptable' 'Excellent' 'Poor'
 'Extremely Poor' nan]
Onboard_Entertainment
['Needs Improvement' 'Poor' 'Good' 'Excellent' 'Acceptable'
 'Extremely Poor' nan]
Online_Support
['Acceptable' 'Good' 'Excellent' 'Poor' nan 'Needs Improvement'
 'Extremely Poor']
Ease_of_Online_Booking
['Needs Improvement' 'Good' 'Excellent' 'Ac

**Observations**:
* There are no clear outliers in both Departure_Delay_in_Mins and Arrival_Delay_in_Mins columns, for 
* There are no extrange values in any categorical column