# Assumption Validation

This notebook is about validating basic assumptions on the data

In [1]:
from src import data_io

# Prerequisites

## Load data

In [2]:
raw = data_io.load_raw_data()

# Data dictionary

In [3]:
raw.head()

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
0,1006220,2016-04-09,2016-04-11,31114,desktop,384,Gondal,Gondal,1006220_1
1,1006220,2016-04-11,2016-04-12,39641,desktop,384,Gondal,Gondal,1006220_1
2,1006220,2016-04-12,2016-04-16,20232,desktop,384,Gondal,Glubbdubdrib,1006220_1
3,1006220,2016-04-16,2016-04-17,24144,desktop,384,Gondal,Gondal,1006220_1
4,1010293,2016-07-09,2016-07-10,5325,mobile,359,The Devilfire Empire,Cobra Island,1010293_1


 - `user_id` - User ID
 - `check`-in - Reservation check-in date
 - `checkout` - Reservation check-out date
 - `affiliate_id` - An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
 - `device_class` - desktop/mobile
 - `booker_country` - Country from which the reservation was made (anonymized)
 - `hotel_country` - Country of the hotel (anonymized)
 - `city_id` - city_id of the hotel’s city (anonymized)
 - `utrip_id` - Unique identification of user’s trip (a group of multi-destinations bookings within the same trip)

# Nullability ✅

None of the data should be nullable

In [4]:
assert not raw.isnull().sum().sum()

# `utrip_id` splitability ✅

Seems like `utrip_id` is made from two numbers separated by an underscore, validate this is true throughout the data

In [5]:
all_utrip_uids = set(raw.utrip_id)
for utrip_id in all_utrip_uids:
    part1, part2 = utrip_id.split('_')
    assert part1.isnumeric() and part2.isnumeric()

# `user_id` is the first element of `utrip_id` ✅

Seems like the `user_id` is the first number of `utrip_id`, verify that this is true throughout the entire data

In [6]:
user_ids_from_utrip_id = raw['utrip_id'].apply(lambda x: int(x.split('_')[0]))
assert (user_ids_from_utrip_id == raw.user_id).all()

# Every `city_id` has only one `hotel_country` ✅

Same `city_id` should have the same `hotel_country`

In [7]:
assert raw.groupby('city_id')['hotel_country'].nunique().max() == 1

# `device_class` is unique per `utrip_id` ❌

Assuming the device class is the device used for booking the entire trip

In [8]:
assert raw.groupby('utrip_id')['device_class'].nunique().max() == 1

AssertionError: 

Show counter example

In [9]:
raw[raw.groupby('utrip_id')['device_class'].transform('nunique') != 1]

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
4,1010293,2016-07-09,2016-07-10,5325,mobile,359,The Devilfire Empire,Cobra Island,1010293_1
5,1010293,2016-07-10,2016-07-11,55,mobile,359,The Devilfire Empire,Cobra Island,1010293_1
6,1010293,2016-07-12,2016-07-13,23921,mobile,359,The Devilfire Empire,Cobra Island,1010293_1
7,1010293,2016-07-13,2016-07-15,65322,desktop,9924,The Devilfire Empire,Cobra Island,1010293_1
8,1010293,2016-07-15,2016-07-16,23921,desktop,9924,The Devilfire Empire,Cobra Island,1010293_1
...,...,...,...,...,...,...,...,...,...
1166826,987787,2016-08-24,2016-08-27,63650,tablet,9924,Gondal,Glubbdubdrib,987787_1
1166827,999261,2016-09-04,2016-09-07,11473,desktop,2322,Gondal,Fook Island,999261_1
1166828,999261,2016-09-07,2016-09-10,6306,desktop,10332,Gondal,Fook Island,999261_1
1166829,999261,2016-09-10,2016-09-13,44024,desktop,2526,Gondal,Fook Island,999261_1
