## Drift in hotel booking dataset

In the previous chapter, you calculated the business value and ROC AUC performance for a model that predicts booking cancellations. You noticed a few alerts in the resulting plots, which is why you need to investigate the presence of drift in the analysis data.

In this exercise, you will initialize the multivariate drift detection method and compare its results with the performance results calculated in the previous chapter.

StandardDeviationThreshold is already imported along with business value, and ROC AUC results stored in the perf_results variable and feature_column_names are already defined.

### Instructions
    - Initialize the StandardDeviationThreshold method and set std_lower_multiplier to 2 and std_upper_multiplier parameters to 1.
    - Add the following feature names country, lead_time, parking_spaces, and hotel. Retain their order.
    - Pass previously defined thresholds and feature names to the DataReconstructionDriftCalculator.
    - Show the comparison plot featuring both the multivariate drift detection results(mv_results) and the performance results(perf_results).

In [None]:
# Create standard deviation thresholds
stdt = StandardDeviationThreshold(std_lower_multiplier=2, std_upper_multiplier=1)

# Define feature columns
feature_column_names = ['country', 'lead_time', 'parking_spaces', 'hotel']

# Intialize, fit, and show results of multivariate drift calculator
mv_calc = nannyml.DataReconstructionDriftCalculator(
    column_names=feature_column_names,
	threshold = stdt,
    timestamp_column_name='timestamp',
    chunk_period='m')
mv_calc.fit(reference)
mv_results = mv_calc.calculate(analysis)
mv_results.filter(period='analysis').compare(perf_results).plot().show()

## Univariate drift detection for hotel booking dataset

In the previous exercises, we established using the multivariate drift detection method that the shift in data in January is responsible for the alert in the ROC AUC metric and the negative business value of the model.

In this exercise, you will use a univariate drift detection method to find the feature and explanation behind the drift.

The reference and analysis sets are already pre-loaded for you.

### Instructions
    - Specify Wasserstein and Jensen-Shannon method for continuous methods and L-inifity and Chi2 for categorical.
    - Fit the reference and calculate results on the analysis set.
    - Plot the results.

In [None]:
# Intialize the univariate drift calculator
uv_calc = nannyml.UnivariateDriftCalculator(
    column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_period='m',
    continuous_methods=['wasserstein', 'jensen_shannon'],
    categorical_methods=['l_infinity', 'chi2'],
)

# Plot the results
uv_calc.fit(reference)
uv_results = uv_calc.calculate(analysis)
uv_results.plot().show()

## Ranking the univariate results

In the previous exercises, you ended up with eight plots. In this exercise your task is to rank them based on the number of the alerts and the correlation with the ROC AUC performance.

The univariate results are pre-loaded and stored in uv_results variable, and performance results are stored in perf_results variable.

### Instructions 1/2
    - Initialize AlertCountRanker without any initial parameters.
    - Call .rank() method and pass the filtered uv_results for Wasserstein and L-infinity methods.

In [None]:
# Initialize the alert count ranker
alert_count_ranker = nannyml.AlertCountRanker()
alert_count_ranked_features = alert_count_ranker.rank(
    uv_results.filter(methods=['wasserstein', 'l_infinity']))

display(alert_count_ranked_features)

### Instructions 2/2
    - Initialize CorrelationRanker without any initial parameters.
    - Fit correlation ranker with filtered perf_results for reference period.
    - Use rank method and pass there filtered uv_results for Wasserstein and L-infinity methods and perf_results.

In [None]:
# Initialize the alert count ranker
alert_count_ranker = nannyml.AlertCountRanker()
alert_count_ranked_features = alert_count_ranker.rank(
    uv_results.filter(methods=['wasserstein', 'l_infinity']))

display(alert_count_ranked_features)

# Initialize the correlation ranker
correlation_ranker = nannyml.CorrelationRanker()
correlation_ranker.fit(perf_results.filter(period='reference'))

correlation_ranked_features = correlation_ranker.rank(
    uv_results.filter(methods=['wasserstein', 'l_infinity']),
    perf_results)
display(correlation_ranked_features)

## Visualizing drifting features

After ranking the univariate results, you know that drift hotel and country features are impacting the model's performance the most. In this exercise, you will look at the drift results and distribution plots of them to determine the root cause of the problem.

The results from the univariate drift calculator are stored in the uv_results variable.

### Instructions
    - Set period argument to analysis for drift_results.
    - Pass hotel and country to column_names for drift_results.
    - Set kind argument in .plot() method to "drift".
    - Do the same for distribution_results, except for setting the kind argument in .plot() method to "distribution".

In [None]:
# Filter and create drift plots
drift_results = uv_results.filter(
    period='analysis',
    column_names=['hotel', 'country']
    ).plot(kind='drift')

# Filter and create distribution plots
distribution_results = uv_results.filter(
    period='analysis',
    column_names=['hotel', 'country']
    ).plot(kind='distribution')

# Show the plots
drift_results.show()
distribution_results.show()

## Data quality checks

As you learned in the previous video, missing values can result in a loss of valuable information and potentially lead to incorrect interpretations. Similarly, the presence of unseen values can also affect your model's confidence.

In this exercise, your goal is to explore whether the hotel booking dataset contains missing values and identify any unseen values. The reference and analysis datasets are already loaded, along with the nannyml library.

A quick reminder, if you can't recall the column types, you can easily explore the data using the .head() method.

### Instructions 1/2
    - Initialize the missing value calculator, passing the selected columns to column_names and setting the chunk_period to monthly.

In [None]:
# Define analyzed columns
selected_columns = ['country', 'lead_time', 'parking_spaces', 'hotel']

# Intialize missing values calculator
ms_calc = nannyml.MissingValuesCalculator(
    column_names=selected_columns,
    chunk_period='m',
    timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
ms_calc.fit(reference)
ms_results = ms_calc.calculate(analysis)
ms_results.plot().show()

### Instructions 2/2
    - Add two categorical column names country and hotel, initialize the unseen values calculator, and pass the categorical_columns to column names.

In [None]:
# Define analyzed categorical columns
categorical_columns = ['country', 'hotel']

# Intialize unseen values calculator
us_calc = nannyml.UnseenValuesCalculator(
  	column_names=categorical_columns, 
  	chunk_period='m', 
  	timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
us_calc.fit(reference)
us_results = us_calc.calculate(analysis)
us_results.filter(period='analysis').plot().show()