# Thesis on data analysis. Reasearch of Megafon customer satisfaction survey

## Problem statement

Like any business, Megafon wants to increase customer satisfaction with service quality. This is an important task for retaining users, both long-standing and newly acquired. After all, marketing and promotion costs will not be justified if the customer leaves due to poor connection quality. However, in the real world, resources are always limited, and the technical department can solve a finite number of tasks per unit of time.

To do this most effectively, it is important to determine which technical indicators of connection quality have the greatest impact on customer satisfaction, and primarily direct resources to working with them. 
To do this, Megafon performed a survey of its customers, asking them to rate their level of satisfaction with connection quality. Technical indicators were collected for each customer who completed the survey.
Prepare a research for Megafon and analyze how (and whether) the customer satisfaction dependsds on the collected data.

**More details about the survey**  

During the survey, Megafon asked its customers to assess their satisfaction with the quality of communication on a 10-point scale (where 10 is “excellent” and 1 is “terrible”). 
If the customer assessed the quality of communication at 9 or 10 points, the survey ended. 
If the customer assessed it below 9, a 2nd question was asked about the reasons for dissatisfaction with the quality of communication with the numbered answer options provided. 
The answer could be given in a free format or by listing the answer numbers separated by commas.

## Provided survey data

`megafon.csv` contains the survey data with the following fields: <br><br>
&nbsp;&nbsp;&nbsp;&nbsp; `user_id` — user id;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Q1` — aswer to 1st question;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Q2` — aswer to 2nd question;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Total Traffic(MB)` — traffic total volume <sup>1 </sup>; <br>
&nbsp;&nbsp;&nbsp;&nbsp; `Downlink Throughput(Kbps)` — average downlink speed <sup>2 </sup>;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Uplink Throughput(Kbps)`— avearage uplink speed <sup>3 </sup>;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Downlink TCP Retransmission Rate(%)` — frequency of downlink packets retransmission<sup>4 </sup>;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Video Streaming Download Throughput(Kbps)` — streaming video download speed <sup>5 </sup>;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Video Streaming xKB Start Delay(ms)` — delay start of video playback <sup>6 </sup>;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Web Page Download Throughput(Kbps)` — web page loading speed via browser <sup>7 </sup>;<br>
&nbsp;&nbsp;&nbsp;&nbsp; `Web Average TCP RTT(ms)` — ping when browsing web pages<sup>8 </sup>.<br>


<sup>1 </sup> — Indicates how actively the subscriber uses the mobile Internet.<br>
<sup>2 </sup> — Calculates over all traffic.<br>
<sup>3 </sup> — Calculates over all traffic.<br>
<sup>4 </sup> — More is worser (less effective speed).<br>
<sup>5 </sup> — More is better (less lag and better picture quality).<br>
<sup>6 </sup> — The time between pressing the Play button and the start of video playback. The shorter this time, the faster the playback starts.<br>
<sup>7 </sup> — The more the better.<br>
<sup>8 </sup> — The less the better (web pages are loading faster).<br>

The first metric is given for a week before the survey. The other metrics indicates average value for a week before the survey

## Auxilary functions

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import plotly.express as px
from plotly.subplots import make_subplots
import textwrap

In [None]:
def wrap_text(text, length=50):
    '''
   Splits the text into lines of a given length and replaces line breaks with the HTML <br> element.
    
        Parameters:
        ----------
        text : string
            Text being processed.
                    
        length : int
            Maximum line length.
        
        Returns:
        -----------------------
            String object. 
    '''
    return textwrap.fill(text, length).replace('\n', '<br>')

In [None]:
def trimean_mod(data, axis=0):
    '''
    Returns modified trimmer,
    obtained weighted average of 10th, 50th and 90th percentile in 1:8:1 proportion.
    Calculation is performing on the given axis of data sample.
    
        Parameters:
        ----------
        data : pandas.Series, pandas.DataFrame or numpy.ndarray
            The given data sample.
                    
        axis : {0, 1, 'index', 'columns'}, default - 0
            If 0 or 'index' the calculation is performed on rows. 
            If 1 or 'columns' the calculation is performed on columns.
            Is used if data is pandas.DataFrame or numpy.array
        
        Returns:
        -----------------------
            Float type value if data is pandas.Series
            Pandas.Series with index of opposite axis of data if data is pandas.DataFrame .
            1d numpy.ndarray if data is numpy.ndarray.
    
    '''
    if type(data) == pd.Series:
        p10 = data.quantile(0.1)
        p50 = data.median()
        p90 = data.quantile(0.9)
    elif type(data) == pd.DataFrame:
        p10 = data.quantile(0.1, axis=axis)
        p50 = data.median(axis=axis)
        p90 = data.quantile(0.9, axis=axis)
    else:
        p10 = np.quantile(data, 0.1, axis=axis)
        p50 = np.median(data, axis=axis)
        p90 = np.quantile(data, 0.9, axis=axis)
        
    return (p10 + p50*8 + p90) / 10

In [None]:
def trimean_mod_diff(a, b, axis=0):
    '''
    Returns a difference between modified trimmers for the two given data samples.
    
        Parameters:
        ----------
        a, b : pandas.Series, pandas.DataFrame or numpy.ndarray
            The given data samples.
                    
        axis  : {0, 1, 'index', 'columns'}, default - 0
            If 0 or 'index' the calculation is performed on rows. 
            If 1 or 'columns' the calculation is performed on columns.
            Is used if data is pandas.DataFrame or numpy.array
        
        Returns:
        -----------------------
            Float type value if data is pandas.Series
            Pandas.Series with index of opposite axis of data if data is pandas.DataFrame .
            1d numpy.ndarray if data is numpy.ndarray.
    '''
    return trimean_mod(a, axis=axis) - trimean_mod(b, axis=axis)

In [None]:
def kde(data, n_points=100, special_points=None):
    '''
    Generates a Kernel Density Estimate (KDE) representation for a sample of data of one or more parameters.
    A Series object can be used to pass data on a sample of values of one parameter,
    for which the KDE is generated. In this case, the function also returns a Series object containing
    If it is necessary to generate a KDE representation for several parameters,
    it is necessary to pass samples of their values using a DataFrame object.
    In this case, the data samples for the parameters must be of the same length and distributed across columns.
    
        Parameters:
        ----------
        data : DataFrame or Series
            The given data sample
                    
        n_points : int
            The number of points in the returned LOP representation.
        
        special_values : DataFrame or Series
            Additional points (e.g. mean, median, and confidence interval bounds) 
            that should be represented in the returned KDE representation
                    
        Returns:
        -----------------------
             A DataFrame object containing a set of data about the KDE representation.
            If the LOP is formed for one parameter, and the data selection is transferred using a Series object,
            the resulting data set contains 2 columns:
                value - the values of the points from the range;
                pdf - the KDE values
    '''
    # Forming a list of columns
    columns = ['value', 'pdf']

    if type(data) is pd.Series:
        # The LOP representation is generated for one parameter
        # We divide the range of parameter values in the sample into (n_points-1) equal segments
        values = pd.Series(np.linspace(data.min(), data.max(), n_points))
        if special_points is not None:
            values = pd.concat([values, special_points])
        # We prepare the returned dataset
        result = pd.DataFrame(columns=columns)
    else:
        # The LOP representation is generated for several parameters
        # We divide the range of values of each parameter in the sample into (n_points-1) equal segments
        values = pd.DataFrame(np.linspace(data.min(), data.max(), n_points),
                              columns=data.columns)
        # We prepare the returned dataset
        result = pd.DataFrame(
            columns=pd.MultiIndex.from_product([columns, data.columns]))

    # Add "special" values to the set
    if special_points is not None:
        values = pd.concat([values, special_points])

    # Find the value of the LOP representation for the generated set of parameter(s) values
    if type(data) is pd.Series:
        # The LOP representation is generated for one parameter
        kde = stats.gaussian_kde(data)
        pdf = kde.pdf(values)
    else:
        # The LOP representation is generated for several parameters
        kde = data.apply(lambda s: stats.gaussian_kde(s))
        pdf = data.apply(lambda s: kde[s.name].pdf(values[s.name]))
        pdf.index = values.index

    # Fill the resulting dataset
    result.index.name = 'point'
    result['value'] = values # Endpoints of segments
    result['pdf'] = pdf # Values of the LOP at the extreme points of the segments

    return result

In [None]:
def my_bootstrap(data, statistic, n_resamples=9999, axis=0):
    '''
    Returns the distribution of the given statistic for a population,
    represented by an observed sample with one or more metrics,
    using the bootstrap method.
    
        Parameters:
        ----------
        data : pandas.Series, pandas.DataFrame or numpy.ndarray
            Observed data sample.
            
        statistic : function
            A function that implements the calculation of statistics for one metric
                    
        n_resamples : int
            Number of resamples. Default is 9999
                    
        Returns:
        -----------------------
            A pandas.Series object of length n_resamples if data is a pandas.Series
            A pandas.DataFrame object of type n_resamples along the selected axis, if data is a pandas.DataFrame.
                The dimensions and indices of the opposite axis are the same as in `data`.
            An object of type numpy.ndarray with dimensions along the selected axis n_resamples, if data is a numpy.ndarray.
                The dimensions of the opposite axis are the same as in `data`.
    '''
    
    def _my_bootstrap_1d(arr_1d, statistic, n_resamples=9999):
        '''
        Returns the distribution of a given statistic for a population represented by an observed sample with a single metric,
        using the bootstrap method.
        
        Parameters:
        ----------
        arr_1d : 1d numpy.ndarray
            A one-dimensional array of metric values in the observed sample.
            
        statistic : function
            Function implementing the calculation of statistics
                    
        n_resamples : int
            Number of resamples. Default is 9999
                    
        Returns:
        -----------------------
            A one-dimensional numpy.ndarray array containing n_resamples statistics values.
        '''
        return np.array([statistic(np.random.choice(arr_1d, arr_1d.size)) for index in range(n_resamples+1)])
        
    if type(data) == np.ndarray:
        # Sample - ndarray (one or more metrics)
        # Apply _my_bootstrap_1d for each metric
        return np.apply_along_axis(_my_bootstrap_1d, axis, arr, statistic)
    elif type(data) == pd.Series:
        # Sample - Series (one metric)
        # Apply _my_bootstrap_1d to it
        return pd.Series(_my_bootstrap_1d(data.values, statistic, n_resamples), name=data.name)
    else:
        # Selection - DataFrame (multiple metrics)
        # Apply _my_bootstrap_1d to each metric's values
        arr = np.apply_along_axis(_my_bootstrap_1d, axis, data.values, statistic)
        # We transform the obtained result into a dataframe
        if axis == 0:
            return pd.DataFrame(arr, columns=data.columns)
        else:
            return pd.DataFrame(arr, index=data.index)
    return result

In [None]:
def permutation_test(data, functions, alternatives=None, n_resamples=9999, random_state=0):
    '''
    Implements a "permutation test" for two independent groups on one or more metrics.
    It is a wrapper for the permutation_test function from the scipy.stats library.
     
        Parameters:
        ----------
        data : pandas.Series or pandas.DataFrame
            A set of observed samples. Group names should be used as indices.
            In a DataFrame, metrics must be arranged in columns.
                    
        functions : callable or pandas.Series of callable
            Test function statistically.
            callable if samples are pandas.Series.
            pandas.Series of callable if samples are pandas.DataFrame. Indexes should be metric names,
            i.e. match the column names in the samples.
            
        alternatives : {'two-sided', 'less', 'greater'} or Series of {'two-sided', 'less', 'greater'} or None. Default None
            Test type: 'two-sided' or None - two-sided, 'less' - left-sided, 'greater' - right-sided
            string if samples are pandas.Series.
            pandas.Series if samples are pandas.DataFrame. Indexes should be metric names,
            i.e. match the column names in the samples.
            
        n_resamples : int
            Number of resamples. Default is 9999
            
        Returns:
        -----------------------
        pvalue: float or pandas.Series
            p-value meaning.
            float if samples are pandas.Series
            pandas.Series of float if samples are pandas.DataFrame. Indexes are names of metrics (columns)
            in the observed samples.
        null_distribution : pandas.Series or pandas.DataFrame
            Null distribution of test statistics.
            pandas.Series of float if samples are pandas.Series. Number of elements is n_resamples.
            pandas.DataFrame of float if samples are a pandas.DataFrame. Number of rows is n_resamples.
            Columns are the names of the metrics (columns) in the observed samples.
        statistic : float or pandas.Series
            The observed value of the test statistic.
            float if samples are pandas.Series.
            pandas.Series of float if samples are pandas.DataFrame. Indexes are names of metrics (columns)
            in the observed samples.
            
    '''

    def _permutation_test_for_1_metric(data, function, alternative=None, n_resamples=9999):
        '''
        Auxiliary function,
        which implements the "permutation test" for one metric.

        Parameters:
        ----------
        data : pandas.Series
            A set of observed samples. Group names should be used as indices.
                    
        functions : callable
            Test function statistically.
            
        alternatives : {'two-sided', 'less', 'greater'} or None. Defaults to None
            Test type: 'two-sided' or None - two-sided, 'less' - left-sided, 'greater' - right-sided
            string if samples are pandas.Series.
            pandas.Series if samples are pandas.DataFrame. Indexes should be metric names,
            i.e. match the column names in the samples.
            
        n_resamples : int
            Number of resamples. Default is 9999
            
        Returns:
        -----------------------
        pvalue: float
            p-value meaning.
        null_distribution : pandas.Series
            Null distribution of test statistics.
        statistic : float
            The observed value of the test statistic.
        '''
        # Apply the stats.permutation_test function
        # Test type - independent ('independent'), for one metric (vectorized=False)
        result = stats.permutation_test([data.loc[group] for group in data.index.unique()],
                                        statistic=function,
                                        permutation_type='independent',
                                        alternative=alternative,
                                        vectorized=False,
                                        n_resamples=n_resamples)
        # Return the result
        return result.pvalue, pd.Series(result.null_distribution), result.statistic

    # If the sample contains data for only one metric,
    # call _permutation_test_for_1_metric and return the result of its execution
    if type(data) == pd.Series:
        return _permutation_test_for_1_metric(data, functions, alternatives, n_resamples)

    # The sample contains data for several metrics
    # Create result templates
    pvalues = pd.Series(name='pvalue', index=data.columns, dtype='float')
    null_distributions = pd.DataFrame(columns=data.columns, dtype='float')
    statistics = pd.Series(name='statistic', index=data.columns, dtype='float')
    # We run tests for each metric in the sample:
    # call _permutation_test_for_1_metric and save the result of its execution
    for metric in data.columns:
        pvalues[metric], null_distributions[metric], statistics[metric] = \
        _permutation_test_for_1_metric(data[metric], functions[metric], alternatives[metric], n_resamples)
    # Return the result
    return pvalues, null_distributions, statistics

In [None]:
def confidence_interval(data, statistic, confidence_level=0.95, n_resamples=9999):
    '''
    Returns the confidence interval of the specified statistic for one or more metrics.
    the population represented by the observed sample using the bootstrap method.
    It is a "wrapper" for the scipy.stats.bootstrap function, which
    
        Parameters:
        ----------
        data : pandas.Series, pandas.DataFrame
            Observed data sample.
            
        statistic : callable
            A function that implements the calculation of statistics for one metric.
                    
        n_resamples : int
            Number of resamples. Default is 9999
                    
        Returns:
        -----------------------
            A pandas.Series object of length n_resamples if data is a pandas.Series
            A pandas.DataFrame object of type n_resamples along the selected axis, if data is a pandas.DataFrame.
                The dimensions and indices of the opposite axis are the same as in `data`.
            An object of type numpy.ndarray with dimensions along the selected axis n_resamples, if data is a numpy.ndarray.
                The dimensions of the opposite axis are the same as in `data`.
            Each element is a Tupple of DI boundaries.
    '''

    def _confidence_interval(data, statistic, confidence_level=0.95, n_resamples=9999):
        return tuple(
            stats.bootstrap((data.to_numpy(), ),statistic=statistic,
                            confidence_level=confidence_level,
                            n_resamples=n_resamples, vectorized=False,
                            method='basic').confidence_interval
        )
    
    '''
    Returns the confidence interval of the given statistic for a single metric
    the population represented by the observed sample using the bootstrap method.
    It is a "wrapper" for the scipy.stats.bootstrap function, which
    
        Parameters:
        ----------
        data : pandas.Series
            Observed data sample.
            
        statistic : callable
            A function that implements the calculation of statistics.
                    
        n_resamples : int
            Number of resamples. Default is 9999
                    
        Returns:
        -----------------------
            A pandas.Series object of length n_resamples. Each element is a Tupple of CI bounds.
    '''

    if type(data) == pd.Series:
        # The sample contains data for only one metric
        # Return Series from DI of this metric for all groups
        return pd.Series(
            [_confidence_interval(data.loc[group], statistic, confidence_level, n_resamples) 
             for group in data.index.unique()],
            name='ci', index=groups, dtype='object')
    
    # The sample contains data for only a few metrics
    # Return DataFrame from DI. Metrics by columns.
    result = [np.apply_along_axis(_confidence_interval, 0, data.loc[group], 
                                 statistic, confidence_level, n_resamples).tolist()
              for group in data.index.unique()]
    return pd.DataFrame([list(zip(group_result[0], group_result[1])) for group_result in result],
                        index=groups, columns=data.columns, dtype='object')

In [None]:
def confidence_interval_overlapping(confidence_interval_1, confidence_interval_2, metrics):
    '''
    The function checks for the intersection of two confidence intervals of one or more metrics.
     
        Parameters:
        ----------
        confidence_interval_1, confidence_interval_2 : Series of Tupple of 2 float
            Confidence intervals. Names are names of populations.
                    
        metrics : DataFrame
            Information about metrics. Indexes - names of metrics in confidence_interval_1, confidence_interval_2.
        
        Returns:
        -----------------------
            An object of type Series of Boolean.
            Name - names of populations separated by commas and spaces
            Indexes are the names of metrics (row indices in metrics).
            Elements - result of intersection check
    '''
    def _confidence_interval_overlapping(confidence_interval_1, confidence_interval_2):
        return not ((confidence_interval_1[1] < confidence_interval_2[0]) or 
                    (confidence_interval_2[1] < confidence_interval_1[0]))
        '''
        The helper function checks for the intersection of confidence intervals of one metric.

            Parameters:
            ----------
            confidence_interval_1, confidence_interval_2 : Tupple of 2 float
                Confidence intervals. Names are names of populations.

            Returns:
            -----------------------
                False - do not overlap, True - overlap.
        '''
    
    # Return a Series of Boolean, where the indices are the names of the metrics.
    # Series name - population names separated by commas and spaces
    return pd.Series(
        [_confidence_interval_overlapping(confidence_interval_1[metric], confidence_interval_2[metric])
        for metric in metrics.index], 
        index = metrics.index, name = f'{confidence_interval_1.name}, {confidence_interval_2.name}'
    )

In [None]:
def confidence_interval_center_diffs(confidence_interval_1, confidence_interval_2, metrics):
    '''
    Calculates the distance between the centers of two confidence intervals for one or more metrics.
     
        Parameters:
        ----------
        confidence_interval_1, confidence_interval_2 : Series of Tupple of 2 float
            Confidence intervals. Names are names of populations.
                    
        metrics : DataFrame
            Information about metrics. Indexes - names of metrics in confidence_interval_1, confidence_interval_2.
        
        Returns:
        -----------------------
            An object of type Series of Float.
            Name - names of populations separated by commas and spaces
            Indexes are the names of metrics (row indices in metrics).
            Elements - distance between centers of confidence intervals
    '''
    def _confidence_interval_center_diff(confidence_interval_1, confidence_interval_2):
        return (confidence_interval_1[0] + confidence_interval_1[1] \
                - confidence_interval_2[0] - confidence_interval_2[1])/2
        '''
        The helper function checks for the intersection of confidence intervals of one metric.

            Parameters:
            ----------
            confidence_interval_1, confidence_interval_2 : Tupple of 2 float
                Confidence intervals. Names are names of populations.

            Returns:
            -----------------------
                Distance between the centers of the DI.
        '''

    # Return a Series of Float, where the indices are the names of the metrics.
    # Series name - population names separated by commas and spaces
    return pd.Series(
        [_confidence_interval_center_diff(confidence_interval_1[metric], confidence_interval_2[metric])
         for metric in metrics.index], 
        index = metrics.index, name = f'{confidence_interval_1.name}, {confidence_interval_2.name}'
    )

In [None]:
def confidence_interval_info(data, metrics, group_pairs):
    '''
    Function:
    - calculates confidence intervals (ci);
    - checks for overlapping confidence intervals (ci_overlapping) of the specified pairs of groups (group_pairs);
    - calculates confidence interval centers (ci_center);
    - calculates the distance between the centers of the confidence intervals (ci_center) of the given pairs of groups (group_pairs).
    
        Parameters:
        ----------
        data : DataFrame or Series
            Observed data sample.
            When using Series, only a single metric dataset can be passed.
            When it is necessary to calculate CI for sets of several metrics of the same size
            DataFrame should be used. In this case, the metrics data sets should be located
            in separate columns.
                    
        metrics : DataFrame
            Information about metrics. Indexes - names of metrics. Column 'statistic' - statistical function.
            
        group_pairs : List or Tupple
            List of pairs of groups of groups for which the intersection of confidence intervals is tested and
            distances between the centers of confidence intervals.
        
        Returns:
        -----------------------
        ci : Series
            Confidence intervals of groups. Indexes - names of groups (indices from data)
            
        ci_overlapping : DataFrame
            The presence of intersections of confidence intervals of given pairs of groups.
            
        ci_center : Series
            Centers of confidence intervals of groups. Indexes - names of groups (indices from data)
            
        ci_center_diffs : DataFrame
            Distances between the centers of confidence intervals of given pairs of groups.
    '''
    # Calculating confidence intervals of statistics
    ci = data.apply(lambda s: confidence_interval(s, statistic=metrics.loc[s.name, 'statistic']))
    # Checking for overlapping confidence intervals of statistics of given pairs of groups
    ci_overlapping = pd.DataFrame([
        confidence_interval_overlapping(ci.loc[group_pair[0]], ci.loc[group_pair[1]], metrics)
        for group_pair in group_pairs])
    # Calculation of confidence interval centers of statistics
    ci_center = ci.map(lambda x: (x[0] + x[1])/2)
    # Calculation of distances between centers of confidence intervals of statistics of given pairs of groups
    ci_center_diffs = pd.DataFrame([
        confidence_interval_center_diffs(ci.loc[group_pair[0]], ci.loc[group_pair[1]], metrics)
        for group_pair in group_pairs])
    return ci, ci_overlapping, ci_center, ci_center_diffs

In [None]:
def display_cat_info(data):
    '''
    Generates a stylized tabular representation of the customer distribution map
    by categories of mobile Internet service quality assessment and reasons for such assessment.
    
        Parameters:
        ----------
        data : DataFrame
            Dataset of information about customer categories. Should contain the fields 'Internet score' and 'Dissatisfaction reasons'.
        
    
        Returns:
        -----------------------
        io.formats.style.Styler
            A stylized tabular representation of a customer distribution map.
    '''
    df = data.groupby(['Internet score', 'Dissatisfaction reasons'], sort=False).size().unstack(
        level=1, fill_value=0)
    df.index.name = None
    df.columns.name = ''
    df = df.map(lambda x: x if x > 0 else '-')
    style = df.style.set_table_styles(
        [{'selector': 'th, td', 'props': 'width: 80px; text-align: center; border: 1px solid lightgray;'}, 
         {'selector': 'th.index_name', 'props': 'border: none'}],
        overwrite=False
    )
    return style

In [None]:
def display_statistics(data, axis=0, metrics=None, precision=1, caption=None, caption_font_size=12, 
                       opacity=1.0, index_width=120, col_width=130):
    '''
    Outputs statistics values for one or more metrics from one or more populations
    in the form of a stylized table with a heading.
    The best and worst values for each metric are highlighted in green and red font colors, respectively.
    The background color of the population name is set from the px.colors.DEFAULT_PLOTLY_COLORS palette
    in the order in which they appear in the data set.
    
        Parameters:
        ----------
        data : DataFrame
            The set of statistics values to display.
            The names of the metrics should be located on one axis, and the names of the populations on the other.
            
        axis: {0, 1}. Default - 0
            Shows what is located in the rows and columns of a data set.
            0 - indices are the names of populations, metric data are distributed across columns
            1 - columns are the names of populations, metric data are distributed across rows
                    
        precision : int. Default - 4
            The number of decimal places for the output statistics values.
            
        caption : string or None. Default is None
            Table Header
            
        caption_font_size : int. Default - 12
            Table Header Font Size
            
        opacity : float. Default is 1.0
            Opacity level (from 0.0 to 1.0) of the population name background
            
        index_width : int. Default - 120
            Index column width
            
        col_width : int. Default - 130
            Width of value columns
                    
        Returns:
        -----------------------
            No.
    '''

    df = data.copy()
    if axis==0:
        df.columns = metrics['name']
        df.columns.name = 'Metric'
        df.index.name = 'Group'
        positive_subset = pd.IndexSlice[:, metrics.loc[metrics.impact=='+', 'name'].to_list()]
        negative_subset = pd.IndexSlice[:, metrics.loc[metrics.impact=='-', 'name'].to_list()]
    else:
        df.index = metrics['name']
        df.index.name = 'Metric'
        df.columns.name = 'Group'
        positive_subset = pd.IndexSlice[metrics.loc[metrics.impact=='+', 'name'].to_list(), :]
        negative_subset = pd.IndexSlice[metrics.loc[metrics.impact=='-', 'name'].to_list(), :]
    
    style = df.style\
    .map_index(lambda group: f'color: white; background-color: \
        {px.colors.DEFAULT_PLOTLY_COLORS[df.axes[axis].get_loc(group)]}; opacity: {opacity}', axis=axis)\
    .set_caption(caption)\
    .set_table_styles([
        {'selector': 'caption', 'props': f'font-size:{caption_font_size}pt; text-align: center; color: black'},
        {'selector': '.row_heading, td', 'props': f'width: {index_width}px; text-align: center;'},
        {'selector': '.col_heading, td', 'props': f'width: {col_width}px; text-align: center;'}
    ], overwrite=False)\
    .format(precision=precision)
    
    if len(positive_subset[1]) > 0:
        style = style\
            .highlight_min(props='color: red; font-weight: bold', subset=positive_subset, axis=axis)\
            .highlight_max(props='color: green; font-weight: bold', subset=positive_subset, axis=axis)\
    
    if len(negative_subset[1]) > 0:
        style = style\
            .highlight_min(props='color: green; font-weight: bold', subset=negative_subset, axis=axis)\
            .highlight_max(props='color: red; font-weight: bold', subset=negative_subset, axis=axis)
        
    if df.axes[axis].size == 1:
        style = style.hide(axis='index')
   
    display(style)

In [None]:
def display_pvalues(data, axis=0, metrics=None, precision=4, alpha=0.05, caption=None, caption_font_size=12, 
                    opacity=1.0, index_width=120, col_width=130):
    '''
    Outputs p-values for one or more metrics from one or more tests.
    in the form of a stylized table with a heading.
    Values below the significance level are highlighted in red font.
    The background color of the population name is set from the px.colors.DEFAULT_PLOTLY_COLORS palette
    in the order in which they appear in the data set.
    
        Parameters:
        ----------
        data : DataFrame
            The set of displayed p-values.
            The names of the metrics should be located on one axis, and the names of the pairs of populations, separated by a comma and a space, should be located on the other.
            
        axis: {0, 1}. Default - 0
            Shows what is located in the rows and columns of a data set.
            0 - indices are the names of population pairs separated by commas and spaces, metric data are distributed across columns
            1 - columns are the names of population pairs separated by commas and spaces, metric data are distributed across rows
                    
        precision : int. Default - 4
            The number of decimal places for the output statistics values.
            
        alpha : float. Default is 0.05
            Level of significance.
            
        caption : string or None. Default is None
            Table Header
            
        caption_font_size : int. Default - 12
            Table Header Font Size
            
        opacity : float. Default is 1.0
            Opacity level (from 0.0 to 1.0) of the population name background
            
        index_width : int. Default - 120
            Index column width
            
        col_width : int. Default - 130
            Width of value columns
                    
        Returns:
        -----------------------
            No.
    '''

    df = data.copy()
    if axis==0:
        df.index = pd.MultiIndex.from_tuples(df.index.str.split(', ').map(lambda x: tuple(x)), name=[None, None])
        df.columns = pd.Index(metrics['name'].to_list(), name=None)
        groups = pd.Index(df.index.get_level_values(0).to_list() + df.index.get_level_values(1).to_list()
                         ).drop_duplicates()
    else:
        df.columns = pd.MultiIndex.from_tuples(df.columns.str.split(', ').map(lambda x: tuple(x)), name=[None, None])
        df.index = pd.Index(metrics['name'].to_list(), name=None)
        groups = pd.Index(df.columns.get_level_values(0).to_list() + df.columns.get_level_values(1).to_list()
                         ).drop_duplicates()
    
    style = df.style\
    .map_index(lambda group: f'color: white; background-color: \
        {px.colors.DEFAULT_PLOTLY_COLORS[groups.get_loc(group)]}; opacity: {opacity}', axis=axis, level=0)\
    .map_index(lambda group: f'color: white; background-color: \
        {px.colors.DEFAULT_PLOTLY_COLORS[groups.get_loc(group)]}; opacity: {opacity}', axis=axis, level=1)\
    .set_caption(caption)\
    .set_table_styles([
        {'selector': 'caption', 'props': f'font-size:{caption_font_size}pt; text-align:center; color:black'},
        {'selector': 'td', 'props': 'text-align: center; border: 1px solid lightgray; border-collapse: collapse;'},
        {'selector': '.row_heading, td', 'props': f'width: {index_width}px; text-align: center;'},
        {'selector': '.col_heading, td', 'props': f'width: {col_width}px; text-align: center;'}
    ], overwrite=False)\
    .map_index(lambda s: 'border: 1px solid lightgray; border-collapse: collapse;', axis=0)\
    .map_index(lambda s: 'border: 1px solid lightgray; border-collapse: collapse;', axis=1)\
    .format(precision=precision)\
    .highlight_between(right=alpha, inclusive='right', props='color: red; font-weight: bold')
    
    if df.axes[axis].size == 1:
        style = style.hide(axis='index')
   
    display(style)

In [None]:
def display_confidence_interval(values, axis=0, metrics=None, precision=1, caption=None, caption_font_size=12, 
                                opacity=1.0, index_width=120, col_width=80):
    '''
    Outputs confidence interval values for one or more metrics in one or more populations.
    in the form of a stylized table with a heading.
    For each confidence interval, the minor boundary, the center, and the major boundary are displayed in a separate column or row.
    The best and worst values of the CI center for each metric are highlighted in green and red font colors, respectively.
    The background color of the population name is set from the px.colors.DEFAULT_PLOTLY_COLORS palette
    in the order in which they appear in the data set.
    
        Parameters:
        ----------
        data : DataFrame
            The set of displayed p-values.
            The names of the metrics should be located on one axis, and the names of the pairs of populations, separated by a comma and a space, should be located on the other.
            
        axis: {0, 1}. Default - 0
            Shows what is located in the rows and columns of a data set.
            0 - indices are the names of population pairs separated by commas and spaces, metric data are distributed across columns
            1 - columns are the names of population pairs separated by commas and spaces, metric data are distributed across rows
                    
        precision : int. Default - 1
            The number of decimal places for the output statistics values.
            
        caption : string or None. Default is None
            Table Header
            
        caption_font_size : int. Default - 12
            Table Header Font Size
            
        opacity : float. Default is 1.0
            Opacity level (from 0.0 to 1.0) of the population name background
            
        index_width : int. Default - 120
            Index column width
            
        col_width : int. Default - 80
            Width of value columns
                    
        Returns:
        -----------------------
            No.
    '''
    df = pd.DataFrame()
    if axis==0:
        if type(metrics) == pd.DataFrame:
            df = pd.DataFrame(
                columns=pd.MultiIndex.from_product(
                    [metrics['name'], ['Low bound', 'Midpoint', 'Hi bound']],
                    names = ['', '']),
                index=pd.Index(values.index.to_list(), name=None)
            )
            positive_subsets = [pd.IndexSlice[:, (description, 'Midpoint')] 
                                for description in metrics.loc[metrics.impact=='+', 'name']]
            negative_subsets = [pd.IndexSlice[:, (description, 'Midpoint')] 
                                for description in metrics.loc[metrics.impact=='-', 'name']]
        else:
            df = pd.DataFrame(
                columns=pd.Index(
                    ['Low bound', 'Midpoint', 'Hi bound'],
                    name=''),
                index=pd.Index(values.index.to_list(), name=None)
            )
            positive_subsets = [pd.IndexSlice[:, 'Midpoint']] if metrics.impact == '+' else []
            negative_subsets = [pd.IndexSlice[:, 'Midpoint']] if metrics.impact == '-' else []
    else:
        if type(metrics) == pd.DataFrame:
            df = pd.DataFrame(
                index=pd.MultiIndex.from_product(
                    [metrics['name'], ['Low bound', 'Midpoint', 'Hi bound']],
                    names = ['', '']),
                columns=pd.Index(values.index.to_list(), name=None)
            )
            negative_subsets = [pd.IndexSlice[(description, 'Midpoint'), :] 
                                for description in metrics.loc[metrics.impact=='+', 'name']]
            positive_subsets = [pd.IndexSlice[(description, 'Midpoint'), :] 
                                for description in metrics.loc[metrics.impact=='-', 'name']]
        else:
            df = pd.DataFrame(
                index=pd.Index(
                    ['Low bound', 'Midpoint', 'Hi bound'],
                    name=''),
                columns=pd.Index(values.index.to_list(), name=None)
            )
            positive_subsets = [pd.IndexSlice[:, 'Midpoint']] if metrics.impact == '+' else []
            negative_subsets = [pd.IndexSlice[:, 'Midpoint']] if metrics.impact == '-' else []
            
    if df.columns.nlevels == 2:
        df = df.swaplevel(axis=1)
    
        df.loc[:, 'Low bound'] = values.map(lambda x: x[0]).to_numpy()
        df.loc[:, 'Hi bound'] = values.map(lambda x: x[1]).to_numpy()
        df.loc[:, 'Midpoint'] = (df.loc[:, 'Low bound'] + df.loc[:, 'Hi bound']).to_numpy() / 2
    
        df = df.swaplevel(axis=1)
    else:
        df.loc[:, 'Low bound'] = values.apply(lambda x: x[0]).to_numpy()
        df.loc[:, 'Hi bound'] = values.apply(lambda x: x[1]).to_numpy()
        df.loc[:, 'Midpoint'] = (df.loc[:, 'Low bound'] + df.loc[:, 'Hi bound']).to_numpy() / 2

    style = df.style\
    .map_index(lambda group: f'''color: white; background-color: 
                            {px.colors.DEFAULT_PLOTLY_COLORS[df.axes[axis].get_loc(group)]}; 
                            opacity: {opacity}''', axis=axis)\
    .set_caption(caption)\
    .set_table_styles([
        {'selector': 'caption', 'props': f'font-size:{caption_font_size}pt; text-align:center; color:black'},
        {'selector': 'td', 'props': 'text-align: center; border: 1px solid lightgray; border-collapse: collapse;'},
        {'selector': '.row_heading, td', 'props': f'width: {index_width}px; text-align: center;'},
        {'selector': '.col_heading, td', 'props': f'width: {col_width}px; text-align: center;'}
    ], overwrite=False)\
    .map_index(lambda s: 'border: 1px solid lightgray; border-collapse: collapse;', axis=0)\
    .map_index(lambda s: 'border: 1px solid lightgray; border-collapse: collapse;', axis=1)\
    .format(precision=precision)
    
    for positive_subset in positive_subsets:
        style = style\
            .highlight_min(props=f'color: red; font-weight: bold;', 
                           subset=positive_subset, axis=axis)\
            .highlight_max(props=f'color: green; font-weight: bold;', 
                           subset=positive_subset, axis=axis)

    for negative_subset in negative_subsets:
        style = style\
            .highlight_max(props=f'color: red; font-weight: bold;', 
                           subset=negative_subset, axis=axis)\
            .highlight_min(props=f'color: green; font-weight: bold;', 
                           subset=negative_subset, axis=axis)

    if df.axes[axis].size == 1:
        style = style.hide(axis='index')
   
    display(style)

In [None]:
def display_confidence_interval_overlapping(values, axis=0, metrics=None, caption='', caption_font_size=12, 
                                            opacity=1.0, index_width=120, col_width=130):

    df = values.map(lambda x: 'Yes' if x==1 else 'No')
    if axis==0:
        df.index = pd.MultiIndex.from_tuples(df.index.str.split(', ').map(lambda x: tuple(x)), name=[None, None])
        df.columns = pd.Index(metrics['name'].to_list(), name=None)
        groups = pd.Index(df.index.get_level_values(0).to_list() + df.index.get_level_values(1).to_list()
                         ).drop_duplicates()
    else:
        df.columns = pd.MultiIndex.from_tuples(df.columns.str.split(', ').map(lambda x: tuple(x)), name=[None, None])
        df.index = pd.Index(metrics['name'].to_list(), name=None)
        groups = pd.Index(df.columns.get_level_values(0).to_list() + df.columns.get_level_values(1).to_list()
                         ).drop_duplicates()
    
    style = df.style\
    .map_index(lambda group: f'color: white; background-color: {px.colors.DEFAULT_PLOTLY_COLORS[groups.get_loc(group)]}; opacity: {opacity}', axis=axis, level=0)\
    .map_index(lambda group: f'color: white; background-color: {px.colors.DEFAULT_PLOTLY_COLORS[groups.get_loc(group)]}; opacity: {opacity}', axis=axis, level=1)\
    .set_caption(caption)\
    .set_table_styles([
        {'selector': 'caption', 'props': f'font-size:{caption_font_size}pt; text-align:center; color:black'},
        {'selector': 'td', 'props': 'text-align: center; border: 1px solid lightgray; border-collapse: collapse;'},
        {'selector': '.row_heading, td', 'props': f'width: {index_width}px; text-align: center;'},
        {'selector': '.col_heading, td', 'props': f'width: {col_width}px; text-align: center;'}
    ], overwrite=False)\
    .map_index(lambda s: 'border: 1px solid lightgray; border-collapse: collapse;', axis=0)\
    .map_index(lambda s: 'border: 1px solid lightgray; border-collapse: collapse;', axis=1)\
    .map(lambda x: f'color: red; font-weight: bold;' if x=='No' else None)
    
    if df.axes[axis].size == 1:
        style = style.hide(axis='index')
        
    display(style)

In [None]:
def plot_metric_histograms(data, metrics, title=None, title_y=None, yaxis_title=None, 
                           title_font_size=14, labels_font_size=12, units_font_size=12, axes_tickfont_size=12,
                           height=None, width=None, horizontal_spacing=None, vertical_spacing=None,
                           n_cols=1, opacity=0.5, histnorm='percent', 
                           add_boxplot=False, boxplot_height_fraq=0.25, add_mean=False, add_kde=False,
                           mark_confidence_interval=False, confidence_level=0.95,
                           add_statistic=False, mark_statistic=None, statistic=None):
    '''
    The function is designed to build histograms for several metrics, divided into several groups.
    A separate canvas is created for each metric. The canvas can be divided into several columns.
    In each canvas, several histograms are built for each group.
    Additionally, boxplots for each group can be placed above the histograms,
    similar to what the px.histogram function does when specifying the margin parameter equal to boxplot.
    Another option is the ability to construct kernel distribution estimates (KDE)
    for each group on one canvas with histograms. KDE construction is only possible when building
    probability density histogram.

    Parameters:
    ----------
    data : DataFrame or Series
        The data sample for which confidence interval bounds are calculated.
        When using Series, only a single metric dataset can be passed.
        When it is necessary to calculate CI for sets of several metrics of the same size
        DataFrame should be used. In this case, the metrics data sets should be located
        in separate columns.

    metrics : DataFrame
        Information about metrics. Indexes are names of metrics.
        
    title : string or None. Default is None
        Chart Title
        
    title_y : float or None
        Relative position of the chart title by height (from 0.0 (bottom) to 1.0 (top))
        
    yaxis_title : string or None. Defaults to None
        Y-axis title
        
    title_font_size : int. Default - 14
        Chart Title Font Size
        
    labels_font_size : int. Default - 12
        Font size of inscriptions
        
    units_font_size : int. Default - 12
        Font size of unit names
        
    axes_tickfont_size : int. Default - 12
        Font size of axes labels
        
    height : int or None. Default is None
        Height of the chart

    width : int or None. Default is None
        Diagram width
                
    horizontal_spacing : float or None. Defaults to None
        Distance between columns of canvases in fractions of width (from 0.0 to 1.0)

    vertical_spacing : float or None. Defaults to None
        Distance between the rows of canvases in fractions of the height (from 0.0 to 1.0)
        
    n_cols : int. Default - 1
        Number of columns of canvases

    opacity : float. Default is 0.5
        Opacity level (0.0 to 1.0) of the column color
        
    histnorm : {'percent', 'probability', 'density' or 'probability density'} or None. Defaults to 'percent'
        Histogram type (see plotly.express.histogram)
    
    boxplot_height_fraq : float. Default is 0.25
        Fraction of boxplot height. Only used if add_boxplot=True
    
    add_boxplot : boolean. Default is False
        Add a boxplot above each histogram.
        
    add_mean : boolean. Default is False
        Add a mean value marker to the boxlot. Only used if add_boxplot=True
        
    add_kde : boolean. Default is False
        Add KDE curve to histogram. Only used if histnorm='probability density'
    
    mark_confidence_interval : boolean. Default is False
        Color KDE regions outside the confidence interval with the histogram color at half transparency.
        Only used if add_kde=True
    
    confidence_level : float. Default is 0.95
        Confidence level. Only used if mark_confidence_interval=True
        
    add_statistic : boolean. Default is False
        Mark the statistics on the histogram as a vertical dashed line.
        
    mark_statistic : {'tomin', 'tomax', 'tonearest'}. Default - False
        Color the KDE area on the left ('tomin') or right ('tomax'),
        or minimum ('min') or maximum ('max') in size in the histogram color with half transparency.
    
    statistic : Series
        The value of the statistics displayed on the histogram. Indexes are the names of the metrics.
        Used only if add_statistic=True and/or mark_statistic=True
    
    Returns:
    -----------------------
        No.
    '''

    def _confidence_interval(data, confidence_level=0.95):
        '''
        Calculates the confidence interval bounds of a data set

        Parameters:
        ----------
        data : Series or DataFrame
            The data sample for which confidence interval bounds are calculated.
            When using Series, only a single metric dataset can be passed.
            When it is necessary to calculate CI for sets of several metrics of the same size
            DataFrame should be used. In this case, the metrics data sets should be located
            in separate columns.
            
        confidence_level : float. Default is 0.95
            Level of trust.
            
        Returns:
        -----------------------
            If data is a Series, then a Series with two elements: 'low' is the lower bound, 'high' is the upper bound.
            If data is a DataFrame, then a DataFrame with two rows: 'low' is the lower bound, 'high' is the upper bound.
        '''
        alpha = 1 - confidence_level
        result = data.quantile([alpha/2, 1 - alpha/2])
        result = result.rename({alpha/2: 'low', 1 - alpha/2: 'high'})
        return result
    
    # The list of metrics is the names of the columns in the dataset
    n_metrics = metrics.shape[0]
    # Calculate the number of lines and their heights
    n_rows = int(np.ceil(n_metrics / n_cols))
    if add_boxplot:
        row_heights = [boxplot_height_fraq / n_rows, (1 - boxplot_height_fraq) / n_rows] * n_rows
        n_rows *= 2
    else:
        row_heights = [1 / n_rows] * n_rows
    titles = []
    specs = []
    # Generate a list of titles and chart specifications
    for index in range(0, n_metrics, n_cols):
        titles += metrics['label'].iloc[index:index + n_cols].to_list()
        if add_boxplot:
            titles += [''] * n_cols
            specs.append([{'b': 0.004}] * n_cols)
        specs.append([{'b': vertical_spacing}] * n_cols)
    # Create a canvas with n_row*n_cols graphs
    fig = make_subplots(cols=n_cols, rows=n_rows, row_heights=row_heights, subplot_titles=titles,
                        horizontal_spacing=horizontal_spacing, vertical_spacing=0,
                        specs=specs)
    # Display metrics histograms with boxes and whiskers above them
    for index, metric in enumerate(metrics.index):
        # We go by metrics
        col = index % n_cols + 1
        row = (index // n_cols) * (2 if add_boxplot else 1) + 1
        # Add a histogram
        fig.add_histogram(x=data[metric], row=row + (1 if add_boxplot else 0), col=col, histnorm=histnorm,
                          bingroup=index + 1,
                          marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index], 
                          marker_line_color='white', marker_line_width=1,
                          opacity=opacity, showlegend=False, name=metrics.loc[metric, 'name'])
        # Add KDE to the histogram
        if add_kde and histnorm == 'probability density':
            special_points = None
            if mark_confidence_interval:
                confidence_interval = _confidence_interval(data[metric], confidence_level)
                special_points = confidence_interval
            if mark_statistic is not None:
                if special_points is None:
                    special_points = pd.Series(statistic[metric])
                else:
                    special_points = special_points.append(
                        pd.Series(statistic[metric]))
            metric_kde = kde(data[metric],
                             special_points=special_points)
            metric_kde.sort_values(['value'], inplace=True)
            fig.add_scatter(x=metric_kde['value'], y=metric_kde['pdf'], row=row + (1 if add_boxplot else 0), col=col,
                            mode='lines', marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index], marker_line_width=1,
                            opacity=opacity, showlegend=False, name=metrics.loc[metric, 'name'])
            if mark_confidence_interval:
                df = metric_kde[metric_kde['value'] <= confidence_interval['low']]
                fig.add_scatter(x=df['value'], y=df['pdf'], row=row + (1 if add_boxplot else 0), col=col, mode='lines',
                                marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index], marker_line_width=1,
                                opacity=opacity, name=metrics.loc[metric, 'name'],
                                showlegend=False, fill='tozeroy')
                df = metric_kde[metric_kde['value'] >= confidence_interval['high']]
                fig.add_scatter(x=df['value'], y=df['pdf'], row=row + (1 if add_boxplot else 0), col=col, mode='lines',
                                marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index], marker_line_width=1,
                                opacity=opacity, name=metrics.loc[metric, 'name'], 
                                showlegend=False, fill='tozeroy')
            if mark_statistic is not None and statistic is not None:
                if mark_statistic[metric] == 'tomin':
                    df = metric_kde[metric_kde['value'] <= statistic[metric]]
                elif mark_statistic[metric] == 'tomax':
                    df = metric_kde[metric_kde['value'] >= statistic[metric]]
                elif mark_statistic[metric] == 'min':
                    if sum(data[metric] <= statistic[metric]) <= sum(data[metric] >= statistic[metric]):
                        df = metric_kde[metric_kde['value'] <= statistic[metric]]
                    else:
                        df = metric_kde[metric_kde['value'] >= statistic[metric]]
                elif mark_statistic[metric] == 'max':
                    if sum(data[metric] <= statistic[metric]) >= sum(data[metric] >= statistic[metric]):
                        df = metric_kde[metric_kde['value'] >= statistic[metric]]
                    else:
                        df = metric_kde[metric_kde['value'] <= statistic[metric]]
                fig.add_scatter(x=df['value'], y=df['pdf'], row=row + (1 if add_boxplot else 0), col=col, mode='lines',
                                marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index], marker_line_width=1,
                                opacity=opacity, name=metrics.loc[metric, 'name'],
                                showlegend=False, fill='tozeroy')
            if add_statistic and statistic is not None:
                # Add statistics
                fig.add_vline(x=statistic[metric], row=row + (1 if add_boxplot else 0), col=col,
                              line_color=px.colors.DEFAULT_PLOTLY_COLORS[index], line_width=2, line_dash='dash',
                              opacity=opacity)
        if add_boxplot:
            # Add a \"box with whiskers\" above the histogram
            fig.add_box(x=data[metric], row=row, col=col, marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index],
                        line_width=1, name=metrics.loc[metric, 'name'],
                        boxmean=add_mean, showlegend=False)
            # For \"boxes with whiskers\" we set the same range of values on the x-axis as for the histograms,
            # show the grid on the x axis, but hide the labels on it
            fig.update_xaxes(matches=list(fig.select_traces(row=row + 1, col=col))[0].xaxis, 
                             showgrid=True, showticklabels=False, row=row, col=col)
            # For \"boxes with whiskers\" we hide the y-axis title and labels on it
            fig.update_yaxes(title='', row=row, col=col, showticklabels=False)
        fig.update_xaxes(title=metrics['units'].iloc[index], title_font_size=units_font_size, 
                         row=row + (1 if add_boxplot else 0), col=col)
        fig.update_yaxes(title=yaxis_title, title_font_size=units_font_size, 
                         row=row + (1 if add_boxplot else 0), col=col)

    fig.update_xaxes(tickfont_size=axes_tickfont_size)
    fig.update_yaxes(tickfont_size=axes_tickfont_size)
    fig.update_annotations(font_size=labels_font_size)
    fig.update_layout(barmode='overlay', title=title, title_font_size=title_font_size,
                      title_x=0.5, title_y=title_y,
                      width=width, height=height,
                      margin_b=0)
    fig.show()

In [None]:
def plot_metric_confidence_interval(data, metrics, title=None, title_y=None, yaxis_title=None,
                                    title_font_size=14, labels_font_size=12, units_font_size=12, axes_tickfont_size=12,
                                    height=None, width=None, horizontal_spacing=None, vertical_spacing=None,
                                    n_cols=1, opacity=0.5):
    '''
    The function is designed to construct confidence intervals for several metrics of several groups.
    A separate canvas is created for each metric. The canvas can be divided into several columns.
    The confidence interval is indicated as a horizontal segment with vertical cutoffs at the ends.
    The center of the confidence interval is highlighted by a dot.
    
        Parameters:
        ----------
        data : pandas.Series or pandas.DataFrame
            Confidential intervals of populations.
            pandas.Series - for constructing CI for one metric
            pandas.DataFrame - for constructing CI for several metrics. Metric data are arranged in columns.
            Population names should be used as the dataset index.
            
        metrics : DataFrame
            Information about metrics. Indexes are names of metrics.

        title : string or None. Default is None
            Chart Title
            
        title_y : float or None
            Relative position of the chart title by height (from 0.0 (bottom) to 1.0 (top))
            
        title_font_size : int. Default - 14
            Chart Title Font Size
            
        labels_font_size : int. Default - 12
            Font size of inscriptions
            
        units_font_size : int. Default - 12
            Font size of unit names
            
        axes_tickfont_size : int. Default - 12
            Font size of axes labels
            
        height : int or None. Default is None
            Height of the chart

        width : int or None. Default is None
            Diagram width
                    
        horizontal_spacing : float or None. Defaults to None
            Distance between columns of canvases in fractions of width (from 0.0 to 1.0)

        vertical_spacing : float or None. Defaults to None
            Distance between the rows of canvases in fractions of the height (from 0.0 to 1.0)
            
        n_cols : int. Default - 1
            Number of columns of canvases

        opacity : float. Default is 0.5
            Opacity level (0.0 to 1.0) of the column color
            
        Returns:
        -----------------------
            No.
    '''
    # Calculate the number of lines and their heights
    n_rows = int(np.ceil(metrics.shape[0] / n_cols))
    row_heights = [1 / n_rows] * n_rows
    titles = []
    specs = []
    # Generate a list of titles and chart specifications
    for index in range(0, metrics.shape[0], n_cols):
        titles += (metrics.iloc[index:index + n_cols, :]['label'].to_list())
        if vertical_spacing:
            specs.append([{'b': vertical_spacing}] * n_cols)
        else:
            specs.append([{}] * n_cols)
    # Create a canvas with n_row*n_cols graphs
    fig = make_subplots(cols=n_cols, rows=n_rows, row_heights=row_heights,
                        subplot_titles=titles, specs=specs,
                        horizontal_spacing=horizontal_spacing, vertical_spacing=0.004)
    # Display a scatter plot of metrics, placing boxes and whiskers above them
    for index, metric in enumerate(metrics.index):
        # We go by metrics
        col = index % n_cols + 1
        row = index // n_cols + 1
        if type(data) == pd.Series:
            # Add a dot plot to the canvas
            fig.add_scatter(x=[(data[metric][0] + data[metric][1])/2], y=[0],
                            error_x={'type': 'constant', 'value': abs(data[metric][0] - data[metric][1])/2},
                            row=row, col=col, name='',
                            marker_color=px.colors.DEFAULT_PLOTLY_COLORS[index],
                            marker_line_color='white', marker_line_width=1,
                            opacity=opacity, showlegend=False)
        else:
            # Add a dot for each group
            for group_index, group in enumerate(data.index.unique()):
                fig.add_scatter(x=[(data.loc[group, metric][0] + data.loc[group, metric][1]) / 2], y=[-group_index],
                                error_x={'type': 'constant', 'value': abs(data.loc[group, metric][0] - data.loc[group, metric][1])/2},
                                customdata=[data.loc[group, metric]],
                                row=row, col=col, name=group,
                                marker_color=px.colors.DEFAULT_PLOTLY_COLORS[group_index],
                                marker_line_color='white', marker_line_width=1, marker_size=10, opacity=opacity,
                                showlegend=index == 0, legendgroup=group,
                                hovertemplate='%{x:.1f}, (%{customdata[0]:.1f}, %{customdata[1]:.1f})')
        fig.update_xaxes(title=metrics.loc[metric, 'units'], title_font_size=units_font_size, row=row, col=col)
        fig.update_yaxes(title=yaxis_title, title_font_size=units_font_size, row=row, col=col)
    fig.update_xaxes(tickfont_size=axes_tickfont_size)
    fig.update_yaxes(visible=False, tickfont_size=axes_tickfont_size)
    fig.update_annotations(font_size=labels_font_size)
    fig.update_layout(title=title, title_y=title_y, font_size=title_font_size, title_x=0.5,
                      width=width, height=height, margin_b=0)
    fig.show()

In [None]:
def plot_group_size_barchart(data, title=None, title_y=None, title_font_size=14, opacity=0.5, orientation='h', 
                             labels_font_size=12, xaxis_title=None, yaxis_title=None, 
                             axes_title_font_size=12, axes_tickfont_size=12, 
                             height=None, width=None):
    '''
    The function plots a bar chart that displays the sizes of the groups present in the data set.
    
        Parameters:
        ----------
        data : DataFrame
            Dataset. Group names should be used as the dataset index.
            
        title : string or None. Default is None
            Chart Title
            
        title_y : float or None
            Relative position of the chart title by height (from 0.0 (bottom) to 1.0 (top))
            
        title_font_size : int. Default - 14
            Chart Title Font Size
            
        opacity : float. Default is 0.5
            Opacity level (0.0 to 1.0) of the column color
            
        orientation : {'h', 'v'}. Default is 'h'
            Chart orientation: 'h'-horizontal, 'v'-vertical
            
        labels_font_size : int. Default - 12
            Font size of inscriptions
            
        xaxis_title : string or None. Defaults to None
            x-axis title
            
        yaxis_title : string or None. Defaults to None
            Y-axis title
            
        axes_title_font_size : int. Default - 12
            Axis Title Font Size
            
        axes_tickfont_size : int. Default - 12
            Font size of axes labels
            
        height : int or None. Default is None
            Height of the chart

        width : int or None. Default is None
            Diagram width
                    
        Returns:
        -----------------------
            No.
    '''
    
    # Build a bar chart
    # If the diagram is horizontal, then we change the order of the indices to the reverse
    df = data.index if orientation == 'v' else data.index[::-1]
    colors = px.colors.DEFAULT_PLOTLY_COLORS[:df.nunique()] 
    if orientation == 'h':
        colors.reverse()
    fig = px.histogram(df, title=title, opacity=opacity, orientation=orientation, height=height, width=width)
    fig.update_traces(texttemplate="%{x}", hovertemplate='%{y} - %{x:} clients',
                      marker_color=colors, showlegend=False)
    fig.update_layout(bargap=0.2, boxgroupgap=0.2,
                      font_size=title_font_size,
                      title_x=0.5, title_y=title_y,
                      margin_t=60, margin_b=40)
    fig.update_xaxes(title=xaxis_title, title_font_size=axes_title_font_size, tickfont_size=axes_tickfont_size)
    fig.update_yaxes(title=yaxis_title, title_font_size=axes_title_font_size, tickfont_size=axes_tickfont_size)
    fig.update_annotations(font_size=labels_font_size)
    fig.show()

## 1. Loading and viewing data

We load the user survey data from the `megafon.csv` file into the `data` dataset and display information about the content.

In [None]:
data = pd.read_csv('megafon.csv')
data.info(memory_usage=False)

From the dataset information we can glean the following important information:
- the dataset contains the results of answers and metrics of mobile internet usage for `3112` customers;
- the answer to the 1st question was not given by `2` customers (the `Q1` field contains `3110` non-null values);
- some customers, when answering the 1st question, gave answers that were not numeric (the `Q1` field has the `object` type, and if all the answers were integers, the type would be `int64`).

## 2. Data preparation and exploratory analysis

### 2.1. Removing unnecessary data

The dataset contains a user identifier field `user_id`. This information is not needed for analysis - it can be removed.

In [None]:
data.drop(columns='user_id', inplace=True)

### 2.2. Data cleaning and Exploratory analysis

#### 2.2.1. Cleaning and Analyzing answers to the 1st question

Earlier, we established that not all customers answered the 1st question as a number. It is impossible to analyze such answers, so we need to exclude records with such answers from the dataset and convert the `Q1` field to an integer type.

To do this, we first leave only those records whose values in the `Q1` field are a text representation of an integer. And then we convert the `Q1` field to an integer type.

In [None]:
data = data[data.Q1.notna()]
data = data[data.Q1.str.isdecimal()]
data.Q1 = data.Q1.astype(int)

In addition, users could answer the 1st question with a number that is outside the rating scale. Let's correct this by leaving in the dataset only those records in which the value of the `Q1` field has a value in the range from 1 to 10.

In [None]:
data = data[(data.Q1 >= 1) & (data.Q1 <= 10)]
data.info()

After removing incorrect answers to the 1st question, there are `3058` records left in the data set.

Let's look at the structure of the answers to the 1st question. To do this, Let's build a histogram of the percentage distribution of answers to this question.

In [None]:
fig = px.histogram(data, x='Q1', histnorm='percent',
                   title='<b>Distribution of customers by quality satisfaction score</b>', opacity=0.5)
fig.update_traces(texttemplate="%{y:.1f}%", hovertemplate='%{x} - %{y:.1f}%',
                  marker_color=px.colors.DEFAULT_PLOTLY_COLORS)
fig.update_layout(title_x=0.5, title_y=0.88, bargap=0.2, width=800, height=400, margin_l=0, margin_b=0)
fig.update_xaxes(title='', tickvals=data['Q1'].sort_values().unique())
fig.update_yaxes(title='')
fig.show()

As we can see, more than a quarter of customers (`27.7%`) are completely satisfied with the quality of communication. But it is also worth noting that a significant part (`17.4%`) gave an extremely negative assessment.

The main thing to pay attention to is that the percentage of adjacent assessments among customers often differs several times. For example, the share of customers who gave a score of `6` is only `3.3%`, but the share of customers who gave a score of `5` and `7` is approximately 2 times greater. This circumstance may indicate that it was difficult for customers who participated in the survey to give a score on a 10-point scale. Indeed, it is quite difficult for you to determine the difference between a scores `6` and `7`. Even the survey organizers themselves, as if anticipating this, did not ask the 2nd question to customers who gave a score `9` as well as a score `10`. Let's try to convert the `10`-point scale to `5`-point and look at the distribution again.

In [None]:
s = (data['Q1'] + 1) // 2
fig = px.histogram(s, x='Q1', histnorm='percent',
    title='<b>Distribution of customers<br>by quality satisfaction score</b><br>5-point scale',
    opacity=0.5)
fig.update_traces(texttemplate="%{y:.1f}%", hovertemplate='%{x} - %{y:.1f}%', 
                  marker_color=px.colors.DEFAULT_PLOTLY_COLORS)
fig.update_layout(title_x=0.5, title_y=0.93, bargap=0.5, width=700, height=400, margin_l=0, margin_b=0)
fig.update_xaxes(title='')
fig.update_yaxes(title='')
fig.show()

This distribution looks **more uniform**. Let's take this circumstance into account in our further work.

#### 2.2.2. Cleaning and analyzing answers to the 2nd question

Now let's move on to processing the answers to the 2nd question, the values of which are in the `Q2` field. Correct answers to the 2nd ques are a string with numbers from `1` to `7`, separated by a comma and a space.

Also take into account that the customer could have answered only the 1st question and left the second one unanswered. Let's not need these records for the analysis, so they can be deleted.

Among the answers to the 2nd question, Let's leave only the correct ones: numbers from `1` to `7`. In addition, for customers who gave several answers and indicated `1`...`5`, `7` among these answers, Let's delete the incorrect answers `0` ("Unknown") or `6` ("Difficult to answer"). After all, the answers `0` and `6` are correct only if the customer did not give other answers.

In [None]:
# Fill in missing answers to the 2nd question with nulls
data['Q2'] = data['Q2'].fillna('0')
# Convert the rows with answers to the 2nd question into a list of grades
data['Q2'] = data['Q2'].str.split(', ')
# Expand the lists with grades from the answer to the 2nd question.
# As a result, we get a dataset in which a separate row is created for each assessment from the answer to the 2nd question
data = data.explode('Q2')
# We leave only those records in which the answers to the 2nd question are numbers
data = data[data['Q2'].astype(str).str.isdecimal()]
# Now we convert the answers to the 2nd question to an integer type
data['Q2'] = data['Q2'].astype(int)
# We leave only those records in which the answers to the 2nd question are numbers from 0 to 7
data = data[(data['Q2'] >= 0) & (data['Q2'] <= 7)]
# We sort the data by the number of the answer to the 2nd question
data.sort_values('Q2', inplace=True)
# We collapse the assessments data for the 2nd question back into a line with a comma and space as a separator
data['Q2'] = data['Q2'].astype(str)
data['Q2'] = data['Q2'].groupby(level=0).apply(', '.join)
# Remove duplicates
data.drop_duplicates(inplace=True)
# Remove answers "0" or "6" if there are other answers
data['Q2'] = data['Q2'].str.replace('([06], )|(, [06])', '', regex=True)
data.info(memory_usage=False)

After removing incorrect answers, there are `3057` records left in the dataset. Now let's look at the structure of the answers to this question. To do this, Let's construct a histogram of the percentage distribution of answers.

In [None]:
s = pd.Series(index = [
    "0 - Unknown", "1 - Missed calls, disconnected calls",
    "2 - Waiting time for ringtones",
    "3 - Poor connection quality in buildings, shopping centers, etc.",
    "4 - Slow mobile Internet", "5 - Slow video loading",
    "6 - Difficult to answer", "7 - Your own option"
], dtype=float)
s.index = s.index.map(lambda x: wrap_text(x, 40))
for index in range(s.size):
    s.iloc[index] = data[data['Q1'] <= 8]['Q2'].str.contains(str(index)).sum()
s = s / s.sum() * 100

fig = px.bar(s, title='<b>Distribution of reasons for quality dissatisfaction</b>',
             orientation='h', opacity=0.5)
fig.update_traces(texttemplate="%{x:.1f}%", hovertemplate='%{y} - %{x:.1f}%',
                  marker_color=px.colors.DEFAULT_PLOTLY_COLORS)
fig.update_layout(title_x=0.5, title_y=0.88,
                  title_font_size=14,
                  width=700, height=500,
                  showlegend=False,
                  bargap=0.2, boxgroupgap=0.2,
                  margin_l=0, margin_b=0)
fig.update_xaxes(title='', title_font_size=12, tickfont_size=12)
fig.update_yaxes(title='', title_font_size=12, tickfont_size=12)
fig.show()

The diagram shows that almost 1/5 (`19.9%`) of customers are unsatisfied with mobile Internet. A smaller, but still significant (`7.1%`) part of the responses are unsatisfied with the video loading speed.

Let's look at the structure of the responses, grouping them by services. So, responses `1` and `2` refer to the voice communication service, `4` and `5` to the mobile Internet service. Responses `6` and `7`, as well as the absence of responses, do not give us an idea of the reason for the customer's lower scoring. So they can also be grouped. Rating `3` shows that the customer is unsatisfied with the quality of coverage.

Let's build a diagram that shows the structure of the responses in the context of the above groups.

In [None]:
s = pd.Series(dtype=float)
s.loc["0, 6, 7 - Unknown"] = (data['Q2'].str.contains('[067]', regex=True) & (data['Q1'] <= 8)).sum()
s.loc["1, 2 - Voice communation"] = data['Q2'].str.contains('[12]', regex=True).sum()
s.loc["3 - Coverage"] = data['Q2'].str.contains('3').sum()
s.loc["4, 5 - Mobile Internet"] = data['Q2'].str.contains('[45]', regex=True).sum()
s = s / s.sum() * 100

fig = px.bar(s, title='<b>Distribution of reasons for rating decrease</b>',
             orientation='h', opacity=0.5)
fig.update_traces(texttemplate="%{x:.1f}%",
                  hovertemplate='%{y} - %{x:.1f}%',
                  marker_color=px.colors.DEFAULT_PLOTLY_COLORS)
fig.update_layout(title_x=0.5, title_y=0.82,
                  title_font_size=14,
                  showlegend=False,
                  bargap=0.2, boxgroupgap=0.2,
                  width=650, height=300,
                  margin_l=0, margin_b=0)
fig.update_xaxes(title='', title_font_size=12, tickfont_size=12)
fig.update_yaxes(title='', title_font_size=12, tickfont_size=12)
fig.show()

As we can see, the answers were distributed into approximately equal parts. About `1/4` of the customers are unsatisfied with voice communication, mobile Internet or coverage.

#### 2.2.3. Cleaning and analyzing mobile internet metrics

First, let's describe the metrics that are present in the dataset. This description will be useful later.

Let's collect information about the names (`name`) and units of measurement (`units`) of the metrics. This information will be needed when outputting the reporting text and graphic information.

For each metric, let's indicate what influence (`impact`) it has on the quality of the mobile internet service and, accordingly, on customer satisfaction. This information will be useful to us in further research.

Metrics with a "positive" (`+`) impact include metrics whose value is "the higher, the better":
- `Downlink Throughput(Kbps)`;
- `Uplink Throughput(Kbps)`;
- `Video Streaming Download Throughput(Kbps)`;
- `Web Page Download Throughput(Kbps)`.

Metrics with "negative" (`-`) influence include metrics whose value is "the lower, the better":
- `Downlink TCP Retransmission Rate(%)`;
- `Video Streaming xKB Start Delay(ms)`;
- `Web Average TCP RTT(ms)`.

Since the `'Total Traffic(MB)'` metric is only an indicator of the intensity of mobile Internet usage by the customer, Let's indicate its influence as absent `0`.

In [None]:
metrics = pd.DataFrame(
    columns = ['name', 'units', 'impact'],
    index=pd.Index(data.columns.drop(['Q1', 'Q2']), name='metric')
)

for metric in metrics.index:
    name, units = metric[:-1].split('(')
    label = '<b>' + wrap_text(name, 30) + '</b>'
    metrics.loc[metric, ['name', 'units', 'label']] = name, units, label

metrics.loc['Total Traffic(MB)', 'impact'
] = '0'

metrics.loc[
    ['Downlink Throughput(Kbps)', 'Uplink Throughput(Kbps)', 
     'Video Streaming Download Throughput(Kbps)', 
     'Web Page Download Throughput(Kbps)'], 'impact'
] = '+'

metrics.loc[
    ['Downlink TCP Retransmission Rate(%)', 
     'Video Streaming xKB Start Delay(ms)', 
     'Web Average TCP RTT(ms)'], 'impact'
] = '-'

The resulting set of metrics information is as follows:

In [None]:
metrics.style.hide(axis='index').hide('label', axis='columns')

Next, Let's analyze the distributions of the metric values. First of all, this must be done in order to correctly select statistics for assessing the central position of the distributions and the criteria that we can use to test statistical hypotheses. To assess the distribution, Let's use standard tools - a histogram and a "box with whiskers". Additionally, for the convenience of assessing the shape of the distribution, Let's overlay the curves of the kernel density estimate (KDE) on the histograms.

In [None]:
plot_metric_histograms(
    data[metrics.index], metrics,
    title='<b>Probability density function of metrics in the observed sample</b>', title_y=0.95,
    height=900, boxplot_height_fraq=0.15, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_boxplot=True, add_kde=True, add_mean=True,
    horizontal_spacing=0.07, vertical_spacing=0.12)

Based on the histogram analysis, we can make the following observations:
- the distributions of all metrics except `Total traffic` are strongly skewed to the right and have a very long thin "tail";
- the distribution of all metrics is far from "normal";
- a lot of values on the right are located far from the rest, which raises questions about their reliability (these may be the so-called "outliers").

As we noted, there is a lot of data in the sample that are suspected of being "outliers". They can have a negative impact on the reliability of the assessment of the central position of the distributions. That is, the mean value will reflect a far different metric value from what an ordinary user "sees". Let's check this by deriving the percentile values corresponding to the mean values of the metrics.

In [None]:
metrics['name'].to_frame().merge(data[metrics.index].apply(lambda s: stats.percentileofscore(s, s.mean())).to_frame(), left_index=True, right_index=True).rename(columns={'name': 'Metric', 0: 'Percentile'}).style.hide(axis='index').format(precision=0)

There are two options for further action in this situation:
- either try to clean the data from outliers;
- or use robust (less susceptible) statistics to "outliers" to estimate the central positions of the distributions.

There are no universal and reliable methods for detecting "outliers". Since the sample distributions are very skewed to the right and are far from normal, traditional "simple" methods such as the "3-sigma rule" or determining outliers by the border of the "whiskers" boxpolot are not suitable. Of course, you can use more "advanced" methods such as `Local Outlier Factor (LOF)` or `Isolation Forest`, but they all require fine-tuning. Therefore, Let's use the second option for solving the problem. **Let's not remove "outliers", but will use statistics that are robust to them.**

## 3. Setting the objective

The metrics present in the dataset are related exclusively to the quality of mobile Internet service. In addition, we have customer scores regarding the quality of communication and the reasons that determined the scores. Based on this, we can try to recognize how to classify users in terms of their assessment of the quality of the mobile Internet service. Having this information will make it possible to build a classifier for use in predicting customer churn and ways to retain them.

We can only initially divide users into classes (categories) based on the answers to questions. And only then will we be able to see which categories of users have similar metric values, i.e. which categories of customers belong to the same population and can be combined into one class.

### 3.1. Selecting customers for the research

In the context of the research, only those customers are interested whose assessment of the mobile Internet service can be determined.<br>
The most obvious thing is that customers who gave `9` and `10` scores in response to the 1st question also highly assess the mobile Internet service.<br>
Also it's appropriate to analyze that customers who, in aswer to the 2nd question, specified slow mobile Internet (`4`) and/or slow video loading speed (`5`).
The exact assessment of the remaining customers who gave from `1` to `8` score in the answer to the 1st question, but did not specify either slow mobile Internet (`4`) or slow video loading speed (`5`) in their answer to the 2nd question is unknown. It's possible only to assume that their score of the mobile Internet service is higher than the score they gave for the connection quality.

In [None]:
data_clean = data.loc[(data['Q1'] >= 9) | data['Q2'].str.contains('[45]', regex=True)].copy()

### 3.2. Dividing customers into categories depending on the answer to the 1st question

When answering the 1st question, we asked users to score the quality of the service on a 10-point scale (where 10 is "Excellent" and 1 is "Terrible"). But as previously established, when using such a scale, it was quite difficult for the user to rate, and therefore it is better to convert the scores to a 5-point scale. Therefore, Let's divide the selected customers into `5` categories of mobile Internet service ratings depending on the answer to the 1st question:

| Category | Scores |
|:-----------------|:-------|
| Very unsatisfied |1, 2 |
| Unsatisfied |3, 4 |
| Neutral |5, 6 |
| Satisfied |7, 8 |
| Very satisfied |9, 10 |

According to above, let's append to the dataset a column `Internet score` indicating to which category the customer belongs.

In [None]:
data_clean['Internet score'] = data_clean['Q1'].apply(
    lambda q1: 'Very unsatisfied' if q1 <= 2 else ('Unsatisfied' if q1 <= 4 else (
        'Neutral' if q1 <= 6 else ('Satisfied' if q1 <= 8 else 'Very satisfied'))))

### 3.3. Dividing customers into categories based on answers to the 2nd question

In answers to the 2nd question, the selected customers could specify either slow internet (`4`) or slow video loading (`5`) separately, or both reasons together.<br>

It's possible to assume with the greatest certainty that they are fully satisfied with the mobile internet service and, accordingly, that there are no reasons for dissatisfaction with the internet only for those customers who scored the quality of the mobile service as `Very satisfied`.

Based on all of the above, it's appropriate split the selected customers into the following categories based on the reasons for dissatisfaction with the mobile internet service:

|Category |Description |Answers |
|:----------------|:----------------------------------------------------------|:-------------------------|
| Internet and Video|Unsatisfied with Mobile Internet and Video loading in the same way|Contain 4 and 5 |
| Internet |More unsatisfied with Mobile Internet than Video loading|Contain 4, do not contain 5 |
| Video |More unsatisfied with Video loading than Mobile Internet|Contain 5, do not contain 4 |
| No |Satisfied with mobile Internet and Video loading |- |

According to above, let's append to the dataset a column `Dissatisfaction reasons` indicating to which category the customer belongs.

In [None]:
data_clean['Dissatisfaction reasons'] = data_clean['Q2'].apply(
    lambda q2: 'Internet and Video' if q2.find('4, 5') >= 0 else (
        'Internet' if q2.find('4') >= 0 else (
            'Video' if q2.find('5') >= 0 else 'No')))
data_clean.sort_values(
    ['Q1', 'Q2'], 
    key=lambda x: x if x.name=='Q1' else (x.str.contains('5') - x.str.contains('4, 5')*2), 
    inplace=True)

### 3.4. Customer distribution map

All the customers under consideration belong to one of the categories of assessing the quality of the Internet service and one of the categories of reasons for dissatisfaction with it. Based on this information, we can create a map of the distribution of customers:

In [None]:
display_cat_info(data_clean)

## 4. Selecting metrics, their evaluation statistics and criteria for testing hypotheses

### 4.1. Selecting key metrics

We have information on 8 metrics of mobile Internet service. All of these metrics, except `Data transfer traffic volume`, affect the convenience of using this service to one degree or another. We need to select from these metrics those that have the greatest impact on the reasons for dissatisfaction indicated by users:
- Slow mobile Internet;
- Slow video loading.

Both of these reasons are related to the assessment of data transfer speed. Let's turn to the opinion of experts and see which metrics are used specifically for Internet speed.

One of such experts is the company `Ookla`, which is engaged in assessing the quality of services of mobile and fixed Internet operators around the world. This company publishes its methods for forming some assessments. Let's use this information.

The company's experts evaluate the speed of the Internet provided by the operator based on two metrics: `Average speed "to the subscriber"` and `Average speed "from the subscriber"`. Moreover, the ratio of the influence of these metrics is estimated as `9:1`: *\"...Speed Score which incorporates a measure of each provider’s download and upload speed to rank network speed performance (90% of the final Speed Score is attributed to download speed and the remaining 10% to upload speed)\"*.

Therefore, Let's consider the metrics characterize data transfer speed:
- `Downlink Throughput`
- `Video Streaming Download Throughput`
- `Web Page Download Throughput`.

In [None]:
research_metrics = metrics.loc[[
    'Downlink Throughput(Kbps)',
    'Video Streaming Download Throughput(Kbps)',
    'Web Page Download Throughput(Kbps)'
]]

The final list of metrics we will use to evaluate user groups is as follows:

In [None]:
research_metrics.style.hide(axis='index').hide('label', axis='columns')

### 4.2. Statistic selection

As was found out earlier, using the "average" to estimate the central position of the metric distribution is not very correct due to the asymmetry of the metric distributions and a large number of values with suspected anomalies. Accordingly, it is necessary to select statistics that will be more applicable in these conditions. There are the following standard versions of such statistics:
- median;
- trimmed mean;
- trimmer (weighted average of median, 10th and fourth quartiles, Trimean (TM)).

In practice, various modifications of these metrics are also used. For example, the company `Ookla` uses a modified version of the trimmer in the methodology for assessing the speed of Internet connections of providers. The company's experts estimate the central value of speed as a weighted average of the 10th, 50th (median) and 90th percentiles in the ratio `1:8:1`, which is described by the following formula:

$$\hat{TM}=\frac{P_{10}+8\cdot P_{50}+P_{90}}{10}\ .$$

Let's use the developments of experts in this field and will estimate the central position of the metrics under research in the same way.

In [None]:
research_metrics['statistic'] = trimean_mod

Now let's look at the distribution of this statistic for the metrics under research. To build the distribution, let's use the **bootstrap** method. Let's visualize the distributions using histograms, additionally marking the 95% confidence intervals, coloring the distribution areas outside the confidence intervals, and the median values using dashed lines.

In [None]:
statistic_distributions = data[research_metrics.index]\
    .apply(lambda s: my_bootstrap(s, research_metrics.loc[s.name, 'statistic'], n_resamples=9999))
plot_metric_histograms(statistic_distributions, statistic=statistic_distributions.median(),
                       metrics=research_metrics,
                       title='<b>Probability density function of statistics</b>', title_y=0.9,
                       height=300, n_cols=3, opacity=0.5,
                       histnorm='probability density',
                       add_kde=True, add_statistic=True, mark_confidence_interval=True,
                       horizontal_spacing=0.06, vertical_spacing=0.07)

As we see, the distributions of statistics of all metrics are almost symmetrical, therefore the median values of statistics match with the modes and are approximately in the middle of the confidence intervals. This circumstance allows, knowing the confidence interval, to estimate the central position of statistics as its middle.

### 4.3. Selecting a statistical criterion for testing hypotheses

Let's select a statistical test based on the following points:
1. The metrics being studied are quantitative, so the criterion must be suitable for comparing quantitative data.
2. As we established earlier, the distribution of the metrics being studied is far from "Neutral", so parametric criteria are not very suitable - we must use a non-parametric criterion.
3. Since Let's be investigating the difference between groups that include different customers, we can assume that the groups are "independent". Therefore, we must select a criterion that is suitable for independent samples.
4. In addition, the criterion must allow us to compare groups using the statistics we have chosen - a modified trimmer.

Based on the above conditions, tests based on the use of repeated samples can help to verify statistical hypotheses:
- Bootstrap test;
- Permutation test.

According to many experts, the `Permutation Test` is more suitable for testing hypotheses about the belonging of groups to the same population. But the `Bootstrap Test` is better used to test a certain difference between group statistics. Since our task is to identify statistically distinguishable classes among customers, we should use the `Permutation Test` as a criterion. And Let's use the `Bootstrap` to find confidence intervals and central positions of metric statistics in the groups under research.

As a test statistic for the `Permutation test` Let's use the difference in the statistics of the test groups:
$$\Delta \hat{TM} = \hat{TM}_1 - \hat{TM}_2,$$
where $\hat{TM}_1$ and $\hat{TM}_2$ are the values of the statistics of the first and second test groups, respectively.

In [None]:
research_metrics['test statistic'] = trimean_mod_diff

### 4.4. Selecting the significance level for testing statistical hypotheses and the confidence level for interval estimation of statistics

The cost of error in the research is not as high as, for example, in a research of drug effectiveness, so it's possible to choose a standard significance level $\alpha$ equal to `0.05`. The confidence level $\beta$, accordingly, will be chosen equal to `0.95` (1-$\alpha$).

In [None]:
alpha = 0.05
betta = 1 - alpha

### 4.5. Decision rule for customer groups belonging to the same or different populations
Let's make a decision on the belonging of two test groups based on the results of a statistical test, using the following rule:
- if the `p-value` for all metrics is below the significance level, i.e. if we can reject the null hypothesis for all metrics, then we consider that the customers of the test groups belong to the same population;
- otherwise, we consider that the customers of the test groups belong to different populations.

## 5. Research of reasons for dissatisfaction with mobile Internet service

### 5.1. Purpose of the research

Depending on the reasons for dissatisfaction with the mobile Internet service, we divided customers into three categories. To research customers belonging to the above categories, Let's allocate the selected customers into groups of the same name. For the convenience of analyzing the research data, Let's assign the following color scheme to each group, which Let's use when displaying graphic and tabular data:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;Internet and Video&nbsp;</span> - unsatisfied with the speed of Mobile Internet and Video loading;
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;Internet&nbsp;</span> - unsatisfied primarily with the speed of Mobile Internet;
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;Video&nbsp;</span> - unsatisfied primarily with the speed of Video loading.

Let's illustrate these groups on a customer distribution map:

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied':'Satisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Very unsatisfied':'Satisfied', 'Internet'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Very unsatisfied':'Satisfied', 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5)

Perhaps this grouping is not statistically correct. For example, customers who are unsatisfied with video loading may simply consume more video content, which is why they only reported slow video loading in their responses, although their page loading speed is also low.

In this research, Let's try to understand whether this is true by answering **the following questions**:

1. Do customers in the groups belong to different populations?
2. If customers do not belong to the same population, what metrics do they differ in?

### 5.2. Data Preparation

Let's build a dataset for the research by selecting information about the customers of the above groups using the name of the groups as an index.

In [None]:
research_data = data_clean\
    .loc[data_clean['Internet score'] != 'Very satisfied']\
    .set_index('Dissatisfaction reasons')[research_metrics.index]
research_data.index.rename('Group', inplace=True)
groups = research_data.index.unique().to_list()

### 5.3. Exploratory analysis

First, let's look at the number of customers in groups utilizing a bar chart.

In [None]:
plot_group_size_barchart(research_data,
                         title='<b>Number of customers in the considered groups</b>', title_y=0.85,
                         width=600, height=210)

As we can see, the largest group `Internet` (`440` customers) is the group of customers unsatisfied primarily with the speed of Mobile Internet. There are significantly fewer customers in the group `Internet and video` who are unsatisfied with both the speed of Mobile Internet and the speed of Video loading (`185` customers). And the group of customers unsatisfied primarily with the speed of video downloading `Video` is very small (`37` customers) - several times smaller than the number of customers in other groups.

Then Let's find the confidence intervals of the statistics of the metrics in the groups. This information will help to estimate the central values of the statistics and make assumptions about the presence of significant differences between the metrics of the customer groups and their direction (up or down).

Let's calculate the confidence intervals using the bootstrap method. The results will be displayed graphically (the confidence intervals will be marked using horizontal segments, and their midpoints will be marked using dots).

In [None]:
# Build a list of tested pairs of groups
group_pairs = [[groups[0], groups[1]], [groups[1], groups[2]], [groups[2], groups[0]]]
# Calculate confidence intervals and their midpoints, check for "overlapping" between confidence intervals
ci, ci_overlapping, ci_center, _ = confidence_interval_info(research_data, research_metrics, group_pairs)
# Visualize confidence intervals
plot_metric_confidence_interval(ci, metrics=research_metrics,
                                title='<b>Confidence intervals of metric statistics</b>',
                                height=230, n_cols=3,
                                horizontal_spacing=0.04, vertical_spacing=0.07)

For clarity, Let's also demonstrate the obtained results in a tabular form (the "worst" and "best" values of the confidence interval centers will be highlighted in "<span style="color: red; font-weight: bold;">red</span>" and "<span style="color: green; font-weight: bold;">green</span>" colors, respectively).

In [None]:
display_confidence_interval(ci, metrics=research_metrics, caption='<b>Confidence intervals of metric statistics</b>', caption_font_size=12, opacity=0.5, precision=1)

Based on the analysis of confidence intervals, we can make the following **conclusions**:
1. customers of the `Internet and Video` group have significantly "worse" values of the metrics than the `Internet` and `Video` groups.
2. customers of the `Internet` and `Video` groups have similar values of the `Streaming video download speed` and `Web page download speed via browser` metrics.
3. customers of the `Video` group have a higher central value of the `Average speed "to subscriber"` metric statistics than the `Web page download speed via browser` metric. This may indicate that customers of this group consume video content to a greater extent.
4. The confidence intervals of the metrics of the `Video` group are significantly wider than those of the `Internet and Video` and `Internet` groups. But this most likely does not indicate a significantly greater spread of the values of these metrics in the population to which this group belongs than in the populations to which the other groups belong. It is quite possible that such an effect is due to the fact that this group is significantly smaller in size. And, accordingly, with repeated samples from this group, the probability of obtaining more extreme statistical values is higher than with samples from other larger groups.

Additionally, we let's display information on the presence of "overlaps" of confidence intervals of the statistics of the groups in pairs. Let's highlight "<span style="color: red; font-weight: bold;">negative</span>" results in red because they are the most important and informative.

In [None]:
display_confidence_interval_overlapping(ci_overlapping, metrics=research_metrics, caption='<b>Overlapping confidence intervals of the statistics</b>', opacity=0.5)

Based on the obtained results on the presence of "overlapping" confidence intervals of statistics, the following conclusions can be made:
1. Confidence intervals of the `Video Streaming Download Throughput` and `Web Page Download Throughput` metrics of the `Internet and Video` and `Internet` groups **do not overlap**, which indicates a **significant difference** in the statistics of these metrics of this pair of groups. Thus, it is possible not to perform a test for this pair of groups, but to immediately conclude that these groups belong to different populations.
2. Confidence intervals of the `Downlink Throughput` metric statistics of the `Internet and Video` and `Video` groups also are not overlapped. This indicates a significant difference in the statistics of this metric of this pair of groups. Thus, it is possible not to perform a test for this pair of groups, but to immediately conclude that these groups belong to different populations.
3. Confidence intervals of the statistics of all metrics of the `Internet` and `Video` groups "overlapped". Therefore, we cannot draw a conclusion about the significance of the difference in the statistics of the researched metrics of these groups based on exploratory analysis - a statistical test must be performed.

### 5.4. Statistical tests

Based on exploratory analysis, we have established that the `Internet and Video` group has significant differences in metrics relative to the `Internet` and `Video` groups. That is, without performing testing, we can already conclude that the `Internet and Video` group belongs to a separate population.<br>
To answer the 1st question, it is enough to perform a test only for the `Internet` and `Video` groups. But it's still needed to find out which metrics influence the difference between the groups most strongly, so we need to perform tests for all pairs of groups.<br>
In this case, it does not matter to us in which direction (larger or smaller) this difference is directed. Therefore, we can perform the so-called `two-sided` test.<br>
But the `p-value` for a two-sided `Permutation test` is twice the minimum `p-value` for a left-sided or right-sided test. Therefore, to compare with the selected significance level, it is necessary to first halve the obtained `p-values`.

In [None]:
alternatives = research_metrics['impact'].apply(lambda impact: 'two-sided')
pvalues = pd.DataFrame(index=[', '.join(group_pair) for group_pair in group_pairs], columns=research_metrics.index)
mark_statistic = alternatives.apply(lambda impact: 'min')

#### 5.4.1. Statistical test for the groups "Internet and Video" and "Internet"

For all metrics, Let's accept the assumption that there are **no significant differences** between the values of the test group statistics as the `null hypothesis`. And the opposite statement that **there are differences** as the `alternative hypothesis`. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Internet \, and \, Video}-\hat{TM}_{Internet}\\
H_0:\Delta \hat{TM}=0\\
H_1:\Delta \hat{TM}≠0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (to the right or left of the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 0
test_data = research_data.loc[group_pairs[group_pair_index]]
# Testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Dividing obtained `p-values` by 2
pvalues.loc[', '.join(group_pairs[group_pair_index])] /= 2
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all metrics are **less than** the significance level, we **can reject the null hypothesis** and, accordingly, Let's assume that there are **significant differences** between all metrics of these groups.
2. Since the areas of the null distribution of the test statistics for all metrics **to the left** of the observed test statistics are **less** than **to the right**, this indicates that the values of all metrics of the `Internet and Video` group are **less** than those of the `Internet` group, which confirms the preliminary conclusions of the exploratory analysis.

#### 5.4.2. Statistical test for the groups "Internet" and "Video"

For all metrics, Let's accept the assumption that there are **no significant differences** between the values of the test group statistics as the `null hypothesis`. And the opposite statement that **there are differences** as the `alternative hypothesis`. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Internet}-\hat{TM}_{Video}\\
H_0:\Delta \hat{TM}=0\\
H_1:\Delta \hat{TM}≠0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (to the right or left of the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 1
test_data = research_data.loc[group_pairs[group_pair_index]]
# Testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Dividing obtained `p-values` by 2
pvalues.loc[', '.join(group_pairs[group_pair_index])] /= 2
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**<br>
1. Since the `p-values` obtained as a result of the test for all metrics are **greater** than the significance level, we **cannot reject the null hypothesis** and, accordingly, will assume that there are **no significant differences** between all metrics of these groups.
2. Since the areas of the zero distribution of the test statistics for the `Downlink Throughput` and `Web Page Download Throughput` metrics **to the left** of the observed test statistics are **smaller** than **to the right**, this suggests that the values of these metrics for the `Internet` group are **smaller** than for the `Video` group, which confirms the preliminary conclusions of the exploratory analysis.
3. Since the null distribution region of the test statistic for the `Video Streaming Download Throughput` metric is **larger** to the left** of the observed test statistic than to the **right**, this suggests that the values of this metric for the `Internet` group are **larger** than for the `Video` group. However, this difference is not significant, since the `p-value` is close to `0.5`. This also confirms the conclusions of the exploratory analysis.

#### 5.4.3. Statistical test for the groups "Video" and "Internet and Video"

For all metrics, Let's accept the assumption that there are **no significant differences** between the values of the test group statistics as the `null hypothesis`. And the opposite statement that **there are differences** as the `alternative hypothesis`. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Video}-\hat{TM}_{Internet \, and \, Video}\\
H_0:\Delta \hat{TM}=0\\
H_1:\Delta \hat{TM}≠0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (to the right or left of the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 2
test_data = research_data.loc[group_pairs[group_pair_index]]
# Testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Dividing obtained `p-values` by 2
pvalues.loc[', '.join(group_pairs[group_pair_index])] /= 2
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**<br>
1. Since the `p-values` obtained as a result of the test for all metrics are **less than** the significance level, we **cannot reject the null hypothesis** and, accordingly, will assume that there are **significant differences** between the test groups.
2. Since the areas of the zero distribution of the test statistics for all metrics **to the left** of the observed test statistics are **larger** than **to the right**, this suggests that the values of all metrics in the `Video` group are **larger** than those in the `Internet and Video` group, which confirms the preliminary conclusions of the exploratory analysis.
3. **The most significant difference** between the groups is observed for the `Downlink Throughput` metric, since the `p-value` value for this metric is significantly less than for other metrics.

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

### 5.5. Conclusion

To form the conclusions of the research, it is necessary to analyze the results of all the tests performed. To do this, Let's display the obtained `p-values` of all the tests in tabular form:

In [None]:
display_pvalues(pvalues, metrics=research_metrics, alpha=alpha, caption=f'<b>Tests p-values</b>', opacity=0.5, col_width=160)

Based on this information, the following answers can be given to the questions posed:
1. customers who expressed dissatisfaction primarily with the Internet speed (the `Internet` group) or primarily with video downloading (the `Video` group) do not have statistically significant differences in the metrics under research and belong to the same population. A noticeable difference is observed only for the `Downlink Throughput` metric, since the `p-value` of this metric is only slightly higher than the significance level.
2. The researched metrics of the `Internet and Video` group have statistically significant differences with the other groups. That is, this group belongs to a separate population. The strongest differences were found in the `Downlink Throughput` metric.

Since it was established that customers of the `Internet` and `Video` groups can be attributed to the same population, then in the future Let's consider customers of these groups as belonging to the same category of customers of reasons for dissatisfaction with the mobile Internet service. Let's call this category `Internet or video`. Thus, we are left with two categories of customers based on their dissatisfaction with the mobile Internet service:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;Internet and Video&nbsp;</span> - Dissatisfied with the speed of mobile Internet and video loading;
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;Internet or Video&nbsp;</span> - Dissatisfied with the speed of mobile Internet or video loading.

Let's illustrate these categories on a customer distribution map:

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied':'Satisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Very unsatisfied':'Satisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5)

## 6. Research of mobile internet service quality assessments

### 6.1. Research objective

Depending on the assessment of the mobile internet service, we divided the customers into five categories. To research the customers belonging to the above categories, Let's allocate the customers into groups of the same name. For ease of analysis of the research data, Let's assign the following color scheme to each group, which Let's use when displaying graphical and tabular data:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;Very unsatisfied&nbsp;</span>
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;Unsatisfied&nbsp;</span>
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;Neutral&nbsp;</span>
- <span style="color:white;background-color:rgb(214, 39, 40);opacity:0.5">&nbsp;Satisfied&nbsp;</span>
- <span style="color:white;background-color:rgb(148, 103, 189);opacity:0.5">&nbsp;Very satisfied&nbsp;</span>

Let's illustrate these groups on a customer distribution map:

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied', 'Internet and Video': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Unsatisfied', 'Internet and Video': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Neutral', 'Internet and Video': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5).set_properties(pd.IndexSlice['Satisfied', 'Internet and Video': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[3], opacity=0.5).set_properties(pd.IndexSlice['Very satisfied', 'No'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[4], opacity=0.5)

Perhaps this division is not statistically correct. For example, for customers who rate the quality of mobile Internet service as `Very unsatisfied`, the researched Internet metrics are close to customers who rated the mobile Internet service as `Unsatisfied`.

In this research, Let's try to understand whether this is true by answering the **following questions**:

1. Do customers of the above groups belong to different populations?
2. If customers do not belong to the same population, then by what metrics do they differ especially strongly?

### 6.2. Data Preparation

Let's build a dataset for the research by selecting information about the customers of the above groups from the general set of the researched data. Let's use the name of the groups as an index.

In [None]:
research_data = data_clean\
    .set_index('Internet score')[research_metrics.index]
research_data.index.rename('Group', inplace=True)
groups = research_data.index.unique().to_list()

### 6.3. Exploratory analysis

First, let's look at the number of customers in groups utilizing a bar chart.

In [None]:
plot_group_size_barchart(research_data,
                         title='<b>Number of customers in the groups</b>', title_y=0.9,
                         width=550, height=300)

As we can see, the largest group is `Very satisfied` (`1084` customers) - this is a group of customers who are completely satisfied with the mobile Internet service. The other groups are several times smaller and the difference in numbers between them is not so significant. The smallest group is `Neutral` (`118` customers), which includes customers who find the quality of the mobile Internet service satisfactory.

Then Let's find the confidence intervals of the statistics of the researched metrics of the researched groups. This information will help to evaluate the central values of the statistics and make assumptions about the presence of significant differences between the researched metrics of the customer groups and their direction (in the "large" or "smaller" direction).

Let's calculate the confidence intervals using the bootstrap method. The results will be displayed graphically (Let's construct confidence intervals using horizontal segments, and Let's mark their middle points with dots).

In [None]:
# Build a list of test group pairs 
group_pairs = [[groups[0], groups[1]], [groups[1], groups[2]], [groups[2], groups[3]], [groups[3], groups[4]]]
# Calculate confidence intervals, their midpoints and check precense of "overlaps" among them
ci, ci_overlapping, ci_center, _ = confidence_interval_info(research_data, research_metrics, group_pairs)
# Visualize confidence intervls
plot_metric_confidence_interval(ci, metrics=research_metrics, 
                                title='<b>Confidence intervals of the statistics</b>',
                                height=300, n_cols=3,
                                horizontal_spacing=0.04, vertical_spacing=0.07)

For clarity, Let's also present the obtained results in a tabular form (the "worst" and "best" values of the confidence interval centers will be highlighted in "<span style="color: red; font-weight: bold;">red</span>" and "<span style="color: green; font-weight: bold;">green</span>" colors, respectively).

In [None]:
display_confidence_interval(ci, metrics=research_metrics, caption='<b>Confidence intervals of the statistics</b>', caption_font_size=12, opacity=0.5, precision=1)

Based on the analysis of confidence intervals, we can make the following **conclusions**:
1. All metrics show a tendency for the central values of statistics to increase, i.e. to change for the "better" with an increase in the level of satisfaction. That is, we can state that the dynamics of metric values is consistent with the level of assessment of the quality of mobile Internet service by customers. Therefore, statistically significant differences may be absent primarily between neighboring groups.
2. The metric `Video Streaming Download Throughput` stands out from the overall picture. customers of the `Very unsatisfied` group have a slightly higher average statistical value than the more satisfied `Unsatisfied` group. But this difference does not seem to be significant.
3. The confidence intervals of the `Very satisfied` group are significantly smaller than those of the other groups. But this does not indicate a significantly smaller spread of metric values in the population to which this group belongs than in the populations to which the other groups belong. It is quite possible that such an effect is due to the fact that this group is significantly larger in size, and therefore, with repeated samples from it, the probability of obtaining more extreme statistical values is lower than in smaller groups.
4. But the larger size of the confidence interval of the `Satisfied` group than that of groups comparable in size: `Very unsatisfied`, `Unsatisfied` and `Neutral`, may indicate a greater dispersion of metrics in this group.

It is visually noticeable that the confidence intervals of the statistics of the metrics under research intersect in all pairs of neighboring groups. But, to be sure of this, Let's additionally display information about the presence of "overlaps" (common areas) of the confidence intervals of the statistics of the metrics under research in the groups under research in pairs. "<span style="color: red; font-weight: bold;">Let's highlight negative results in red</span>" since they are the most important and informative.

In [None]:
display_confidence_interval_overlapping(ci_overlapping, metrics=research_metrics, caption='<b>Overlapping confidence intervals of the statistics</b>', opacity=0.5)

Based on the obtained results on the presence of "overlapping" confidence intervals of the metrics under research, the following **conclusion** can be made:
since the confidence intervals of the statistics of all the metrics under research in neighboring groups **do not overlap**, this does not allow us to draw a conclusion about the significance of the difference in the statistics of these groups based on exploratory analysis - statistical tests must be carried out.

### 6.4. Statistical tests

Based on exploratory analysis, we have established that there is a clear tendency for the metric values to increase with the growth of the ratings given by customers to the mobile Internet service. That is, to answer the 1st question, we should perform "one-sided" tests of comparison of statistics of only neighboring groups. The first group in the tested pair will be the group of customers with the lower rating, therefore Let's perform "left-sided" tests.

In [None]:
# Use "left-handed" test for all metrics
alternatives = research_metrics['impact'].apply(lambda impact: 'less')
# Create a dataframe for test results
pvalues = pd.DataFrame(index=[', '.join(group_pair) for group_pair in group_pairs], columns=research_metrics.index)
# On the zero distribution, Let's mark the region located to the left of the observed value of the statistic
mark_statistic = alternatives.apply(lambda impact: 'tomin')

#### 6.4.1. Statistical test for the groups "Very unsatisfied" and "Unsatisfied"

For all metrics, Let's accept as the `null hypothesis` the assumption that the values of the `Very unsatisfied` group statistics are **not less than** the values of the `Unsatisfied` group statistics. And as the `alternative hypothesis` the opposite statement that the values of the `Very unsatisfied` group statistics are still **less than** the values of the `Unsatisfied` group statistics. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Very unsatisfied}-\hat{TM}_{Unsatisfied}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the zero distribution of the test statistics of the metrics under research. On the histograms, Let's mark the observed value of the test statistics with a vertical dashed line and color the distribution areas that are used to calculate the `p-values` (from the lines of the observed values of the test statistics).

In [None]:
# Test data preparation
group_pair_index = 0
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all metrics are **greater** than the significance level, we **cannot reject the null hypothesis** with respect to all the metrics under research and, accordingly, Let's assume that the values of all metrics in the `Very unsatisfied` group are not less than those in the `Unsatisfied` group.

2. The groups under research have especially **close** values for the `Video Streaming Download Throughput` metric, since the `p-value` value for this metric is closest to `0.5`. This also confirms the conclusions of the exploratory analysis.

#### 6.4.2. Statistical test for the groups "Unsatisfied" and "Neutral"

For all metrics, Let's accept the assumption that the values of the Unsatisfied group statistics are **not less than** the values of the Neutral group statistics as the `null hypothesis`. And the opposite statement that the value of the Unsatisfied group statistics is **less than** the values of the Neutral group statistics is the `alternative hypothesis`. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Unsatisfied}-\hat{TM}_{Neutral}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the zero distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 1
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-value` obtained as a result of the test for the `Video Streaming Download Throughput` metric is **less** than the significance level, we **can reject the null hypothesis** in relation to it and, accordingly, Let's assume that the values of this metric for the `Unsatisfied` group are **less** than for the `Neutral` group.

2. The values of the remaining metrics for the researched groups are quite **close**, since the `p-values` for these metrics are significantly **higher** than the significance level.

#### 6.4.3. Statistical test for the groups "Neutral" and "Satisfied"

For all metrics, as a `null hypothesis` Let's accept the assumption that the values of the `Neutral` group statistics are **not less than** the values of the `Satisfied` group statistics. And as an `alternative hypothesis` Let's accept the opposite statement that the value of the `Neutral` group statistics is **less than** the values of the `Satisfied` group statistics. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Neutral}-\hat{TM}_{Satisfied}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the zero distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 2
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all metrics are **greater** than the significance level, we **cannot reject the null hypothesis** with respect to all the metrics under research and, accordingly, Let's assume that the values of all metrics in the `Neutral` group are **not less** than those in the `Satisfied` group.

2. It should be noted that the values of the `Downlink Throughput` and `Video Streaming Download Throughput` metrics of the researched groups have, although not significant, but **significant differences**, since the `p-values` of these metrics are **close** to the significance level.

#### 6.4.4. Statistical test for the groups "Satisfied" and "Very satisfied"

For all metrics, as a `null hypothesis` Let's accept the assumption that the values of the `Satisfied` group statistics are **not less than** the values of the `Very satisfied` group statistics. And as an `alternative hypothesis` Let's accept the opposite statement that the value of the `Satisfied` group statistics is **less than** the values of the `Very satisfied` group statistics. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{Satisfied}-\hat{TM}_{Very satisfied}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the zero distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 3
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` of the `Video Streaming Download Throughput` and `Web Page Download Throughput` metrics obtained as a result of the test are **less** than the significance level, we **can reject the null hypothesis** with respect to these metrics and, accordingly, Let's assume that the values of these metrics of the `Satisfied` group are **less** than those of the `Very satisfied` group.

2. The values of the `Downlink Throughput` metric of the `Satisfied` group, although not significantly, are **significantly** `less`, since the `p-value` value of this metric is close to the significance level.

### 6.5. Conclusion

To form the conclusions of the research, it is necessary to analyze the results of all the tests performed. To do this, Let's display the obtained `p-values` of all the tests in tabular form:

In [None]:
display_pvalues(pvalues, metrics=research_metrics, alpha=alpha, caption=f'<b>Tests p-values</b>', opacity=0.5, col_width=160)

Based on this information, we can give the following answers to the questions posed:
1. Since all the metrics of the customers of the `Very unsatisfied` group are **no less** than those of the neighboring `Unsatisfied` group, we can assume that the customers of these groups **belong to the same population**.
2. Since all the metrics of the customers of the `Neutral` group are **no less** than those of the neighboring `Satisfied` group, we can assume that the customers of these groups **belong to the same population**.
3. The `Video Streaming Download Throughput` metric has the **strongest** influence on the division of customers into populations depending on the assessment of the mobile Internet service, since differences were found in the value of this metric between two pairs of the researched groups. But the `Downlink Throughput` metric has the **weakest** influence on this, since no pair of neighboring groups have differences in the values of this metric.

Since it was found that the groups `Very unsatisfied` and `Unsatisfied` can be attributed to the same population, then in the future Let's consider the customers of these groups as belonging to the same category of customers with the same assessment of the quality of mobile Internet service. Let's call this category `Unsatisfied`.

A similar situation is with the groups `Neutral` and `Satisfied`. Let's call the combined category `Satisfied`.

Thus, we have the following categories of customers depending on the assessment of the quality of mobile Internet service:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;Unsatisfied&nbsp;</span>
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;Satisfied&nbsp;</span>
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;Very satisfied&nbsp;</span>

Let's illustrate these categories on a customer distribution map:

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet and Video':'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet and Video':'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Very satisfied', 'No'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5)

## 7. Research of satisfaction levels with mobile internet service

### 7.1. Purpose of the research

We performed a research of dividing customers into categories of assessments and reasons for dissatisfaction with the mobile internet service obtained as a result of the survey. Based on the results of these studies, we can divide all customers into five categories depending on the degree of satisfaction with the quality of the mobile internet service. The degree of customer satisfaction is usually called `CSAT` from the English `Customer Satisfaction Score`. This indicator is usually indicated as a serial number (starting with 1) of the degree of satisfaction (in the ascending direction). Let's list the categories of customers in ascending order of `CSAT`:

1. customers who negatively assessed the quality of the mobile internet service (category `Unsatisfactory`) and are dissatisfied with both the speed of the mobile internet and the speed of downloading video (category `Internet and Video`), we can distinguish in the category of customers who are `Completely dissatisfied` with the quality of the mobile internet service. 2. customers who rated the quality of mobile internet service negatively (category `Unsatisfactory`), but indicated only slow mobile internet or only slow video loading as the root cause (category `Internet or video`), we can classify into the category of customers who are `Partially dissatisfied` with the quality of mobile internet service.
3. customers who rated the quality of mobile internet service positively overall (category `Satisfactory`), but who still have complaints about the speed of mobile internet and the speed of video loading (category `Internet and Video`), we can classify into the category of customers who are `Neither satisfied nor disappointed` with the quality of mobile internet service.
4. customers who generally rated the quality of mobile Internet service positively (category `Satisfactory`), but who still have complaints about the speed of mobile Internet or the speed of downloading video (category `Internet or video`), we can classify into the category of customers who are `Partially satisfied` with the quality of mobile Internet service.
5. The remaining customers who rated the service without having complaints about the mobile Internet service (category `Very satisfied`), Let's classify into the category of `Completely satisfied` customers.

Perhaps such a division of customers by satisfaction levels is not statistically correct, i.e. some groups of customers with different `CSAT` do not have statistically significant differences in metrics.

In this research, Let's try to understand whether this is true by answering the **following questions**:

1. Do customers with different CSAT belong to different populations?

2. If customers with different CSAT do not belong to the same population, then by which metrics do they differ especially strongly?

For this purpose, in this research Let's divide customers into groups by the `CSAT` value. Let's assign a designation corresponding to the `CSAT` value and color scheme to these groups when displaying graphic and tabular data:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;1&nbsp;</span> - Completely dissatisfied;
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;2&nbsp;</span> - Partly dissatisfied;
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;3&nbsp;</span> - Neutral;
- <span style="color:white;background-color:rgb(214, 39, 40);opacity:0.5">&nbsp;4&nbsp;</span> - Partly satisfied;
- <span style="color:white;background-color:rgb(148, 103, 189);opacity:0.5">&nbsp;5&nbsp;</span> - Completely satisfied.

Let's illustrate these groups on a customer distribution map:<br>

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[3], opacity=0.5).set_properties(pd.IndexSlice['Very satisfied', 'No'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[4], opacity=0.5)

### 7.2. Data Preparation

Let's build a dataset for the research by selecting information about the customers of the above groups from the general set of the researched data. Let's use the name of the groups as an index.

In [None]:
research_data = data_clean.copy()
research_data.index = research_data[['Internet score', 'Dissatisfaction reasons']].apply(
    lambda x: 
        '1' if (x['Internet score'] in ['Very unsatisfied', 'Unsatisfied']) 
             & (x['Dissatisfaction reasons'] == 'Internet and Video') else (
        '2' if (x['Internet score'] in ['Very unsatisfied', 'Unsatisfied']) 
             & (x['Dissatisfaction reasons'] != 'Internet and Video') else (
        '3' if (x['Internet score'] in ['Neutral', 'Satisfied']) 
             & (x['Dissatisfaction reasons'] == 'Internet and Video') else (
        '4' if (x['Internet score'] in ['Neutral', 'Satisfied']) 
             & (x['Dissatisfaction reasons'] != 'Internet and Video') else
        '5')))
    , axis=1)
research_data.index.rename('Group', inplace=True)
research_data = research_data[research_metrics.index]
groups = research_data.index.unique().to_list()

### 7.3. Exploratory analysis

First, let's look at the number of customers in groups utilizing a bar chart.

In [None]:
plot_group_size_barchart(research_data,
                         title='<b>Number of customers in the groups</b>', title_y=0.9,
                         width=550, height=300)

As expected, the largest group of companies with a satisfaction index of "5" (Customer "1084") is the group whose customers are completely satisfied with the local Internet service. The other groups are several times smaller. In the opposite direction, a group of neutral customers with an index of `3` is displayed - there are only `54` customers in it.

Then come the confidence intervals of stabilization of the indicators of the corresponding groups. This information will help to evaluate the central values of statistics and make a statement about the presence of significant indicators between the metrics of customer groups and their direction (the "larger" or "smaller" side).

The calculation of confidence intervals is performed using the bootstrap method. The results will be displayed graphically (Let's construct confidence intervals using horizontal segments, and underline them with dots).

In [None]:
# Build a list of test group pairs 
group_pairs = [[groups[0], groups[1]], [groups[1], groups[2]], [groups[2], groups[3]], [groups[3], groups[4]]]
# Calculate confidence intervals, their midpoints and check precense of "overlaps" among them
ci, ci_overlapping, ci_center, _ = confidence_interval_info(research_data, research_metrics, group_pairs)
# Visualize confidence intervls
plot_metric_confidence_interval(ci, metrics=research_metrics, 
                                title='<b>Confidence intervals of the statistics</b>',
                                height=300, n_cols=3,
                                horizontal_spacing=0.04, vertical_spacing=0.07)

For clarity, Let's also present the obtained results in a tabular form (the "worst" and "best" values of the confidence interval centers will be highlighted in "<span style="color: red; font-weight: bold;">red</span>" and "<span style="color: green; font-weight: bold;">green</span>" colors, respectively).

In [None]:
display_confidence_interval(ci, metrics=research_metrics, caption='<b>Confidence intervals of the statistics</b>', caption_font_size=12, opacity=0.5, precision=1)

Based on the obtained results on the presence of "overlapping" confidence intervals of the metrics under research, the following **conclusion** can be made:
1. All metrics show a tendency for the central values of statistics to increase with the growth of the customer satisfaction index. Therefore, statistically significant differences may be absent primarily between neighboring groups.
2. Group `3` stands out from the general picture. For this group, the metrics `Downlink Throughput` and `Web Page Download Throughput` are **worse** than for group `2`. And this difference looks **significant**, especially for the metric `Downlink Throughput`. The value of this metric for group `3` has almost the same value as for group `1`.
3. In accordance with the indicated tendency, the worst metric values are observed for the first group `1`, and the best ones are for group `5`.
4. The confidence intervals of group `5` are significantly **smaller** than for the other groups. But this most likely does not indicate a significantly smaller spread of metric values in the population to which this group belongs than in the populations to which the other groups belong. It is quite possible that such an effect is due to the fact that this group is significantly larger in size, and therefore, with repeated samples from it, there is a **lower probability** of obtaining more extreme statistical values than in larger groups.
5. But the **larger** size of the confidence intervals of the metrics in group `4` than in groups comparable in size: `1`, `2` and `3`, may indicate a **larger** range of metrics in this group.

It is visually noticeable that the confidence intervals of the statistics of the metrics under research intersect in all pairs of neighboring groups. But, to be sure of this, Let's additionally derive information about the presence of "overlaps" (common areas) of the confidence intervals of the statistics of the metrics under research in the groups under research in pairs. "<span style="color: red; font-weight: bold;">We highlight negative results in red</span>, as they are the most important and informative.<br>

In [None]:
display_confidence_interval_overlapping(ci_overlapping, metrics=research_metrics, caption='<b>Overlapping confidence intervals of the statistics</b>', opacity=0.5, index_width=30)

Based on the confidence intervals, we can make the following **conclusions**:
1. All metrics show a tendency for the central values of the statistics to increase with the growth of the customer satisfaction index. Therefore, statistically significant differences may be absent primarily between neighboring groups.
2. Group `3` stands out from the general picture. This group has **worse** metrics `Downlink Throughput` and `Web Page Download Throughput` than group `2`. And this difference seems **significant**, especially for the metric `Downlink Throughput`. The value of this metric for group `3` has almost the same value as for group `1`.
3. In accordance with the indicated tendency, the worst metric values are observed in the first group `1`, and the best ones are in group `5`.
4. The confidence intervals of group `5` are significantly smaller than those of the other groups. But this does not mean that the spread of metric values in the population to which this group belongs is significantly smaller than in the populations to which the other groups belong. It is quite possible that such an effect is due to the fact that this group is significantly larger in size, and therefore, with repeated samples from it, the probability of obtaining more extreme statistical values is lower than in larger groups. The opposite situation is with the smallest group `3`.

It is visually noticeable that only one pair of neighboring groups: `4` and `5` have confidence intervals for the statistics of all metrics that do not intersect. But, to be sure of this, Let's additionally display information about the presence of "overlaps" (common areas) of confidence intervals for the statistics of the metrics under research in pairs. "<span style="color: red; font-weight: bold;">Let's highlight negative results in red</span>" because they are the most important and informative.<br>

In [None]:
display_confidence_interval_overlapping(ci_overlapping, metrics=research_metrics, caption='<b>Overlapping confidence intervals of the statistics</b>', opacity=0.5, index_width=30)

Based on the obtained results on the presence of "overlapping" confidence intervals, the following conclusions can be made:
Indeed, the confidence intervals of the statistics of all metrics **do not overlap** only for the pair of groups `4` and `5`. That is, only with respect to this pair of groups can a conclusion be made about a significant statistical difference in the metrics under research (the metrics of group `4` are smaller than those of group `5`). With respect to the remaining pairs of groups, it is impossible to make a conclusion about the significance of the difference in the metrics of these groups based on exploratory analysis - statistical tests must be carried out.

### 7.4. Statistical tests

Based on exploratory analysis, we have established that there is a clear tendency for the metric values to increase with increasing customer satisfaction. That is, to answer the 1st question, we should perform "one-sided" tests of comparison of statistics for adjacent groups and use the test. The first group in the tested pair will be the group of customers with a lower level of satisfaction, so Let's perform "left-sided" tests.

In [None]:
# Use "left-handed" test for all metrics
alternatives = research_metrics['impact'].apply(lambda impact: 'less')
# Create a dataframe for test results
pvalues = pd.DataFrame(index=[', '.join(group_pair) for group_pair in group_pairs], columns=research_metrics.index)
# On the zero distribution, Let's mark the region located to the left of the observed value of the statistic
mark_statistic = alternatives.apply(lambda impact: 'tomin')

#### 7.4.1. Statistical test for the groups 1 and 2

For all metrics, as a `null hypothesis` Let's accept the assumption that the values of the statistics of group `1` are **not less than** the values of the statistics of group `2`. And as an `alternative hypothesis` Let's accept the opposite statement that the values of the statistics of group `1` are still **less than** the values of the statistics of group `2`. In mathematical form, the test statistics and formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{1}-\hat{TM}_{2}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the zero distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 0
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all the metrics under research are **less** than the significance level, we **can reject the null hypothesis** in relation to them and, accordingly, Let's assume that the values of all the metrics under research for group `1` are **less** than for group `2`.

2. It should be noted that the **confidence level** for the result obtained is very **high** - about `99%`.

#### 7.4.2. Statistical test for the groups 2 and 3

For all metrics, Let's take the assumption as a `null hypothesis` that the values of the statistics of group `2` are **not less than** the values of the statistics of group `3`. And as an `alternative hypothesis`, Let's take the opposite statement that the values of the statistics of group `2` are still **less than** the values of the statistics of group `3`. In mathematical form, the test statistics and the formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{2}-\hat{TM}_{3}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 1
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all the metrics under research are **greater** than the significance level, we **cannot reject the null hypothesis** in their regard and, accordingly, Let's assume that the values of all the metrics under research for group `2` are **not less** than those of group `3`.

2. It should be noted that the `p-value` value for the `Downlink Throughput` and `Web Page Download Throughput` metrics is significantly greater than `0.5`, which indicates that the values of these metrics for group `2` are significantly **"better"** than those of group `3`.

#### 7.4.3. Statistical test for the groups 3 and 4

For all metrics, Let's take the assumption as a `null hypothesis` that the values of the statistics of group `3` are **not less than** the values of the statistics of group `4`. And as an `alternative hypothesis`, Let's take the opposite statement that the values of the statistics of group `3` are still **less than** the values of the statistics of group `4`. In mathematical form, the test statistics and the formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{3}-\hat{TM}_{4}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 2
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` of the `Downlink Throughput` metric obtained as a result of the test are **less than** the significance level, we **can reject the null hypothesis** with respect to this metric and, accordingly, Let's assume that the values of this metric of group `3` are **less than** those of group `4`.

2. It should be noted that the `p-value` value of the `Web Page Download Throughput` metric is **slightly higher** than the significance level, which indicates **significant**, although statistically **not significant**, differences in the values of these metrics between the test groups.

#### 7.4.4. Statistical test for the groups 4 and 5

For all metrics, Let's take the assumption as the `null hypothesis` that the values of the statistics of group `4` are **not less than** the values of the statistics of group `5`. And as the `alternative hypothesis`, the opposite statement that the values of the statistics of group `4` are still **less than** the values of the statistics of group `5`. In mathematical form, the test statistics and the formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{4}-\hat{TM}_{5}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 3
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all the metrics under research are **less** than the significance level, we **can reject the null hypothesis** in relation to them and, accordingly, Let's assume that the values of all the metrics under research for group `4` are **less** than for group `5`.

2. It should be noted that the **confidence level** for the result obtained is very **high** - more than `99%`.

### 7.5. Conclusions

To form the conclusions of the research, it is necessary to analyze the results of all the tests performed. To do this, Let's display the obtained `p-values` of all the tests in the form of a table:

In [None]:
display_pvalues(pvalues, metrics=research_metrics, alpha=alpha, caption=f'<b>Tests p-values</b>', opacity=0.5, col_width=160, index_width=30)

Based on this information, we can give the following answers to the questions posed:
1. Since all the metrics of customers in group `1` are **less** than those of the neighboring group `2`, we can assume that the customers of these groups **belong to different populations**. On the same basis, customers of neighboring groups `4` and `5` should be assigned to different populations.
2. Since the `Downlink Throughput` metric of customers in group `3` is **less** than those of the neighboring group `4`, we can assume that the customers of these groups **belong to different populations**.
3. Since all the metrics of customers in group `2` are **not less** than those of the neighboring group `3`, we can assume that the customers of these groups **belong to the same population**. 4. The metric `Video Streaming Download Throughput` has the **strongest** impact on dividing customers into populations depending on their mobile internet service assessment, since the average `p-value` of this metric (`0.1472`) is significantly lower than that of the other metrics: `Downlink Throughput` (`0.2066`) and `Web Page Download Throughput` (`0.2088`).

Since it was established that customers of groups `2` and `3` should be classified as belonging to the same population, Let's further consider customers of these groups as belonging to the same category of customers with the same level of satisfaction with the quality of mobile internet service. Thus, the CSAT scale is narrowed to 4, i.e. the following customer categories remain depending on the CSAT of the customers:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;1&nbsp;</span> - Completely dissatisfied;
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;2&nbsp;</span> - Partially dissatisfied;
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;3&nbsp;</span> - Partially satisfied;
- <span style="color:white;background-color:rgb(214, 39, 40);opacity:0.5">&nbsp;4&nbsp;</span> - Completely satisfied.

Let's illustrate these categories on a customer distribution map depending on their customers' CSAT:

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5).set_properties(pd.IndexSlice['Very satisfied', 'No'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[3], opacity=0.5)

## 8. Research of the influence of metrics on the customer satisfaction level with Mobile Internet service

### 8.1. Purpose of the research

As a result of the previous research, it was established that customers should be divided into four categories by the level of `CSAT`.

As part of this research, Let's check that such a division is statistically correct, i.e. customers of different categories belong to different populations, the metrics of which have statistically different values. Moreover, the higher the customer satisfaction category, the higher the customer satisfaction index is the index of differences in metrics.

In addition, Let's try to determine which of the metrics under research has the greatest impact on `CSAT`.

To perform the research, Let's divide customers into groups by the value of `CSAT`. Let's assign a designation corresponding to the `CSAT` value and color scheme for displaying graphic and tabular data to these groups:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">1</span> - Completely dissatisfied;
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;2&nbsp;</span> - Partially dissatisfied;
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;3&nbsp;</span> - Partially satisfied;
- <span style="color:white;background-color:rgb(214, 39, 40);opacity:0.5">&nbsp;4&nbsp;</span> - Completely satisfied.

Let's illustrate these groups on a customer distribution map:

In [None]:
display_cat_info(data_clean).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[0], opacity=0.5).set_properties(pd.IndexSlice['Very unsatisfied': 'Unsatisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[1], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet and Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5).set_properties(pd.IndexSlice['Neutral': 'Satisfied', 'Internet': 'Video'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[2], opacity=0.5).set_properties(pd.IndexSlice['Very satisfied', 'No'], color='white', background=px.colors.DEFAULT_PLOTLY_COLORS[3], opacity=0.5)

### 8.2. Data Preparation

Let's build a dataset for the research by dividing it into the above groups. Let's use the names of the groups as an index.

In [None]:
research_data = data_clean.copy()
research_data.index = research_data[['Internet score', 'Dissatisfaction reasons']].apply(
    lambda x:
        '1' if (x['Internet score'] in ['Very unsatisfied', 'Unsatisfied']) 
             & (x['Dissatisfaction reasons'] == 'Internet and Video') else (
        '2' if (x['Internet score'] in ['Very unsatisfied', 'Unsatisfied']) 
             & (x['Dissatisfaction reasons'] != 'Internet and Video') else (
        '3' if (x['Internet score'] in ['Neutral', 'Satisfied']) 
             & (x['Dissatisfaction reasons'] == 'Internet and Video') else (
        '3' if (x['Internet score'] in ['Neutral', 'Satisfied']) 
             & (x['Dissatisfaction reasons'] != 'Internet and Video') else
        '4')))
    , axis=1)
research_data.index.rename('Group', inplace=True)
research_data = research_data[research_metrics.index]
groups = research_data.index.unique().to_list()

### 8.3. Exploratory Analysis

First, let's look at the number of customers in the groups. To do this, we'll build a bar chart.

In [None]:
plot_group_size_barchart(research_data,
                         title='<b>Number of clients in the study groups</b>', title_y=0.9,
                         width=700, height=260)

As we can see, after changing the `CSAT` scale, the number of customers who can be classified as partially satisfied customers (group `3`) became equal to the number of partially dissatisfied customers (group `2`).

Then Let's find the confidence intervals of the statistics of the metrics of these groups. This information will help to estimate the central values of the statistics and check for a tendency for the metrics to grow in the "larger" direction.

Let's calculate the confidence intervals using the bootstrap method. The results will be displayed graphically (Let's construct the confidence intervals using horizontal segments, and their midpoints will be marked with dots).

In [None]:
# Build a list of test group pairs 
group_pairs = [[groups[0], groups[1]], [groups[1], groups[2]], [groups[2], groups[3]]]
# Calculate confidence intervals, their midpoints and check precense of "overlaps" among them
ci, ci_overlapping, ci_center, _ = confidence_interval_info(research_data, research_metrics, group_pairs)
# Visualize confidence intervls
plot_metric_confidence_interval(ci, metrics=research_metrics, 
                                title='<b>Confidence intervals of the statistics</b>',
                                height=270, n_cols=3,
                                horizontal_spacing=0.04, vertical_spacing=0.07)

For clarity, we will also present the obtained results in a tabular form (the "worst" and "best" values of the midpoints of the confidence intervals will be highlighted in "<span style="color: red; font-weight: bold;">red</span>" and "<span style="color: green; font-weight: bold;">green</span>" colors, respectively).

In [None]:
display_confidence_interval(ci, metrics=research_metrics, caption='<b>Confidence intervals of the statistics</b>', caption_font_size=12, opacity=0.5, precision=1, index_width=30)

Based on the confidence intervals, we can make the following **conclusions**:
1. All metrics show a tendency for the central values of statistics to increase with the growth of the customer satisfaction index.

2. In accordance with the indicated tendency, the "worst" values of the metrics are observed in the first group `1`, and the best - in the last group `5`.

It is visually noticeable that only in one pair of neighboring groups: `4` and `5`, the confidence intervals of the statistics of all metrics do not intersect. But, to be sure of this, Let's additionally display information about the presence of "overlaps" (common areas) of the confidence intervals of the statistics of the researched metrics of the researched groups in pairs. "<span style="color: red; font-weight: bold;">Let's highlight negative results in red</span>" since they are the most important and informative.

In [None]:
display_confidence_interval_overlapping(ci_overlapping, metrics=research_metrics, caption='<b>Overlapping confidence intervals of the statistics</b>', opacity=0.5, index_width=30)

Based on the obtained results on the presence of "overlapping" confidence intervals, the following conclusions can be made:
Indeed, the confidence intervals of the statistics of all metrics **do not overlap** only for the pair of groups `4` and `5`. That is, only with respect to this pair of groups can a conclusion be made about a significant statistical difference in the metrics under research (the metrics of group `4` are smaller than those of group `5`). With respect to the remaining pairs of groups, it is impossible to make a conclusion about the significance of the difference in the metrics of these groups based on exploratory analysis - statistical tests must be carried out.

### 8.4. Statistical tests

Based on exploratory analysis, we have established that there is a clear tendency for the metric values to increase with increasing customer satisfaction. That is, to answer the 1st question, we should perform "one-sided" tests of comparison of statistics for adjacent groups and use the test. The first group in the tested pair will be the group of customers with a lower level of satisfaction, so Let's perform "left-sided" tests.

In [None]:
# Use "left-handed" test for all metrics
alternatives = research_metrics['impact'].apply(lambda impact: 'less')
# Create a dataframe for test results
pvalues = pd.DataFrame(index=[', '.join(group_pair) for group_pair in group_pairs], columns=research_metrics.index)
# On the zero distribution, Let's mark the region located to the left of the observed value of the statistic
mark_statistic = alternatives.apply(lambda impact: 'tomin')

#### 8.4.1. Statistical test for the groups 1 and 2

For all metrics, Let's take the assumption as the `null hypothesis` that the values of the statistics of group `1` are **not less than** the values of the statistics of group `2`. And as the `alternative hypothesis`, the opposite statement that the values of the statistics of group `1` are still **less than** the values of the statistics of group `2`. In mathematical form, the test statistics and the formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{1}-\hat{TM}_{2}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 0
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all the metrics under research are **less** than the significance level, we **can reject the null hypothesis** in relation to them and, accordingly, Let's assume that the values of all the metrics under research for group `1` are **less** than for group `2`.

2. It should be noted that the **confidence level** for the result obtained is very **high** - about `99%`.

#### 8.4.2. Statistical test for the groups 2 and 3

For all metrics, Let's take the assumption as a `null hypothesis` that the values of the statistics of group `2` are **not less than** the values of the statistics of group `3`. And as an `alternative hypothesis`, Let's take the opposite statement that the values of the statistics of group `2` are still **less than** the values of the statistics of group `3`. In mathematical form, the test statistics and the formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{2}-\hat{TM}_{3}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 1
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-value` obtained as a result of the test for the `Video Streaming Download Throughput` metric is **less than** the significance level, we **cannot reject the null hypothesis** with respect to this metric and, accordingly, Let's assume that the values of this metric in group `2` are **less than** those of group `3`.

2. It should be noted that the `p-value` for the `Downlink Throughput` and `Web Page Download Throughput` metrics is significantly greater than the significance level, which indicates that the values of these metrics in group `2` are close to the values of group `3`.

#### 8.4.3. Statistical test for the groups 3 and 4

For all metrics, Let's take the assumption as the `null hypothesis` that the values of the statistics of group `3` are **not less than** the values of the statistics of group `4`. And as the `alternative hypothesis`, the opposite statement that the values of the statistics of group `3` are still **less than** the values of the statistics of group `4`. In mathematical form, the test statistics and the formulated hypotheses can be written as follows:

$$
\Delta \hat{TM}=\hat{TM}_{3}-\hat{TM}_{4}\\
H_0:\Delta \hat{TM}≥0\\
H_1:\Delta \hat{TM}<0
$$

Let's perform the testing and visualize the results by constructing histograms of the null distribution of the test statistics of the metrics under research. On the histograms, we mark the observed value of the test statistic with a vertical dashed line and color the areas of the distributions that are used to calculate the `p-values` (from the lines of the observed values of the test statistic).

In [None]:
# Test data preparation
group_pair_index = 2
test_data = research_data.loc[group_pairs[group_pair_index]]
# Perform testing
pvalues.loc[', '.join(group_pairs[group_pair_index])], null_distributions, statistics = permutation_test(
    test_data, research_metrics['test statistic'], alternatives)
# Output of the density histogram of the null distribution
plot_metric_histograms(
    null_distributions, statistic=statistics, metrics=research_metrics,
    title=f'<b>Density of the null probability distribution of the test statistic</b>', title_y=0.9,
    height=300, n_cols=3, opacity=0.5,
    histnorm='probability density',
    add_kde=True, add_statistic=True, mark_statistic=mark_statistic,
    horizontal_spacing=0.08, vertical_spacing=0.07)

Let's present the obtained `p-values` in a tabular form, marking in "<span style="color: red; font-weight: bold;">red</span>" those values that are below the significance level.

In [None]:
display_pvalues(pvalues.loc[', '.join(group_pairs[group_pair_index])].to_frame().T, metrics=research_metrics, alpha=alpha, caption=f'<b>Test p-values</b>', col_width=160)

**Conclusions:**

1. Since the `p-values` obtained as a result of the test for all the metrics under research are **less** than the significance level, we **can reject the null hypothesis** in relation to them and, accordingly, Let's assume that the values of all the metrics under research for group `3` are **less** than for group `4`.

2. It should be noted that the **confidence level** for the result obtained is very **high** - about `99%`.

### 8.5. Conclusions

To form the conclusions of the research, it is necessary to analyze the results of all the tests performed. To do this, Let's display the obtained `p-values` of all the tests in the form of a table:

In [None]:
display_pvalues(pvalues, metrics=research_metrics, alpha=alpha, caption=f'<b>Tests p-values</b>', opacity=0.5, col_width=160, index_width=30)

Based on this information, we can give the following answers to the questions posed:
1. Since all neighboring groups have at least one customer metric with a lower `CSAT` value **less** than the group with a higher `CSAT` level, we can assume that customers of all groups **belong to different populations**. Thus, the division of customers by the `CSAT` value is performed **correctly**.
2. Since only the `Video Streaming Download Throughput` metric has all `p-values` **less** than the significance level, this metric has the **strongest influence** on `CSAT`.

Additionally, let's look at the dependence characteristics of the `Video Streaming Download Throughput` metric value on `CSAT`. To do this, Let's construct a scatter plot of the statistics values of this metric and a trend line (dashed line) describing the linear dependence of the statistics values on the `CSAT` level.

In [None]:
df = ci_center['Video Streaming Download Throughput(Kbps)'].rename('value').to_frame()

fig = px.scatter(
    df, x=df.index, y='value', 
    title=f"<b>Trend of \"{research_metrics.loc['Video Streaming Download Throughput(Kbps)', 'name']}</b>\"",
    labels={'x': '', 'value':'Kbits', 'index': 'CSAT'}, trendline="ols")
fig.update_layout(title_x=0.5, title_y=0.95, title_font_size=14,
                  width=600, height=350,
                  margin_t=40, margin_b=0)
fig.update_traces(hovertemplate='%{x}<br>%{y}<extra></extra>', selector={'mode': 'markers'})
fig.update_traces(line_dash='dash', selector={'mode': 'lines'})
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4])
fig.show()
results = px.get_trendline_results(fig).iloc[0, 0]

As we can see, the trend of the metrics change is described by a straight line. This is confirmed by the fact that the `determination coefficient` of the linear regression is very **close to 1.0** (R²=`0.9981`). The `CSAT` value increases after the growth of `Video Streaming Download Throughput(Kbps)` by about `847` kbps.

## Summary

Within this work, a research of the survey of Megafon customers was done. As a result, it was found that the **level of customer satisfaction** `CSAT` with the Mobile Internet service should be determined on a **4-point scale**:
- <span style="color:white;background-color:rgb(31, 119, 180);opacity:0.5">&nbsp;1&nbsp;</span> - Completely dissatisfied;
- <span style="color:white;background-color:rgb(255, 127, 14);opacity:0.5">&nbsp;2&nbsp;</span> - Partially dissatisfied;
- <span style="color:white;background-color:rgb(44, 160, 44);opacity:0.5">&nbsp;3&nbsp;</span> - Partially satisfied;
- <span style="color:white;background-color:rgb(214, 39, 40);opacity:0.5">&nbsp;4&nbsp;</span> - Completely satisfied.

The `CSAT` is primarily affected by `Video Streaming Download Throughput(Kbps)`. The depences between `CSAT` and the statistics of this metric is clearly linear - the difference in the central values of the statistics for customers with neighboring `CSAT` is approximately `847` Kbps.

For each category of customers with a certain `CSAT` confidence interval of the statistic were determined:

In [None]:
display_confidence_interval(ci['Video Streaming Download Throughput(Kbps)'], metrics=research_metrics.loc['Video Streaming Download Throughput(Kbps)'], caption='', caption_font_size=12, opacity=0.5, precision=1, index_width=30)