## Part 7: Predictive Modeling
---
In this section, we employ the tools of predictivte analytics to:
  - measure the effect of the marketing campaign on user metrics
  - forecast the future user acitivity to plan the workload on the server

### Table of Contents
- [Imports, global parameters, functions](#section-1)
- [Marketing campaign: analysis of metrics with Casualmpact](#section-2)
- [Forecasting: users activity within 1 month](#section-3)


<a id="section-1"></a>
### Imports, global parameters, functions
---

In [1]:
# start with the imports we would need
import warnings # suppress pallette-related matplotlib warnings
from hashlib import md5 # do the splitting with this hash
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as axes
import statsmodels.api as sm
import scipy.stats as stats
import pandas as pd
import swifter # applies any function to a pandas df or series faster
import pandahouse as ph # Connect with clickhouse DB
import seaborn as sns


In [2]:
# set the connection with the db
CONNECTION = {'host': 'https://clickhouse.lab.karpov.courses',
                      'database':'simulator_20230720',
                      'user':'student', 
                      'password':'dpo_python_2020'
                     }


In [60]:
# set quality of the plots to be built
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300

# Set the scaling factors for the plots
SCALE = 1 / 6
sns.set(font_scale=0.25)

# suppress pallette-related matplotlib warnings
warnings.filterwarnings('ignore', category=UserWarning)

In [4]:
# get a class that plots two histograms of the same plot 
# given an axis grid and font sizes

class HistPlotter:
    
    # initialize a class object with data and annotation parameters
    def __init__(
        self, 
        data,
        x,
        hue,
        xlabel=None,
        ylabel=None,
        title=None,
    ):
        self.data = data
        self.x = x
        self.hue = hue
        self.xlabel = xlabel
        self.ylabel = ylabel
        self.title = title

    # plot two histograms on the same plot given a grid axis
    def plot(
        self,
        ax,
        SCALE=1/ 6,
        bins=100,
        fontsize=18,
        labelsize=15
    ):
    
        # set fonts
        fontsize=fontsize * SCALE
        labelsize=labelsize * SCALE
        sns.set(font_scale=0.25)
        
        # adjust fonts and ticks
        ax.set_xlabel(self.xlabel, fontsize=fontsize)
        ax.set_ylabel(self.ylabel, fontsize=fontsize)
        ax.set_title(self.title, fontsize=fontsize)

        ax.tick_params(axis='y', labelsize=labelsize)
        ax.tick_params(axis='x', rotation=45, labelsize=labelsize)

        # create a histogram for each group
        sns.histplot(
            data=self.data,
            x=self.x,
                      hue=self.hue, 
                      palette=sns.color_palette('bright', as_cmap = True),
                      alpha=0.4,
                      kde=False,
                        ax=ax,
                        bins=bins
        )


<a id="section-2"></a>
### Marketing campaign: analysis of metrics with Casualmpact
---
Our team of marketers decided to organize a campaign in the news feed: participants must make a post where they share an interesting fact about themselves and publish it with a hashtag. The three posts that receive the most likes will win prizes.

The flash mob took place from 2023-07-14 to 2023-07-20. Your task as an analyst is to evaluate the effectiveness of this event.

Questions:
  - Which metrics were affected and how by the campaign?
  - has the metric analyzed indeed changed?
  - Was there any long-term effect of this campaign?
  
To do:
  - Propose which metrics might have changed and, suggest the trend.
  - Vizualize the metrics and test the hypothesis with CasualImpact.
  - Look in the data after the campaign and look for the long-term effects.

#### 1. Metrics to be analyzed

Let's assume which metrics could have changed during the campaign and how.
- Number of posts created: increase (people are posting more during the campaign) or the same (non-posters not getting active in the campaign).
- Number of views or likes: increase (more posts lead to more views or posts are better shown/advertised).
- Likes or views per post: decrease (more posts, but same activity of viewers) or increase (increased quality/availability of posts).
- Click-through rate (CTR): Decrease (more "artificial" posts might lead to lower quality and conversion rate).
- Daily active users (DAU): Increase (existing users become more active, and new users join because of the campaign).
- Retention: Increase initially, then possibly decrease after the campaign (some people might participate solely for the prizes).

To visualize and further have a deep dive into the data, we would need 

<a id="section-3"></a>
### Forecasting: users activity within 1 month with Orbit
---
[Text for Section 3 goes here.]


Задача 2

Чем активнее наши пользователи – тем выше нагрузка на сервера. И в последнее время нам всё чаще приходят жалобы, что приложение подвисает. Звучит как задача для девопсов и инженеров!

От вас тоже попросили внести свой вклад в задачу – спрогнозировать, как изменится активность пользователей в течение ближайшего месяца. Давайте попробуем это сделать!

    Выберите основную метрику, которую вы планируете прогнозировать. Обоснуйте, почему именно она. Какое временное разрешение вы возьмёте? Будут ли какие-то дополнительные регрессоры, которые вы включите в модель?
    Постройте модель и провалидируйте её. Хватает ли у нас данных для бэккастинга с текущей задачей? Если нет, то определите, для какого горизонта прогнозирования у нас хватает данных.
    Выберите ту модель, которая кажется вам наиболее удачной (обоснуйте выбор). Проинтерпретируйте её результаты. Если видите какие-то важные ограничения по выводу – не забудьте их тоже указать.

Бонус: опишите, насколько удобным для использования вам показался Orbit. Если есть опыт использования других прогнозных инструментов (Prophet, Darts и т.д.) – сравните Orbit с ними по тем признакам, которые вам кажутся наиболее важными.

Формат сдачи задания – merge request в GitLab с ноутбуком, где выполнены вычисления (формат .ipynb).