In [1]:
import pandas as pd

### Task 1
In the "Audience Data" tab, information about users who visited our app in November. What is the MAU of the product?
*MAU (Monthly Active Users) is a metric used to measure user activity for one month. It shows the number of unique users who have interacted with a product, service, or application at least once in the last month.

In [18]:
df = pd.read_excel('data.xlsx', sheet_name='Данные об аудитории')

mau = df['user_id'].nunique()
print(f'Monthly Active Users (MAU) for November 2023: {mau}')

Monthly Active Users (MAU) for November 2023: 7639


**Answer:** 7639

### Task 2
Using the "Audience Data" tab, calculate what the DAU will be.
*DAU (Daily Active Users) is a metric that shows the number of unique users who have interacted with a product, application, or service at least once during the day. DAU helps to understand how many users actively use the product every day.

In [57]:
dau = df.groupby(df['date'])['user_id'].nunique()

dau_df = dau.reset_index(name='DAU')
dau_df.columns = ['Date', 'DAU']
dau_df['DAU'].mean()

560.4666666666667

**Answer:** 560

### Task 3
Using the "Audience Data" tab, calculate what the first-day retention rate will be for users who joined the product on November 1st.
*Retention is a metric that shows how many users continue to use a product after a certain period following their initial interaction. Retention can be calculated as the percentage of users who returned to the product after a specific time (e.g., after 1 day, 1 week, 1 month) out of the total number of new users.

In [60]:
# Step 1: Identify new users on November 1st
new_users_nov_1 = df[df['date'] == '2023-11-01']['user_id'].unique()
num_new_users = len(new_users_nov_1)

# Step 2: Track returning users on November 2nd
returning_users_nov_2 = df[(df['date'] == '2023-11-02') & (df['user_id'].isin(new_users_nov_1))]['user_id'].unique()
num_returning_users = len(returning_users_nov_2)

# Step 3: Calculate retention rate
if num_new_users > 0:
    retention_rate = (num_returning_users / num_new_users) * 100
else:
    retention_rate = 0

print(f'First-Day Retention Rate for users who joined on November 1st: {retention_rate:.2f}%')

First-Day Retention Rate for users who joined on November 1st: 26.65%


**Answer:** 26,6%

### Task 4
On the graph, there are retention curves for two products. What conclusions can be drawn by looking at them?

**Answer:** Product 1 is more stable and decreases to a lesser extent than product 2. Product 2 reached zero on day 5, while product 1 was preserved by 50% on day 7.

### Task 5
In the "Audience Data" tab, there is information about how many ads each user has viewed (view_adverts). Do you calculate the user conversion per ad view for November? (in users)
* User conversion is a metric that shows what percentage of users completed a target action relative to the total number of users. In the context of websites, this can be an action such as viewing an ad or clicking on an advertising banner.

In [84]:
total_users = df['user_id'].nunique()
conversions = df[df['view_adverts'] > 0]['user_id'].nunique()
total_ad_views = november_data['view_adverts'].sum()

if total_users > 0:
    conversion_rate = conversions / total_users * 100
else:
    conversion_rate = 0

if total_ad_views > 0:
    conversion_per_ad_view = conversions / total_ad_views
else:
    conversion_per_ad_view = 0

print(f'User conversion per ad view: {conversion_per_ad_view:.4f} conv/view')
print(f'User conversation rate: {conversion_rate:.1f} %')

User conversion per ad view: 0.1614 conv/view
User conversation rate: 46.3 %


**Answer:** 46,3%

### Task 6
Using the information from the Audience Data tab, calculate the average number of ads viewed per user in November.

In [88]:
total_ad_views = df['view_adverts'].sum()
unique_users = november_data['user_id'].nunique()

if unique_users > 0:
    average_ads_per_user = total_ad_views / unique_users
else:
    average_ads_per_user = 0

print(f'Total Ad Views: {total_ad_views}')
print(f'Unique Users: {unique_users}')
print(f'Average Ads Viewed per User in November: {average_ads_per_user:.2f}')

Total Ad Views: 21914
Unique Users: 7639
Average Ads Viewed per User in November: 2.87


**Answer:** 2,9

### Task 7
We conducted a survey among 2,000 users. Of them, 500 are "critics," 1,200 are "promoters," and 300 are "neutrals." Calculate the NPS.
* NPS (Net Promoter Score) — is a metric that measures user loyalty towards a company or product and categorizes them into three groups: Promoters, Passives, and Detractors. NPS is calculated as (% of promoters - % of detractors).

Formula is  NPS=%Promoters−%Detractors
* Promoters: 1,200 users (60%)* 
Detractors: 500 users (25%)* 
Neutrals: 300 users (15%) — these are not included in the NPS calculatio
* NPS=60%−25%=35%

**Answer:** 35n.

### Task 8
In the "AB Test Data" tab, there are results of three unrelated AB tests for ARPU (total revenue/total number of users).


Look at the test results and interpret them 

Write down the p-values you obtaine <br>
d.
Prepare conclusions and recommendatio <br>
ns.
Data Colu <br>

ns:
experiment_num: Experiment n <br>
umber
experiment_group: Group the user was assig ed to
user_id:  <br>
User ID
revenue: Revenue generated by the user through purchasing a paid promotion service.

In [125]:
df = pd.read_excel('data.xlsx', sheet_name='Данные АБ тестов')

arpu_summary = df.groupby(['experiment_num', 'experiment_group']).agg(
    total_revenue=('revenue', 'sum'),
    total_users=('user_id', 'nunique')
).reset_index()

arpu_summary['ARPU'] = arpu_summary['total_revenue'] / arpu_summary['total_users']
arpu_summary

Unnamed: 0,experiment_num,experiment_group,total_revenue,total_users,ARPU
0,1,control,335944,465,722.460215
1,1,test,319555,480,665.739583
2,2,control,327664,465,704.653763
3,2,test,159806,480,332.929167
4,3,control,308391,465,663.206452
5,3,test,479361,480,998.66875


In [127]:
from scipy import stats

results = {}

for experiment in df['experiment_num'].unique():
    test_group = df[(df['experiment_num'] == experiment) & (df['experiment_group'] == 'test')]
    control_group = df[(df['experiment_num'] == experiment) & (df['experiment_group'] == 'control')]
    
    if len(test_group) > 0 and len(control_group) > 0:
        t_stat, p_value = stats.ttest_ind(test_group['revenue'], control_group['revenue'], equal_var=False)
        results[experiment] = p_value

results

{1: 0.688966190624939, 2: 0.0011282266247294027, 3: 0.06031548002841902}

**Answer:** Experiment 1 and 2 suggest that there is no evidence that ARPU of control group is higher than test group, while experiment 3 suggest that ARPU of control group is higher than test group. We need feather testing in order to get more accurate results.

### Task 9
Calculate the average income per user based on the dataset with the listers.

In [130]:
df = pd.read_excel('data.xlsx', sheet_name='Листеры')
df.head()

Unnamed: 0,user_id,date,cnt_adverts,age,cnt_contacts,revenue
0,100,2022-01-01,6,21,119,53
1,100,2022-01-02,2,21,200,18
2,100,2022-01-03,6,21,193,42
3,100,2022-01-04,2,21,143,38
4,100,2022-01-05,2,21,190,40


In [137]:
df = pd.read_excel('data.xlsx', sheet_name='Листеры')

total_revenue = df['revenue'].sum()
unique_users = df['user_id'].nunique()

if unique_users > 0:
    average_income_per_user = total_revenue / unique_users
else:
    average_income_per_user = 0

print(f'Total Revenue: {total_revenue}')
print(f'Unique Users: {unique_users}')
print(f'Average Income per User: {average_income_per_user:.2f}')

Total Revenue: 4851
Unique Users: 31
Average Income per User: 156.48


**Answer:** 156.4

### Task 10
Based on the dataset with the listers, calculate the median age of the user

In [140]:
median_age = df['age'].median()

print(f'Median Age of Users: {median_age}')

Median Age of Users: 28.0


**Answer:** 28

### Task 11
Which chart is best suited to display the spread of prices for goods in different stores?

To display the spread of prices for goods in different stores, the best-suited chart types are box plot and histogram. Box plot is best for summarizing price distributions, showing median, quartiles, and outliers. Histogram is best for visualizing the frequency distribution of prices.
**Answer:** histogram and box plot.

### Task 12
What is the bimodal distribution graph?

**Answer:** 3rd picture. Becuase a graph with two peaks is called bimodal.

### Task 13
Which random variable has the largest data variance according to the following distribution density graph?

**Answer:** 3rd picture. Because its range of data is the largest.

### Task 14
On which graph can the correlation be calculated?

**Answer:** scatter plot and correlation heatmap. Because a scatter plot shows the correlation of data, representing them as points, the distance between which can be measured using other points. Correlation heatmap is matrix table that represents data that related to each data points.

### Task 15
What does it mean if, when testing hypotheses, we got p-value = 0.05?

a) This means that there is no statistically significant difference between the groups

b) There is a 5% chance of accidentally getting this or an even more extreme result if the null hypothesis is correct

c) This means that the results of the experiment are 95% accurate

d) This suggests that the alternative hypothesis is correct with a probability of 95%


**Answer:** b

### Task 16
Which method is most suitable for testing the hypothesis of equality of the average of two samples from a normal distribution?

a) 
t-tes

b) tChi-squar

c)  testAnalysis of Variance (ANOVA

d) )Pearson Correlation

**Answer:** t-test. Because The t-test is specifically designed to determine if there is a significant difference between the means of two groups. It is appropriate when the samples are drawn from normally distributed populations, especially when the population variances are unknown.

### Task 17
How to interpret quartiles in the distribution of user income?
a) 
They show the maximum and minimum income
b) 
Divide the data into four equal part
c) s
Indicate the most common inco
d) me
Graph of the density distribution of matter in the universe


**Answer:** b) divide the data into four equal parts. Quartiles are statistical values that split a dataset into four equal parts, each containing 25% of the data points.

### Task 18
The following results were obtained. Colleagues ask you to confirm them and make a final conclusion on the experiment.
Option A (control group) — 100,047,501 visitors, 1003 payments.
Option B (test group) — 100,001,055 visitors, 1099 payments.
What recommendations would you make based on this data?


Control Group (Option A):
* Visitors: 100,047,501
* Payments: 1,003
Test Group (Option B):
* Visitors: 100,001,055
* Payments: 1,099

Conversion Rate = 
Number of Visitor / 
Number of Paymen * 
 

Conversion Rate A = 100,047,501 / 
1,00 * 100 ≈ 0.001003%

Conversion Rate B = 
100,001,05 / 
1,0 * 
  × 0.001099%100≈3×100

**Answer:** The test group (Option B) has a higher conversion rate than the control group (Option A). This suggests that whatever changes were made in the test group may have positively influenced user behavior leading to more payments. My recommendation is to continue testing. Because since option b shows a higher conversion rate, consider continuing with the changes implemented in this group. However, further testing should be conducted to confirm that this improvement is consistent over time and across different user segments.