## Задача

Проанализировать массив данных при помощи языка Python (допускается и рекомендуется использование дополнительных библиотек): вычисление среднего, максимального/минимального значений, медианы, моды числовых значений как для всего массива в целом, так и для каждого типа контента (столбец Type) в отдельности. Найти самый популярный объект в выборке, объяснить почему. Решение предоставить в виде .py/.ipynb файла на github. 

## Содержание <a class="anchor" id="zero-bullet"></a>  
* [Техническая часть](#import-bullet) (импорт модулей и загрузка данных)
* [EDA](#eda-bullet)
* [Подстчет и вывод описательных (дескриптивных) статистик](#descriptives-bullet)  
     а) [Для всех данных](#descriptives_all-bullet)  
     б) [Для данных по типам](#descriptives_type-bullet)     
     с) ["Самый популярный объект"](#popularest-bullet)

>> ## Импортируем модули <a class="anchor" id="import-bullet"></a>

In [6]:
from tqdm import tqdm_notebook

# import numpy as np # redundant for task
import pandas as pd
from scipy import stats

import pandas_profiling

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

**Опишем модули:**

**Основные:**
* pandas -- чтобы работать с объектом класса pandas.DataFrame с методами для выполнения задания, PyData-стэк;
* scipy.stats -- чтобы не писать функцию для подсчета моды ряда, PyData-стэк;
* pandas_profiling -- чтобы выполнить быстрый EDA "из коробки";

**Вспомогательные:**
* tqdm -- чтобы логгировать и оценки времени итеративных операций;

[К содержанию](#zero-bullet)

>> ## Импортируем данные

In [2]:
FILE_PATH = '../src/'

In [3]:
FILE_NAME = 'dataset_Facebook.csv'

In [4]:
%%time
df = pd.read_csv(FILE_PATH + FILE_NAME, sep=';')

Wall time: 20 ms


In [5]:
f'Данные -- таблица размером {df.shape[0]} x {df.shape[1]} (объектов х признаков)'

'Данные -- таблица размером 500 x 19 (объектов х признаков)'

**Выведем типы столбцов**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
Page total likes                                                       500 non-null int64
Type                                                                   500 non-null object
Category                                                               500 non-null int64
Post Month                                                             500 non-null int64
Post Weekday                                                           500 non-null int64
Post Hour                                                              500 non-null int64
Paid                                                                   499 non-null float64
Lifetime Post Total Reach                                              500 non-null int64
Lifetime Post Total Impressions                                        500 non-null int64
Lifetime Engaged Users                                                 500 non-nul

Можем заметить, что все признаки кроме признака **Type** численные, $3$ признака с плавающей запятой, $15$ целочисленные.

Есть _NaN_ в признаках **Paid** ($1$), **like** ($1$), **share** ($4$).  
Природа пропусков не известна, количество не превышает $1\%$ -- будем заменять их нулями. Вероятно, что это случайные ошибки в процессе сбора данных. При необходимости можно будет их изучить.

In [7]:
df.columns

Index(['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'Lifetime Post Total Reach',
       'Lifetime Post Total Impressions', 'Lifetime Engaged Users',
       'Lifetime Post Consumers', 'Lifetime Post Consumptions',
       'Lifetime Post Impressions by people who have liked your Page',
       'Lifetime Post reach by people who like your Page',
       'Lifetime People who have liked your Page and engaged with your post',
       'comment', 'like', 'share', 'Total Interactions'],
      dtype='object')

**Выведем несколько строк данных, чтобы посмотреть на данные вживую.**

In [8]:
df.head(4)

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777


Подозрение вызывает тип признаков **Category**, **Paid**. Вероятно, что это не численные признаки, а категориальный и бинарный, соответственно.

[К содержанию](#zero-bullet)

>> ## Описательный анализ данных <a class="anchor" id="eda-bullet"></a>

In [9]:
df.columns

Index(['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'Lifetime Post Total Reach',
       'Lifetime Post Total Impressions', 'Lifetime Engaged Users',
       'Lifetime Post Consumers', 'Lifetime Post Consumptions',
       'Lifetime Post Impressions by people who have liked your Page',
       'Lifetime Post reach by people who like your Page',
       'Lifetime People who have liked your Page and engaged with your post',
       'comment', 'like', 'share', 'Total Interactions'],
      dtype='object')

Выполним базовый EDA (Explorationary Data Analysis).

**Введение:**

Вероятно, что данные являются срезом разных метрик, собранных в Facebook Analytics по постам пользователей. Однако, кодификации данных нет, по этой причине будем руководствоваться логикой и названиями столбцов-признаков, которые достаточно информативны. И найдём описание метрик (не все метрики идентично названы, не все можно идентифицировать явно) на официальном сайте, в [документации](https://developers.facebook.com/docs/graph-api/reference/v3.2/insights#availmetrics) :)

**Предварительный описательный анализ:**

_(разворачивается по клику, прячется по двойному клику)_

In [10]:
%%time
pandas_profiling.ProfileReport(df)

Wall time: 3.4 s


0,1
Number of variables,19
Number of observations,500
Total Missing (%),0.0%
Total size in memory,74.3 KiB
Average record size in memory,152.2 B

0,1
Numeric,14
Categorical,1
Boolean,0
Date,0
Text (Unique),0
Rejected,4
Unsupported,0

0,1
Distinct count,3
Unique (%),0.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.88
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,2
Q3,3
95-th percentile,3
Maximum,3
Range,2
Interquartile range,2

0,1
Standard deviation,0.85267
Coef of variation,0.45355
Kurtosis,-1.5875
Mean,1.88
MAD,0.7568
Skewness,0.23197
Sum,940
Variance,0.72705
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
1,215,43.0%,
3,155,31.0%,
2,130,26.0%,

Value,Count,Frequency (%),Unnamed: 3
1,215,43.0%,
2,130,26.0%,
3,155,31.0%,

Value,Count,Frequency (%),Unnamed: 3
1,215,43.0%,
2,130,26.0%,
3,155,31.0%,

0,1
Distinct count,414
Unique (%),82.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,920.34
Minimum,9
Maximum,11452
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,169.75
Q1,393.75
Median,625.5
Q3,1062.0
95-th percentile,2581.2
Maximum,11452.0
Range,11443.0
Interquartile range,668.25

0,1
Standard deviation,985.02
Coef of variation,1.0703
Kurtosis,34.111
Mean,920.34
MAD,596.86
Skewness,4.5159
Sum,460172
Variance,970260
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
537,4,0.8%,
550,3,0.6%,
564,3,0.6%,
528,3,0.6%,
1141,3,0.6%,
909,3,0.6%,
421,3,0.6%,
541,3,0.6%,
424,3,0.6%,
338,3,0.6%,

Value,Count,Frequency (%),Unnamed: 3
9,1,0.2%,
15,1,0.2%,
17,1,0.2%,
24,1,0.2%,
25,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4840,1,0.2%,
5352,1,0.2%,
6164,1,0.2%,
8072,1,0.2%,
11452,1,0.2%,

0,1
Distinct count,382
Unique (%),76.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,609.99
Minimum,9
Maximum,4376
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,130.75
Q1,291.0
Median,412.0
Q3,656.25
95-th percentile,1834.2
Maximum,4376.0
Range,4367.0
Interquartile range,365.25

0,1
Standard deviation,612.73
Coef of variation,1.0045
Kurtosis,11.348
Mean,609.99
MAD,390.96
Skewness,2.9916
Sum,304993
Variance,375430
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
403,5,1.0%,
340,4,0.8%,
327,4,0.8%,
363,4,0.8%,
248,3,0.6%,
375,3,0.6%,
408,3,0.6%,
328,3,0.6%,
497,3,0.6%,
361,3,0.6%,

Value,Count,Frequency (%),Unnamed: 3
9,1,0.2%,
15,2,0.4%,
17,1,0.2%,
19,1,0.2%,
32,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
3430,1,0.2%,
3798,1,0.2%,
4104,1,0.2%,
4318,1,0.2%,
4376,1,0.2%,

0,1
Correlation,0.96821

0,1
Distinct count,440
Unique (%),88.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1415.1
Minimum,9
Maximum,19779
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,153.95
Q1,509.25
Median,851.0
Q3,1463.0
95-th percentile,4540.5
Maximum,19779.0
Range,19770.0
Interquartile range,953.75

0,1
Standard deviation,2000.6
Coef of variation,1.4137
Kurtosis,31.379
Mean,1415.1
MAD,1086.5
Skewness,4.8176
Sum,707565
Variance,4002400
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
730,3,0.6%,
513,3,0.6%,
889,3,0.6%,
795,3,0.6%,
599,3,0.6%,
431,3,0.6%,
719,3,0.6%,
652,3,0.6%,
966,2,0.4%,
247,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
9,1,0.2%,
19,1,0.2%,
20,1,0.2%,
26,1,0.2%,
31,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
11064,1,0.2%,
12074,1,0.2%,
14974,1,0.2%,
18115,1,0.2%,
19779,1,0.2%,

0,1
Distinct count,491
Unique (%),98.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16766
Minimum,567
Maximum,1107833
Zeros (%),0.0%

0,1
Minimum,567.0
5-th percentile,1603.0
Q1,3969.8
Median,6255.5
Q3,14860.0
95-th percentile,48584.0
Maximum,1107833.0
Range,1107266.0
Interquartile range,10891.0

0,1
Standard deviation,59791
Coef of variation,3.5661
Kurtosis,247.44
Mean,16766
MAD,16690
Skewness,14.723
Sum,8383188
Variance,3575000000
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
2541,2,0.4%,
5732,2,0.4%,
2888,2,0.4%,
1284,2,0.4%,
4935,2,0.4%,
3675,2,0.4%,
4911,2,0.4%,
1210,2,0.4%,
5010,2,0.4%,
4664,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
567,1,0.2%,
721,1,0.2%,
723,1,0.2%,
935,1,0.2%,
943,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
122474,1,0.2%,
160270,1,0.2%,
184270,1,0.2%,
648611,1,0.2%,
1107833,1,0.2%,

0,1
Distinct count,494
Unique (%),98.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29586
Minimum,570
Maximum,1110282
Zeros (%),0.0%

0,1
Minimum,570.0
5-th percentile,2451.7
Q1,5694.8
Median,9051.0
Q3,22086.0
95-th percentile,110240.0
Maximum,1110282.0
Range,1109712.0
Interquartile range,16391.0

0,1
Standard deviation,76803
Coef of variation,2.5959
Kurtosis,94.002
Mean,29586
MAD,32300
Skewness,8.351
Sum,14792974
Variance,5898700000
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
6503,2,0.4%,
12735,2,0.4%,
7004,2,0.4%,
4372,2,0.4%,
8533,2,0.4%,
8745,2,0.4%,
6476,1,0.2%,
4004,1,0.2%,
55633,1,0.2%,
8891,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
570,1,0.2%,
726,1,0.2%,
746,1,0.2%,
1029,1,0.2%,
1071,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
453213,1,0.2%,
457509,1,0.2%,
497910,1,0.2%,
665792,1,0.2%,
1110282,1,0.2%,

0,1
Distinct count,485
Unique (%),97.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,13903
Minimum,238
Maximum,180480
Zeros (%),0.0%

0,1
Minimum,238.0
5-th percentile,1326.6
Q1,3315.0
Median,5281.0
Q3,13168.0
95-th percentile,54319.0
Maximum,180480.0
Range,180242.0
Interquartile range,9853.0

0,1
Standard deviation,22741
Coef of variation,1.6356
Kurtosis,16.8
Mean,13903
MAD,13702
Skewness,3.6792
Sum,6951680
Variance,517140000
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
3322,2,0.4%,
13544,2,0.4%,
677,2,0.4%,
3754,2,0.4%,
9528,2,0.4%,
6692,2,0.4%,
2938,2,0.4%,
5280,2,0.4%,
2232,2,0.4%,
32208,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
238,1,0.2%,
391,1,0.2%,
452,1,0.2%,
584,1,0.2%,
617,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
128064,1,0.2%,
139008,1,0.2%,
153536,1,0.2%,
158208,1,0.2%,
180480,1,0.2%,

0,1
Distinct count,469
Unique (%),93.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6585.5
Minimum,236
Maximum,51456
Zeros (%),0.0%

0,1
Minimum,236.0
5-th percentile,912.75
Q1,2181.5
Median,3417.0
Q3,7989.0
95-th percentile,22847.0
Maximum,51456.0
Range,51220.0
Interquartile range,5807.5

0,1
Standard deviation,7682
Coef of variation,1.1665
Kurtosis,8.1618
Mean,6585.5
MAD,5267.8
Skewness,2.609
Sum,3292744
Variance,59013000
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
2044,2,0.4%,
2426,2,0.4%,
2660,2,0.4%,
2174,2,0.4%,
3768,2,0.4%,
2604,2,0.4%,
690,2,0.4%,
2388,2,0.4%,
5348,2,0.4%,
2124,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
236,1,0.2%,
380,1,0.2%,
450,1,0.2%,
511,1,0.2%,
516,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
38720,1,0.2%,
39776,1,0.2%,
47488,1,0.2%,
48368,1,0.2%,
51456,1,0.2%,

0,1
Distinct count,90
Unique (%),18.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,123190
Minimum,81370
Maximum,139441
Zeros (%),0.0%

0,1
Minimum,81370
5-th percentile,90804
Q1,112680
Median,129600
Q3,136390
95-th percentile,138900
Maximum,139441
Range,58071
Interquartile range,23717

0,1
Standard deviation,16273
Coef of variation,0.13209
Kurtosis,-0.2666
Mean,123190
MAD,13641
Skewness,-0.98245
Sum,61597088
Variance,264800000
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
136393,18,3.6%,
124940,17,3.4%,
129600,15,3.0%,
139441,14,2.8%,
138895,14,2.8%,
109670,13,2.6%,
107907,13,2.6%,
137177,12,2.4%,
100732,12,2.4%,
117764,11,2.2%,

Value,Count,Frequency (%),Unnamed: 3
81370,4,0.8%,
85093,3,0.6%,
85979,7,1.4%,
86491,5,1.0%,
86909,6,1.2%,

Value,Count,Frequency (%),Unnamed: 3
138353,9,1.8%,
138414,11,2.2%,
138458,3,0.6%,
138895,14,2.8%,
139441,14,2.8%,

0,1
Distinct count,3
Unique (%),0.6%
Missing (%),0.2%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.27856
Minimum,0
Maximum,1
Zeros (%),72.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,1
Maximum,1
Range,1
Interquartile range,1

0,1
Standard deviation,0.44874
Coef of variation,1.6109
Kurtosis,-1.0222
Mean,0.27856
MAD,0.40193
Skewness,0.99093
Sum,139
Variance,0.20137
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,360,72.0%,
1.0,139,27.8%,
(Missing),1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,360,72.0%,
1.0,139,27.8%,

Value,Count,Frequency (%),Unnamed: 3
0.0,360,72.0%,
1.0,139,27.8%,

0,1
Distinct count,22
Unique (%),4.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.84
Minimum,1
Maximum,23
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,3
Median,9
Q3,11
95-th percentile,14
Maximum,23
Range,22
Interquartile range,8

0,1
Standard deviation,4.3686
Coef of variation,0.55722
Kurtosis,-0.82136
Mean,7.84
MAD,3.9
Skewness,0.21385
Sum,3920
Variance,19.085
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
3,105,21.0%,
10,78,15.6%,
13,52,10.4%,
11,44,8.8%,
2,39,7.8%,
4,35,7.0%,
9,30,6.0%,
12,29,5.8%,
6,16,3.2%,
5,13,2.6%,

Value,Count,Frequency (%),Unnamed: 3
1,4,0.8%,
2,39,7.8%,
3,105,21.0%,
4,35,7.0%,
5,13,2.6%,

Value,Count,Frequency (%),Unnamed: 3
18,3,0.6%,
19,1,0.2%,
20,1,0.2%,
22,1,0.2%,
23,1,0.2%,

0,1
Correlation,0.94119

0,1
Distinct count,7
Unique (%),1.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.15
Minimum,1
Maximum,7
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,6
95-th percentile,7
Maximum,7
Range,6
Interquartile range,4

0,1
Standard deviation,2.0307
Coef of variation,0.48933
Kurtosis,-1.2754
Mean,4.15
MAD,1.762
Skewness,-0.10252
Sum,2075
Variance,4.1237
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
7,82,16.4%,
6,81,16.2%,
4,72,14.4%,
1,68,13.6%,
5,67,13.4%,
2,66,13.2%,
3,64,12.8%,

Value,Count,Frequency (%),Unnamed: 3
1,68,13.6%,
2,66,13.2%,
3,64,12.8%,
4,72,14.4%,
5,67,13.4%,

Value,Count,Frequency (%),Unnamed: 3
3,64,12.8%,
4,72,14.4%,
5,67,13.4%,
6,81,16.2%,
7,82,16.4%,

0,1
Correlation,0.92856

0,1
Distinct count,4
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0

0,1
Photo,426
Status,45
Link,22

Value,Count,Frequency (%),Unnamed: 3
Photo,426,85.2%,
Status,45,9.0%,
Link,22,4.4%,
Video,7,1.4%,

0,1
Distinct count,46
Unique (%),9.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.482
Minimum,0
Maximum,372
Zeros (%),21.2%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,1.0
Median,3.0
Q3,7.0
95-th percentile,25.05
Maximum,372.0
Range,372.0
Interquartile range,6.0

0,1
Standard deviation,21.181
Coef of variation,2.8309
Kurtosis,183.44
Mean,7.482
MAD,7.9863
Skewness,11.768
Sum,3741
Variance,448.63
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,106,21.2%,
2,71,14.2%,
1,62,12.4%,
4,44,8.8%,
3,36,7.2%,
6,26,5.2%,
5,20,4.0%,
7,20,4.0%,
9,15,3.0%,
10,11,2.2%,

Value,Count,Frequency (%),Unnamed: 3
0,106,21.2%,
1,62,12.4%,
2,71,14.2%,
3,36,7.2%,
4,44,8.8%,

Value,Count,Frequency (%),Unnamed: 3
64,1,0.2%,
103,1,0.2%,
144,1,0.2%,
146,1,0.2%,
372,1,0.2%,

0,1
Distinct count,258
Unique (%),51.6%
Missing (%),0.2%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,177.95
Minimum,0
Maximum,5172
Zeros (%),1.0%

0,1
Minimum,0.0
5-th percentile,7.0
Q1,56.5
Median,101.0
Q3,187.5
95-th percentile,534.1
Maximum,5172.0
Range,5172.0
Interquartile range,131.0

0,1
Standard deviation,323.4
Coef of variation,1.8174
Kurtosis,119.18
Mean,177.95
MAD,143.27
Skewness,8.9553
Sum,88795
Variance,104590
Memory size,4.0 KiB

Value,Count,Frequency (%),Unnamed: 3
98.0,7,1.4%,
79.0,6,1.2%,
72.0,6,1.2%,
148.0,6,1.2%,
7.0,6,1.2%,
53.0,6,1.2%,
101.0,5,1.0%,
74.0,5,1.0%,
48.0,5,1.0%,
56.0,5,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,5,1.0%,
1.0,1,0.2%,
2.0,3,0.6%,
3.0,3,0.6%,
4.0,4,0.8%,

Value,Count,Frequency (%),Unnamed: 3
1572.0,1,0.2%,
1622.0,1,0.2%,
1639.0,1,0.2%,
1998.0,1,0.2%,
5172.0,1,0.2%,

0,1
Correlation,0.90403

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,19,325.0,49.0,393


**Интерпретации EDA:**

_Note:_ колонки будем называть -- **признаками**, строки -- **объектами**.

**Есть признаки с высоким содержанием нулевых значений:**
  * Paid
  * comment

**Попробуем дать содержательное описание нулям:**
* **Paid**. Бинарный признак. Вероятно, что показывает два исхода: пришел ли пользователь сам или отслеживался каким-то путём монетизации. "Заплатили ли мы за пользователя явно"
* **comment**. Численный признак. Вероятно, что показывает количество комментариев под объектом. И в интернете и в реальной жизни "пользователи" комментируют не всё.


**Есть очень скореллированные признаки:** 
    * Lifetime Post Consumers и Lifetime Engaged Users;
    * Post Month и Page total likes;
    * Total Interactions и share
    * share и like

**Попробуем дать содержательное объяснение корреляциям:**
* Lifetime Post Consumers и Lifetime Engaged Users; Численные признаки; В документации _(и на маркетинговых страницах первого листа выдачи)_ указано очень похожее описание двух этих признаков -- клики на объектах;
* Post Month и Page total likes;
* Total Interactions и share; Численные признаки. Вероятно, показывают количество пользовательских взаимодействий с объектом. Кажется логичным, что взаимодействия составляются из действия "поделиться", а также "лайков" и комментариев.
* share и like. Численные признаки. Вероятно, показывает действия пользователей. Кажется логичным и интересным паттерном, что если человек посчитал важным поделиться объектом, то он поставит ему лайк.

[К содержанию](#zero-bullet)

>> ## Подстчет и вывод описательных (дескриптивных) статистик <a class="anchor" id="descriptives-bullet"></a>

**Вычисление для численных признаков:**
    
    а) для всего массива
    б) по категориям (признак **Type**)

**Следующих статистик:**
* среднего значения ($\mu$ или $E$, среднее арифметическое); 
* максимального/минимального значения ($min$, $max$, $0\%-$ и $100\%-$ или $0$ и $4$ квантили); 
* медианы, моды ($50\%$- или $2$ квантиль, самое частовстречаемое/наиболее типичное значение) 

**Также**, найти самый популярный объект в выборке, объяснить почему.**

**Выведем статистики для всего массива:** <a class="anchor" id="descriptives_all-bullet"></a>

_(разворачивается по клику, прячется по двойному клику)_

In [11]:
from print_stats import print_stats

In [12]:
%%time
for col in tqdm_notebook(df.columns.difference(['Type'])):
    print(print_stats(df[col].fillna(0)))
del(print_stats)

A Jupyter Widget

Для признака: Category
	Среднее значение: 1.88.
	Максимальное / Минимальное значения: 3.00 / 1.00.
	Медиана: 2.00.
	Мода: {1}.

Для признака: Lifetime Engaged Users
	Среднее значение: 920.34.
	Максимальное / Минимальное значения: 11452.00 / 9.00.
	Медиана: 625.50.
	Мода: {537}.

Для признака: Lifetime People who have liked your Page and engaged with your post
	Среднее значение: 609.99.
	Максимальное / Минимальное значения: 4376.00 / 9.00.
	Медиана: 412.00.
	Мода: {403}.

Для признака: Lifetime Post Consumers
	Среднее значение: 798.77.
	Максимальное / Минимальное значения: 11328.00 / 9.00.
	Медиана: 551.50.
	Мода: {182}.

Для признака: Lifetime Post Consumptions
	Среднее значение: 1415.13.
	Максимальное / Минимальное значения: 19779.00 / 9.00.
	Медиана: 851.00.
	Мода: {431}.

Для признака: Lifetime Post Impressions by people who have liked your Page
	Среднее значение: 16766.38.
	Максимальное / Минимальное значения: 1107833.00 / 567.00.
	Медиана: 6255.50.
	Мода: {1210}.

Для признака: Li

**По категориям**

В данных есть $4$ уникальные категории:
* Photo; 
* Status; 
* Link; 
* Video

In [13]:
df.Type.unique()

array(['Photo', 'Status', 'Link', 'Video'], dtype=object)

**Выведем статистики для признаков по категориям:** <a class="anchor" id="descriptives_type-bullet"></a>

_(разворачивается клику, прячется по двойному клику)_

In [14]:
%%time
for col in df.columns.difference(['Type']):
    print(f'Для признака: {col}:')
    df_aux = df.groupby('Type')[col].agg(['mean', 'max', 'min', 'median', lambda x: stats.mode(x)[0][0]])
    df_aux.columns = ['mean', 'max', 'min', 'median', 'mode']
    display(df_aux.round(2))
    print('\n')

del(df_aux)
del(col)

Для признака: Category:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,1.14,3,1,1,1
Photo,1.92,3,1,2,1
Status,2.02,3,1,2,2
Video,1.0,1,1,1,1




Для признака: Lifetime Engaged Users:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,342.82,1374,24,244.0,66
Photo,818.95,11452,9,605.5,537
Status,2040.22,6164,128,1701.0,128
Video,1707.0,3872,459,1779.0,459




Для признака: Lifetime People who have liked your Page and engaged with your post:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,210.55,788,19,161.5,19
Photo,507.31,3430,9,403.0,403
Status,1719.84,4376,101,1604.0,101
Video,979.43,2218,363,885.0,363




Для признака: Lifetime Post Consumers:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,292.68,1106,23,205.0,322
Photo,690.43,11328,9,528.5,298
Status,1949.56,5934,86,1599.0,86
Video,1584.71,3822,411,1643.0,411




Для признака: Lifetime Post Consumptions:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,374.09,1345,26,290,26
Photo,1299.03,19779,9,827,431
Status,2838.87,9237,112,2201,1692
Video,2600.14,7327,539,2331,539




Для признака: Lifetime Post Impressions by people who have liked your Page:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,11148.59,42338,2307,9034.5,2307
Photo,16422.48,1107833,567,5498.0,1210
Status,18664.27,37849,5009,17502.0,5009
Video,43149.86,107502,21436,30131.0,21436




Для признака: Lifetime Post Total Impressions:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,28725.45,229733,3094,9773.0,3094
Photo,28994.5,1110282,570,8118.5,4372
Status,24244.47,59964,7509,20849.0,7509
Video,102622.43,277100,30235,56950.0,30235




Для признака: Lifetime Post Total Reach:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,18544.59,70912,1536,7422,1536
Photo,13137.81,180480,238,4675,677
Status,13078.89,31136,3930,11096,3930
Video,51205.71,139008,13544,30624,13544




Для признака: Lifetime Post reach by people who like your Page:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,6544.36,27232,1180,5039,1180
Photo,6059.1,51456,236,3110,690
Status,9908.58,21352,2410,8980,2410
Video,17386.29,38720,9568,14112,9568




Для признака: Page total likes:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,116363.18,138895,85979,115396,138353
Photo,122354.17,139441,81370,128032,124940
Status,132647.04,139441,104070,135713,139441
Video,135014.86,138895,126424,137893,137893




Для признака: Paid:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,0.27,1.0,0.0,0.0,0.0
Photo,0.28,1.0,0.0,0.0,0.0
Status,0.22,1.0,0.0,0.0,0.0
Video,0.57,1.0,0.0,1.0,1.0




Для признака: Post Hour:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,5.73,14,2,4,3
Photo,8.0,23,1,9,3
Status,7.24,15,2,9,10
Video,8.71,13,2,11,11




Для признака: Post Month:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,6.59,12,1,4.5,12
Photo,6.81,12,1,7.0,10
Status,9.07,12,3,10.0,12
Video,9.57,12,6,11.0,11




Для признака: Post Weekday:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,4.27,7,1,4.5,6
Photo,4.11,7,1,4.0,7
Status,4.58,7,1,5.0,5
Video,3.57,6,2,3.0,2




Для признака: Total Interactions:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,89.05,420,6,52.5,14
Photo,216.58,6334,0,122.0,0
Status,217.04,1009,17,186.0,117
Video,295.86,550,81,271.0,81




Для признака: comment:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,2.82,15,0,1.5,0
Photo,7.49,372,0,3.0,0
Status,8.91,60,0,4.0,2
Video,12.29,23,2,17.0,2




Для признака: like:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,73.32,379.0,5.0,37.0,12.0
Photo,182.61,5172.0,0.0,100.0,7.0
Status,176.71,859.0,13.0,150.0,86.0
Video,231.43,449.0,65.0,204.0,65.0




Для признака: share:


Unnamed: 0_level_0,mean,max,min,median,mode
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Link,12.91,44.0,0.0,10.5,2.0
Photo,27.16,790.0,0.0,19.0,14.0
Status,31.42,123.0,1.0,28.0,1.0
Video,52.14,121.0,13.0,44.0,13.0




Wall time: 332 ms


**Определим самый популярный объект в выборке** <a class="anchor" id="popularest-bullet"></a>

В первую очередь определим популярность. 
> Сама идея популярности очень эфимерна и формируется под воздействием требований и желаний к тому, через какую призму и под каким углом мы будем смотреть, измерять и определять популярность. История знает ситуации, когда популярные личности у одних были не только лишь ненавистными, но даже неизвестными другим. 

Допустим, что без желаний из вне нужно попробовать найти некий ["самый популярный объект"](https://en.wikipedia.org/wiki/Spherical_cow), который не будет зависить от нашей ЦА, который уже привлёк наибольшее количество пользователей, или который будет привлекать пользователей. Имея наши данные, можно пожелать объект, отвечающий следующим условиям:
* за всю жизнь с максимальным желаемым количеством вовлеченных пользователей;
* за всю жизнь не максимальным желаемым количеством вовлеченных "своих" (постоянная аудитория)
* с максимальным желаемым количеством лайков, репостов и комментариев (Total Interactions);
* в идеальном мире, возможно, еще и не проплаченный. :)

**Попробуем формализовать сказанное:**
* _Максимальным желаемым_ будем считать то значение метрики, которое превосходит некий порог среди всех объектов. В примере возьмем $0.85$. А _не максимальным желаемым_ то, которое не привосходит некий порог среди всех объектов. Возьмем $0.85$.
* Количеством вовлеченных пользователей будем мерить по **Lifetime Engaged Users** The number of people who engaged with your Page. Engagement includes any click).
* Независимость от нашей ЦА будем мерить по People who have liked your Page and engaged with your post. 
* Количеством лайков, репостов и комментариев будем мерить по **Total Interactions** ($\sum (N_{\text{comment}} , N_{\text{like}} , N_{\text{share}})$)


In [16]:
THRESHOLD_upper = 0.85
THRESHOLD_lower = 0.85

def cond_slise(col, pos=True, THRESHOLD_upper=THRESHOLD_upper, THRESHOLD_lower=THRESHOLD_lower):
    if pos:
        return (df[col] > df[col].quantile(THRESHOLD_upper))
    else:
        return (df[col] < df[col].quantile(THRESHOLD_lower))

_(разворачивается клику, прячется по двойному клику)_

In [15]:
df_aux = df[(cond_slise('Lifetime Engaged Users')) 
   & (cond_slise('Total Interactions')) 
   & (cond_slise('Lifetime People who have liked your Page and engaged with your post', False))]
print(f'Любой из {len(df_aux)} объектов можно назвать самым популярным по текущей системе измерения')
display(df_aux)
del(df_aux)

Любой из 5 объектов можно назвать самым популярным по текущей системе измерения


Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
67,138185,Photo,1,11,7,3,1.0,53456,93790,1576,995,1469,32646,14912,884,20,697.0,70.0,787
71,137893,Video,1,11,5,3,1.0,100768,220447,2101,1735,2331,59658,18880,885,17,449.0,84.0,550
254,129600,Photo,3,7,5,3,0.0,54256,82011,1620,963,1419,42128,24224,977,10,755.0,58.0,823
270,128032,Photo,2,7,4,5,1.0,53056,65260,2003,1412,2089,23679,17104,975,6,696.0,28.0,730
480,86909,Photo,2,1,4,11,0.0,11484,20696,1762,1635,2741,8774,5124,722,56,360.0,99.0,515


[К содержанию](#zero-bullet)