## Pinkoi 資料導向產品開發經驗分享

<br/>
<br/>
<br/>

**廖尹禎**

a.k.a **Dboy Liao**

## About Me

- Data Scientist @ [Pinkoi](https://www.pinkoi.com/)
- 主要負責推薦系統
- 一個 Python 碼農(但很想跳槽 Julia)
- 業餘開發 [uTensor](https://github.com/uTensor) 的 [code generator](https://github.com/uTensor/utensor_cgen)

## 資料導向產品開發?

1. `Collect Data`
2. `ETL/EDA`
3. `Statistics/Machine Learing/Deep Learning(?)`
3. `Production`

It's better to start with an example.

## 一個深具啟發的例子

<center>
    <img src='imgs/hypo.png' width='70%'/>
</center>

`python generate_data.py -h`

![gen-data](imgs/gen_data.png)

In [1]:
import pandas as pd

In [2]:
raw_data = pd.read_csv('family.csv')
raw_data.head()

Unnamed: 0,parent,child
0,ZbE2yu1Y,
1,gCGxZU5T,YgTV3iZy
2,gCGxZU5T,jBECVH1w
3,5Mfc7OX9,Sv3axtli
4,5Mfc7OX9,ysk3fo0L


## Simple Logistic Regrssion

<br/>
<br/>

$$
Prob(HasGrandchild | HasChild) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * D_{Has Child}})} 
$$

$\beta_1$ 是否顯著不為零呢?

In [3]:
# 整理成適合 logistic regression 的樣子
data_train = (
    raw_data.join(
        (
            raw_data.rename(columns={"parent": "child", "child": "grand_child"})
            .copy()
            .set_index("child")
        ),
        on="child",
        how="left",
    )[["child", "grand_child"]]
    .notna()
    .applymap(int)
    .rename(columns={"child": "has_child", "grand_child": "has_grand_child"})
)

In [4]:
data_train.head(10)

Unnamed: 0,has_child,has_grand_child
0,0,0
1,1,1
1,1,1
2,1,1
3,1,0
4,1,1
5,1,1
5,1,1
6,1,0
7,1,1


In [5]:
from statsmodels.formula.api import logit

In [6]:
model = logit("has_grand_child ~ has_child", data_train).fit(method="bfgs")

Optimization terminated successfully.
         Current function value: 0.480177
         Iterations: 25
         Function evaluations: 26
         Gradient evaluations: 26


In [7]:
print(model.summary2())

                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.197      
Dependent Variable: has_grand_child  AIC:              335040.4569
Date:               2019-05-17 14:57 BIC:              335061.9818
No. Observations:   348868           Log-Likelihood:   -1.6752e+05
Df Model:           1                LL-Null:          -2.0859e+05
Df Residuals:       348866           LLR p-value:      0.0000     
Converged:          1.0000           Scale:            1.0000     
-------------------------------------------------------------------
              Coef.    Std.Err.     z     P>|z|    [0.025    0.975]
-------------------------------------------------------------------
Intercept    -14.4293    4.2993  -3.3562  0.0008  -22.8557  -6.0028
has_child     14.0233    4.2993   3.2618  0.0011    5.5968  22.4498



<center>
    <img alt=batman-balance src=imgs/batman_balance.jpg />
</center>

In [8]:
data_train.describe()

Unnamed: 0,has_child,has_grand_child
count,348868.0,348868.0
mean,0.713525,0.285326
std,0.452115,0.45157
min,0.0,0.0
25%,0.0,0.0
50%,1.0,0.0
75%,1.0,1.0
max,1.0,1.0


In [9]:
y_pred = model.predict(data_train.has_child) > 0.5

In [10]:
y_pred.any()

False

In [11]:
from sklearn.metrics import classification_report

print(classification_report(data_train.has_grand_child.values, y_pred.values.astype('int')))

              precision    recall  f1-score   support

           0       0.71      1.00      0.83    249327
           1       0.00      0.00      0.00     99541

   micro avg       0.71      0.71      0.71    348868
   macro avg       0.36      0.50      0.42    348868
weighted avg       0.51      0.71      0.60    348868



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [12]:
from analysis_balanced import model as model_balanced

Optimization terminated successfully.
         Current function value: 0.528835
         Iterations: 27
         Function evaluations: 28
         Gradient evaluations: 28


In [13]:
y_pred_balanced = model_balanced.predict(data_train.has_child) > 0.5
y_pred_balanced.any()

True

In [14]:
print(classification_report(data_train.has_grand_child.values, y_pred_balanced.values.astype('int')))

              precision    recall  f1-score   support

           0       1.00      0.40      0.57    249327
           1       0.40      1.00      0.57     99541

   micro avg       0.57      0.57      0.57    348868
   macro avg       0.70      0.70      0.57    348868
weighted avg       0.83      0.57      0.57    348868



In [15]:
print(model_balanced.summary2())

                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.237      
Dependent Variable: has_grand_child  AIC:              527415.3787
Date:               2019-05-17 14:58 BIC:              527437.6181
No. Observations:   498654           Log-Likelihood:   -2.6371e+05
Df Model:           1                LL-Null:          -3.4564e+05
Df Residuals:       498652           LLR p-value:      0.0000     
Converged:          1.0000           Scale:            1.0000     
-------------------------------------------------------------------
              Coef.    Std.Err.     z     P>|z|    [0.025    0.975]
-------------------------------------------------------------------
Intercept    -16.1405   10.1155  -1.5956  0.1106  -35.9664   3.6855
has_child     16.6527   10.1155   1.6463  0.0997   -3.1732  36.4786



<center>
    <img src='imgs/hypo.png' width='70%'/>
    <br/>
    <br/>
    <div style='font-size: 20px;'>
        <strong>結論:</strong> 在 10% 的信心水準下，有小孩對有孫子有顯著性的影響
    </div>
</center>

<center>
    <img alt=say-ds src=imgs/say_ds.jpg />
<center/>

<center>
    <img src=imgs/data_science.png alt=data-science />
</center>

<center>
    <img alt='data-science-2' src=imgs/data_science2.png />
</center>

1. `Collect Data`
2. `ETL`
3. `Statistics/Machine Learing/Deep Learning(?)`
4. `Production`

<br/>
`1+2+3+4 = Domain Knowledge`

<font size=32px>Science Is For <strong>Understanding</strong></font>

- [SHAP](https://github.com/slundberg/shap)

<a href=https://github.com/slundberg/shap>
    <img alt=shap-diagram src=https://raw.githubusercontent.com/slundberg/shap/master/docs/artwork/shap_diagram.png />
</a>

## Pinkoi Data Team Codebase Architecture

<br/>
<center>
    <img alt=pinkoi-data-stack src=imgs/pinkoi-data-stack.png width=80% />
</center>

```
                         +------------------+       +------------------+
 +------+                |                  |       | JSON  ...        |
 | Hive |                |      Spark       +------>+           MySQL  |
 +------+                |                  |       |  ..  Pandas  ..  |
     |                   +--------+---------+       | paquet   ...     |
     |                            ^                 |        Hive(ORC) | 
     |        +--------+          |                 +------------------+
     +------->+ PyHive |          +------+
              +----+---+                 |           
                   |                     |           
                   |        +------------+------------+
                   +------->+                         |
+-------+                   |     SQLAlchemy ORM      |
| MySQL +------------------>+ (Business Logics/Rules) |
+-------+                   |                         |
                            +-------------------------+
```

<center>
    <img alt=sqlalchemy-logo src=imgs/sqlalchemy_logo.jpg width=70% />
</center>

<center>
    <img alt=sqlalchemy-arch src=imgs/sqlalchemy_achitech.png />
    <a href=http://www.aosabook.org/en/sqlalchemy.html >blog post</a>
</center>

<center>
    <img alt=spark-logo src=imgs/spark_logo.png width=70% />
</center>

## Case Study: Interaction Aggregation of Logging Data with Spark

- Definately not about work in Pinkoi
- Any similarity is mere coincidence

### Raw Data

```
                     User Activity
+------+-------------------+--------+---------------------+
| User |      Designer     | Action |      Timestamp      |
+======+===================+========+=====================+
| dboy | pinkoi-experience |  view  | 2019-03-02 02:17:38 |
+------+-------------------+--------+---------------------+
|   :  |         :         |    :   |           :         |
+------+-------------------+--------+---------------------+
```

<center>
<img src=imgs/collab_filter.jpg width=80% />
<a href=https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f>Original Blog</a>
</center>

### Our Goal

<br/>

```
+-----------+-----------+---------+---------+
| designer1 | designer2 | metric1 | metric2 |
+-----------+-----------+---------+---------+
|     :     |     :     |    :    |    :    |
+-----------+-----------+---------+---------+
```

- `metric`s can be various interations among `designer`s
- learning the relationship among `designer`s then generate recommendation based on learned relations
- [Association Rule Learning](https://en.wikipedia.org/wiki/Association_rule_learning)

```
+------+-------------------+--------+---------------------+
| user |      designer     | action |      timestamp      |
+------+-------------------+--------+---------------------+
|   :  |         :         |    :   |          :          |
+------+-------------------+--------+---------------------+
                 |
                 |
                 |                +-----------+-----------+---------+---------+
               +---+              | designer1 | designer2 | metric1 | metric2 |
               | ? |------------->+-----------+-----------+---------+---------+
               +---+              |     :     |     :     |    :    |    :    |
                                  +-----------+-----------+---------+---------+
```

- `SQL`?

- `NoSQL`?

- `import multiprocessing`

### My Plan

<br/>

```
+------+-------------------+--------+---------------------+
| user |      designer     | action |      timestamp      |
+------+-------------------+--------+---------------------+
|   :  |         :         |    :   |          :          |
+------+-------------------+--------+---------------------+
                  |
                  | groupBy(user) + collect_list
                  |
                  +----> +------+--------------------------------+
                         | dboy | [(designer1, timestamp1), ...] |
                         +------+--------------------------------+
```

```
+------+--------------------------------+
| dboy | [(designer1, timestamp1), ...] |
+------+--------------------------------+
       |
       | C^n_2,
       | and duration(timestamp1, timestamp2) <= K
       |
       |                 +------+-----------+------------+
       +---------------> | dboy | designer1 | designer2  |
                         +------+-----------+------------+
                         | dboy | designer1 | designer2' |
                         +------+-----------+------------+
                         |   :  |     :     |      :     |
                         +------+-----------+------------+
```

In [16]:
from itertools import combinations

def as_related_designer_pairs(raw_list):
    rel_pairs = []
    for (designer1, stamp1), (designer2, stamp2) in combinations(raw_list, 2):
        if abs((stamp1-stamp2).days) < 15:
            rel_pairs.extend([(designer1, designer2), (designer2, designer1)])
    return rel_pairs

In [17]:
from pyspark.sql.functions import udf
from pyspark.sql import types as T

as_related_designer_pairs_udf = udf(
    as_related_designer_pairs, 
    returnType = T.ArrayType(T.ArrayType(T.StringType()))
)

- PySpark UDF: https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

```
+------+-----------+------------+
| dboy | designer1 | designer2  |                                
+------+-----------+------------+     
| dboy | designer1 | designer2' |
+------+-----------+------------+                                
|   :  |     :     |      :     |                                
+------+-----------+------------+
       |
       | groupBy(designer1, designer2) + count
       |
       |                     +-----------+-----------+-------+
       |                     | designer1 | designer2 | count |
       +------------------>  +-----------+-----------+-------+
                             |     :     |     :     |   :   |
                             +-----------+-----------+-------+
```

```
+-----------+-----------+-------+
| designer1 | designer1 | count |
+-----------+-----------+-------+
|     :     |     :     |   :   |
+-----------+-----------+-------+
            |
            |          +-----------+-----------+---------+---------+
            |          | designer1 | designer2 | metric1 | metric2 |
            +--------> +-----------+-----------+---------+---------+
                       |     :     |     :     |    :    |    :    |
                       +-----------+-----------+---------+---------+
```

With these implicit feedback in hand, you can apply them to

- Association rule learning
- Matrix Factorization
- Factorization Machine
- You name it

<img src=imgs/dataset_hierarchy.png />

## Q & A