###  This example shows how to use data-fusion-sm to join two data sources with the goal of sharing information.

A typical use case for this arises when a researcher wants to supplement available data with information not measured in the current study.  

For instance, consider a purchase prediction model built using purchasing behavior. Television ads are known to affect purchasing behavior, so it may be helpful to include tv viewing behavior or ad viewing history in the model to improve accuracy.  This information is measured in a separate data source.  Fusing the purchasing data with the viewing data will create a more robust and informative data source to train the prediction model on, improving its accuracy/usefulness.  

In this example, we join an online behavior survey to a tv viewing panel.  The tv panel doesn't have viewing behavior in this example, but some useful characteristics that are not measured in the survey.  Also, the tv panel doesn't have Income as a measured feature, but the survey does.  It would be useful to share information across these sources to enrich aggregate insights derived from either.

In [1]:
from datafusionsm.datasets import load_online_survey, load_tv_panel

survey = load_online_survey()
panel = load_tv_panel()

A key requirement for implicit fusion is a set of overlaping features that were measured in both data sources.  The values of these features must be uniform between them as well. Let's see whats available, and check their values. 

In [2]:
linking = survey.columns.intersection(panel.columns).tolist()
print(linking)

for l in linking:
    print(f"{l} values:= survey: {sorted(survey[l].unique())}, panel: {sorted(panel[l].unique())}")

['age', 'gender', 'children', 'marital']
age values:= survey: ['16-24', '25-34', '35-44', '45-64'], panel: ['16-24', '25-34', '35-44', '45-64']
gender values:= survey: ['F', 'M'], panel: ['F', 'M']
children values:= survey: ['0', '1', '2', '3+'], panel: ['0', '1', '2', '3+']
marital values:= survey: ['married', 'single'], panel: ['married', 'single']


We can see there are 4 linking variables, and the possible values align between the data sources.  Ordinarily, there will be more common variables and a selection process should determine which are used for fusion.  Here, we can use all of them.

---

When matching person records based on demographics, it is a common desire to avoid nonsensical situations where respondents of different particular demographic groups are matched together. 

This depends on the data and analysis being done, but situations such as matching an 18-yr old male to a 65-yr old female are often undesireable.  To avoid this, _`critical cells`_ are created within which all records have the same values for a particular set of fields.  Matching is then done within the cell on the remaining features.  A common choice in marketing is `age` and `gender`.

In [3]:
critical_cells = ["age", "gender"]

To perform the actual "fusion", we use the main `HotDeck` class from the `fusion.custom.implicit_model` module.  Currently, this is the only implicit model offered.  We also set `income` as the target variable of interest we want to donate from `survey` to `panel`.

In [4]:
from datafusionsm.implicit_model import HotDeck
target = "income"

Let's first run a model with the default parameters:  
>method=nearest  
>score_method=cosine  
>minimize=True  
>importance=None (no importance weights)

In [5]:
hd = HotDeck() 
hd.fit(survey, panel, linking=linking, critical=critical_cells)
fused = hd.transform(survey, panel, target=target)
fused[["panelist-id"] + linking + [target]].head()

Unnamed: 0,panelist-id,age,gender,children,marital,income
0,61913,45-64,M,0,married,Mid 50%
1,17173,45-64,M,0,married,Mid 50%
2,62118,45-64,M,0,married,Mid 50%
3,72817,45-64,M,0,married,Mid 50%
4,92177,45-64,M,0,married,Mid 50%


We can see the donanted `income` value in the tv panel.  One can now treat this variable as if it were actually measured in the panel.  

A quick way to evaluate the fusion results is to look at how well preserved the donated information is post-fusion; i.e. does the `income` variable maintain a similar distribution as measured in the survey after being donated to the panel?  We can use `datafusionsm.evaluation.compare_distributions(p, q)` as a check.

In [6]:
from datafusionsm.evaluation import compare_dists

measured_inc = survey["income"].value_counts() / survey.shape[0]
fused_inc = fused["income"].value_counts() / fused.shape[0]

compare_dists(measured_inc, fused_inc)

KL-Divergence               0.004285
Hellinger Distance          0.032555
Total Variation Distance    0.033000
Overlap                     0.967000
dtype: float64

From the above summary, we can see we recapture the `income` variable pretty well - as shown by the high degree of `Overlap` and the relatively small distances between the distributions.  This means we can be pretty confident when using `income` from the adjusted panel moving foward.

Another way to evaluate the fusion results is to inspect the matches themselves and see how close we were on average with respect to the linking variables and scores.  Since `age` and `gender` were critical cells, let's only loook at `children` and `marital`.

In [7]:
from datafusionsm.evaluation import match_accuracy
summary, results = match_accuracy(hd.matches, survey, panel, ["children", "marital"])
print(summary)

children:

          0    1    2  3+  recipient
0      2606    0    0   0       2606
1         0  925    0   0        925
2         0    0  456   0        456
3+        0    0    0  13         13
donor  2606  925  456  13       4000


	acc:	1.0

	donor/recip	0	1	2	3+


			1.0	1.0	1.0	1.0



marital:

        married single  recipient
married    2521      0       2521
single        1   1478       1479
donor      2522   1478       4000


	acc:	0.9998

	donor/recip	married	single


			1.0004	0.9993





We can see we did pretty well ensuring close matches.  