#  This example shows how to use the fusion library to join two external 3rd party data sources with the goal of sharing information.

A typical use case for this arises when a researcher wants to supplement available data with information not measured in the current study.  

For instance, consider a purchase prediction model built using purchasing behavior. Television ads are known to affect purchasing behavior, so it may be helpful to include tv viewing behavior or ad viewing history in the model to improve accuracy.  This information is measured in a separate data source.  Fusing the purchasing data with the viewing data will create a more robust and informative data source to train the prediction model on, improving its accuracy/usefulness.  

In this example, we join a GWI online behavior survey to a BARB tv viewing panel.  The BARB panel doesn't have viewing behavior, but some useful characteristics that are not measured in the GWI survey.  Also, the BARB panel doesn't have Income as a measured feature, but GWI does.  It would be useful to share information across these sources to enrich aggregate insights derived from either.

In [1]:
import pandas as pd

survey = pd.read_csv("./data/online-survey.csv")
panel = pd.read_csv("./data/tv-panel.csv")

A key requirement for implicit fusion is a set of overlaping features that were measured in both data sources.  The values of these features must be uniform between them as well. Let's see whats available, and check their values. 

In [2]:
linking = survey.columns.intersection(panel.columns).tolist()
print(linking)

for l in linking:
    print(f"{l} values:= gwi: {sorted(survey[l].unique())}, barb: {sorted(panel[l].unique())}")

['age', 'gender', 'children', 'marital']
age values:= gwi: ['16-24', '25-34', '35-44', '45-64'], barb: ['16-24', '25-34', '35-44', '45-64']
gender values:= gwi: ['F', 'M'], barb: ['F', 'M']
children values:= gwi: ['0', '1', '2', '3+'], barb: ['0', '1', '2', '3+']
marital values:= gwi: ['married', 'single'], barb: ['married', 'single']


We can see there are 4 linking variables, and the possible values align between the data sources.  Ordinarily, there will be more common variables and a selection process should determine which are used for fusion.  Here, we can use all of them.

---

When matching person records based on demographics, it is a common desire to avoid nonsensical situations where respondents of different particular demographic groups are matched together. 

This depends on the data and analysis being done, but situations such as matching an 18-yr old male to a 65-yr old female are often undesireable.  To avoid this, _`critical cells`_ are created within which all records have the same values for a particular set of fields.  Matching is then done within the cell on the remaining features.  A common choice in marketing is `age` and `gender`.

In [3]:
critical_cells = ["age", "gender"]

To perform the actual "fusion", we use the main `HotDeck` class from the `fusion.custom.implicit_model` module.  Currently, this is the only implicit model offered.  We also set `income` as the target variable of interest we want to donate from `gwi` to `barb`.

In [4]:
from datafusionsm.implicit_model import HotDeck
target = "income"

Let's first run a model with the default parameters:  
>method=nearest  
>score_method=cosine  
>minimize=True  
>importance=None (no importance weights)

In [5]:
hd = HotDeck() 
hd.fit(survey, panel, linking=linking, critical=critical_cells)
fused = hd.transform(survey, panel, target=target)
fused[["panelist-id"] + linking + [target]].head()

Unnamed: 0,ViewerID,age,gender,children,marital,income
0,548-1,45-64,M,0,married,Top 25%
1,1074-2,45-64,M,0,married,Bottom 25%
2,1103-2,45-64,M,0,married,Top 25%
3,1268-1,45-64,M,0,married,Mid 50%
4,1301-1,45-64,M,0,married,Top 25%


We can see the donanted `income` value in the BARB panel.  One can now treat this variable as if it were actually measured in the panel.  

A quick way to evaluate the fusion results is to look at how well preserved the donated information is post-fusion; i.e. does the `income` variable maintain a similar distribution as measured in GWI after being donated to BARB?  We can use `fusion.evaluation.compare_distributions.compare_distributions(p, q)` as a check.

In [6]:
from datafusionsm.evaluation import compare_distributions

measured_inc = survey["income"].value_counts() / survey.shape[0]
fused_inc = fused["income"].value_counts() / fused.shape[0]

compare_distributions(measured_inc, fused_inc)

KL-Divergence               0.004165
Hellinger Distance          0.032141
Total Variation Distance    0.033679
Overlap                     0.966321
dtype: float64

From the above summary, we can see we recapture the `income` variable pretty well - as shown by the high degree of `Overlap` and the relatively small distances between the distributions.  This means we can be pretty confident when using `income` from the adjusted panel moving foward.

Another way to evaluate the fusion results is to inspect the matches themselves and see how close we were on average with respect to the linking variables and scores.  Since `age` and `gender` were critical cells, let's only loook at `children` and `marital`.

In [7]:
from fusion.evaluation import demo_accuracy

summary, results = demo_accuracy(hd.matches, gwi, barb, ["children", "marital"])
print(summary)

children:

          0     1    2  3+  recipient
0      5000     0    0   0       5000
1         0  1868    0   0       1868
2         0     0  913   0        913
3+        0     0    0  53         53
donor  5000  1868  913  53       7834


	acc:	1.0

	donor/recip	0	1	2	3+


			1.0	1.0	1.0	1.0



marital:

        married single  recipient
married    4951      0       4951
single        0   2883       2883
donor      4951   2883       7834


	acc:	1.0

	donor/recip	married	single


			1.0	1.0





We can see we did pretty well ensuring close matches.  