## Study Case bayer

#### Goal

In order to increase awareness about the work and products made by Bayer, different teams create
frequently content designed for Health Care Providers (HCP). It can be all sorts of messages from the
launch of a new product, to updates on a clinical trial, launch of initiatives, partnerships...
Each content is designed with a specific target audience in mind. However, the limited knowledge on
HCPs by Bayer is a source of problem when choosing the right message to broadcast to a user.
Your task is to come up with a solution that will select the correct message for each HCP to optimize
as a final goal the prescription of Bayer products by this HCP.
To do so, you will be provided by historical data of reactions of HCPs to newsletters mixing randomly
the messages. As a final goal we would like to send from 1 to 5 messages by newsletter to the HCP.

#### Data Description

You are provided a data set withholding all necessary information to answer the questions: The
messages sent to HCPs through newsletters and information about the HCP, the newsletter and the
message.

![image.png](attachment:0038d558-68a6-488d-b534-33d9829b5821.png)

#### Exercises

1. The purpose of our solution is to increase the volume of sales of Bayer products. We do not have final sales by prescription from the doctors and we believe that knowledge of our brand drives sales in a long term. Based on the data that are provided to you, what target  can you select or create to optimize for long term sales increase?
2. Based on the data you are provided, develop an algorithm to recommend Bayer how to interact with HCPs in terms of newsletter and messages.
3. After developing such a system how would the system evolve over time in terms of algorithm, feedback loop and message creation?
4. We will launch your solution through a newsletter, in the future we want to interact differently with HCP, what other UX (app, interface, experience…) can you envision for HCP to consume content?
5. For the first version of the tool, what features of the model would you select? What enhancements would be part of further developments? (algorithm, data, external sources,…)
6. At the moment, messages are characterised by a serie of items. In the future we would like to automate it through computer vision and text mining. Could you present an approach of how we would do so? How could this go into production?

##### 1. How to increase sales without final sales information?
* Notion: if a message is clicked then it has been read as well.
* We can assume that we should target those HCPs which fall into a cluster which maximizes the possibility of read and click in our messages.
* Option A: Since each HCP might receive more than one message we should put aggregate them and just get the information related with them and their outcome with our messages - Risky since each HCP can change localization + the type of messages receive can be different + message TA as well.
* Option B: discriminant analysis to see which set of features maximizes the probability of having read Message. and or clicked Message.
    * Option 1: Conditional probability: $P(Read| X_1 = x_1,\dots,X_n = x_n)$
        * Features to consider:
            * Message_type
            * Message_TA
            * office_or_hospital_based
            * gender
            * is_cardiologist
            * is_gp
            * years_since_graduation - discretize into 3 buckets:
                * Equal frequency - each bucket the approximate same number of HCPs
                * Equal width - same cut length
            * Number of combinations: $6\cdot5\cdot3\cdot2\cdot2\cdot2\cdot3 = 2160$ combinations - way to much.
    * Decision tree - problem with categorial features (try R coding instead).

In [46]:
import pandas as pd
import seaborn as sns

df_bayer = pd.read_csv("data/real_case/data_usecase.csv", sep = ';')
df_bayer.head()
print(df_bayer.columns.values)
features_relevants = ['Message_type', 'Message_TA','office_or_hospital_based','gender','is_cardiologist','is_gp','years_since_graduation','Message_click','Message_read']
df_bayer_filtered = df_bayer[features_relevants]
df_bayer_filtered.head()

df_bayer_filtered.years_since_graduation = pd.cut(df_bayer_filtered.years_since_graduation, 5, labels=["L1","L2","L3","L4","L5"])

['HCP_id' 'news_id' 'Message_id' 'Message_type' 'Message_creation_date'
 'Message_TA' 'news_date' 'office_or_hospital_based' 'gender'
 'is_cardiologist' 'is_gp' 'years_since_graduation' 'Message_read'
 'Message_click']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


#### Mutual information or correlation study study
First of all let's measure how many information can we for 'Message_read' or 'Message_click' fron the rest of variables independently for the others:
* We assume all variables to be discrete (for *years_since_graduation* we have performed a equal width discretization)
    * We have assume 5 levels of seniority.

In [63]:
import numpy as np
from sklearn.metrics import mutual_info_score
features_relevants = ['Message_type', 'Message_TA','office_or_hospital_based','gender','is_cardiologist','is_gp']
X = np.array(df_bayer_filtered[features_relevants])
feature_targets = ['Message_read','Message_click']
for feature in features_relevants:
    X = np.array(df_bayer_filtered[feature])
    y = np.array(df_bayer_filtered['Message_read'])
    contingency_table = pd.crosstab(X, y,margins = False)
    print('Feature: ',feature,'\n',contingency_table)
    print(mutual_info_score(labels_true=None,labels_pred=None,contingency = contingency_table))

Feature:  Message_type 
 col_0      0      1
row_0              
0      27498      0
1          0  28045
0.6930986860013235
Feature:  Message_TA 
 col_0      0      1
row_0              
0      27498      0
1          0  28045
0.6930986860013235
Feature:  office_or_hospital_based 
 col_0      0      1
row_0              
0      27498      0
1          0  28045
0.6930986860013235
Feature:  gender 
 col_0      0      1
row_0              
0      27498      0
1          0  28045
0.6930986860013235
Feature:  is_cardiologist 
 col_0      0      1
row_0              
0      27498      0
1          0  28045
0.6930986860013235
Feature:  is_gp 
 col_0      0      1
row_0              
0      27498      0
1          0  28045
0.6930986860013235
