<div class="alert alert-info" role="alert">
    <center><h1 style="color:red;"><strong><font color = red>Social Media Campaign Analytics:<br>Should we INCREASE our investment in social media marketing?
    <br>--or--
    <br>The Problem with Sampling</font></strong></h1></center><br>
</div>
<br><br>

`Whenever you deal with a "sampling" problem you need to be VERY careful when drawing conclusions.  Think about every election that
was predicted to be a landslide, but wasn't.  That's a sampling problem.  Let's look at a marketing example.`

## Business Problem 

You are a business analyst, called to a meeting with your executive team to help with some marketing analytics.  CMO Sheila says,...

![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/cmo.jpg)

>Last quarter was the best quarter in our history.  We crushed Wall Streets earning targets by a wide margin.  I am POSITIVE 
that the **key reason** was our **revamped digital advertising campaigns**.  Last summer we conducted a **comprehensive** survey 
of our social media usage at our **mall stores** which I plotted in this **pie chart**:

![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/pieChart.png)

She continues...

>You can see from the pie chart that our **Instagram presence is expanding**. 52% of survey respondents said they learned 
about our company from Instagram. **We should double-down on our Instagram ads** to continue our earnings growth 
trajectory.

### Your CEO is skeptical...

...before investing more budget to targeted IG campaigns your CEO asks you to explore the data a bit.  You think it is best to do this
as part of a group _Design Thinking_ session where we can interactively look at the data and problem together, hypothesize, and come 
to a conclusion about _what do we do next?_.  

>Half of my marketing dollars are wasted.  The problem is, I don't know which half?  --John Wanamaker

<img src="https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/jw.jpg" width="50">

## Design Thinking Session

You think that it might be a good idea to start by looking at how the survey questions were designed.  The way questions are asked can
greatly influence the outcome of a survey.  

>There's a whole science behind asking questions in surveys.  Too much to write about here.  

Here is the actual survey that was used: 

![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/survey.png)

OK.  That's a pretty simple survey.  You also find out that the folks conducting the survey were guessing at the **age range** of the
respondents and capturing that information as well.  That's great!  

### Possible Issues

Your _Design Thinking_ team uncovers some _possible_ issues with this survey _experiment_.  And we build our hypotheses **before 
looking at the data**.  

* There _might_ be **sampling bias**.  You could make the case that the survey is biased towards store customers that are social
media users and that might not be reflective of all customers who visit the store.  CMO Sheila does *not* believe the survey was biased.  

You ask to see Sheila's data and she relunctantly provides it.  

## Exploratory Data Analytics

It's time to look at some data.  

In [2]:
## my standard spark template
## we also load a bunch of packages via requirements.txt

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *


StatementMeta(SparkPool002, 6, 2, Finished, Available)

In [3]:

# This notebook uses synapse.  Make sure you run requirements.txt as part of the Spark cluster setup.  
# This notebook assumes you've done that


# vars
# https://davewdemodata.dfs.core.windows.net/lake/MarketingAnalytics/surveys.csv
# let's use a SAS token so this is reproducible for everyone
#

storageAccount='davewdemodata'
container='lake'
sasToken='sv=2020-10-02&st=2021-02-17T16%3A26%3A00Z&se=2030-02-18T16%3A26%3A00Z&sr=c&sp=rl&sig=UrdPIPkBQsgvD5pZhKn0KYL0Ziyb8zaXeeLw1fhA68s%3D'
lakepath='wasbs://{}@{}.blob.core.windows.net/MarketingAnalytics/surveys.csv'.format(container,storageAccount)

sc._jsc.hadoopConfiguration().set("fs.azure.sas.{0}.{1}.blob.core.windows.net".format(container,storageAccount), sasToken)

StatementMeta(SparkPool002, 6, 3, Finished, Available)

In [4]:
## let's load up the file and take a look
dfSurvey = spark.read \
    .option('header','true') \
    .option('delimiter', ',') \
    .csv (lakepath)
display(dfSurvey.show(5))
dfSurvey.printSchema()
dfSurvey.registerTempTable("dfSurvey")

StatementMeta(SparkPool002, 6, 4, Finished, Available)

+----------------+---------+----------+-----------+--------------------+--------------+----------+
|SurveyResponseId|Responded|SurveyDate| SurveyTime|SocialMediaAwareness|SocMedProperty|AgeBracket|
+----------------+---------+----------+-----------+--------------------+--------------+----------+
|               1|    FALSE| 7/14/2020| 9:13:00 AM|                null|          null|      null|
|               2|    FALSE| 7/14/2020|10:13:00 AM|                null|          null|      null|
|               3|    FALSE| 7/14/2020|11:13:00 AM|                null|          null|      null|
|               4|    FALSE| 7/14/2020|12:13:00 PM|                null|          null|      null|
|               5|    FALSE| 7/14/2020| 1:13:00 PM|                null|          null|      null|
+----------------+---------+----------+-----------+--------------------+--------------+----------+
only showing top 5 rows

root
 |-- SurveyResponseId: string (nullable = true)
 |-- Responded: string (nullabl

In [5]:
## start with summary to do Basic EDA (Exploratory Data Analytics)
display (dfSurvey.summary())

StatementMeta(SparkPool002, 6, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, f26e3799-3bb8-41bf-a8c9-7262b050c7ac)

### Interpretation

![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/results01.png)

* 363 rows:  this is the total number of surveys.  Is this enough?  
* ...but it looks like `Responded` is TRUE/FALSE.  After discussing you learn that there were 363 _attempts_ to get a survey answered
* it looks like there is only one `SurveyDate`...that's weird.  

Let's dig in with a little SQL


## Basic Exploratory Data Analysis

In [6]:
%%sql 

--number responded vs not
--a pie chart might be a good visual here
SELECT Responded, count(*) Count
from dfSurvey
group by Responded

StatementMeta(SparkPool002, 6, 6, Finished, Available)

<Spark SQL result set with 2 rows and 2 fields>

### Interpretation


![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/results02.png)

|||
|---|---|
|Respondents|128|
|People who didn't respond|235|
|Total Asked|363|
|**Response Rate**|**35%**|

**A 35% response rate is not that great.  That's a possible problem.**

## _HOW_ was the survey conducted?  

Sheila tells you the survey was conducted just inside the entrance of a single retail store in a California mall location 
**as the shoppers were leaving the store**. 

**That's horrible** ...but let's look closer at the data.  

_When_ was the survey conducted?  





In [7]:
%%sql 

--dates of surveys
SELECT distinct SurveyDate, date_format(to_date(SurveyDate,'MM/dd/yyyy'),'EEEE') AS DayOfWeek
from dfSurvey
ORDER BY SurveyDate;

--time ranges of surveys
SELECT 
    min(to_timestamp(SurveyTime, 'hh:mm:ss aa'))  AS MinSurveyTime,
    max(to_timestamp(SurveyTime, 'hh:mm:ss aa'))  AS MaxSurveyTime
FROM dfSurvey;

StatementMeta(, 6, -1, Finished, Available)

<Spark SQL result set with 1 rows and 2 fields>

<Spark SQL result set with 1 rows and 2 fields>

### Interpretation

![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/results03.png)

* Surveys were conducted on a SINGLE day, a Tuesday.  
* The surveys were conducted from 9-5 on a summer day.  

If we think through this we should see some red flags.  

* Based on those times and the fact that this is a summer day, we should see **a lot of children or retired folks** that are not at work that day.  

Sheila provides you with this summary:  

|Age Bracket|Respondents|
|----|----|
|12-20|68%|
|20-40|15%|
|40-65|12%|
|65+|5%|

...and says...

>**83% of the "Under Age 40" demographic** are captured in the survey and this closely matches our **target sales demographic**.  

But, clearly that is misrepresenting the data.  You know that it's easy to **_confuse_ with numbers** if you aggregate the data in certain ways.  
This looks like one of those cases.  While the **'Under Age 40' demographic** is our target consumer, clearly we are heavily **skewed towards children**.  

_Our actual target demographic, and the ones that will spend the most, was likely at work when the survey was conducted._

>One common way people make **cognitive mistakes** with data is by `inappropriately aggregating data`.  ([Simpson's 
Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) is one reason).  Be careful!!



## Interpretation


Remember this chart supplied by your CMO?

![](https://raw.githubusercontent.com/davew-msft/PrescriptiveAnalytics/sparkconf/slides/pieChart.png)

We think a better way to display this data is:

|||
|---|---|
||%Respondents|
|All Social Media|43%|
|Instagram|22%|
|Facebook|9%|

This is a much different way to think about this data.  There are other "issues" with this survey and data analysis, but I'll leave that as an exercise for the 
reader.  

## Recommendation

**The data does not support additional investments in Instagram**. If we follow the CMO's recommendation we should be aware that we will be potentially be targeting
the wrong demographic.  

 We need to design a better experiment taking the above recommendations into consideration.
   * The survey design is fundamentally flawed
   * There is statistical bias in the data
   * There is sampling bias in the data

