# CANDY CONSUMPTION (10 mins)
# Instructions / Notes: Read these carefully
This **Python Jupyter Notebook** notebook is split into the following sections:

1. **Initial section** with pre-filled cells, that you should run just to load some Python modules (packages), the dataset required for your task and its variables in memory.
2. **Middle section** with **description of a concrete task** associated with the dataset. 
3. **Final section (with one or more empty cells)** where you can perform analyses with the loaded dataset (e.g., write a few lines of code if needed), answer the question posed, and describe your reasoning in words.

**Read and execute each cell in order, without skipping forward**. To execute any cell, press **Shift+Enter** on your keyboard. It might take a couple of seconds to receive an output.

Have fun!

In [1]:
# Run the following to import necessary packages and import dataset.
import pandas as pd
import numpy as np
import scipy as sp
from others_assessment import partial_corr 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')


dA = "SC_FT assessment_AgeCandyMarriage_population A.csv"
dB = "SC_FT assessment_AgeCandyMarriage_population B.csv"
dC = "SC_FT assessment_AgeCandyMarriage_population C.csv"

#DATASET A
dfA = pd.read_csv(dA)

#DATASET B
dfB = pd.read_csv(dB)

#DATASET C
dfC = pd.read_csv(dC)

#Print first five lines of dataset A as a check to see if the dataset is loaded properly.
dfA.head(n=5)

Unnamed: 0,Age,Candy Consumption,Marital Status
0,35,11.999648,0
1,37,13.533833,0
2,30,13.466628,0
3,44,11.124643,1
4,45,10.401901,1


# DATASET DESCRIPTION:
Each of the three datasets above contain some statistics for a random sample of 200 people:
1. **Age**
2. **Candy consumption (in terms of percentage of total food consumption)** 
3. **Marital status (0 - not married, 1 - married)**

Run the cell below to obtain **correlation table** AND **partial correlation** table for each of the three datasets. 

**Partial correlation** measures the strength of a relationship between two variables, while controlling for the effect of one or more other variables. The purpose of partial correlation is to find the unique variance between two variables while eliminating the variance from a third variable.


In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
print("Correlation table for COUNTRY A [cell (i,j) refers to the correlation between variable i and j]")
round (dfA.corr(method='pearson'),2)
print (" ")
print ("----")
print (" ")

print("Partial correlation table for COUNTRY A [cell (i,j) refers to the partial correlation between variable i and j, controlling for the third remaining variable]")
data_int = np.hstack((np.ones((dfA.shape[0],1)), dfA)) 
X=partial_corr(data_int)[1:, 1:]
Y=pd.DataFrame(X,index=['Age', 'Candy Consumption', 'Martial Status'], columns=['Age', 'Candy Consumption', 'Martial Status'])
round (Y,2)


Correlation table for COUNTRY A [cell (i,j) refers to the correlation between variable i and j]


Unnamed: 0,Age,Candy Consumption,Marital Status
Age,1.0,-0.84,0.75
Candy Consumption,-0.84,1.0,-0.64
Marital Status,0.75,-0.64,1.0


 
----
 
Partial correlation table for COUNTRY A [cell (i,j) refers to the partial correlation between variable i and j, controlling for the third remaining variable]


Unnamed: 0,Age,Candy Consumption,Martial Status
Age,1.0,-0.72,0.52
Candy Consumption,-0.72,1.0,-0.01
Martial Status,0.52,-0.01,1.0


In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
print("Correlation table for COUNTRY B [cell (i,j) refers to the correlation between variable i and j]")
round (dfB.corr(method='pearson'),2)
print (" ")
print ("----")
print (" ")

print("Partial correlation table for COUNTRY B [cell (i,j) refers to the partial correlation between variable i and j, controlling for the third remaining variable]")
data_int = np.hstack((np.ones((dfB.shape[0],1)), dfB)) 
X=partial_corr(data_int)[1:, 1:]
Y=pd.DataFrame(X,index=['Age', 'Candy Consumption', 'Martial Status'], columns=['Age', 'Candy Consumption', 'Martial Status'])
round (Y,2)


Correlation table for COUNTRY B [cell (i,j) refers to the correlation between variable i and j]


Unnamed: 0,Age,Candy Consumption,Marital Status
Age,1.0,-0.79,0.72
Candy Consumption,-0.79,1.0,-0.52
Marital Status,0.72,-0.52,1.0


 
----
 
Partial correlation table for COUNTRY B [cell (i,j) refers to the partial correlation between variable i and j, controlling for the third remaining variable]


Unnamed: 0,Age,Candy Consumption,Martial Status
Age,1.0,-0.71,0.59
Candy Consumption,-0.71,1.0,0.12
Martial Status,0.59,0.12,1.0


In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
print("Correlation table for COUNTRY C [cell (i,j) refers to the correlation between variable i and j]")
round (dfC.corr(method='pearson'),2)
print (" ")
print ("----")
print (" ")

print("Partial correlation table for COUNTRY C [cell (i,j) refers to the partial correlation between variable i and j, controlling for the third remaining variable]")
data_int = np.hstack((np.ones((dfC.shape[0],1)), dfC)) 
X=partial_corr(data_int)[1:, 1:]
Y=pd.DataFrame(X,index=['Age', 'Candy Consumption', 'Martial Status'], columns=['Age', 'Candy Consumption', 'Martial Status'])
round (Y,2)


Correlation table for COUNTRY C [cell (i,j) refers to the correlation between variable i and j]


Unnamed: 0,Age,Candy Consumption,Marital Status
Age,1.0,-0.77,0.72
Candy Consumption,-0.77,1.0,-0.51
Marital Status,0.72,-0.51,1.0


 
----
 
Partial correlation table for COUNTRY C [cell (i,j) refers to the partial correlation between variable i and j, controlling for the third remaining variable]


Unnamed: 0,Age,Candy Consumption,Martial Status
Age,1.0,-0.67,0.59
Candy Consumption,-0.67,1.0,0.09
Martial Status,0.59,0.09,1.0


# TASK:
Survey results from populations in **3 different countries A, B, C** suggests 
(i) **very similar (AND HIGH) correlation trends** between marital status and candy consumption (ranging between -0.5 to -0.65)
(ii) **very similar (AND VERY LOW) partial correlation trends** between marital status and candy consumption (ranging between -0.01 to 0.12)

A candy manufacturer that is newly entering the market has allocated **a limited budget** that can be used to give free samples to a targeted portion of the population in each of these 3 countries. The head of the company wants to **first target married people who consume more candy**. The company officials have brainstormed four potential strategies to **fulfil this goal**. 

**Strategy 1**: Select most married people who are in their adulthood (> 35 years) for inclusion into the population. 

**Strategy 2**: Select most married people who are seniors (> 60 years) for inclusion into the population. 

**Strategy 3**: Select most married people who are in their youth (> 15 years) for inclusion into the population. 

**Strategy 4**: Randomly pick married people for inclusion into the population.

**Which strategy would you recommend to the company head for each country**? 

Reason with the **dataset provided to you**, and, the **correlation** and **partial correlation** trends.

1. Please mark your strategy recommendation for each country 
2. Please provide a brief **reasoning** behind your answer (an explanation of **why** you took certain steps or performed certain calculations to get to the solution)
3. Please mark your **confidence** in the designed measure (on a scale of 1 to 5)

# Not enough time...

In [None]:
#NOTE: Round all your statistics to 2 decimal places before reasoning with them!! 

#REPORT YOUR ANSWER (e.g., 112, 123, 234 etc)
country = 'None'
#Choose strategy recommendation (1, 2, 3, 4) for the countries in the following sequence:
#Country A --> Country B --> Country C
print(country)

#REPORT YOUR REASONING
strategy_appropriateness_reasoning = 'None'
print(strategy_appropriateness_reasoning)

#REPORT CONFIDENCE IN YOUR ANSWER
confidence_measure = 'None' 
#Choose among: 1 (low confidence), 2, 3 (medium confidence), 4, 5 (high confidence)
print(confidence_measure)


In [None]:
#ONLY use this space below to write your code (if needed) for answering the task. DO NOT ERASE this code segment from the workbook.












#Your intuitive ideas are valuable!!If you need syntax-related help in implementing your ideas, you can access the following documentation files (use the "Search" tab for queries) and/or summarized syntax sheets.

#a) Pandas library
#Documentation file: https://pandas.pydata.org/pandas-docs/stable/
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/fbc502d0-46b2-4e1b-b6b0-5402ff273251

#b) Numpy library
#Documentation file: https://docs.scipy.org/doc/numpy/user/index.html
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/e9f83f72-a81b-42c7-af44-4e35b48b20b7

#c) Matplotlib library
#Documentation file: https://matplotlib.org/contents.html
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/28b8210c-60cc-4f13-b0b4-5b4f2ad4790b

#d) Scipy library
#Documentation file: https://docs.scipy.org/doc/scipy/reference/
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/5710caa7-94d4-4248-be94-d23dea9e668f