## *DISCLAIMER*
<p style="font-size:16px; color:#117d30;">
 By accessing this code, you acknowledge the code is made available for presentation and demonstration purposes only and that the code: (1) is not subject to SOC 1 and SOC 2 compliance audits; (2) is not designed or intended to be a substitute for the professional advice, diagnosis, treatment, or judgment of a certified financial services professional; (3) is not designed, intended or made available as a medical device; and (4) is not designed or intended to be a substitute for professional medical advice, diagnosis, treatment or judgement. Do not use this code to replace, substitute, or provide professional financial advice or judgment, or to replace, substitute or provide medical advice, diagnosis, treatment or judgement. You are solely responsible for ensuring the regulatory, legal, and/or contractual compliance of any use of the code, including obtaining any authorizations or consents, and any solution you choose to build that incorporates this code in whole or in part.
</p>

## Important – Do not use in production, for demonstration purposes only – please review the legal notices before continuing
 License agreement: https://github.com/microsoft/Azure-Analytics-and-AI-Engagement/blob/main/HealthCare/License.md 


## Legal Notices
This presentation, demonstration, and demonstration model are for informational purposes only. Microsoft makes no warranties, express or implied, in this presentation demonstration, and demonstration model. Nothing in this presentation, demonstration, or demonstration model modifies any of the terms and conditions of Microsoft’s written and signed agreements. This is not an offer and applicable terms and the information provided is subject to revision and may be changed at any time by Microsoft.

This presentation, demonstration, and/or demonstration model do not give you or your organization any license to any patents, trademarks, copyrights, or other intellectual property covering the subject matter in this presentation, demonstration, and demonstration model.

The information contained in this presentation, demonstration and demonstration model represent the current view of Microsoft on the issues discussed as of the date of presentation and/or demonstration, and the duration of your access to the demonstration model. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of presentation and/or demonstration and for the duration of your access to the demonstration model.

No Microsoft technology, nor any of its component technologies, including the demonstration model, is intended or made available: (1) as a medical device; (2) for the diagnosis of disease or other conditions, or in the cure, mitigation, treatment or prevention of a disease or other conditions; or (3) as a substitute for the professional clinical advice, opinion, or judgment of a treating healthcare professional. Partners or customers are responsible for ensuring the regulatory compliance of any solution they build using Microsoft technologies.

© 2020 Microsoft Corporation. All rights reserved


## Please don't run / don't click "Run all" the notebook:
At the time of writing of this document, the current core limit is 200 cores per workspace and depending upon number of concurrent users, you may end up with core capacity being exceeded or maximum number of parallel jobs being exceeded error. 
## Fetch Marketing Campaigns data into DataFrame and Calculate Revenue Variance        

In [1]:
%%pyspark
data_path = spark.read.load('abfss://marketingdata@#STORAGE_ACCOUNT_NAME#.dfs.core.windows.net/CampaignData.csv', format='csv',header=True)
data_path.show(10)

+-------------------+-------+---------------+-----------+--------------+-----------+----------+------+
|             Region|Country|  Campaign_Name|    Revenue|Revenue_Target|       City|     State|RoleID|
+-------------------+-------+---------------+-----------+--------------+-----------+----------+------+
|         South East|     US|Patient Stories|$11,564.00 |   $19,306.00 |      Miami|   Florida|  NULL|
|Southern California|     US|  For Your Life| $6,497.00 |    $6,147.00 |Los Angeles|California|  NULL|
|         South East|     US|  Hit the track|$11,620.00 |   $17,230.00 |      Miami|   Florida|    20|
|         South East|     US|Patient Stories| $9,963.00 |   $18,377.00 |      Miami|   Florida|  NULL|
|Southern California|     US|  For Your Life|$16,850.00 |   $15,753.00 |Los Angeles|California|  NULL|
|         South East|     US|  Hit the track| $5,333.00 |    $7,346.00 |      Miami|   Florida|    20|
|         South East|     US|Patient Stories|$17,488.00 |    $9,941.00 | 

## Load into Pandas and Perform Cleansing Operations


In [2]:
%%pyspark
from pyspark.sql.functions import *
from pyspark.sql.types import *

import numpy as np

pd_df = data_path.select("*").toPandas()

'''Cleansing Operations: 
1. Columns Revenue, Revenue_Target: Remove '$' symbol and convert datatype to float
2. Columns Revenue, Revenue_Target: Replace null values with 0
3. Columns Region, Country, Product_Category, Campaign_Name: Convert columns to Camel Case
'''
pd_df['Revenue']= pd_df['Revenue'].replace('[\$,]', '', regex=True).astype(float)
pd_df['Revenue_Target']= pd_df['Revenue_Target'].replace('[\$,]', '', regex=True).astype(float)
pd_df['Revenue'].fillna(value=0, inplace=True)
pd_df['Revenue_Target'].fillna(value=0, inplace=True)

pd_df['Region'] = pd_df.Region.str.title()
pd_df['Country'] = pd_df.Country.str.title()

pd_df['Campaign_Name'] = pd_df.Campaign_Name.str.title()

## Data Transformation - Calculate Revenue Variance


In [3]:
#Create new column
pd_df['Revenue_Variance'] = pd_df['Revenue_Target'] - pd_df['Revenue']

print(pd_df[1:5])

Region Country  ... RoleID  Revenue_Variance
1  Southern California      Us  ...   NULL            -350.0
2           South East      Us  ...     20            5610.0
3           South East      Us  ...   NULL            8414.0
4  Southern California      Us  ...   NULL           -1097.0

[4 rows x 9 columns]

## Move data to Azure Data Lake Gen2


In [4]:
%%pyspark
df = spark.createDataFrame(pd_df)
df.show(5)

(df
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save('abfss://marketingdata@#STORAGE_ACCOUNT_NAME#.dfs.core.windows.net/Campaignsdata'))

+-------------------+-------+---------------+-------+--------------+-----------+----------+------+----------------+
|             Region|Country|  Campaign_Name|Revenue|Revenue_Target|       City|     State|RoleID|Revenue_Variance|
+-------------------+-------+---------------+-------+--------------+-----------+----------+------+----------------+
|         South East|     Us|Patient Stories|11564.0|       19306.0|      Miami|   Florida|  NULL|          7742.0|
|Southern California|     Us|  For Your Life| 6497.0|        6147.0|Los Angeles|California|  NULL|          -350.0|
|         South East|     Us|  Hit The Track|11620.0|       17230.0|      Miami|   Florida|    20|          5610.0|
|         South East|     Us|Patient Stories| 9963.0|       18377.0|      Miami|   Florida|  NULL|          8414.0|
|Southern California|     Us|  For Your Life|16850.0|       15753.0|Los Angeles|California|  NULL|         -1097.0|
+-------------------+-------+---------------+-------+--------------+----