# Hands on - Applied data engineering with Pandas

### ...or creating a simple ETL process

In this hands-on session, we will again work with the data from the ACM case. However, in the last module some data scientists have already invested some time in data engineering and wrangling.

Given our newly gained pandas skills, we now want to follow their path...


# 1) Importing Files

Import the survey data into pandas. However, the survey data is stored in three different sheets in the data file ("2019", "2020", and "2021"). Load them into pandas.


In [1]:
import pandas as pd

In [2]:
survey2019 = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_Survey_Data.xlsx", sheet_name="2019")

In [3]:
survey2020 = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_Survey_Data.xlsx", sheet_name="2020")

In [4]:
survey2021 = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_Survey_Data.xlsx", sheet_name="2021")

Have a look at the three dataframes. They all have the same sructure and identical variable names. Paste theme together into a new dataframe.

In [5]:
survey2019.head()

Unnamed: 0,DateAired,IndustryAdType,ProgramName,Spend,GRP,Impressions,gravity,relatability,heart,originality,adrenaline,smarts,passion,edge,Country,State
0,2019-04-13,Healthy Fast Food Chain,Planet Hypothesis,147051.0,9.658,11183903.0,-1.37,0.929,-0.434,1.595,-1.561,-1.007,-0.164,4.322,United States,Delaware
1,2019-04-13,Healthy Fast Food Chain,New Gal,387.0,0.163,188746.0,-0.938,0.88,0.45,-0.099,-1.244,-1.628,0.661,0.385,United States,Kentucky
2,2019-04-13,Healthy Fast Food Chain,Sister Home Sellers,78814.0,5.274,6106398.0,1.501,1.109,-0.56,0.399,1.185,-0.005,-0.101,0.149,United States,Hawaii
3,2019-04-13,Healthy Fast Food Chain,Freaky Vacations,178700.0,8.658,10026367.0,-1.483,0.036,0.39,-1.429,1.751,0.413,1.997,0.389,United States,Washington
4,2019-04-13,Healthy Fast Food Chain,Maui Five Ten,208500.0,5.241,6068858.0,-0.479,1.473,-0.205,0.327,0.773,-0.955,2.105,-0.217,United States,Florida


In [6]:
survey2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463 entries, 0 to 2462
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   DateAired       2463 non-null   datetime64[ns]
 1   IndustryAdType  2463 non-null   object        
 2   ProgramName     2463 non-null   object        
 3   Spend           2463 non-null   float64       
 4   GRP             2463 non-null   float64       
 5   Impressions     2463 non-null   float64       
 6   gravity         2463 non-null   float64       
 7   relatability    2463 non-null   float64       
 8   heart           2463 non-null   float64       
 9   originality     2463 non-null   float64       
 10  adrenaline      2463 non-null   float64       
 11  smarts          2463 non-null   float64       
 12  passion         2463 non-null   float64       
 13  edge            2463 non-null   float64       
 14  Country         2463 non-null   object        
 15  Stat

In [7]:
survey2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   DateAired       3900 non-null   datetime64[ns]
 1   IndustryAdType  3900 non-null   object        
 2   ProgramName     3900 non-null   object        
 3   Spend           3900 non-null   float64       
 4   GRP             3900 non-null   float64       
 5   Impressions     3900 non-null   float64       
 6   gravity         3900 non-null   float64       
 7   relatability    3900 non-null   float64       
 8   heart           3900 non-null   float64       
 9   originality     3900 non-null   float64       
 10  adrenaline      3900 non-null   float64       
 11  smarts          3900 non-null   float64       
 12  passion         3900 non-null   float64       
 13  edge            3900 non-null   float64       
 14  Country         3900 non-null   object        
 15  Stat

In [8]:
survey2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1483 entries, 0 to 1482
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   DateAired       1483 non-null   datetime64[ns]
 1   IndustryAdType  1483 non-null   object        
 2   ProgramName     1483 non-null   object        
 3   Spend           1483 non-null   float64       
 4   GRP             1483 non-null   float64       
 5   Impressions     1483 non-null   float64       
 6   gravity         1483 non-null   float64       
 7   relatability    1483 non-null   float64       
 8   heart           1483 non-null   float64       
 9   originality     1483 non-null   float64       
 10  adrenaline      1483 non-null   float64       
 11  smarts          1483 non-null   float64       
 12  passion         1483 non-null   float64       
 13  edge            1483 non-null   float64       
 14  Country         1483 non-null   object        
 15  Stat

Combine files row-wise or column-wise

*   set **axis=0** to row-wise combination
*   set **axis=1** to row-wise combination

In [9]:
survey = pd.concat([survey2019, survey2020, survey2021], axis = 0)
survey

Unnamed: 0,DateAired,IndustryAdType,ProgramName,Spend,GRP,Impressions,gravity,relatability,heart,originality,adrenaline,smarts,passion,edge,Country,State
0,2019-04-13,Healthy Fast Food Chain,Planet Hypothesis,147051.0,9.658,11183903.0,-1.370,0.929,-0.434,1.595,-1.561,-1.007,-0.164,4.322,United States,Delaware
1,2019-04-13,Healthy Fast Food Chain,New Gal,387.0,0.163,188746.0,-0.938,0.880,0.450,-0.099,-1.244,-1.628,0.661,0.385,United States,Kentucky
2,2019-04-13,Healthy Fast Food Chain,Sister Home Sellers,78814.0,5.274,6106398.0,1.501,1.109,-0.560,0.399,1.185,-0.005,-0.101,0.149,United States,Hawaii
3,2019-04-13,Healthy Fast Food Chain,Freaky Vacations,178700.0,8.658,10026367.0,-1.483,0.036,0.390,-1.429,1.751,0.413,1.997,0.389,United States,Washington
4,2019-04-13,Healthy Fast Food Chain,Maui Five Ten,208500.0,5.241,6068858.0,-0.479,1.473,-0.205,0.327,0.773,-0.955,2.105,-0.217,United States,Florida
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1478,2021-05-22,Sport Utility Vehicle Sales,Planet Hypothesis,194193.0,14.693,17105095.0,-1.279,1.420,-0.749,0.530,-1.531,-1.209,0.255,0.667,United States,Illinois
1479,2021-05-22,Sport Utility Vehicle Sales,Waltzing with the Famous,48839.0,1.731,2014795.0,1.364,0.764,0.297,0.799,1.620,-2.017,-0.097,0.068,United States,Colorado
1480,2021-05-22,Sport Utility Vehicle Sales,Sister Home Sellers,3000.0,1.485,1728914.0,1.670,1.376,-0.608,-0.106,0.658,-0.389,-0.124,0.043,United States,New York
1481,2021-05-22,Sport Utility Vehicle Sales,Together Forever,38187.0,5.212,6065586.0,-0.878,0.150,-0.955,-0.505,-1.097,-1.271,-0.659,1.753,United States,Texas


Now also read in the intentionality results using an appropriate reading function. Watch out for the delimeter!



In [None]:
intentionality = pd.read_csv("https://raw.githubusercontent.com/casbdai/notebooks2023/main/Module2/DataEngineeringPandas/Pandas_TV_Intentionality_Data.csv", sep=";")
intentionality

We need to fix the variable type of "date"

In [None]:
intentionality.date = pd.to_datetime(______.______)
intentionality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7846 entries, 0 to 7845
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            7846 non-null   datetime64[ns]
 1   IndustryAdType  7846 non-null   object        
 2   ProgramName     7846 non-null   object        
 3   Intentionality  7846 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 245.3+ KB


In [None]:
gtrends = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_GTrends_Data.xlsx")
gtrends.info()

# 2) Merging Files

Now after having loaded the data, we want to combine the data into one overarching data set. However, be aware that the data needs to be joined on three variables: Industry Ad Type, Program Name and date / Date Aired

Perform an inner join of the data.

In [None]:
inner =  pd.merge(survey, intentionality,
                  how="inner",
                  left_on=["IndustryAdType", "ProgramName", "DateAired"],
                  right_on=["IndustryAdType", "ProgramName", "date"])

inner.info()

Perform an left join of the data

In [None]:
left =  pd.merge(________, ________,
                  how=________,
                  left_on=[________, ________, ________],
                  right_on=[________, ________,________])

left.info()

How many NaNs are introduced in the variable intentionality? (you can use .info() )

Number of NaN: __

Which joining method would you use for combining the two dataframe? Why?

Your answer: __________________________

In [None]:
left =  pd.________(________, ________,
                  how=________,
                  left_on=[________, ________],
                  right_on=[________, ________])

left.info()

# 3) Dealing with NA

In order to practice our "dealing with missing data skills", we have to decided to go with an outer join.

Create a new dataframe in which you have removed all missing values:

In [None]:
acmdata = left.dropna()
acmdata.info()

Create a new dataframe in which you insert 0 into the missing data fields of appropriate variables.

In [None]:
acmdata_0 = left.fillna(value=0)
acmdata_0.info()

# 4) Tranforming Variables

In the following exercises, we use the acmdata dataframe!

Rename the variable "Spend" into "Spend_in_000"

In [None]:
acmdata = acmdata.rename(columns={"Spend": "Spend_in_000"})
acmdata.info()

Delete the Variable "date"

In [None]:
del(acmdata["date_y"])
acmdata.info()

In [None]:
acmdata = acmdata.drop("date_x", axis = 1)
acmdata.info()

Aggregate the acmdata data frame by "IndustryAdType" using .mean()

In [None]:
acmdata.groupby("IndustryAdType").mean()

Aggregate the acm dataframe by "Industry Ad Type" and "Program Name" using .sum()

In [None]:
acmdata.groupby([________, ________]).sum()

Again, aggregate the acmdata dataframe by "Industry Ad Type" and "Program Name" using .sum(). However, you are only interested in the "Spend" and "Impressions" data

In [None]:
________

### Meaningful plots: Combining aggregations and .plot()

For creating more meaningful and Tableau-like plots in python, you have to combine aggregations with the .plot() method

In [None]:
acmdata.groupby(______)["Spend_in_000"].sum().plot()

a barplot of Spend by Program Name

In [None]:
acmdata.groupby([______])["Spend_in_000"].sum().______

# Writing Data File

Now, write the merged and tidied data file as excel

In [None]:
acmdata.to_excel("acmdata.xlsx", index=False)

In [None]:
from google.colab import files
files.download('acmdata.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Or write the data into an SQL database

In [None]:
import sqlalchemy as db

engine = db.create_engine("sqlite:///cleaned_database")
engine.connect()

acmdata.to_sql('clean_acm_data', con=engine, if_exists="replace", index=False)

inspector = db.inspect(engine)
inspector.get_table_names()

['clean_acm_data']