# Reading a datafeed file
To start, let's read in a sample datafeed file:

In [4]:
setwd("/Users/tpaulsen/Desktop/Product Management/Summit/2019/R/")
data_feed = read.table(file = "example_datafeed.tab", sep="\t", header = TRUE, na.strings = "", stringsAsFactors = FALSE)
head(data_feed)

visitor_id_hi,visitor_id_lo,visit_num,hit_time_gmt,post_event_list,post_campaign,ip,user_id
f73fe8ccf061718bf6de70f8ef6c1484,66360b670b0308e75eeeec6abeb15639,1,1517568611,20.0,campaign9,ip_address6351,
0198f59332915116862c6a1889538be2,0e303f38500e9f0f384451334a9e6bc3,1,1517569668,,,ip_address6545,user_id3795
457588336531e4208f3647e33fdb6028,a73fb9e74fee58fab7b64970aee739da,1,1517569678,20.0,campaign2233,ip_address5805,user_id1008
af5a0bbfd451ed2802265198f58688b4,278413a6033d4cea84136e6e495b5cfc,1,1517569662,20.0,campaign2240,ip_address6535,
af5a0bbfd451ed2802265198f58688b4,278413a6033d4cea84136e6e495b5cfc,1,1517569664,,campaign2240,ip_address6535,user_id2971
9741b5a62c07dbb2e606f5284e971685,d602a1427f25bab96a4dc0a063586603,1,1517569666,220.0,campaign89,ip_address6535,


# Reading a classification file
Read in a sample classification file:

In [7]:
classification = read.table(file="example_classification.tab", sep="\t", header = TRUE, stringsAsFactors = FALSE)
head(classification)

post_campaign,channel
campaign1,Email
campaign2,Email
campaign3,Email
campaign4,Email
campaign5,Email
campaign6,Email


# Dplyr basics
Now let's talk about dplyr!
Dplyr is a very popular R library that boils down complex data manipulation tasks into a series of verbs. It's kinda similar to SQL, but way easier to use and you can do a lot more with much less code.

Go here for more examples and information: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

In [8]:
# Install dplyr using install.packages("dplyr")
library(dplyr)

# FILTER: filter the classification to just the Email campaigns
email_campaigns = classification %>%
  filter(channel == "Email")

# ARRANGE: sort the datafeed by hit time gmt
sorted_data_feed = data_feed %>%
  arrange(hit_time_gmt)

# SELECT: select only certain columns of a datafeed (like a SQL select) and rename them
couple_columns_from_data_feed = data_feed %>%
  select(
    uuid = user_id,
    timestamp = hit_time_gmt
  )

# MUTATE: create a new column using some calculation
data_feed_with_modifications = data_feed %>%
  mutate(
    # concatenate visitor id hi and low
    visitor_id = paste0(visitor_id_hi, "_", visitor_id_lo)
  )

# GROUP BY + SUMMARISE: perform operations on groupings of data
visitor_level_aggregations = data_feed_with_modifications %>%
  group_by(visitor_id) %>%
  summarise(
    rows = n(),
    visits = n_distinct(visit_num)
  )
head(visitor_level_aggregations)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



visitor_id,rows,visits
000063701a4af5d23b3c3b5e71638d22_6e2b54b479fb70ee940f8948b30a14ee,4,1
0003504c09f514587123f1ec5db3691a_97d1cc744fac1d7dcf82b5e165fe7650,99,3
000a1d529ffdcefaccc2082d9e50caf2_06cea78ea3363b4eb5fe0057284142de,1,1
000f3c967b5896cafa65383fbed6ef2b_7c8dacd46bcd1a1155dcffd79dde3e10,2,1
00102093fba32a4e93164af8a5fa4782_38978c46a68b216ef7218a393e2e0719,7,1
0010fa00adc3668ff8149e0cc5b63df9_e19ffb3b9cf1ca3fc8eb7d32b28f5aec,8,1


# Connecting dplyr to an external database
The best part about dplyr is that you can use it to interface with nearly any system that accepts SQL commands, not just on your local machine!

This means using the same dplyr commands above, you can run the same operations at scale on databases like MySQL, MariaDB, Postgres interfaces (including Amazon Redshift), SQLite, odbc, or Google BigQuery. You can also interface with Adobe's Cloud Platform this way!

To make it work, just install "dbplyr" (install.packages("dbplyr") to install) in addition to dplyr above. See this link for more information: https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html

In [None]:
# First, create a connection to your database of choice. Each database will have its own parameters to enter, but the first parameter is always the type of database

# Typical example:
con = DBI::dbConnect(RSQLite::SQLite(), 
  host = "database.host.com",
  user = "myuser",
  
  # Putting your password in a string isn't safe! Rstudio gives this handy option:
  password = rstudioapi::askForPassword("Database password")
)

# Here's a bunch of options you can put as the first parameter:
RSQLite::SQLite()
RMySQL::MySQL()
RPostgreSQL::PostgreSQL()
odbc::odbc()
bigrquery::bigquery()

If you use an on premise setup, you can connect dplyr directly to a SparkSQL backend using a library called "sparklyr" which is also amazing. You can read all about that here: https://spark.rstudio.com/

For the rest of this session, we'll just use a local file so that results can be easily reproduced, but I'll add comments to show you how you'd do things if connected to an external database.
