# SSC Data Science and Analytics Workshop 2022

### Intro to Databases in Industry: Data Cleaning, Querying, and Modeling at Scale
---------------

SQL is powerful, fast, and reliable. But unfortunately, queries can quickly become complex, even for routine data wrangling. 

Languages like R  and Python have powerful packages, such as `tidyverse` and `pandas`, that are designed to facilitate wrangling and cleaning data. The disadvantage is that they are not as fast as SQL. Luckily for us, we can connect R directly to the databases. Not only that, but we can use the usual `tidyverse` verbs, and `dbplyr` will generate the SQL queries for us! So we can have the best of both worlds! 

In this part of the workshop, we will explore the R$\leftrightarrow$SQL interface. 




## 1. Connecting `R` to a database 

We will use the `DBI` package to connect R to a database. There are many different database management systems (DBMS) vendors out there (e.g., Oracle, Microsoft, Postgres, MySQL). Although all these DBMS are somewhat similar, they have some differences. For this reason, we need to tell the `DBI` package which database we want to connect to. In our case here, we are using the `PostgreSQL` DBMS. We need to install the PostgreSQL backend for DBI, which is the package `RPostgres`. 

Finally, the package [dbplyr](https://dbplyr.tidyverse.org/) creates the interface with the database and converts the `dplyr` verbs into SQL queries. How does that work? Very similarly to if you had loaded the tables into R as data frames.


Let's start by creating the connection.

In [32]:
library(tidyverse) # dbplyr is part of tidyverse metapackage
library(RPostgres)

**Exercise 1.1** 

Connect R to the `imdb` database located at `ssc-2022-workshop.ct6ghoz7smhy.us-east-1.rds.amazonaws.com`. 
Your username is `ssc_workshop` and password `sql_for_ds`.

In [137]:
# connection = dbConnect(
#     drv = Postgres(), 
#     user = ..., 
#     password = ..., 
#     port = 5432, # this is the default port for postgres 
#     dbname = ..., 
#     host = ...)

Congratulations!!! R is now connected to the database. 

## 2. Retrieving data from a database with R

Now that we have the connection ready to go, we can pull data from the database. But before we start pulling data from tables, it is useful to get some information about the database itself (e.g., what tables there are in a database, what are the fields of a table): 

- `dbListTables(connection)`: list all tables in the database accessed in connection;+
- `dbListFields(connection, table_name)`: List all columns of `table_name` in the database 

**Exercise 2.1**

List all tables of the `imdb` database. 


In [138]:
# Your code goes here. 

**Exercise 2.2**

List all columns from of the `movies` relation in the `imdb` database. (Note: relation is just another name for table in the database literature.) 


In [139]:
# Your code goes here. 


### 2.1 Wrangling data with `dbplyr`

With `dbplyr`, we can work with a database table like it was loaded into memory (but it isn't!). 

To "read" a table from a database we can use the [dplyr::tbl](https://dplyr.tidyverse.org/reference/tbl.html) function. 

**Example**

Read the `movies` table from the `imdb` database.

In [36]:
(movies <- tbl(connection, 'movies'))

[90m# Source:   table<movies> [?? x 8][39m
[90m# Database: postgres [postgres@localhost:5432/imdb][39m
         id title         orig_title   start_year end_year runtime rating nvotes
      [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m             [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<int>[39m[23m
[90m 1[39m 10[4m0[24m[4m3[24m[4m5[24m423 Kate & Leopo~ [31mNA[39m                 [4m2[24m001       [31mNA[39m     118    6.4  [4m7[24m[4m4[24m982
[90m 2[39m 10[4m0[24m[4m4[24m[4m2[24m742 Mister 880    [31mNA[39m                 [4m1[24m950       [31mNA[39m      90    7.1   [4m1[24m171
[90m 3[39m 10[4m0[24m[4m4[24m[4m1[24m181 Black Hand    [31mNA[39m                 [4m1[24m950       [31mNA[39m      92    6.4    666
[90m 4[39m 10[4m0[24m[4m4[24m[4m1[24m387 Francis       [31mNA[39m                 [4m1[24m950      

Now we can treat the `movies` variable like a regular tibble that was loaded into memory (although, again, it isn't) and use all usual `tidyverse` verbs to wrangle, and explore the data. 

**Exercise 2.1.1**

What are the top rated movies produced after 2000 with more than 500 votes? Remove the `id`, `orig_title` and `end_year` columns. 

In [140]:
# Your code goes here. 
#top_rated_movies <- ...

All evaluations are lazy when using `dbplyr` as the backend of `dplyr` (i.e., the data is not retrieved until requested). So what the command actually does is generate the SQL code. 

We can check the generated SQL code using the `show_query` function. 

**Example**

In [38]:
top_rated_movies %>% 
    show_query()

<SQL>
SELECT *
FROM (SELECT "title", "start_year", "runtime", "rating", "nvotes"
FROM "movies"
WHERE ("start_year" > 2000.0 AND "nvotes" > 500.0)
LIMIT 10) "q01"
ORDER BY "rating" DESC


We can always call the `collect` function to collect the data from the database immediately. 

**Example**

In [39]:
top_rated_movies %>% 
    collect()

title,start_year,runtime,rating,nvotes
<chr>,<int>,<int>,<dbl>,<int>
The Lord of the Rings: The Fellowship of the Ring,2001,178,8.8,1537080
Frida,2002,123,7.4,75612
Corpse Bride,2005,77,7.3,226501
The Other Side of the Wind,2018,122,6.9,4904
The Dancer Upstairs,2002,132,6.9,6117
From Hell,2001,122,6.8,140669
The Shipping News,2001,111,6.7,31012
Star Wars: Episode II - Attack of the Clones,2002,142,6.6,584616
Kate & Leopold,2001,118,6.4,74982
Men in Black II,2002,88,6.2,320765


In [40]:
(principals <- tbl(connection, 'principals'))

[90m# Source:   table<principals> [?? x 3][39m
[90m# Database: postgres [postgres@localhost:5432/imdb][39m
   movie_id ordering  name_id
      [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m 10[4m0[24m[4m3[24m[4m5[24m423        1 20[4m0[24m[4m0[24m[4m0[24m212
[90m 2[39m 10[4m0[24m[4m3[24m[4m5[24m423        2 20[4m4[24m[4m1[24m[4m3[24m168
[90m 3[39m 10[4m0[24m[4m3[24m[4m5[24m423        3 20[4m0[24m[4m0[24m[4m0[24m630
[90m 4[39m 10[4m0[24m[4m3[24m[4m5[24m423        4 20[4m0[24m[4m0[24m[4m5[24m227
[90m 5[39m 10[4m0[24m[4m3[24m[4m5[24m423        5 20[4m0[24m[4m0[24m[4m3[24m506
[90m 6[39m 10[4m0[24m[4m3[24m[4m5[24m423        6 20[4m7[24m[4m3[24m[4m7[24m216
[90m 7[39m 10[4m0[24m[4m3[24m[4m5[24m423        7 20[4m4[24m[4m6[24m[4m5[24m298
[90m 8[39m 10[4m0[24m[4m3[24m[4m5[24m423        8 20[4m4[24m[4m4[24m[4m8[24m843
[90m 9[39m 10[4m0

**Exercise 2.1.2**

What are the median running times and the average ratings of movies in each genre in `movie_genres` table? Check the SQL code generated by `dbplyr`, and collect the data. 

In [141]:
# Your code goes here
# genres <- ...(..., 'movie_genres')
# ...

    

### 2.1.1 Writing your own SQL query

If you need to write your own queries, you can use the `DBI::dbGetQuery` function, which returns a `data.frame`. 

**Example**

Retrieve the movies with the word `science` in the title. 

In [63]:
my_query <- "
SELECT title, start_year 
  FROM movies
  WHERE title ILIKE '% science %';
"
dbGetQuery(connection, my_query)

title,start_year
<chr>,<int>
My Science Project,1985
Mystery Science Theater 3000: The Movie,1996
Bill Nye: Science Guy,2017
The Science of Sleep,2006


## 3. Cleaning Data
##### Suggested reading [Tidy Data by Hadley Wickham](https://www.jstatsoft.org/article/view/v059i10/)

The task of cleaning and preparing data for analysis is somewhat generic and may involve many steps, such as:
- outlier detection;
- fixing typos;
- date parsing;
- imputing missing data;
- properly structuring the data for analysis;

Here we will focus on the last point. 


In this workshop, we looked at very organized data sets stored in our databases. What are the characteristics of the tables we used? 
- Each column corresponds to one variable; 
- Each row corresponds to one unit. 
- Each table corresponds to one entity;
  - For example, in the `imdb` database, we have a table for movies (an entity), another table for people, another one for genres, etc...


### 3.1 Activity

Let's take a look at the Gapminder dataset with information about countries. Do you see any problem with this dataset?

In [128]:
con_world = dbConnect(
    drv = Postgres(), 
    user = "ssc_workshop", 
    password = "sql_for_ds", 
    port = 5432, # this is the default port for postgres 
    dbname = "world", 
    host = "ssc-2022-workshop.ct6ghoz7smhy.us-east-1.rds.amazonaws.com")

(gapminder_raw <- 
    tbl(con_world, 'gapminder') %>%
    collect())

country,variable,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Afghanistan/Asia,pop,8.425333e+06,9.240934e+06,1.026708e+07,1.153797e+07,1.307946e+07,1.488037e+07,1.288182e+07,1.386796e+07,1.631792e+07,2.222742e+07,2.526841e+07,3.188992e+07
Afghanistan/Asia,gdpPercap,7.794453e+02,8.208530e+02,8.531007e+02,8.361971e+02,7.399811e+02,7.861134e+02,9.780114e+02,8.523959e+02,6.493414e+02,6.353414e+02,7.267341e+02,9.745803e+02
Afghanistan/Asia,lifeExp,2.880100e+01,3.033200e+01,3.199700e+01,3.402000e+01,3.608800e+01,3.843800e+01,3.985400e+01,4.082200e+01,4.167400e+01,4.176300e+01,4.212900e+01,4.382800e+01
Albania/Europe,pop,1.282697e+06,1.476505e+06,1.728137e+06,1.984060e+06,2.263554e+06,2.509048e+06,2.780097e+06,3.075321e+06,3.326498e+06,3.428038e+06,3.508512e+06,3.600523e+06
Albania/Europe,gdpPercap,1.601056e+03,1.942284e+03,2.312889e+03,2.760197e+03,3.313422e+03,3.533004e+03,3.630881e+03,3.738933e+03,2.497438e+03,3.193055e+03,4.604212e+03,5.937030e+03
Albania/Europe,lifeExp,5.523000e+01,5.928000e+01,6.482000e+01,6.622000e+01,6.769000e+01,6.893000e+01,7.042000e+01,7.200000e+01,7.158100e+01,7.295000e+01,7.565100e+01,7.642300e+01
Algeria/Africa,pop,9.279525e+06,1.027086e+07,1.100095e+07,1.276050e+07,1.476079e+07,1.715280e+07,2.003375e+07,2.325496e+07,2.629837e+07,2.907202e+07,3.128714e+07,3.333322e+07
Algeria/Africa,gdpPercap,2.449008e+03,3.013976e+03,2.550817e+03,3.246992e+03,4.182664e+03,4.910417e+03,5.745160e+03,5.681359e+03,5.023217e+03,4.797295e+03,5.288040e+03,6.223367e+03
Algeria/Africa,lifeExp,4.307700e+01,4.568500e+01,4.830300e+01,5.140700e+01,5.451800e+01,5.801400e+01,6.136800e+01,6.579900e+01,6.774400e+01,6.915200e+01,7.099400e+01,7.230100e+01
Angola/Africa,pop,4.232095e+06,4.561361e+06,4.826015e+06,5.247469e+06,5.894858e+06,6.162675e+06,7.016384e+06,7.874230e+06,8.735988e+06,9.875024e+06,1.086611e+07,1.242048e+07


### 3.1 Writing into a database from R

In [None]:
?dbWriteTable