# Course Description
This course builds on what you learned in Data Manipulation in R with dplyr by showing you how to combine data sets with dplyr's two table verbs. In the real world, data comes split across many data sets, but dplyr's core functions are designed to work with single tables of data. In this course, you'll learn the best ways to combine data sets into single tables. You'll learn how to augment columns from one data set with columns from another with mutating joins, how to filter one data set against another with filtering joins, and how to sift through data sets with set operations. Along the way, you'll discover the best practices for building data sets and troubleshooting joins with dplyr. Afterwards, you’ll be well on your way to data manipulation mastery!

# 1) Mutating joins
Mutating joins add new variables to one dataset from another dataset, matching observations across rows in the process. This chapter will explain the various ways you can join datasets together and what happens when you do.

we will see how we can join our data sets with:

    left_join & right_join
    inner_join
    full_join
    
o if we want to filter the some way our data with:

    semi_join
    anti_join

## 1.1) keys
A key is a columns or combination of columns, that occurs in each of the tables that you want to join, dplyr complete the join by matching rows that have the same values of key.

We have a primary key and secondary key, we will joins ours dataset thought these keys.

## 1.2) Joins
the basic join function in dplyr is a `left_join`, to use it, we need:

    left_join(fist_dataset, second_daset, by = "key column name(s)")

if you want to use a multiple keys, it´s possible if you pass a vector like:

    left_join(names, plays, by = c("name","surname"))
    
### 1.2.1) A basic join
As Garrett mentioned in the video, left_join() is the basic join function in dplyr. You can use it whenever you want to augment a data frame with information from another data frame.

For example, left_join(x, y) joins y to x. The second dataset you specify is joined to the first dataset. Keep that in mind as you go through the course.

Example:

    # Complete the code to join artists to bands
    bands2 <- left_join(bands, artists, by = c("first","last"))

### 1.2.2) A right join
There is more than one way to execute a left join. Knowing multiple methods will make you a more versatile data scientist, especially as you try to fit joins into pipes created with %>%

    # Finish the code below to recreate bands3 with a right join
    bands2 <- left_join(bands, artists, by = c("first", "last"))
    bands3 <- right_join(artists, bands, by = c("first", "last") )

    # Check that bands3 is equal to bands2
    setequal(bands2, bands3)
    
## 1.3) Variations on joins

### 1.3.1) Inner joins and full joins
You may have noticed that some of the songs in songs correspond to some of the albums in albums. Suppose you want a new dataset that contains all of the songs for which you have data from both albums and songs. How would you make it?

The artists and bands datasets also share some information. What if you want to join these two datasets in such a way that you retain all of the information available in both tables, without throwing anything away?

You can think of inner joins as the most strict type of join: they only retain observations that appear in both datasets. In contrast, full joins are the most permissive type of join: they return all of the data that appears in both datasets (often resulting in many missing values).

Recall that, `type_of_join(x, y) joins y to x` . The second dataset you specify is joined to the first dataset.

    # Join albums to songs using inner_join()
    inner_join(songs, albums, by = "album")

    # Join bands to artists using full_join()
    full_join(artists, bands, by = c("first", "last"))
    
### 1.3.1) Pipes
You can combine dplyr functions together with the pipe operator, %>%, to build up an analysis step-by-step. %>% takes the result of the code that comes before it and "pipes" it into the function that comes after it as the first argument of the function.

So for example, the two pieces of code below do the same thing:

    full_join(artists, bands, 
              by = c("first", "last"))
    
    artists %>% 
      full_join(bands, by = c("first", "last"))
Pipes are so efficient for multi-step analysis that you will use them for the remainder of the exercises in this course

Example: 

    # Find guitarists in bands dataset (don't change)
    temp <- left_join(bands, artists, by = c("first", "last"))
    temp <- filter(temp, instrument == "Guitar")
    select(temp, first, last, band)

    # Reproduce code above using pipes
    bands %>% 
    left_join(artists, by = c("first", "last"))  %>%
    filter(instrument == "Guitar") %>%
    select (first, last, band)

# 2) Filtering joins and set operation
Filtering joins and set operations combine information from datasets without adding new variables. Filtering joins filter the observations of one dataset based on whether or not they occur in a second dataset. Set operations use combinations of observations from both datasets to create a new dataset

## 2.1) Apply a semi-join
Semi-joins provide a concise way to filter data from the first dataset based on information in a second dataset.

For example, the code in the editor uses semi_join() to create a data frame of the artists in artists who have written a song in songs:

    # View the output of semi_join()
    artists %>% 
      semi_join(songs, by = c("first", "last"))

    # Create the same result
    artists %>% 
      right_join(songs, by = c("first", "last")) %>% 
      filter(!is.na(instrument)) %>% 
      select(first, last, instrument)
      
### 2.1.1) Exploring with semi-joins
Semi-joins provide a useful way to explore the connections between multiple tables of data.

For example, you can use a semi-join to determine the number of albums in the albums dataset that were made by a band in the bands dataset. 

    albums %>% 
      # Collect the albums made by a band
      semi_join(bands, by = "band") %>% 
      # Count the albums made by a band
      nrow()
      
      
## 2.2) Anti-join
anti-join is the opposite of the semi-join because it give us information from de first source that don´t be in the second set,
anti-joins provide a useful way to reason about how a mutating join will work before you apply the join.

For example, you can use an anti-join to see which rows will not be matched to a second dataset by a join.

    # Return rows of artists that don't have bands info
    artists %>% anti_join(bands, by = c("first", "last"))
    
Anti-joins with anti_join() also provide a great way to diagnose joins that go wrong.

For example, they can help you zero-in on rows that have capitalization or spelling errors in the keys. These things will make your primary and secondary keys appear different to R, even though you know they refer to the same thing.

for example:
labels describes the record labels of the albums in albums. Compare the spellings of album names in labels with the names in albums. Are any of the album names of labels mis-entered? Use anti_join() to check. Note: Don't forget to mention the by argument.

    # Check whether album names in labels are mis-entered
    labels %>% anti_join (albums, by = c("album"))

You could be ask you in which case we will use semi_join or anti_join, for this we have a exercise

Exercise, Which filtering join?
Think you have filtering joins down? Let's check.

Which filtering join would you use to determine how many rows in songs match a label in labels?

    # Check your understanding
    songs %>% 
      # Find the rows of songs that match a row in labels
      semi_join (labels, by = "album") %>% 
      # Number of matches between labels and songs
      nrow()
      
## 2.3) Set

1. union
2. inserct
3. setdiff
4. setequal

### 2.3.1) Multiple operations

There is no set operation to find rows that appear in one data frame or another, but not both. However, you can accomplish this by combining set operators

    # Select songs from live and greatest_hits
    live_songs <- select(live, song)
    greatest_songs <- select(greatest_hits, song)


    # Find songs in at least one of live_songs and greatest_songs
    all_songs <- union(live_songs, greatest_songs)

    # Find songs in both 
    common_songs <- intersect(live_songs, greatest_songs)

    # Find songs that only exist in one dataset
    setdiff(all_songs, common_songs)
    
## 2.4) Comparing datasets
We have two commands that allow us to see if two set are equal:

* Use identical() to determine whether the sets contain the same data in the same order.
* Use setequal() to determine whether the sets contain the same data in any order.
    

# 3) Assembling data
This chapter will show you how to build datasets from basic elements: vectors, lists, and individual datasets that do not require a join. dplyr contains a set of functions for assembling data that work more intuitively than base R's functions. The chapter will also look at when dplyr does and does not use data type coercion.

## 3.1) Binds
We have a simple way to combine our dataset, if you have two set that containt the exact same column or the exact same rows and  and the same order, you can simple paste them, this process is called bind 

here are several differences between dplyr's bind_rows() and bind_cols() and base R's rbind() and cbind().

1. bind_rows() and bind_cols() are faster than rbind() and cbind()
2. bind_rows() and bind_cols() can take a list of data frames as input
3. bind_rows() and bind_cols() always return a tibble (a data frame with class tbl_df)
4. rbind() returns an error when column names do not match across data frames. bind_rows() creates a column for each unique column name and distributes missing values as appropriate
5. the argument **.id** , if you want add a new column to your bind_row output that indicate which dataframe each row came from you need to do two things.

    bind_rows(LabelName1 = source1, LavelName2 = source2, .id = "Column Name for new column")
    

## 3.2) Build a better data frame
in R two most common functions to build a data frame are :  data.frame() and as.data.frame() but the same way dplyr provides functions to build that kind of structures, some disadvantages that have default R funtions are:
. Changes strings to factors
. add rows names
. changes unusual columns names 

Otherwise, the advantages of data_frame() are: 

1. data_frame always returns a data frame of class tbl_df (if true, assume this is an advantage).
2. data_frame will not change your column names, even if they are unorthodox.
3. data_frame never adds row names.
4. data_frame only recycles length 1 inputs.
5. data_frame evaluates its arguments lazily and in order.
6. data_frame never change a facotors.



# 4) Advanced joining
Now that you have the basics, let's dive deep into the mechanics of joins. This chapter will show you how to spot common join problems, how to join based on multiple or mismatched keys, how to join multiple tables, and how to recreate dplyr's joins with SQL and base Rtr

## 4.1) What can go wrong!!!
We could have problem in our joins, some wrongs that we will looking are missing row or duplicates key for example.

### 4.1.1) Spot the key
R's data frames can store important information in the row.names attribute. This is not a tidy way to store data, but it does happen quite commonly. If the primary key of your dataset is stored in row.names, you will have trouble joining it to other datasets.

For example, stage_songs contains information about songs that appear in musicals. However, it stores the primary key (song name) in the row.names attribute. As a result, you cannot access the key with a join function.

One way to remedy this problem is to use the function rownames_to_column() from the tibble package. rownames_to_column() returns a copy of a dataset with the row names added to the data as a column.

    library(tibble)
    # Add row names as a column named song
    stage_songs %>% rownames_to_column(var = "song") %>%
    # Left join stage_writers to stage_son
    left_join(stage_writers, by = "song")

### 4.1.2) Non-unique keys
shows and composers provide some more information about songs that appear in musicals.

You can join the two datasets by the musical column, which appears in both the datasets. However, you will run into a glitch: the values of musical do not uniquely identify the rows of composers. For example, two rows have the value "The Sound of Music" for musical and two other rows have the value "The King and I"   

### 4.1.3) Two non-unique keys
You saw in the last exercise that if a row in the primary dataset contains multiple matches in the secondary dataset, left_join() will duplicate the row once for every match. This is true for all of dplyr's join functions.

Now, let's see how this rule would apply when the primary dataset contains duplicate key values.

show_songs contains songs that appear in the musicals written by the composers. You can join the two by the musical column, but like composers, show_songs has two rows where musical == "The Sound of Music"

How many entries (rows) will exist for The Sound of Music if you left join composers to show_songs by musical? answer 4 because,left_join() duplicates each row in show_songs that contains the Sound of Music twice, once for each row of composers that contains the Sound of Music. The result is four rows that contain the Sound of Music.

## 4.2) Defining Key

Now we will be focus in how we can defining a key in dplyr commands, if you remember, you saw that we can match our dataset through command "by" but it's not necessary to type  in this case R will look for tha columns that have the same name, now if our column names have diferent names will be necesary to type in our consola.

    left_join(members, plays, by = c("member" = "name"))
    
### 4.2.1) A subset of keys
Often the same column name will be used by two datasets to refer to different things. For example, the data frame movie_studios uses name to refer to the name of a movie studio. movie_years uses name to refer to the name of an actor.

You could join these datasets (they describe the same movies), but you wouldn't want to use the name column to do so!

dplyr will ignore duplicate column names if you set the by argument and do not include the duplicated name in the argument. When you do this, dplyr will treat the columns in the normal fashion, but it will add .x and .y to the duplicated names to help you tell the columns apart.

    movie_years %>% 
      # Left join movie_studios to movie_years
      left_join(movie_studios, by = "movie") %>% 
      # Rename the columns: artist and studio
      rename(artist = name.x, studio = name.y) 
      
### 4.3.1) Mis-matched key names
Just as the same name can refer to different things in different datasets, different names can refer to the same thing. For example, elvis_movies and elvis_songs both describe movies starring Elvis Presley, but each uses a different column name to describe the name of the movie.

This type of inconsistency can be frustrating when you wish to join data based on the inconsistently named variable.

To make the join, set by to a named vector. The names of the vector will refer to column names in the primary dataset (x). The values of the vector will correspond to the column names in the secondary dataset (y), e.g.

    x %>% left_join(y, by = c("x.name1" = "y.name2"))
    
dplyr will make the join and retain the names in the primary dataset, for example:

    #Identify the column in elvis_songs that corresponds to a column in elvis_movies.
    #Left join elvis_songs to elvis_movies by this column.
    #Use rename() to give the result the column names movie, year, and song.
    
    elvis_movies %>% 
      # Left join elvis_songs to elvis_movies by this column
      left_join(elvis_songs, by = c("name" = "movie")) %>% 
      # Rename columns
      rename("movie" = "name","song" = "name.y")
      
## Joining Multiple Tables   

You can join two o three data sets together, it´s easy to think the strategy you can join two data sets together join the third to the result and so on, at the same time you can use de pipe %>% to reduce the code, but exist other way to do the previous with the **purr R package**.

to install the package:

    install.packages("purrr") #note that we have 3r´s 
    library(purrr)
    
This package include the function **reduce** to use it,it´s function is very useful for joining together multiple datasets it's necesary to pass a list, for example:

    table<-list(a , b , c) #where a,b,c are data frames
    reduce(tables, left_join, by = "id")
    
Most packages in the "tidyverse" have a single "job" that they do well. For example, the job of the dplyr package is to help you manipulate tabular data. It provides a short grammar of data manipulation.

what is the job of the purrr package? To help you apply R functions to data in efficient ways, purrr rounds out the functional programming tools in R (i.e. the tools that help you write programs with functions.)   

The job of reduce() is to apply a function in an iterative fashion to many datasets. As a result, reduce() works well with all of the dplyr join functions.

For example, you can use reduce() to filter observations with a filtering join.

    list(more_artists, more_bands, supergroups) %>% 
      # Return rows of more_artists in all three datasets
      reduce(semi_join, by = c("first", "last"))
      
## Other implementations

Exist others way to combine data set,  for example marge function, you can recreate dplyr joins with sql

if fact only we recreate some of this like:

    R-semi join
    select * from x where 
    exist (select 1 from y where x.a = y.a)
    
    R-anti_join
    select * from x where 
    not exist (select 1 from y where x.a = y.a)
    
it's important to know that dplyr function can work directly with sql, because it´is possible to create the connection to the DB for example:

Function      DBMS
src_sqlite()  SQLite
src_mysql()   MySQL,MariaDB
src_postgres  PostrageSQL

but fitst we need the package "DBI" to use they. 