# Data Manipulation with dplyr in R
link: https://www.datacamp.com/courses/data-manipulation-with-dplyr-in-r

### Course Description
Say you've found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You'll also learn to aggregate your data and add, remove, or change the variables. Along the way, you'll explore a dataset containing information about counties in the United States. You'll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.

### Note how can Resizing plots in the R kernel for Jupyter notebooks
https://blog.revolutionanalytics.com/2015/09/resizing-plots-in-the-r-kernel-for-jupyter-notebooks.html

    library(repr)

    # Change plot size to 4 x 3
    options(repr.plot.width=4, repr.plot.height=3)
    
### Note2 Generate a table 

https://www.tablesgenerator.com/markdown_tables


other: Book: machine learning with R by Brett Lantz
Learn about `attr` function


### Note 2.2 RDS (Reading from a R data file)
Link: https://mgimond.github.io/ES218/Week02b.html

R has its own data file format–it’s usually saved using the .rds extension. To read a R data file, invoke the readRDS() function.

    dat <- readRDS("ACS.rds")
    
As with a CSV file, you can load a RDS file straight from a website, however, you must first run the file through a decompressor before attempting to load it via readRDS. A built-in decompressor function called gzcon can be used for this purpose.

    dat <- readRDS(gzcon(url("http://mgimond.github.io/ES218/Data/ACS.rds")))
    
The .rds file format is usually smaller than its text file counterpart and will therefore take up less storage space. The .rds file will also preserve data types and classes such as factors and dates eliminating the need to redefine data types after loading the file.



### Note 3 - DataFrames

In [1]:
library(dplyr)

"package 'dplyr' was built under R version 3.5.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [2]:

#library(readr)
#path_csv<-"https://assets.datacamp.com/production/course_6430/datasets/Pokemon.csv"
#df_pokemon<-read_csv(path_csv)

#dat <- readRDS("https://assets.datacamp.com/production/repositories/4984/datasets/a924bf7063f02a5445e1f49cc1c75c78e018ac4c/counties.rds")
counties <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/4984/datasets/a924bf7063f02a5445e1f49cc1c75c78e018ac4c/counties.rds")))
babynames<- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/4984/datasets/a924ac5d86adba2e934d489cb9db446236f62b2c/babynames.rds")))


## 1) Transforming Data with dplyr
Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You'll use these functions to modify the counties dataset to view particular observations and answer questions about the data

### 1.1) (video) The counties dataset
As other course that you might had taken, we will develop this course with the goal that at the finish it you will learn how manipulate a real dataset, for this you will been working with dataset called `counties`, it is census from 2015 United States, a state is one of  50 regions  of united states as California or Texas, a counties is a subregion of each of those states as LA counties in California
    
#### 1.1.1) Select 
the first verb you need to learn is how to select a specific columns in a dataset, it's very common, because you wont work with all set of columns, to do that is necesary to use the verb `select(data, columns...)`

### 1.2 (video) The filter and arrange verbs
Now that you know how to select a specific set of columns, we will continue with the 2 following verbs of `dplyr`.

- `arrange` allows sort out data by one or multiple columns or variables, its default behavior  is sort the data in order ascending, so if you want to change you will need to use the verb `desc` e.g `arrange(data, desc(colum_of_interest))`
- `filter` you can use this verb to extract a particular observations based in a condition.

### 1.3 (video) Mutate
It's very common that you dataset has not all variables or columns that you need or maybe you need change one of existing , `mutate` allows to do that.

In [4]:
?glimpse

## 2)Aggregating Data
Now that you know how to transform your data, you'll want to know more about how to aggregate your data to make it more interpretable. You'll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.

### 2.1 (video) The count verb
so far you have learn how to select, filter and sort your observations but until now you have worked with the same level of your initial data in this second chapter you will learn to aggregate data: to take many observations and summarize them into one, through the `count` function, it is avalible in `dplyr` package.

- we can sort our data with the argumento `sort= TRUE` inside of `count`
- we can weigh over a some variables with `wt` argument



In [15]:
a<-data.frame( Estado = c("YUC","YUC","YUC","QROO","QROO"), 
           ciudad = c("MID","VAL","TIZ","CUN","CHT"), poblacion = c(10,2,3,5,3))
count(a)

count(a, Estado)

# podemos s
count(a, Estado, wt = poblacion)

n
5


Estado,n
QROO,2
YUC,3


Estado,n
QROO,8
YUC,15



#### 2.1.1) Counting by region

**Exercise**
- Use count() to find the number of counties in each region, using a second argument to sort in descending order.

*Answer*

In [4]:
counties_selected <- select(counties, region, state, population, citizens)
#glimpse(counties_selected)

# Use count to find the number of counties in each region
counties_selected %>% count(region, sort = TRUE)
  

region,n
South,1420
North Central,1054
West,447
Northeast,217


#### 2.1.2) Counting citizens by state
You can weigh your count by particular variables rather than finding the number of counties. In this case, you'll find the number of citizens in each state.

**Exercise**
- Count the number of counties in each state, weighted based on the citizens column, and sorted in descending order.

*Answer*

In [6]:
# Find number of counties per state, weighted by citizens
head(counties_selected %>% count(state, wt = citizens, sort = TRUE))
  

state,n
California,24280349
Texas,16864864
Florida,13933052
New York,13531404
Pennsylvania,9710416
Illinois,8979999


### 2.1.3) Mutating and counting
You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: "What are the US states where the most people walk to work?"

You'll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

    counties_selected <- counties %>%
      select(region, state, population, walk)

**Exercise**
- Use mutate() to calculate and add a column called population_walk, containing the total number of people who walk to work in a county.
- Use a (weighted and sorted) count() to find the total number of people who walk to work in each state.      

*Answer*

In [16]:
counties_selected <- counties %>%
  select(region, state, population, walk)

counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
  mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column
  count(state, wt = population_walk, sort = TRUE)

state,n
New York,1237938.17
California,1017963.68
Pennsylvania,505397.19
Texas,430783.43
Illinois,400345.6
Massachusetts,316765.03
Florida,284722.87
New Jersey,273047.19
Ohio,266910.98
Washington,239764.32


### 2.2) (video) The group by, summarize and ungroup verbs
you saw how can use `count` to aggregate your data but it is a special case of more general set of verbs group_by and summarize

- summarize:  it takes many observation and turns them into one observation, ya can define multiple operation and to aggragate in different ways.
- group_by : allow us to aggragate our data in groups 

When you summarize on a table that has multiple groups, only the last of them gets "peeled off", this useful when you want doing additional summaries or aggregations, if you don't want to keep state as a group, you can add another dplyr verb `ungroup()` 

#### 2.2.1) Summarizing by state
Another interesting column is land_area, which shows the land area in square miles. Here, you'll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).

    counties_selected <- counties %>%
      select(state, county, population, land_area)

**Exercise**
- Group the data by state, and summarize to create the columns total_area (with total area in square miles) and total_population (with total population).      
- Use mutate() to add the density column that contains the number of people per square mile in the state.

*Answer*



In [4]:
    counties_selected <- counties %>%
      select(state, county, population, land_area)


# Add a density column, then sort in descending order
counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  mutate(density = total_population / total_area) %>%
  arrange(desc(density))

state,total_area,total_population,density
New Jersey,7354.22,8904413,1210.789587
Rhode Island,1033.82,1053661,1019.191929
Massachusetts,7800.08,6705586,859.681696
Connecticut,4842.36,3593222,742.039419
Maryland,9707.23,5930538,610.940299
Delaware,1948.55,926454,475.458161
New York,47126.43,19673174,417.455216
Florida,53624.78,19645772,366.356226
Pennsylvania,44742.71,12779559,285.623267
Ohio,40860.73,11575977,283.303235


#### 2.2.2) Summarizing by state and region
You can group by multiple columns instead of grouping by one. Here, you'll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

    counties_selected <- counties %>%
      select(region, state, county, population)
      
**Exercise**
-  Summarize to find the total population, as a column called total_pop, in each combination of region and state.
- Notice the tibble is still grouped by region; use another summarize step to calculate two new columns: the average state population in each region (average_pop) and the median state population in each region (median_pop).

*Answer*     

In [4]:
counties_selected <- counties %>%
select(region, state, county, population)

# Calculate the average_pop and median_pop columns 
counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) 

# Calculate the average_pop and median_pop columns 
counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population))  %>%
  summarize(average_pop = mean(total_pop), median_pop = median(total_pop))





region,state,total_pop
North Central,Illinois,12873761
North Central,Indiana,6568645
North Central,Iowa,3093526
North Central,Kansas,2892987
North Central,Michigan,9900571
North Central,Minnesota,5419171
North Central,Missouri,6045448
North Central,Nebraska,1869365
North Central,North Dakota,721640
North Central,Ohio,11575977


region,average_pop,median_pop
North Central,5627687,5580644
Northeast,6221058,3593222
South,7370486,4804098
West,5722755,2798636


### 2.3) (video) The top_n verb
what if instead of aggregating each state? you wanted to find only the largest county in each state?

dplyr `top_n` is very useful for keeping the most extreme observations from each group, like `summarize`,  `top_n` operates on a grouped table  this take two arguments the fists the number of observations that you want, and the  column you want to weight by.

#### 2.3.1) Selecting a county from each region
Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.

Now, you're interested in finding the county within each region with the highest percentage of citizens who walk to work.

**Exercise**
- Find the county in each region with the highest percentage of citizens who walk to work.

*Answer*

In [5]:
counties_selected <- counties %>%
  select(region, state, county, metro, population, walk)

# Group by region and find the greatest number of citizens who walk to work
counties_selected %>% group_by(region) %>% top_n(1, walk)
  


region,state,county,metro,population,walk
West,Alaska,Aleutians East Borough,Nonmetro,3304,71.2
Northeast,New York,New York,Metro,1629507,20.7
North Central,North Dakota,McIntosh,Nonmetro,2759,17.5
South,Virginia,Lexington city,Nonmetro,7071,31.7


#### 2.3.2) Finding the highest-income state in each region
You've been learning to combine multiple dplyr verbs together. Here, you'll combine `group_by()`, `summarize()`, and `top_n()` to find the state in each region with the highest income.

When you group by multiple columns and then summarize, it's important to remember that the summarize "peels off" one of the groups, but leaves the rest on. For example, if you `group_by(X, Y)` then summarize, the result will still be grouped by `X`.

**Exercise**
- Calculate the average income (as average_income) of counties within each region and state (notice the group_by() has already been done for you).
- Find the highest income state in each region

*Answer*

In [6]:
counties_selected <- counties %>%
  select(region, state, county, population, income)

counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income)) %>%
  # Find the highest income state in each region
  top_n(1, average_income)

region,state,average_income
North Central,North Dakota,55574.87
Northeast,New Jersey,73014.1
South,Maryland,69200.38
West,Alaska,65124.54


## 3.0 Selecting and Transforming Data
Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.

### 3.1 (video) selecting 
this chapter will focus in some advanced methods of select and transform data.

- you can select by a sequence by name e.g

        counties %>% select (state, county, drive:work_at_home)

- and other technique tit offers is "select helpers". functions that specify criteria for choosing columns, we'll statr with the `contains` 

        counties %>% select (state, county, contains("work"))

- we can use `starts` with to select onlye the colums that start with a particular prefix .

        counties %>% select (state, county, stars_with("income "))

there are other ways to select our columns, so you can see it with `?select_helpers`

-Finally you can remove a specif columns of yours select adding the `-` in front of the column name 

        counties %>% select(-cesus_id)
        
        
#### 3.1.1 ) Selecting columns
Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon `(:)` is useful for getting many columns at a time.        

#### 3.1.2) Select helpers
In the video you learned about the select helper `starts_with()`. Another select helper is `ends_with()`, which finds the columns that end with a particular string.

### 3.3 (video) The rename verb
sometimes you're not agree with column names, and you are thinking in change it, well it's possible with the `rename` verb or you can do it directly in the select verb e.g:

    #select
    counties %>% select(state, county, population, unemployment_rate = unemployment)
    
    #raname
    counties %>% select(state, county, population, unemployment)
    renem(unemployment_rate = unemployment)

The `rename()` verb is often useful for changing the name of a column that comes out of another verb, such as `count()`

    counties %>%
      count(state) %>%
      rename(num_counties = n)

### 3.4) (video) The transmute verb
until now, you have seen how to select, rename and mutate your data, but now it time to see `transmute`, which is a combination of select and mutate.

the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.

    counties %>% transmute(state, county, population, density = population/land_area) %>%


In [3]:
?select_helpers

#### 4) Case Study: The babynames Dataset


**This exercise show us how can group by over a variable and put it in our data set and same time**

#### 4.1 ) Finding the year each name is most common
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you're going to explore which year each name was the most common.

To do this, you'll be combining the grouped mutate approach with a top_n.

**Exercise**
- Complete the code so that it finds the year each name is most common.

*Answer*

In [3]:
# Find the year each name is most common 
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
  group_by(name) %>%
  top_n(1, fraction)

year,name,number,year_total,fraction
1880,Abbott,5,201478,2.481661e-05
1880,Abe,50,201478,2.481661e-04
1880,Abner,27,201478,1.340097e-04
1880,Adelbert,28,201478,1.389730e-04
1880,Adella,26,201478,1.290463e-04
1880,Adolf,6,201478,2.977993e-05
1880,Adolph,93,201478,4.615889e-04
1880,Agustus,5,201478,2.481661e-05
1880,Albert,1493,201478,7.410238e-03
1880,Albertina,7,201478,3.474325e-05


In [4]:
babynames %>% filter (year == 1880) %>%
group_by(year) %>% 
summarize(total = sum(number))

year,total
1880,201478


## 4) Case of study

### 4.1) Windows functions

`lag()` function

In [11]:
a<-c(1,2,3,4,5)
b<-lag(a)
print(length(a))
print(length(b))

print(a-b)


[1] 5
[1] 5
[1] 0 0 0 0 0
attr(,"tsp")
[1] 0 4 1


**Exercise**

*Answer*