# Tidyverse

# Manupulating DataSet with dplyr package

## The gapminder dataset
We are working with **Tidyverse** that is a collection of tool of data science to explore, manipulate and visualization of data, we´re working with data set called **gapminder**, at same time we need de library of **dplyr** to tranform our data set (filter, sorting and summarize for example) 

`install.packages("gapminder")
library(gapminder)
library(dplyr)
gapminder
`

Now, if we have charge our data set, we start so the first thing that we´ll see it´s verbs.

## Verbs
The **filter** verbs, allow us to filter our data set in other words you use it when you want to look only at a subset of your observation, based in a particular condition, every time you apply a verb you will use a pipe:

pipe = %>% it´s say "take whether is before it, and feed it into the next step" after pipe we use our verb in this case **filter** for examples:


    gapminder %>%
      filter (year==1997)

    gapminder %>%
      filter (year==1997,country=="Mexico")


The verb **arrange** sorts the observations of data set 

    gapminder %>%
      arrange(gdpPercap)

    gapminder %>%
      filter (year==2007) %>%
      arrange(desc(gdpPercap))
      
Suppose you want to change one of the variables in you data set, based on the other one or suppose you want add a new variable, for that you would use the **mute** verb, you use this vervb like you would filter or arrange, after that a pipe operator.

First, you need to learn how we can change one existenting variable, for example we´re going to change our column pop we have:


    #3) Mutate (change)
    #3.1) Change a existenting variable
    gapminder %>%
      mutate(pop= pop/1000000)

it´s very important to say then we´re using **filter and mutate** we´re not changing the original data set, instead we´re creating a new data set. 

Second, now we are create new column,so.

    #3.2 Create a new colums
    gapminder %>%
      mutate (gdp=pop*gdpPercap)

Finally we are working with all things that we´re lerning:

    gapminder %>%
      mutate(gdp=pop*gdpPercap) %>%
      filter(year==2007) %>%
      arrange(desc(gdp))


## Visualizing with ggplot2

In the first chapter we saw how we can work with dplyr package to manipulate our dataset, but in some case in better to understand your data using visualizations for that now we´ll learn how to use the ggplot2.

Note: It´s recomended to asing our subset to new variable 

To able to create a scatter plot we have 3 parts:

1. First is the data that we´re visualizing 
2. Is the mapping of variable in your dataset to **aesthetics** in your graph, an aesthetics is a visual dimension of a graph that can be used to communicate information.
3. The last step is specifying the type of graph you´re creating, yo do that by adding a layer to the graph; use a **plus** after the ggplot 

**geom** means you´re adding a type of a geometric type to the graph.
    
**point** indicates it´s a scatter plot.

take care with the spaces when we´re writting, so:

    #relationship between a country´s waelth and its life expectetions 
    # x= GDP per capital  y= life expectancy 
    ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point()

## Log Scales
Sometimes we have problem with our graphs, perhaps for our magnitudes, for example if we´re talking about money (GDP) some objects (countries) could have thousands of dollaras and others in hundreds, when one of your axis has that kind of distribution, it´s useful to work with a **logarithmic scale** that is, a scale where each axes fixed distance reprensts a multiplicaction of the value, to create this graph, you would add one addtional option to our ggplot call, with another "plus" after "geom" point 

    #Sometimes we need transform our data set, for example with log
    ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10()


## Additional Aesthetics 

Now you´ll learn to add two more aesthetics - color and size - to communicate even more information in your scatter plot, in our examples we´ll working with:

> Continent is a categorical variable; a good way to represent this type of variable in a scatter plot is the color of your points. 

> By numerical variables we can use the "size" to represente this type of variable, so

    #Adding more easthetics in our graph 
    #Color for example to categorical variables
    #size for numerical variables 
    ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent , size = pop
                               )) + geom_point() + scale_x_log10()


## Faceting 
Now you´ll learn about another way to explore your data in terms of this kind of categorical variables, ggplot2 allow dive your plot into subplots to get one smalle graph for each caterogical variable for example in our case continent, your facet a plot adding another option, with a +, to end of your code:

    #facet our graph
    ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent , size = pop
    )) + geom_point() + scale_x_log10() + facet_wrap(~continent)
    
The symbol **~** means in R **by** 


## Summarize Verb 
Now you´ll learno how to summarize many observation into a single data point.

    #We star with this verb
    gapminder %>%
      summarize(avglifeExp=mean(lifeExp))
      
## The group by Verb 
The "group by" allows us to group our data set through a variable.

    #Now we´ll the verb group by
    gapminder %>%
      group_by(year,continent) %>%
      summarize(avglifeExp=mean(lifeExp), totalpop=sum(as.numeric(pop)))
      
## Visualizing summarized Data
Now we´ll see how we can join all things that we have seen, so we start saving our subset in a new variable and after that, let´s go to graph it.

    by_year<-gapminder %>%
      group_by(year, continent) %>%
      summarize(totalpop=sum(as.numeric(pop)), meanLifeExp=mean(lifeExp))

    ggplot(by_year, aes(x=year, y=totalpop, color= continent)) + geom_point() +
      expand_limits(y=0)

## Other Graphs ggplot2
You will learn to make 4 more types of graphs:

    1. Line Plots: which are useful for showing change over time
    2. Bar Plots: which are good at comparing statistics for each several categories.
    3. Histograms: which describe the distribution of a one dimesional numeric variable.
    4. Box Pltos: whic compare the distribution of a numeric variable among several categories. 
    
## Line Plots
The same way we use all syntaxis that we have been using, but now we´ll be using **geom_line()** in our code so:

    year_continent<-gapminder %>%
      group_by(year,continent) %>%
      summarize(meanlifeExp=mean(lifeExp), totalpop=sum(as.numeric(pop)))

    #Line Plot
    ggplot(year_continent, aes(x=year, y=meanlifeExp, color=continent)) +
      geom_line() + expand_limits(y=0)
      
## Bar Plot (geom_col)
For graph that kind of graph we need the function **geom_col()** 

Note: By default, geom_bar uses stat="count" which makes the height of the bar proportion to the number of cases in each group (or if the weight aethetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use stat="identity" and map a variable to the y aesthetic, I write the previous note, because if i did not write  "stat" parameter , we could have a error. 

    #Bar Plot 
    year_07<- gapminder %>%
      filter(year==2007) %>%
      group_by(continent) %>%
      summarize(meanLifeExp=mean(lifeExp))

    year_07
    ggplot(year_07, aes(x=continent, y=meanLifeExp)) + geom_bar(stat = "identity")
    #other way
    ggplot(year_07, aes(x=continent, y=meanLifeExp)) + geom_col()
    

## Histogram
For this graph we need the function **geom_histogram()** and it´s importantan define the width of each one of our box.

    #Histogram Plot
    ggplot(gapminder_2007, aes(x=lifeExp)) + geom_histogram(binwidth = 5)

Sometimes a lot of our data could be higher than other, which cause the distribution to be very skewed for this it´s necesary tranforms our data with a scale for example log.

    # Create a histogram of population (pop), with x on a log scale
    ggplot(gapminder_2007, aes(x=pop)) + geom_histogram() + scale_x_log10()
    
## BoxPlot
You can graph a boxplot with the funtion **geom_boxplot()**

    #BoxPlot
    ggplot(gapminder_2007, aes(x=continent, y=lifeExp)) + geom_boxplot()
