# Intermediate Data Visualization with ggplot2
link: https://www.datacamp.com/search?facets%5Btechnology%5D%5B%5D=r&facets%5Btopic%5D%5B%5D=Data+Visualization&tab=courses

course: https://learn.datacamp.com/courses/intermediate-data-visualization-with-ggplot2

### Course Description
This ggplot2 course builds on your knowledge from the introductory course to produce meaningful explanatory plots. Statistics will be calculated on the fly and you’ll see how Coordinates and Facets aid in communication. You’ll also explore details of data visualization best practices with ggplot2 to help make sure you have a sound understanding of what works and why. By the end of the course, you’ll have all the tools needed to make a custom plotting function to explore a large data set, combining statistics and excellent visuals.


### Note how can Resizing plots in the R kernel for Jupyter notebooks
https://blog.revolutionanalytics.com/2015/09/resizing-plots-in-the-r-kernel-for-jupyter-notebooks.html

    library(repr)

    # Change plot size to 4 x 3
    options(repr.plot.width=4, repr.plot.height=3)
    
### Note2 Generate a table 

https://www.tablesgenerator.com/markdown_tables


other: Book: machine learning with R by Brett Lantz
Learn about `attr` function


Laying out multiple plots on a page:  https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html

### Note 3 - DataFrames

We have troubles in the moment to load our dataset but if we used `gascon(url())` or onlye `url()` we can do that.

In [1]:
library(dplyr)
library(ggplot2)
library(gridExtra)
data(mtcars)

mtcars$fcyl<-factor(mtcars$cyl, levels = c("4","6","8"))
mtcars$fam<-factor(mtcars$am, labels = c("automatic","manual"),  levels  = c (1,0))

"package 'dplyr' was built under R version 3.5.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

"package 'gridExtra' was built under R version 3.5.3"
Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine



## 1) Statistics
A picture paints a thousand words, which is why R ggplot2 is such a powerful tool for graphical data analysis. In this chapter, you’ll progress from simply plotting data to applying a variety of statistical methods. These include a variety of linear models, descriptive and inferential statistics (mean, standard deviation and confidence intervals) and custom functions.

### 1.1) (video) Stats with geoms

welcome to the second ggplot2 course on data visualization, here we are going to build on the skills you learned during the first course, we will examine the following 3 layers in detail:

1. Statistics
2. coordinates
3. facets

so let's start with the statistics, there are two category of functions in this family.

- those are called from within `geom` 
- those are called independently 

as maybe you guess all statistics start with `stats_` .

we already saw `stats_` funciontions, when we use `geom_histrogram()` ,recall that under the hood, this calls `stat_bin` to summarize total count in each group, but we can obtain the same result with the `geom_bar`,  but if we call the `stat_count()` directely, we will obtain the same result.

so we can see specif `geom` and specif `stat` functions are related 

### 1.2 (video) Stats: sum and quantile
now we will discuss two useful functions `geom_count` and `geom_quantile` in the last course we saw 4 ways to overcome the overplotting.

- `geom_count` counts the number of observations in each location and then maps the counts onto sides points area, remeber that each `geom` can be associate a `stat` directely as e.g `stat_count` in this case. 
- `geom_quantile`  This fits a quantile regression to the data and draws the fitted quantiles with lines or we can use stat_quantile().


| Cause Over-Plotting             | Solutiones               | Here...      |
|---------------------------------|--------------------------|--------------|
| Large Dataset                   | alpha                    |              |
| Aligned values on a single axis | alpha + change possition |              |
| Low precision data              | jitter                   | geom_count() |
| Integer data                    | jitter                   | geom_count() |

Wen you have two integer variable one solution, is jittering with transparency. Another solution is to use `stat_sum()`, which calculates the total number of overlapping observations and maps that onto the size aesthetic.

`stat_sum()` allows a special variable, ..`prop`.., to show the proportion of values within the dataset.


### 1.3) Stats outside geoms
Let's see some statistics that you can call directly, the typical way to summarize the data is with the mean, standard deviation or confidence of the 95% interval, we can calculate them manually or do it directly in ggplot2.

or can use other package e.g

    set.seed(123)
    xx <- rnorm(100)
    # Hmisc
    library(Hmisc)
    smean.sdl(xx, mult = 1)  # 1 sd
    
but we can do it with ggplot 2 `mean_sdl(xx, mult = 1)`

In summary we can use some kind this function into: 

- stat_summary(), summarize y values at distinct x values.
- stat_function(), compute y values from a function of x values.
- stat_qq(), perform calculations for a quantile-quantile plot.

`Summary statistics` refers to a combination of location (mean or median) and spread (standard deviation or confidence interval).

These metrics are calculated in `stat_summary()` by passing a function to the `fun.data` argument. `mean_sdl()`, calculates multiples of the standard deviation and `mean_cl_normal()` calculates the t-corrected 95% CI.

Arguments to the data function are passed to `stat_summary()`'s `fun.args` argument as a list.

In [3]:
set.seed(123)
xx <- rnorm(100)
mean(xx)
mean(xx) + (sd(xx) * c(-1, 1))

## 2) Coordinates

The Coordinates layers offer specific and very useful tools for efficiently and accurately communicating data. Here we’ll look at the various ways of effectively using these layers, so you can clearly visualize lognormal datasets, variables with units, and periodic data.