# HarvardX: PH125.2x 
## Data Science: Visualization

#### In this course, you will learn:

* Data visualization principles to better communicate data-driven findings
* How to use ggplot2 to create custom plots
* The weaknesses of several widely used plots and why you should avoid them

#### Course overview

##### **Section 1: Introduction to Data Visualization and Distributions**

You will get started with data visualization and distributions in R. 

##### **Section 2: Introduction to ggplot2**

You will learn how to use ggplot2 to create plots. 

##### **Section 3: Summarizing with dplyr**

You will learn how to summarize data using dplyr.

##### **Section 4: Gapminder**

You will see examples of ggplot2 and dplyr in action with the Gapminder dataset.

##### **Section 5: Data Visualization Principles** 

You will learn general principles to guide you in developing effective data visualizations.


# Section 1 introduces you to Data Visualization and Distributions.

After completing Section 1, you will:

        understand the importance of data visualization for communicating data-driven findings.
        be able to use distributions to summarize data.
        be able to use the average and the standard deviation to understand the normal distribution.
        be able to assess how well a normal distribution fits the data using a quantile-quantile plot.
        be able to interpret data from a boxplot.


### Data types

What type of data?

- Categorical (qualitative)

  - ordinal 
  
  - non-ordinal

- Numerical (quantitative)

  - discrete

  - continuous
  


### Exercise 1. Variable names

The type of data we are working with will often influence the data visualization technique we use. We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous.

We will review data types using some of the examples provided in the dslabs package. For example, the heights dataset.
```
library(dslabs)
data(heights)
```

In [1]:
library(dslabs)
data(heights)
names(heights)

### Exercise 2. Variable type

We saw that sex is the first variable. We know what values are represented by this variable and can confirm this by looking at the first few entires:
```
library(dslabs)
data(heights)
head(heights)
```
What data type is the ```sex``` variable?

In [2]:
# cactegorical

### Exercise 3. Numerical values

Keep in mind that discrete numeric data can be considered ordinal. Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members.

The height variable could be ordinal if, for example, we report a small number of values such as short, medium, and tall. Let's explore how many unique values are used by the heights varialbe. For this we can use the unique fuction:
```
x <- c(3, 3, 3, 3, 4, 4, 2)
unique(x)
```

In [3]:
library(dslabs)
data(heights)
x <- heights$height

length(unique(x))

### Exercise 4. Tables

One of the useful outputs of data visualization is that we can learn about the distribution of variables. For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table. Here is an example:
```
x <- c(3, 3, 3, 3, 4, 4, 2)
table(x)
```
Use the table function to compute the frequencies of each unique height value. Because we are using the resulting frequency table in a later exercise we want you to save the results into an object and call it tab.

In [4]:
library(dslabs)
data(heights)
x <- heights$height
tab <- table(x)
tab

x
              50               51               52               53 
               2                1                2                1 
           53.77               54               55               58 
               1                1                1                1 
              59          59.0551 59.0551181102362 59.8425196850394 
               6                1                2                1 
              60               61            61.32 61.8110236220472 
              17               12                1                2 
              62 62.2047244094488             62.4             62.5 
              19                3                1                2 
62.5984251968504             62.6  62.992125984252               63 
               1                1                3               31 
63.3858267716535 63.7795275590551               64           64.173 
               1                3               39                1 
         64.1732 64.173228346456

### Exercise 5. Indicator variables

To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.

In the previous exercise we computed the variable tab which reports the number of times each unique value appears. For values reported only once tab will be 1. Use logicals and the function sum to count the number of times this happens.
```
Hint

Here is an example

x <- c(3, 3, 3, 3, 4, 4, 2)
tab <- table(x)
sum(tab==1)
```
Use the function sum to count the number of times entries in tab are equal to 1.

In [5]:
library(dslabs)
data(heights)
tab <- table(heights$height)
sum(tab == 1)

### Exercise 6. Data types - heights

Since there are a finite number of reported heights and technically the height can be considered ordinal, which of the following is true:

In [6]:
# 1

# 1.2 Intro. to distributions

* Proportions
* Bar chart
* Histogram
* CDF (cumulative distribution function)
* Smooth density plot

### Exercise 1. Distributions - 1

You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. So, for example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list? We are going to learn when these 2 numbers are enough and when we need more elaborate summaries and plots to describe the data.

Our first data visualization building block is learning to summarize lists of factors or numeric vectors. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as distribution, there are several data visualization techniques to effectively relay this information. In later assessments we will practice to write code for data visualization. Here we start with some multiple choice questions to test your understanding of distributions and related basic plots.

In the murders dataset, the region is a categorical variable and on the right you can see its distribution. To the closet 5%, what proportion of the states are in the North Central region?

#  1.3 Quantiles, Percentiles, and Boxplots 

quantile-quantile plot

etc.

percentiles



boxplot

### Exercise 1. Proportions

Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.

The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.

Here we focus on how the normal distribution helps us summarize data and can be useful in practice.

One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.

Load the height data set and create a vector x with just the male heights:
```
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
```
What proportion of the data is between 69 and 72 inches (taller than 69 but shorter or equal to 72)?

Its not 
```
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
sum(x > 69 & x <=72) / length(x)
```

In [7]:
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
mean(x > 69 & x <=72)

### Exercise 2. Averages and Standard Deviations

Suppose all you know about the height data from the previous exercise is the average and the standard deviation and that its distribution is approximated by the normal distribution. We can compute the average and standard deviation like this:
```
library(dslabs)
data(heights)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)
```
Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?


Use the normal approximation to estimate the proportion the proportion of the data that is between 69 and 72 inches.
Note that you can't use x in your code, only avg and stdev. Also note that R has a function that may prove very helpful here - check out the pnorm function (and remember that you can get help by using ?pnorm).


In [8]:
library(dslabs)
data(heights)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)

(pnorm(72, mean = avg, sd = stdev)) - pnorm(69, mean = avg, sd = stdev) 


### Exercise 3. Approximations

Notice that the approximation calculated in the second question is very close to the exact calculation in the first question. The normal distribution was a useful approximation for this case.

However, the approximation is not always useful. An example is for the more extreme values, often called the "tails" of the distribution. Let's look at an example. We can compute the proportion of heights between 79 and 81.
```
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
mean(x > 79 & x <= 81)
```
Instructions

    Use normal approximation to estimate the proportion of heights between 79 and 81 inches and save it in an object called approx.
    Report how many times bigger the actual proportion is compared to the approximation.


In [9]:
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
exact <- mean(x > 79 & x <= 81)
approx <- (pnorm(81, mean=mean(x), sd=sd(x))) - (pnorm(79, mean=mean(x), sd=sd(x)))
exact/approx

### Exercise 4. Seven footers and the NBA

Someone asks you what percent of seven footers are in the National Basketball Association (NBA). Can you provide an estimate? Let's try using the normal approximation to answer this question.

First, we will estimate the proportion of adult men that are 7 feet tall or taller.

Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.

Instructions

Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers. Print out your estimate; don't store it in an object.

In [12]:
1-pnorm(84, mean=69, sd=3)

### Exercise 5. Estimating the number seven footers

Now we have an approximation for the proportion, call it p, of men that are 7 feet tall or taller.

We know that there are about 1 billion men between the ages of 18 and 40 in the world, the age range for the NBA.

Can we use the normal distribution to estimate how many of these 1 billion men are at least seven feet tall?

Instructions

    Use your answer to the previous exercise to estimate the proportion of men that are seven feet tall or taller in the world and store that value as p.
    Then round the number of 18-40 year old men who are seven feet tall or taller to the nearest integer. (Do not store this value in an object.)


In [11]:
p <- 1-pnorm(84, mean=69, sd=3)
p
