### Summary Statistics vs. Visualizations
Summary statistics like the mean and standard deviation can be great for attempting to quickly understand aspects of a dataset, but they can also be misleading if you make too many assumptions about how the data distribution looks.

the [Datasaurus](https://video.udacity-data.com/topher/2019/November/5dc49fcf_samestats-differentgraphs/samestats-differentgraphs.pdf) dataset, which is amazingly insightful and artistic, but is built on the same idea that you just discovered. You can find the full dataset, and the visualizations on the Datasaurus link.

### Exploratory vs. Explanatory Analyses

There are two main reasons for creating visuals using data:

1. **Exploratory analysis** is done when you are searching for insights. These visualizations don't need to be perfect. You are using plots to find insights, but they don't need to be aesthetically appealing. You are the consumer of these plots, and you need to be able to find the answer to your questions from these plots.


2. **Explanatory analysis** is done when you are providing your results for others. These visualizations need to provide you the emphasis necessary to convey your message. They should be accurate, insightful, and visually appealing.

The five steps of the data analysis process:

1. Extract - Obtain the data from a spreadsheet, SQL, the web, etc.

2. Clean - Here we could use exploratory visuals.

3. Explore - Here we use exploratory visuals.

4. Analyze - Here we might use either exploratory or explanatory visuals.

5. Share - Here is where explanatory visuals live.

### Univariate plots
For quantitative data, if we are just looking at one column worth of data, we have four common visuals:

1. Histogram
2. Normal Quantile Plot
3. Stem and Leaf Plot
4. Box and Whisker Plot

In most cases, you will want to use a **histogram**.

For categorical data, if we are looking at just one variable (column), we have three common visuals:

1. Bar Chart
2. Pie Chart
3. Pareto Chart（A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.帕累托图根据“关键的少数和次要的多数”的原理而制做，其结构为两个纵坐标和一个横坐标，由数个直方形和一条折线构成。左侧纵坐标表示频率，右侧纵坐标则表示累计频率（以百分比表示），横坐标表示影响质量的各种因素之名称，按影响大小顺序排列，直方形高度表示相应的因素的影响程度（即出现频率为多少），上方之折线则表示累计频率线（又称帕累托图曲线））

In most cases, you will want to use a **bar chart**.

### Scatter plots
Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the **correlation coefficient** commonly denoted by **r**.

Though there are a few different ways to measure correlation between two variables, the most common way is with [Pearson's correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). Pearson's correlation coefficient provides the

1. Strength
2. Direction

of a **linear relationship**. [Spearman's Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) [中文版](https://zh.wikipedia.org/wiki/%E6%96%AF%E7%9A%AE%E5%B0%94%E6%9B%BC%E7%AD%89%E7%BA%A7%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0)does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables.

### Line plots 
Line plots are a common plot for viewing data over time. These plots allow us to quickly identify overall trends, seasonal occurrences, peaks, and valleys in the data. You will commonly see these used in looking at stock prices over time, but really tracking anything over time can be easily viewed using these plots.

You can see one of Cole's seminars here. She stresses that six lessons of communicating with data:

- **Understand the context** - this means knowing your audience and conveying a clear message about what you want your audience to know or do with the information you are providing.

- **Choose an appropriate visual display** - this was covered in the last lesson. Check out the lesson titled recap in the previous section if you need a quick refresher.

- **Eliminate clutter** - you should only provide information to the user that helps convey your message.

- **Focus attention where you want it**  - build visualizations that pull attention to the message you want to highlight.

- **Think like a designer**  - you will learn a number of design principles in this lesson to assist as you start to put together your own data visualizations.

- **Tell a story**  - your visualizations should give the audience a story. The most powerful data visualizations move people to take action.

### Experts say about visual encodings

Experts and researchers have determined the types of visual patterns that allow humans to best understand certain information. 

In general, humans are able to ****best understand** data encoded with **positional changes** (differences in x- and y- position as we see with scatterplots) and **length changes** (differences in box heights as we see with bar charts and histograms).

Alternatively, humans **struggle** with understanding data encoded with **color hue changes** (as are unfortunately commonly used as an additional variable encoding in scatter plots - we'll study this in upcoming concepts) and **area changes** (as we see in pie charts, which often makes them not the best plot choice).

So, corlor should be reserved for drawing the eye of your audience to a key finding, and not for differing the bars of a bar chart with no message.

charts tha have arching angles or compare based on areas can easily deceive us.

### Chart junk

Chart junk refers to all visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph or that distract the viewer from this information.

Examples of chart junk you saw in this video include:

1. Heavy grid lines
2. Unnecessary text
3. Pictures surrounding the visual
4. Shading or 3d components
5. Ornamented chart axes

### Python Data Visualization Libraries
In this course, you will make use of the following libraries for creating data visualizations:

- Matplotlib: a versatile library for visualizations, but it can take some code effort to put together common visualizations.
- Seaborn: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.
- pandas: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data.
All together, these libraries will allow you to visualize data in a balance of productivity and flexibility, for both exploratory as well as explanatory analyses.