## Module 01: Practical Stats

### Lesson 01: Descriptive Statistics - Part I

> In this lesson, you will learn about data types, measures of center, and the basics of statistical notation.

#### 01. Introduce Instructors

Statistics is at the core of analyzing data.

For the stats portion of this class, you'll be learning from Sebastian Thrun and Josh Bernhard.

* Sebastian is a statistician and Stanford faculty member, as well as founder of Udacity and Google X. He'll be showcasing a number of examples for each of the statistical topics covered.
* Josh who is also a statistician teacher, has taught statistics at the University of Colorado and previously was working as a machine learning engineer for KPMG. He'll be working alongside Sebastian to assure you can implement the statistical applications in Python.

#### 02. Text: Optional Lessons Note

This lesson and the next lesson for this course are the beginning of the [Data Foundations Nanodegree Program](https://www.udacity.com/course/data-foundations-nanodegree--nd100). Though this is considered prerequisite knowledge, these lessons are provided as a refresher for some of the ideas you may need to review.

Additionally, you might hear/see information about projects or other items that are not true about the Data Analyst Nanodegree Program. Please ignore this information and follow the information that is true to your programs projects and schedule. Cheers!

#### 03. Video: Welcome!

A quick overview leading up to your first project. We will start with an overview of data types and the most common statistics used when analyzing data. 

We'll discuss **measures of center and spread**. The most common shapes that data takes on and how to handle **outliers**. You will take this farther by using **spreadsheets** to handle these calculations for you. You'll learn how to **build visuals** to better convey your message as well as how to use these features of spreadsheets to take your data game to a whole new level.

All of this is aimed to help you succeed on the very first project and for analyzing your own data.

#### 04. Video: What is Data? Why is it important?

**Define of Data**

The word "Data(数据)" is defined as distinct pieces of information. 

**Forms of Data**

You may think of data as simply numbers on a spreadsheet, but data can come in many forms. From text to video to spreadsheets and databases to images to audio, and I'm sure I'm forgetting many other forms.

**Utilizing Data**

Utilizing data is the new way of the world. Data is used to understand and improve nearly every facet of our lives. From early disease detection to social networks that allow us to connect and communicate with people around the world. No matter what field you're in, from insurance and banking, to medicine, to education, to agriculture, to automotive, to manufacturing, and so on. You can utilize data to make better decisions and accomplish your goals.

#### 05. Video: Data Types (Quantitative vs. Categorical)

**Data Types**: Quantitative and Categorical.

* **Quantitative(定量)** takes on numeric values that allow us to perform mathematical operations (like the number of dogs).
* **Categorical(分类)** are used to label a group or set of items (like dog breeds - Collies, Labs, Poodles, etc.).

#### 06. Quiz: Data Types (Quantitative vs. Categorical)

**QUESTION**

Can you identify the data types below as either quantitative or categorical?

* Zip Code
* Age
* Income
* Marital Status (Single, Married, Divorced, etc.)
* Height
* Letter Grades (A+, A, A-, B+, B, B-, …)
* Travel Distance to Work
* Ratings on a Survey (Poor, Ok, Great)
* Temperature
* Average Speed

**Solution**

Q for Quantitative, C for Categorical:

| Variable | Data Types |
| --- | --- |
|Zip Code| C |
|Age| Q |
|Income| Q |
|Marital Status (Single, Married, Divorced, etc.)| C |
|Height| Q |
|Letter Grades (A+, A, A-, B+, B, B-, …)| C |
|Travel Distance to Work| Q |
|Ratings on a Survey (Poor, Ok, Great)| C |
|Temperature| Q |
|Average Speed| Q |

#### 07. Video: Data Types (Ordinal vs. Nominal)

We can divide categorical data further into two types: **Ordinal** and **Nominal**.

* **Categorical Ordinal data(分类定序数据)** take on a ranked ordering (like a ranked interaction on a scale from Very Poor to Very Good with the dogs).
* **Categorical Nominal data(分类定类数据)** do not have an order or ranking (like the breeds of the dog).

#### 08. Video: Data Types (Continuous vs. Discrete)

We can think of **quantitative** data as being either **continuous** or **discrete**.

* **Continuous(连续)** data can be split into smaller and smaller units, and still a smaller unit exists. An example of this is the age of the dog - we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with the age.
* **Discrete(离散)** data only takes on countable values. The number of dogs we interact with is an example of a discrete data type.

#### 09. Video: Data Types Summary

| Data Types |  |  |
| --- | --- | --- |
| **Quantitative** | **Continuous**: Height, Age, Income | **Discrete**: Pages in a Book, Trees in Yard, Dogs at a Coffee Shop |
| **Categorical** | **Ordinal**: Letter Grade, Survey Rating | **Nominal**: Gender, Marital Status, Breakfast Items |

#### 10. Text + Quiz: Data Types (Ordinal vs. Nominal)

To break down our data types, there are two main blocks: **Quantitative** and **Categorical**.

* **Quantitative** can be further divided into **Continuous** or **Discrete**.
* **Categorical** data can be divided into **Ordinal** or **Nominal**.

You should have now mastered what types of data in the world around us falls into each of these four buckets: **Discrete**, **Continuous**, **Nominal**, and **Ordinal**.

**Quantitative vs. Categorical**

* Some of these can be a bit tricky - **notice even though zip codes are a number, they aren’t really a quantitative variable**. If we add two zip codes together, we do not obtain any useful information from this new value. Therefore, this is a categorical variable.
* **Height, Age, the Number of Pages in a Book and Annual Income** all take on values that we can add, subtract and perform other operations with to gain useful insight. Hence, these are quantitative.
* **Gender, Letter Grade, Breakfast Type, Marital Status, and Zip Code** can be thought of as labels for a group of items or individuals. Hence, these are categorical.

**Continuous vs. Discrete**

* To consider if we have continuous or discrete data, we should see if we can split our data into smaller and smaller units. 
* Consider **time** - we could measure an event in years, months, days, hours, minutes, or seconds, and even at seconds we know there are smaller units we could measure time in. Therefore, we know this data type is continuous. 
* **Height, age, and income** are all examples of continuous data. Alternatively, **the number of pages in a book, dogs** I count outside a coffee shop, or trees in a yard are discrete data. We would not want to split our dogs in half.

**Ordinal vs. Nominal**

* In looking at categorical variables, we found **Gender, Marital Status, Zip Code and your Breakfast items** are nominal variables where there is no order ranking associated with this type of data. Whether you ate **cereal, toast, eggs, or only coffee** for breakfast; there is no rank ordering associated with your breakfast.
* Alternatively, **the Letter Grade or Survey Ratings** have a rank ordering associated with it, as ordinal data. If you receive an A, this is higher than an A-. An A- is ranked higher than a B+, and so on… Ordinal variables frequently occur on rating scales from very poor to very good. **In many cases we turn these ordinal variables into numbers**, as we can more easily analyze them, but more on this later!

**Final Words**

In this section, we looked at the different data types we might work with in the world around us. When we work with data in the real world, it might not be very clean - sometimes there are **typos or missing values**. When this is the case, simply having some **expertise regarding the data and knowing the data type** can assist in our ability to ‘**clean**’ this data. **Understanding data types** can also assist in our ability to build visuals to best explain the data.

**Quiz**

This quiz will assure you have a clear understanding of the differences between categorical nominal vs. categorical ordinal variables. All of the below variables are categorical. It is your job to check all of the below which are **nominal**. Do not check the **ordinal categorical variables**.

* [ ] Letter Grades (A, B+, B, B-, etc.)
* [x] Types of Fruit (Apple, Banana, etc.)
* [ ] Ratings on a Survey (Poor, Ok, Great)
* [x] Types of Dog Breeds (German Shepherd, Collie, etc.)
* [x] Genres of Movies (Horror, Comedy, etc.)
* [x] Gender  
* [x] Nationality
* [ ] Education (HS, Associates, Bachelors, Masters, PhD, etc.)


#### 11. Data Types (Continuous vs. Discrete)

**Quiz**

This quiz will assure you have a clear understanding of the differences between **quantitative continuous** vs. **discrete variables**. All of the below variables are quantitative. It is your job to check all of the below which are **continuous**. Do not check the **discrete**.

* [x] Travel Distance from Home to Work
* [ ] Number of Pages in a Book  
* [x] Amount of Rain in a Year 
* [x] Time to Run a Mile 
* [ ] Number of Movies Watched in a Week 
* [x] Amount of Water Consumed in a Day  
* [ ] Number of Phones per Household

#### 12. Video: Introduction to Summary Statistics

In the next lessons, we will discuss how to use statistics to **describe quantitative data**. You will gain insight into a process of **how data is collected and how to answer questions using your data**. Throughout this lesson, I hope you learn to be critical of your analysis that happened under the hood and **what the numbers actually mean**.

Using a variety of measures, like **measures of center**, give you an idea of the average student. **Measures of spread** give you an idea of how students differ. **Visuals** can provide us a more complete picture of how long it takes any student to complete a program.

#### 13. Video: Measures of Center (Mean)

**Four Aspects for Quantitative Data**

There are **four main aspects to analyzing Quantitative data**:

1. **Measures of Center(中心测度)**
2. **Measures of Spread(分布测度)**
3. **The Shape of the data(数据形状)**.
4. **Outliers(异常值)**

**Analyzing Categorical Data**

**Categorical data** is analyzed usually be looking at the **counts(计数/频数) or proportion(Proportion, 比/构成比)** of individuals that fall into each group.

**Measures of Center**

There are **three measures of center**:

1. Mean
2. Median
3. Mode

**The Mean**

The **mean(均值)** is often called the **average** or the **expected value(期望值)** in mathematics. We calculate the mean by adding all of our values together, and dividing by the number of values in our dataset.

#### 14. Measures of Center (Mean)

**Measures of Center I**

Which of the below are measures of center (Check all that apply)?

* [x] Mean
* [ ] Standard Deviation
* [ ] Variance
* [x] Median
* [ ] Inter-quartile Range
* [x] Mode
* [ ] Range
* [ ] Maximum
* [ ] Minimum

**Measures of Center II**

If we have the data: $5, 8, 15, 7, 10, 22, 3, 1, 15$. What is the mean?

Answer: 
$(5+15+10+22+3+15+8+7+1) / 9 = 9.56$

#### 15. Video: Measures of Center (Median)

The **Median(中位数)** splits our data so that 50% of our values are lower and 50% are higher. We found in this video that how we calculate the median depends on if we have an **even** number of observations or an **odd** number of observations.

**Median for Odd Values**

If we have an **odd number** of observations, the median is simply the **number in the direct middle**. For example, if we have 7 observations, the median is the fourth value when our numbers are ordered from smallest to largest. If we have 9 observations, the median is the fifth value.

**Median for Even Values**

If we have an **even number** of observations, the median is the **average of the two values in the middle**. For example, if we have 8 observations, we average the fourth and fifth values together when our numbers are ordered from smallest to largest.

In order to compute the median we MUST sort our values first. Whether we use the mean or median to describe a dataset is largely dependent on the shape of our dataset and if there are any outliers.

#### 16. Measures of Center (Median)

**Median 1**

* If we have the data: $5, 8, 15, 7, 10, 22, 3, 1, 15$. What is the median?

* Solution: Sorted data $[1, 3, 5, 5, 7, 8, 10, 15, 22]$, median is $8$.

**Median - average**

If we have the data: $5, 8, 15, 7, 10, 22, 3, 1, 15, 2$. What is the median?

* Solution: ordered data $[1, 2, 3, 5, 7, 8, 10, 15, 15, 22]$, median is $(7+8)/2$.

#### 17. Video: Measures of Center (Mode)

The **mode(众数)** is the **most frequently observed value** in our dataset. There might be multiple modes for a particular dataset, or no mode at all.

**No Mode**

If all observations in our dataset are observed with the **same frequency(频数)**, there is no mode. If we have the dataset: $1, 1, 2, 2, 3, 3, 4, 4$. There is no mode, because all observations occur the same number of times.

**Many Modes**

If two (or more) numbers share the maximum value, then there is more than one mode. If we have the dataset: $1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9$. There are two modes 3 and 6, because these values share the **maximum frequencies** at 3 times, while all other values only appear once.

These are all **three potential measures of center**:

1. The Mean or the average
2. The Median or the middle value
3. The Mode or the most frequent value

#### 18. Measures of Center (Mode)

**Measures of center**

We want to summarize the number of dogs our friends have into a single number. We will use the measures of center for this problem. Ashley has 1 dog, Steve has 1 dog, Jeff has 2 dogs, Kylie has 3 dogs, and Lisa has 8 dogs. There is no best measure of center so we need to try all three to see what makes sense.
What is the mean, median, and mode for the number of dogs our friends have?

* Solution
    * origin data: $[1, 1, 2, 3, 8]$
    * ordered data: $[1, 1, 2, 3, 8]$
    * mean: $(1+1+2+3+8)/5 = 3$
    * median: $2(odd number)$
    * mode: $1(twice)$

**When? Where? Why?**

Check all of the below that are true with regards to our measures of center.

* [ ] The mode is the middle number in the dataset when the numbers are rank ordered. 
* [x] The median is the middle number in the dataset when the numbers are rank ordered. (Suspect)
* [ ] The mean is always the best measure of center for any dataset.
* [ ] The mean is always less than the median.
* [ ] The median is always the best measure of center for any dataset.
* [ ] The mode is always the best measure of center for any dataset.

**Mode**

If we have the data: $5, 8, 15, 7, 10, 22, 3, 1, 15$. What is the mode?

* Solution
    * origin data: $[5, 8, 15, 7, 10, 22, 3, 1, 15]$
    * ordered data: $[1, 3, 5, 7, 8, 10, 15, 15, 22]$
    * mode: $15(twice)$

**Mean, Median, Mode**

For the dataset below match the correct measure to the value: $8, 12, 32, 10, 3, 4, 4, 4, 4, 5, 12, 20$

* Solution
    * origin data: $[8, 12, 32, 10, 3, 4, 4, 4, 4, 5, 12, 20]$
    * ordered data: $[3, 4, 4, 4, 4, 5, 8, 10, 12, 12, 20, 32]$
    * mean: $(3+5+12+4+4+12+4+4+32+8+10+20)/12 = 118/12 = 9.83$
    * median: $(5+8)/2(\text{even number}) = 6.5$
    * mode: $4$(four times)

**Modes?**

If we have the data: 5, 8, 15, 7, 10, 22, 3, 1, 15, 10, Mark all statements that are true.

* Solution
    * origin data: $[5, 8, 15, 7, 10, 22, 3, 1, 15, 10]$
    * ordered data: $[1, 3, 5, 7, 8, 10, 10, 15, 15, 22]$
    * mean: $(1+7+22+5+15+10+10+15+3+8)/12 = 86/10 = 8.6$
    * median: $(10+10)/2(even number) = 10$
    * mode: $10\enspace\text{or}\enspace 15$(twice)


* [x] The mode is 15.
* [ ] The mean is 15.
* [x] The mode is 10.
* [ ] None of the above are true.


#### 19. Video: What is Notation?

**The Greek Alphabet**

**Notation**

**Notation(符号表达式)** is a common language used to communicate mathematical ideas. Think of notation as a universal language used by academic and industry professionals to convey mathematical ideas.

You likely already know some notation. Plus, minus, multiply, division, and equal signs all have mathematical symbols that you are likely familiar with. Each of these symbols replaces an idea for how numbers interact with one another. 

Though you will not need to use notation to complete the project, it does have the following **properties**:

1. **Understanding how to correctly use notation makes you seem really smart**. Knowing how to read and write in notation is like learning a new language. A language that is used to convey ideas associated with mathematics. 
2. **It allows you to read documentation, and implement an idea to your own problem**. Notation is used to convey how problems are solved all the time. One really popular mathematical algorithm that is used to solve some of the world's most difficult problems is known as [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting).
3. **It makes ideas that are hard to say in words easier to convey**. Sometimes we just don't have the right words to say. For those situations, I prefer to use notation to convey the message. Similar to the way an emoji or meme might convey a feeling better than words, notation can convey an idea better than words. Usually those ideas are related to mathematics, but I am not here to stifle your creativity.

#### 20. Video: Random Variables

**Rows and Columns**

**Spreadsheets(电子表格程序)** are a common way to hold data. They are composed of **rows(行)** and **columns(列)**. Rows run horizontally, while columns run vertically. Each **column** in a spreadsheet commonly holds a **specific variable**, while each **row** is commonly called an **instance(实例)** or **individual(个体)**.

The example used in the video is shown below.

| **Date** | **Day of Week** | **Time Spent On Site (X)** | **Buy (Y)** | 
|---|---|---|---|
| June 15 | Thursday | 5 | No | 
| June 15 | Thursday | 10 | Yes | 
| June 16 | Friday | 20 | Yes |

This is a **row**:

| Date | Day of Week | Time Spent On Site (X) | Buy (Y) |
|---|---|---|---|
| June 15 | Thursday | 5 | No |

This is a **column**:

|Time Spent On Site (X)|
|---|
|5|
|10|
|20|

**Random Variables**

A **random variable(随机变量)** is a **placeholder(占位符)** for the possible values of some process (mostly… the term 'some process' is a bit ambiguous). As was stated before, notation is useful in that it helps us **take complex ideas and simplify** (often to a single letter or single symbol). 

We see **random variables** represented by capital letters (X, Y, or Z are common ways to represent a random variable). We might have the random variable X, which is a holder for the possible values of the amount of time someone spends on our site. Or the random variable Y, which is a holder for the possible values of whether or not an individual purchases a product.

X is 'a holder' of the values that could possibly occur for the amount of time spent on our website. Any number from 0 to infinity really.

#### 21. Quiz: Variable Types

**Example Dataset**

An example of the data we might have collected in the previous video is shown here:

| Date | Day of Week | Time Spent On Site (X) | Buy (Y) | 
|---|---|---|---|
| June 15 | Thursday | 5 | No | 
| June 15 | Thursday | 10 | Yes | 
| June 16 | Friday | 20 | Yes |

1. Variable Type

What type of variable is the random variable X in the video in the previous concept?

Solution: **Quantitative - Continuous**

2. Data Types 2

What type of variable is the random variable Y in the video in the previous concept?

Solution: **Categorical - Nominal**

#### 22. Video: Capital vs. Lower

**Capital vs. Lower Case Letters**

Random variables are represented by capital letters. Once we observe an outcome of these random variables, we notate it as a lower case of the same letter.

**Example 1**

For example, the **amount of time someone spends on our site** is a **random variable** (we are not sure what the outcome will be for any particular visitor), and we would notate this with X. Then when the first person visits the website, if they spend 5 minutes, we have now observed this outcome of our random variable. We would **notate any outcome as a lowercase letter** with a subscript associated with the order that we observed the outcome.

If 5 individuals visit our website, the first spends 10 minutes, the second spends 20 minutes, the third spends 45 mins, the fourth spends 12 minutes, and the fifth spends 8 minutes; we can notate this problem in the following way:

$X$ is the amount of time an individual spends on the website.

$x_1=10$
$x_2=20$
$x_3=45$
$x_4=12$
$x_5=8$

The **capital X** is associated with this idea of a random variable, while the observations of the random variable take on **lowercase x values**.

**Example 2**

Taking this one step further, we could ask: What is the probability someone spends more than 20 minutes in our website?

In notation, we would write: $P(X > 20)?$. Here **P stands for probability(概率)**, while the parentheses encompass the statement for which we would like to find the probability. Since X represents the amount of time spent on the website, this notation represents the probability the amount of time on the website is greater than 20.

We could find this in the above example by noticing that only one of the 5 observations exceeds 20. So, we would say there is a 1 (the 45) in 5 or 20% chance that an individual spends more than 20 minutes on our website (based on this dataset).

**Example 3**

If we asked: What is the probability of an individual spending 20 or more minutes on our website? We could notate this as: $P(X >= 20)?$

We could then find this by noticing there are two out of the five individuals that spent 20 or more minutes on the website. So this probability is 2 out of 5 or 40%.

#### 23. Quiz: Introduction to Notation

Consider we have the following table:

|Years Experience|Department|Part/Full Time|
|---|---|---|
|5|IT|Part Time|
|10|Finance|Full Time|
|8|HR|Full Time|
|1|Finance|Part Time|

Consider we have the following labels:

$X= \text{years of experience}$
$Y= \text{Department}$
$Z= \text{Part/Full Time}$

Match the following notation to their corresponding:

$x_1 \implies 5$
$y_2 \implies Finance$
$z_3 \implies \text{Full Time}$
$n \implies 4$


#### 24. Video: Better Way?

**Notation for Calculating the Mean**

We know that the mean is calculated as the sum of all our values divided by the number of values in our dataset.

In our current notation, adding all of our values together can be extremely tedious. If we want to add 3 values of some random variable together, we would use the notation:

$x_1+x_2+x_3$


If we want to add 6 values together, we would use the notation:

$x_1+x_2+x_3+x_4+x_5+x_6$

To extend this to add one hundred, one thousand, or one million values would be ridiculous! How can we make this easier to communicate?!


#### 25. Video: Summation

**Aggregations**

An **aggregation(聚合)** is a way to turn multiple numbers into fewer numbers (commonly one number).

**Summation(求和)** is a common aggregation. The notation used to sum our values is a greek symbol called sigma($\Sigma$).

**Example 1**

Imagine we are looking at the amount of time individuals spend on our website. We collect data from nine individuals:

$
x_1=10, \quad
x_2=20, \quad
x_3=45, \quad
x_4=12, \quad
x_5=8, \quad
x_6=12, \quad
x_7=3, \quad
x_8=68, \quad
x_9=5, \quad
$

If we want to sum the first three values together in our previous notation, we write:

$x_1+x_2+x_3$

In our new notation, we can write:

$\displaystyle\sum_{i=1}^3 x_i$

Notice, our notation starts at the first observation ($i=1$) and ends at $3$ (the number at the top of our summation).

So all of the following are equal to one another: 
$\displaystyle\sum_{i=1}^3 x_i=x_1$

**Example 2**

Now, imagine we want to sum the last three values together.

$x_7+x_8+x_9$

In our new notation, we can write:

$\displaystyle\sum_{i=7}^9 x_i$

Notice, our notation starts at the seventh observation ($i=7$) and ends at $9$ (the number at the top of our summation).


#### 26. Video: Notation for the Mean

**Final Steps for Calculating the Mean**

To finalize our calculation of the mean, we introduce **n** as the **total number of values** in our dataset. We can use this notation both at the top of our summation, as well as for the value that we divide by when calculating the mean.

$\frac{1}{n}\displaystyle\sum_{i=1}^nx_i$

Instead of writing out all of the above, we commonly write $\bar{x}$ to represent the mean of a dataset. Although, similar to the first video, we could use any variable. Therefore, we might also write $\bar{y}$, or any other letter.

We also could index using any other letter, not just $i$. We could just as easily use $j,k$, or $m$ to index each of our data values. The quizzes on the next concept will help reinforce this idea.

**Notice**
At second 0:12, this should say $\displaystyle\sum_{i=1}^5x_i=x_1+x_2+x_3+x_4+x_5$ is missing here in front of the summation.

#### 27. Quiz: Summation

**Match The Notation**

For this quiz, you will be matching the notation attached the letters below to the corresponding numeric value to make sure you understand exactly what is being done with each part of the notation.

Imagine, we have the following table of values:

| $x_1$ | $x_2$ | $x_3$ | $x_4$ | $x_5$ | $x_6$ | $x_7$ |
|---|---|---|---|---|---|---|
| 5 | 15 | 3 | 3 | 8 | 10 | 12 |

Use the letters, numbers, and notation as defined above to match each letter to the appropriate value.

| Value | Letter |
| --- | --- |
| 8 | $x_5$ |
| 4 | $\frac{\displaystyle\sum_{i=3}^6 x_i}{n-1}$ |
| 57 | $\displaystyle\sum_{j=2}^7+6$ |
| 56 | $\displaystyle\sum_{x=1}^n x_i$ |
| 7 | $n$ |

#### 28. Quiz: Notation for the Mean

**Notation for Quizzes 1**

Use the letter next to the notation above to match the notation to the description of what the notation represents.

| Notation Letter | Description |
| --- | --- |
| $X$ | The notation for a random variable. |
| $Y$ | The notation for a random variable. |
| $x_1$ | The notation for the first observed value of a random variable. |
| $n$ | The notation for the number of rows in our dataset. |
| $\displaystyle\sum_{i=1}^n x_i$ | The notation for the sum of all the values in our dataset. |

**Notation for Quizzes 2**

If we wanted to provide notation for the mean of a particular dataset, which of the following letters would correspond to the notation attached to calculating the mean? (Mark all that apply.)

* [ ] $\displaystyle\sum_{i=1}^n x_i$
* [x] $\frac{\displaystyle\sum_{i=1}^n x_i}{n}$
* [x] $\bar x$
* [x] $\bar y$
* [x] $\frac{\displaystyle\sum_{j=1}^n j_i}{n}$

#### 29. Text: Summary on Notation

**Notation Recap**

Notation is an essential tool for communicating mathematical ideas. We have introduced the fundamentals of notation in this lesson that will allow you to read, write, and communicate with others using your new skills!

**Notation and Random Variables**

As a quick recap, **capital letters** signify **random variables**. When we look at **individual instances** of a particular random variable, we identify these as **lowercase letters with subscripts** attach themselves to each specific observation.

For example, we might have $X$ be the amount of time an individual spends on our website. Our first visitor arrives and spends 10 minutes on our website, and we would say $x_1$ is 10 minutes.

We might imagine the random variables as columns in our dataset, while a particular value would be notated with the lower case letters.

| Notation | English | Example |
| --- | --- | --- |
| $X$ | A random variable | Time spent on website |
| $x_1$ | First observed value of the random variable $X$ | 15 mins |
| $\displaystyle\sum_{i=1}^n x_i$ | Sum values beginning at the first observation and ending at the last | $5 + 2 + … + 3$ |
| $\frac{1}{n}\displaystyle\sum_{i=1}^n x_i$ | Sum values beginning at the first observation and ending at the last and divide by the number of observations (the mean) | $(5 + 2 + 3)/3$ |
| $\bar x$ | Exactly the same as the above - the mean of our data. | $(5 + 2 + 3)/3$ |

**Notation for the Mean**

We took our notation even farther by introducing the notation for summation $\Sigma$. Using this we were able to calculate the mean as:
$\frac{1}{n}\displaystyle\sum_{i=1}^n x_i$

In the next section, you will see this notation used to assist in your understanding of calculating various measures of spread. Notation can take time to fully grasp. Understanding notation not only helps in conveying mathematical ideas, but also in writing computer programs - if you decide you want to learn that too! Soon you will analyze data using spreadsheets. When that happens, many of these operations will be hidden by the functions you will be using. But until we get to spreadsheets, it is important to understand how mathematical ideas are commonly communicated. **This isn't easy, but you can do it!**

#### 30. Appendix: Glossary

* Data(数据)

* * *

* Categorical(分类)
* Categorical Ordinal data(分类定序数据) 
* Categorical Nominal data(分类定类数据) 
* Counts(计数/频数)
* Proportion(Proportion, 比/构成比)

* * *

* Quantitative(定量) 
* Continuous(连续)
* Discrete(离散) 
* Measures of Center(中心测度)
* Measures of Spread(分布测度)
* The Shape of the data(数据形状).
* Outliers(异常值)
* Mean(均值) - Expected Value(期望值) 
* Median(中位数)
* Mode(众数) 

* * *

* Notation(符号表达式)
* Random Variable(随机变量) 
* Placeholder(占位符)
* Probability(概率)
* Aggregation(聚合)
* Summation(求和)

* * *

* Spreadsheets(电子表格程序)
* Rows(行)
* Columns(列)
* Instance(实例)
* Individual(个体)
