# Data Analysis in Python
_Author: Ioann Dovgopoliy_

## Seminar 10. Not only Python

### Seminar outline

* Probability theory;
* Random variable & distribution;
* Sample (and its characteristics);
* Summary;
* Practice.

### Probability theory
What is probability? Let's assume the situation when we expect some important event to occur. For instance, you are waiting for your friend at assigned point. We can imagine that some embarrassments can result in your friend's delay. If we divide number of scenarios where some embarrassment emerges by number of all possible scenarios, we get value that is known as <i style="color: darkred;">probability</i>. Thus, the probability is $p=\frac{n}{N}$ where $n$ is number of outcomes when event occurs and $N$ is number of all possible outcomes.

For instance, probability of getting coin tail (one throw) is $p=\frac{1}{2}$.

**Issue 1.** What is the probability to get 6 when you roll the dice?

In [None]:
# your code here

It is very important to remember two rules when dealing with probabilities ($\cap$ is intersection [and], $\cup$ is union [or]):

1. **Rule of sum.** If you compute the probability whether event $A$ [$p(A)$] *OR* event $B$ [$p(B)$] occurs, you should add $p(A)$ to $p(B)$: $p(A or B)=p(A \cup B)=p(A)+p(B)$;
2. **Rule of multiplication.** If you compute the probability whether event $A$ [$p(A)$] *AND* event $B$ [$p(B)$] occurs, you should multiply $p(A)$ by $p(B)$: $p(A and B)=p(A \cap B)=p(A)*p(B)$.

**Issue 2.** What is the probability to get 2 or 3 when you roll the dice?

In [None]:
# your code here

**Issue 3.** What is the probability to get 1 and 1 in two subsequent rolls?

In [None]:
# your code here

**Issue 4.** What is the probability to get 4 and 3 in two subsequent rolls?

In [None]:
# your code here

### Random variable & distribution
We are surrounded by random variables. What is the individual income of random guy in India? How many students will visit my lecture next time? What grade will get those people who do not attend my lessons? We cannot answer these questions exactly. We can only give some approximate estimation.

Of course, there are no physically random events (here we do not touch quantum theory). But mostly each event is affected by so large number of factors that it is rational to 'invent' the concept of *random variable*. Warning: *random* here does not mean *arbitrary*. We cannot exactly predict the height of random person outside, but we can almost definitely guarantee that this stranger is lower than 3 metres. In addition, it is more probable to meet mean-height person than, for instance, me.

So, *random variable* has its characteristics and limitations.

**Issue 5.** What is the probability to meet the dinosaur outside?

In [None]:
# your ideas in your mind

The characteristics and limitations of *random variables* can be called *distribution*. Distribution can show where it is more or less possible to meet value the *random variable*. Let's look at the following picture:

![Z](https://cdn.scribbr.com/wp-content/uploads/2020/10/standard-normal-distribution-1024x633.png)

This is so-called *normal distribution*. The more is area under the curve, the more is probability to meet the value. The area under the curve always equals 1 (as probability can be from 0 to 1).

**Issue 6.** What is more probable: to meet values near 0 or 1? -1 or -2? 1 or -1? -3 or 6?

Many random variables are distributed normally: human height and weight, psycological characteristics etc.

**Issue 7.** How do you think, is the distribution of income in Russia normal (in a statistical sence)?

Of course, there are many other distributions.

### Sample (and its characteristics)
#### Measures of central tendency
Now we understand that every random variable has its distribution e. g. its rules.

Assume that we want to estimate average height of Moscow citizen. Is it possible to measure heght of each person in Moscow? I think no. But we have some acceptable alternative: to take N random people from all Moscow citizen (let me call them population). This random group is called *sample*. If we measure average height in our sample, we get some estimatioon for the whole population.

As we have some sample, we are able to compute, for instance, mean value. In this context, mean value shows us some central tendency.

**Issue 8.** Write function that takes real numbers sequence as an argument and returns its mean.

In [None]:
heights = [188, 178, 164, 152, 157, 170, 188, 168, 172, 160]

# your code here

Of course, it can be tiring to write such functions manually. Here `pandas` helps us:

In [None]:
import pandas as pd

heights_pd = pd.Series(heights)
heights_pd

In [None]:
heights_pd.mean() # mean() method

Mean value is not the only measure of a central tendency. See:

In [None]:
heights_pd.mean(), pd.Series([240, 178, 164, 152, 157, 170, 188, 168, 172, 160]).mean() # replace 188 by 240

**Issue 9.** Above we have a very tall man among the usual ones (this value is called *outlier*). Consequently, mean value is changed. Is the new mean representative for the sample?

For this case, we have another central tendency measure: *median*. How it can be computed:

1. Take sample: [188, 178, 164, 152, 157, 170, 188, 168, 172, 160];
2. Sort it in ascending order: [152, 157, 160, 164, 168, 170, 172, 178, 188, 188];
3. Find the place where you have equal number of values on the left and on the right: [152, 157, 160, 164, 168, **here** 170, 172, 178, 188, 188] (if length of the sequence is not even, this place will be central number);
4. If length of the sequence is even: $median = \frac{left+right}{2}$ where $left$ is value on the left from the center and $right$ is value on the right from the center (`(168 + 170) / 2 = 169`). If sequence length is uneven, then median is the central number.

Of course, `pandas` has special method for the median:

In [None]:
heights_pd.median()

Unlike the mean value, median is a *robust* statistic. It means that median cannot be influenced by outliers:

In [None]:
heights_pd.median(), pd.Series([240, 178, 164, 152, 157, 170, 188, 168, 172, 160]).median() # medians are equal

**Issue 10.** Find the medians for:

1. [1, 2, 3, 4, 5];
2. [1, 2, 3, 4, 5, 6];
3. [35, -2, 9, 56, 0.2, -81.3, 0];
4. [0, 0, 0, 0, 1, 0, 0].

**Issue 11.** How do you think, what could mean the situation when mean and median are very different?

#### Measures of dispersion

Look at these two sequences:

In [None]:
heights2 = [169.4, 170, 169.7, 169.6, 169.8, 169, 170.4, 169.7, 168.7, 170.7]
print(heights, heights2, sep='\n')

In [None]:
print(pd.Series(heights).mean(), pd.Series(heights2).mean(), sep='\n')

They have equal means. But do they have equal dispersion of values? Let's check:

In [None]:
min1 = pd.Series(heights).min()
max1 = pd.Series(heights).max()
min2 = pd.Series(heights2).min()
max2 = pd.Series(heights2).max()

print(f'First min and max are {min1} and {max1} while second - {min2} and {max2}.')

As you can see, first and second ranges are different. Of course, there is special statistic to measure range of a sample. This statistic is called *variance*. Its formula is:

![var](https://www.statisticshowto.com/wp-content/uploads/2009/08/usual.png)

Where $X$ is each value, $X$ with a cap is mean, and sum of squared subtractions is divided by sample length ($N$) minus `1`.

**Issue 12.** Write function that takes real numbers sequence as an argument and returns its variance.

In [None]:
heights = [188, 178, 164, 152, 157, 170, 188, 168, 172, 160]

# your code here

As *variance* involves squared differences, we might desire to remove this 'squareness'. If we take a square root of variance we get *standard deviation*. Of course, `pandas` allows us to compute both:

In [None]:
heights_pd.var()

In [None]:
heights_pd.std()

In [None]:
heights_pd.var() ** (1 / 2)

**Issue 13.** Find the variance and standard deviation:

1. [1, 2, 3, 4, 5];
2. [1, 2, 3, 4, 5, 6];
3. [35, -2, 9, 56, 0.2, -81.3, 0];
4. [0, 0, 0, 0, 1, 0, 0].

#### Quantiles
One more interesting way to characterize your sample is *quantile*.

**Definition.** Quantile of level $p$  is the value on the left of which share of values is $p$ (including this value).

**Example.** Level is `0.5`:

Sequence: [3, 4, 1, 8, 2, 0];<br>
Sorted sequence: [0, 1, 2, 3, 4, 8];<br>
Quantile of level `0.5` is `2`, because:

1. How many values are from the left of `2` (including `2`)? 3: 0, 1, 2;
2. What is the length of the sequence? 6;
3. What is share of 3 in 6? `3 / 6 = 0.5`.

There are several identical (in general) methods to compute quantile in `pandas`. To choose ours, set `interpolation` argument to `nearest`:

In [None]:
pd.Series([3, 4, 1, 8, 2, 0]).quantile(0.5, interpolation='nearest')

**Issue 14.** You have the sequence from the example. Find quantiles of level:

1. 0.4;
2. 0.41;
3. 1;
4. 0.7

### Summary

* *probability* is ratio of number of positive outcomes to number of all possible outcomes;
* *random variables* have their rules: distribution;
* *sample* can be characterized by *mean*, *median* and *variance* (+ *standard deviation*).

### Practice
**Task 1.** Collect data about height of 6 random persons you meet since this moment. Compute: mean, median, variance and standard deviation. Are mean and median very different? Why?

**Task 2.** You have table `ec_data.xls`. Open it in `pandas` and compute all the metrics above for integer or float columns. Interpret your results.

In [None]:
# your code here