# Recitation 1: Introduction to Jupyter & Basic Probability and Statistics

1. NumPy: The fundamental package for scientific computing with Python.



2. Pandas: Fast, powerful, flexible and easy to use open source data analysis and manipulation tool,



3. Matplotlib: Comprehensive library for creating static, animated, and interactive visualizations in Python (next recitation)


In addition, we will work with Jupyter Notebook which enables us to have code, text, visualizations and more in a single document. 


## NumPy

### Data Types in Python
***
Python is a dynamically typed language. This means that every time we preform an action types are checked during execution of the code. 

### Python integers disguise 
Every time we store an integer in python `x = 10`, python is storing much more than just the value 10. It is actually storing 4 different values:
1. The value of the digit
1. The size of the variable
1. The type of the object
1. The reference count (handling memory allocation)


Since every object is maintained separately, we can create heterogeneous lists in python:

In [None]:
l = [1, "string", True, 122.3]

In [None]:
for element in l:
    print(f"type({element}) = {type(element)}")

While this flexibility is convenient it also comes at a price. Every element in the list is a python object containing extra information. Every time we enumerate over the data we will need to check the type of the element which takes time and memory. In cases where we are holding a list of the same type this overhead is redundant.

### Importing NumPy
***
The convention is `import numpy as np`. We use an alias `np` so that we don't pollute our code with too much `numpy`.

In [None]:
!pip install numpy

In [None]:
import numpy as np
print("Numpy version:", np.__version__)

### Numpy Arrays
***
You can think of a numpy array as a list where all the elements **must** be of the same type. Meaning, all the elements in a numpy array will be treated equally, no type checking should take place.

In [None]:
a = np.array([1.0, 2, 3, 4])
type(a), a, a.dtype

In [None]:
ll = np.array([1, "string", True, 122.3])

type(ll), ll

### Creating arrays from lists
As we just saw we can create NumPy arrays from python lists.  

In [None]:
a = np.array([1,2,3,4]) # Creating a 1d Array.
b = np.array([[1,2,3], [4, 5, 6], [7, 8, 9]]) # Creating a 2d Array.

print(a)
print('='*10)
print(b)

### Creating arrays with NumPy methods
We can also create arrays using different NumPy methods:

In [None]:
x = np.zeros(10)

In [None]:
x.shape

In [None]:
np.zeros((5, 5))

In [None]:
np.ones((3, 3))

In [None]:
np.full((3, 3), 5.0)

In [None]:
a = np.empty((2, 2))
a

In [None]:
np.arange(5)

In [None]:
np.arange(0, 10, 2)

In [None]:
np.linspace(0, 10 , 5)

### Creating arrays with NumPy random methods
<img src="https://cdn.pixabay.com/photo/2012/04/05/01/24/dice-25637__340.png" width="200" height="">

Another very useful way of creating NumPy arrays is by using the NumPy random library - `np.random`.

In [None]:
np.random.rand(3, 3) # from the uniform distribution

In [None]:
np.random.randint(low=10, high=50, size=(4, 4))  # discrete uniform distribution

In [None]:
np.random.normal(loc=0, scale=1, size=(6, 2))  # loc - mean, scale - std

### Exercise 

Create a (5,5) matrix with all the values from 1 to 25.

Get the bottom corner of the matrix (top left corner of the print) of size 2x2.

Create a (5,5) array with 1's on the border and 0's inside.

### NumPy aggregation

In most data science application we start exploring the data by querying different statistics.   
Numpy allows us to do that quickly by using aggregation functions (you aggregate information as you iterate over the array), which summarize the values in an array.
Some of the most common aggregation are : 
```py
sum, mean, std, var, min, max.   
```
To view the entire aggregation list visit : [Numpy aggregation](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html).

In [None]:
a = np.random.randint(10, 20, size=10)
a

In [None]:
a.min(), a.max(), a.mean(), a.std()

In [None]:
a.argmin(), a.argmax()

Multi dimensional arrays

In [None]:
M = np.random.rand(5, 5)

In [None]:
M

In [None]:
M.max(), M.min(), M.mean()

In [None]:
a = np.arange(10).reshape(2, 5)

In [None]:
a, a.sum()

In [None]:
a.sum(axis=0)

In [None]:
a.sum(axis=1)

### Exercise

In [None]:
np.random.seed(42)

In [None]:
X = np.random.randint(low=0, high=50, size=(30, 4))
X

Get the mean values of each column in X.

In [None]:
# 1. What do you expect the output shape to be?
# 2. CODE!

Get the max value of each row in X.

In [None]:
# 1. What do you expect the output shape to be?
# 2. CODE!

Get the median value of all the value of X.

In [None]:
# 1. What do you expect the output shape to be?
# 2. CODE!

-----------------

In [None]:
np.random.seed(2222)
Y = np.random.randint(low=0, high=50, size=(4, 20))

In [None]:
# what is the shape of Y?

Get the mean of each column of X plus the mean of each row of Y.

In [None]:
# 1. What do you expect the output shape to be? 
# 2. CODE!

### Broadcasting

A very powerful mechanism of NumPy arrays is [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
Broadcasting is used when an operation is used on two arrays of different shapes.
The rules are:

1. If arrays dimension differ, left-pad the smaller array's shape with 1s.
1. If the shapes differ, change any dimension of size 1 to match the dimension of the other array.
1. If shapes still differ, raise an error.

Some exmaples:
![broadcasting examples](http://www.astroml.org/_images/fig_broadcast_visual_1.png)

In [None]:
np.arange(3) + 5

In [None]:
np.ones((3,3)) + np.arange(3)

In [None]:
np.arange(3).reshape((3, 1)) + np.arange(6)

In [None]:
np.arange(3).reshape((3, 1))

In [None]:
 np.arange(6)

-----------------

In [None]:
a = np.arange(1, 11)
a.shape

In [None]:
b = np.arange(1, 11).reshape(10, 1)
b.shape

In [None]:
# What do you expect the output shape of a*b to be? 
# a * b

### Exercise

Given a 1D array X, calculate the differences between each two elements of X using broadcasting and save it to array D, Meaning `D[i,j] = X[i] - X[j]`

In [None]:
X = np.linspace(1, 10, 10)
X

In [None]:
# 1. What do you expect the output shape od D to be? 
# 2. CODE

Now that we have a solid understanding of NumPy, we can notice two things:
1. NumPy is extremely useful for working with numerical values of the same type.
1. NumPy is missing some flexibility when it comes to working with data which contains heterogeneous data (say, strings along side floats), as well as preforming some more data science related operations such as groupings and pivots.

So we would like a way to benefit from NumPy ability to work efficiently with numerical values, but, also enjoy some flexibility which will allow us to work with heterogeneous data.  
(Named is derived from *Panel Data* which is multi-dimensional data involving measurements over time) 

## Hello Pandas
***
Pandas is a python module which builds on top of NumPy capabilities, harvesting its numerical efficiency while enabling us to work with heterogeneous data. It does so by wrapping the ndarrays with it's own objects : pandas Dataframe and pandas Series which we will discuss after we import.

In [None]:
import pandas as pd
pd.__version__

### Pandas Objects
Pandas supply 3 objects for us to work with : 
1. __Index__
1. __Series__
1. __DataFrame__  
The reason for this order is because each series object contains a Index, and each DataFrame contains both Series objects and an Index object. But, we will talk about them in this order : Series, DataFrame, Index.

### Series
***

A series is a "wrapped" 1d numpy array

In [None]:
s = pd.Series([10, 20, 30, 40], name='random stuff') 
s

In [None]:
s.name

In [None]:
s[0], s[1], s[2], s[3]

In [None]:
s[1:3]

In [None]:
try:
    pd.Series(np.zeros((2, 2))) # exception
except Exception as e:
    print(f"ERROR!\n\t{e}")

## Index
*** 

We can think of the pandas index as an immutable numpy array. We can use some NumPy operations on arrays, but we cannot change any of the values (an Index is an immutable object).

In [None]:
idx = pd.Index([1, 2, 3, 4, 5, 6, 7, 8])

In [None]:
idx.shape, idx.ndim, idx.size, idx.dtype

In [None]:
idx[0], idx[1:5], idx[::2]

In [None]:
try:
    idx[0] = 12
except Exception as e:
    print(f"ERROR!\n\t{e}")

In [None]:
type(idx.values)

A great functionality of the Index object is it supports set operations. We can preform various set operation between indicies to great new indicies.

Some set operation remainder:  

* __Union__        : $A \cup B = \{a | a\in A ~or~ a\in B \}$ All the elements which are either in A or in B.  
* __Intersection__ : $A \cap B = \{a | a\in A ~and~ a\in B \}$ All the elements which are both in A and in B.  
* __Symmetric difference__ : $A \triangle B = \{a | a \in A ~or~ a\in B~ but ~not ~both \}$

In [None]:
ind_1 = pd.Index(np.arange(10))
ind_2 = pd.Index(np.arange(5, 12))
ind_1, ind_2

In [None]:
ind_1.union(ind_2)

In [None]:
ind_1.intersection(ind_2)

In [None]:
ind_1.symmetric_difference(ind_2)

In [None]:
np.random.seed(2611)
index_lettters = [chr(ord('a') + i) for i in range(10)]
s = pd.Series(np.random.randint(low=0, high=20, size=10), index=index_lettters)
s

In [None]:
s[['a', 'd', 'e']]

In [None]:
s[[0, 3, 4]]

#### loc and iloc 
Using both the implicit and explicit index can cause confusion. To make it clear which index is being used we can use the __loc__ and __iloc__ methods.
* __iloc__ - refers to the numeric index
* __loc__ - refers to the explicit index.  

In [None]:
np.random.seed(2611)
index_lettters = [chr(ord('a') + i) for i in range(10)]
s = pd.Series(np.random.randint(low=0, high=20, size=10), index=index_lettters)
s

In [None]:
try:
    print(s.iloc[[1, 2, 4, 6]])
except Exception as e:
    print(f"ERROR!\n\t{e}")

In [None]:
try:
    print(s.loc[['a','b','d','h']])
except Exception as e:
    print(f"ERROR!\n\t{e}")

In [None]:
try:
    print(s.loc[:4])
except Exception as e:
    print(f"ERROR!\n\t{e}")

In [None]:
try:
    print(s.iloc[:4])
except Exception as e:
    print(f"ERROR!\n\t{e}")

In [None]:
try:
    print(s.loc['a':'d'])
except Exception as e:
    print(f"ERROR!\n\t{e}")

In [None]:
try:
    print(s.iloc['a':'d'])
except Exception as e:
    print(f"ERROR!\n\t{e}")

A series object contains 2 objects as attributes: 
1. __values__ - a numpy array of values.
1. __index__ - a pandas Index object. (guess what holds the values under the hood? a numpy array)  

### DataFrame
***

<img src="https://www.tutorialspoint.com/python_pandas/images/structure_table.jpg" width="300">

A pandas dataframe is basically $n$ pandas series stacked vertically. Think of a dataframe as basically an excel sheet where you can name you columns. Another option is to think of it as an enhanced 2d numpy array, where you can access the column and the rows in a "fancy" way. 

In [None]:
data = np.random.randint(low=10, high=50, size=(15, 3))

In [None]:
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df.head(n=7)

In [None]:
df['A']

### DataFrame
A dataframe is similar to a mapping between an index and a column.

In [None]:
np.random.seed(1010)
index_lettters = [chr(ord('a') + i) for i in range(10)]
data = np.random.randint(low=0, high=100, size=(10, 3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'], index=index_lettters)
df

In [None]:
df['A']

In [None]:
# Accessing the 2nd and 3rd row
df.iloc[[1,2]]

In [None]:
df.loc[["b","c"]]

In [None]:
# Accessing rows [1,5)

In [None]:
df.iloc[1:5]

In [None]:
# Accessing row 'a'
df.loc['a']

In [None]:
# Accessing rows 'a', 'b', 'c'
df.loc[['a','b','c']]

In [None]:
# Accessing rows [1,5)

In [None]:
# get columns A and B

In [None]:
# Where column A is larger than 50?
mask = df['A'] > 50
mask

In [None]:
# Get rows where A value is bigger then 50
df[mask]

If you want to access the columns:
- Standard dictionary like accessing work.
- Fancy indexing given the same type as the columns index works as well.

If you want to access the rows:
- use iloc for implicit index.
- use loc for explicit index.
- use slicing using bracket notion.
- use boolean masking using bracket notion.

# Statistics

In [None]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Coin Toss

You toss a coin 30 times and see head 24 times. Is it a fair coin?

The easiest solution is to use the programming approach and run a simulation.

We will run the experiment (tossing a coin 30 times) multiple times, and find the percentage of times we obtained 24 heads or more. 

### A single experiment:

In [None]:
experiment = np.random.randint(0,2,10)
experiment

In [None]:
experiment[experiment==1].sum()

In [None]:
def coin_toss_experiment(n_exp):
    total_tosses = 30
    prob_head = 0.5
    head_count = []
    
    for i in range(n_exp):
        experiment = np.random.randint(0,2,total_tosses)
        head_count.append(experiment[experiment==1].sum())
    
    return np.array(head_count)

In [None]:
n_exp = 10000

In [None]:
%%time

head_count = coin_toss_experiment(n_exp)

In [None]:
num_heads = 24

In [None]:
probability_more_than_24 = len(head_count[head_count>=num_heads]) / n_exp

In [None]:
print(f"The probability to see 24 or more heads in {n_exp} experiments is {probability_more_than_24:.3f}")

Visualize the number of head counts in our simulation.

In [None]:
sns.histplot(head_count, bins=19);

### NumPy exercise: 
try to code this experiment without a for loop.

### Part 2: Deviation from the mean

We want to know how many times we obtained a results that is $k$ times the standard deviation from the mean.

In [None]:
mu = head_count.mean()
sigma = head_count.std()

In [None]:
ks = np.arange(1, 5, 0.5)
probs = []

for k in ks: 
    c = 0
    for i in head_count:
        # count if far from mean in k standard deviation
        if abs(i - mu) > k * sigma :
            c += 1
    probs.append(c/n_exp)

In [None]:
plt.figure(figsize=(20,10))
plt.plot(ks,probs, marker='o')
plt.show()
print("Probability of a sample far from mean more than k standard deviation:")
for i, prob in enumerate(probs):
    print("k={}: Calculated probability={:.2f}  |  Theoretical probability: {:.2f}".format(ks[i], prob, 1/ks[i]**2))

# Random Variables

Using scipy, it is simple to build any custom random variable.

In [None]:
from scipy.stats import rv_discrete

**Single die**

In [None]:
x1 = [1, 2, 3, 4, 5, 6] # values
p1 = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] # probabilities per value
distribution1 = rv_discrete(values=(x1, p1))
print("Expected value: ", distribution1.expect())

**Sum of two dice**

In [None]:
x2 = [x for x in range(2,13)]
p2 = [1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36]
distribution2 = rv_discrete(values=(x2, p2))
print("Expected value: ", distribution2.expect())

### Linearity of expectation:

The expectation of the sum of two (independent) dice is the sum of expectations of each die. Thus, $3.5 + 3.5 = 7$.

**Exercise**: Two fair six-sided dice are rolled. Calculate the expectation of the product of two independent dice.

#### Solution

$$E[X_1 X_2] = \sum_{x_1} \sum_{x_2} x_1 x_2 p(X_1 = x_1) p(X_2=x_2) \\
= \sum_{x_1} x_1 p_1(X_1 = x_1) \cdot \sum_{x_2} x_2 p(X2=x2) \\
= E[X_1]E[X_2] = 3.5\cdot 3.5 = 12.25$$


Programming approach:

$$ \frac{1}{36}\sum_{i=1}^{6} \sum_{j=1}^{6} i\cdot j$$