We will use Python for the coding parts of all assignments and exams. This tutorial will include a brief Python programming language introduction and other useful python libraries (numpy, panda, matplotlib and scipy). 

# 01 Environments

We will provide the started codes in notebooks, so we recommend to write and execute python code in Jupyter notebooks. 

**Run Jupyter notebook locally:** 
1. We strongly recommend using [Anaconda Distribution](https://www.anaconda.com/products/distribution), which provides an easy way to install packages and speed up numpy or scipy codes by default environment.
2. After installed, if you wish to create a virtual environment for all codes in this course, run the command: `conda create -n <env_name> python=3.12.4` where `<env_name>` can be `cs418-fa24`. To activate and enter the environment, run the command: `conda activate <env_name>`
3. Then run `conda install notebook` to install Jupyter notebook. If you wish to open notebook via command, then run `jupyter notebook` in the terminal. Or you can install [Anaconda Navigator](https://docs.anaconda.com/anaconda/navigator/) to use interface to start Jupyter Notebook server at `http://localhost:8888` in your favoriate web browser. 
4. Alternatively, I suggest using Visual Studio Code as the editor as it can open Jupyter notebooks with an extension and gives you access to terminal simultaneously in which case you do not need to use a web browser at all. In fact, you will be able to choose the conda environment that you created in Step 2 by clicking on "Select Kernel" on Top Right corner (on MacOS) to conveniently choose your environment as opposed to remembering the name of the environment that you had created.

Note: Python version is not required as 3.12.4. See this [page](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) for more virtual enviroments details. If you would like to use Jupyter notebook in visual studio code, please see details [here](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). 

**Run [Colab](https://colab.research.google.com/) notebooks(Free):**
If you have difficulties to set up local environments or package dependencies, you can use Google Colab to run Jupyter notebooks in the cloud. You can upload local files and download your edit codes to local machine for assignment submissions via Colab.


# Optional : Python Introduction


## Variables and Printing
In Python, types of variables are automatically inferred by the interpreter

In [None]:
# Let's keep error messages very short
%xmode minimal

school = "Computer Science" # creates a string
number = 446       
pi = 3.141        

# Print Statements
print("department:", school)
print("number:", number)
print("pi:", pi)

print(type(school), type(number), type(pi))

## Containers
Python includes several built-in container types: lists, dictionaries, sets, and tuples.

### Lists
A list is the Python equivalent of an array, but is resizeable and can contain elements of different types:

In [None]:
fruit = ['apple', 'pear', 'tomato', 'avocado']
print(fruit)
print(fruit[0])
print(fruit[-2]) # specifies the element two spots from the end
print(fruit[1:3]) # can index multiple items [start, end]
print(fruit[:-1]) # Slice indices can be negative; prints all in this case "[0, 1, 2, 3]"
fruit[3] = 'salmon' # can assign items in list

print(fruit)

In [None]:
print(len(fruit)) # this is the length of the list
print('pear' in fruit) # can check for existence in a list ("contains()")

In [None]:
fruit.append('trout') # you can add elements to a list

print(fruit)
print(len(fruit))

### Dictionaries
A dictionary stores (key, value) pairs, similar to a Map in Java or an object in Javascript. You can use it like this:

In [None]:
d = {'apple': 'red', 'avocado': 'green'}  # Create a new dictionary with some data
print(d['apple'])       # Get an entry from a dictionary; prints "red"
print('apple' in d)     # Check if a dictionary has a given key; prints "True"
d['pear'] = 'green'     # Set an entry in a dictionary
print(d['pear'])      # Prints "green"
# print(d['tomato'])  # KeyError: 'tomato' not a key of d
print(d.get('tomato', 'N/A'))  # Get an element with a default; prints "N/A"
print(d.get('pear', 'N/A'))    # Get an element with a default; prints "green"
del d['avocado']         # Remove an element from a dictionary
print(d.get('avocado', 'N/A')) # "avocado" is no longer a key; prints "N/A"

### Sets
A set is an unordered collection of distinct elements. As a simple example, consider the following:

In [None]:
fruits = {'apple', 'avocado'}
print('apple' in fruits)   # Check if an element is in a set; prints "True"
print('pear' in fruits)  # prints "False"
fruits.add('pear')       # Add an element to a set
print('pear' in fruits)  # Prints "True"
print(len(fruits))       # Number of elements in a set; prints "3"
fruits.add('apple')        # Adding an element that is already in the set does nothing
print(len(fruits))       # Prints "3"
fruits.remove('apple')     # Remove an element from a set
print(len(fruits))       # Prints "2"

### Tuples
A tuple is an (immutable) ordered list of values. Tuples are a useful way to pack small amounts of data

In [None]:
tupperware = (1,2,3)
print(tupperware)
print(tupperware[1])
a, b, c = tupperware # can 'unpack' tuples
_, d, _ = tupperware # ignore the elements you don't care about
print(c)
print(d)

In [None]:
#tupperware[1] = 4 # cannot modify tuples 

## Math operations

In [None]:
a = 13
b = 4

print(a + b)
print(a / b) # floating point division
print(a // b) # integer division
print(b**2) # powers are built-in

## Conditionals
Python cares about whitespace! There are no brackets or end statements

Instead of &&, ||, !, we use 'and', 'or', and 'not'.

Boolean values are written as 'True' and 'False'

In [None]:
a = 9.1
b = 7

# basic if else syntax
if a < 10:
    print('hello there!')
else:
    print('general kenobi')

print('this always prints')

In [None]:
# and syntax
if a < 10 and b > 5:
    print('and')

# or syntax
if a > 10 or b > 5:
    print('or')

In [None]:
# else, else if syntax
if False:
    print('this never prints')
elif a != b:
    print('a != b')
else:
    print('else!')

## Loops

Enumerate Function - https://www.geeksforgeeks.org/enumerate-in-python/

Zip Function - https://www.w3schools.com/python/ref_func_zip.asp

In [None]:
# basic for loops
for i in range(5):
    print(i)
    
print()

# can loop over any iterable
for f in fruit:
    print(f)

In [None]:
# advanced for loops
for idx, item in enumerate(fruit):
    print(idx, item)
    
print()

a = [3, 1, 4, 1, 5]
b = [2, 7, 1, 8, 2]

for f, pi, e in zip(fruit, a, b):
    print(f, pi, e)

In [None]:
# while loops
my_str = "Ben"

while len(my_str) < 10:
    my_str += ' 10'

print(my_str)

In [None]:
# A useful way to make lists is using list comprehensions
a = [x**2 for x in range(10)]
print(a)

## Functions + Classes
Python functions are defined using the `def` keyword.

In [None]:
# this is a function - I will expect most of your code to be in functions
def square_this(x):
    return x * x

print(square_this(4))

In [None]:
# Classes are convenient ways to package several pieces of information together
class Squaring:
    
    # this is the constructor
    def __init__(self, a, b=4):
        self._a = a
        self._b = b
    
    # this is a method
    def square(self, x):
        return x * x
    
    # this is a static method
    @staticmethod
    def mymethod(x):
        return x**3

sq = Squaring(3)
print('method:', sq.square(3))
print('static method:', Squaring.mymethod(5))
print('a:', sq._a)
print('b:', sq._b)

# 02 Numpy
[NumPy](https://numpy.org/) is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations using C/C++ in the background.

Note: please see [documents](https://numpy.org/install/) for installation.

## Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. We can initialize numpy arrays from nested Python lists, and access elements similar to list.

In [None]:
import numpy as np

a = np.array([1,2,3,4,5])
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(5,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"

b = np.array([5,4,3,2,1])
c = a + b                 # can do elementwise operations
print(c)
print(a*b)
print(a[3:5])             # can use same indexing as lists
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5,2,3,4,5]"

Numpy also provides many functions to create arrays:

In [None]:
a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values
print(e)                     # Might print "[[ 0.91940167  0.08143941]
                             #               [ 0.68744134  0.87236687]]"

Note: see more Array creation [here](https://numpy.org/doc/stable/user/basics.creation.html#arrays-creation)

##Array indexing
Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

In [None]:
A = np.array([[1,2], [3,4]]) # can create multidimensional arrays
print(A)
print(A.shape) # this specifies the number of rows and columns
print(A[0]) # indexing into rows
print(A[:,1]) # indexing into columns
print(A[1,1], A[1][1]) # indexing into rows and columns
print(A[:,1].T) 
print(np.shape(A[:,1]))

In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.
# The returned array will have shape (3,) and
print(a[[0, 1, 2], [0, 1, 0]])  # Prints "[1 4 5]"

# The above example of integer array indexing is equivalent to this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"

# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]])  # Prints "[2 2]"

# Equivalent to the previous integer array indexing example
print(np.array([a[0, 1], a[0, 1]]))  # Prints "[2 2]"

In [None]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

print(a)  # prints "array([[ 1,  2,  3],
          #                [ 4,  5,  6],
          #                [ 7,  8,  9],
          #                [10, 11, 12]])"

# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
print(a[np.arange(4), b])  # Prints "[ 1  6  7 11]"

# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10

print(a)  # prints "array([[11,  2,  3],
          #                [ 4,  5, 16],
          #                [17,  8,  9],
          #                [10, 21, 12]])

## Array operations
Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module:

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

In [None]:
# matrices
A = np.array([[1,2],[3,4]])
B = np.array([[1, 0],[0, 1]]) # identity matrix (same as np.eye(2))

print(A + B)
print()
print(A @ B)#@ is matrix multiplication 
print(A * B)#* is elementwise multiplication. You probably don't mean this

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

Note: Numpy provides many mathematical functions, please see details [here](numpy.org/doc/stable/reference/routines.math.html).

We also frequently need to reshape or manipulate data in arrays. The most useful example is transposing a matrix. Please see other manipulating array functions [here](https://numpy.org/doc/stable/reference/routines.array-manipulation.html).

In [None]:
x = np.array([[1,2], [3,4]])
print(x)    # Prints "[[1 2]
            #          [3 4]]"
print(x.T)  # Prints "[[1 3]
            #          [2 4]]"

# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])
print(v)    # Prints "[1 2 3]"
print(v.T)  # Prints "[1 2 3]"

# 03 Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Note: please see [documents](https://pandas.pydata.org/getting_started.html) for installation.

In [None]:
import pandas as pd

# How to create a series using pandas
df = pd.Series([1, 2, 3, 4, 5], 
               index=['a', 'b', 'c', 'd', 'e'])
print(df)

In [None]:
# How to make a dataframe using pandas
df = pd.DataFrame({"a" : [4 ,5, 6], 
                   "b" : [7, 8, 9], 
                   "c" : [10, 11, 12]}, 
                    index = [1, 2, 3])
print(df)

In [None]:
# How to read a .csv file
df = pd.read_csv('elections.csv')

In [None]:
df.shape

In [None]:
print(df.info()) 

In [None]:
# This will show the first 5 rows of your dataframe
df.head()

In [None]:
# This gives a statistical description of the data
df.describe()

In [None]:
# This randomly outputs 2 samples (try running it multiple times to see different results)
df.sample(2)

In [None]:
elections = df.copy() #copy df as elections so it is easy to remember or interpret
type(elections)

Another CSV file

In [None]:
#mottos = pd.read_csv("/content/drive/My Drive/Colab Notebooks/mottos.csv", index_col = "State") if using google collab
mottos = pd.read_csv("mottos.csv", index_col = "State")
mottos

## Indexing

As a simple indexing example, consider the code below, which returns the first 5 rows of the DataFrame.

In [None]:
elections.loc[0:4]

We can also use the head/tail command to return only a few rows of a dataframe.

In [None]:
mottos.head(5)
mottos.tail(5)

In [None]:
elections.set_index("Year")

What happens here?

In [None]:
elections.set_index("Year").loc[2020]

Or the tail command to get the last so many rows.

In [None]:
elections.tail(5)

If we want a subset of the columns, we can also use loc just to ask for those.

In [None]:
elections.loc[0:4, "Year":"Party"]

In [None]:
#elections[0]

### loc

loc selects items by row and column label.

In [None]:
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

In [None]:
elections.loc[[87, 25, 179], "Popular vote":"%"]

In [None]:
elections.loc[[87, 25, 179], "Popular vote"]

In [None]:
elections.loc[:, ["Year", "Candidate", "Result"]]

### iloc

iloc selects items by row and column number.

In [None]:
elections.iloc[[1, 2, 3], [0, 1, 2]]

In [None]:
elections.iloc[[1, 2, 3], 0:2]

In [None]:
elections.iloc[[1, 2, 3], 1]

In [None]:
elections.iloc[:, [0, 1, 4]]

### []

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it does essentially the same thing.

If we provide a slice of row numbers, we get the numbered rows.

In [None]:
elections[3:7]

If we provide a list of column names, we get the listed columns.

In [None]:
elections[["Year", "Candidate", "Result"]].tail(5)

And if we provide a single column name we get back just that column.

In [None]:
elections["Candidate"].tail(5)

#### A little annoying puzzle

In [None]:
weird = pd.DataFrame({
    1:["topdog","botdog"],
    "1":["topcat","botcat"]
})
weird

In [None]:
weird[1] #try to predict the output

In [None]:
weird["1"] #try to predict the output

In [None]:
weird[1:] #try to predict the output

## Pandas Datastructures: DataFrames, Series, and Indices

In [None]:
type(elections)

In [None]:
type(elections["Candidate"])

In [None]:
mottos = pd.read_csv("mottos.csv", index_col = "State")
mottos.loc["California":"Illinois"]

In [None]:
elections["Candidate"].tail(5).to_frame()

In [None]:
elections[["Candidate"]].tail(5)

In [None]:
mottos.index

In [None]:
mottos.columns

## Conditional Selection

In [None]:
elections[elections["Party"] == "Independent"]

In [None]:
elections["Party"] == "Independent"

Boolean array selection also works with `loc`!

In [None]:
elections.loc[elections["Party"] == "Independent"]

In [None]:
elections[(elections["Result"] == "win") & (elections["%"] < 47)]

In [None]:
elections[[True]*len(elections)]

### An annoying puzzle

In [None]:
elections2 = pd.read_csv("annoying_puzzle2.csv")
elections2

In [None]:
# Which of the following yield a DataFrame of the first 3 Candidate names only for candidates that won with more than 50% of the vote?
elections2.iloc[[0, 3, 5], [0, 3]]
elections2.loc[[0, 3, 5], "Candidate":"Year"]
elections2.loc[elections2["%"] > 50, ["Candidate", "Year"]].head(3)
elections2.loc[elections2["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

In [None]:
elections2 = elections[(elections["Year"] == 1980) | (elections["Year"] == 1984) | (elections["Year"] == 1988)]
elections2

In [None]:
(
    elections[(elections["Party"] == "Anti-Masonic")  |
              (elections["Party"] == "American")      |
              (elections["Party"] == "Anti-Monopoly") |
              (elections["Party"] == "American Independent")]
)
#Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability

In [None]:
a_parties = ["Anti-Masonic", "American", "Anti-Monopoly", "American Independent"]
elections[elections["Party"].isin(a_parties)]

In [None]:
elections[elections["Party"].str.startswith("A")]

In [None]:
elections.query('Year >= 2000 and Result == "win"')

In [None]:
parties = ["Republican", "Democratic"]
elections.query('Result == "win" and Party not in @parties')

## Built In Functions

In [None]:
winners = elections.query('Result == "win"')["%"]
winners.head(5)

In [None]:
np.mean(winners)

In [None]:
max(winners)

In [None]:
elections

In [None]:
elections.size

In [None]:
elections.shape

In [None]:
elections.describe()

In [None]:
elections.sample(5).iloc[:, 0:2]

In [None]:
elections.query('Year == 2000').sample(4, replace = True).iloc[:, 0:2]

In [None]:
elections["Candidate"].value_counts()

In [None]:
elections["Party"].unique()

In [None]:
elections["Candidate"].sort_values()

In [None]:
elections.sort_values("%", ascending = False)

Note: Plese see pandas [user guide](https://pandas.pydata.org/docs/user_guide/index.html) for more functions. 

# 04 Matplotlib
[Matplotlib](https://matplotlib.org/) is a library used to visualize data

In [None]:
import matplotlib.pyplot as plt

# basic plotting (show scatter and plot)
xs = np.arange(10)
ys = xs ** 2
plt.scatter(xs, ys)
plt.plot(xs, ys)

# ALWAYS label your plots!
plt.title('plot of $x^2$ vs $x$')
plt.xlabel('$x$')
plt.ylabel('$x^2$')
plt.show()

##Subplots
You can plot different things in the same figure using the subplot function. Here is an example:

In [None]:
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

# 05 Probability  

Probability theory is a mathematical framework for representing uncertain statements. 

A random variable is a variable that can take on different values randomly. There are two types of random variables, discrete and continuous.

**Discrete variable**: numeric variables that have a countable number of values between any two values

**Continuous variable**: takes an infinite number of possible values

## Probability distribution
In mathematics, especially in probability theory and statistics, probability distribution represents the values of a variable that holds the probabilities of an experiment. The way we describe probability distributions depends on whether the variables are discrete or continuous.

### Discrete Variables and Probability Mass functions

A probability distribution over discrete variables may be described using a __probability mass function (PMF)__. A probability mass function maps from a state of a random variable to the probability of that random variable taking on that state.

We denote probability mass functions with $P$, where we denote a __PMF__ equation as $P(X = x)$. Here $x$ can be a number on the dice when $X$ is the event of rolling the dice.

In [None]:
import os
import random
import sys

from collections import defaultdict

def single_dice(x, sides, rolls):   
    result = roll(sides, rolls)
    for i in range(1, sides +1):
        plt.bar(i, result[i] / rolls)
    print("P(X = {}) = {}%".format(x, np.divide(np.multiply(result[x], 100), rolls)))

def roll(sides, rolls):   
    d = defaultdict(int)                    
    for _ in range(rolls):
        d[random.randint(1, sides)] += 1    # The random process
    return d


single_dice(x=6, sides=6, rolls=10000)

### Continuous Variables and Probability Density Functions
When working with continuous random variables, we describe probability distributions using a __probability density function (PDF)__. 

In [None]:
# import seaborn
import seaborn as sns
# settings for seaborn plotting style
sns.set_theme(color_codes=True)
# settings for seaborn plot sizes
sns.set(rc={'figure.figsize':(5,5)})

In [None]:
# import uniform distribution
from scipy.stats import uniform

# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)

In [None]:
ax = sns.histplot(data_uniform,
                  bins=10,
                  kde=True,
                  color='skyblue',
                  line_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')

## Conditional Probability
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred

$$\color{blue}{P(\mathrm{x} = x \ | \ \mathrm{y} = y) = \frac{P(\mathrm{x} = x, \mathrm{y} = y)}{P(\mathrm{y} = y)} \tag{1}}$$

Example: Suppose we send out a survey to 300 individuals asking them which sport they like best: baseball, basketball, football, or soccer. What is the probability that an individual is male(female), given that they prefer baseball as their favorite sport

In [None]:
# create pandas DataFrame with raw data
df = pd.DataFrame({'gender': np.repeat(np.array(['Male', 'Female']), 150),
                   'sport': np.repeat(np.array(['Baseball', 'Basketball', 'Football',
                                                'Soccer', 'Baseball', 'Basketball',
                                                'Football', 'Soccer']), 
                                    (34, 40, 58, 18, 34, 52, 20, 44))})

In [None]:
survey_data = pd.crosstab(index=df['gender'], columns=df['sport'], margins=True)
survey_data

In [None]:
# Calculate probability of being male, given that individual prefers baseball
survey_data.iloc[1, 0] / survey_data.iloc[2, 0]

In [None]:
# Calculate probability of preferring basketball, given that individual is female
survey_data.iloc[0, 1] / survey_data.iloc[0, 4]

## Expectation, Variance and Covariance
The __expectation__, or __expected value__, of some function $f(x)$ with respect to a probability distribution $P(x)$ is the average, or mean value, that $f$ takes on when $x$ is drawn from $P$. For discrete variables this can be computed with a summation:

$$\color{blue}{\mathbb{E}_{x \sim P} [f(x)] = \displaystyle\sum_{x} P(x) f(x) \tag{2}}$$

In [None]:
# Example

a = [-1, -1, -1, 0, 0, 4, 4]           
b = [1/6.]*6                              
  
expectation = 0
for i in range(0, len(b)):
    expectation += (b[i] * a[i])    # summing p(x) * f(x)

# Calculate the expectation   
print( "Expectation of the a E(X) is : {:.4}".format(expectation)) 

For continuous variables, it is computed with an integral:

$$\color{blue}{\mathbb{E}_{x \sim P} [f(x)] = \int p(x) f(x) dx \tag{3}}$$

__Variance__ is the expectation of the squared deviation of a random variable from its population mean or sample mean:

$$\color{blue}{Var(f(x)) = \mathbb{E} \Big[ (f(x) - \mathbb{E}[f(x)])^2 \Big] \tag{4}}$$

In [None]:
a = np.array([[1, 2], [3, 4]])
np.var(a)

**Covariance** measures the direction of the relationship between two variables.
$$\color{blue}{Cov(f(x), g(y)) = \mathbb{E} \big[ (f(x) - \mathbb{E}[f(x)]) (g(y) - \mathbb{E}[g(y)]) \big] \tag{5}}$$

In [None]:
x = np.array([[0, 2], [1, 1], [2, 0]]).T
np.cov(x)

In [None]:
x = [-2.1, -1,  4.3]
y = [3,  1.1,  0.12]
X = np.stack((x, y), axis=0)

np.cov(X)
np.cov(x, y)
np.cov(x)

## Common distributions


### Bernoulli Distribution

The __Bernoulli distribution__ is a distribution over a single binary random variable.

In [None]:
from scipy.stats import bernoulli
#
# Instance of Bernoulli distribution with parameter p = 0.7
#
bd = bernoulli(0.7)
#
# Outcome of experiment can take value as 0, 1
#
X = [0, 1]
#
# Create a bar plot; Note the usage of "pmf" function
# to determine the probability of different values of
# random variable
#
plt.figure(figsize=(7,7))
plt.xlim(-1, 2)
plt.bar(X, bd.pmf(X), color='orange')
plt.title('Bernoulli Distribution (p=0.7)', fontsize='15')
plt.xlabel('Values of Random Variable X (0, 1)', fontsize='15')
plt.ylabel('Probability', fontsize='15')
plt.show()

### Gaussian(Normal) Distribution

It is the most important probability distribution function used in statistics because of its advantages in real case scenarios.
$$\color{blue}{\mathcal{N}(x; \mu, \sigma^2) = \sqrt{\frac{1}{2 \pi \sigma^2}} exp \Big(- \frac{1}{2 \sigma^2} (x - \mu)^2 \Big) \tag{6}}$$

In [None]:
import scipy.stats as stats
import math

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()

### Mixtures Distributions
A mixture distribution is made up of several component distributions.

In [None]:
distributions = [
    {"type": np.random.normal, "kwargs": {"loc": -3, "scale": 2}},
    {"type": np.random.uniform, "kwargs": {"low": 4, "high": 6}},
    {"type": np.random.normal, "kwargs": {"loc": 2, "scale": 1}},
]
coefficients = np.array([0.5, 0.2, 0.3])
coefficients /= coefficients.sum()      # in case these did not add up to 1
sample_size = 100000

num_distr = len(distributions)
data = np.zeros((sample_size, num_distr))
for idx, distr in enumerate(distributions):
    data[:, idx] = distr["type"](size=(sample_size,), **distr["kwargs"])
random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)
sample = data[np.arange(sample_size), random_idx]
plt.hist(sample, bins=100, density=True)
plt.show()