# Machine Learning with Python - Introduction
 
## Learning goals 

In this "warm-up" exercise you will be introduced to some basic but useful functionalities provided by Python.
We will demonstrate how to 

* plot functions, 
* read in data from files or the internet, 
* visualize data sets using scatter plots, 
* compute an eigenvalue decomposition and 
* plot the probability density function of a Gaussian random variable. 

This notebook contains few student tasks which require you to write a few lines of Python code to solve small problems. In particular, you have to fill in the gaps marked as **Student Task**.


## Exercise Contents

The exercise consist of the following parts:

1. [Matrices and Vectors](#Q1): You will learn how to define and work with matrices and vectors.

2. [Plotting](#Q2): You will learn how to define and plot a function.

3. [Reading in Data](#Q3): You will learn to load data from files and to create scatter plots.

4. [Eigenvectors](#Q4): You will review the concept of eigenvalue decompositions. 

5. [Probability Distributions](#Q5): You will review the concept of Gaussian Probability density function.

## Keywords 

`linear algebra`, `plotting`, `scatter plot`, `loading/saving files`, `eigenvalues and eigenvectors`, `matplotlib`, `numpy`, `pandas` ,`Gaussian probability density function` 

## Relevant Sections in [Course Book](https://arxiv.org/abs/1805.05052)  

Secion 1; Section 2


## 1. Matrices and Vectors
<a id="Q1"></a>

We now introduce the Python library `numpy`, which provides implementations of many useful matrix and vector operations, e.g. matrix multiplication or computing an eigenvalue decomposition. **Throughout this course we will represent matrices or vectors consistently using numpy data types ("numpy arrays").**

Hints: 

* You can find a more detailed tutorial on the `numpy` package under [this link](https://hackernoon.com/introduction-to-numpy-1-an-absolute-beginners-guide-to-machine-learning-and-data-science-5d87f13f0d51).

* you can read "Learn the Basics" and "Data Science Tutorial" sections from [this link](https://www.learnpython.org/en/).

* a quick refresher for basic properties of matrices can be found under [this link](https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf)


In [None]:
#Important libraries to import:
# NumPy: is the fundamental package for scientific computing with Python.
# matplotlib.pyplot: provides a MATLAB-like plotting framework.

import numpy as np              # use shorthand "np" for the numpy library ("package")
import matplotlib.pyplot as plt # use shorthand "plt" for package matlobplit.pyplot
import pandas as pd             # use "pd" for package "pandas" providing methods for 
                                # loading and saving data from and to files 

# the following two imports are for testing purposes only
from plotchecker import ScatterPlotChecker
from plotchecker import LinePlotChecker

# let's create a vector a of length 9:
a = np.array([1,2,3,4,5,6,7,8,9]).transpose()
dimension=np.shape(a)
rows = dimension[0]
print("now we have a vector a =",a) 
print("the vector a has", rows, "elements \n")

# create a 2 x 2 matrix, denoted B, which contains only zero entries
B = np.zeros((2,2))
print("now we have a matrix B = \n",B)
dimension=np.shape(B)      # determine dimensions of matrix B
rows = dimension[0]        # first element of dimension is the number of rows 
cols = dimension[1]        # second element of "dimension" is the number of cols
print("the matrix B has", rows, "rows and", cols, "columns \n")

<a id='LoadDataset'></a>
<div class=" alert alert-info">
    <b>Demo.</b> Matrix Multiplication. 
    </div>
    
The code snippet below implements a Python function `C,rows,cols=Matrix_multiplication(A,B)` which reads in two matrices $\mathbf{A}$ and $\mathbf{B}$ and returns the product $\mathbf{C}=\mathbf{A} \mathbf{B}$ along with the number of rows and columns of the resulting matrix. The aim of this demo is to show how to declare a function with multiple input and output parameters.

In [None]:
def Matrix_multiplication(A, B):
    """
    Compute the matrix multiplication A*B and its shape
    
    :param A: array-like, shape=(m, n)
    :param B: array-like, shape=(o, p)
    
    :return: The matrix being the result of the multiplication, the number of rows and the number columns 
    """
    #We now construct another matrix H by multiplying the matrices A and B using the function np.dot():
    H = np.dot(A,B)
    shape = H.shape
    
    # A matrix that is the product of two matrices with r1 and r2 rows
    # and c1 and c2 columns will have r1 rows and c2 columns 
    
    rows = shape[0]
    cols = shape[1]
    
    return H, rows, cols

In [None]:
#Here we create two 3*3 matrices C and I:
C = np.array([[1,4,0],[3,2,5],[6,2,1]])
I = np.eye(3)
print("C = \n",C)
print("I = \n",I)

D, rows, cols = Matrix_multiplication(C,I)
print("D = C * I = \n",D)

#Read a particular row and column of matrix D:

secondcol = D[:,1]   # remember that indexing starts at 0 in Python ! 
secondrow = D[1,:]
print("second column of D = \n",secondcol)
print("second row of D =",secondrow)

# let us now determine the size of a matrix that 
# is obtained as the product of two other matrices F and G
F = np.array([[1,2],[3,4],[5,6]])

print("now we have a matrix F = \n", F)
print("\n the matrix F has", F.shape[0], "rows and", F.shape[1], "columns\n")

G = np.array([[1,2,3],[3,4,5]])

print("now we have a matrix G = \n", G)
print("\n the matrix G has", G.shape[0], "rows and", G.shape[1], "columns\n")

# what will be the size of F*G ?
H, rows, cols = Matrix_multiplication(F,G)

print("the product H=F*G is H = \n", H)
print("\n the matrix H has", rows, "rows and", cols, "columns\n")

print("\n") # create a line break 

# In the last step we want to take a look at how to build a 3d array
# by using the np.array() method
# you can also create directly a 3 dimension array by using methods such as np.empty, np.zeros etc. 
# and specifying a third parameter e.g. np.zeros((2,3,4))
L = np.array([[[0, 1],[2, 3]], [[4, 5],[6, 7]], [[9,10],[11,12]]]) #size 2,2,3

print("L = ", L) #you can see it as 3 matrices of size 2x2 each

## 2. Plotting Data
<a id="Q2"></a>

In this task you will learn how to use the Python library (or package) `matplotlib` (https://matplotlib.org/index.html) to depict functions such as the **sigmoid function** $f(x)=\dfrac{1}{1+e^{-x}}$. The sigmoid function is widely used in machine learning methods as it allows to represent confidence levels, from no confidence ($f(x)\approx 0$) to full confidence ($f(x)\approx 1$), for classifications methods.

Hint: 
- You can find useful tutorials for the `matplotlib` package under [this link](https://matplotlib.org/3.0.2/tutorials/index.html#introductory)


<a id='LoadDataset'></a>

<div class=" alert alert-info">

<b>Demo.</b> 

Plotting Functions. 
  </div>
  
The code snippet below implements a Python function `axes=Plot_sigmoid()` 
which does not have any input parameters and returns an `axes` object 
for a plot of the sigmoid function $f(x)$ in the range $x=-6 ,\ldots,6$.

Hint: 

You can find more information about the `axes` object under [this link](https://matplotlib.org/api/axes_api.html?highlight=axes#module-matplotlib.axes)

 


In [None]:
def Plot_sigmoid():
    """
    Plot the sigmoid function f(x) in the range [-6,6]
    
    :return: axes object used for testing, containing the plot of f(x).    
    """
    
    def sigmoid_func(x):
        f_x = 1/(1+np.exp(-x))
        return f_x

    fig, axes = plt.subplots(1, 1, figsize=(15, 5)) #used only for testing purpose

    # np.arange creates a vector starting from -6 to 6 (not included) with step 0.01.
    # range_x will contain [-6.0 -5.99 -5.98 ... 5.97 5.98 5.99]
    range_x = np.arange(-6 , 6 , 0.01)

    f_x = np.empty(len(range_x))

    for i in range(len(range_x)):
        f_x[i] = sigmoid_func(range_x[i])

    
    # plot the results, using the plot function in matplotlib.pyplot.
    axes.plot(range_x,f_x, label='sigmoid function')
    axes.set_xlabel('x')
    axes.set_ylabel('f(x)')
    axes.legend()
    
    return axes

In [None]:
axes = Plot_sigmoid()

<a id='LoadDataset'></a>
<div class=" alert alert-info">
    <b>Demo.</b> Plotting Functions. 
    </div>

The following code snippet implements a Python function `axes=Plot_first_derivative()` which has no input parameters and returns an `axes` object. The returned `axes` object should contain a plot of the first derivative $f'(x)$ of the sigmoid function $f(x)=\dfrac{1}{1+e^{-x}}$ in the range $x=-6 ,\ldots,6$.

In [None]:
def Plot_first_derivative():
    """
    Plot the first derivative of the sigmoid function f(x) in the range [-6,6]
    
    :return: axes object used for testing, containing the plot of f'(x).    
    """
    
    def sigmoid_func(x):
        f_x = 1/(1+np.exp(-x))
        return f_x

    fig, axes = plt.subplots(1, 1, figsize=(15, 5)) #used only for testing purpose

    # np.arange creates a vector starting from -6 to 6 (not included) with step 0.01.
    # range_x will contain [-6.0 -5.99 -5.98 ... 5.97 5.98 5.99]
    range_x = np.arange(-6 , 6 , 0.01)

    f_x = np.empty(len(range_x))

    for i in range(len(range_x)):
        f_x[i] = sigmoid_func(range_x[i])
    
    # Let's find the first derivative of the mentioned function and plot it. 
    f_prime_x = f_x * (1-f_x)
    axes.plot(range_x,f_prime_x, label='derivative of sigmoid function')
    axes.set_xlabel('x')
    axes.set_ylabel('f(x)(1-f(x))')
    axes.legend()
    
    return axes

In [None]:
axes = Plot_first_derivative()


<a id='LoadDataset'></a>
<div class=" alert alert-info">
    <b>Demo.</b> Plotting Curves. 
    </div>
    
    The following code snippet demonstrates how to plot multiple curves in the same figure.

In [None]:
#Example on how to plot multiple functions in the same graph

range_x = np.arange(-5 , 5 , 0.01)

def straight_line1(x):
    f_x = 3*x +2
    return f_x

def straight_line2(x):
    f_x = x +5
    return f_x

def straight_line3(x):
    f_x = -5*x +1
    return f_x

f_x1 = straight_line1(range_x)
f_x2 = straight_line2(range_x)
f_x3 = straight_line3(range_x)

#plt.plot() for every function
plt.plot(range_x,f_x1, label='straight_line1') 
plt.plot(range_x,f_x2, label='straight_line2')
plt.plot(range_x,f_x3, label='straight_line3')
plt.xlabel('x')
plt.ylabel('f(x)')

plt.legend()
plt.show() #plt.show() must be called once, and after plt.plot()

## 3. Reading in Data
<a id="Q3"></a>

We now demonstrate how to use the `pandas` package to read in data from csv files or from the internet (such as [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page)). We will also demonstrate how to use the `matplotlib.pyplot` library to create scatter plots.  

Hint: You can find more information about the `pandas` package under this [link](http://pandas.pydata.org/pandas-docs/stable/)


<a id='LoadDataset'></a>
<div class=" alert alert-warning">
    <b>Student Task.</b> Load Data from File. 
    </div>

- Implement a Python function `X, m, n=LoadData(filename)` which reads in the filename of a csv file as input parameter. The function should return three ouput parameters: a matrix $\mathbf{X} =\big(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(m)}\big)^{T}$ whose rows are the feature vectors $\mathbf{x}^{(i)} \in \mathbb{R}^{n}$ which are stored in the rows of the csv file, the sample size (total number of rows in the csv file) $m$ and the feature length $n$. 


In [None]:
def LoadData(filename):
    """
    Load the dataframe reading the file with the filename given as a parameter.
    Print the sample size m and the feature length n.

    :input: String path to the file

    :return: numpy array of shape=(m, n), the sample size m and the feature length n    
    """

    df = pd.read_csv(filename)
    X = df.values # convert the data frame to numpy array

    #n = ...
    #m = ...
    #print('sample size m=',...)
    #print('feature length n=',...)
    # YOUR CODE HERE
    raise NotImplementedError()

    return X, m, n

In [None]:
X, m, n = LoadData("Data.csv")
assert X.shape == (600,2), f'Expected dataframe to be of different size than {df.shape}'


print("Sanity checks passed! Some hidden tests may still fail.")

<a id='Scatterplots'></a>

<div class=" alert alert-info">
 
 <b>Demo.</b> Scatterplots. 
  </div>  
 
The following code snippet implements the function `axes=ScatterPlots()` that returns an axes object which contains two scatterplots: 
- one scatter plot depicting the feature vectors stored in "Data.csv"

- one scatter plot depicting the feature vectors stored in "Data.csv" but divided into 3 subsets corresponding to the first 200, the second 200, and the last 200 rows in "Data.csv". This second scatter plot uses different colors for the (feature vectors from) different subsets.


In [None]:
def ScatterPlots():
    """
    Plot the scatterplot of all the data, then plot the scatterplot of the 3 subsets,
    each one with a different color

    return: axes object used for testing, containing the 2 scatterplots.    
    """

    fig, axes = plt.subplots(1, 2, figsize=(15, 5)) #used only for testing purpose
    data, _, _, = LoadData("Data.csv")                # load data from csv file

    colors = ['r', 'g', 'b']

    axes[0].scatter(data[:,0],data[:,1], label='All data')
    axes[0].legend()
    axes[1].scatter(data[0:200,0],data[0:200,1], c=colors[0], label='first 200 data')
    axes[1].scatter(data[200:400,0],data[200:400,1], c=colors[1], label='second 200 data')
    axes[1].scatter(data[400:600,0],data[400:600,1], c=colors[2], label='third 200 data')
    axes[1].legend()

    return axes

In [None]:
axes = ScatterPlots()

<a id='wikidata'></a>
<div class=" alert alert-info">
    <b>Demo.</b> Query wikidata.org. 
    </div>

The code snippet below shows how to access https://www.wikidata.org/ in order to obtain current statistics about countries belonging to the European Union (EU). Beside reading in data from wikidata, the code below also shows how to handle (impute) missing data ("NaN"). In order to fill missing data fields, different methods can be used. The basic idea of those methods is to interpolate between similar data points. The code below implements a simple interpolation by using the mean of all known values of some property.  

Hint: 

- You can find more information on how to query wikidata under [this link.](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/A_gentle_introduction_to_the_Wikidata_Query_Service)
- You can find more information on coping with missing data using imputation under [this link.](http://www.paultwin.com/wp-content/uploads/Lodder_1140873_Paper_Imputation.pdf)
- Python provides several methods for data imputation ([see here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html))

In [None]:
import requests
import pandas as pd
from collections import OrderedDict
import matplotlib.pyplot as plt

# E.g. read in key figures (gdp, average age, population, ...) of all countries.
# only EU countries: ?country wdt:P463 wd:Q458

url = 'https://query.wikidata.org/sparql'
query = """
SELECT
  ?countryLabel ?population ?area ?medianIncome ?age ?nominalGDP
WHERE {
  ?country wdt:P463 wd:Q458
  OPTIONAL { ?country wdt:P1082 ?population }
  OPTIONAL { ?country wdt:P2046 ?area }
  OPTIONAL { ?country wdt:P3529 ?medianIncome }
  OPTIONAL { ?country wdt:P571 ?inception.
    BIND(year(now()) - year(?inception) AS ?age)
  }
  OPTIONAL { ?country wdt:P2131 ?nominalGDP}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
"""

r = requests.get(url, params={'format': 'json', 'query': query}) #execute the query
data = r.json()

countries = []
#cleans the data because some values are missing
for item in data['results']['bindings']:
    countries.append(OrderedDict({'country': item['countryLabel']['value'],
'population': item['population']['value']
    if 'population' in item else None,
'area': item['area']['value']
    if 'area' in item else None,
'medianIncome': item['medianIncome']['value']
    if 'medianIncome' in item else None,
'age': item['age']['value']
    if 'age' in item else None,
'nominalGDP': item['nominalGDP']['value']
    if 'nominalGDP' in item else None}))

df_wikidata=pd.DataFrame(countries)
df_wikidata.set_index('country', inplace=True)
df_wikidata=df_wikidata.astype({'population': float, 'area': float, 'medianIncome': float, 'age': float, 'nominalGDP': float})
df_wikidata=df_wikidata.astype({'area': float, 'medianIncome': float, 'age': float, 'nominalGDP': float})

df_wikidata.fillna(df_wikidata.mean(), inplace=True)   # replace missing data "Nan" with means 
data_wikidata = df_wikidata.values[:,0:2]              # needed for Task5 only 

print(df_wikidata)

## 4. Eigenvectors
<a id="Q4"></a>

Consider a dataset containing $m$ feature vectors $\mathbf{x}^{(1)},...,\mathbf{x}^{(m)} \in \mathbb{R}^{n}$ 
which we stack into the feature matrix $\mathbf{X}=(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(m)})^{T} \in \mathbb{R}^{m \times n}$. In order to characterize the intrinsic geometry of this dataset it is useful to compute the sample covariance 
matrix 

$$\mathbf{C} = (1/m) \mathbf{X}^{T} \mathbf{X}.$$ 

The entries of the matrix $\mathbf{C}$ are the scaled inner products $(1/m) \big( \mathbf{x}^{(i)} \big)^{T} \mathbf{x}^{(j)}$ between two feature vectors. The matrix $\mathbf{C}$ is positive semidefinite (psd) since 

$$\mathbf{w}^{T}\mathbf{C}\mathbf{w} = (1/m) \mathbf{w}^{T} \mathbf{X}^{T} \mathbf{X} \mathbf{w} = (1/m) \|\mathbf{X} \mathbf{x} \|^{2} \geq 0 \mbox{ for any vector }\mathbf{w} \in \mathbb{R}^{n}.$$ 

As a psd matrix, we can decompose $\mathbf{C}$ into three factors 

$$\mathbf{C} = \mathbf{U} {\bf \mathbf{\Lambda}} \mathbf{U}^{T}.$$ 

The matrix $\mathbf{U}=\big(\mathbf{u}^{(1)},\ldots,\mathbf{u}^{(n)}\big)$ is orthonormal (i.e., $\mathbf{U}^{T}\mathbf{U}=\mathbf{I}$) 
and its columns $\mathbf{u}^{(1)},\ldots,\mathbf{u}^{(n)}$ are eigenvectors of $\mathbf{C}$. The matrix ${\mathbf{\Lambda}}$ is diagonal having the eigenvalues $\lambda_{1}\geq\lambda_{2}\geq \ldots \geq \lambda_{n}$ of $\mathbf{C}$ on the main diagonal. The above decomposition is also known as the eigenvalue decomposition of the matrix $\mathbf{C}$.  
 
 
Hints:

- The eigenvector $\mathbf{u}^{(1)}$ (corresponding to the largest eigenvalue $\lambda_{1}$) should indicate the direction into which the dataset spreads the most.
- Computing eigenvalues and eigenvectors will be required in a later exercise on dimensionality reduction.   
- You can find more information about eigenvalue decomposition under [this link](http://math.mit.edu/~gs/linearalgebra/ila0601.pdf) and about the interpretation of the covariance matrix under [this link](http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/)

<a id='LoadAndScatter'></a>
<div class=" alert alert-warning">
    <b>Student Task.</b> Load and Visualize Data. 
    </div>

Implement a Python function `X, axes = LoadDataAndScatter(filename)` which reads in the filename of a csv file. The function should return all the feature vectors $\mathbf{x}^{(i)} \in \mathbb{R}^{n}$ stored in the rows of the file "DataTask4.csv" and generates a scatter plot of them contained in the axes object.

In [None]:
def LoadDataAndScatter(filename):
    """
    Load the data contained in the file DataTask4.csv and generate a scatterplot 
    of all the dataset.

    :input: String path to the file

    :return: numpy array of shape=(m, n), axes object containing the scatterplot.    
    """

    # read in the data from "DataTask4.csv", you can use the function LoadData()
    #X = ...
    # YOUR CODE HERE
    raise NotImplementedError()

    fig, axes = plt.subplots(1, 1, figsize=(8, 8)) #used only for testing purpose

    ### STUDENT TASK ###
    # draw the scatter plot for data X
    # YOUR CODE HERE
    raise NotImplementedError()

    plt.xlim((-4,4))
    plt.ylim((-4,4))
    plt.legend()

    return X, axes

In [None]:
X, axes = LoadDataAndScatter('DataTask4.csv')
assert X.shape == (1000,2), f'Expected dataframe to be of different size than {df.shape}'


print("Sanity checks passed! Some hidden tests may still fail.")

<a id='CovarianceAndEigen'></a>
<div class=" alert alert-warning">
    <b>Student Task.</b> Eigenvalue Decomposition of a Covariance Matrix.  
    </div>

Implement a Python function `C, g , U = EigenvalueDecomposition(X)` which reads in a data matrix $\mathbf{X}=\big(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(m)}\big)^{T}$ whose rows are feature vectors $\mathbf{x}^{(i)} \in \mathbb{R}^{n}$ of $m$ data points. The function should return the 

- the covariance matrix $\mathbf{C} = (1/m) \mathbf{X}^{T} \mathbf{X} \in \mathbb{R}^{n \times n}$.
- a vector $\mathbf{g}=\big(\lambda_{1},\ldots,\lambda_{n}\big)^{T} \in \mathbb{R}^{n}$ whose entries are the decreasingly sorted eigenvalued $\lambda_{1} \geq \lambda_{2} \geq \ldots \geq \lambda_{n}$ of the matrix $\mathbf{C}$
- a matrix $\mathbf{U}=\big(\mathbf{u}^{(1)},\ldots,\mathbf{u}^{(n)}\big) \in \mathbb{R}^{n \times n}$ whose columns contain the eigenvectors $\mathbf{u}^{(i)}$ of $\mathbf{C}$, corresponding to the eigenvalue $\lambda_{i}$. 

In [None]:
def EigenvalueDecomposition():
    """
    Compute the covariance matrix C, eigenvectors and eigenvalues

    return: The covariance matrix C of shape=(n, n), eigenvalues and eigenvectors    
    """

    X, _, _ = LoadData('DataTask4.csv')

    ### STUDENT TASK ###
    # compute the sample covariance of data X
    #C = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    #Hint: C should be a 2x2 matrix. 

    ### STUDENT TASK ###
    # find eigenvalues and eigenvectors of C and print them
    # values, vectors = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    print('eigenvalues= ', values)
    print('eigenvectors= ', vectors)

    return C, values, vectors

In [None]:
C, values, vectors = EigenvalueDecomposition()
assert C.shape == (2,2), f"C should be a 2x2 matrix"


print("Sanity checks passed! Some hidden tests may still fail.")

<a id='PlotEigenVectors'></a>
<div class=" alert alert-info">
    <b>Demo.</b> Eigenvectors.  
    </div>
    
The following code snippet indicates the eigenvectors $\mathbf{u}^{(1)}$, $\mathbf{u}^{(2)}$ (corresponding to the two largest eigenvalues) by drawing a line from the origin $(0,0)$ to the points given by $\mathbf{u}^{(1)}$ and $\mathbf{u}^{(2)} \in \mathbb{R}^{2}$, on top of the scatter plot with the data points.
 
Hints:

- If the same scaling is used for the horizontal and vertical axis of the scatter plot, then the eigenvectors should be perpendicular to each other. Use _plt.xlim_ and _plt.ylim_ to set the axes length. Furthermore, you can set the figure size by using _plt.figure(figsize=(8,8))_.


In [None]:
def PlotEigenVectors():
    """
    Plot the two eigenvectors in the scatterplot plotted from the previous task

    return: axes object used for testing, containing the two eigenvectors.    
    """
    df, axes = LoadDataAndScatter('DataTask4.csv')    

    axes.plot([0,vectors[0,1]],[0,vectors[0,0]],'r-', label='eigenvectors')
    axes.plot([0,vectors[1,1]],[0,vectors[1,0]],'r-')

    plt.legend()
    return axes

In [None]:
axes = PlotEigenVectors()

## 5. Probability Distributions
<a id="Q5"></a>

In this task you are required to use the $\textit{multivariate_normal.pdf}$ function from the _scipy.stats_ package to evaluate the probability density function $f(x)$ of a Gaussian random variable.

The Gaussian distribution is important because many processes in nature and social sciences can be accurately modeled using Gaussian random variables.

The pdf of a Gaussian random variable x is given by $f(x)=\dfrac{1}{\sqrt{(2\pi\sigma^2)}}{\rm exp} \big(-(x-\mu)^2/(2\sigma^{2})\big)$ with the mean $\mu ={\rm E} \{ x \}$ and the variance $\sigma^{2} = {\rm E} \{(x- \mu)^2 \}$. 

Some of the properties of the Gaussian pdf are:

- It is symmetric around the point $x = \mu$
- The area under the curve and the x-axis is equal to one
- 99.7% of the area under the curve is between $\mu$ - $3\sigma$ and $\mu$ +$3\sigma$

Under [this link](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.multivariate_normal.html) you can find more information about the Python function `multivariate_normal.pdf`. 

Under [this link](http://mathworld.wolfram.com/NormalDistribution.html), you can find more information about the Gaussian pdf and in [this link](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.contour.html) you can find more information about `plt.contour`. 


<a id='ComputeAndPlotGaussianpdf'></a>
<div class=" alert alert-warning">
    <b>Student Task.</b> Gaussian Random Variables.  
    </div>
 
Implement a Python function `axes = ComputeAndPlotGaussian()` which has no input parameters and returns an axes object representing the plot of the Gaussian pdf $f(x)$ in the range $x=0 ,\ldots,5$, computed using mean $5/2$ and variance $1/2$. 
 

In [None]:
from scipy.stats import multivariate_normal

def ComputeAndPlotGaussian():
    """
    Compute and plot the Gaussian pdf for a random variable x

    return: axes object used for testing, containing the Gaussian pdf plotted.    
    """

    x = np.linspace(0, 5, 100, endpoint=False)

    ### STUDENT TASK ###
    # compute the values of the gaussian probability density function
    # y = multivariate_normal.pdf(...)
    # YOUR CODE HERE
    raise NotImplementedError()

    fig, axes = plt.subplots(1, 1, figsize=(15, 5)) #used only for testing purpose

    ### STUDENT TASK
    # plot the function in the range [0,5]
    # YOUR CODE HERE
    raise NotImplementedError()

    return axes

In [None]:
axes = ComputeAndPlotGaussian()
pc = LinePlotChecker(axes)
test_x = np.linspace(0, 5, 100, endpoint=False)
np.testing.assert_allclose(pc.x_data[0], test_x, atol=1e-6, err_msg="The x values for the graph are incorrect. Don't modify the linspace line")


print("Sanity checks passed! Some hidden tests may still fail.")

<a id='Fit2DGaussian'></a>
<div class=" alert alert-info">
    <b>Demo.</b> Fitting a Gaussian Distribution to Data.  
    </div>

The code snippet below fits a two-dimensional Gaussian distribution to a set of data points which is read in from wikidata. 

In [None]:
#scatter plot of the wikidata population and area
population = data_wikidata[:,0]
area = data_wikidata[:,1]
plt.figure(figsize=(15,8))
plt.scatter(population, area, label='wikidata')

#scaling for a better view of the data
plt.xlim(-2e7,9e7)
plt.ylim(-1e5,7e5)
#compute mean and covariance matrix of data_wikidata using the formula from Task number 4 and plot
#showing a red x as the mean in the graph
x_mean = np.mean(population)
y_mean = np.mean(area)
plt.plot(x_mean, y_mean, 'rx')
X = data_wikidata

#compute the covariance matrix C = (1/m) X^T X
C = np.matmul(np.transpose(X), X)/X.shape[0]

#plot Gaussian using the mean and the covariance computed in the step before, to fit the scatterplot
#data in the same graph as before, using the function plt.contour from matplotlibrary
#these two values are taken looking at the population and area values 
#(lower than the minimum and higher than the maximum value)
x = np.linspace(-12e7,9e7, 100) 
y = np.linspace(-1e5,7e5, 100)

x_mesh, y_mesh = np.meshgrid(x,y)
pos = np.array([x_mesh.flatten(),y_mesh.flatten()]).T
z = multivariate_normal([x_mean,y_mean],C).pdf(pos)
z = z.reshape((100,100))
plt.contour(x_mesh,y_mesh,z,10)
plt.xlabel('population')
plt.ylabel('area')

plt.show()