# Programming and Data Analysis

> Assignment 4

Kuo, Yao-Jen <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com)

In [1]:
import numpy as np
import pandas as pd

## Instructions

- The assignment will be disconnected if idling over 10 minutes, we can reactivate a new session by clicking the assignment link again.
- We've imported necessary modules at the top of each assignment.
- We've put necessary files(if any) in the working directory.
- We've defined the names of functions/inputs/parameters for you.
- Write down your solution between the comments `### BEGIN SOLUTION` and `### END SOLUTION`.
- It is NECESSARY to `return` the answer, tests will fail by just printing out the answer.
- It is known that `SyntaxError` and `IndentationError` might break our `test_runner.py` and results in a zero point grade. It is highly recommended testing your solution by calling functions/methods in notebook or running tests before submission.
- Running tests to see if your solutions are right:
    - File -> Save Notebook to save exercises.ipynb.
    - File -> New -> Terminal to open a Terminal.
    - Use command `python test_runner.py` to run test.
- When you are ready to submit, click File -> Export Notebook As -> Executable Script.
- Rename the exported Python script with your student ID(e.g. `b01234567.py`) and upload to the Assignment session on NTU COOL/NTNU Moodle.

## 01. Define a function named `add_intercepts` which horizontally combines a `(m, 1)` array of zeros to a given array.

- Expected inputs: `np.ndarray`
- Expected outputs: `np.ndarray`

In [2]:
def add_intercepts(arr: np.ndarray) -> np.ndarray:
    """
    >>> A = np.array([5, 5, 6, 6]).reshape(-1, 1)
    >>> add_intercepts(A)
    array([[0, 5],
           [0, 5],
           [0, 6],
           [0, 6]])
    >>> B = np.ones((5, 2), dtype=int)
    >>> add_intercepts(B)
    array([[0, 1, 1],
           [0, 1, 1],
           [0, 1, 1],
           [0, 1, 1],
           [0, 1, 1]])
    """
    ### BEGIN SOLUTION
    m = arr.shape[0]
    intercepts = np.zeros(m, dtype=int).reshape(-1, 1)
    out = np.concatenate([intercepts, arr], axis=1)
    return out
    ### END SOLUTION

## 02. Define a function named `split_train_test` which splits a given array vertically according to specified parameters.

- Expected inputs: `np.ndarray`
- Expected outputs: `tuple`

In [3]:
def split_train_test(arr: np.ndarray, test_size: float) -> tuple:
    """
    >>> A = np.ones((10, 2))
    >>> A_train, A_test = split_train_test(A, test_size=0.3)
    >>> A_train.shape
    (7, 2)
    >>> A_test.shape
    (3, 2)
    >>> B = np.ones((20, 3))
    >>> B_train, B_test = split_train_test(B, test_size=0.4)
    >>> B_train.shape
    (12, 3)
    >>> B_test.shape
    (8, 3)
    """
    ### BEGIN SOLUTION
    m = arr.shape[0]
    test_index = int(m * test_size)
    arr_test, arr_train = np.split(arr, [test_index])
    return arr_train, arr_test
    ### END SOLUTION

## 03. Define a function named `is_invertible` which determines if an inverse matrix exists for a given matrix.

- Expected inputs: `np.ndarray`
- Expected outputs: `bool`

In [4]:
def is_invertible(arr: np.ndarray) -> bool:
    """
    >>> A = np.array([1, 2, 2, 4]).reshape(2, 2)
    >>> is_invertible(A)
    False
    >>> B = np.array([5, 5, 6, 6]).reshape(2, 2)
    >>> is_invertible(B)
    False
    >>> C = np.array([5, 6, 7, 8]).reshape(2, 2)
    >>> is_invertible(C)
    True
    """
    ### BEGIN SOLUTION
    try:
        np.linalg.inv(arr)
        return True
    except np.linalg.LinAlgError:
        return False
    ### END SOLUTION

## 04. Define a function named `create_diagonal_split_matrix` which generates a diagonal matrix given `n` as the order, `fill_int` as the elements outside the main diagonal and zeros as the main diagonal.

- Expected inputs: `int`
- Expected outputs: `np.ndarray`

In [12]:
def create_diagonal_split_matrix(n: int, fill_int: int) -> np.ndarray:
    """
    >>> create_diagonal_split_matrix(2, 5566)
    array([[   0, 5566],
           [5566,    0]])
    >>> create_diagonal_split_matrix(3, 55)
    array([[ 0, 55, 55],
           [55,  0, 55],
           [55, 55,  0]])
    >>> create_diagonal_split_matrix(4, 66)
    array([[ 0, 66, 66, 66],
           [66,  0, 66, 66],
           [66, 66,  0, 66],
           [66, 66, 66,  0]])
    """
    ### BEGIN SOLUTION
    arr_shape = (n, n)
    out_arr = np.full(shape=arr_shape, fill_value=fill_int)
    diags = np.diagonal(out_arr)
    minus_arr = -np.diag(diags)
    out_arr += minus_arr
    return out_arr
    ### END SOLUTION

## 05. Define a function named `create_square_matrix` which generates a square matrix with elements equal to the multiplication of row numbers and column numbers.

- Expected inputs: `int`
- Expected outputs: `np.ndarray`

In [6]:
def create_square_matrix(n: int) -> np.ndarray:
    """
    >>> create_square_matrix(3)
    array([[1, 2, 3],
           [2, 4, 6],
           [3, 6, 9]])
    >>> create_square_matrix(4)
    array([[ 1,  2,  3,  4],
           [ 2,  4,  6,  8],
           [ 3,  6,  9, 12],
           [ 4,  8, 12, 16]])
    >>> create_square_matrix(5)
    array([[ 1,  2,  3,  4,  5],
           [ 2,  4,  6,  8, 10],
           [ 3,  6,  9, 12, 15],
           [ 4,  8, 12, 16, 20],
           [ 5, 10, 15, 20, 25]])
    """
    ### BEGIN SOLUTION
    arr = np.arange(1, n + 1).reshape(-1, 1)
    out = arr.dot(arr.T)
    return out
    ### END SOLUTION

## 06. Define a class named `MeanError` which instantiates objects with 2 methods `get_mse` and `get_mae` that is able to calculate the mean square error and mean absolute error between 2 equal length arrays.

\begin{equation}
MSE = \frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y_{i}})^2 \\
MAE = \frac{1}{m}\sum_{i=1}^{m} \mid y_i - \hat{y_{i}} \mid
\end{equation}

Sources: <https://en.wikipedia.org/wiki/Mean_squared_error>, <https://en.wikipedia.org/wiki/Mean_absolute_error>

- Expected inputs: `np.ndarray`
- Expected outputs: `float`

In [7]:
class MeanError:
    """
    >>> y = np.array([5, 5, 6, 6])
    >>> y_hat = np.array([5, 5, 6, 6])
    >>> me = MeanError(y, y_hat)
    >>> me.get_mse()
    0.0
    >>> me.get_mae()
    0.0
    >>> y = np.array([5, 5, 6, 6])
    >>> y_hat = np.array([5, 6, 7, 8])
    >>> me = MeanError(y, y_hat)
    >>> me.get_mse()
    1.5
    >>> me.get_mae()
    1.0
    """
    ### BEGIN SOLUTION
    def __init__(self, y, y_hat):
        self._y = y
        self._y_hat = y_hat
    def get_error(self):
        return self._y - self._y_hat
    def get_mse(self):
        error = self.get_error()
        se = error**2
        mse = np.sum(se) / error.size
        return mse
    def get_mae(self):
        error = self.get_error()
        ae = np.absolute(error)
        mae = np.sum(ae) / error.size
        return mae
    ### END SOLUTION

## 07. Define a function named `get_confusion_matrix` which generates a `(2, 2)` confusion matrix for 2 equal length arrays. The number of true negatives, true positives, false negatives, and false positives are placed at `[0, 0]`, `[1, 1]`, `[1, 0]`, and `[0, 1]` respectively. 

Source: <https://en.wikipedia.org/wiki/Confusion_matrix>

- Expected inputs: `np.ndarray`
- Expected outputs: `np.ndarray`

In [8]:
def get_confusion_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """
    >>> np.random.seed(0)
    >>> y = np.random.randint(0, 2, size=100)
    >>> np.random.seed(1)
    >>> y_hat = np.random.randint(0, 2, size=100)
    >>> get_confusion_matrix(y, y_hat)
    array([[21, 23],
           [24, 32]])
    >>> np.random.seed(2)
    >>> y = np.random.randint(0, 2, size=100)
    >>> np.random.seed(3)
    >>> y_hat = np.random.randint(0, 2, size=100)
    >>> get_confusion_matrix(y, y_hat)
    array([[27, 28],
           [23, 22]])
    """
    ### BEGIN SOLUTION
    n_tn = 0
    n_tp = 0
    n_fn = 0
    n_fp = 0
    for y_true_i, y_pred_i in zip(y_true, y_pred):
        if y_true_i == 0 and y_pred_i == 0:
            n_tn += 1
        elif y_true_i == 1 and y_pred_i == 1:
            n_tp += 1
        elif y_true_i == 1 and y_pred_i == 0:
            n_fn += 1
        elif y_true_i == 0 and y_pred_i == 1:
            n_fp += 1
    cm = np.array([[n_tn, n_fp],
                   [n_fn, n_tp]])
    return cm
    ### END SOLUTION

## 08. Define a function named `import_all_time_olympic_medals` that is able to import `all_time_olympic_medals.csv` in working directory.

Source: <https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table>

- Expected inputs: None.
- Expected outputs: `pd.DataFrame`

In [9]:
def import_all_time_olympic_medals() -> pd.DataFrame:
    """
    >>> all_time_olympic_medals = import_all_time_olympic_medals()
    >>> type(all_time_olympic_medals)
    pandas.core.frame.DataFrame
    >>> all_time_olympic_medals.shape
    (157, 17)
    """
    ### BEGIN SOLUTION
    df = pd.read_csv("all_time_olympic_medals.csv")
    return df
    ### END SOLUTION

## 09. Define a function named `find_taiwan_from_olympic_medals` that is able to retrieve the data of Taiwan given `all_time_olympic_medals.csv` in working directory.

PS Taiwan might not be "Taiwan" in Olympic data.

Source: <https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table>

- Expected inputs: None.
- Expected outputs: `pd.DataFrame`

In [10]:
def find_taiwan_from_olympic_medals() -> pd.DataFrame:
    """
    >>> taiwan_from_olympic_medals = find_taiwan_from_olympic_medals()
    >>> type(taiwan_from_olympic_medals)
    pandas.core.frame.DataFrame
    >>> taiwan_from_olympic_medals.shape
    (1, 17)
    """
    ### BEGIN SOLUTION
    df = pd.read_csv("all_time_olympic_medals.csv")
    return df[df["team_ioc"] == "TPE"]
    ### END SOLUTION

## 10. Define a function named `find_the_king_of_summer_winter_olympics` that is able to retrieve the data of the country that won the most gold medals in summer and winter Olympics, respectively given `all_time_olympic_medals.csv` in working directory.

Source: <https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table>

- Expected inputs: None.
- Expected outputs: `pd.DataFrame`.

In [11]:
def find_the_king_of_summer_winter_olympics() -> pd.DataFrame:
    """
    >>> the_king_of_summer_winter_olympics = find_the_king_of_summer_winter_olympics()
    >>> type(the_king_of_summer_winter_olympics)
    pandas.core.frame.DataFrame
    >>> the_king_of_summer_winter_olympics.shape
    (2, 17)
    """
    ### BEGIN SOLUTION
    df = pd.read_csv("all_time_olympic_medals.csv")
    df_without_total = df[df["team_name"] != "Totals"]
    max_summer_golds = df_without_total["no_summer_golds"].max()
    max_winter_golds = df_without_total["no_winter_golds"].max()
    is_max_summer_golds = (df_without_total["no_summer_golds"] == max_summer_golds).values
    is_max_winter_golds = (df_without_total["no_winter_golds"] == max_winter_golds).values
    condition = np.logical_or(is_max_summer_golds, is_max_winter_golds)
    return df_without_total[condition]
    ### END SOLUTION