# Notes

Exam contains 6 problems, Most of them are of intermediate complexity and follow the material from class or graded assignments. Note, that no loops are allowed in this exam, and all the solutions containing loops will be graded as 0.

For this exam you'll need [Titanic](https://www.kaggle.com/c/titanic) and [road accidents](https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales) datasets.

In [2]:
%pylab inline
plt.style.use("bmh")

Populating the interactive namespace from numpy and matplotlib


In [3]:
plt.rcParams["figure.figsize"] = (6,6)

In [4]:
import numpy as np
import pandas as pd
import torch

In [5]:
STUDENT = "Asaf Dahan"
ASSIGNMENT = "exam"
TEST = False

In [5]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 10

# NumPy

### 1. Filtering array (2 points).

Clip array values according to the following:

- given a two-dimensional array `arr` and threshold value `max_val`,
- find those rows, for which row values sum is `> max_val`,
- and replace largest value for each of those rows with `v` $\rightarrow$ `v - <row sum> + max_val`.

For example, consider the following array and threshold `max_val=8`:

In [6]:
a = np.array([[1, 5, 4], [-3, 2, 8]])
a

array([[ 1,  5,  4],
       [-3,  2,  8]])

Row sums are:

In [7]:
a.sum(axis=1)

array([10,  7])

Since row sum for row `0` is `> max_val`, largest value in that row (`a[0, 1]`, which is `5`), must be replaced with: `5 - 10 + 8 = 3`, resulting in:

In [8]:
a_clipped = np.array([[1, 3, 4], [-3, 2, 8]])
a_clipped

array([[ 1,  3,  4],
       [-3,  2,  8]])

#### Notes:

- **do not change original array**,
- in this problem you may need to use **boolean and fancy indexing**, as well as `arr.argmax(...)`,
- you **cannot use loops**,
- input array is of **any two-dimensional shape** (including `(N,1)` and `(1,N)`), filled with **random integers**,
- there may be no rows, which satisfy threshold condition, and in that case resulting array must be identical to input array.

In [9]:
def clip_array(arr, max_val):
    """Clip array based on `max_val`."""
    tmp = arr.copy()
    maxes = np.max(tmp, axis=1)
    replacers = maxes - np.sum(tmp, axis=1) + max_val
    replacers[~(tmp.sum(axis=1) > max_val)] = maxes[~(tmp.sum(axis=1) > max_val)]
    tmp[np.arange(tmp.shape[0]), tmp.argmax(axis=1)] = replacers
    return tmp

In [10]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, clip_array)

In [30]:
# Personal test - please ignore
if not TEST:
    print(a, "\n\n")
    print(clip_array(a, 8), "\n\n")
    
    print(a.reshape(6,1), "\n\n")
    print(clip_array(a.reshape(6,1), 7), "\n\n")
    
    print(a.reshape(1,6), "\n\n")
    print(clip_array(a.reshape(1,6), 7))

[[ 1  5  4]
 [-3  2  8]] 


[[ 1  3  4]
 [-3  2  8]] 


[[ 1]
 [ 5]
 [ 4]
 [-3]
 [ 2]
 [ 8]] 


[[ 1]
 [ 5]
 [ 4]
 [-3]
 [ 2]
 [ 7]] 


[[ 1  5  4 -3  2  8]] 


[[ 1  5  4 -3  2 -2]]


### 2. Calculate area (1 point).

In this problem you will construct a naive Monte-Carlo simulator. Provided with a 2D bounding box, you must calculate it's area:

- a bounding box is specified by maximum and minimum `x` and `y`, i.e. a bounding box is a **rectangle** between `minx` and `maxx` over `x`-axis and between `miny` and `maxy` over `y`-axis,
- all of `minx`, `maxx`, `miny`, `maxy` are `>=0` and `<=1`,
- you can sample **at most** `n_samples` points on 2D place,
- ratio of number of points inside a bounding box to total number of points is an **estimate of bounding box area**,
- estimate is considered valid, if it's **no more than 10% off of actual area value**,
- `n_samples` is chosen in such a way, that **10% error is achievable nearly always**, i.e. chances of getting more then 10% error with correct computation are negligibly small.

For example, a bounding box is `minx=0.25`, `maxx=0.5`, `miny=0.1`, `maxy=0.6`. Actual area is `0.125`. Suppose, that we sample `10000` points in unit square $x \in [0, 1],\,y \in [0, 1]$ and 1215 of them are inside the bounding box. Then, an estimate for the bounding box area is `0.1215` (with error of about 2.8%). Image below illustrates this example.

![Monte-Carlo integration example](mc.png)

In [12]:
def calc_area(minx, maxx, miny, maxy, n_samples):
    """Calculate area of bounding box.""" 
    # Sample points between 0-1
    sampled_points = np.random.random_sample(size=(n_samples, 2))
    actual_area = (maxx - minx) * (maxy - miny)

    # Calculate the estimation
    samples_in_box = sampled_points[(sampled_points[:, 0] >= minx) & (sampled_points[:, 0] <= maxx) &
                                    (sampled_points[:, 1] >= miny) & (sampled_points[:, 1] <= maxy)]
    points_in_box = samples_in_box.shape[0]
    return points_in_box / len(sampled_points)


In [13]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, calc_area)

In [14]:
# Personal test - please ignore
if not TEST:
    print(f"Expected: {0.25 * 0.5}")
    print(f"Estimate 1: {calc_area(0.25, 0.5, 0.1, 0.6, 10000)}")
    print(f"Estimate 2: {calc_area(0.25, 0.5, 0.1, 0.6, 10000)}")
    print(f"Estimate 3: {calc_area(0.25, 0.5, 0.1, 0.6, 10000)}")

Expected: 0.125
Estimate 1: 0.1274
Estimate 2: 0.122
Estimate 3: 0.1223


### 3. Find outliers (3 points).

Given an array of shape `(N,2)`, filter all the rows, which are more than `thr` away from other rows. Distance metrics is Euclidean, i.e. distance between rows `i` and `j` is (in pseudocode):

```
distance(i, j) = sqrt(square(arr[i, 0] - arr[j, 0]) + square(arr[i, 1] - arr[j, 1]))
```

Distance of row `i` from other rows is:

```
distance(i) = mean(distance(i, j)), j!=i
```

Rows, which have `distance(i) > thr` must be filtered. In this problem you **cannot use loops**. Instead, use broadcasting (recall recurrence matrix problem in GA-2 and extend it to two-dimensional case).

As an example, consider 1000 samples from standard normal distribution for `x` (axis 1) and `y` (axis 0) and threshold of 2:

![Outliers filtering](outliers.png)

In [15]:
def find_outliers(arr, thr):
    """Find outliers."""
    
    # Find the number of samples
    points_count = arr.shape[0]
    
    # Create distance matrix (euclidean)
    m1 = np.zeros((points_count, points_count, 2)) + arr[np.newaxis, :]
    m2 = (np.zeros((points_count, points_count, 2)) + arr.reshape((points_count,1,2)))
    subtractions = np.square(m1 - m2)
    dist_matrix = np.sqrt(np.sum(subtractions, axis=2, keepdims=True))

    # distance(i) = mean(distance(i, j)), j!=i 
    # -> Dividing distance sum by (points_count - 1) will evauate the mean where i!=j    
    # Return the array without outliers
    return arr[~np.greater(np.sum(dist_matrix, axis=1) / (points_count-1), thr).reshape(points_count), :]

In [16]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_outliers)

In [17]:
# Personal test - please ignore
if not TEST:
    n = 10
    thr = 0.5
    arr = np.random.random_sample((n, 2))
    print(f"Array:\n{arr}\n")
    
    for r1 in arr:
        dists = 0
        print(f"Current point - {r1}")
        print(f"Distances:")
        for r2 in arr:
            dist = np.sqrt((r1[0] - r2[0])**2 + (r1[1] - r2[1])**2)
            print(f"Dist from {r2} - {dist}")            
            dists += dist
        print(f"Avg distance - {dists / (n-1)}")
        if (dists / (n-1)) > thr:
            print(f"{r1} is an outlier.\n")
        else:
            print(f"{r1} is not an outlier.\n")
        
    print(f"\nResult array:\n{find_outliers(arr, thr)}\n")

Array:
[[0.79337306 0.21029466]
 [0.06411658 0.76081864]
 [0.06063078 0.25195854]
 [0.75269828 0.76945336]
 [0.69995353 0.41157123]
 [0.73956342 0.99536732]
 [0.60139573 0.51776686]
 [0.84683245 0.62073325]
 [0.56215929 0.55135243]
 [0.01374006 0.45497757]]

Current point - [0.79337306 0.21029466]
Distances:
Dist from [0.79337306 0.21029466] - 0.0
Dist from [0.06411658 0.76081864] - 0.9137240570764532
Dist from [0.06063078 0.25195854] - 0.7339258287127035
Dist from [0.75269828 0.76945336] - 0.5606361405342877
Dist from [0.69995353 0.41157123] - 0.2218996673040353
Dist from [0.73956342 0.99536732] - 0.7869145817941781
Dist from [0.60139573 0.51776686] - 0.3624837256570277
Dist from [0.84683245 0.62073325] - 0.41390547543547807
Dist from [0.56215929 0.55135243] - 0.41204393915225984
Dist from [0.01374006 0.45497757] - 0.8171274885447677
Avg distance - 0.5802956560234657
[0.79337306 0.21029466] is an outlier.

Current point - [0.06411658 0.76081864]
Distances:
Dist from [0.79337306 0.2102

# PyTorch

### 4. SImple derivative (1 point).

Given some value of `x0`, calculate a derivative of sigmoid function at that point. Input is a single floating point value. Output must also be a single floating point value (not a tensor!) equal to derivative of $\sigma(x)$ at `x0`.

Do not use the exact formula for the derivative, but use PyTorch `.backward()`.

In [18]:
def d_sigmoid(x0):
    """Derivative of sigmoid."""
    t = torch.tensor(float(x0), requires_grad=True)
    sig = torch.sigmoid(t)
    sig.backward()
    return t.grad.item()

In [19]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, d_sigmoid)

In [32]:
# Personal test - please ignore
if not TEST:
    samples = np.random.random_sample(10)
    
    from scipy.special import expit as sigmoid
    def sigmoid_grad(x):
        fx = sigmoid(x)
        return fx * (1 - fx)
    
    for i in samples:
        print(f"Expected value [{sigmoid_grad(i)}] - Func value [{d_sigmoid(i)}]")

Expected value [0.24174272434710267] - Func value [0.24174273014068604]
Expected value [0.2203676488485075] - Func value [0.2203676402568817]
Expected value [0.21691770324671786] - Func value [0.21691769361495972]
Expected value [0.24987886855075614] - Func value [0.2498788684606552]
Expected value [0.20168431122313693] - Func value [0.20168432593345642]
Expected value [0.24534878967613902] - Func value [0.24534879624843597]
Expected value [0.19714916179863964] - Func value [0.19714917242527008]
Expected value [0.19796580234424177] - Func value [0.1979658156633377]
Expected value [0.2307064038808568] - Func value [0.23070639371871948]
Expected value [0.22249340294103612] - Func value [0.22249341011047363]


# Pandas

### 5. Ratio of males travelling alone per class (1 point).

Given the Titanic dataset, calculate ratio of males travelling alove (`SipSp==0` and `Parch==0`) per class. In other words, calculate number of males travelling alone in each class, divided by number of passengers in that class.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a series, indexed by class, containing the requested ratios.

In [21]:
def lone_males(df):
    """Calculate ratio of males travelling alone per class."""
    # Count lone male and total passangers per class
    lone_male = df[(df.SibSp==0) & (df.Parch==0) & (df.Sex=="male")].reset_index().\
                                groupby('Pclass').PassengerId.nunique().reset_index()
    total_passengers = df.reset_index().groupby('Pclass').PassengerId.nunique().reset_index()
    # Merge based on class and return the ratio
    res_df = lone_male.merge(total_passengers, on="Pclass", suffixes=['_lone', '_total'])
    res_df.set_index('Pclass', inplace=True)
    return (res_df.PassengerId_lone / res_df.PassengerId_total)

In [22]:
PROBLEM_ID = 5

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, lone_males)

In [23]:
# Personal test - please ignore
if not TEST:
    df_train = pd.read_csv('./train.csv')
    df_test = pd.read_csv('./test.csv')
    df = pd.concat([df_train, df_test])
    df.set_index('PassengerId', inplace=True)
    lone_male = df[(df.SibSp==0) & (df.Parch==0) & (df.Sex=="male")].reset_index().\
                                groupby('Pclass').PassengerId.nunique().reset_index()
    total_passengers = df.reset_index().groupby('Pclass').PassengerId.nunique().reset_index()
    res_df = lone_male.merge(total_passengers, on="Pclass", suffixes=['_lone', '_total'])
    res_df.set_index('Pclass', inplace=True)
    print(f"Lone males per class:\n{lone_male}\n")
    print(f"Total per class:\n{total_passengers}\n")
    print(f"Merged DF:\n{res_df}\n")
    print(f"Function result:\n{lone_males(df)}")

Lone males per class:
   Pclass  PassengerId
0       1          108
1       2          116
2       3          372

Total per class:
   Pclass  PassengerId
0       1          323
1       2          277
2       3          709

Merged DF:
        PassengerId_lone  PassengerId_total
Pclass                                     
1                    108                323
2                    116                277
3                    372                709

Function result:
Pclass
1    0.334365
2    0.418773
3    0.524683
dtype: float64


### 6. Worst days on UK roads in 2005 (2 points).

Calculate Top-5 days with the largest number of severe accidents (`Accident_Severity < 3`).

Input is a **dataframe**, containing all the accidents in 2005 and the following columns: `date_time` (constructed in the same way, as in optional time series notebook) and `Accident_Severity`. Index is a default integer index. Result must be a list (or tuple) of dates (as a `pd.Timestamp`) with 5 elements.

In [24]:
def worst_days(df):
    """Calculate Top 5 most severe days."""
    tmp = df.copy()
    high_sev = tmp[tmp.Accident_Severity<3]
    top5 = high_sev.date_time.dt.date.value_counts().head(5)
    return top5.index.map(lambda x: pd.Timestamp(x)).tolist()

In [25]:
PROBLEM_ID = 6

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, worst_days)

In [26]:
# Personal test - please ignore
if not TEST:
    d = pd.read_csv('./accidents_2005_to_2007.csv')
    d.loc[:, 'dt'] = d.Date.str.cat(d.Time, sep=' ', na_rep='00:00')
    d.loc[:, 'date_time'] = pd.to_datetime(d.dt, dayfirst=True)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [27]:
# Personal test - please ignore
if not TEST:
    tmp = d.copy()
    tmp.index = tmp.index.set_names(['acc_id'])
    tmp.reset_index(inplace=True)
    high_sev = tmp[tmp.Accident_Severity<3]
    top5 = high_sev.date_time.dt.date.value_counts().head(5)
    print(f"Dates value count -\n{top5}\n")
    print(f"Function result -\n{worst_days(d)}")

Dates value count -
2006-06-10    142
2006-09-09    118
2005-05-14    117
2005-06-18    115
2005-09-16    114
Name: date_time, dtype: int64

Function result -
[Timestamp('2006-06-10 00:00:00'), Timestamp('2006-09-09 00:00:00'), Timestamp('2005-05-14 00:00:00'), Timestamp('2005-06-18 00:00:00'), Timestamp('2005-09-16 00:00:00')]


In [28]:
if TEST:
    print(f"{STUDENT}: {int(100 * total_grade / MAX_POINTS)}")