# Notes

Exam contains 6 problems, Most of them are of intermediate complexity and follow the material from class or graded assignments. Note, that no loops are allowed in this exam, and all the solutions containing loops will be graded as 0.

For this exam you'll need [Titanic](https://www.kaggle.com/c/titanic) and [road accidents](https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales) datasets.

In [1]:
%pylab inline
plt.style.use("bmh")

Populating the interactive namespace from numpy and matplotlib


In [2]:
plt.rcParams["figure.figsize"] = (6,6)

In [3]:
import numpy as np
import pandas as pd
import torch

In [4]:
STUDENT = "Daniel Volchegursky"
ASSIGNMENT = "exam"
TEST = False

In [5]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 10

# NumPy

### 1. Filtering array (2 points).

Clip array values according to the following:

- given a two-dimensional array `arr` and threshold value `max_val`,
- find those rows, for which row values sum is `> max_val`,
- and replace largest value for each of those rows with `v` $\rightarrow$ `v - <row sum> + max_val`.

For example, consider the following array and threshold `max_val=8`:

In [6]:
a = np.array([[1, 5, 4], [-3, 2, 8]])
a

array([[ 1,  5,  4],
       [-3,  2,  8]])

Row sums are:

In [7]:
a.sum(axis=1)

array([10,  7])

Since row sum for row `0` is `> max_val`, largest value in that row (`a[0, 1]`, which is `5`), must be replaced with: `5 - 10 + 8 = 3`, resulting in:

In [8]:
a_clipped = np.array([[1, 3, 4], [-3, 2, 8]])
a_clipped

array([[ 1,  3,  4],
       [-3,  2,  8]])

#### Notes:

- **do not change original array**,
- in this problem you may need to use **boolean and fancy indexing**, as well as `arr.argmax(...)`,
- you **cannot use loops**,
- input array is of **any two-dimensional shape** (including `(N,1)` and `(1,N)`), filled with **random integers**,
- there may be no rows, which satisfy threshold condition, and in that case resulting array must be identical to input array.

In [9]:
def clip_array(arr, max_val):
    """Clip array based on `max_val`."""
    result=arr.copy()
    mask=arr.sum(axis=1) > max_val
    indices=arr.argmax(axis=1)
    masked_indices=indices[mask]
    print('masked_indices=',masked_indices)
    print('result[mask,masked_indices]=',result[mask,masked_indices])
    print('result.sum(axis=1)-max_val=',result.sum(axis=1)-max_val)
    result[mask,masked_indices]-= result.sum(axis=1)[mask]-max_val
    return result

In [10]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, clip_array)

In [11]:
if not TEST:
    a1 = np.array([[1, 5, 6], [-3, 2, 8]])
    threshold1 = 9
    result1 = clip_array(a1, threshold1)
    print('result1 =', result1)
    a2 = np.array([[1, 5, 6], [-3, 2, 8]])
    threshold2 = 20
    result2 = clip_array(a2, threshold2)
    print('result2 =', result2)
    a3 = np.array([[1, 5, 6], [-3, 2, 8]])
    threshold3 = 4
    result3 = clip_array(a3, threshold3)
    print('result3 =', result3)
    a4 = np.array([[1, 2, 3]]).T
    threshold4 = 4
    result4 = clip_array(a4, threshold4)
    print('result4 =', result4)
    a5 = np.array([[1, 2, 3]])
    threshold5 = 4
    result5 = clip_array(a5, threshold5)
    print('result5 =', result5)


masked_indices= [2]
result[mask,masked_indices]= [6]
result.sum(axis=1)-max_val= [ 3 -2]
result1 = [[ 1  5  3]
 [-3  2  8]]
masked_indices= []
result[mask,masked_indices]= []
result.sum(axis=1)-max_val= [ -8 -13]
result2 = [[ 1  5  6]
 [-3  2  8]]
masked_indices= [2 2]
result[mask,masked_indices]= [6 8]
result.sum(axis=1)-max_val= [8 3]
result3 = [[ 1  5 -2]
 [-3  2  5]]
masked_indices= []
result[mask,masked_indices]= []
result.sum(axis=1)-max_val= [-3 -2 -1]
result4 = [[1]
 [2]
 [3]]
masked_indices= [2]
result[mask,masked_indices]= [3]
result.sum(axis=1)-max_val= [2]
result5 = [[1 2 1]]


### 2. Calculate area (1 point).

In this problem you will construct a naive Monte-Carlo simulator. Provided with a 2D bounding box, you must calculate it's area:

- a bounding box is specified by maximum and minimum `x` and `y`, i.e. a bounding box is a **rectangle** between `minx` and `maxx` over `x`-axis and between `miny` and `maxy` over `y`-axis,
- all of `minx`, `maxx`, `miny`, `maxy` are `>=0` and `<=1`,
- you can sample **at most** `n_samples` points on 2D place,
- ratio of number of points inside a bounding box to total number of points is an **estimate of bounding box area**,
- estimate is considered valid, if it's **no more than 10% off of actual area value**,
- `n_samples` is chosen in such a way, that **10% error is achievable nearly always**, i.e. chances of getting more then 10% error with correct computation are negligibly small.

For example, a bounding box is `minx=0.25`, `maxx=0.5`, `miny=0.1`, `maxy=0.6`. Actual area is `0.125`. Suppose, that we sample `10000` points in unit square $x \in [0, 1],\,y \in [0, 1]$ and 1215 of them are inside the bounding box. Then, an estimate for the bounding box area is `0.1215` (with error of about 2.8%). Image below illustrates this example.

![Monte-Carlo integration example](mc.png)

In [12]:
def calc_area(minx, maxx, miny, maxy, n_samples):
    """Calculate area of bounding box."""
    pts = np.random.random_sample((n_samples,2))
    filtered = pts[(pts[:,0] >= minx) & (pts[:,1] >= miny) & (pts[:,0] <= maxx) & (pts[:,1] <= maxy)]
    estimate = len(filtered)/n_samples
    return estimate

In [13]:
if not TEST:
    samples=50000
    area = calc_area(0.25, 0.5, 0.1, 0.6, samples)
    print('area = ', area)

area =  0.12434


In [14]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, calc_area)

### 3. Find outliers (3 points).

Given an array of shape `(N,2)`, filter all the rows, which are more than `thr` away from other rows. Distance metrics is Euclidean, i.e. distance between rows `i` and `j` is (in pseudocode):

```
distance(i, j) = sqrt(square(arr[i, 0] - arr[j, 0]) + square(arr[i, 1] - arr[j, 1]))
```

Distance of row `i` from other rows is:

```
distance(i) = mean(distance(i, j)), j!=i
```

Rows, which have `distance(i) > thr` must be filtered. In this problem you **cannot use loops**. Instead, use broadcasting (recall recurrence matrix problem in GA-2 and extend it to two-dimensional case).

As an example, consider 1000 samples from standard normal distribution for `x` (axis 1) and `y` (axis 0) and threshold of 2:

![Outliers filtering](outliers.png)

In [15]:
def find_outliers(arr, thr):
    """Find outliers."""
    aux = arr.reshape(np.prod(arr.shape[:-1]),1,arr.shape[-1])
    diff =  arr - aux
    distance_matrix = np.sqrt(np.einsum('ijk,ijk->ij', diff, diff)).squeeze()
    distance_matrix[distance_matrix == 0] = np.nan
    avg = np.nanmean(distance_matrix, axis=1)
    mask = avg <= thr
    result = arr[mask]
    if not TEST:
        print ('distance_matrix:', *distance_matrix, sep='\n')
    return result

In [16]:
if not TEST:
    arr=np.array([[1,2],[3,4],[5,6]])
    threshold=3
    result=find_outliers(arr,threshold)
    print ('result:', *result, sep='\n')

distance_matrix:
[       nan 2.82842712 5.65685425]
[2.82842712        nan 2.82842712]
[5.65685425 2.82842712        nan]
result:
[3 4]


In [17]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_outliers)

# PyTorch

### 4. SImple derivative (1 point).

Given some value of `x0`, calculate a derivative of sigmoid function at that point. Input is a single floating point value. Output must also be a single floating point value (not a tensor!) equal to derivative of $\sigma(x)$ at `x0`.

Do not use the exact formula for the derivative, but use PyTorch `.backward()`.

In [18]:
def d_sigmoid(x0):
    """Derivative of sigmoid."""
    x = torch.autograd.Variable(torch.Tensor([x0]),requires_grad=True)
    sigmoid = 1/(1 + (-x).exp())
    sigmoid.backward()
    return x.grad.item()

In [19]:
if not TEST:
    x0=2.0
    result = d_sigmoid(x0)
    print("result =", result)

result = 0.10499356687068939


In [20]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, d_sigmoid)

# Pandas

### 5. Ratio of males travelling alone per class (1 point).

Given the Titanic dataset, calculate ratio of males travelling alove (`SipSp==0` and `Parch==0`) per class. In other words, calculate number of males travelling alone in each class, divided by number of passengers in that class.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a series, indexed by class, containing the requested ratios.

In [21]:
def lone_males(df):
    """Calculate ratio of males travelling alone per class."""
    lonely_males=df[(df['Sex']=='male')& (df['Parch'] == 0) & (df['SibSp'] == 0)]
    result = lonely_males.groupby('Pclass').size() / df.groupby('Pclass').size()
    return result

In [22]:
if not TEST:
    url_test='https://raw.githubusercontent.com/dsindy/kaggle-titanic/master/data/test.csv'
    url_train='https://raw.githubusercontent.com/dsindy/kaggle-titanic/master/data/train.csv'
    titanic_train = pd.read_csv(url_train, index_col="PassengerId")
    titanic_test = pd.read_csv(url_test, index_col="PassengerId")
    titanic_test[(titanic_test['Sex']=='male')]
    titanic = pd.concat([titanic_train, titanic_test], sort=False)
    result = lone_males(titanic)
    print('type(result)=',type(result))
    print('result =', result)

type(result)= <class 'pandas.core.series.Series'>
result = Pclass
1    0.334365
2    0.418773
3    0.524683
dtype: float64


In [23]:
#     url_test='https://raw.githubusercontent.com/dsindy/kaggle-titanic/master/data/test.csv'
#     url_train='https://raw.githubusercontent.com/dsindy/kaggle-titanic/master/data/train.csv'
#     titanic_train = pd.read_csv(url_train, index_col="PassengerId")
#     titanic_test = pd.read_csv(url_test, index_col="PassengerId")
#     titanic = pd.concat([titanic_train, titanic_test], sort=False)
#     lonely_males=titanic[(titanic['Sex']=='male')& (titanic['Parch'] == 0) & (titanic['SibSp'] == 0)]
#     lonely_males.groupby('Pclass').size()/titanic.groupby('Pclass').size()
#     titanic.groupby('Pclass').size()

In [24]:
PROBLEM_ID = 5

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, lone_males)

### 6. Worst days on UK roads in 2005 (2 points).

Calculate Top-5 days with the largest number of severe accidents (`Accident_Severity < 3`).

Input is a **dataframe**, containing all the accidents in 2005 and the following columns: `date_time` (constructed in the same way, as in optional time series notebook) and `Accident_Severity`. Index is a default integer index. Result must be a list (or tuple) of dates (as a `pd.Timestamp`) with 5 elements.

In [25]:
def worst_days(df):
    """Calculate Top 5 most severe days."""
    df.set_index('date_time', inplace=True)
    result = df[(df['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values
    result = [(pd.Timestamp(x)) for x in result]
    return result

In [26]:
if not TEST:
    %pylab inline
    plt.style.use('bmh')
    import pathlib
    import numpy as np
    import pandas as pd
    import seaborn as sns
    sns.set()

Populating the interactive namespace from numpy and matplotlib


In [27]:
if not TEST:
    DATA_DIR = pathlib.Path("./")
#     d = pd.read_csv(DATA_DIR.joinpath('accidents_2005_to_2007.csv.zip'),low_memory=False)
#     d.info(memory_usage="deep")
#     d.loc[:, 'dt'] = d.Date.str.cat(d.Time, sep=' ', na_rep='00:00')
#     d.loc[:, 'date_time'] = pd.to_datetime(d.dt, dayfirst=True)
#     d.set_index('date_time', inplace=True)
#     d.loc["2005"].to_csv(DATA_DIR.joinpath("exam_accidents_2005.csv"))
    accidents_2005 = pd.read_csv(DATA_DIR.joinpath("exam_accidents_2005.csv"), parse_dates=["date_time"])
    result = worst_days(accidents_2005)
    print ('result:', *result, sep='\n')
    print (result)
    print('type(result)=',type(result))

result:
2005-05-14 00:00:00
2005-06-18 00:00:00
2005-09-16 00:00:00
2005-11-04 00:00:00
2005-12-23 00:00:00
[Timestamp('2005-05-14 00:00:00'), Timestamp('2005-06-18 00:00:00'), Timestamp('2005-09-16 00:00:00'), Timestamp('2005-11-04 00:00:00'), Timestamp('2005-12-23 00:00:00')]
type(result)= <class 'list'>


In [28]:
# accidents_2005['date_time'] = accidents_2005['date_time'].apply(lambda x: pd.Timestamp(x))
# accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size()
# accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index
# accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values
# tuple(accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values)
# tuple(accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5)[date_time])
# accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values.tolist()
# accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).apply(lambda x: pd.Timestamp(x))
# pd.Timestamp(tuple(accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values)[0])
# accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values.apply(lambda x: pd.Timestamp(x))
# pd.Series(accidents_2005[(accidents_2005['Accident_Severity']<3)].resample('D').size().sort_values(ascending=False).head(5).index.values).apply(lambda x: pd.Timestamp(x))
# accidents_2005 = pd.read_csv(DATA_DIR.joinpath("exam_accidents_2005.csv"), parse_dates=["date_time"])
# accidents_2005['date_time'] = accidents_2005['date_time'].apply(lambda x: pd.Timestamp(x))
# res = [(pd.Timestamp(x)) for x in t]
# res

In [29]:
PROBLEM_ID = 6

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, worst_days)

In [30]:
if TEST:
    print(f"{STUDENT}: {int(100 * total_grade / MAX_POINTS)}")