## Activity 03: Filtering, Sorting, and Reshaping

Following up on the last activity, we are asked to deliver some more complex operations.   
We will, therefore, continue to work with the same dataset, our `normal_distribution.csv`.

#### Loading the dataset

In [32]:
# importing the necessary dependencies
import numpy as np

In [33]:
# loading the Dataset
dataset = np.genfromtxt('./data/normal_distribution.csv', delimiter=',')

In [34]:
import pandas as pd
pd.DataFrame(dataset)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,99.149315,104.038527,107.435347,97.852307,98.749869,98.808334,96.819649,98.567832,101.347459
1,92.026288,97.104393,99.320669,97.245848,92.926751,92.656578,105.719785,101.231629,93.871555
2,95.662537,95.177501,90.933181,110.188895,98.800844,105.952977,98.374814,106.546543,107.224824
3,91.372946,100.967814,100.401183,113.420905,105.485088,91.660495,106.147284,95.087158,103.404121
4,101.208625,103.573031,100.286909,105.852694,93.371263,108.579804,100.79479,94.200197,96.100203
5,102.803871,98.296876,93.243764,97.2413,89.034527,96.283275,104.603448,101.134424,97.627878
6,106.717516,102.975856,98.457233,100.724189,106.397985,95.464934,94.353732,106.832738,100.077215
7,96.025483,102.823609,106.475518,101.347459,102.456518,98.747675,97.575443,92.574876,91.372946
8,105.303504,92.877308,103.192583,104.405183,101.293268,100.854471,101.222604,106.038688,97.852307
9,110.444843,93.871555,101.536365,97.653935,92.750486,101.720746,96.968512,103.291471,99.149315


---

#### Filtering

To get better insights into our dataset, we want to only look at the value that fulfills certain conditions.   
Our client reaches out to us and asks us to provide lists of values that fulfills these conditions:
- all values greater than 105 (>105)
- all values that are between 90 and 95 (>90 and <95)
- the indices of all values that have a delta of less than 1 to 100 (x-100 < 1)

In [35]:
# values that are greater than 105
filter_gt_105 = dataset > 105
dataset[filter_gt_105]

array([107.43534677, 105.7197853 , 110.18889465, 105.95297652,
       106.54654286, 107.22482426, 113.42090475, 105.48508838,
       106.1472841 , 105.85269352, 108.57980357, 106.71751618,
       106.39798503, 106.83273763, 106.47551845, 105.30350449,
       106.03868807, 110.44484313, 106.6471081 , 105.0320535 ,
       107.02874163, 105.07475277, 106.57364584, 107.22482426,
       107.19119932, 108.09423367, 109.40523174, 106.11454989,
       106.57052697, 105.13668343, 105.37011896, 110.44484313,
       105.86078488, 106.89005002, 106.57364584, 107.40064604,
       106.38276709, 106.46476468, 110.43976681, 105.02389857,
       106.05042487, 106.89005002])

In [36]:
# values that are between 90 and 95
filter_between_90_and_95 = (dataset >= 90) & (dataset <= 95)
dataset[filter_between_90_and_95]

array([92.02628776, 92.9267508 , 92.65657752, 93.87155456, 90.93318132,
       91.37294597, 91.6604946 , 93.37126331, 94.20019732, 93.24376389,
       94.35373179, 92.5748759 , 91.37294597, 92.87730812, 93.87155456,
       92.75048583, 93.97853495, 91.32093303, 92.0108226 , 93.18884302,
       93.83969256, 94.5081787 , 94.59300658, 93.04610867, 91.6779221 ,
       91.37294597, 94.76253572, 94.57421727, 94.11176915, 93.97853495])

> **Note:**    
Conditional filtering can be done either using the brackets syntax or NumPys `extract` method

In [45]:
# indices of values that have a delta of less than 1 to 100

import pandas as pd

row_indices, indices_columnas = np.where(np.abs(dataset - 100) < 1)
row_indices

array([ 0,  1,  3,  3,  4,  4,  6,  6,  8,  9, 10, 10, 10, 12, 13, 13, 13,
       14, 14, 15, 16, 16, 17, 17, 18, 18, 20, 21, 21, 21, 22, 23, 23])

---

#### Sorting

They also want to experiment with some more plotting techniques so they ask you to also deliver these datasets:
- values sorted in ascending order for each row
- values sorted in ascending order for each column
- the matrix of indices indicating the position in a sorted list of each value   
```
[3, 1, 2, 5, 4]  =>  [1, 2, 0, 4, 3]
```

In [38]:
# values sorted for each row
sorted_each_row = np.sort(dataset, axis=1)

# Print the result using a DataFrame because it's easier to see.
pd.DataFrame(sorted_each_row)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,96.819649,97.852307,98.567832,98.749869,98.808334,99.149315,101.347459,104.038527,107.435347
1,92.026288,92.656578,92.926751,93.871555,97.104393,97.245848,99.320669,101.231629,105.719785
2,90.933181,95.177501,95.662537,98.374814,98.800844,105.952977,106.546543,107.224824,110.188895
3,91.372946,91.660495,95.087158,100.401183,100.967814,103.404121,105.485088,106.147284,113.420905
4,93.371263,94.200197,96.100203,100.286909,100.79479,101.208625,103.573031,105.852694,108.579804
5,89.034527,93.243764,96.283275,97.2413,97.627878,98.296876,101.134424,102.803871,104.603448
6,94.353732,95.464934,98.457233,100.077215,100.724189,102.975856,106.397985,106.717516,106.832738
7,91.372946,92.574876,96.025483,97.575443,98.747675,101.347459,102.456518,102.823609,106.475518
8,92.877308,97.852307,100.854471,101.222604,101.293268,103.192583,104.405183,105.303504,106.038688
9,92.750486,93.871555,96.968512,97.653935,99.149315,101.536365,101.720746,103.291471,110.444843


> **Note:**   
By default, sorting will always be done along the last axis. In our case this is 1, leading to each row being sorted.

In [39]:
# values sorted for each column
sorted_each_column = np.sort(dataset, axis=0)

# Print the result using a DataFrame because it's easier to see.
pd.DataFrame(sorted_each_column)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,91.372946,88.802211,90.933181,93.188843,85.988396,91.660495,91.320933,92.574876,91.372946
1,92.026288,91.677922,93.243764,94.593007,89.034527,92.656578,93.046109,94.200197,91.372946
2,94.111769,92.010823,93.839693,96.746303,92.750486,95.191843,94.353732,94.762536,93.871555
3,95.65982,92.877308,94.508179,97.2413,92.926751,95.464934,96.503429,95.087158,93.978535
4,95.662537,93.871555,97.758876,97.245848,93.371263,95.623593,96.819649,95.852842,95.191843
5,96.025483,94.574217,98.457233,97.627878,93.978535,96.283275,96.892443,97.595722,96.100203
6,96.100203,95.177501,99.320669,97.653935,95.937992,96.346228,96.968512,98.00253,97.104393
7,96.768148,96.593855,99.578599,97.852307,98.29244,96.593778,97.575443,98.071227,97.2413
8,96.782662,97.104393,100.286909,99.488954,98.613252,98.659127,97.940469,98.567832,97.627878
9,97.213157,98.296876,100.401183,99.958279,98.749869,98.747675,97.997624,99.586647,97.852307


In [41]:
# indices of positions for each row
indices = np.arange(dataset.shape[0])
indices

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

---

#### Combining

After finishing their visualization and doing ask you to deliver a way they can incrementally add the split parts of the dataset to make sure it works with every subset, too.   
They want you to send them examples for:
- adding the second half of the first column
- adding the second column
- adding the third and last separate column


In [None]:
# split up dataset from activity03
thirds = np.hsplit(dataset, (3))
halfed_first = np.vsplit(thirds[0], (2))

# this is the part we've sent the client in activity03
halfed_first[0]

array([[ 99.14931546, 104.03852715, 107.43534677],
       [ 92.02628776,  97.10439252,  99.32066924],
       [ 95.66253664,  95.17750125,  90.93318132],
       [ 91.37294597, 100.96781394, 100.40118279],
       [101.20862522, 103.5730309 , 100.28690912],
       [102.80387079,  98.29687616,  93.24376389],
       [106.71751618, 102.97585605,  98.45723272],
       [ 96.02548256, 102.82360856, 106.47551845],
       [105.30350449,  92.87730812, 103.19258339],
       [110.44484313,  93.87155456, 101.5363647 ],
       [101.3514185 , 100.37372248, 106.6471081 ],
       [ 97.21315663, 107.02874163, 102.17642112]])

In [None]:
# adding the second half of the first column to the data
# np.concatenate((np.array([1,2,3]), np.array([4,5,6])))

np.concatenate((halfed_first[0], halfed_first[1]))


array([[ 99.14931546, 104.03852715, 107.43534677],
       [ 92.02628776,  97.10439252,  99.32066924],
       [ 95.66253664,  95.17750125,  90.93318132],
       [ 91.37294597, 100.96781394, 100.40118279],
       [101.20862522, 103.5730309 , 100.28690912],
       [102.80387079,  98.29687616,  93.24376389],
       [106.71751618, 102.97585605,  98.45723272],
       [ 96.02548256, 102.82360856, 106.47551845],
       [105.30350449,  92.87730812, 103.19258339],
       [110.44484313,  93.87155456, 101.5363647 ],
       [101.3514185 , 100.37372248, 106.6471081 ],
       [ 97.21315663, 107.02874163, 102.17642112],
       [ 95.65982034, 107.22482426, 107.19119932],
       [100.39303522,  92.0108226 ,  97.75887636],
       [103.1521596 , 109.40523174,  93.83969256],
       [106.11454989,  88.80221141,  94.5081787 ],
       [ 96.78266211,  99.84251605, 104.03478031],
       [101.86186193, 103.61720152,  99.57859892],
       [ 97.49594839,  96.59385486, 104.63817694],
       [ 96.76814836,  91.67792

In [None]:
# adding the second column to our combined dataset


In [None]:
# adding the third column to our combined dataset


> **Note:**    
The same results can be achieved with `np.concatenate` and `np.stack`.    
For both methods, you need to provide the axis onto which it should be appended.   
Depending on your preferences you might want to use those.

---

#### Reshaping

For their internal AI algorithms, they need the dataset in a reshaped manner that reduces the number of columns.   
They asked us to deliver the whole dataset in the following shapes:
- reshaped in a one-dimensional list with all values
- reshaped in a matrix with only 2 columns

In [None]:
# reshaping to a list of values


In [None]:
# reshaping to a matrix with two columns


> **Note:**   
-1 in the dimension definition means that it figures out the other dimension on its own