## Activity 02: Indexing, Slicing, and Iterating

Our client wants to prove that our dataset is nicely distributed around the mean value of 100.   
They asked us to run some tests on several subsections of it to make sure they won't get a non-descriptive section of our data.

Look at the mean value of each subtask.

#### Loading the dataset

In [6]:
# importing the necessary dependencies
import numpy as np

In [7]:
# loading the Dataset
dataset = np.genfromtxt('./data/normal_distribution.csv', delimiter=',')

In [8]:
import pandas as pd
pd.DataFrame(dataset)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,99.149315,104.038527,107.435347,97.852307,98.749869,98.808334,96.819649,98.567832,101.347459
1,92.026288,97.104393,99.320669,97.245848,92.926751,92.656578,105.719785,101.231629,93.871555
2,95.662537,95.177501,90.933181,110.188895,98.800844,105.952977,98.374814,106.546543,107.224824
3,91.372946,100.967814,100.401183,113.420905,105.485088,91.660495,106.147284,95.087158,103.404121
4,101.208625,103.573031,100.286909,105.852694,93.371263,108.579804,100.79479,94.200197,96.100203
5,102.803871,98.296876,93.243764,97.2413,89.034527,96.283275,104.603448,101.134424,97.627878
6,106.717516,102.975856,98.457233,100.724189,106.397985,95.464934,94.353732,106.832738,100.077215
7,96.025483,102.823609,106.475518,101.347459,102.456518,98.747675,97.575443,92.574876,91.372946
8,105.303504,92.877308,103.192583,104.405183,101.293268,100.854471,101.222604,106.038688,97.852307
9,110.444843,93.871555,101.536365,97.653935,92.750486,101.720746,96.968512,103.291471,99.149315


---

#### Indexing

Since we need several rows of our dataset to complete the given task, we have to use indexing to get the right rows.   
To recap, we need: 
- the second row 
- the last row
- the first value of the first row
- the last value of the second to the last row

In [9]:
# indexing the second row of the dataset (2nd row)
dataset[1]

array([ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
        92.9267508 ,  92.65657752, 105.7197853 , 101.23162942,
        93.87155456])

In [10]:
# indexing the last element of the dataset (last row)
dataset[-1]

array([ 94.11176915,  99.62387832, 104.51786419,  97.62787811,
        93.97853495,  98.75108352, 106.05042487, 100.07721494,
       106.89005002])

In [11]:
# indexing the first value of the second row (1st row, 1st value)
dataset[1][0]

92.02628776

In [12]:
# indexing the last value of the second to last row (we want to use the combined access syntax here) 
dataset[-2][-1]

101.2226037

---

#### Slicing

Other than the single rows and values we also need to get some subsets of the dataset.   
Here we want slices:
- a 2x2 slice starting from the second row and second element to the 4th element in the 4th row
- every other element of the 5th row
- the content of the last row in reversed order

In [13]:
# slicing an intersection of 4 elements (2x2) of the first two rows and first two columns
dataset[:2,:2]

array([[ 99.14931546, 104.03852715],
       [ 92.02628776,  97.10439252]])

##### Why is it not a problem if such a small subsection has a bigger standard deviation from 100?

Several smaller values can cluster in such a small subsection leading to the value being really low.   
If we make our subsection larger, we have a higher chance of getting a more expressive view of our data.

In [14]:
# selecting every second element of the fifth row 
dataset[4,::2]

array([101.20862522, 100.28690912,  93.37126331, 100.79478953,
        96.10020311])

In [15]:
# reversing the entry order, selecting the first two rows in reversed order
dataset[ :2, ::-1 ]

array([[ 96.10020311,  94.57421727, 100.80409326, 105.02389857,
         98.61325194,  95.62359311,  97.99762409, 103.83852459,
        101.2226037 ],
       [ 94.11176915,  99.62387832, 104.51786419,  97.62787811,
         93.97853495,  98.75108352, 106.05042487, 100.07721494,
        106.89005002]])

---

#### Splitting

Our client's team only wants to use a small subset of the given dataset.   
Therefore we need to first split it into 3 equal pieces and then give them the first half of the first split.   
They sent us this drawing to show us what they need:
```
1, 2, 3, 4, 5, 6          1, 2     3, 4    5, 6          1, 2  
3, 2, 1, 5, 4, 6    =>    3, 2     1, 5    4, 6    =>    3, 2    =>    1, 2
5, 3, 1, 2, 4, 3          5, 3     1, 2    4, 3                        3, 2
1, 2, 2, 4, 1, 5          1, 2     2, 4    1, 5          5, 3
                                                         1, 2
```

> **Note:**   
We are using a very small dataset here but imagine you have a huge amount of data and only want to look at a small subset of it to tweak your visualizations

In [16]:
# splitting up our dataset horizontally on indices one third and two thirds
hsplit_result = np.hsplit(dataset, 3)
hsplit_result

[array([[ 99.14931546, 104.03852715, 107.43534677],
        [ 92.02628776,  97.10439252,  99.32066924],
        [ 95.66253664,  95.17750125,  90.93318132],
        [ 91.37294597, 100.96781394, 100.40118279],
        [101.20862522, 103.5730309 , 100.28690912],
        [102.80387079,  98.29687616,  93.24376389],
        [106.71751618, 102.97585605,  98.45723272],
        [ 96.02548256, 102.82360856, 106.47551845],
        [105.30350449,  92.87730812, 103.19258339],
        [110.44484313,  93.87155456, 101.5363647 ],
        [101.3514185 , 100.37372248, 106.6471081 ],
        [ 97.21315663, 107.02874163, 102.17642112],
        [ 95.65982034, 107.22482426, 107.19119932],
        [100.39303522,  92.0108226 ,  97.75887636],
        [103.1521596 , 109.40523174,  93.83969256],
        [106.11454989,  88.80221141,  94.5081787 ],
        [ 96.78266211,  99.84251605, 104.03478031],
        [101.86186193, 103.61720152,  99.57859892],
        [ 97.49594839,  96.59385486, 104.63817694],
        [ 96

In [17]:
v_split_result = np.vsplit(hsplit_result[0], 2)
v_split_result[0]

array([[ 99.14931546, 104.03852715, 107.43534677],
       [ 92.02628776,  97.10439252,  99.32066924],
       [ 95.66253664,  95.17750125,  90.93318132],
       [ 91.37294597, 100.96781394, 100.40118279],
       [101.20862522, 103.5730309 , 100.28690912],
       [102.80387079,  98.29687616,  93.24376389],
       [106.71751618, 102.97585605,  98.45723272],
       [ 96.02548256, 102.82360856, 106.47551845],
       [105.30350449,  92.87730812, 103.19258339],
       [110.44484313,  93.87155456, 101.5363647 ],
       [101.3514185 , 100.37372248, 106.6471081 ],
       [ 97.21315663, 107.02874163, 102.17642112]])

In [18]:
# Num of dimensions
print("Dataset:", dataset.shape)
print("Subset:", v_split_result[0].shape)

Dataset: (24, 9)
Subset: (12, 3)


In [19]:
# splitting up our dataset vertically on index 2
v_split_result = np.vsplit(dataset, [2])

---

In [20]:
pd.DataFrame(v_split_result[0])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,99.149315,104.038527,107.435347,97.852307,98.749869,98.808334,96.819649,98.567832,101.347459
1,92.026288,97.104393,99.320669,97.245848,92.926751,92.656578,105.719785,101.231629,93.871555


In [21]:
pd.DataFrame(v_split_result[1])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,95.662537,95.177501,90.933181,110.188895,98.800844,105.952977,98.374814,106.546543,107.224824
1,91.372946,100.967814,100.401183,113.420905,105.485088,91.660495,106.147284,95.087158,103.404121
2,101.208625,103.573031,100.286909,105.852694,93.371263,108.579804,100.79479,94.200197,96.100203
3,102.803871,98.296876,93.243764,97.2413,89.034527,96.283275,104.603448,101.134424,97.627878
4,106.717516,102.975856,98.457233,100.724189,106.397985,95.464934,94.353732,106.832738,100.077215
5,96.025483,102.823609,106.475518,101.347459,102.456518,98.747675,97.575443,92.574876,91.372946
6,105.303504,92.877308,103.192583,104.405183,101.293268,100.854471,101.222604,106.038688,97.852307
7,110.444843,93.871555,101.536365,97.653935,92.750486,101.720746,96.968512,103.291471,99.149315
8,101.351418,100.373722,106.647108,100.617428,105.032054,99.36,98.870075,95.852842,93.978535
9,97.213157,107.028742,102.176421,96.746303,95.937992,102.623847,105.074753,97.595722,106.573646


#### Iterating

Once you sent over the dataset they tell you that they also need a way iterate over the whole dataset element by element as if it would be a one-dimensional list.   
However, they want to also now the position in the dataset itself.

They send you this piece of code and tell you that it's not working as mentioned.   
Come up with the right solution for their needs.

In [22]:
# iterating over whole dataset (each value in each row)
curr_index = 0
for x in np.nditer(dataset):
    print(x, curr_index)
    curr_index += 1

99.14931546 0
104.03852715 1
107.43534677 2
97.85230675 3
98.74986914 4
98.80833412 5
96.81964892 6
98.56783189 7
101.34745901 8
92.02628776 9
97.10439252 10
99.32066924 11
97.24584816 12
92.9267508 13
92.65657752 14
105.7197853 15
101.23162942 16
93.87155456 17
95.66253664 18
95.17750125 19
90.93318132 20
110.18889465 21
98.80084371 22
105.95297652 23
98.37481387 24
106.54654286 25
107.22482426 26
91.37294597 27
100.96781394 28
100.40118279 29
113.42090475 30
105.48508838 31
91.6604946 32
106.1472841 33
95.08715803 34
103.40412146 35
101.20862522 36
103.5730309 37
100.28690912 38
105.85269352 39
93.37126331 40
108.57980357 41
100.79478953 42
94.20019732 43
96.10020311 44
102.80387079 45
98.29687616 46
93.24376389 47
97.24130034 48
89.03452725 49
96.2832753 50
104.60344836 51
101.13442416 52
97.62787811 53
106.71751618 54
102.97585605 55
98.45723272 56
100.72418901 57
106.39798503 58
95.46493436 59
94.35373179 60
106.83273763 61
100.07721494 62
96.02548256 63
102.82360856 64
106.475518

In [23]:
# iterating over the whole dataset with indices matching the position in the dataset
index = 0
for x in np.ndindex(dataset.shape):
    print(x, dataset[x])

(0, 0) 99.14931546
(0, 1) 104.03852715
(0, 2) 107.43534677
(0, 3) 97.85230675
(0, 4) 98.74986914
(0, 5) 98.80833412
(0, 6) 96.81964892
(0, 7) 98.56783189
(0, 8) 101.34745901
(1, 0) 92.02628776
(1, 1) 97.10439252
(1, 2) 99.32066924
(1, 3) 97.24584816
(1, 4) 92.9267508
(1, 5) 92.65657752
(1, 6) 105.7197853
(1, 7) 101.23162942
(1, 8) 93.87155456
(2, 0) 95.66253664
(2, 1) 95.17750125
(2, 2) 90.93318132
(2, 3) 110.18889465
(2, 4) 98.80084371
(2, 5) 105.95297652
(2, 6) 98.37481387
(2, 7) 106.54654286
(2, 8) 107.22482426
(3, 0) 91.37294597
(3, 1) 100.96781394
(3, 2) 100.40118279
(3, 3) 113.42090475
(3, 4) 105.48508838
(3, 5) 91.6604946
(3, 6) 106.1472841
(3, 7) 95.08715803
(3, 8) 103.40412146
(4, 0) 101.20862522
(4, 1) 103.5730309
(4, 2) 100.28690912
(4, 3) 105.85269352
(4, 4) 93.37126331
(4, 5) 108.57980357
(4, 6) 100.79478953
(4, 7) 94.20019732
(4, 8) 96.10020311
(5, 0) 102.80387079
(5, 1) 98.29687616
(5, 2) 93.24376389
(5, 3) 97.24130034
(5, 4) 89.03452725
(5, 5) 96.2832753
(5, 6) 104.6034

In [24]:
for index, elem in np.ndenumerate(dataset):
    print(index, elem)

(0, 0) 99.14931546
(0, 1) 104.03852715
(0, 2) 107.43534677
(0, 3) 97.85230675
(0, 4) 98.74986914
(0, 5) 98.80833412
(0, 6) 96.81964892
(0, 7) 98.56783189
(0, 8) 101.34745901
(1, 0) 92.02628776
(1, 1) 97.10439252
(1, 2) 99.32066924
(1, 3) 97.24584816
(1, 4) 92.9267508
(1, 5) 92.65657752
(1, 6) 105.7197853
(1, 7) 101.23162942
(1, 8) 93.87155456
(2, 0) 95.66253664
(2, 1) 95.17750125
(2, 2) 90.93318132
(2, 3) 110.18889465
(2, 4) 98.80084371
(2, 5) 105.95297652
(2, 6) 98.37481387
(2, 7) 106.54654286
(2, 8) 107.22482426
(3, 0) 91.37294597
(3, 1) 100.96781394
(3, 2) 100.40118279
(3, 3) 113.42090475
(3, 4) 105.48508838
(3, 5) 91.6604946
(3, 6) 106.1472841
(3, 7) 95.08715803
(3, 8) 103.40412146
(4, 0) 101.20862522
(4, 1) 103.5730309
(4, 2) 100.28690912
(4, 3) 105.85269352
(4, 4) 93.37126331
(4, 5) 108.57980357
(4, 6) 100.79478953
(4, 7) 94.20019732
(4, 8) 96.10020311
(5, 0) 102.80387079
(5, 1) 98.29687616
(5, 2) 93.24376389
(5, 3) 97.24130034
(5, 4) 89.03452725
(5, 5) 96.2832753
(5, 6) 104.6034