## Activity 02: Indexing, Slicing, and Iterating

Our client wants to prove that our dataset is nicely distributed around the mean value of 100.   
They asked us to run some tests on several subsections of it to make sure they won't get a non-descriptive section of our data.

Look at the mean value of each subtask.

#### Loading the dataset

In [1]:
# importing the necessary dependencies
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np



In [2]:
# loading the Dataset
dataset = np.genfromtxt('./data/normal_distribution.csv', delimiter=',')

---

#### Indexing

Since we need several rows of our dataset to complete the given task, we have to use indexing to get the right rows.   
To recap, we need: 
- the second row 
- the last row
- the first value of the first row
- the last value of the second to the last row

In [14]:
# indexing the second row of the dataset (2nd row)
second_row = dataset[1,:]
print(second_row)
np.mean(second_row)


[ 92.02628776  97.10439252  99.32066924  97.24584816  92.9267508
  92.65657752 105.7197853  101.23162942  93.87155456]


96.90038836444445

In [15]:
# indexing the last element of the dataset (last row)
last_row = dataset[-1,:]
print(last_row)
np.mean(last_row)


[ 94.11176915  99.62387832 104.51786419  97.62787811  93.97853495
  98.75108352 106.05042487 100.07721494 106.89005002]


100.18096645222221

In [17]:
# indexing the first value of the first row (1st row, 1st value)
print(dataset[0,0])


99.14931546


In [22]:
# indexing the last value of the second to last row (we want to use the combined access syntax here) 

dataset[-2][-1]
# or 
dataset[-2,-1]


101.2226037

101.2226037

---

#### Slicing

Other than the single rows and values we also need to get some subsets of the dataset.   
Here we want slices:
- a 2x2 slice starting from the second row and second element to the 4th element in the 4th row
- every other element of the 5th row
- the content of the last row in reversed order

In [25]:
# slicing an intersection of 4 elements (2x2) of the first two rows and first two columns
subsection_2x2 = dataset[1:3,1:3]
np.mean(subsection_2x2)


95.63393608250001

##### Why is it not a problem if such a small subsection has a bigger standard deviation from 100?

Several smaller values can cluster in such a small subsection leading to the value being really low.   
If we make our subsection larger, we have a higher chance of getting a more expressive view of our data.

In [29]:
# selecting every second element of the fifth row 
second_elelemnt = dataset[4,::2]
print(second_elelemnt)
np.mean(second_elelemnt)



[101.20862522 100.28690912  93.37126331 100.79478953  96.10020311]


98.35235805800001

In [32]:
# reversing the entry order, selecting the first two rows in reversed order
print(dataset[-1,::-1])


[106.89005002 100.07721494 106.05042487  98.75108352  93.97853495
  97.62787811 104.51786419  99.62387832  94.11176915]


---

#### Splitting

Our client's team only wants to use a small subset of the given dataset.   
Therefore we need to first split it into 3 equal pieces and then give them the first half of the first split.   
They sent us this drawing to show us what they need:
```
1, 2, 3, 4, 5, 6          1, 2     3, 4    5, 6          1, 2  
3, 2, 1, 5, 4, 6    =>    3, 2     1, 5    4, 6    =>    3, 2    =>    1, 2
5, 3, 1, 2, 4, 3          5, 3     1, 2    4, 3                        3, 2
1, 2, 2, 4, 1, 5          1, 2     2, 4    1, 5          5, 3
                                                         1, 2
```

> **Note:**   
We are using a very small dataset here but imagine you have a huge amount of data and only want to look at a small subset of it to tweak your visualizations

In [53]:
# splitting up our dataset horizontally on indices one third and two thirds
hor_split = np.hsplit(dataset,(3))
hor_split[0]


array([[ 99.14931546, 104.03852715, 107.43534677],
       [ 92.02628776,  97.10439252,  99.32066924],
       [ 95.66253664,  95.17750125,  90.93318132],
       [ 91.37294597, 100.96781394, 100.40118279],
       [101.20862522, 103.5730309 , 100.28690912],
       [102.80387079,  98.29687616,  93.24376389],
       [106.71751618, 102.97585605,  98.45723272],
       [ 96.02548256, 102.82360856, 106.47551845],
       [105.30350449,  92.87730812, 103.19258339],
       [110.44484313,  93.87155456, 101.5363647 ],
       [101.3514185 , 100.37372248, 106.6471081 ],
       [ 97.21315663, 107.02874163, 102.17642112],
       [ 95.65982034, 107.22482426, 107.19119932],
       [100.39303522,  92.0108226 ,  97.75887636],
       [103.1521596 , 109.40523174,  93.83969256],
       [106.11454989,  88.80221141,  94.5081787 ],
       [ 96.78266211,  99.84251605, 104.03478031],
       [101.86186193, 103.61720152,  99.57859892],
       [ 97.49594839,  96.59385486, 104.63817694],
       [ 96.76814836,  91.67792

In [56]:
# splitting up our dataset vertically on index 2
ver_split = np.vsplit(hor_split[0],(2))
print(ver_split[0])
print("Dataset", dataset.shape)
print("Subset", ver_split[0].shape)

[[ 99.14931546 104.03852715 107.43534677]
 [ 92.02628776  97.10439252  99.32066924]
 [ 95.66253664  95.17750125  90.93318132]
 [ 91.37294597 100.96781394 100.40118279]
 [101.20862522 103.5730309  100.28690912]
 [102.80387079  98.29687616  93.24376389]
 [106.71751618 102.97585605  98.45723272]
 [ 96.02548256 102.82360856 106.47551845]
 [105.30350449  92.87730812 103.19258339]
 [110.44484313  93.87155456 101.5363647 ]
 [101.3514185  100.37372248 106.6471081 ]
 [ 97.21315663 107.02874163 102.17642112]]
Dataset (24, 9)
Subset (12, 3)


---

#### Iterating

Once you sent over the dataset they tell you that they also need a way iterate over the whole dataset element by element as if it would be a one-dimensional list.   
However, they want to also now the position in the dataset itself.

They send you this piece of code and tell you that it's not working as mentioned.   
Come up with the right solution for their needs.

In [12]:
# iterating over whole dataset (each value in each row)
curr_index = 0
for x in np.nditer(dataset):
    print(x, curr_index)
    curr_index += 1

99.14931546 0
104.03852715 1
107.43534677 2
97.85230675 3
98.74986914 4
98.80833412 5
96.81964892 6
98.56783189 7
101.34745901 8
92.02628776 9
97.10439252 10
99.32066924 11
97.24584816 12
92.9267508 13
92.65657752 14
105.7197853 15
101.23162942 16
93.87155456 17
95.66253664 18
95.17750125 19
90.93318132 20
110.18889465 21
98.80084371 22
105.95297652 23
98.37481387 24
106.54654286 25
107.22482426 26
91.37294597 27
100.96781394 28
100.40118279 29
113.42090475 30
105.48508838 31
91.6604946 32
106.1472841 33
95.08715803 34
103.40412146 35
101.20862522 36
103.5730309 37
100.28690912 38
105.85269352 39
93.37126331 40
108.57980357 41
100.79478953 42
94.20019732 43
96.10020311 44
102.80387079 45
98.29687616 46
93.24376389 47
97.24130034 48
89.03452725 49
96.2832753 50
104.60344836 51
101.13442416 52
97.62787811 53
106.71751618 54
102.97585605 55
98.45723272 56
100.72418901 57
106.39798503 58
95.46493436 59
94.35373179 60
106.83273763 61
100.07721494 62
96.02548256 63
102.82360856 64
106.475518

In [13]:
# iterating over the whole dataset with indices matching the position in the dataset
