## NumPy, Pandas
### BIOINF 575 - Fall 2021



_____


### NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

____
#### A list contains refences to each of the values.
#### An array refers to a block of memory containg all values one after the other.
- <b>that is why we need to know the size of the array and the array size cannot change <br>


<img src = "https://www.python-course.eu/images/list_structure.png" width = 350 /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src = "https://www.python-course.eu/images/array_structure.png" width = 350 />
____

#### Arrays of different dimensions (`shape` gives the number of elements on each dimension):

<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" alt="data structures" width="500">  

_____


#### <b>NumPy basics</b>

Arrays are designed to:
* <b>handle vectorized operations (lists cannot do that)</b>
    - if you apply a function it is performed on every item in the array, rather than on the whole array object
    - both arrays and lists have 0-based indexing
* <b>store multiple items of the same data type</b>
* <b>handle missing values </b>
    - missing numerical values are represented using the `np.nan` object (not a number)
    - the object `np.inf` represents infinite  
* <b>have an unchangeable size</b>
    - array size cannot be changed, should create a new array if you want to change the size
    - you know when you create the array how much space you need for it and that will not change  
* <b>have efficient memory usage</b>
    - an equivalent numpy array occupies much less space than a python list of lists

#### <b>Basic array attributes:</b>
* shape: array dimension
* size: Number of elements in array
* ndim: Number of array dimension (len(arr.shape))
* dtype: Data-type of the array

#### <b>Importing NumPy
The recommended convention to import numpy is to use the <b>np</b> alias:

In [2]:
import numpy as np

#### <b>Documentation and help
https://numpy.org/doc/

In [4]:
# np.lookfor('sum') 

In [5]:
np.me*?

np.mean
np.median
np.memmap
np.meshgrid

In [7]:
#np.mean?

In [9]:
#help(np.mean)

#### <b>Motivating example</b> - transform temperatures from Celsius to Farenheit

In [10]:
temp_list_C = [-20, 25, 3, 10]

In [11]:
# using lists we need a loop to apply the formula to each element of the list
temp_list_F = []

for temp in temp_list_C:
    temp_list_F.append(temp * 1.8 + 32)

temp_list_F

[-4.0, 77.0, 37.4, 50.0]

In [12]:
# using arrays we can apply the formula directly to the array and it will be applied to each element

temp_array_C = np.array(temp_list_C)
temp_array_C

array([-20,  25,   3,  10])

In [15]:
temp_array_F = temp_array_C * 1.8 + 32
print(temp_array_F)

[-4.  77.  37.4 50. ]


In [14]:
type(temp_array_F)

numpy.ndarray

#### <b>Functions for creating arrays</b>
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html

##### np.array() - array from lists - e.g. 2D array from a list of lists

In [22]:
 #help(np.array)

x = np.array([1,2,3,5.6])
x.dtype

dtype('float64')

In [23]:
x.shape

(4,)

In [24]:
x.size

4

In [25]:
x.ndim

1

##### np.arange() - vector of evenly spaced values form a range (arange) given by start, stop and step

In [31]:
#help(np.arange)

np.arange(1,51,5)


array([ 1,  6, 11, 16, 21, 26, 31, 36, 41, 46])

##### np.linspace() - vector of evenly spaced values (known number, linspace) given by start, stop and number of points

In [32]:
# help(np.linspace)

np.linspace(1,100)

array([  1.        ,   3.02040816,   5.04081633,   7.06122449,
         9.08163265,  11.10204082,  13.12244898,  15.14285714,
        17.16326531,  19.18367347,  21.20408163,  23.2244898 ,
        25.24489796,  27.26530612,  29.28571429,  31.30612245,
        33.32653061,  35.34693878,  37.36734694,  39.3877551 ,
        41.40816327,  43.42857143,  45.44897959,  47.46938776,
        49.48979592,  51.51020408,  53.53061224,  55.55102041,
        57.57142857,  59.59183673,  61.6122449 ,  63.63265306,
        65.65306122,  67.67346939,  69.69387755,  71.71428571,
        73.73469388,  75.75510204,  77.7755102 ,  79.79591837,
        81.81632653,  83.83673469,  85.85714286,  87.87755102,
        89.89795918,  91.91836735,  93.93877551,  95.95918367,
        97.97959184, 100.        ])

##### np.zeros() - array of zeros (e.g. 3D array), there is also a np.ones()

In [35]:
# help(np.zeros)

np.zeros((3,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

##### More functions to create special arrays:      
    np.identity(n) - 2D square array filled with 1 on the diagonal      
    np.eye(n,m) - 2D array filled with 1 on the diagonal      
    np.full((n,m), val) - array filled with a given value     

In [36]:
np.identity(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [39]:
np.eye(5,7)

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.]])

#### <b>Basic array attributes:</b>
* shape: array dimension
* size: Number of elements in array
* ndim: Number of array dimension (len(arr.shape))
* dtype: Data-type of the array

In [40]:
# nested lists give us multi dimensional arrays

matrix = np.array([[1,2,3],[4,5,6]])
matrix

array([[1, 2, 3],
       [4, 5, 6]])

In [42]:
# dir(matrix)

In [43]:
# .size - length of array

matrix.size

6

In [44]:
# .shape tells us the size on each dimension and implicit the number of dimensions

matrix.shape

(2, 3)

In [None]:
# .ndim - number of array dimensions



In [48]:
# help(matrix.data)

In [45]:
# .dtype - type of the dsata stored in the array

matrix.dtype

dtype('int64')

In [49]:
matrix

array([[1, 2, 3],
       [4, 5, 6]])

#### <b>Indexing/Slicing(subsetting): [][] or [,]</b>
___
<img src = "http://scipy-lectures.org/_images/numpy_indexing.png" width = 400/>

In [9]:
matrix = np.full((6,6),range(6)) + 10 * np.full((6,6),range(6)).T
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [21]:
list(range(6))

[0, 1, 2, 3, 4, 5]

In [12]:
np.full((6,6),range(6)) 

array([[0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5]])

In [24]:
10 * np.full((6,6),range(6)).T

array([[ 0,  0,  0,  0,  0,  0],
       [10, 10, 10, 10, 10, 10],
       [20, 20, 20, 20, 20, 20],
       [30, 30, 30, 30, 30, 30],
       [40, 40, 40, 40, 40, 40],
       [50, 50, 50, 50, 50, 50]])

In [13]:
10 * np.full((6,6),range(6)).T + np.full((6,6),range(6))

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

#### <b>Reshaping</b> - changing the numbers of rows and columns - data and size stay the same

In [29]:
matrix.size

36

In [30]:
matrix.shape

(6, 6)

In [16]:
# help(matrix.reshape)

In [31]:
# .reshape((n,m)) - Reshaping - changing the shape of the matrix

matrix.shape

(6, 6)

In [32]:
# make a 9 by 4 matrix

matrix.reshape(4,9)

array([[ 0,  1,  2,  3,  4,  5, 10, 11, 12],
       [13, 14, 15, 20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35, 40, 41, 42],
       [43, 44, 45, 50, 51, 52, 53, 54, 55]])

In [33]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [34]:
# make a 3D array

matrix.reshape(4,3,3)

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [10, 11, 12]],

       [[13, 14, 15],
        [20, 21, 22],
        [23, 24, 25]],

       [[30, 31, 32],
        [33, 34, 35],
        [40, 41, 42]],

       [[43, 44, 45],
        [50, 51, 52],
        [53, 54, 55]]])

In [35]:
matrix.reshape(4,3,3).shape

(4, 3, 3)

In [61]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [62]:
matrix[:3][:2]

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15]])

#### Indexing/Slicing

In [36]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [39]:
# [][] - List-like - works for row and column when getting only one element

matrix[5][2]


52

In [40]:
# [,] - Using both rows and columns indices to get a value
matrix[5,2]

52

In [41]:
matrix[:5,:2]

array([[ 0,  1],
       [10, 11],
       [20, 21],
       [30, 31],
       [40, 41]])

In [44]:
matrix[:5][:2]

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15]])

In [45]:
# reshape - change the number of rows and columns - it compatible to the size
matrix_reshaped = matrix.reshape((4,9))

In [46]:
matrix_reshaped

array([[ 0,  1,  2,  3,  4,  5, 10, 11, 12],
       [13, 14, 15, 20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35, 40, 41, 42],
       [43, 44, 45, 50, 51, 52, 53, 54, 55]])

In [47]:
# Using both rows and columns indices to get a sub-matrix

matrix_reshaped[:2,:3]

array([[ 0,  1,  2],
       [13, 14, 15]])

In [49]:
# Fun arrays - display a checkers_board list
checkers_board = np.zeros((6,6),dtype=int)
print(checkers_board)

[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]


In [50]:
checkers_board[1::2,::2] = 1
print(checkers_board)

[[0 0 0 0 0 0]
 [1 0 1 0 1 0]
 [0 0 0 0 0 0]
 [1 0 1 0 1 0]
 [0 0 0 0 0 0]
 [1 0 1 0 1 0]]


In [51]:
checkers_board[::2,1::2] = 1
print(checkers_board)

[[0 1 0 1 0 1]
 [1 0 1 0 1 0]
 [0 1 0 1 0 1]
 [1 0 1 0 1 0]
 [0 1 0 1 0 1]
 [1 0 1 0 1 0]]


In [52]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [54]:
indices = [2,3,5]

In [57]:
y = matrix[:,indices]

In [58]:
y

array([[ 2,  3,  5],
       [12, 13, 15],
       [22, 23, 25],
       [32, 33, 35],
       [42, 43, 45],
       [52, 53, 55]])

In [60]:
y[indices,:]

array([[22, 23, 25],
       [32, 33, 35],
       [52, 53, 55]])

In [61]:
(matrix[:,indices])[indices,]

array([[22, 23, 25],
       [32, 33, 35],
       [52, 53, 55]])

In [63]:
matrix[:3, :3]

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [64]:
matrix[indices, indices] # does not slice properly for the rows and columns at indices

array([22, 33, 55])

#### Array of indices subsetting - use array/list of indices to subset array with only the elements given by the indices

In [74]:
matrix 

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [75]:
indices = [0,2,3]
matrix[indices,]

array([[ 0,  1,  2,  3,  4,  5],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35]])

In [77]:
# columns
matrix[:,indices]


array([[ 0,  2,  3],
       [10, 12, 13],
       [20, 22, 23],
       [30, 32, 33],
       [40, 42, 43],
       [50, 52, 53]])

#### conditional subsetting - use array of booleans to subset array with only the elements where the bool array is True

In [85]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [65]:
# conditional subsetting - get from the array the rows that are at the indices where the condition is True
matrix[(matrix[:,0] > 20)]

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [66]:
matrix[:,0]

array([ 0, 10, 20, 30, 40, 50])

In [67]:
# deconstruct

(matrix[:,0] > 20)

array([False, False, False,  True,  True,  True])

In [69]:
matrix[[False, False, False,  True,  True,  True]]

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [70]:
new_matrix = matrix[(matrix[:,0] > 20)]
new_matrix

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [71]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [76]:
# no of colums
new_matrix.shape[1]

6

In [77]:
# no of rows
new_matrix.shape[0]

3

In [78]:
matrix = np.full((6,6),range(6)) + 10 * np.full((6,6),range(6)).T
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [79]:
# multiple conditions  
(matrix[:,0] > 20) & (matrix[:,0] <= 40)

array([False, False, False,  True,  True, False])

In [82]:
matrix[(matrix[:,0] > 20) & (matrix[:,0] <= 40)]

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45]])

In [80]:
(matrix[:,0] > 20)

array([False, False, False,  True,  True,  True])

In [81]:
(matrix[:,0] <= 40)

array([ True,  True,  True,  True,  True, False])

In [84]:
vec = np.array([ 5,  8, 11, 14, 17, 20])
vec

array([ 5,  8, 11, 14, 17, 20])

In [85]:
matrix

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [86]:
new_matrix

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

#### <b>Matrix operations</b>

https://www.tutorialspoint.com/matrix-manipulation-in-python<br>
Arithmetic operators on arrays apply elementwise. <br> 
A new array is created and filled with the result.


#### <b>Array broadcasting</b><br>

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html<br>
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. <br>
Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

<img src = "https://www.tutorialspoint.com/numpy/images/array.jpg" height=10/>


https://www.tutorialspoint.com/numpy/numpy_broadcasting.htm

In [87]:
matrix = np.arange(1,13).reshape(3,4)
matrix


array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [88]:
# create an array with 4 values
vec = np.array([5,10,20,30])
vec

array([ 5, 10, 20, 30])

In [89]:
# addition with a data row
matrix + vec


array([[ 6, 12, 23, 34],
       [10, 16, 27, 38],
       [14, 20, 31, 42]])

In [None]:
####

In [90]:
# create an array with 3 values

vec = np.array([1,2,3])
vec

array([1, 2, 3])

In [91]:
matrix

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [92]:
matrix + vec

ValueError: operands could not be broadcast together with shapes (3,4) (3,) 

In [93]:
matrix + vec.reshape(3,1)

array([[ 2,  3,  4,  5],
       [ 7,  8,  9, 10],
       [12, 13, 14, 15]])

In [94]:
vec.reshape(3,1)

array([[1],
       [2],
       [3]])

In [95]:
matrix

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [90]:
new_matrix

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [96]:
# addition with a data column

new_matrix + vec.reshape((3,1))

array([[31, 32, 33, 34, 35, 36],
       [42, 43, 44, 45, 46, 47],
       [53, 54, 55, 56, 57, 58]])

In [97]:
##########

matrix


array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [100]:
# no of rows
matrix.shape[0]



6

In [102]:
# addittion with a data row - error if dimensions do not match

vec = np.arange(5,21,3)
vec

array([ 5,  8, 11, 14, 17, 20])

In [103]:
##########

new_matrix

array([[30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [104]:
# add a row vec => 
# every respective value from vec is added to  each row

new_matrix + vec


array([[35, 39, 43, 47, 51, 55],
       [45, 49, 53, 57, 61, 65],
       [55, 59, 63, 67, 71, 75]])

In [None]:
# multiplication with a data column

new_matrix 



#### Multiplication with a matrix of the same shape  results in the multiplication of the elements at the respective indices 
#### Mathematical matrix multiplication .dot method or @ operator the dimensions need to be compatible n1 == m1 and m1 == n2 - each value in the resulting column is the sum of the product of the pair of elements from the respective row and column 

<img src = "https://miro.medium.com/max/1400/1*YGcMQSr0ge_DGn96WnEkZw.png" width = "400"/>
     
https://towardsdatascience.com/a-complete-beginners-guide-to-matrix-multiplication-for-data-science-with-python-numpy-9274ecfc1dc6
     

In [96]:
matrix 

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [99]:
matrix2 = np.arange(5,17).reshape(3,4)
matrix2

array([[ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

In [100]:
matrix * matrix2

array([[  5,  12,  21,  32],
       [ 45,  60,  77,  96],
       [117, 140, 165, 192]])

In [101]:
m1 = np.array([[1,2,3], [4,5,6]])
m1

array([[1, 2, 3],
       [4, 5, 6]])

In [102]:
m2 = np.array([[10,11],[20,21],[30,31]])
m2

array([[10, 11],
       [20, 21],
       [30, 31]])

In [103]:
# @ - matrix multiplication
# .dot

m1@m2

array([[140, 146],
       [320, 335]])

#### <b>More matrix computation</b> - basic aggregate functions are available - min, max, sum, mean

In [None]:
matrix

#### Use the axis argument to compute mean for each column or row
#### axis = 0 - columns
#### axis = 1 - rows

In [104]:
matrix

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [105]:
help(matrix.sum)

Help on built-in function sum:

sum(...) method of numpy.ndarray instance
    a.sum(axis=None, dtype=None, out=None, keepdims=False, initial=0, where=True)
    
    Return the sum of the array elements over the given axis.
    
    Refer to `numpy.sum` for full documentation.
    
    See Also
    --------
    numpy.sum : equivalent function



In [107]:
matrix

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [106]:
# col sum 
matrix.sum(axis = 0)



array([15, 18, 21, 24])

In [108]:
# row sum
matrix.sum(1)



array([10, 26, 42])

In [109]:
dir(matrix)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__

In [111]:
matrix.sum()

81

https://www.w3resource.com/python-exercises/numpy/index.php


Create a matrix of 2 rows and 3 columns with every fifth number starting from 1 (e.g. 1,6,11,16,...)


In [110]:
matrix = np.arange(1, 2*3*5+1, 5).reshape(2,3)

matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

#### Exercise

Normalize the values in the matrix to be between 0 and 1 (min-max normalization). 
Substract the minimum value and divide by the maximum value of the resulting values.

In [112]:
matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

In [113]:
min_val = matrix.min()
min_val

1

In [115]:
matrix_min = matrix - min_val
matrix_min

array([[ 0,  5, 10],
       [15, 20, 25]])

In [117]:
max_val = matrix_min.max()
max_val

25

In [118]:
matrix_normalized = matrix_min/max_val

In [119]:
matrix_normalized

array([[0. , 0.2, 0.4],
       [0.6, 0.8, 1. ]])

#### Exercise

Do the same normalization at the row level

In [120]:
matrix 

array([[ 1,  6, 11],
       [16, 21, 26]])

In [121]:
min_row = matrix.min(1)
min_row

array([ 1, 16])

In [122]:
matrix_rmin = matrix - min_row

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

In [123]:
matrix_rmin = matrix - min_row.reshape(2,1)

In [124]:
matrix_rmin

array([[ 0,  5, 10],
       [ 0,  5, 10]])

In [125]:
matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

In [127]:
max_row = matrix_rmin.max(1)
max_row

array([10, 10])

In [130]:
matrix_rmin / max_row.reshape(max_row.size,1)

array([[0. , 0.5, 1. ],
       [0. , 0.5, 1. ]])

#### Exercise

Return the even numbers from the matrix.
Try to return the indices of the even numbers  (hint: look at the where method).

In [131]:
matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

In [None]:
# help(np.where)

In [132]:
matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

In [134]:
pos = np.where(matrix == 21)
pos

(array([1]), array([1]))

In [135]:
matrix[pos]

array([21])

In [138]:
7 % 5

2

In [141]:
pos = np.where(matrix % 2 == 0)
pos

(array([0, 1, 1]), array([1, 0, 2]))

In [139]:
matrix % 2 

array([[1, 0, 1],
       [0, 1, 0]])

In [140]:
matrix % 2 == 0

array([[False,  True, False],
       [ True, False,  True]])

In [142]:
pos = np.where(matrix % 2 == 0)
pos

(array([0, 1, 1]), array([1, 0, 2]))

In [143]:
matrix[pos]

array([ 6, 16, 26])

In [145]:
matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

In [146]:
matrix.shape

(2, 3)

#### RESOURCES

http://scipy-lectures.org/intro/numpy/array_object.html#what-are-numpy-and-numpy-arrays   
https://www.python-course.eu/numpy.php   
https://numpy.org/devdocs/user/quickstart.html#universal-functions   
https://www.geeksforgeeks.org/python-numpy/

_____

### Pandas
<img src = "https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" width = 200/>

https://commons.wikimedia.org/wiki/File:Pandas_logo.svg

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png" width = 550/>

https://www.geeksforgeeks.org/python-pandas-dataframe/

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

```python
import numpy as np
import pandas as pd


```

#### 1. `pd.Series`

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']
 
 
 Methods
 
 ['abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'asfreq',
 'asof',
 'astype',
 'at_time',
 'autocorr',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'duplicated',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'head',
 'hist',
 'idxmax',
 'idxmin',
 'infer_objects',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shift',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'var',
 'view',
 'where',
 'xs']

#### Create a Series from a Python list

In [147]:
import numpy as np
import pandas as pd

In [149]:
#dir(pd)

In [150]:

labels = ["EGFR","IL6","BRAF","ABL"]
values = [3,4,3,6]
gene_snp_no = pd.Series(data = values, index=labels)


In [151]:
gene_snp_no

EGFR    3
IL6     4
BRAF    3
ABL     6
dtype: int64

In [153]:
gene_snp_no.name = "GeneSNPs"

In [154]:
gene_snp_no

EGFR    3
IL6     4
BRAF    3
ABL     6
Name: GeneSNPs, dtype: int64

In [157]:
# Get the data, name, labels, value counts for the series

# gene_snp_no.values
# gene_snp_no.name
gene_snp_no.index

Index(['EGFR', 'IL6', 'BRAF', 'ABL'], dtype='object')

In [158]:
type(gene_snp_no)

pandas.core.series.Series

#### Create a Series from a dictionary

In [159]:
gene_expr_map = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
# Create new series
gene_expr_vals = pd.Series(data = gene_expr_map)


In [161]:
gene_expr_vals

EGFR     2.5
IL6     10.2
BRAF     6.7
ABL      5.3
dtype: float64

In [163]:
## Which genes have an expression greater then 5.5?

gene_expr_vals > 5.5

EGFR    False
IL6      True
BRAF     True
ABL     False
dtype: bool

In [164]:
gene_expr_vals[gene_expr_vals > 5.5]

IL6     10.2
BRAF     6.7
dtype: float64

In [168]:
## Which gene has the highest expression value?

## .idmax() - Return the index of the row with the max value

# dir(gene_expr_vals)
gene_expr_vals.idxmax()

'IL6'

In [169]:
gene_expr_vals.id*?

  ns[key] = getattr(obj, key)
  ns[key] = getattr(obj, key)


gene_expr_vals.idxmax
gene_expr_vals.idxmin

In [170]:
matrix

array([[ 1,  6, 11],
       [16, 21, 26]])

In [173]:
matrix.shape[1]

3

#### Random data
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [None]:
# Create an array filled with random values 
# Results are from the “continuous uniform” distribution over the [0,1] interval.

# help(np.random.random)

In [None]:
# Generate the same random numbers every time
# Set seed

np.random.seed(42) 



In [None]:
# Create an array filled with random values from the standard normal distribution
help(np.random.randn) 

#### 2. `pd.DataFrame`

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

In [None]:
np.random.seed(42)
expression_array = np.random.random(20).reshape(4,5) * 100
genes = ["HER2","PIK3CA", "BRAF", "IL6"]
samples = ["Sample1","Sample2", "Sample3", "Sample4", "Sample5"]
gene_expr = pd.DataFrame(data = expression_array, 
                         index = genes, 
                         columns = samples)


In [None]:
# .describe() -  generate descriptive statistics 


In [None]:
# Explore DataFrame attributes and methods .T, .shape, .size., .index, .columns 
# Get individual columns .<column_name>
# '.' operator selected columns are just a pd.Series and can be '[]' sliced on further




In [None]:
###

In [None]:
gene_expr

In [None]:
# we can sort the data by column - get the samples ranked by HER2 expression

gene_expr.T.sort_values(by='HER2', ascending=False)

In [None]:
# We can aggregate data - get lowest gene value accross samples

gene_expr.aggregate(np.min, 1)

In [None]:
######

In [None]:
gene_expr

#### Append, join, and concat methods are used to add new rows/columns

In [None]:
# Add a new sample with the values 54.11, 20.65, 30.52, 96.86

# help(pd.DataFrame.join)



               

<b>#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can easily save your `DataFrames`

In [None]:
gene_expr.to_csv('dataframe_data.csv')

In [None]:
# help(gene_expr.to_csv)

In [None]:
df_gene_go.to_csv('dataframe_data.csv', index = True)

#### Input

You can easily bring data from a file into a `DataFrames`

In [None]:
pd.read_csv('dataframe_data.csv', index_col = 0)

##### Excel Files (.to_excel(), .read_excel())
##### TSV Files (.csv( , sep = "\t"), .read_csv( , sep = "\t"))
##### Clipboard (.to_clipboard(), .read_clipboard() )

#____________________

#### <b>Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label/name-based
2. `.iloc` -> primarily integer-based

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')


Pandas allows you to do random sampling from the dataframe

In [None]:
df_small = df_iris.sample(n=5)
df_small

In [None]:
# or see the first 5 rows: .head()

df_iris.head()

In [None]:
### 

df_iris

#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**
Named rows can be selected by a range of the names

#### Selection <b>BY NAME</b>: the `.loc` method

```python
# .loc syntax
df.loc[row indexer, column indexer]
```

<b>A slice of specific items (based on label) - start and stop included</b>

In [None]:
df_iris.head()

#### Boolean indexing - returns rows that meet the condition

#### Selection <b>BY POSITION</b>: the `.iloc` method

<b>A slice of specific items (based on position)</b>

In [None]:
# we can use a list of indices



#### Quick Exploration of the data

In [None]:
help(df_iris.groupby)

In [None]:
# get the mean of the four charactheristics grouped by species




In [None]:
# bar plot of petal length mean per species



In [None]:
## boxplot of the mean of the 4 characteristics 
## which one varies the most and the least betweeen species?



In [None]:
## check the dataframe for nas

bool(sum(df_iris.isnull().any()))


#### Exercise

In [None]:
## boxplot of the mean of the 4 characteristics for the species setosa



In [None]:
## histogram of the sepal length for the versicolor species




In [None]:
## Replace all values for the species "virginica" where sepal_length >7.5 or <  5.5 with np.nan




In [None]:
## check the dataframe for missing values .isna().any()



#### RESOURCES

https://www.python-course.eu/pandas.phphttps://www.python-course.eu/numpy.php    
https://scipy-lectures.org/packages/statistics/index.html?highlight=pandas  
https://www.geeksforgeeks.org/pandas-tutorial/

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>