## NumPy
### BIOINF 575 - Fall 2023



_____


### NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

____
#### A list contains refences to each of the values.
#### An array refers to a block of memory containg all values one after the other.
- <b>that is why we need to know the size of the array and the array size cannot change <br>


<img src = "https://www.python-course.eu/images/list_structure.png" width = 350 /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src = "https://www.python-course.eu/images/array_structure.png" width = 350 />
____

#### Arrays of different dimensions (`shape` gives the number of elements on each dimension):
<img src="https://raw.githubusercontent.com/elegant-scipy/elegant-scipy/master/figures/NumPy_ndarrays_v2.svg" alt="data structures" width="600">  

https://github.com/elegant-scipy/elegant-scipy
_____


#### <b>NumPy basics</b>

Arrays are designed to:
* <b>handle vectorized operations (lists cannot do that)</b>
    - if you apply a function it is performed on every item in the array, rather than on the whole array object
    - both arrays and lists have 0-based indexing
* <b>store multiple items of the same data type</b>
* <b>handle missing values </b>
    - missing numerical values are represented using the `np.nan` object (not a number)
    - the object `np.inf` represents infinite  
* <b>have an unchangeable size</b>
    - array size cannot be changed, should create a new array if you want to change the size
    - you know when you create the array how much space you need for it and that will not change  
* <b>have efficient memory usage</b>
    - an equivalent numpy array occupies much less space than a python list of lists

#### <b>Basic array attributes:</b>
* shape: array dimension
* size: Number of elements in array
* ndim: Number of array dimension (len(arr.shape))
* dtype: Data-type of the array

#### <b>Importing NumPy
The recommended convention to import numpy is to use the <b>np</b> alias:

In [None]:
import numpy as np

#### <b>Documentation and help
https://numpy.org/doc/

In [None]:
# np.lookfor('sum') 

In [None]:
np.me*?

In [None]:
# np.mean?

In [None]:
# help(np.mean)

#### <b>Motivating example</b> - transform temperatures from Celsius to Farenheit

In [None]:
temp_list_C = [-20, 25, 3, 10]

In [None]:
# using lists we need a loop to apply the formula to each element of the list
temp_list_F = []

for temp in temp_list_C:
    temp_list_F.append(temp * 1.8 + 32)

temp_list_F

In [None]:
# using arrays we can apply the formula directly to the array and it will be applied to each element

temp_array_C = np.array(temp_list_C)
temp_array_C

In [None]:
temp_array_F = temp_array_C * 1.8 + 32
temp_array_F

#### <b>Functions for creating arrays</b>
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html

##### np.array() - array from lists - e.g. 2D array from a list of lists

In [None]:
# help(np.array)



##### np.arange() - vector of evenly spaced values form a range (arange) given by start, stop and step

In [None]:
# help(np.arange)



##### np.linspace() - vector of evenly spaced values (known number, linspace) given by start, stop and number of points

In [None]:
# help(np.linspace)



##### np.zeros() - array of zeros (e.g. 3D array), there is also a np.ones()

In [None]:
# help(np.zeros)



##### More functions to create special arrays:      
    np.identity(n) - 2D square array filled with 1 on the diagonal      
    np.eye(n,m) - 2D array filled with 1 on the diagonal      
    np.full((n,m), val) - array filled with a given value     

#### <b>Basic array attributes:</b>
* shape: array dimension
* size: Number of elements in array
* ndim: Number of array dimension (len(arr.shape))
* dtype: Data-type of the array

In [None]:
# nested lists give us multi dimensional arrays

matrix = np.array([[1,2,3],[4,5,6]])
matrix

In [None]:
# dir(matrix)

In [None]:
# .size - length of array



In [None]:
# .shape tells us the size on each dimension and implicit the number of dimensions



In [None]:
# .ndim - number of array dimensions



In [None]:
# .dtype - type of the dsata stored in the array



In [None]:
matrix

In [None]:
# .T - transpose of the array (rows and columns switched)


#### <b>Reshaping</b> - changing the numbers of rows and columns - data and size stay the same

In [None]:
# .reshape((n,m)) - Reshaping



#### <b>Indexing/Slicing(subsetting): [][] or [,]</b>
___
<img src = "http://scipy-lectures.org/_images/numpy_indexing.png" width = 400/>

In [None]:
matrix = np.full((6,6),range(6)) + 10 * np.full((6,6),range(6)).T
matrix

#### Indexing/Slicing

In [None]:
# [][] - List-like 




In [None]:
# [,] - Using both rows and columns indices to get a value


In [None]:
matrix_reshaped

In [None]:
# Using both rows and columns indices to get a sub-matrix

matrix_reshaped[:2,:3]

In [None]:
# Fun arrays - display a checkers_board list
checkers_board = np.zeros((6,6),dtype=int)
print(checkers_board)

In [None]:
checkers_board[1::2,::2] = 1
print(checkers_board)

In [None]:
checkers_board[::2,1::2] = 1
print(checkers_board)

#### Array of indices subsetting - use array/list of indices to subset array with only the elements given by the indices

In [None]:
matrix 

In [None]:
indices = [0,2,3]
matrix[indices,]

In [None]:
# columns



#### conditional subsetting - use array of booleans to subset array with only the elements where the bool array is True

In [None]:
matrix

In [None]:
# conditional subsetting
matrix[(matrix[:,0] > 20)]

In [None]:
# deconstruct



In [None]:
matrix

In [None]:
# multiple conditions  
(matrix[:,0] > 20) & (matrix[:,0] <= 40)

#### <b>Matrix operations</b>

https://www.tutorialspoint.com/matrix-manipulation-in-python<br>
Arithmetic operators on arrays apply elementwise. <br> 
A new array is created and filled with the result.


#### <b>Array broadcasting</b><br>

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html<br>
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. <br>
Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

<img src = "https://www.tutorialspoint.com/numpy/images/array.jpg" height=10/>


https://www.tutorialspoint.com/numpy/numpy_broadcasting.htm

In [None]:
matrix = np.arange(1,13).reshape(3,4)
matrix


In [None]:
# create an array with 4 values



In [None]:
# addition with a data row



In [None]:
####

In [None]:
# create an array with 3 values



In [None]:
matrix

In [None]:
# addition with a data column



In [None]:
##########

matrix


In [None]:
# column vector



In [None]:
# addittion with a data row - error if dimensions do not match




In [None]:
##########

matrix

In [None]:
# column vec



In [None]:
# multiplication with a data column




#### Multiplication with a matrix of the same shape  results in the multiplication of the elements at the respective indices 
#### Mathematical matrix multiplication .dot method or @ operator the dimensions need to be compatible n1 == m1 and m1 == n2 - each value in the resulting column is the sum of the product of the pair of elements from the respective row and column 

<img src = "https://miro.medium.com/max/1400/1*YGcMQSr0ge_DGn96WnEkZw.png" width = "400"/>
     
https://towardsdatascience.com/a-complete-beginners-guide-to-matrix-multiplication-for-data-science-with-python-numpy-9274ecfc1dc6
     

#### <b>More matrix computation</b> - basic aggregate functions are available - min, max, sum, mean

In [None]:
matrix

#### Use the axis argument to compute mean for each column or row
#### axis = 0 - columns
#### axis = 1 - rows

In [None]:
help(matrix.sum)

In [None]:
matrix

In [None]:
# col sum 




In [None]:
# row sum




https://www.w3resource.com/python-exercises/numpy/index.php


Create a matrix of 2 rows and 3 columns with every fifth number starting from 1 (e.g. 1,6,11,16,...)


In [None]:
matrix = np.arange(1, 2*3*5+1, 5).reshape(2,3)

matrix

#### <font color = "red">Exercise:</font>   


Normalize the values in the matrix to be between 0 and 1 (min-max normalization).     
Substract the minimum value and divide by the maximum value of the resulting values.

#### <font color = "red">Exercise:</font>   

Do the same normalization at the row level

#### <font color = "red">Exercise:</font>   


* Return the even numbers from the matrix.
* Try to return the indices of the even numbers  (hint: look at the where method).

In [None]:
# help(np.where)

In [None]:
matrix

In [None]:
pos = np.where(matrix == 3)
pos

In [None]:
matrix[pos]

#### RESOURCES

http://scipy-lectures.org/intro/numpy/array_object.html#what-are-numpy-and-numpy-arrays   
https://www.python-course.eu/numpy.php   
https://numpy.org/devdocs/user/quickstart.html#universal-functions   
https://www.geeksforgeeks.org/python-numpy/

_____

### Pandas
<img src = "https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" width = 200/>

https://commons.wikimedia.org/wiki/File:Pandas_logo.svg

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png" width = 550/>

https://www.geeksforgeeks.org/python-pandas-dataframe/

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

```python
import numpy as np
import pandas as pd


```

#### 1. `pd.Series`

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']
 
 
 Methods
 
 ['abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'asfreq',
 'asof',
 'astype',
 'at_time',
 'autocorr',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'duplicated',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'head',
 'hist',
 'idxmax',
 'idxmin',
 'infer_objects',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shift',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'var',
 'view',
 'where',
 'xs']

#### Create a Series from a Python list

In [None]:
import numpy as np
import pandas as pd

In [None]:

labels = ["EGFR","IL6","BRAF","ABL"]
values = [3,4,3,6]
gene_snp_no = pd.Series(data = values, index=labels)


In [None]:
# Get the data, name, labels, value counts for the series



#### Create a Series from a dictionary

In [None]:
gene_expr_map = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
# Create new series
gene_expr_vals = pd.Series(data = gene_expr_map)


In [None]:
## Which genes have an expression greater then 5.5?



In [None]:
## Which gene has the highest expression value?

## .idmax() - Return the index of the row with the max value



#### Random data
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [None]:
# Create an array filled with random values 
# Results are from the “continuous uniform” distribution over the [0,1] interval.

# help(np.random.random)

In [None]:
# Generate the same random numbers every time
# Set seed

np.random.seed(42) 



In [None]:
# Create an array filled with random values from the standard normal distribution
help(np.random.randn) 

#### 2. `pd.DataFrame`

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

In [None]:
np.random.seed(42)
expression_array = np.random.random(20).reshape(4,5) * 100
genes = ["HER2","PIK3CA", "BRAF", "IL6"]
samples = ["Sample1","Sample2", "Sample3", "Sample4", "Sample5"]
gene_expr = pd.DataFrame(data = expression_array, 
                         index = genes, 
                         columns = samples)


In [None]:
# .describe() -  generate descriptive statistics 


In [None]:
# Explore DataFrame attributes and methods .T, .shape, .size., .index, .columns 
# Get individual columns .<column_name>
# '.' operator selected columns are just a pd.Series and can be '[]' sliced on further




In [None]:
###

In [None]:
gene_expr

In [None]:
# we can sort the data by column - get the samples ranked by HER2 expression

gene_expr.T.sort_values(by='HER2', ascending=False)

In [None]:
# We can aggregate data - get lowest gene value accross samples

gene_expr.aggregate(np.min, 1)

In [None]:
######

In [None]:
gene_expr

#### Append, join, and concat methods are used to add new rows/columns

In [None]:
# Add a new sample with the values 54.11, 20.65, 30.52, 96.86

# help(pd.DataFrame.join)



               

<b>#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can easily save your `DataFrames`

In [None]:
gene_expr.to_csv('dataframe_data.csv')

In [None]:
# help(gene_expr.to_csv)

In [None]:
df_gene_go.to_csv('dataframe_data.csv', index = True)

#### Input

You can easily bring data from a file into a `DataFrames`

In [None]:
pd.read_csv('dataframe_data.csv', index_col = 0)

##### ------

##### Excel Files (.to_excel(), .read_excel())
##### TSV Files (.csv( , sep = "\t"), .read_csv( , sep = "\t"))
##### Clipboard (.to_clipboard(), .read_clipboard() )

#____________________

#### <b>Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label/name-based
2. `.iloc` -> primarily integer-based

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')


Pandas allows you to do random sampling from the dataframe

In [None]:
df_small = df_iris.sample(n=5)
df_small

In [None]:
# or see the first 5 rows: .head()

df_iris.head()

In [None]:
### 

df_iris

#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**
Named rows can be selected by a range of the names

#### Selection <b>BY NAME</b>: the `.loc` method

```python
# .loc syntax
df.loc[row indexer, column indexer]
```

<b>A slice of specific items (based on label) - start and stop included</b>

In [None]:
df_iris.head()

#### Boolean indexing - returns rows that meet the condition

#### Selection <b>BY POSITION</b>: the `.iloc` method

<b>A slice of specific items (based on position)</b>

In [None]:
# we can use a list of indices



#### Quick Exploration of the data

In [None]:
help(df_iris.groupby)

In [None]:
# get the mean of the four characteristics grouped by species




In [None]:
# bar plot of petal length mean per species



In [None]:
## boxplot of the mean of the 4 characteristics 
## which one varies the most and the least betweeen species?



In [None]:
## check the dataframe for nas

bool(sum(df_iris.isnull().any()))


#### Exercise

In [None]:
## boxplot of the mean of the 4 characteristics for the species setosa



In [None]:
## histogram of the sepal length for the versicolor species




In [None]:
## Replace all values for the species "virginica" where sepal_length >7.5 or <  5.5 with np.nan




In [None]:
## check the dataframe for missing values .isna().any()



#### RESOURCES

https://www.python-course.eu/pandas.phphttps://www.python-course.eu/numpy.php    
https://scipy-lectures.org/packages/statistics/index.html?highlight=pandas  
https://www.geeksforgeeks.org/pandas-tutorial/

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>