## NumPy, Pandas
### BIOINF 575 - Fall 2021



_____


### NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

____
#### The list contains refences to each of the values.
#### The array refers to a block of memory containg all values one after the other.
- <b>that is why we need to know the size of the array and the array size cannot change <br>


<img src = "https://www.python-course.eu/images/list_structure.png" width = 350 /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src = "https://www.python-course.eu/images/array_structure.png" width = 350 />
____

#### Arrays of different dimensions (`shape` gives the number of elements on each dimension):

<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" alt="data structures" width="500">  

_____


#### <b>NumPy basics</b>

Arrays are designed to:
* <b>handle vectorized operations lists are not</b>
    - if you apply a function it is performed on every item in the array, rather than on the whole array object
    - both arrays and lists have 0-based indexing
* <b>store multiple items of the same data type</b>
* <b>handle missing values </b>
    - missing numerical values are represented using the `np.nan` object (not a number)
    - the object `np.inf` represents infinite  
* <b>have an unchangeable size</b>
    - array size cannot be changed, should create a new array
    - you know when you create the array how much space you need for it and that will not change  
* <b>have efficient memory usage</b>
    - an equivalent numpy array occupies much less space than a python list of lists

#### <b>Importing NumPy
The recommended convention to import numpy is to use the <b>np</b> alias:

In [None]:
import numpy as np

#### <b>Documentation and help
https://numpy.org/doc/

In [None]:
# np.lookfor('sum') 

In [None]:
np.me*?

In [None]:
# np.mean?

In [None]:
# help(np.mean)

#### <b>Motivating example</b> - change weight from grams to ounces

In [None]:
weight_list_g = [20, 5, 30, 100]

In [None]:
# using lists we need a comprehension to apply the formula to each element of the list
weight_list_oz = [weightg*0.03527396195 for weightg in weight_list_g]
weight_list_oz

In [None]:
# using arrays we can apply the formula directly to the array and it will be applied to each element

np.array(weight_list_g)

In [None]:
weight_array_oz = np.array(weight_list_g)*0.03527396195
weight_array_oz

In [None]:
# plot the values
#the following command is used to show the plot in the notebook not in a pop-up window
%matplotlib inline   

# the following command is used to import the module and class used for plorring
import matplotlib.pyplot as plt  

plt.plot(weight_array_oz)
plt.show()

#### <b>Functions for creating arrays</b>
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html

In [None]:
# Creating arrays see the different functions used to create arrays:

vector_list = np.array([[1,2,3,4], [40,60,70,80], [101, 202, 303, 404]])
print("2D array from a list of lists: \n", vector_list, "\n")

vector_range = np.arange(3,18,3) # Evenly spaced in a range (arange): start stop step
print("Vecor of evenly spaced values form a range (arange) given by start, stop and step: \n", vector_range, "\n")

vector_lin = np.linspace(0, 1, 5)
print("Vector of evenly spaced values (known number, linspace) given by start, stop and number of points: \n", vector_lin, "\n")

vector_zeros = np.zeros((5,4), dtype = int)
print("2D array of zeros: \n", vector_zeros, "\n")

vector_ones = np.ones((4,5,3), dtype = int)
print("3D array of ones: \n", vector_ones, "\n")

val = 42
vector_val = np.full((4,5), val, dtype = int)
print("2D array filled with a given value: \n", vector_val, "\n")


vector_id = np.identity(4)
print("2D square array filled with 1 on the diagonal: \n", vector_id, "\n")


vector_id = np.eye(5)
print("2D square array filled with 1 on the diagonal: \n", vector_id, "\n")




In [None]:
# Evenly spaced by number of points (linspace):

vector = np.linspace(0, 1, 5)
vector

In [None]:
# Build array from Python list (array)

vector = np.array([1,2,3])
vector

#### Common arrays

In [None]:
# matrix with zeros 

np.zeros((3,4), dtype = int)

In [None]:
# matrix with ones

np.ones((3,4), dtype=int)

In [None]:
# matrix with a constant value

value = 20
np.full((3,4,2), value)

In [None]:
# Create a 4x6 identity matrix - does not need to be square

np.eye(4,6,k=1)       

In [None]:
 help(np.eye)

In [None]:
# has to be square
np.identity(4)

#### Random data
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [None]:
help(np.random.random)

In [None]:
# Create an array filled with random values 
# Results are from the “continuous uniform” distribution over the [0,1] interval.

np.random.random((3,4))        

In [None]:
# Create an array filled with random values from the standard normal distribution
np.random.randn(3,4)    

In [None]:
np.random.randn(3,4)

In [None]:
# Generate the same random numbers every time
# Set seed

np.random.seed(10)
print(np.random.randn(3,4))

print(np.random.randn(3,4))

np.random.seed(100)
print(np.random.randn(3,4))
                

In [None]:

np.random.seed(10)
print(np.random.randn(3,4))
print()
np.random.seed(10)
print(np.random.randn(3,4))
print()
np.random.seed(55)
print(np.random.randn(3,4))

#### <b>Basic array attributes:</b>
* shape: array dimension
* size: Number of elements in array
* ndim: Number of array dimension (len(arr.shape))
* dtype: Data-type of the array

In [None]:
# nested lists give us multi dimensional arrays

matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix

In [None]:
# dir(matrix)

In [None]:
# length of array
matrix.size

In [None]:
# shape tells us the size on each dimension and implicit the number of dimensions
matrix.shape

In [None]:
# number of array dimensions
matrix.ndim

In [None]:
# type of the dsata stored in the array

matrix.dtype

In [None]:
matrix

In [None]:
# transpose of the array (rows and columns switched)
matrix.T

#### <b>Reshaping</b> - changing the numbers of rows and columns - data and size stay the same

In [None]:
# Reshaping
matrix_reshaped = matrix.reshape(2,6)
matrix_reshaped

In [None]:
matrix_reshaped[0][2:5]

In [None]:
matrix_reshaped[0, 2:5]

#### <b>Indexing/Slicing(subsetting): [][] or [,]</b>
___
<img src = "http://scipy-lectures.org/_images/numpy_indexing.png" width = 400/>

In [None]:
matrix_reshaped

In [None]:
# List-like
matrix_reshaped[1][1]

In [None]:
# Using both rows and columns indices to get a value
matrix_reshaped[1,3]

In [None]:
matrix_reshaped

In [None]:
# Using both rows and columns indices to get a subset of a column
matrix_reshaped[1,:3]

In [None]:
# Using both rows and columns indices to get a sub-matrix

matrix_reshaped[:2,:3]

In [None]:
# iterrating ... let's print the elements of matrix_reshaped
nrows = matrix_reshaped.shape[0]
ncols = matrix_reshaped.shape[1]

for i in range(nrows):
    for j in range(ncols):
        print(matrix_reshaped[i,j])



In [None]:
# Fun arrays - display a checkers_board list
checkers_board = np.zeros((8,8),dtype=int)
checkers_board[1::2,::2] = 1
checkers_board[::2,1::2] = 1
print(checkers_board)

In [None]:
checkers_board = np.zeros((8,8),dtype=int)
checkers_board

In [None]:
checkers_board = np.zeros((8,8),dtype=int)
checkers_board[1::2,::2] = 1
checkers_board

Create a 2d array with 1 on the border and 0 inside

In [None]:
boarder_array = np.zeros((8,8),dtype=int)
boarder_array[0,:] = 1

boarder_array

In [None]:
boarder_array = np.ones((8,8),dtype=int)
boarder_array[1:-1,1:-1] = 0
boarder_array

In [None]:
boarder_array[:,-1]

#### Array of indeces subsetting - use array of indices to subset array with only the elements given by the indices

In [None]:
matrix = np.arange(30)
matrix = matrix.reshape(5,6)
matrix

In [None]:
indices = [0,2,3]
matrix[indices,]

In [None]:
matrix[:,indices]

In [None]:
indices = [0,2,3]
matrix[indices, indices]

In [None]:
matrix[indices,][:,indices]

#### conditional subsetting - use array of booleans to subset array with only the elements where the bool array is True

In [None]:
matrix

In [None]:
# conditional subsetting
matrix[(matrix[:,0] > 6)]

In [None]:
matrix[:,0] > 6

In [None]:
matrix[[False, False,  True,  True,  True]]

In [None]:
matrix

In [None]:
# condition on 
matrix[(4 <= matrix[:,0]) & (matrix[:,0] <= 20)
       & (2 <= matrix[:,1]) & (matrix[:,1] <= 18),]

#### <b>Matrix operations</b>

https://www.tutorialspoint.com/matrix-manipulation-in-python<br>
Arithmetic operators on arrays apply elementwise. <br> 
A new array is created and filled with the result.


#### <b>Array broadcasting</b><br>

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html<br>
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. <br>
Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

<img src = "https://www.tutorialspoint.com/numpy/images/array.jpg" height=10/>


https://www.tutorialspoint.com/numpy/numpy_broadcasting.htm

In [None]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix


In [None]:
# create array of length 4 and reshape it to make it a column
col_vec = np.array([1,2,3,4]).reshape(4,1)
col_vec

In [None]:
# addittion with a data column
matrix + col_vec

In [None]:
##########

matrix


In [None]:
# addittion with a data row - error if dimensions do not match
# matrix + np.array([0,1,2,3])

In [None]:
matrix + np.array([1,2,3])

In [None]:
##########

matrix

In [None]:
col_vec

In [None]:
# multiplication with a data column
matrix * col_vec

In [None]:
##########

matrix

In [None]:
# create 4x3 matrix
matrix2 = np.array([[1,2,3],[5,6,7],[1,1,1],[2,2,2]])
matrix2

In [None]:
# multiplication with a matrix of the same shape
matrix * matrix2

In [None]:
##########

matrix

In [None]:
help(matrix.sum)

In [None]:
matrix.sum(axis = 1)

In [None]:
dir(matrix)

In [None]:
# matrix multiplication
col_vec = np.array([1,2,3]).reshape(3,1)
matrix.dot(col_vec)

In [None]:
# matrix multiplication - more recently
matrix@(np.array([1,2,3]).reshape(3,1))

In [None]:
##########

matrix

In [None]:
matrix2

In [None]:
# stacking arrays together - vertically
np.vstack((matrix,matrix2))

In [None]:
# stacking arrays together - horizontally
np.hstack((matrix,matrix2))

In [None]:
##########

matrix

In [None]:
# splitting arrays 
np.vsplit(matrix,2)

In [None]:
##########

matrix

In [None]:
np.hsplit(matrix,(2,3))

#### <b>Copy</b> - shallow (view) which looks at the same data vs deep copy that you can change without affecting the initial array

In [None]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix

In [None]:
# shallow copy - looks at the same data
matrix_copy = matrix
matrix_copy1 = matrix.view()

In [None]:
print(matrix)

print(matrix_copy)

print(matrix_copy1)

In [None]:
matrix_copy1[0,0] = 5000

In [None]:
print(matrix)

print(matrix_copy)

print(matrix_copy1)

In [None]:
# deep copy
matrix_copy2 = matrix.copy()
print(matrix_copy2)

In [None]:
matrix_copy2[0,0] = 7777

In [None]:
print(matrix)

print(matrix_copy)

print(matrix_copy1)

print(matrix_copy2)

#### <b>More matrix computation</b> - basic aggregate functions are available - min, max, sum, mean

In [None]:
matrix

#### Use the axis argument to compute mean for each column or row
#### axis = 0 - columns
#### axis = 1 - rows

In [None]:
# col mean 
matrix.mean(axis = 0)

In [None]:
# row mean
matrix.mean(axis = 1)

In [None]:
# unique values and counts
matrix = np.random.random((3,4), )
matrix = np.array([[ 5,  2,  3],
       [ 4,  5,  6],
       [ 3,  3,  2],
       [4, 2, 3]])
uvals, counts = np.unique(matrix, return_counts=True)
print(uvals,counts)

https://www.w3resource.com/python-exercises/numpy/index.php


Create a matrix of 5 rows and 6 columns with numbers from 1 to 30.
Add 2 to the odd values of the array.

In [None]:
matrix = np.arange(1,31).reshape(5,6)
matrix[matrix%2==1] +=  2 
matrix

Normalize the values in the matrix. Substract the mean and divide by the standard deviation.

In [None]:
mat_mean = np.mean(matrix)
mat_std = np.std(matrix)
matrix_norm = (matrix - mat_mean)/mat_std
matrix_norm

In [None]:
matrix

Create a random array (5 by 3) and compute: 
   * the sum of all elements 
   * the sum of the rows  
   * the sum of the columns

In [None]:
matrix_rand = np.random.rand(5,3)
print(matrix_rand)
matrix_rand.sum()
matrix_rand.sum(1)
matrix_rand.sum(0)

In [None]:
positions = np.where(matrix  < 10) # returns tuple
positions

In [None]:
matrix[positions]

In [None]:
# help(np.where)

In [None]:
pos = np.where(matrix == 3)
pos

In [None]:
matrix[pos]

#### RESOURCES

http://scipy-lectures.org/intro/numpy/array_object.html#what-are-numpy-and-numpy-arrays   
https://www.python-course.eu/numpy.php   
https://numpy.org/devdocs/user/quickstart.html#universal-functions   
https://www.geeksforgeeks.org/python-numpy/

_____

### Pandas
<img src = "https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" width = 200/>

https://commons.wikimedia.org/wiki/File:Pandas_logo.svg

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png" width = 550/>

https://www.geeksforgeeks.org/python-pandas-dataframe/

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

```python
import numpy as np
import pandas as pd


```

#### 1. `pd.Series`

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']
 
 
 Methods
 
 ['abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'asfreq',
 'asof',
 'astype',
 'at_time',
 'autocorr',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'duplicated',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'head',
 'hist',
 'idxmax',
 'idxmin',
 'infer_objects',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shift',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'var',
 'view',
 'where',
 'xs']

#### From a Python list

In [None]:
import numpy as np
import pandas as pd

In [None]:
labels = ["gene","protein","miRNA","metabolites"]
values = [3,4,5,6]
series_named_val = pd.Series(data = values, index=labels)


#### From dictionary

In [None]:
dict_var = dict(zip(labels, values))
pd.Series(dict_var)

In [None]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
# Create new series
new_series = pd.Series(data = dict_var)
new_series

In [None]:
#help(new_series.idxmax)

In [None]:
# Return the index of the row with the max value
new_series.idxmax()

In [None]:
# generate descriptive statistics
new_series.describe()

In [None]:
# check for missing values
new_series.isna()

#### 2. `pd.DataFrame`

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

In [None]:
correlation_array = np.arange(40,52).reshape(3,4)
genes_rows = ["HER2","PIK3CA", "BRAF"]
genes_cols = ["HER1","EGFR", "IL6", "INSR"]
df_gene_correlation = pd.DataFrame(correlation_array, genes_rows, genes_cols)
df_gene_correlation

In [None]:
# Explore DataFrame attributes and methods

df_gene_correlation.T

In [None]:
df_gene_correlation.sort_values(by='EGFR',ascending=False)

In [None]:
df_gene_correlation.aggregate(np.mean, 1)

In [None]:
df_gene_correlation.size

In [None]:
df_gene_correlation.index

In [None]:
df_gene_correlation.dtypes

In [None]:
'''
Create a 4 by 5 array with values from 20 to 80 going with a step of 3 
Create a list with row names: Gene1, Gene2 ...
Create a list with column names: GO_Term1, GO_Term2 ...
Create a DataFrame from the array created with the respective 
row names and column names from the lists
'''
values_array = np.arange(20,80,3).reshape(4,5)

#genes = ["Gene1","Gene2","Gene3","Gene4"]
#genes = ["Gene"+str(i+1) for i in range(values_array.shape[0])]
genes = []
for i in range(values_array.shape[0]):
    genes.append("Gene"+str(i+1))
genes

go_terms = ["Go_Term"+str(i+1) for i in range(values_array.shape[1])]
go_terms
df_gene_go = pd.DataFrame(values_array,genes,go_terms)
df_gene_go

#### From `pd.Series`

In [None]:
# Create pd.Series from the list with the go-terms names and set the name "new_row"
numbers_list = list(range(4,9))
numbers_series = pd.Series(numbers_list, index = go_terms, name = "new_row")

#### Row-wise (`append`)

In [None]:
# Now add on a row
df_gene_go.append(numbers_series)

#### Column-wise (`join`/`concat`)

#### `join`

In [None]:
df_gene_go

In [None]:
numbers_series1 = pd.Series([1,2,3], index = ["Gene1", "Gene2", "Gene3"], name = "new_column")


In [None]:
#different size
df_gene_go.join(numbers_series1)

#### `concat`

In [None]:
# Same size
numbers_series2 = pd.Series([1,2,3,4], index = genes, name = "new_column1")
pd.concat([df_gene_go, numbers_series2], axis=1)

In [None]:
# Unequal size
pd.concat([df_gene_go, numbers_series1], axis=1)

<b>#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can easily save your `DataFrames`

In [None]:
df_gene_go.to_csv('dataframe_data.csv')

In [None]:
# help(df_gene_go.to_csv)

In [None]:
df_gene_go.to_csv('dataframe_data.csv', index = True)

#### Input

You can easily bring data from a file into a `DataFrames`

In [None]:
pd.read_csv('dataframe_data.csv', index_col = 0)

In [None]:
# help(pd.read_csv)

#### Excel Files

In [None]:
# Output
df_gene_go.to_excel('excel_output.xlsx')
# Input
pd.read_excel('excel_output.xlsx')

#### TSV Files

In [None]:
# Output 
df_gene_go.to_csv('tsv_output.tsv', sep="\t")
# Input
pd.read_csv('tsv_output.tsv', sep="\t").tail()

#### Clipboard

#### Copy

In [None]:
df_gene_go.to_clipboard()

In [None]:
# Paste here


#### Paste

In [None]:
pd.read_clipboard()

#### <b>Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label/name-based
2. `.iloc` -> primarily integer-based

In [None]:
# Create some data to work with
row_labels = ["row"+str(i) for i in range(10)]
col_labels = ["col"+str(i) for i in range(6)]

""" 
Create a DataFrame from a 10 by 6 array with values from 1 to 60, 
add the row_labels and col_labels we just created 
"""
data_array = np.arange(1,61).reshape(10,6)
data_array
df_example = pd.DataFrame(data_array,row_labels,col_labels)
df_example


Additionally, Pandas allows you to do random sampling from the dataframe

In [None]:
df_small = df_example.sample(n=5)
df_small

In [None]:
### 

df_example

#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
df_example[:3]

#### `'.'` operators and a column name can select a **specific named** column

In [None]:
df_example.col1

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
df_example.col1[:3]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
df_example["col1"][:3]

In [None]:
### 

df_example

Named rows can be selected by a range of the names

In [None]:
df_example['row1':'row3']

#### Selection <b>BY NAME</b>: the `.loc` method

```python
# .loc syntax
df.loc[row indexer, column indexer]
```

<b>A slice of specific items (based on label) - start and stop included</b>

In [None]:
df_example.loc['row3':'row5', 'col2':'col4']

#### Boolean indexing

In [None]:
df_example.loc[df_example.col2 < 30]

#### Selection <b>BY POSITION</b>: the `.iloc` method

<b>A slice of specific items (based on position)</b>

In [None]:
df_example.iloc[:3,2]

In [None]:
# we can use a list of indices

df_example.iloc[:3,[0,1,3]]

#### Quick Exploration of the data

In [None]:
df_example.col1.describe()

In [None]:
df_example.col1.aggregate(sum)


In [None]:
df_example[df_example > 50] = np.nan

In [None]:
df_example

In [None]:
print('Any missing values?')
# Checks missiong values on a column (pd.Series)
df_example.col1.hasnans


#### Object Manipulation

In [None]:
df_example

In [None]:
df_example.loc[df_example.col2 > 30, ['col2',"col4"]] = 0 


In [None]:
df_example

Replace all the 0 values in df_example with 200.

In [None]:
# some plotting

ts = pd.Series(np.random.randn(1000),
                  index=pd.date_range('1/1/2000', periods=1000))
   

ts = ts.cumsum()

ts.plot()

In [None]:
df = pd.DataFrame(np.random.randn(1000, 4),
                      index=ts.index, columns=list('ABCD'))
 

df = df.cumsum()

plt.figure();

df.plot();

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris

Answer the following questions by writing code:
* How may rows and columns does the dataset have?
* How may flowers with petal length > 4 and petal width > 2 are there?



#### RESOURCES

https://www.python-course.eu/pandas.phphttps://www.python-course.eu/numpy.php    
https://scipy-lectures.org/packages/statistics/index.html?highlight=pandas  
https://www.geeksforgeeks.org/pandas-tutorial/

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>