# Introduction to Pandas modules for data handling
* [pandas](https://pandas.pydata.org/pandas-docs/stable/) imports and exports tabular data

### Tips:
* Using `import pandas as pd` lets us abbreviate the library name
* We can call `pd.DataFrame()` instead of `pandas.DataFrame()`

In [None]:
# !pip install --upgrade numpy pandas

In [None]:
import pandas as pd
import numpy as np

### Let's try loading data from an Excel file

In [None]:
data = pd.read_excel('CRC_sample_data.xlsx', sheet_name = 'expression', index_col = 0)
data.head()

### If you encounter an error message:
**ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd**
  
Then, follow the instruction to install missing library with this command template

**!pip install _missing-lib-name_**

In [None]:
# !pip install xlrd openpyxl

## (pandas) DataFrame and Series
**pd.read_excel()** reads data from an excel file

### Tips:
* For tab- or comma-separated files (.txt, .tsv, or .csv), use **pd.read_csv()**
* For excel file with multiple sheets, specify the **sheet_name** parameter
* **index_col** specify the column that should be used as the row index
* **header** specify the row that should be used as the column header

#### The read result is a DataFrame
**head()** is used to preview the top rows of the DataFrame

In [None]:
data = pd.read_excel('CRC_sample_data.xlsx',
                     sheet_name = 'expression', 
                     index_col = 0,
                     header = 0)
data.head(2)

**tail()** shows the bottom rows of the data frame

In [None]:
data.tail(3)

### Pandas automatically determine the appropriate data types for each column
We can check data types with the built-in **DataFrame.dtypes**

In [None]:
data.dtypes

#### View the dimension of the data with DataFrame.shape
Like array

In [None]:
data.shape

### DataFrame is a multidimensional array with row indices and column headers
* DataFrame.index
* DataFrame.columns

In [None]:
data.index

In [None]:
data.columns

### Basic summary statistics for DataFrame
* DataFrame.describe()
* DataFrame.mean(axis = 0)
* DataFrame.std(axis = 0)
#### Notice how the CMS column is ignored automatically

In [None]:
data.describe()

In [None]:
data.iloc[:, :-1].mean(axis = 0)

In [None]:
data.iloc[:, :-1].std(axis = 0)

### Basic statistics for categorical columns
* DataFrame.nunique()
* DataFrame.value_counts()

In [None]:
print('number of distinct elements:', data['CMS'].nunique())
print('---------------------')
print(data['CMS'].value_counts())

#### We can get the unique elements with pd.unique()

In [None]:
unique_classes = pd.unique(data['CMS'])

## How to access rows, columns, and specific cells?
* DataFrame[A]
* DataFrame.loc[A, B]
* DataFrame.iloc[a, b]

#### DataFrame[headers] return a Series or a DataFrame
A Series is a one-dimensional DataFrame

In [None]:
data['AGR2'].head(2)

In [None]:
data[['AGR2', 'ASCL2']].head(2)

#### DataFrame.loc[A, B] lets us specific the row indices and column headers
The output follows the ordering in A and B

**:** can be used to select everything

In [None]:
print(data.loc['Patient3', 'AGR2'])
print('---------------------')
print(data.loc['Patient3', ['GFPT2', 'FAP']])
print('---------------------')
print(data.loc[['Patient3', 'Patient2'], ['GFPT2', 'FAP']])
print('---------------------')
print(data.loc['Patient3', :])

#### DataFrame.iloc[a, b] lets us specific the locations by 0, 1, ... indices

In [None]:
print(data.iloc[[0, 2], [-1, -3]])

#### Combination of access forms

In [None]:
print(data['FAP'].iloc[[10, 21]])

In [None]:
print(data['FAP'].loc[data['FAP'] < 5])

Who is the first patient whose `FAP` expression is lower than 5.0?

## Access with conditions (a list of booleans)
* data.loc[[True, False, ..., True], [True, False, ..., True]]

In [None]:
data.loc[data['CMS'] == 'CMS3', ['FAP', 'SLC5A6', 'CMS']].head(5)

Select patients whose `DUSP4` expression is higher than 7.0

### Accessing by condition lets us do subpopulation-specific calculations

In [None]:
print('average DUSP4 expression in CMS1 is', data.loc[data['CMS'] == 'CMS1', 'DUSP4'].mean())
print('average DUSP4 expression in CMS2 is', data.loc[data['CMS'] == 'CMS2', 'DUSP4'].mean())
print('average DUSP4 expression in CMS3 is', data.loc[data['CMS'] == 'CMS3', 'DUSP4'].mean())

## Combining multiple conditions
Instead of **and**, **or**, **not**, we need to use `&`, `|`, `~`

First, try to select every CMS1 and CMS2 patients

Try to select CMS3 patients whose FAP expression is lower than 6.0

How about non-CMS3 patients?

### Selection for categorical feature with `.isin()`

In [None]:
data.loc[(data['CMS'] == 'CMS1') | (data['CMS'] == 'CMS2'), :].shape

## Let's load the mutation data from a different sheet

In [None]:
mutation_data = pd.read_excel('CRC_sample_data.xlsx', sheet_name = 'mutation', header = 0, index_col = 0)
mutation_data.head(5)

## Identify missing values with `pd.isna()`

In [None]:
pd.isna(mutation_data['microsatelite_status']

### Broadcasting this selection to the gene expression table
This assumes that the two tables have the same row ordering

Can we broadcast the selection in a safer manner?

In [None]:
data.loc[~pd.isna(mutation_data['KRAS']), :].head(2)

## Merging two DataFrames with pd.concat()
Designate the joining direction with **axis** and how common or distinct entries should be handled with **join**
* inner = intersection of entires
* outer = union of entires

In [None]:
merged = pd.concat([data, mutation_data], 
                   axis = 1)
print(merged.shape)
merged.head(2)

In [None]:
mutation_nomissing = 

## Copying data frame
Like list and other objects, using *=* assignment for DataFrame makes the variables linked

#### Use DataFrame.copy() to get an independent copy

In [None]:
new_data = data.copy()
new_data.loc['Patient1', :] = -5
new_data.head(2)

Original data remain unchanged

In [None]:
data.head(2)

## Adding new column or row
Be sure to wotk on a copy to protect the original DataFrame

In [None]:
new_data['FAP x SLC5A6'] = new_data['TSPAN6'] * new_data['SLC5A6']
new_data.head(2)

In [None]:
new_data.loc['NewPatient'] = 0
new_data.tail(2)

In [None]:
new_data[['NewGene1', 'NewGene2']] = -1
new_data.head(2)

## Save DataFrame to file
Similar to **read_excel()** and **read_csv()**, we have **to_excel()** and **to_csv()**

In [None]:
new_data.to_excel('new_dataframe.xlsx')
new_data.to_csv('new_dataframe.csv', sep = ',')
new_data.to_csv('new_dataframe.tsv', sep = '\t')