## Data handling
Visit [this](https://pandas.pydata.org/pandas-docs/stable/) for more in-depth resources and guides

### First, import the pandas library
Import **as** lets us abbreviate the library name

Also import **numpy** for its mathematical functionality

In [5]:
import pandas as pd
import numpy as np

## If you encounter error message like
* ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd
  
Then, follow the instruction to install missing library with this command template

**!pip install _missing-lib-name_**

In [None]:
# !pip install xlrd openpyxl

## Tip 1: When in doubt, print()

In [None]:
x = 5
y = 5 ** 2

print(y)
print(x, x ** 2, x ** 5)
print("the value of", x, 'to the power of 6 is', x ** 6)

In [None]:
s = 'hello world'

print(s)
print(s + ' to everyone')

## Python data structures
* list
* tuple - a list that cannot be changed
* dictionary - a mapping of **key** to **value**

In [2]:
a_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
a_tuple = (0.1, 0.2, 0.3, 0.4, 0.5, 0.6)
a_dict = {'phone':'081-000-0180', 
          'email':['abc@gmail.com', 'ab.c@chula.ac.th'],
          'age':46}

In [3]:
a_dict['email']

['abc@gmail.com', 'ab.c@chula.ac.th']

## Accessing entries in list, tuple, and dictionary

In [4]:
print(a_list[0])
print(a_tuple[1])
print(a_dict['age'])

0.1
0.2
46


## Accessing multiple entries at once

In [5]:
print(a_list[2:])
print(a_tuple[-4:])

[0.3, 0.4, 0.5, 0.6]
(0.3, 0.4, 0.5, 0.6)


In [6]:
# print(a_list[1:4:2])
print(a_list[::-1])

[0.6, 0.5, 0.4, 0.3, 0.2, 0.1]


## Changing entries in list and dictionary

In [7]:
a_list[2] = 10
a_dict['age'] = 18

In [8]:
print(a_list)
print(a_dict['age'])

[0.1, 0.2, 10, 0.4, 0.5, 0.6]
18


## Entries in tuple cannot be changed
What could tuple be used for?

In [9]:
a_tuple[1] = 2

TypeError: 'tuple' object does not support item assignment

## Locating an entry by value

In [10]:
a_list.index(0.5)

4

In [11]:
a_list.index(0)

ValueError: 0 is not in list

## Pandas can read in txt, tsv, csv, and even excel files
For excel file with multiple sheets, we can read specific sheet using **sheet_name** parameter

**head()** is used to preview the top rows of the data frame

In [6]:
data = pd.read_excel('CRC_sample_data.xlsx', sheet_name = 'expression', index_col = 0)
data.head(2)

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient1,5.317879,7.521597,5.458581,7.873975,6.777987,5.148662,6.372153,6.495578,5.361258,7.529628,9.910427,6.563663,6.467622,6.556573,5.782625,CMS1
Patient2,5.462626,7.613383,3.996901,7.03683,7.610739,5.58387,6.889211,6.049421,6.075198,7.027278,8.972537,5.544412,6.861825,5.765743,4.195767,CMS1


**tail()** shows the bottom rows of the data frame

In [17]:
data.tail(1)

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient50,5.759367,6.432468,4.713223,7.078099,6.763756,5.44207,6.134814,6.689782,6.583367,7.857809,9.571952,9.348392,6.505845,5.026038,4.529609,CMS3


## We can specify the location of header and index column
With **header** and **index_col** parameters

**Note:** In computer, index starts with zero, not one

In [4]:
data = pd.read_excel('CRC_sample_data.xlsx', sheet_name = 'expression', header = 0, index_col = 0)
data.head(5)

NameError: name 'pd' is not defined

## Pandas automatically determine the appropriate data types for each column
We can check data types with the built-in **dtypes** variable

In [21]:
data.dtypes

FAP       float64
SLC5A6    float64
GFPT2     float64
ASCL2     float64
TSPAN6    float64
CCDC80    float64
DUSP4     float64
EFEMP2    float64
TRIM7     float64
DCN       float64
AGR2      float64
REG4      float64
TUBB6     float64
POFUT1    float64
RETNLB    float64
CMS        object
dtype: object

## Dimension of data frame
shape

In [22]:
data.shape

(50, 16)

## Row indices and column names
* index
* columns

In [23]:
data.index

Index(['Patient1', 'Patient2', 'Patient3', 'Patient4', 'Patient5', 'Patient6',
       'Patient7', 'Patient8', 'Patient9', 'Patient10', 'Patient11',
       'Patient12', 'Patient13', 'Patient14', 'Patient15', 'Patient16',
       'Patient17', 'Patient18', 'Patient19', 'Patient20', 'Patient21',
       'Patient22', 'Patient23', 'Patient24', 'Patient25', 'Patient26',
       'Patient27', 'Patient28', 'Patient29', 'Patient30', 'Patient31',
       'Patient32', 'Patient33', 'Patient34', 'Patient35', 'Patient36',
       'Patient37', 'Patient38', 'Patient39', 'Patient40', 'Patient41',
       'Patient42', 'Patient43', 'Patient44', 'Patient45', 'Patient46',
       'Patient47', 'Patient48', 'Patient49', 'Patient50'],
      dtype='object', name='SampleID')

In [24]:
data.columns

Index(['FAP', 'SLC5A6', 'GFPT2', 'ASCL2', 'TSPAN6', 'CCDC80', 'DUSP4',
       'EFEMP2', 'TRIM7', 'DCN', 'AGR2', 'REG4', 'TUBB6', 'POFUT1', 'RETNLB',
       'CMS'],
      dtype='object')

## Summary statistics
* describe()
* agg([statistics], axis = 0), see documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
* mean(axis = 0)
* std(axis = 0)

In [25]:
data.describe()

Unnamed: 0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,5.837798,6.716091,4.937316,6.18913,6.779853,5.365377,6.89828,6.352815,5.786682,7.268783,9.255964,8.171306,6.843639,5.32009,3.53965
std,0.726815,0.484525,0.844463,1.704388,0.72484,0.93014,0.976062,0.573696,0.805768,1.030827,1.427257,1.950079,0.654265,0.405347,1.327002
min,4.208262,5.705932,3.324667,2.125807,4.79143,3.469958,4.41772,4.954105,2.868241,4.22577,4.684659,3.855574,5.309111,4.707178,1.604823
25%,5.319341,6.364697,4.288292,4.979495,6.483483,4.778292,6.445787,5.947604,5.544531,6.678181,8.540691,6.975765,6.495158,5.042827,2.574891
50%,5.968247,6.634655,5.059908,6.867199,6.76216,5.263732,7.125535,6.467185,6.052655,7.448704,9.342965,8.934924,6.754874,5.266818,3.454829
75%,6.331971,7.031999,5.463169,7.314634,7.231183,6.0046,7.566512,6.711835,6.300318,7.948625,10.08179,9.639411,7.192919,5.519746,4.505157
max,7.241834,8.152112,7.715739,8.628652,8.645912,7.130213,8.373788,7.305372,7.002703,9.043028,11.572476,11.344753,8.408573,6.556573,7.165236


In [26]:
data.mean()

  data.mean()


FAP       5.837798
SLC5A6    6.716091
GFPT2     4.937316
ASCL2     6.189130
TSPAN6    6.779853
CCDC80    5.365377
DUSP4     6.898280
EFEMP2    6.352815
TRIM7     5.786682
DCN       7.268783
AGR2      9.255964
REG4      8.171306
TUBB6     6.843639
POFUT1    5.320090
RETNLB    3.539650
dtype: float64

In [29]:
patient_name = ['Alice', 'Bob', 'Clare', 'Don', 'Eric', 'Fei', 'Gabriel', 'Henry', 'Ivan']
print(patient_name[0][3])

c


In [32]:
data.iloc[:, :-1].agg(['mean', 'std'], axis = 1)

Unnamed: 0_level_0,mean,std
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1
Patient1,6.609214,1.243988
Patient2,6.312383,1.315664
Patient3,6.020332,1.551171
Patient4,6.760024,1.454852
Patient5,6.456409,1.584368
Patient6,7.072045,1.65386
Patient7,5.327263,1.196508
Patient8,5.843336,1.520018
Patient9,5.876659,2.062362
Patient10,6.454575,1.339245


In [34]:
data.mean(axis = 1)

  data.mean(axis = 1)


SampleID
Patient1     6.609214
Patient2     6.312383
Patient3     6.020332
Patient4     6.760024
Patient5     6.456409
Patient6     7.072045
Patient7     5.327263
Patient8     5.843336
Patient9     5.876659
Patient10    6.454575
Patient11    6.775646
Patient12    6.610202
Patient13    5.628796
Patient14    6.798966
Patient15    6.583176
Patient16    6.198568
Patient17    6.510898
Patient18    6.012400
Patient19    5.684372
Patient20    6.385495
Patient21    5.906961
Patient22    6.632239
Patient23    6.169695
Patient24    6.554711
Patient25    5.783419
Patient26    7.004950
Patient27    6.430306
Patient28    5.613163
Patient29    6.493013
Patient30    6.628591
Patient31    6.207285
Patient32    6.652798
Patient33    6.361606
Patient34    6.497078
Patient35    6.307263
Patient36    6.302473
Patient37    5.914927
Patient38    6.136930
Patient39    6.050167
Patient40    6.693099
Patient41    6.221856
Patient42    6.466011
Patient43    6.849184
Patient44    6.471797
Patient45    6.568011
P

In [36]:
data.std(axis = 1)

  data.std(axis = 1)


SampleID
Patient1     1.243988
Patient2     1.315664
Patient3     1.551171
Patient4     1.454852
Patient5     1.584368
Patient6     1.653860
Patient7     1.196508
Patient8     1.520018
Patient9     2.062362
Patient10    1.339245
Patient11    1.585588
Patient12    1.806041
Patient13    1.690298
Patient14    2.105802
Patient15    1.640564
Patient16    2.139080
Patient17    2.101855
Patient18    1.920176
Patient19    1.778558
Patient20    1.202454
Patient21    1.951451
Patient22    1.094286
Patient23    2.176568
Patient24    1.902769
Patient25    2.310010
Patient26    1.762273
Patient27    1.592649
Patient28    1.409382
Patient29    1.629277
Patient30    1.428176
Patient31    1.685714
Patient32    1.929936
Patient33    1.994265
Patient34    1.831734
Patient35    1.671168
Patient36    1.540594
Patient37    2.206808
Patient38    1.612969
Patient39    1.149788
Patient40    1.504442
Patient41    1.530340
Patient42    1.429043
Patient43    1.427884
Patient44    1.531739
Patient45    2.049557
P

## Statistics for categorical features
* nunique()
* value_counts()

In [37]:
data['CMS'].nunique()

3

In [38]:
data['CMS'].value_counts()

CMS1    20
CMS2    15
CMS3    15
Name: CMS, dtype: int64

## Accessing data entry

In [40]:
print(data.loc['Patient1', :])

FAP       5.317879
SLC5A6    7.521597
GFPT2     5.458581
ASCL2     7.873975
TSPAN6    6.777987
CCDC80    5.148662
DUSP4     6.372153
EFEMP2    6.495578
TRIM7     5.361258
DCN       7.529628
AGR2      9.910427
REG4      6.563663
TUBB6     6.467622
POFUT1    6.556573
RETNLB    5.782625
CMS           CMS1
Name: Patient1, dtype: object


In [41]:
print(data.loc['Patient1', 'AGR2'])

9.9104272014113


In [39]:
print(data.iloc[0, 10])

9.9104272014113


## Quick way to standardize data frame
What is standardization?

In [42]:
data.head()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient1,5.317879,7.521597,5.458581,7.873975,6.777987,5.148662,6.372153,6.495578,5.361258,7.529628,9.910427,6.563663,6.467622,6.556573,5.782625,CMS1
Patient2,5.462626,7.613383,3.996901,7.03683,7.610739,5.58387,6.889211,6.049421,6.075198,7.027278,8.972537,5.544412,6.861825,5.765743,4.195767,CMS1
Patient3,5.364091,8.152112,4.220819,4.225933,7.615335,5.042837,6.566867,6.224913,4.497239,6.609805,8.543011,7.602395,6.0577,6.179182,3.402747,CMS1
Patient4,7.241834,6.989748,5.53598,7.283076,6.760564,7.0766,6.418798,7.018138,5.620279,8.986782,8.049796,7.436953,8.252535,5.978596,2.750682,CMS1
Patient5,5.732008,7.252843,5.003082,6.519208,8.357291,5.660058,4.5257,6.506203,5.300539,7.059266,10.547725,7.424943,6.674952,5.850525,4.4318,CMS1


In [50]:
data_std = (data - data.mean()) / data.std()
data_std.head()

  data_std = (data - data.mean()) / data.std()


Unnamed: 0_level_0,AGR2,ASCL2,CCDC80,CMS,DCN,DUSP4,EFEMP2,FAP,GFPT2,POFUT1,REG4,RETNLB,SLC5A6,TRIM7,TSPAN6,TUBB6
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient1,0.458546,0.988534,-0.232992,,0.253044,-0.539031,0.248849,-0.71534,0.617273,3.05043,-0.824399,1.690257,1.662467,-0.527974,-0.002575,-0.574715
Patient2,-0.198581,0.497363,0.234904,,-0.234283,-0.009292,-0.52884,-0.516187,-1.113625,1.099434,-1.34707,0.494436,1.851903,0.358063,1.146303,0.027797
Patient3,-0.499527,-1.151848,-0.346765,,-0.639271,-0.339541,-0.222944,-0.651758,-0.848465,2.119397,-0.291738,-0.103167,2.963771,-1.600265,1.152644,-1.201254
Patient4,-0.845095,0.641841,1.839748,,1.666622,-0.491241,1.159715,1.931764,0.708928,1.624548,-0.376576,-0.594549,0.564796,-0.206515,-0.026611,2.153402
Patient5,0.905065,0.193664,0.316813,,-0.203252,-2.430767,0.267369,-0.145553,0.077878,1.308592,-0.382735,0.672305,1.107791,-0.603329,2.176258,-0.257825


In [45]:
alist = []
alist.append('x')

print(alist)

['x']


In [51]:
data_no_cms = data_std.drop('CMS', axis = 1)
data_no_cms.head()

Unnamed: 0_level_0,AGR2,ASCL2,CCDC80,DCN,DUSP4,EFEMP2,FAP,GFPT2,POFUT1,REG4,RETNLB,SLC5A6,TRIM7,TSPAN6,TUBB6
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Patient1,0.458546,0.988534,-0.232992,0.253044,-0.539031,0.248849,-0.71534,0.617273,3.05043,-0.824399,1.690257,1.662467,-0.527974,-0.002575,-0.574715
Patient2,-0.198581,0.497363,0.234904,-0.234283,-0.009292,-0.52884,-0.516187,-1.113625,1.099434,-1.34707,0.494436,1.851903,0.358063,1.146303,0.027797
Patient3,-0.499527,-1.151848,-0.346765,-0.639271,-0.339541,-0.222944,-0.651758,-0.848465,2.119397,-0.291738,-0.103167,2.963771,-1.600265,1.152644,-1.201254
Patient4,-0.845095,0.641841,1.839748,1.666622,-0.491241,1.159715,1.931764,0.708928,1.624548,-0.376576,-0.594549,0.564796,-0.206515,-0.026611,2.153402
Patient5,0.905065,0.193664,0.316813,-0.203252,-2.430767,0.267369,-0.145553,0.077878,1.308592,-0.382735,0.672305,1.107791,-0.603329,2.176258,-0.257825


## How to access rows, columns, and specific cells?
* data['column name']
* data.loc['index', 'column name']

In [None]:
data.head(5)

In [None]:
data.loc['Patient3', :]

## Access with boolean list
* data.loc[[True, False, ..., True], ['column 1', 'column 2']]

In [52]:
data.loc['Patient3', 'DUSP4']

6.56686721773301

In [53]:
data.loc[['Patient3', 'Patient10'], ['DUSP4', 'AGR2']]

Unnamed: 0_level_0,DUSP4,AGR2
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1
Patient3,6.566867,8.543011
Patient10,5.674381,7.223094


In [54]:
data['DUSP4'] > 5

SampleID
Patient1      True
Patient2      True
Patient3      True
Patient4      True
Patient5     False
Patient6      True
Patient7     False
Patient8      True
Patient9      True
Patient10     True
Patient11     True
Patient12     True
Patient13     True
Patient14     True
Patient15     True
Patient16     True
Patient17     True
Patient18     True
Patient19     True
Patient20     True
Patient21     True
Patient22     True
Patient23     True
Patient24     True
Patient25     True
Patient26     True
Patient27     True
Patient28     True
Patient29     True
Patient30     True
Patient31     True
Patient32     True
Patient33     True
Patient34     True
Patient35     True
Patient36     True
Patient37     True
Patient38     True
Patient39    False
Patient40     True
Patient41     True
Patient42     True
Patient43     True
Patient44     True
Patient45     True
Patient46     True
Patient47     True
Patient48     True
Patient49     True
Patient50     True
Name: DUSP4, dtype: bool

In [56]:
data.loc[data['DUSP4'] <= 5, :]

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient5,5.732008,7.252843,5.003082,6.519208,8.357291,5.660058,4.5257,6.506203,5.300539,7.059266,10.547725,7.424943,6.674952,5.850525,4.4318,CMS1
Patient7,5.323728,6.797493,5.650428,3.786245,6.562456,4.34146,4.41772,5.867998,6.271575,7.070642,4.684659,3.855574,6.317054,5.677381,3.284533,CMS1
Patient39,5.517355,6.846884,4.9459,7.825758,7.302324,4.977612,4.739178,5.595053,5.925502,6.156288,8.034304,5.228956,7.470683,4.885445,5.301257,CMS3


In [57]:
print('average DUSP4 expression for patient from CMS1 is', data.loc[data['CMS'] == 'CMS1', 'DUSP4'].mean())

average DUSP4 expression for patient from CMS1 is 6.756505373534206


## Applying multiple conditions

In [58]:
(data['AGR2'] > 10) & (data['CMS'] == 'CMS1')

SampleID
Patient1     False
Patient2     False
Patient3     False
Patient4     False
Patient5      True
Patient6      True
Patient7     False
Patient8     False
Patient9      True
Patient10    False
Patient11    False
Patient12     True
Patient13    False
Patient14     True
Patient15     True
Patient16     True
Patient17     True
Patient18    False
Patient19    False
Patient20    False
Patient21    False
Patient22    False
Patient23    False
Patient24    False
Patient25    False
Patient26    False
Patient27    False
Patient28    False
Patient29    False
Patient30    False
Patient31    False
Patient32    False
Patient33    False
Patient34    False
Patient35    False
Patient36    False
Patient37    False
Patient38    False
Patient39    False
Patient40    False
Patient41    False
Patient42    False
Patient43    False
Patient44    False
Patient45    False
Patient46    False
Patient47    False
Patient48    False
Patient49    False
Patient50    False
dtype: bool

In [59]:
(data['AGR2'] > 10) | (data['CMS'] == 'CMS1')

SampleID
Patient1      True
Patient2      True
Patient3      True
Patient4      True
Patient5      True
Patient6      True
Patient7      True
Patient8      True
Patient9      True
Patient10     True
Patient11     True
Patient12     True
Patient13     True
Patient14     True
Patient15     True
Patient16     True
Patient17     True
Patient18     True
Patient19     True
Patient20     True
Patient21    False
Patient22    False
Patient23     True
Patient24    False
Patient25    False
Patient26     True
Patient27    False
Patient28    False
Patient29     True
Patient30    False
Patient31    False
Patient32    False
Patient33     True
Patient34     True
Patient35    False
Patient36    False
Patient37     True
Patient38    False
Patient39    False
Patient40    False
Patient41    False
Patient42    False
Patient43    False
Patient44    False
Patient45     True
Patient46    False
Patient47    False
Patient48    False
Patient49     True
Patient50    False
dtype: bool

In [61]:
data.loc[(data['AGR2'] > 10) | (data['CMS'] == 'CMS1'), ['AGR2', 'CMS']]

Unnamed: 0_level_0,AGR2,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1
Patient1,9.910427,CMS1
Patient2,8.972537,CMS1
Patient3,8.543011,CMS1
Patient4,8.049796,CMS1
Patient5,10.547725,CMS1
Patient6,10.541127,CMS1
Patient7,4.684659,CMS1
Patient8,5.026115,CMS1
Patient9,10.086906,CMS1
Patient10,7.223094,CMS1


## Selection for categorical feature

In [62]:
array = pd.unique(data['CMS'])
display(array)
print(array)

array(['CMS1', 'CMS2', 'CMS3'], dtype=object)

['CMS1' 'CMS2' 'CMS3']


In [64]:
lst = sorted(pd.unique(data['CMS']), reverse = True)
print(lst)

['CMS3', 'CMS2', 'CMS1']


In [None]:
data.loc[data['CMS'].isin(['CMS2', 'CMS3']), 'AGR2'].mean()

## Ability to use index is restricted to list, not array

In [65]:
lst.index('CMS2')

1

In [66]:
array.index('CMS2')

AttributeError: 'numpy.ndarray' object has no attribute 'index'

## But ability to select entries by condition is restricted to array, not list

In [67]:
array = np.array([40, 60, 70, 80, 50])
lst = [40, 60, 70, 80, 50]

In [70]:
print(array)
print(lst)

[40 60 70 80 50]
[40, 60, 70, 80, 50]


In [68]:
array[array > 60]

array([70, 80])

In [69]:
lst[lst > 60]

TypeError: '>' not supported between instances of 'list' and 'int'

In [71]:
new_array = array / 10
print(new_array)

[4. 6. 7. 8. 5.]


In [72]:
new_lst = lst / 10

TypeError: unsupported operand type(s) for /: 'list' and 'int'

## List comprehension

In [73]:
lst = [3, 4, 5, 6, 7]
twice_lst = lst * 2

print(twice_lst)

[3, 4, 5, 6, 7, 3, 4, 5, 6, 7]


In [74]:
new_twice_lst = [x * 2 for x in lst]

print(new_twice_lst)

[6, 8, 10, 12, 14]


In [75]:
half_lst = [x / 2 for x in lst if x > 4]

print(half_lst)

[2.5, 3.0, 3.5]


## Convert array to list

In [76]:
lst = list(array)
print(lst)

[40, 60, 70, 80, 50]


In [77]:
array = np.array(lst)
print(array)

[40 60 70 80 50]


## Load a different sheet from an excel file

In [78]:
mutation_data = pd.read_excel('CRC_sample_data.xlsx', sheet_name = 'mutation', header = 0, index_col = 0)
mutation_data.head(5)

Unnamed: 0_level_0,KRAS,BRAF,APC,TP53,PIK3CA,PTEN,microsatelite_status
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Patient1,,,,,,,
Patient2,,,,,,,
Patient3,,,,,,,
Patient4,,,,,,,
Patient5,wt,wt,wt,wt,mt,mt,MSI


## Select non-missing values
isna()

In [79]:
pd.isna(mutation_data['KRAS'])

SampleID
Patient1      True
Patient2      True
Patient3      True
Patient4      True
Patient5     False
Patient6     False
Patient7     False
Patient8     False
Patient9     False
Patient10    False
Patient11    False
Patient12     True
Patient13     True
Patient14     True
Patient15    False
Patient16     True
Patient17    False
Patient18    False
Patient19    False
Patient20    False
Patient21    False
Patient22    False
Patient23    False
Patient24    False
Patient25     True
Patient26    False
Patient27    False
Patient28    False
Patient29    False
Patient30     True
Patient31    False
Patient32    False
Patient33    False
Patient34     True
Patient35    False
Patient36     True
Patient37    False
Patient38    False
Patient39    False
Patient40    False
Patient41    False
Patient42    False
Patient43    False
Patient44    False
Patient45    False
Patient46    False
Patient47    False
Patient48     True
Patient49     True
Patient50     True
Name: KRAS, dtype: bool

In [80]:
mutation_data.loc[~pd.isna(mutation_data['KRAS']), 'KRAS']

SampleID
Patient5     wt
Patient6     wt
Patient7     wt
Patient8     mt
Patient9     wt
Patient10    wt
Patient11    mt
Patient15    mt
Patient17    wt
Patient18    wt
Patient19    wt
Patient20    mt
Patient21    wt
Patient22    mt
Patient23    wt
Patient24    wt
Patient26    wt
Patient27    wt
Patient28    wt
Patient29    wt
Patient31    wt
Patient32    wt
Patient33    wt
Patient35    wt
Patient37    wt
Patient38    mt
Patient39    wt
Patient40    wt
Patient41    mt
Patient42    mt
Patient43    wt
Patient44    wt
Patient45    mt
Patient46    mt
Patient47    mt
Name: KRAS, dtype: object

## Copying data frame
Copying is important because we sometimes want to preserve the original data

In [1]:
x = 5
y = x
x = 10
print(y)

5


In [2]:
lst = [1, 3, 4]
new_lst = lst
lst[1] = 5

print(new_lst)

[1, 5, 4]


In [7]:
new_data = data.copy()
new_data.head()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient1,5.317879,7.521597,5.458581,7.873975,6.777987,5.148662,6.372153,6.495578,5.361258,7.529628,9.910427,6.563663,6.467622,6.556573,5.782625,CMS1
Patient2,5.462626,7.613383,3.996901,7.03683,7.610739,5.58387,6.889211,6.049421,6.075198,7.027278,8.972537,5.544412,6.861825,5.765743,4.195767,CMS1
Patient3,5.364091,8.152112,4.220819,4.225933,7.615335,5.042837,6.566867,6.224913,4.497239,6.609805,8.543011,7.602395,6.0577,6.179182,3.402747,CMS1
Patient4,7.241834,6.989748,5.53598,7.283076,6.760564,7.0766,6.418798,7.018138,5.620279,8.986782,8.049796,7.436953,8.252535,5.978596,2.750682,CMS1
Patient5,5.732008,7.252843,5.003082,6.519208,8.357291,5.660058,4.5257,6.506203,5.300539,7.059266,10.547725,7.424943,6.674952,5.850525,4.4318,CMS1


In [8]:
new_data.loc['Patient1', 'FAP'] = -5
new_data.head()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient1,-5.0,7.521597,5.458581,7.873975,6.777987,5.148662,6.372153,6.495578,5.361258,7.529628,9.910427,6.563663,6.467622,6.556573,5.782625,CMS1
Patient2,5.462626,7.613383,3.996901,7.03683,7.610739,5.58387,6.889211,6.049421,6.075198,7.027278,8.972537,5.544412,6.861825,5.765743,4.195767,CMS1
Patient3,5.364091,8.152112,4.220819,4.225933,7.615335,5.042837,6.566867,6.224913,4.497239,6.609805,8.543011,7.602395,6.0577,6.179182,3.402747,CMS1
Patient4,7.241834,6.989748,5.53598,7.283076,6.760564,7.0766,6.418798,7.018138,5.620279,8.986782,8.049796,7.436953,8.252535,5.978596,2.750682,CMS1
Patient5,5.732008,7.252843,5.003082,6.519208,8.357291,5.660058,4.5257,6.506203,5.300539,7.059266,10.547725,7.424943,6.674952,5.850525,4.4318,CMS1


Original data remain unchanged

In [9]:
data.head()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Patient1,5.317879,7.521597,5.458581,7.873975,6.777987,5.148662,6.372153,6.495578,5.361258,7.529628,9.910427,6.563663,6.467622,6.556573,5.782625,CMS1
Patient2,5.462626,7.613383,3.996901,7.03683,7.610739,5.58387,6.889211,6.049421,6.075198,7.027278,8.972537,5.544412,6.861825,5.765743,4.195767,CMS1
Patient3,5.364091,8.152112,4.220819,4.225933,7.615335,5.042837,6.566867,6.224913,4.497239,6.609805,8.543011,7.602395,6.0577,6.179182,3.402747,CMS1
Patient4,7.241834,6.989748,5.53598,7.283076,6.760564,7.0766,6.418798,7.018138,5.620279,8.986782,8.049796,7.436953,8.252535,5.978596,2.750682,CMS1
Patient5,5.732008,7.252843,5.003082,6.519208,8.357291,5.660058,4.5257,6.506203,5.300539,7.059266,10.547725,7.424943,6.674952,5.850525,4.4318,CMS1


## Add a new column

In [12]:
new_data['FAP x SLC5A6'] = new_data['TSPAN6'] * new_data['SLC5A6']
new_data.head()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,CMS,average,FAP x SLC5A6
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Patient1,-5.0,7.521597,5.458581,7.873975,6.777987,5.148662,6.372153,6.495578,5.361258,7.529628,9.910427,6.563663,6.467622,6.556573,5.782625,CMS1,6.609214,-37.607987
Patient2,5.462626,7.613383,3.996901,7.03683,7.610739,5.58387,6.889211,6.049421,6.075198,7.027278,8.972537,5.544412,6.861825,5.765743,4.195767,CMS1,6.312383,41.589065
Patient3,5.364091,8.152112,4.220819,4.225933,7.615335,5.042837,6.566867,6.224913,4.497239,6.609805,8.543011,7.602395,6.0577,6.179182,3.402747,CMS1,6.020332,43.728668
Patient4,7.241834,6.989748,5.53598,7.283076,6.760564,7.0766,6.418798,7.018138,5.620279,8.986782,8.049796,7.436953,8.252535,5.978596,2.750682,CMS1,6.760024,50.618598
Patient5,5.732008,7.252843,5.003082,6.519208,8.357291,5.660058,4.5257,6.506203,5.300539,7.059266,10.547725,7.424943,6.674952,5.850525,4.4318,CMS1,6.456409,41.573351


## Calculating statistics per group
groupby()

In [14]:
pd.unique(new_data['CMS'])

array(['CMS1', 'CMS2', 'CMS3'], dtype=object)

In [13]:
new_data.groupby('CMS').mean()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,average,FAP x SLC5A6
CMS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CMS1,5.177354,6.886191,4.885876,6.427058,6.808196,5.016452,6.756505,6.270085,5.690918,7.170562,9.006503,7.885265,6.846878,5.47324,3.623595,6.296038,35.317076
CMS2,5.779586,6.695623,4.809646,5.584092,6.959512,5.509791,6.952251,6.332585,5.841067,7.139227,9.650661,8.663313,6.721061,5.290922,3.31374,6.349539,38.771797
CMS3,6.088745,6.509758,5.133573,6.476929,6.562403,5.686196,7.033342,6.483351,5.859985,7.529301,9.193881,8.060689,6.961898,5.14506,3.653633,6.42525,39.605577


In [15]:
new_data.groupby('CMS').median()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,average,FAP x SLC5A6
CMS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CMS1,5.413358,6.809508,5.104045,6.729795,6.617792,4.888518,6.882406,6.397944,5.844112,7.080048,9.437429,8.752215,6.664631,5.304315,3.325177,6.420035,37.850216
CMS2,5.962015,6.787483,5.001984,6.855597,6.939538,5.357899,7.290589,6.354966,6.165573,7.605952,9.514784,9.160111,6.735849,5.316901,3.455666,6.430306,38.156645
CMS3,6.296011,6.432468,5.313438,7.078099,6.723037,5.888979,7.033327,6.666716,5.925502,7.679102,9.286057,8.930608,6.888585,5.166031,4.090627,6.466011,40.92862


In [16]:
new_data.groupby('CMS').std()

Unnamed: 0_level_0,FAP,SLC5A6,GFPT2,ASCL2,TSPAN6,CCDC80,DUSP4,EFEMP2,TRIM7,DCN,AGR2,REG4,TUBB6,POFUT1,RETNLB,average,FAP x SLC5A6
CMS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CMS1,2.521938,0.529478,0.988115,1.563879,0.710954,0.946218,1.120124,0.60151,0.910132,1.112713,1.841101,2.063829,0.776496,0.529022,1.078805,0.458234,18.200314
CMS2,0.616043,0.359574,0.73798,1.863321,0.833197,0.767518,0.949679,0.586716,0.945766,1.077422,0.963606,1.557366,0.483658,0.255311,1.483349,0.366783,5.106172
CMS3,0.716974,0.473829,0.750807,1.677589,0.610155,0.957778,0.823646,0.536844,0.481396,0.879477,1.158014,2.171287,0.644138,0.25066,1.51816,0.275255,5.145026


## Save data frame to excel file

In [17]:
new_data.to_excel('new_dataframe.xlsx')

## For loop

In [18]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [19]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


### Go over data one column at a time

In [20]:
data.columns

Index(['FAP', 'SLC5A6', 'GFPT2', 'ASCL2', 'TSPAN6', 'CCDC80', 'DUSP4',
       'EFEMP2', 'TRIM7', 'DCN', 'AGR2', 'REG4', 'TUBB6', 'POFUT1', 'RETNLB',
       'CMS'],
      dtype='object')

In [21]:
for c in data.columns[:-1]:
    print(c, 'mean =', data[c].mean(), 'std =', data[c].std())

FAP mean = 5.837798271570673 std = 0.7268151668649991
SLC5A6 mean = 6.716090501801843 std = 0.48452488022793194
GFPT2 mean = 4.937316203751788 std = 0.8444631359686181
ASCL2 mean = 6.189129583684837 std = 1.7043883930513326
TSPAN6 mean = 6.779853084510222 std = 0.7248398444823763
CCDC80 mean = 5.365376883441282 std = 0.9301397103582416
DUSP4 mean = 6.898279961337789 std = 0.9760619609925546
EFEMP2 mean = 6.3528146421449785 std = 0.5736958202424725
TRIM7 mean = 5.786682481831352 std = 0.805768441378514
DCN mean = 7.268783172079207 std = 1.030827356688954
AGR2 mean = 9.255963986762353 std = 1.427256881342723
REG4 mean = 8.171306490760951 std = 1.9500791067363323
TUBB6 mean = 6.84363871512698 std = 0.6542653298749708
POFUT1 mean = 5.320090465841044 std = 0.40534709175112954
RETNLB mean = 3.539649907218695 std = 1.3270021784751038


### Go over data one row at a time

In [22]:
for i in range(10):
    print(i, 'FAP =', data.iloc[i, 0], 'SLC5A6 =', data.iloc[i, 1])

0 FAP = 5.31787854169241 SLC5A6 = 7.52159731275778
1 FAP = 5.46262582095284 SLC5A6 = 7.613383443550051
2 FAP = 5.36409097800165 SLC5A6 = 8.15211150589592
3 FAP = 7.241833950260019 SLC5A6 = 6.98974844869259
4 FAP = 5.73200786216588 SLC5A6 = 7.25284259391562
5 FAP = 7.210346651400889 SLC5A6 = 6.10483318329787
6 FAP = 5.32372753792337 SLC5A6 = 6.79749274296288
7 FAP = 5.31063855138597 SLC5A6 = 6.572456622232879
8 FAP = 4.208261744613741 SLC5A6 = 6.5543446277055
9 FAP = 6.17311094583072 SLC5A6 = 6.345487876655951


In [23]:
for i in range(10):
    print(i, 'FAP =', data['FAP'].iloc[i], 'SLC5A6 =', data['SLC5A6'].iloc[i])

0 FAP = 5.31787854169241 SLC5A6 = 7.52159731275778
1 FAP = 5.46262582095284 SLC5A6 = 7.613383443550051
2 FAP = 5.36409097800165 SLC5A6 = 8.15211150589592
3 FAP = 7.241833950260019 SLC5A6 = 6.98974844869259
4 FAP = 5.73200786216588 SLC5A6 = 7.25284259391562
5 FAP = 7.210346651400889 SLC5A6 = 6.10483318329787
6 FAP = 5.32372753792337 SLC5A6 = 6.79749274296288
7 FAP = 5.31063855138597 SLC5A6 = 6.572456622232879
8 FAP = 4.208261744613741 SLC5A6 = 6.5543446277055
9 FAP = 6.17311094583072 SLC5A6 = 6.345487876655951


## Dealing with missing value
impute with **fillna()**

In [24]:
missing_data = pd.DataFrame([[0, np.nan, 1], 
                             [np.nan, 1, 2],
                             [2, 3, 4]], 
                            index = ['Cell1', 'Cell2', 'Cell3'],
                            columns = ['ProteinA', 'ProteinB', 'ProteinC'])
missing_data.head()

Unnamed: 0,ProteinA,ProteinB,ProteinC
Cell1,0.0,,1
Cell2,,1.0,2
Cell3,2.0,3.0,4


In [25]:
imputed_data = missing_data.fillna(-1)
imputed_data.head()

Unnamed: 0,ProteinA,ProteinB,ProteinC
Cell1,0.0,-1.0,1
Cell2,-1.0,1.0,2
Cell3,2.0,3.0,4


In [26]:
imputed_data = missing_data.fillna(missing_data.mean())
imputed_data.head()

Unnamed: 0,ProteinA,ProteinB,ProteinC
Cell1,0.0,2.0,1
Cell2,1.0,1.0,2
Cell3,2.0,3.0,4


## Selecting rows with no missing data

In [27]:
nomissing_data = missing_data.loc[~pd.isna(missing_data['ProteinA']), :]
nomissing_data.head()

Unnamed: 0,ProteinA,ProteinB,ProteinC
Cell1,0.0,,1
Cell3,2.0,3.0,4


In [28]:
nomissing_data = missing_data.loc[~pd.isna(missing_data['ProteinA']) & ~pd.isna(missing_data['ProteinB']), :]
nomissing_data.head()

Unnamed: 0,ProteinA,ProteinB,ProteinC
Cell3,2.0,3.0,4


Using any() function

In [29]:
print(pd.DataFrame([True, True, False, False]).any()[0])
print(pd.DataFrame([False, False, False, False]).any()[0])
print(pd.DataFrame([False, True, False, False]).any()[0])

True
False
True


In [30]:
pd.isna(missing_data).any(axis = 1)

Cell1     True
Cell2     True
Cell3    False
dtype: bool

In [31]:
nomissing_data = missing_data.loc[~pd.isna(missing_data).any(axis = 1), :]
nomissing_data.head()

Unnamed: 0,ProteinA,ProteinB,ProteinC
Cell3,2.0,3.0,4
