<a href="https://colab.research.google.com/github/araldi/Python_for_biomedical_data_analysis/blob/main/03_Intro_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

As you have seen, Python is a very powerful and dynamic programming language with several built-in functions. 

Sometimes, however, importing libraries is essential to perform certain operations without excessive coding from scratch.

Usually, at the beginning of the code or in the first cell of the notebook, you want to import all the libraries that you need. 



In [None]:
# In this example, we will be importing the following libraries :
# pandas, 
# numpy, 

# let's import them!

import pandas as pd # the aliases speed up the calling the library
import numpy as np


If you are using a Jupyter Notebook/Lab on your computer, you need to first install Pandas and NumPy if you have not done so already.
Run this command on terminal:



```
pip install pandas
pip install numpy
```



# What can Pandas do for you?

Today we will learn:

* Create Series and DataFrames;

*   Import data in the form of DataFrames (tables);


* Get info on your imported DataFrames;

* Subset the DataFrame.




# Intro on Pandas DataFrame

## Create Series



In [None]:
# Create an empty Series
sr = pd.Series()
sr

  


Series([], dtype: float64)

In [None]:
# Create an empty Series, made of objects (strings)
sr = pd.Series(dtype = 'object')
sr

Series([], dtype: object)

In [None]:
# Create an empty Series, made of integers
sr = pd.Series(dtype = 'int64')
sr

Series([], dtype: int64)

In [None]:
sr = pd.Series(dtype = 'object', index = ['this', 'is', 'an', 'index'])
sr

this     NaN
is       NaN
an       NaN
index    NaN
dtype: object

In [None]:
content = ['THIS', 'IS', 'THE', 'CONTENT']
sr = pd.Series(content, dtype = 'object', index = ['this', 'is', 'an', 'index'])
sr

this        THIS
is            IS
an           THE
index    CONTENT
dtype: object

In [None]:
# Create random numbers with numpy (more info later)
np.random.randint(36, 46, 5)

array([42, 40, 41, 38, 44])

In [None]:


shoe_size = pd.Series(np.random.randint(36, 46, 5)) #creates a series with 5 integer values included in the range 36-46
patient = pd.Series(['b', 'a', 'c', 'd', 'f'])

In [None]:
patient

0    b
1    a
2    c
3    d
4    f
dtype: object

In [None]:
shoe_size

0    44
1    39
2    42
3    37
4    41
dtype: int64

## Create DataFrames

In [None]:
# create and empty dataframe

df = pd.DataFrame()
df

In [None]:
type(df)

pandas.core.frame.DataFrame

In [None]:
# create and empty dataframe with specific columns

df = pd.DataFrame(columns = ['these', 'are', 'columns'])
df

Unnamed: 0,these,are,columns


In [None]:
df = pd.DataFrame(columns = ['these', 'are', 'columns'], index = [0,1,2,3,4])
df

Unnamed: 0,these,are,columns
0,,,
1,,,
2,,,
3,,,
4,,,


In [None]:
type(df)

pandas.core.frame.DataFrame

In [None]:
df1 = pd.DataFrame({'patient': ['b', 'a', 'c', 'e', 'f'], # this is the first column
                    'height [cm]': np.random.randint(140, 200, 5)} #this is the second columns
                   
                   )
df1

Unnamed: 0,patient,height [cm]
0,b,149
1,a,165
2,c,165
3,e,156
4,f,177


In [None]:
type({'patient': ['b', 'a', 'c', 'e', 'f'], # this is the first column
                    'height [cm]': np.random.randint(140, 200, 5)} #this is the second columns
                   
                   )

dict

In [None]:
df2 = pd.DataFrame({'patient': ['a', 'b', 'd','f'], 
                    'weight [kg]': np.random.uniform(45, 120, 4)})
df2

Unnamed: 0,patient,weight [kg]
0,a,79.61952
1,b,77.214834
2,d,111.482071
3,f,75.879996


In [None]:
df3 = pd.DataFrame({'patient': ['b', 'a', 'c', 'd', 'f'], 
                    'shoe size [EU]': np.random.randint(36, 46, 5)})
df3

Unnamed: 0,patient,shoe size [EU]
0,b,42
1,a,40
2,c,41
3,d,36
4,f,38


#### Create DataFrame from Series

In [None]:
df3 = pd.DataFrame()
# populate each column
df3['patient'] = patient
df3['shoe size [EU]'] = shoe_size
df3

Unnamed: 0,patient,shoe size [EU]
0,b,44
1,a,39
2,c,42
3,d,37
4,f,41


In [None]:
df3['patient']

0    b
1    a
2    c
3    d
4    f
Name: patient, dtype: object

In [None]:
df3


Unnamed: 0,patient,shoe size [EU]
0,b,44
1,a,39
2,c,42
3,d,37
4,f,41


## Data import
Finally, let's import some data.

Most of the data you will deal with in this course is in the form of text (.txt), comma separated variables (.csv), tab separated variables (.tsv), excel files (.xlsx), etc.

Pandas will take care of importing different types of data.

It creates different objects to contain the data. We will use DataFrames at first.

#### Importing from web

Comma separated file: 


```
pd.read_csv('filename.csv')
```



In [None]:
SNP_file_name ="https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/pandas/CD93_exomeSNPs_annotation.csv"
SNPs = pd.read_csv(SNP_file_name)
SNPs

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
0,0,rs7492,3_prime_UTR_variant,,ENST00000246006,,,,,20,23079620,23079620
1,1,rs2567612,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082535,23082535
2,2,rs2749811,3_prime_UTR_variant,,ENST00000246006,,,,,20,23079544,23079544
3,3,rs2749812,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082290,23082290
4,4,rs2749813,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082347,23082347
...,...,...,...,...,...,...,...,...,...,...,...,...
2306,2306,rs1600423846,synonymous_variant,P,ENST00000246006,,,,,20,23085689,23085689
2307,2307,rs1600424016,missense_variant,W/S,ENST00000246006,1.00,probably damaging,0.00,deleterious,20,23085810,23085810
2308,2308,rs1600424406,missense_variant,T/P,ENST00000246006,0.36,benign,0.07,tolerated,20,23086186,23086186
2309,2309,rs1600424446,5_prime_UTR_variant,,ENST00000246006,,,,,20,23086256,23086256


Tab separated file:


```
pd.read_csv(filename, sep='\t')
```






In [None]:
#what happens when you try to upload a tsv file?
drugs =  pd.read_csv('https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/pandas/drugs.tsv')

# tsv is a tab-separated file. You need to specify that with the argument sep='\t'
# similar instance if the file is separated by spaces. In the argument you will specify sep =' '


ParserError: ignored

In [None]:
drugs =  pd.read_csv('https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/pandas/drugs.tsv', 
                     sep='\t')
drugs

Unnamed: 0,PharmGKB Accession Id,Name,Generic Names,Trade Names,Brand Mixtures,Type,Cross-references,SMILES,InChI,Dosing Guideline,...,VIP Count,Dosing Guideline Sources,Top Clinical Annotation Level,Top FDA Label Testing Level,Top Any Drug Label Testing Level,Label Has Dosing Info,Has Rx Annotation,RxNorm Identifiers,ATC Identifiers,PubChem Compound Identifiers
0,PA164712302,2-amino-1-phenylethanol derivatives,,,,Drug Class,,,,No,...,0,,,,,,,,C04AA,
1,PA134967247,2-methoxyestradiol,,,,Drug,PubChem Compound:66414,,,No,...,0,,,,,,,,,66414
2,PA131887008,"3,4-methylenedioxymethamphetamine","Ecstasy,""MDMA""",,,Drug,"ChEBI:CHEBI:1391,""Chemical Abstracts Service:4...",CC(CC1=CC2=C(C=C1)OCO2)NC,InChI=1S/C11H15NO2/c1-8(12-2)5-9-3-4-10-11(6-9...,No,...,1,,3,,,,,,,1615
3,PA165958321,"3,5-dimethyl-2-(3-pyridyl)thiazolidin-4-one","( )-cis-3,5-Dimethyl-2-(3-pyridyl)thiazolidin-...",,,Drug,PubChem Compound:178014,C[C@H]1C(=O)N([C@H](S1)C2=CN=CC=C2)C.Cl,InChI=1S/C10H12N2OS.ClH/c1-7-9(13)12(2)10(14-7...,No,...,1,,,,,,,,,178014
4,PA165858618,3-aminopyridine-2-carboxaldehyde thiosemicarba...,,,,Drug,PubChem Compound:9571836,C1=CC(=C(N=C1)/C=N/NC(=S)N)N,InChI=1S/C7H9N5S/c8-5-2-1-3-10-6(5)4-11-12-7(9...,No,...,1,,,,,,,,,9571836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3443,PA451978,zonisamide,"Zonisamida [Spanish],""Zonisamidum [Latin]"",""zo...","Exceglan,""Excegram"",""Excegran"",""Zonegran""",,Drug,"BindingDB:10888,""ChEBI:CHEBI:10127"",""Chemical ...",C1=CC=C2C(=C1)C(=NO2)CS(=O)(=O)N,"InChI=1S/C8H8N2O3S/c9-14(11,12)5-7-6-3-1-2-4-8...",No,...,1,,3,,Informative PGx,,,39998,N03AX15,5734
3444,PA10236,zopiclone,"(+-)-zopiclone,""Zopiclona [INN-Spanish]"",""Zopi...","Amoban,""Amovane"",""Imovance"",""Imovane"",""Novo-zo...",,Drug,"BindingDB:50054136,""ChEBI:CHEBI:32315"",""Chemic...",CN1CCN(CC1)C(=O)OC2C3=NC=CN=C3C(=O)N2C4=NC=C(C...,InChI=1S/C17H17ClN6O3/c1-22-6-8-23(9-7-22)17(2...,No,...,0,,,,,,,40001,N05CF01,5735
3445,PA164924567,Zosuquidar,,,,Drug,,,,No,...,1,,,,,,,,,
3446,PA452606,zoxazolamine,,,,Drug,,,,No,...,0,,,,,,,,,


In [None]:
pd.read_csv(SNP_file_name, sep='\t')

Unnamed: 0,",Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)"
0,"0,rs7492,3_prime_UTR_variant,,ENST00000246006,..."
1,"1,rs2567612,3_prime_UTR_variant,,ENST000002460..."
2,"2,rs2749811,3_prime_UTR_variant,,ENST000002460..."
3,"3,rs2749812,3_prime_UTR_variant,,ENST000002460..."
4,"4,rs2749813,3_prime_UTR_variant,,ENST000002460..."
...,...
2306,"2306,rs1600423846,synonymous_variant,P,ENST000..."
2307,"2307,rs1600424016,missense_variant,W/S,ENST000..."
2308,"2308,rs1600424406,missense_variant,T/P,ENST000..."
2309,"2309,rs1600424446,5_prime_UTR_variant,,ENST000..."


In [None]:
# and an excel file?

ETH_workplaces =  pd.read_excel('https://github.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/blob/main/pandas/FS%202021_ETH%20Workplaces%20.xlsx?raw=true')

In [None]:
ETH_workplaces

Unnamed: 0,Gebäudebereich,Gebäude,Mo - Fr,Sa + So
0,ET,ETA,06:30 - 20:30,geschlossen
1,ET,ETF,06:30 - 20:30,geschlossen
2,ET,ETZ,06:30 - 20:30,geschlossen
3,HC,HCI,06:30 - 22:00,Sa: 9-19 / So: 10-16
4,HC,HCP,06:30 - 20:30,geschlossen
5,HC,HPI,05:45 - 21:00,geschlossen
6,HG,HG,06:00 - 22:00,08:00 - 17:00
7,HI,HIL,07:00 - 22:00,Sa: 08:00 - 12:00
8,HI,HIT,07:00 - 20:30,geschlossen
9,HP,HPH,07:00 - 20:30,geschlossen


Specify the sheet:



```
pd.read_excel('filename.xlsx', sheet_name='sheet_name')
```



#### Importing from a local drive

In [None]:
# choose file from your computer (this works only in google colab, not in Jupyter notebook)
from google.colab import files
uploaded = files.upload()
file_name = 'kidpackgenes.csv'

KeyboardInterrupt: ignored

In [None]:
import io
genes = pd.read_csv(io.BytesIO(uploaded[file_name]))
# Dataset is now stored in a Pandas Dataframe

#### Importing from Google Drive via PyDrive

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Create a folder on your google drive for this couse (in this case I called it HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L)

In [None]:
!ls

drive  sample_data


In [None]:
!pwd

/content


In [None]:
!cd drive


In [None]:
cd drive/

/content/drive


In [None]:
ls

[0m[01;34mMyDrive[0m/


In [None]:
!cd /content/drive/MyDrive/HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L 
#changes to the folder of interest

In [None]:
directory = '/content/drive/MyDrive/HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L'
file_name = 'kidpackgenes.csv'

In [None]:
genes = pd.read_csv('%s/%s' %(directory, file_name))
genes

#### Importing from your computer (on Jupyter Notebook)

In [None]:
# on Jupyter Lab on your computer, you would add the path of the file
# for instance (in MacOsX )

genes = pd.read_csv('/Users/elisa/kidpackgenes.csv' )

# for instance (in Windows)

genes = pd.read_csv('C:\Documents\kidpackgenes.csv' )


## Get info about your DataFrame

#### Show parts of the DataFrame

In [None]:
SNPs.head() #shows you the first n rows of the DataFrame

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
0,0,rs7492,3_prime_UTR_variant,,ENST00000246006,,,,,20,23079620,23079620
1,1,rs2567612,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082535,23082535
2,2,rs2749811,3_prime_UTR_variant,,ENST00000246006,,,,,20,23079544,23079544
3,3,rs2749812,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082290,23082290
4,4,rs2749813,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082347,23082347


In [None]:
SNPs.tail() # shows the end of the DataFrame

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
2306,2306,rs1600423846,synonymous_variant,P,ENST00000246006,,,,,20,23085689,23085689
2307,2307,rs1600424016,missense_variant,W/S,ENST00000246006,1.0,probably damaging,0.0,deleterious,20,23085810,23085810
2308,2308,rs1600424406,missense_variant,T/P,ENST00000246006,0.36,benign,0.07,tolerated,20,23086186,23086186
2309,2309,rs1600424446,5_prime_UTR_variant,,ENST00000246006,,,,,20,23086256,23086256
2310,2310,rs1600424486,5_prime_UTR_variant,,ENST00000246006,,,,,20,23086310,23086310


In [None]:
SNPs.sample(10) # shows random rows of the DataFrame

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
1267,1267,rs1033785456,3_prime_UTR_variant,,ENST00000246006,,,,,20,23083875,23083875
509,509,rs748574833,missense_variant,R/K,ENST00000246006,0.02,benign,0.18,tolerated,20,23084280,23084280
148,148,rs185416527,3_prime_UTR_variant,,ENST00000246006,,,,,20,23079712,23079712
2113,2113,rs1456830172,missense_variant,D/E,ENST00000246006,0.28,benign,0.06,tolerated,20,23085182,23085182
1001,1001,rs917687325,intron_variant,,ENST00000246006,,,,,20,23084031,23084031
213,213,rs200777310,missense_variant,Q/R,ENST00000246006,0.009,benign,0.53,tolerated,20,23085744,23085744
1168,1168,rs990967931,3_prime_UTR_variant,,ENST00000246006,,,,,20,23080096,23080096
1115,1115,rs969776198,synonymous_variant,K,ENST00000246006,,,,,20,23085857,23085857
1272,1272,rs1043783054,3_prime_UTR_variant,,ENST00000246006,,,,,20,23083644,23083644
1021,1021,rs928513601,3_prime_UTR_variant,,ENST00000246006,,,,,20,23081577,23081577


#### Show info about size/shape of DataFrame, columns names, data types and null values


In [None]:
SNPs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2311 entries, 0 to 2310
Data columns (total 12 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Unnamed: 0                               2311 non-null   int64  
 1   Variant name                             2311 non-null   object 
 2   Variant consequence                      2311 non-null   object 
 3   Protein allele                           989 non-null    object 
 4   Transcript stable ID                     2311 non-null   object 
 5   PolyPhen score                           639 non-null    float64
 6   PolyPhen prediction                      639 non-null    object 
 7   SIFT score                               639 non-null    float64
 8   SIFT prediction                          639 non-null    object 
 9   Chromosome/scaffold name                 2311 non-null   int64  
 10  Chromosome/scaffold position start (bp)  2311 no

In [None]:
SNPs.describe()

Unnamed: 0.1,Unnamed: 0,PolyPhen score,SIFT score,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
count,2311.0,639.0,639.0,2311.0,2311.0,2311.0
mean,1155.0,0.376396,0.196948,20.0,23083410.0,23083410.0
std,667.272558,0.40975,0.272003,0.0,2005.353,2005.379
min,0.0,0.0,0.0,20.0,23079360.0,23079360.0
25%,577.5,0.007,0.0,20.0,23081740.0,23081740.0
50%,1155.0,0.113,0.07,20.0,23083880.0,23083880.0
75%,1732.5,0.8405,0.29,20.0,23085120.0,23085120.0
max,2310.0,1.0,1.0,20.0,23086320.0,23086320.0


In [None]:
SNPs.columns

Index(['Unnamed: 0', 'Variant name', 'Variant consequence', 'Protein allele',
       'Transcript stable ID', 'PolyPhen score', 'PolyPhen prediction',
       'SIFT score', 'SIFT prediction', 'Chromosome/scaffold name',
       'Chromosome/scaffold position start (bp)',
       'Chromosome/scaffold position end (bp)'],
      dtype='object')

In [None]:
SNPs.index

RangeIndex(start=0, stop=2311, step=1)

In [None]:
SNPs.dtypes

Unnamed: 0                                   int64
Variant name                                object
Variant consequence                         object
Protein allele                              object
Transcript stable ID                        object
PolyPhen score                             float64
PolyPhen prediction                         object
SIFT score                                 float64
SIFT prediction                             object
Chromosome/scaffold name                     int64
Chromosome/scaffold position start (bp)      int64
Chromosome/scaffold position end (bp)        int64
dtype: object

In [None]:
# how many null values in the data frame?
SNPs.isna().sum()

Unnamed: 0                                    0
Variant name                                  0
Variant consequence                           0
Protein allele                             1322
Transcript stable ID                          0
PolyPhen score                             1672
PolyPhen prediction                        1672
SIFT score                                 1672
SIFT prediction                            1672
Chromosome/scaffold name                      0
Chromosome/scaffold position start (bp)       0
Chromosome/scaffold position end (bp)         0
dtype: int64

In [None]:
# how many null values in a specific column?
SNPs['PolyPhen prediction'].isna().sum()

1672

# Getting data from DataFrames


#### Get values from one column

In [None]:
SNPs['Variant name']

0             rs7492
1          rs2567612
2          rs2749811
3          rs2749812
4          rs2749813
            ...     
2306    rs1600423846
2307    rs1600424016
2308    rs1600424406
2309    rs1600424446
2310    rs1600424486
Name: Variant name, Length: 2311, dtype: object

In [None]:
#columns are Series!!!

type(SNPs['Variant name'])

pandas.core.series.Series

In [None]:
type(SNPs)

pandas.core.frame.DataFrame

#### Get a value from a specific position of the dataframe

When you have the names of the columns, use:


.loc[ ]




In [None]:
# index 10 and column Variant name
SNPs.loc[10, 'Variant name']

'rs3746732'

In [None]:
# a range of rows and column Variant name
SNPs.loc[10:20, 'Variant name']

10    rs3746732
11    rs3803984
12    rs3803985
13    rs3803986
14    rs6048536
15    rs6048537
16    rs6048538
17    rs6048539
18    rs6076019
19    rs6076020
20    rs6076020
Name: Variant name, dtype: object

In [None]:
# a range of rows for all columns
SNPs.loc[10:20, :]

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
10,10,rs3746732,synonymous_variant,R,ENST00000246006,,,,,20,23084705,23084705
11,11,rs3803984,3_prime_UTR_variant,,ENST00000246006,,,,,20,23080094,23080094
12,12,rs3803985,3_prime_UTR_variant,,ENST00000246006,,,,,20,23081506,23081506
13,13,rs3803986,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082256,23082256
14,14,rs6048536,3_prime_UTR_variant,,ENST00000246006,,,,,20,23082570,23082570
15,15,rs6048537,3_prime_UTR_variant,,ENST00000246006,,,,,20,23083648,23083648
16,16,rs6048538,missense_variant,S/Y,ENST00000246006,0.96,probably damaging,0.0,deleterious,20,23085537,23085537
17,17,rs6048539,synonymous_variant,P,ENST00000246006,,,,,20,23085701,23085701
18,18,rs6076019,3_prime_UTR_variant,,ENST00000246006,,,,,20,23083642,23083642
19,19,rs6076020,missense_variant,G/C,ENST00000246006,0.319,benign,0.01,deleterious,20,23085688,23085688


When you have the numerical coordinates, use:

.iloc[ ]


In [None]:
SNPs.iloc[10, 2]

row = 10
column = 2
SNPs.iloc[row, column]

'synonymous_variant'

#### Broadcasting

I want to know the variant consequence of the variant rs3746732

In [None]:
# use a boolean mask as row selection

mask = SNPs['Variant name'] == 'rs6076020'

SNPs.loc[mask, :]

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
19,19,rs6076020,missense_variant,G/C,ENST00000246006,0.319,benign,0.01,deleterious,20,23085688,23085688
20,20,rs6076020,missense_variant,G/S,ENST00000246006,0.0,benign,0.5,tolerated,20,23085688,23085688


In [None]:
mask = SNPs['Chromosome/scaffold position start (bp)'] > 23085688

In [None]:
SNPs.loc[mask, :]

Unnamed: 0.1,Unnamed: 0,Variant name,Variant consequence,Protein allele,Transcript stable ID,PolyPhen score,PolyPhen prediction,SIFT score,SIFT prediction,Chromosome/scaffold name,Chromosome/scaffold position start (bp),Chromosome/scaffold position end (bp)
17,17,rs6048539,synonymous_variant,P,ENST00000246006,,,,,20,23085701,23085701
21,21,rs6076021,missense_variant,S/I,ENST00000246006,0.701,possibly damaging,0.01,deleterious,20,23085729,23085729
22,22,rs6076022,missense_variant,Q/H,ENST00000246006,0.482,possibly damaging,0.13,tolerated,20,23085743,23085743
45,45,rs41394246,5_prime_UTR_variant,,ENST00000246006,,,,,20,23086216,23086216
65,65,rs35174999,missense_variant,G/W,ENST00000246006,0.999,probably damaging,0.00,deleterious,20,23085838,23085838
...,...,...,...,...,...,...,...,...,...,...,...,...
2306,2306,rs1600423846,synonymous_variant,P,ENST00000246006,,,,,20,23085689,23085689
2307,2307,rs1600424016,missense_variant,W/S,ENST00000246006,1.000,probably damaging,0.00,deleterious,20,23085810,23085810
2308,2308,rs1600424406,missense_variant,T/P,ENST00000246006,0.360,benign,0.07,tolerated,20,23086186,23086186
2309,2309,rs1600424446,5_prime_UTR_variant,,ENST00000246006,,,,,20,23086256,23086256


#### Select only specific columns of the DataFrame

In [None]:
drugs_subset = drugs[['PharmGKB Accession Id' ,	'Name', 'Type']] # list of columns!
drugs_subset

Unnamed: 0,PharmGKB Accession Id,Name,Type
0,PA164712302,2-amino-1-phenylethanol derivatives,Drug Class
1,PA134967247,2-methoxyestradiol,Drug
2,PA131887008,"3,4-methylenedioxymethamphetamine",Drug
3,PA165958321,"3,5-dimethyl-2-(3-pyridyl)thiazolidin-4-one",Drug
4,PA165858618,3-aminopyridine-2-carboxaldehyde thiosemicarba...,Drug
...,...,...,...
3443,PA451978,zonisamide,Drug
3444,PA10236,zopiclone,Drug
3445,PA164924567,Zosuquidar,Drug
3446,PA452606,zoxazolamine,Drug


# Save the dataframe with .to_csv()

In [None]:
# save the dataframe on a specific folder in your google drive
directory = '/content/drive/MyDrive/HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L'

stuff='name1'
df3.to_csv('%s/example_saved_%s_dataframe.csv' %(directory, stuff))





In [None]:
# save the data in google colab space (see the file explorer on the left)

df3.to_csv('/content/sample_data/example_saved_dataframe.csv')
# and then download it
files.download('/content/sample_data/example_saved_dataframe.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# on a Jupyter notebook on your computer, simply specify your folder of interest and file

df3.to_csv('/Users/elisa/Documents/file.csv')




# Exercises



#### Exercise 1

Import a csv file from your computer and get the info on column names, size, data types, number of null values per column.

#### Exercise 1b

Import an excel file from your computer and do the same as above.

#### Exercise 2

From the DataFrame above, select only the first 4 columns, and save the dataset as "dataframe.csv" on a new "exercise" folder in your google drive.

#### Exercise 3

Create a dictionary that has as key PharmGKB Accession Id and as values the Name of the drug (dataframe *drugs* from above)

In [None]:
drugs.head()

Unnamed: 0,PharmGKB Accession Id,Name,Generic Names,Trade Names,Brand Mixtures,Type,Cross-references,SMILES,InChI,Dosing Guideline,...,VIP Count,Dosing Guideline Sources,Top Clinical Annotation Level,Top FDA Label Testing Level,Top Any Drug Label Testing Level,Label Has Dosing Info,Has Rx Annotation,RxNorm Identifiers,ATC Identifiers,PubChem Compound Identifiers
0,PA164712302,2-amino-1-phenylethanol derivatives,,,,Drug Class,,,,No,...,0,,,,,,,,C04AA,
1,PA134967247,2-methoxyestradiol,,,,Drug,PubChem Compound:66414,,,No,...,0,,,,,,,,,66414.0
2,PA131887008,"3,4-methylenedioxymethamphetamine","Ecstasy,""MDMA""",,,Drug,"ChEBI:CHEBI:1391,""Chemical Abstracts Service:4...",CC(CC1=CC2=C(C=C1)OCO2)NC,InChI=1S/C11H15NO2/c1-8(12-2)5-9-3-4-10-11(6-9...,No,...,1,,3.0,,,,,,,1615.0
3,PA165958321,"3,5-dimethyl-2-(3-pyridyl)thiazolidin-4-one","( )-cis-3,5-Dimethyl-2-(3-pyridyl)thiazolidin-...",,,Drug,PubChem Compound:178014,C[C@H]1C(=O)N([C@H](S1)C2=CN=CC=C2)C.Cl,InChI=1S/C10H12N2OS.ClH/c1-7-9(13)12(2)10(14-7...,No,...,1,,,,,,,,,178014.0
4,PA165858618,3-aminopyridine-2-carboxaldehyde thiosemicarba...,,,,Drug,PubChem Compound:9571836,C1=CC(=C(N=C1)/C=N/NC(=S)N)N,InChI=1S/C7H9N5S/c8-5-2-1-3-10-6(5)4-11-12-7(9...,No,...,1,,,,,,,,,9571836.0


#### Exercise 4

Find the row in the file above for aspirin.