<a href="https://colab.research.google.com/github/genomika/pandas-workshop/blob/master/workshop-python_pandas_gnmk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I Workshop de introdução a lib Pandas

![pandas](https://files.realpython.com/media/Python-Pandas-10-Tricks--Features-You-May-Not-Know-Watermark.e58bb5ce9835.jpg)
Figura 1: ["Python Pandas: Tricks & Features You May Not Know"](https://realpython.com/python-pandas-tricks/)

0. Instalando conda, jupyter notebook, pandas, etc;

  * [Instalando Conda no Ubuntu 18.04](https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart-pt)
  * [Instalando Conda no Windows](https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html)
  * [Instalando Conda no MacOS](https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html)

1. Abrindo arquivos com Pandas

  * Series e DataFrames;
  * CSV;
  * Excel;
  * Python dicionários;
  * Python lista de tuplas;
  * Python lista de dicionários;
  * Manipulação básica de `DataFrame`s;
  * Escrevendo CSV, Excel.

2. Manipulando `DataFrame`s
  
  * Manipulando índices;
  * Seleções condicionais;
  * Trabalhando com dados faltantes;
  * Agrupando um `DataFrame`;
  * Como juntar diferentes `DataFrame`s.

3. Manipulando big data com Vaex

4. Hands on com amostra de exemplo

# 1. Abrindo arquivos com Pandas

### `Series` e `DataFrame`

Para trabalhar manipulando dados com pandas é necessário importar esses dados de alguma forma para dentro da biblioteca. A forma mais comum de atingir esse objetivo é criar, usando o material de análise, estruturas de dados nativas do framework. Nesse curso, trabalharemos essencialmente com [*`DataFrame`*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) e [*`Series`*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

`Series` são essencialmente uma estrutura similar a arrays unidimensionais cujo objetivo é guardar uma sequência de valores de forma indexada.

In [1]:
import pandas as pd
series = pd.Series([42, 13, 7, -99], index=["não", "aqui", "é o", "patrick"])
series

não        42
aqui       13
é o         7
patrick   -99
dtype: int64

De maneira mais simples, podemos interpretar `Series` como dicionários ordenados e de tamanho fixo. Inclusive, conseguimos criar essas estruturas a partir de dicionários, e usá-las em contextos onde usaríamos dicionários.

In [2]:
# https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify
spotify_mais_escutadas = {"Shape of You": 2433000, "Rockstar": 1831000, "One Dance": 1817000, "Closer": 1724000, "Thinking Out Loud": 1493000}
mais_escutadas_series = pd.Series(spotify_mais_escutadas)
mais_escutadas_series

Shape of You         2433000
Rockstar             1831000
One Dance            1817000
Closer               1724000
Thinking Out Loud    1493000
dtype: int64

In [3]:
"Rockstar" in mais_escutadas_series

True

In [4]:
"Bota Bota" in mais_escutadas_series # :(

False

`DataFrame`s são representações tabulares de dados. Essa estrutura contém um conjunto ordenado de colunas, podendo cada uma ter um tipo especíco de valor (numérico, texto, boleano e etc.). No `DataFrame` ambas colunas e linhas possuem índice.

In [5]:
formato_data = "%d/%m/%Y"
spotify_mais_escutadas = {"música": ["Shape of You", "Rockstar", "One Dance", "Closer", "Thinking Out Loud"],
                          "número_streams": [2433000, 1831000, 1817000, 1724000, 1493000],
                          "data_publicação": [pd.to_datetime("6/1/2017", format=formato_data),
                                              pd.to_datetime("15/9/2017", format=formato_data),
                                              pd.to_datetime("5/4/2016", format=formato_data),
                                              pd.to_datetime("29/7/2016", format=formato_data),
                                              pd.to_datetime("20/6/2014", format=formato_data)]}
spotify_mais_escutadas_df = pd.DataFrame(spotify_mais_escutadas)
spotify_mais_escutadas_df

Unnamed: 0,música,número_streams,data_publicação
0,Shape of You,2433000,2017-01-06
1,Rockstar,1831000,2017-09-15
2,One Dance,1817000,2016-04-05
3,Closer,1724000,2016-07-29
4,Thinking Out Loud,1493000,2014-06-20


Nesse workshop trabalharemos majoritariamente com `DataFrame`s. Apesar de não ser uma solução universal para todos os problemas, o conjunto de métodos e ferramentas que acompanham o tipo `DataFrame` provêm uma base sólida e de uso simples para a maioria dos casos de uso.

### CSV

Para criar um `DataFrame` a partir de um arquivo `.csv` usamos a função [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Arquivos `.csv` têm por definição vírgulas como separadores de campos. Contudo, podemos mudar o argumento do parâmetro `sep` da função para criar um `DataFrame` a partir de arquivos delimitados por outros caracteres, como por exemplo o *tab*.

In [6]:
multianno_df = pd.read_csv("https://www.dropbox.com/s/xsxbyel2t9fjidl/GNMK_WORKSHOP-001.avinput.hg19_multianno.ann.txt?dl=1", sep="\t")
multianno_df.head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
1,chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
2,chr10,89623861,89623861,T,-,splicing,PTEN,NM_001304717:exon1:c.154+1T>-;NM_001304717:exo...,.,.,...,chr10,89623860,rs71022512,CT,C,234.8,.,AC=2;AF=1;AN=2;DB;DP=7;ExcessHet=3.0103;FS=0;M...,GT:AD:DP:GQ:PL,"1/1:0,6:7:18:272,18,0"
3,chr10,89623901,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,...,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"
4,chr10,89654121,89654121,T,C,intronic,PTEN,.,.,.,...,chr10,89654121,rs139651072,T,C,216.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.15...,GT:AD:DP:GQ:PL,"0/1:10,8:18:99:245,0,289"


O cabeçalho do `DataFrame` é por padrão inferido pela função `read_csv` com base na primeira linha do arquivo. Esse comportamento pode ser modificado usando uma combinação de valores para os parâmetros `header` e `names`.

### Excel

Para criar um `DataFrame` a partir de uma planilha do Excel usamos a função [`read_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). Essa função suporta arquivos com extensão *xls*, *xlsx*, *xlsm*, *xlsb*, e *odf*. Olhando por uma perspectiva de caixa preta, o comportamento da função `read_excel` é bem similar ao comportamento da função `read_csv`: Ambas recebem um arquivo e criam um `DataFrame` a partir do conteúdo desses arquivos.

In [7]:
aba_exonic_df = pd.read_excel("https://www.dropbox.com/s/gyx0hqco6dhtwg2/GNMK_WORKSHOP-001.xlsx?dl=1")
aba_exonic_df.head()

Unnamed: 0,RANK,CLINVAR,CLINVAR_COMMENTS,OMIM,OMIM_COMMENTS,UNIPROT,UNIPROT_COMMENTS,Func,Gene,ExonicFunc,...,MutationTaster_pred,GERP++_RS,Chr,Start,End,Ref,Alt,REPEATMASK,InterVar(automated),InterVar_Rules
0,29,OK,Pathogenic,OK,"Name={Breast cancer, male, susceptibility to},...",OK,SwissProt: P51587 # Breast cancer (BC) [MIM:11...,exonic,BRCA2,frameshift deletion,...,.,.,chr13,32929058,32929059,TC,-,,.,
1,11,OK,Benign,OK,"Name=Bannayan-Riley-Ruvalcaba syndrome, 153480...",OK,SwissProt: P60484 # Cowden syndrome 1 (CWS1) [...,splicing,PTEN,.,...,.,.,chr10,89623861,89623861,T,-,,.,
2,11,OK,Benign,OK,"Name=Blepharocheilodontic syndrome 1, 119580 (...",OK,SwissProt: P12830 # Hereditary diffuse gastric...,splicing,CDH1,.,...,.,.,chr16,68771372,68771372,C,T,,.,
3,10,OK,Benign,OK,"Name=Bannayan-Riley-Ruvalcaba syndrome, 153480...",OK,SwissProt: P60484 # Cowden syndrome 1 (CWS1) [...,exonic,PTEN,nonsynonymous SNV,...,.,.,chr10,89623901,89623901,G,C,Low_complexity_Low_complexity_GC_rich,Benign,"BA1, BS1"
4,10,OK,Benign,OK,"Name={Breast cancer, male, susceptibility to},...",OK,SwissProt: P51587 # Breast cancer (BC) [MIM:11...,exonic,BRCA2,nonsynonymous SNV,...,P,5.07,chr13,32929387,32929387,T,C,,Benign,"BA1, BS1, BS2, BP1, BP6"


Caso a planilha usada como parâmetro contenha mais de uma aba, a função `read_excel` irá por padrão criar o `DataFrame` a partir dos dados contidos na primeira aba da planilha. Para usar dados de outras abas podemos mudar o argumento do parâmetro `sheet_name`.

In [8]:
# O parâmetro sheet_name pode receber uma string com o nome da aba cujo dados serão carregados
aba_intronic_df = pd.read_excel("https://www.dropbox.com/s/gyx0hqco6dhtwg2/GNMK_WORKSHOP-001.xlsx?dl=1", sheet_name="INTRONIC")
aba_intronic_df.head()

Unnamed: 0,RANK,CLINVAR,CLINVAR_COMMENTS,OMIM,OMIM_COMMENTS,UNIPROT,UNIPROT_COMMENTS,Func,Gene,ExonicFunc,...,MutationTaster_pred,GERP++_RS,Chr,Start,End,Ref,Alt,REPEATMASK,InterVar(automated),InterVar_Rules
0,17,NOK,.,OK,"Name=Cowden syndrome 4, 615107 (3)",OK,SwissProt: B2CW77 # Cowden syndrome 4 (CWS4) [...,UTR5,KLLN,.,...,.,.,chr10,89622938,89622938,A,C,,.,
1,17,OK,Conflicting_interpretations_of_pathogenicity,OK,"Name=Cowden syndrome 4, 615107 (3)",OK,SwissProt: B2CW77 # Cowden syndrome 4 (CWS4) [...,UTR5,KLLN,.,...,.,.,chr10,89623142,89623142,C,T,,.,
2,17,NOK,.,OK,"Name={Breast cancer, susceptibility to}, 11448...",OK,SwissProt: Q86YC2 # Breast cancer (BC) [MIM:11...,intronic,PALB2,.,...,.,.,chr16,23640160,23640160,G,T,SINE_Alu_AluSx1,.,
3,17,NOK,.,OK,"Name=Blepharocheilodontic syndrome 1, 119580 (...",OK,SwissProt: P12830 # Hereditary diffuse gastric...,intronic,CDH1,.,...,.,.,chr16,68820717,68820717,C,A,,.,
4,17,NOK,.,OK,"Name=Blepharocheilodontic syndrome 1, 119580 (...",OK,SwissProt: P12830 # Hereditary diffuse gastric...,intronic,CDH1,.,...,.,.,chr16,68856474,68856474,T,C,,.,


In [9]:
# Também podemos usar inteiros como parâmetro, fazendo referência a ordem das abas (index zero)
aba_raw_df = pd.read_excel("https://www.dropbox.com/s/gyx0hqco6dhtwg2/GNMK_WORKSHOP-001.xlsx?dl=1", sheet_name=3)
aba_raw_df.head()

Unnamed: 0,RANK,CLINVAR,CLINVAR_COMMENTS,OMIM,OMIM_COMMENTS,UNIPROT,UNIPROT_COMMENTS,Func,Gene,ExonicFunc,...,Ref,Alt,REPEATMASK,InterVar(automated),InterVar_Rules,QUAL.1,FILTER.1,INFO,FORMAT,FORMAT_INFO
0,29,OK,Pathogenic,OK,"Name={Breast cancer, male, susceptibility to},...",OK,SwissProt: P51587 # Breast cancer (BC) [MIM:11...,exonic,BRCA2,frameshift deletion,...,TC,-,,.,,33599.7,PASS,AC=1;AF=0.5;AN=2;BaseQRankSum=2.352;DB;DP=1613...,GT:AD:DP:GQ:PL,"0/1:853,734:1613:99:33637,0,39748"
1,17,NOK,.,OK,"Name=Cowden syndrome 4, 615107 (3)",OK,SwissProt: B2CW77 # Cowden syndrome 4 (CWS4) [...,UTR5,KLLN,.,...,A,C,,.,,290.77,PASS,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
2,17,OK,Conflicting_interpretations_of_pathogenicity,OK,"Name=Cowden syndrome 4, 615107 (3)",OK,SwissProt: B2CW77 # Cowden syndrome 4 (CWS4) [...,UTR5,KLLN,.,...,C,T,,.,,7963.77,PASS,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
3,17,NOK,.,OK,"Name={Breast cancer, susceptibility to}, 11448...",OK,SwissProt: Q86YC2 # Breast cancer (BC) [MIM:11...,intronic,PALB2,.,...,G,T,SINE_Alu_AluSx1,.,,35.77,PASS,ABHet=0.6;AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=5...,GT:AD:DP:GQ:PL,"0/1:3,2:5:64:64,0,103"
4,17,NOK,.,OK,"Name=Blepharocheilodontic syndrome 1, 119580 (...",OK,SwissProt: P12830 # Hereditary diffuse gastric...,intronic,CDH1,.,...,C,A,,.,,680.77,PASS,ABHet=0.587;AC=1;AF=0.5;AN=2;BaseQRankSum=-1.4...,GT:AD:DP:GQ:PL,"0/1:37,26:63:99:709,0,1225"


In [10]:
# Se quisermos usar dados de todas as abas da planilha passamos None para o sheet_name
planilha_completa = pd.read_excel("https://www.dropbox.com/s/gyx0hqco6dhtwg2/GNMK_WORKSHOP-001.xlsx?dl=1", sheet_name=None)
type(planilha_completa)

dict

In [11]:
# E então podemos acessar os dados de todas as abas a partir de um único objeto
aba_exonic_df = planilha_completa["EXONIC"]
aba_exonic_mosaic_df = planilha_completa["EXONIC MOSAIC"]
aba_intronic_df = planilha_completa["INTRONIC"]
aba_raw_df = planilha_completa["RAW"]

### Python dicionários

Quando introduzimos `DataFrame`s usamos como exemplo um dicionário python. Tendo um dicionário basta usar a função `DataFrame()` para que ele seja transformado.

In [12]:
climas_datas = {
    'dia': ['1/1/2017', '1/2/2017', '1/3/2017', '1/4/2017', '1/5/2017', '1/6/2017'],
    'temperatura': [32, 35, 28, 24, 32, 31],
    'velocidade_vento': [6, 7, 2, 7, 4, 2],
    'evento': ['Chuva', 'Sol', 'Neve', 'Neve', 'Chuva', 'Sol']
}
df = pd.DataFrame(climas_datas)
df

Unnamed: 0,dia,temperatura,velocidade_vento,evento
0,1/1/2017,32,6,Chuva
1,1/2/2017,35,7,Sol
2,1/3/2017,28,2,Neve
3,1/4/2017,24,7,Neve
4,1/5/2017,32,4,Chuva
5,1/6/2017,31,2,Sol


### Python lista de tuplas

É possível criar um `DataFrame` a partir de uma lista de tuplas, sendo necessário especificar o nome das colunas pelo argumento `columns`, por exemplo:

In [13]:
climas_datas = [
    ('1/1/2017', 32, 6, 'Chuva'),
    ('1/2/2017', 35, 7, 'Sol'),
    ('1/3/2017', 28, 2, 'Neve')
]

df = pd.DataFrame(climas_datas, columns=[
    'dia', 'temperatura', 'velocidade_vento', 'evento'
])
df

Unnamed: 0,dia,temperatura,velocidade_vento,evento
0,1/1/2017,32,6,Chuva
1,1/2/2017,35,7,Sol
2,1/3/2017,28,2,Neve


### Python lista de dicionários

É possível criar um DataFrame a partir de uma lista de dicionários, como:

In [14]:
climas_datas = [
    {'dia': '1/1/2017', 'temperatura': 32, 'velocidade_vento': 6,'evento': 'Chuva'},
    {'dia': '1/2/2017', 'temperatura': 35, 'velocidade_vento': 7,'evento': 'Sunny'},
    {'dia': '1/3/2017', 'temperatura': 28, 'velocidade_vento': 2,'evento': 'Neve'} 
]
df = pd.DataFrame(climas_datas)
df

Unnamed: 0,dia,temperatura,velocidade_vento,evento
0,1/1/2017,32,6,Chuva
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Neve


* Existem vários outros tipos de arquivos que podem ser abertos com Pandas, criando `DataFrame`s, alguns exemplos são `JSON`, `HTML`, `HDF5`, `SQL`, `Parquet` e outros, para mais formatos segue a lista de [*IO tools*](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) possíveis.

### Manipulação básica de `DataFrame`s

Para saber o tamanho do `DataFrame` podemos usar o método `shape` que retorna uma tupla com os valores `(linhas, colunas)` 

In [15]:
multianno_df.shape

(92, 183)

Visualizando as primeiras e as últimas linhas dos dados.

In [16]:
multianno_df.head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
1,chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
2,chr10,89623861,89623861,T,-,splicing,PTEN,NM_001304717:exon1:c.154+1T>-;NM_001304717:exo...,.,.,...,chr10,89623860,rs71022512,CT,C,234.8,.,AC=2;AF=1;AN=2;DB;DP=7;ExcessHet=3.0103;FS=0;M...,GT:AD:DP:GQ:PL,"1/1:0,6:7:18:272,18,0"
3,chr10,89623901,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,...,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"
4,chr10,89654121,89654121,T,C,intronic,PTEN,.,.,.,...,chr10,89654121,rs139651072,T,C,216.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.15...,GT:AD:DP:GQ:PL,"0/1:10,8:18:99:245,0,289"


* A função `head()` mostra as primeiras 5 linhas do `DataFrame`, mas é possível especificar quantas você quer ver pondo um valor inteiro dentro dos `()`, como:

In [17]:
multianno_df.head(23)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
1,chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
2,chr10,89623861,89623861,T,-,splicing,PTEN,NM_001304717:exon1:c.154+1T>-;NM_001304717:exo...,.,.,...,chr10,89623860,rs71022512,CT,C,234.8,.,AC=2;AF=1;AN=2;DB;DP=7;ExcessHet=3.0103;FS=0;M...,GT:AD:DP:GQ:PL,"1/1:0,6:7:18:272,18,0"
3,chr10,89623901,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,...,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"
4,chr10,89654121,89654121,T,C,intronic,PTEN,.,.,.,...,chr10,89654121,rs139651072,T,C,216.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.15...,GT:AD:DP:GQ:PL,"0/1:10,8:18:99:245,0,289"
5,chr10,89654190,89654190,G,A,intronic,PTEN,.,.,.,...,chr10,89654190,rs181780682,G,A,29.77,.,ABHet=0.714;AC=1;AF=0.5;AN=2;BaseQRankSum=0.36...,GT:AD:DP:GQ:PL,"0/1:5,2:7:58:58,0,172"
6,chr10,89711576,89711576,C,T,intronic,PTEN,.,.,.,...,chr10,89711576,rs141005791,C,T,96.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.21...,GT:AD:DP:GQ:PL,"0/1:5,4:9:99:125,0,133"
7,chr10,89718057,89718057,A,G,intronic,PTEN,.,.,.,...,chr10,89718057,.,A,G,759.77,.,ABHet=0.324;AC=1;AF=0.5;AN=2;BaseQRankSum=0.83...,GT:AD:DP:GQ:PL,"0/1:11,23:34:99:788,0,283"
8,chr10,89720548,89720548,A,G,intronic,PTEN,.,.,.,...,chr10,89720548,.,A,G,1491.77,.,ABHet=0.467;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.5...,GT:AD:DP:GQ:PL,"0/1:42,48:90:99:1520,0,1330"
9,chr10,89720634,89720634,T,-,intronic,PTEN,.,.,.,...,chr10,89720633,.,CT,C,1805.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=-0.601;DP=689;Ex...,GT:AD:DP:GQ:PL,"0/1:443,192:689:99:1843,0,6330"


* Do mesmo jeito pode se usar a função `tail()`

In [18]:
multianno_df.tail()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
87,chr17,41258136,41258136,A,-,intronic,BRCA1,.,.,.,...,chr17,41258135,.,TA,T,203.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=-1.019;DP=69;Exc...,GT:AD:DP:GQ:PL,"0/1:36,24:69:99:241,0,424"
88,chr19,1206797,1206797,T,-,UTR5,STK11,NM_000455:c.-116del-,.,.,...,chr19,1206796,.,CT,C,2222.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=1.767;ClippingRa...,GT:AD:DP:GQ:MLPSAC:MLPSAF:PL:SAC:SB,"0/1:604,172:776:99:1:0.5:2260,0,15656:245,359,..."
89,chr19,1219444,1219444,C,G,intronic,STK11,.,.,.,...,chr19,1219444,.,C,G,672.77,.,ABHet=0.307;AC=1;AF=0.5;AN=2;BaseQRankSum=-1.8...,GT:AD:DP:GQ:PL,"0/1:39,88:127:99:701,0,458"
90,chr19,1219451,1219451,C,G,intronic,STK11,.,.,.,...,chr19,1219451,.,C,G,213.77,.,ABHet=0.654;AC=1;AF=0.5;AN=2;BaseQRankSum=-1.6...,GT:AD:DP:GQ:PL,"0/1:117,62:179:99:242,0,1041"
91,chr19,1221418,1221418,T,G,intronic,STK11,.,.,.,...,chr19,1221418,.,T,G,25.78,.,ABHet=0.417;AC=1;AF=0.5;AN=2;BaseQRankSum=-5.9...,GT:AD:DP:GQ:PL,"0/1:108,151:263:54:54,0,1020"


In [19]:
multianno_df.tail(2)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
90,chr19,1219451,1219451,C,G,intronic,STK11,.,.,.,...,chr19,1219451,.,C,G,213.77,.,ABHet=0.654;AC=1;AF=0.5;AN=2;BaseQRankSum=-1.6...,GT:AD:DP:GQ:PL,"0/1:117,62:179:99:242,0,1041"
91,chr19,1221418,1221418,T,G,intronic,STK11,.,.,.,...,chr19,1221418,.,T,G,25.78,.,ABHet=0.417;AC=1;AF=0.5;AN=2;BaseQRankSum=-5.9...,GT:AD:DP:GQ:PL,"0/1:108,151:263:54:54,0,1020"


* É possível fazer um *slicing* de linhas a serem visualização usando os valores dos índices (sendo o último valor não incluso).

In [20]:
multianno_df[3:7]

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
3,chr10,89623901,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,...,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"
4,chr10,89654121,89654121,T,C,intronic,PTEN,.,.,.,...,chr10,89654121,rs139651072,T,C,216.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.15...,GT:AD:DP:GQ:PL,"0/1:10,8:18:99:245,0,289"
5,chr10,89654190,89654190,G,A,intronic,PTEN,.,.,.,...,chr10,89654190,rs181780682,G,A,29.77,.,ABHet=0.714;AC=1;AF=0.5;AN=2;BaseQRankSum=0.36...,GT:AD:DP:GQ:PL,"0/1:5,2:7:58:58,0,172"
6,chr10,89711576,89711576,C,T,intronic,PTEN,.,.,.,...,chr10,89711576,rs141005791,C,T,96.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.21...,GT:AD:DP:GQ:PL,"0/1:5,4:9:99:125,0,133"


* Para saber quais as colunas presentes no `DataFrame` basta usar o método `columns`, que ele retorna algo parecido com uma lista contendo todos os nomes de colunas.

In [21]:
multianno_df.columns

Index(['Chr', 'Start', 'End', 'Ref', 'Alt', 'Func.refGene', 'Gene.refGene',
       'GeneDetail.refGene', 'ExonicFunc.refGene', 'AAChange.refGene',
       ...
       'VCF_CHR', 'VCF_POS', 'VCF_ID', 'VCF_REF', 'VCF_ALT', 'VCF_QUAL_2',
       'VCF_FILTER', 'VCF_INFO', 'VCF_FORMAT', 'VCF_SAMPLE_NAME'],
      dtype='object', length=183)

* Dá para percerber que ele não te mostra tudo, se quiser mesmo ver tudo use o método `values` em cima do método `columns`.

In [22]:
multianno_df.columns.values

array(['Chr', 'Start', 'End', 'Ref', 'Alt', 'Func.refGene',
       'Gene.refGene', 'GeneDetail.refGene', 'ExonicFunc.refGene',
       'AAChange.refGene', 'ABRaOM_HomozygousALT', 'ABRaOM_Hemizygous',
       'ABRaOM_Allele_number', 'ABRaOM_Allele_ALT_count',
       'ABRaOM_Frequencies', 'esp6500siv2_all', '1000g2015aug_all',
       'ExAC_ALL', 'ExAC_AFR', 'ExAC_AMR', 'ExAC_EAS', 'ExAC_FIN',
       'ExAC_NFE', 'ExAC_OTH', 'ExAC_SAS', 'CLNALLELEID', 'CLNDN',
       'CLNDISDB', 'CLNREVSTAT', 'CLNSIG', 'avsift', 'SIFT_score',
       'SIFT_converted_rankscore', 'SIFT_pred', 'Polyphen2_HDIV_score',
       'Polyphen2_HDIV_rankscore', 'Polyphen2_HDIV_pred',
       'Polyphen2_HVAR_score', 'Polyphen2_HVAR_rankscore',
       'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_converted_rankscore',
       'LRT_pred', 'MutationTaster_score',
       'MutationTaster_converted_rankscore', 'MutationTaster_pred',
       'MutationAssessor_score', 'MutationAssessor_score_rankscore',
       'MutationAssessor_pred', 'FA

* Se você quer visualizar dados de uma determinada coluna, existem duas formas de fazer isso, onde é retorna uma `Series`:

In [23]:
multianno_df.Chr

0     chr10
1     chr10
2     chr10
3     chr10
4     chr10
      ...  
87    chr17
88    chr19
89    chr19
90    chr19
91    chr19
Name: Chr, Length: 92, dtype: object

In [24]:
multianno_df["Chr"]

0     chr10
1     chr10
2     chr10
3     chr10
4     chr10
      ...  
87    chr17
88    chr19
89    chr19
90    chr19
91    chr19
Name: Chr, Length: 92, dtype: object

In [25]:
type(multianno_df["Chr"])

pandas.core.series.Series

* É mais indicado selecionar uma coluna usando aspas já que houver algum caracter estranho no nome da coluna ele dará erro na seleção;
* Para selecionar múltiplas colunas basta criar uma lista contendo as colunas que você quer visualizar.

In [26]:
col_list = ['Chr', 'Start', 'End']
multianno_df[col_list]

Unnamed: 0,Chr,Start,End
0,chr10,89622938,89622938
1,chr10,89623142,89623142
2,chr10,89623861,89623861
3,chr10,89623901,89623901
4,chr10,89654121,89654121
...,...,...,...
87,chr17,41258136,41258136
88,chr19,1206797,1206797
89,chr19,1219444,1219444
90,chr19,1219451,1219451


In [27]:
multianno_df[['Chr', 'Start', 'End']]

Unnamed: 0,Chr,Start,End
0,chr10,89622938,89622938
1,chr10,89623142,89623142
2,chr10,89623861,89623861
3,chr10,89623901,89623901
4,chr10,89654121,89654121
...,...,...,...
87,chr17,41258136,41258136
88,chr19,1206797,1206797
89,chr19,1219444,1219444
90,chr19,1219451,1219451


* Algumas funções de estatísticas básicas podem ser calculadas nas colunas como `max()`, `min()`, `mean()`, `median()`

In [28]:
# Substituir "." por "0"
multianno_df["HIAE_EXOME_FREQUENCY(722)"] = multianno_df["HIAE_EXOME_FREQUENCY(722)"].apply(
    lambda x: "0" if x == "." else x)
# Deixar 3 casas decimais
multianno_df["HIAE_EXOME_FREQUENCY(722)"] = multianno_df["HIAE_EXOME_FREQUENCY(722)"].astype(float).round(3)
multianno_df["HIAE_EXOME_FREQUENCY(722)"].head()

0    0.000
1    0.003
2    0.339
3    0.350
4    0.001
Name: HIAE_EXOME_FREQUENCY(722), dtype: float64

In [29]:
multianno_df["HIAE_EXOME_FREQUENCY(722)"].min()

0.0

In [30]:
multianno_df["HIAE_EXOME_FREQUENCY(722)"].max()

0.938

In [31]:
multianno_df["HIAE_EXOME_FREQUENCY(722)"].mean()

0.2149130434782609

In [32]:
multianno_df["HIAE_EXOME_FREQUENCY(722)"].median()

0.074

In [33]:
multianno_df["HIAE_EXOME_FREQUENCY(722)"].describe()

count    92.000000
mean      0.214913
std       0.294639
min       0.000000
25%       0.001000
50%       0.074000
75%       0.350500
max       0.938000
Name: HIAE_EXOME_FREQUENCY(722), dtype: float64

* Para a contagem de valores em uma coluna, podemos usar a função `value_counts()`

In [34]:
multianno_df["CLNSIG"].value_counts()

.                                               47
Benign                                          37
Likely_benign                                    3
Conflicting_interpretations_of_pathogenicity     2
Uncertain_significance                           1
drug_response                                    1
Pathogenic                                       1
Name: CLNSIG, dtype: int64

* Mais operações em colunas únicas (`Series`) podem ser vistas [aqui](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

### Escrevendo CSV, Excel

* Para escrever o `CSV` basta usar a função `to_csv()`, manipulando os parâmetros necessários.

In [35]:
output = multianno_df[["Chr", "Start", "End"]]
output.head()

Unnamed: 0,Chr,Start,End
0,chr10,89622938,89622938
1,chr10,89623142,89623142
2,chr10,89623861,89623861
3,chr10,89623901,89623901
4,chr10,89654121,89654121


In [36]:
output.to_csv('multianno_filtrado.csv', index=False)

* Para criar um excel simples é da mesma forma, mas dessa vez usando a função `to_excel()` e seus parâmetros.

> Aqui é preciso ter a biblioteca `Openpyxl` instalada.

In [37]:
output.to_excel('multianno_filtrado.xlsx', sheet_name="bed", index=False)

* Mas se você quer ir além disso e criar um arquivo excel com diferentes abas, por exemplo usando 2 `DataFrames` cada um com sua aba:

In [52]:
df1 = multianno_df.head(15)
df1

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
1,chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
2,chr10,89623861,89623861,T,-,splicing,PTEN,NM_001304717:exon1:c.154+1T>-;NM_001304717:exo...,.,.,...,chr10,89623860,rs71022512,CT,C,234.8,.,AC=2;AF=1;AN=2;DB;DP=7;ExcessHet=3.0103;FS=0;M...,GT:AD:DP:GQ:PL,"1/1:0,6:7:18:272,18,0"
3,chr10,89623901,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,...,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"
4,chr10,89654121,89654121,T,C,intronic,PTEN,.,.,.,...,chr10,89654121,rs139651072,T,C,216.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.15...,GT:AD:DP:GQ:PL,"0/1:10,8:18:99:245,0,289"
5,chr10,89654190,89654190,G,A,intronic,PTEN,.,.,.,...,chr10,89654190,rs181780682,G,A,29.77,.,ABHet=0.714;AC=1;AF=0.5;AN=2;BaseQRankSum=0.36...,GT:AD:DP:GQ:PL,"0/1:5,2:7:58:58,0,172"
6,chr10,89711576,89711576,C,T,intronic,PTEN,.,.,.,...,chr10,89711576,rs141005791,C,T,96.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.21...,GT:AD:DP:GQ:PL,"0/1:5,4:9:99:125,0,133"
7,chr10,89718057,89718057,A,G,intronic,PTEN,.,.,.,...,chr10,89718057,.,A,G,759.77,.,ABHet=0.324;AC=1;AF=0.5;AN=2;BaseQRankSum=0.83...,GT:AD:DP:GQ:PL,"0/1:11,23:34:99:788,0,283"
8,chr10,89720548,89720548,A,G,intronic,PTEN,.,.,.,...,chr10,89720548,.,A,G,1491.77,.,ABHet=0.467;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.5...,GT:AD:DP:GQ:PL,"0/1:42,48:90:99:1520,0,1330"
9,chr10,89720634,89720634,T,-,intronic,PTEN,.,.,.,...,chr10,89720633,.,CT,C,1805.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=-0.601;DP=689;Ex...,GT:AD:DP:GQ:PL,"0/1:443,192:689:99:1843,0,6330"


In [53]:
df2 = multianno_df.tail(15)
df2

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
77,chr17,7579644,7579659,CCCCAGCCCTCCAGGT,-,intronic,TP53,.,.,.,...,chr17,7579643,rs146534833,CCCCCAGCCCTCCAGGT,C,122210.0,.,AC=2;AF=1;AN=2;DB;DP=1405;ExcessHet=3.0103;FS=...,GT:AD:DP:GQ:PL,"1/1:0,924:1405:99:122247,2780,0"
78,chr17,7579801,7579801,G,C,UTR5,TP53,NM_001126118:c.-232C>G,.,.,...,chr17,7579801,rs1642785,G,C,9509.77,.,ABHom=0.651;AC=2;AF=1;AN=2;BaseQRankSum=16.02;...,GT:AD:DP:GQ:PL,"1/1:203,378:582:99:9538,522,0"
79,chr17,7590593,7590593,A,C,intronic,TP53;WRAP53,.,.,.,...,chr17,7590593,.,A,C,87.77,.,ABHet=0.749;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.3...,GT:AD:DP:GQ:PL,"0/1:188,63:251:99:116,0,652"
80,chr17,7590601,7590601,A,C,intronic,TP53;WRAP53,.,.,.,...,chr17,7590601,.,A,C,339.77,.,ABHet=0.585;AC=1;AF=0.5;AN=2;BaseQRankSum=-2.0...,GT:AD:DP:GQ:PL,"0/1:172,122:294:99:368,0,1128"
81,chr17,7590607,7590607,A,C,intronic,TP53;WRAP53,.,.,.,...,chr17,7590607,.,A,C,41.77,.,ABHet=0.704;AC=1;AF=0.5;AN=2;BaseQRankSum=-5.0...,GT:AD:DP:GQ:PL,"0/1:226,95:321:70:70,0,2914"
82,chr17,41196822,41196824,TTT,-,UTR3,BRCA1,NM_007300:c.*873_*871delAAA;NM_007297:c.*873_*...,.,.,...,chr17,41196821,.,CTTT,C,84.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=0.689;ClippingRa...,GT:AD:DP:GQ:MLPSAC:MLPSAF:PL:SAC:SB,"0/1:109,19:128:99:1:0.5:122,0,3145:69,40,12,7:..."
83,chr17,41197940,41197940,T,-,intronic,BRCA1,.,.,.,...,chr17,41197939,.,AT,A,1932.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=3.135;ClippingRa...,GT:AD:DP:GQ:MLPSAC:MLPSAF:PL:SAC:SB,"0/1:60,107:167:99:1:0.5:1970,0,654:2,58,33,74:..."
84,chr17,41247621,41247621,A,-,intronic,BRCA1,.,.,.,...,chr17,41247620,.,CA,C,197.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=-0.097;DP=38;Exc...,GT:AD:DP:GQ:PL,"0/1:24,13:38:99:235,0,434"
85,chr17,41256075,41256075,A,-,intronic,BRCA1,.,.,.,...,chr17,41256074,.,CA,C,932.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=0.364;DP=201;Exc...,GT:AD:DP:GQ:PL,"0/1:128,67:201:99:970,0,2530"
86,chr17,41256088,41256088,A,-,intronic,BRCA1,.,.,.,...,chr17,41256087,.,GA,G,357.73,.,AC=1;AF=0.5;AN=2;BaseQRankSum=-0.38;ClippingRa...,GT:AD:DP:GQ:MLPSAC:MLPSAF:PL:SAC:SB,"0/1:83,28:111:99:1:0.5:395,0,2036:83,0,28,0:70..."


In [55]:
with pd.ExcelWriter('multianno_filtrado_2abas.xlsx') as writer:
    df1.to_excel(writer, sheet_name="DataFrame1", index=False)
    df2.to_excel(writer, sheet_name="DataFrame2", index=False)



---

# 2. Manipulando `DataFrame`s

### Manipulando índices

* Existem métodos para selecionar linhas e colunas, sendo as mais indicadas `iloc` e `loc`.
* A primeira forma (`iloc`) é baseada na indexação para seleção da posição:

In [38]:
# Selecionando a primeira linha, transformando as colunas em índices de uma Series
multianno_df.iloc[0]

Chr                                                            chr10
Start                                                       89622938
End                                                         89622938
Ref                                                                A
Alt                                                                C
                                         ...                        
VCF_QUAL_2                                                    290.77
VCF_FILTER                                                         .
VCF_INFO           ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....
VCF_FORMAT                                            GT:AD:DP:GQ:PL
VCF_SAMPLE_NAME                     0/1:1266,552:1820:99:319,0,12285
Name: 0, Length: 183, dtype: object

In [39]:
# Selecionando a primeira linha, mas na forma de DataFrame
multianno_df.iloc[[0]]

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"


In [40]:
# Selecionando as duas primeiras linhas
multianno_df.iloc[[0, 1]]

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
1,chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"


In [41]:
# Selecionando as duas primeiras linhas usando slicing
multianno_df.iloc[:2]

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
1,chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"


In [42]:
# Selecionando linhas especificas a partir de uma lista de booleanos
ex = multianno_df.iloc[:3] # filtro, senao ele pedirá 92 booleanos
ex.iloc[[True, False, True]]

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
2,chr10,89623861,89623861,T,-,splicing,PTEN,NM_001304717:exon1:c.154+1T>-;NM_001304717:exo...,.,.,...,chr10,89623860,rs71022512,CT,C,234.8,.,AC=2;AF=1;AN=2;DB;DP=7;ExcessHet=3.0103;FS=0;M...,GT:AD:DP:GQ:PL,"1/1:0,6:7:18:272,18,0"


In [43]:
# Selecionando o valor presente na primeira linha da segunda coluna
multianno_df.iloc[0, 1]

89622938

In [44]:
# Selecionando linhas e colunas por lista de listas
multianno_df.iloc[[0, 2], [1, 3, 4]]

Unnamed: 0,Start,Ref,Alt
0,89622938,A,C
2,89623861,T,-


In [45]:
# Selecionando linhas e colunas usando slicing
multianno_df.iloc[1:3, 0:3] # linha 3 não é incluida

Unnamed: 0,Chr,Start,End
1,chr10,89623142,89623142
2,chr10,89623861,89623861


* Enquanto a segunda (`loc`) é baseada no nome das colunas ou booleanos:

In [46]:
# Como os índices são numericos, a seleção de uma linha fica igual ao iloc
multianno_df.loc[[0]] # nao precisa de dois [[]], foi apenas para resultar em dum DataFrame

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
0,chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,...,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"


In [47]:
# No caso de um indice personalizado é possível selecionar linhas via loc
ex = multianno_df.set_index('Start')
ex.loc[[89623142]]

Unnamed: 0_level_0,Chr,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,ABRaOM_HomozygousALT,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
Start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
89623142,chr10,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,2,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"


In [48]:
# Selecionando duas linhas por listas
ex.loc[[89623142, 89623901]]

Unnamed: 0_level_0,Chr,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,ABRaOM_HomozygousALT,...,VCF_CHR,VCF_POS,VCF_ID,VCF_REF,VCF_ALT,VCF_QUAL_2,VCF_FILTER,VCF_INFO,VCF_FORMAT,VCF_SAMPLE_NAME
Start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
89623142,chr10,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,2,...,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
89623901,chr10,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,561,...,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"


In [49]:
# Selecionando o valor de uma linha e coluna especificas
ex.loc[89623901, "Func.refGene"]

'exonic'

In [50]:
# Selecionando linhas de uma coluna via slicing
ex.loc[89623142:89623901, ["End","Func.refGene"]]

Unnamed: 0_level_0,End,Func.refGene
Start,Unnamed: 1_level_1,Unnamed: 2_level_1
89623142,89623142,UTR5
89623861,89623861,splicing
89623901,89623901,exonic


* Para saber quais os valores estao no seu índice use o método `index`

In [51]:
print(multianno_df.index)
print(ex.index)

RangeIndex(start=0, stop=92, step=1)
Int64Index([89622938, 89623142, 89623861, 89623901, 89654121, 89654190,
            89711576, 89718057, 89720548, 89720634, 89720907, 89721094,
            89725294, 89725582, 32889968, 32899567, 32899878, 32905220,
            32905265, 32907536, 32911888, 32912299, 32913055, 32915005,
            32915411, 32915524, 32919637, 32920016, 32920618, 32920844,
            32929058, 32929232, 32929387, 32930373, 32936646, 32950658,
            32951111, 32951115, 32953388, 32954561, 32954692, 32954741,
            32968591, 32968607, 32970736, 32971425, 32973012, 23615043,
            23640160, 68770847, 68771122, 68771372, 68771418, 68771547,
            68772681, 68820717, 68847691, 68847692, 68847694, 68847718,
            68847719, 68847726, 68847727, 68855852, 68856474, 68857141,
            68857792, 68867612,  7576265,  7576443,  7577679,  7578115,
             7578645,  7578712,  7578712,  7578837,  7579472,  7579644,
             7579801,  7590