<a href="https://colab.research.google.com/github/genomika/pandas-workshop/blob/master/workshop-python_pandas_gnmk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I Workshop de introdução a lib Pandas

![pandas](https://files.realpython.com/media/Python-Pandas-10-Tricks--Features-You-May-Not-Know-Watermark.e58bb5ce9835.jpg)
Figura 1: ["Python Pandas: Tricks & Features You May Not Know"](https://realpython.com/python-pandas-tricks/)

0. Instalando conda, jupyter notebook, pandas, etc;

  * [Instalando Conda no Ubuntu 18.04](https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart-pt)
  * [Instalando Conda no Windows](https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html)
  * [Instalando Conda no MacOS](https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html)

1. Abrindo arquivos com Pandas

  * Series e DataFrames;
  * CSV;
  * Excel;
  * Python dicionários;
  * Python lista de tuplas;
  * Python lista de dicionários;
  * Abrindo arquivos `sql`, `parquet`, `pickle`, `html`;
  * Criando `DataFrame`s;
  * Manipulando linhas e colunas;
  * Operações gerais;
  * Escrevendo CSV, Excel.

2. Manipulando `DataFrame`s
  
  * Seleções condicionais;
  * Manipulando índices;
  * Trabalhando com dados faltantes;
  * Agrupando um `DataFrame`;
  * Como juntar diferentes `DataFrame`s.

3. Manipulando big data com Vaex

4. Hands on com amostra de exemplo

# 1. Abrindo arquivos com Pandas

Para trabalhar manipulando dados com pandas é necessário importar esses dados de alguma forma para dentro da biblioteca. O principal objetivo é criar, usando o material de análise, estruturas de dados nativas do framework. Nesse curso, trabalharemos essencialmente com [*`DataFrame`*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) e [*`Series`*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

`Series` são essencialmente uma estrutura similar a arrays unidimensionais cujo objetivo é guardar uma sequência de valores de forma indexada.

In [0]:
import pandas as pd
series = pd.Series([42, 13, 7, -99], index=["não", "aqui", "é o", "patrick"])
series

De maneira mais simples, podemos interpretar `Series` como dicionários ordenados e de tamanho fixo. Inclusive, conseguimos criar essas estruturas a partir de dicionários, e usá-las em contextos onde usaríamos dicionários.

In [0]:
# https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify
spotify_mais_escutadas = {"Shape of You": 2433000, "Rockstar": 1831000, "One Dance": 1817000, "Closer": 1724000, "Thinking Out Loud": 1493000}
mais_escutadas_series = pd.Series(spotify_mais_escutadas)
mais_escutadas_series

In [0]:
"Rockstar" in mais_escutadas_series

In [0]:
"Bota Bota" in mais_escutadas_series # :(

`DataFrame`s são representações tabulares de dados. Essa estrutura contém um conjunto ordenado de colunas, podendo cada uma ter um tipo especíco de valor (numérico, texto, boleano e etc.). No `DataFrame` ambas colunas e linhas possuem índice.

In [0]:
formato_data = "%d/%m/%Y"
spotify_mais_escutadas = {"música": ["Shape of You", "Rockstar", "One Dance", "Closer", "Thinking Out Loud"],
                          "número_streams": [2433000, 1831000, 1817000, 1724000, 1493000],
                          "data_publicação": [pd.to_datetime("6/1/2017", format=formato_data),
                                              pd.to_datetime("15/9/2017", format=formato_data),
                                              pd.to_datetime("5/4/2016", format=formato_data),
                                              pd.to_datetime("29/7/2016", format=formato_data),
                                              pd.to_datetime("20/6/2014", format=formato_data)]}
spotify_mais_escutadas_df = pd.DataFrame(spotify_mais_escutadas)
spotify_mais_escutadas_df

Nesse workshop trabalharemos majoritariamente com `DataFrame`s. Apesar de não ser uma solução universal para todos os problemas, o conjunto de métodos e ferramentas que acompanham o tipo `DataFrame` provêm uma base sólida e de uso simples para a maioria dos casos de uso.

In [0]:
import pandas as pd

df = pd.read_csv("https://www.dropbox.com/s/4p86i6zjfdolee7/GNMK_WORKSHOP-001.avinput.hg19_multianno.ann.txt?dl=1", sep="\t")
df.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,ABRaOM_HomozygousALT,ABRaOM_Hemizygous,ABRaOM_Allele_number,ABRaOM_Allele_ALT_count,ABRaOM_Frequencies,esp6500siv2_all,1000g2015aug_all,ExAC_ALL,ExAC_AFR,ExAC_AMR,ExAC_EAS,ExAC_FIN,ExAC_NFE,ExAC_OTH,ExAC_SAS,CLNALLELEID,CLNDN,CLNDISDB,CLNREVSTAT,CLNSIG,avsift,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,...,AF_asj.1,AF_oth.1,non_topmed_AF_popmax.1,non_neuro_AF_popmax.1,non_cancer_AF_popmax.1,controls_AF_popmax.1,InterVar(automated),PVS1,PS1,PS2,PS3,PS4,PM1,PM2,PM3,PM4,PM5,PM6,PP1,PP2,PP3,PP4,PP5,BA1,BS1,BS2,BS3,BS4,BP1,BP2,BP3,BP4,BP5,BP6,BP7,bed,bed2,HIAE_EXOME_FREQUENCY(722),HIAE_EXOME_OCCURRENCES(722),Otherinfo
chr10,89622938,89622938,A,C,UTR5,KLLN,NM_001126049:c.-694T>G,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"Name=Cowden syndrome 4, 615107 (3)",.,.,.,het,290.77,1820,chr10,89622938,.,A,C,290.77,.,ABHet=0.696;AC=1;AF=0.5;AN=2;BaseQRankSum=-10....,GT:AD:DP:GQ:PL,"0/1:1266,552:1820:99:319,0,12285"
chr10,89623142,89623142,C,T,UTR5,KLLN,NM_001126049:c.-898G>A,.,.,2,0,916,5,0.005459,.,0.00559105,.,.,.,.,.,.,.,.,133117,Hereditary_cancer-predisposing_syndrome|PTEN_h...,"MedGen:C0027672,SNOMED_CT:699346009|MedGen:C19...","criteria_provided,_conflicting_interpretations",Conflicting_interpretations_of_pathogenicity,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"Name=Cowden syndrome 4, 615107 (3)",.,0.002770083102493075,"DLE_002-64749-262-SCEX_S40,reanaliseDLE_r34_00...",het,7963.77,574,chr10,89623142,.,C,T,7963.77,.,ABHet=0.534;AC=1;AF=0.5;AN=2;BaseQRankSum=-0.7...,GT:AD:DP:GQ:PL,"0/1:306,267:574:99:7992,0,9864"
chr10,89623861,89623861,T,-,splicing,PTEN,NM_001304717:exon1:c.154+1T>-;NM_001304717:exon2:c.155-1T>-,.,.,548,0,1096,1096,1.000000,.,1,.,.,.,.,.,.,.,.,433879,not_specified,MedGen:CN169374,"criteria_provided,_single_submitter",Benign,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"Name=Bannayan-Riley-Ruvalcaba syndrome, 153480...",.,0.33933518005540164,20180110_EXOMA_R30_DANIELA_CRISTINA_ENGSTER_27...,hom,234.8,7,chr10,89623860,rs71022512,CT,C,234.8,.,AC=2;AF=1;AN=2;DB;DP=7;ExcessHet=3.0103;FS=0;M...,GT:AD:DP:GQ:PL,"1/1:0,6:7:18:272,18,0"
chr10,89623901,89623901,G,C,exonic,PTEN,.,nonsynonymous SNV,PTEN:NM_001304717:exon2:c.194G>C:p.C65S,561,0,1124,1123,0.999110,.,1,.,.,.,.,.,.,.,.,433880,not_specified,MedGen:CN169374,"criteria_provided,_single_submitter",Benign,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,...,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,"Name=Bannayan-Riley-Ruvalcaba syndrome, 153480...",Name=GC_rich,0.35041551246537395,20180110_EXOMA_R30_DANIELA_CRISTINA_ENGSTER_27...,hom,440.77,14,chr10,89623901,rs2943772,G,C,440.77,.,ABHom=1;AC=2;AF=1;AN=2;DB;DP=14;Dels=0;ExcessH...,GT:AD:DP:GQ:PL,"1/1:0,14:14:36:469,36,0"
chr10,89654121,89654121,T,C,intronic,PTEN,.,.,.,0,0,1124,3,0.002669,.,0.00499201,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"Name=Bannayan-Riley-Ruvalcaba syndrome, 153480...",.,0.0013850415512465374,DLE_002-64749-262-SCEX_S40,het,216.77,18,chr10,89654121,rs139651072,T,C,216.77,.,ABHet=0.556;AC=1;AF=0.5;AN=2;BaseQRankSum=0.15...,GT:AD:DP:GQ:PL,"0/1:10,8:18:99:245,0,289"
