# Analýza dat pomocí Pandas

![](http://www.priroda.cz/clanky/foto/panda3.jpg)

Nejprvé naimportujeme potřebné baličky.
Pokud tyto baličky nemáte, nainstalujte je pomocí následujících príkazů v příkazovém řádku.

`sudo apt-get install python3-pandas`

`sudo apt-get install python3-matplotlib`

Společně s `pandas` by mělo být nainstalovaný baliček `numpy`. Pokud ne, nainstalujtě jej pomocí následujícícho příkazu:

`sudo apt-get install python-numpy`

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt

Načteme data do DataFrame a pojmenujeme sloupce.

In [8]:
df = pd.read_table("/home/nasta/Documents/python_bio/apples/apple.genes", header=None)
df.columns = ["Gen", "Transkript", "Chromozom", "Řetězec", "Start", "Stop", "Exony"] 

In [9]:
df.head()

Unnamed: 0,Gen,Transkript,Chromozom,Řetězec,Start,Stop,Exony
0,MDP0000303933,MDP0000303933,chr1,-,4276,5447,"(4276-4368,4423-4542,4733-4911,5321-5447)"
1,MDP0000223353,MDP0000223353,chr1,+,77339,79628,"(77339-77399,77484-77524,77589-77630,78413-784..."
2,MDP0000322928,MDP0000322928,chr1,+,103533,103686,(103533-103686)
3,MDP0000151845,MDP0000151845,chr1,-,121369,122541,(121369-122541)
4,MDP0000307409,MDP0000307409,chr1,-,123810,125906,"(123810-125614,125804-125906)"


In [10]:
df.tail()

Unnamed: 0,Gen,Transkript,Chromozom,Řetězec,Start,Stop,Exony
5451,MDP0000165503,MDP0000165503,chr3,-,39871832,39875913,"(39871832-39871939,39872744-39872807,39872992-..."
5452,MDP0000575784,MDP0000575784,chr3,-,39877141,39877811,"(39877141-39877434,39877686-39877811)"
5453,MDP0000575784,MDP0000575784.1,chr3,-,39877141,39877811,"(39877141-39877434,39877500-39877550,39877686-..."
5454,MDP0000647499,MDP0000647499,chr3,+,39898182,39898847,(39898182-39898847)
5455,MDP0000216874,MDP0000216874,chr3,-,39902674,39906448,"(39902674-39902837,39902954-39903045,39903532-..."


Zjistíme počet řádků a slopců pomocí `df.shape`

In [18]:
df.shape

(5456, 7)

Přidáme sloupec "Počet_exonů".

In [21]:
df["Počet_exonů"] = df["Exony"].str.count("-")
df.head()

Unnamed: 0,Gen,Transkript,Chromozom,Řetězec,Start,Stop,Exony,Počet_exonů
0,MDP0000303933,MDP0000303933,chr1,-,4276,5447,"(4276-4368,4423-4542,4733-4911,5321-5447)",4
1,MDP0000223353,MDP0000223353,chr1,+,77339,79628,"(77339-77399,77484-77524,77589-77630,78413-784...",7
2,MDP0000322928,MDP0000322928,chr1,+,103533,103686,(103533-103686),1
3,MDP0000151845,MDP0000151845,chr1,-,121369,122541,(121369-122541),1
4,MDP0000307409,MDP0000307409,chr1,-,123810,125906,"(123810-125614,125804-125906)",2


Přidáme sloupec "Velikost_genu".

In [24]:
df["Velikost_genu"] = df["Stop"]-df["Start"]
df.head()

Unnamed: 0,Gen,Transkript,Chromozom,Řetězec,Start,Stop,Exony,Počet_exonů,Velikost_genu
0,MDP0000303933,MDP0000303933,chr1,-,4276,5447,"(4276-4368,4423-4542,4733-4911,5321-5447)",4,1171
1,MDP0000223353,MDP0000223353,chr1,+,77339,79628,"(77339-77399,77484-77524,77589-77630,78413-784...",7,2289
2,MDP0000322928,MDP0000322928,chr1,+,103533,103686,(103533-103686),1,153
3,MDP0000151845,MDP0000151845,chr1,-,121369,122541,(121369-122541),1,1172
4,MDP0000307409,MDP0000307409,chr1,-,123810,125906,"(123810-125614,125804-125906)",2,2096


Podíváme se na popisnou statistiku numerických sloupců pomocí `df.describe`.

V nášem případě jsou to jenom 4 sloupce, z čehož popisná statistika dává smysl jenom u sloupců "Počet_exonů" a "Velikost_genu".

In [25]:
df.describe()

Unnamed: 0,Start,Stop,Počet_exonů,Velikost_genu
count,5456.0,5456.0,5456.0,5456.0
mean,19787426.791972,19790303.794172,5.077163,2877.002199
std,12017288.385368,12017296.260696,4.906019,2916.799005
min,4276.0,5447.0,1.0,90.0
25%,8825954.5,8829904.5,2.0,911.75
50%,19565861.5,19570904.0,3.0,2034.5
75%,31200448.5,31202358.5,7.0,3792.75
max,40159902.0,40163103.0,63.0,30951.0


Vybereme sloupce

In [27]:
df.Gen

0       MDP0000303933
1       MDP0000223353
2       MDP0000322928
3       MDP0000151845
4       MDP0000307409
5       MDP0000153869
6       MDP0000187420
7       MDP0000286949
8       MDP0000482754
9       MDP0000726869
10      MDP0000130529
11      MDP0000834450
12      MDP0000135949
13      MDP0000195757
14      MDP0000025650
15      MDP0000025650
16      MDP0000918616
17      MDP0000907499
18      MDP0000229381
19      MDP0000229382
20      MDP0000648408
21      MDP0000246923
22      MDP0000419196
23      MDP0000434787
24      MDP0000312784
25      MDP0000423722
26      MDP0000413077
27      MDP0000170030
28      MDP0000478153
29      MDP0000249932
            ...      
5426    MDP0000161050
5427    MDP0000930498
5428    MDP0000626322
5429    MDP0000265670
5430    MDP0000163387
5431    MDP0000163388
5432    MDP0000123032
5433    MDP0000317575
5434    MDP0000251717
5435    MDP0000498699
5436    MDP0000209004
5437    MDP0000367689
5438    MDP0000209003
5439    MDP0000498703
5440    MD

In [30]:
df[["Gen", "Start", "Stop"]]

Unnamed: 0,Gen,Start,Stop
0,MDP0000303933,4276,5447
1,MDP0000223353,77339,79628
2,MDP0000322928,103533,103686
3,MDP0000151845,121369,122541
4,MDP0000307409,123810,125906
5,MDP0000153869,135056,135555
6,MDP0000187420,157313,161160
7,MDP0000286949,161876,162460
8,MDP0000482754,178517,181369
9,MDP0000726869,218660,220437


Seřadíme sestupně řádky dle velikosti genů a počtu exonů.

In [34]:
df.sort_values(["Velikost_genu", "Počet_exonů"], ascending=False)

Unnamed: 0,Gen,Transkript,Chromozom,Řetězec,Start,Stop,Exony,Počet_exonů,Velikost_genu
3448,MDP0000312088,MDP0000312088,chr2,+,35500600,35531551,"(35500600-35501010,35501120-35501236,35501464-...",50,30951
3953,MDP0000279951,MDP0000279951,chr3,-,5463665,5489261,"(5463665-5463781,5463888-5464147,5489147-5489261)",3,25596
3674,MDP0000321472,MDP0000321472,chr2,-,39915011,39940155,"(39915011-39915090,39915321-39915610,39916106-...",48,25144
4465,MDP0000310012,MDP0000310012,chr3,+,15197957,15221624,"(15197957-15198118,15198432-15198497,15198597-...",49,23667
2528,MDP0000259414,MDP0000259414,chr2,-,12061728,12085345,"(12061728-12061847,12062166-12062378,12062466-...",37,23617
647,MDP0000271982,MDP0000271982,chr1,+,20755546,20778581,"(20755546-20755625,20757466-20757520,20757620-...",20,23035
2515,MDP0000266474,MDP0000266474,chr2,-,11939032,11962051,"(11939032-11939556,11939651-11940202,11940370-...",33,23019
1570,MDP0000322409,MDP0000322409,chr1,-,35590702,35613351,"(35590702-35592167,35592285-35592718,35592827-...",63,22649
1052,MDP0000262883,MDP0000262883,chr1,+,28698320,28719998,"(28698320-28699220,28699392-28699573,28700523-...",20,21678
1952,MDP0000283610,MDP0000283610,chr2,-,4040399,4061683,"(4040399-4040536,4041164-4041287,4041371-40414...",35,21284
