# RNA-seqカウントデータの前処理(1)

- サンプルごとのカウントデータを1つのカウントテーブルにまとめる

### 1. カウントテーブルの作成

kallistoではサンプル1個ごとのカウントデータが得られます<br>
ほかのツールで処理するときは全部のデータをまとめたほうが扱いやすいので、kallistoのカウント結果 `abundance.tsv` をまとめてひとつのカウントテーブルを作ります

必要なモジュールをインポートします

In [1]:
import pandas as pd

SRAアクセッションを1行ずつ並べた `SRR_Acc_List.txt` を読み込みます

In [2]:
sralib=[i[:-1] for i in open('data/SRR_Acc_List.txt','r')]
sralib = tuple(sralib) # 順番を変えたくないのでtupleにする
sralib

('SRR17223720',
 'SRR17223721',
 'SRR17223722',
 'SRR17223723',
 'SRR17223724',
 'SRR17223725')

kallistoによるカウント結果　abundance.tsvのPATHのリストを作ります<br>
`data`フォルダの下の`kallisto`フォルダに、それぞれの結果フォルダが入っています

In [3]:
kallisto_counts=[]
for sra in sralib:
    kallisto_counts.append('data/kallisto/' + sra + '_exp_kallisto/abundance.tsv')

kallisto_counts = tuple(kallisto_counts) # 順番を変えたくないのでtupleにする

確認します

In [4]:
kallisto_counts

('data/kallisto/SRR17223720_exp_kallisto/abundance.tsv',
 'data/kallisto/SRR17223721_exp_kallisto/abundance.tsv',
 'data/kallisto/SRR17223722_exp_kallisto/abundance.tsv',
 'data/kallisto/SRR17223723_exp_kallisto/abundance.tsv',
 'data/kallisto/SRR17223724_exp_kallisto/abundance.tsv',
 'data/kallisto/SRR17223725_exp_kallisto/abundance.tsv')

`abundance.tsv`を読み込んで、<br>
`estimate_count`列と`tpm`列にSRAアクセッションを追加する処理を、<br>
`read_countdata()`という関数にしておきます

In [5]:
def read_countdata(num):
    df = pd.read_table(kallisto_counts[num],sep='\t')
    newcol1 = 'est_counts_' + sralib[num]
    newcol2 = 'tpm_' + sralib[num]
    df.rename(columns = {'est_counts':newcol1,'tpm':newcol2}, inplace=True)

    return df


カウントデータを読み込みます

In [6]:
df0 = read_countdata(0)
df1 = read_countdata(1)
df2 = read_countdata(2)
df3 = read_countdata(3)
df4 = read_countdata(4)
df5 = read_countdata(5)

読み込めているか確認します（いずれも同じ行数になるはず）

In [7]:
print(len(df0),len(df1),len(df3),len(df3),len(df4),len(df5))

117062 117062 117062 117062 117062 117062


df0の最初の5行を表示してみます

In [8]:
df0.head()

Unnamed: 0,target_id,length,eff_length,est_counts_SRR17223720,tpm_SRR17223720
0,ENSMUST00000178537.2,12,6.74193,0.0,0.0
1,ENSMUST00000178862.2,14,7.65825,0.0,0.0
2,ENSMUST00000196221.2,9,5.34639,0.0,0.0
3,ENSMUST00000179664.2,11,6.27959,0.0,0.0
4,ENSMUST00000177564.2,16,8.56364,0.0,0.0


`target_id`列（transcript_id）と`length`列, `eff_rength`列（TPMの計算に使う）は最初のサンプルの分だけでいいので、残りの表から削除します

In [9]:
df1_count = df1.copy().drop(columns=['length','eff_length'])
df2_count = df2.copy().drop(columns=['length','eff_length'])
df3_count = df3.copy().drop(columns=['length','eff_length'])
df4_count = df4.copy().drop(columns=['length','eff_length'])
df5_count = df5.copy().drop(columns=['length','eff_length'])

target_idをkeyとして、すべてつなげます

In [10]:
new_df = pd.merge(df0, df1_count, on = 'target_id')
new_df = pd.merge(new_df, df2_count, on = 'target_id')
new_df = pd.merge(new_df, df3_count, on = 'target_id')
new_df = pd.merge(new_df, df4_count, on = 'target_id')
new_df = pd.merge(new_df, df5_count, on = 'target_id')

確認します

In [11]:
new_df.head()

Unnamed: 0,target_id,length,eff_length,est_counts_SRR17223720,tpm_SRR17223720,est_counts_SRR17223721,tpm_SRR17223721,est_counts_SRR17223722,tpm_SRR17223722,est_counts_SRR17223723,tpm_SRR17223723,est_counts_SRR17223724,tpm_SRR17223724,est_counts_SRR17223725,tpm_SRR17223725
0,ENSMUST00000178537.2,12,6.74193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ENSMUST00000178862.2,14,7.65825,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ENSMUST00000196221.2,9,5.34639,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ENSMUST00000179664.2,11,6.27959,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ENSMUST00000177564.2,16,8.56364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


`est_counts`と`eff_length` だけのtable、`tpm` だけのtableを作ります<br>
列名もSRAアクセッションのみにします

In [12]:
# est_countsとeff_length
drop_column1 = ['tpm_'+ i for i in sralib]
drop_column1.append('length')
rename_column1 = {'est_counts_'+ i:i for i in sralib}
new_df_count = new_df.copy().drop(columns = drop_column1).rename(columns=rename_column1)

確認します

In [13]:
new_df_count.head()

Unnamed: 0,target_id,eff_length,SRR17223720,SRR17223721,SRR17223722,SRR17223723,SRR17223724,SRR17223725
0,ENSMUST00000178537.2,6.74193,0.0,0.0,0.0,0.0,0.0,0.0
1,ENSMUST00000178862.2,7.65825,0.0,0.0,0.0,0.0,0.0,0.0
2,ENSMUST00000196221.2,5.34639,0.0,0.0,0.0,0.0,0.0,0.0
3,ENSMUST00000179664.2,6.27959,0.0,0.0,0.0,0.0,0.0,0.0
4,ENSMUST00000177564.2,8.56364,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# tpm
drop_column2 = ['est_counts_'+ i for i in sralib]
drop_column2.append('length')
drop_column2.append('eff_length')
rename_column2 = {'tpm_'+ i:i for i in sralib}
new_df_tpm = new_df.copy().drop(columns = drop_column2).rename(columns=rename_column2)

確認します

In [15]:
new_df_tpm.head()

Unnamed: 0,target_id,SRR17223720,SRR17223721,SRR17223722,SRR17223723,SRR17223724,SRR17223725
0,ENSMUST00000178537.2,0.0,0.0,0.0,0.0,0.0,0.0
1,ENSMUST00000178862.2,0.0,0.0,0.0,0.0,0.0,0.0
2,ENSMUST00000196221.2,0.0,0.0,0.0,0.0,0.0,0.0
3,ENSMUST00000179664.2,0.0,0.0,0.0,0.0,0.0,0.0
4,ENSMUST00000177564.2,0.0,0.0,0.0,0.0,0.0,0.0


タブ区切りファイルとして保存します

In [None]:
new_df_count.to_csv('data/counts_kallisto.tsv', sep="\t",index=False)
new_df_tpm.to_csv('data/tpm_kallisto.tsv', sep="\t",index=False)

参考
- 実験医学別冊　独習Pythonバイオ情報解析　第6,7章　(2021年、先進ゲノム解析研究推進プラットフォーム編、ISBN978-4-7581-2249-8)
- pandas 公式サイト　https://pandas.pydata.org
- note.nknk.me pandas関連記事まとめ　https://note.nkmk.me/python-pandas-post-summary/
- kallisto を用いた A. thaliana paired-end リードの転写産物の定量 https://bi.biopapyrus.jp/rnaseq/mapping/kallisto/kallisto-paired.html
- Quasi-Mappingによって高速にRNA seqの定量を行う Kallisto https://kazumaxneo.hatenablog.com/entry/2018/07/14/180503
- kallisto Manual http://pachterlab.github.io/kallisto/manual.html
