# Dataset Cleaning

In DatasetGeneration.ipynb I explained how I obtained some data about all human protein entries in UniProt (with reviewed status). As a result, all these data were stored in multiple CSV files (each file contains 500 rows, one per entry). 

In this notebook I show how the data stored in all these files can be concatenated in a single dataframe, and how this initial messy dataframe can be separated into three nice-looking dataframes

## Initial dataset

In [1]:
import glob
import pandas as pd

df = pd.concat([pd.read_csv(file, sep = ";", header = None) 
                for file in glob.glob('../up_data_*.csv')]).dropna(axis = 1, 
                                                                   how = "all")

df.columns = [
    "Protein",
    "Gene",
    "Species",
    "Description",
    "GOMolFunc",
    "GOBioProc",
    "GOSubLoc",
    "UniProt",
    "PDBs",
    "PubMeds",
    "Sequence",
    "Sites"
]

df.Sequence = df.Sequence.str.split("&", expand = True)[1] # Discard FASTA header

df.head()

Unnamed: 0,Protein,Gene,Species,Description,GOMolFunc,GOBioProc,GOSubLoc,UniProt,PDBs,PubMeds,Sequence,Sites
0,Centrosomal protein 20,CEP20,Homo sapiens (Human),Involved in the biogenesis of cilia (PubMed:20...,identical protein binding+GO:0042802,cilium assembly+GO:0060271&microtubule anchori...,centriolar satellite+GO:0034451&centriole+GO:0...,Q96NB1,,20551181&24018379,MATVAELKAVLKDTLEKKGVLGHLKARIRAEVFNALDDDREPRPSL...,
1,Collagen alpha-6(VI) chain,COL6A6,Homo sapiens (Human),Collagen VI acts as a cell-binding protein.,,cell adhesion+GO:0007155,extracellular region+GO:0005576&collagen trime...,A6NMZ7,,,MMLLILFLVIICSHISVNQDSGPEYADVVFLVDSSDRLGSKSFPFV...,
2,Protein Hikeshi,HIKESHI,Homo sapiens (Human),Acts as a specific nuclear import carrier for ...,Hsp70 protein binding+GO:0030544&nuclear impor...,cellular response to heat+GO:0034605&Golgi org...,cytosol+GO:0005829&nuclear body+GO:0016604&nuc...,Q53FT3,3WVZ&3WW0,,MFGCLVAGRLVQTAAQQVAEDKFVFDLPDYESINHVVVFMLGTIPF...,
3,Unconventional myosin-Vb,MYO5B,Homo sapiens (Human),May be involved in vesicular trafficking via i...,actin-dependent ATPase activity+GO:0030898&act...,actin filament organization+GO:0007015&endosom...,actin cytoskeleton+GO:0015629&myosin complex+G...,Q9ULV0,4J5M&4LNZ&4LWZ&4LX0,,MSVGELYSQCTRVWIPDPDEVWRSAELTKDYKEGDKSLQLRLEDET...,
4,BUD13 homolog,BUD13,Homo sapiens (Human),Involved in pre-mRNA splicing as component of ...,RNA binding+GO:0003723,"mRNA splicing, via spliceosome+GO:0000398",nucleoplasm+GO:0005654&nucleus+GO:0005634&RES ...,Q9BRD0,5Z56&5Z57&5Z58&6FF4&6FF7,,MAAAPPLSKAEYLKRYLSGADAGVDRGSESGRKRRKKRPKPGGAGG...,


## PDB IDs database

In [2]:
pdb = df[["UniProt", "PDBs"]].dropna()

pdb["PDB"] = pdb.PDBs.str.split("&")
del pdb["PDBs"]

pdb = pdb.explode("PDB")

pdb.head()

Unnamed: 0,UniProt,PDB
2,Q53FT3,3WVZ
2,Q53FT3,3WW0
3,Q9ULV0,4J5M
3,Q9ULV0,4LNZ
3,Q9ULV0,4LWZ


In [3]:
pdb.to_csv("all_human_pdb.csv", index = False, sep = ";")

## GO annotations database

In [4]:
go_df = df.melt(id_vars = df.columns[[0,1,2,3,7,8,9,10,11]], 
                value_vars = df.columns[[4,5,6]], 
                value_name ="GO")
go_df.GO = go_df.GO.fillna("").str.split("&").apply(lambda x: 
                                                    [n.split("+GO:") 
                                                     for n in x])       

go_df = go_df[["UniProt", "variable", "GO"]].explode("GO")
go_df.index = list(range(len(go_df)))     
                                                                      
go_df.head()

Unnamed: 0,UniProt,variable,GO
0,Q96NB1,GOMolFunc,"[identical protein binding, 0042802]"
1,A6NMZ7,GOMolFunc,[]
2,Q53FT3,GOMolFunc,"[Hsp70 protein binding, 0030544]"
3,Q53FT3,GOMolFunc,"[nuclear import signal receptor activity, 0061..."
4,Q9ULV0,GOMolFunc,"[actin-dependent ATPase activity, 0030898]"


In [5]:
go_df[["GOAnnot", "GOID"]] = pd.DataFrame(go_df.GO.tolist(), index = go_df.index)
del go_df["GO"]

go_df.head()

Unnamed: 0,UniProt,variable,GOAnnot,GOID
0,Q96NB1,GOMolFunc,identical protein binding,42802.0
1,A6NMZ7,GOMolFunc,,
2,Q53FT3,GOMolFunc,Hsp70 protein binding,30544.0
3,Q53FT3,GOMolFunc,nuclear import signal receptor activity,61608.0
4,Q9ULV0,GOMolFunc,actin-dependent ATPase activity,30898.0


In [6]:
go_df = (go_df.dropna()
              .replace({"GOMolFunc": "molecular function",
                        "GOSubLoc": "subcellular location",
                        "GOBioProc": "biological process"})
              .rename(columns = {"variable": "GOType"})
        )

go_df.head()

Unnamed: 0,UniProt,GOType,GOAnnot,GOID
0,Q96NB1,molecular function,identical protein binding,42802
2,Q53FT3,molecular function,Hsp70 protein binding,30544
3,Q53FT3,molecular function,nuclear import signal receptor activity,61608
4,Q9ULV0,molecular function,actin-dependent ATPase activity,30898
5,Q9ULV0,molecular function,actin filament binding,51015


In [7]:
go_df.to_csv("all_human_go.csv", index = False, sep = ";")

## Basic information dataset

In [8]:
for col in ["GOMolFunc", "GOBioProc", "GOSubLoc", "PDBs"]:
    del df[col]
    
df

Unnamed: 0,Protein,Gene,Species,Description,UniProt,PubMeds,Sequence,Sites
0,Centrosomal protein 20,CEP20,Homo sapiens (Human),Involved in the biogenesis of cilia (PubMed:20...,Q96NB1,20551181&24018379,MATVAELKAVLKDTLEKKGVLGHLKARIRAEVFNALDDDREPRPSL...,
1,Collagen alpha-6(VI) chain,COL6A6,Homo sapiens (Human),Collagen VI acts as a cell-binding protein.,A6NMZ7,,MMLLILFLVIICSHISVNQDSGPEYADVVFLVDSSDRLGSKSFPFV...,
2,Protein Hikeshi,HIKESHI,Homo sapiens (Human),Acts as a specific nuclear import carrier for ...,Q53FT3,,MFGCLVAGRLVQTAAQQVAEDKFVFDLPDYESINHVVVFMLGTIPF...,
3,Unconventional myosin-Vb,MYO5B,Homo sapiens (Human),May be involved in vesicular trafficking via i...,Q9ULV0,,MSVGELYSQCTRVWIPDPDEVWRSAELTKDYKEGDKSLQLRLEDET...,
4,BUD13 homolog,BUD13,Homo sapiens (Human),Involved in pre-mRNA splicing as component of ...,Q9BRD0,,MAAAPPLSKAEYLKRYLSGADAGVDRGSESGRKRRKKRPKPGGAGG...,
...,...,...,...,...,...,...,...,...
495,Cellular tumor antigen p53,TP53,Homo sapiens (Human),Acts as a tumor suppressor in many tumor types...,P04637,12524540&12524540&24051492,MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLS...,Metal binding&176&Zinc&Metal binding&179&Zinc&...
496,Deoxynucleotidyltransferase terminal-interacti...,DNTTIP1,Homo sapiens (Human),Increases DNTT terminal deoxynucleotidyltransf...,Q9H147,11473582,MGATGDAEQPRGPSGAERGGLELGDAGAAGQLVLTNPWNIMIKHRQ...,
497,Cytochrome P450 4F8,CYP4F8,Homo sapiens (Human),A cytochrome P450 monooxygenase involved in th...,P98187,10791960&16112640&15789615&10791960&10791960&1...,MSLLSLSWLGLRPVAASPWLLLLVVGASWLLARILAWTYAFYHNGR...,Metal binding&468&Iron (heme axial ligand)
498,Protein-tyrosine kinase 6,PTK6,Homo sapiens (Human),Non-receptor tyrosine-protein kinase implicate...,Q13882,,MVSRDQAHLGPKYVGLWDFKSRTDEELSFRAGDVFHVARKEEQWWW...,Binding site&219&ATP&Active site&312&Proton ac...


In [9]:
df.to_csv("all_human_basic_info.csv", index = False, sep = ";")