# CellPhoneDB structure

CellPhoneDB uses sql database, which is provided as .db, there are 6 tables, we extract these tables as csv. 

You can do it via sqlite tool, the below command extracts the gene_table and saves it as gene_table.csv

```sqlite3 -header -csv cellphone_pre.db "select * from gene_table;" > gene_table.csv```

Or you can use DB Browser for SQLite and, extract all the tables as csv through GUI. 


https://sqlitebrowser.org/

## CellPhoneDB input files

Adapted from 

Efremova, M., Vento-Tormo, M., Teichmann, S. A., & Vento-Tormo, R. (2020). CellPhoneDB: inferring cell-cell communication from combined expression of multi-subunit ligand-receptor complexes. Nature protocols, 15(4), 1484–1506. https://doi.org/10.1038/s41596-020-0292-x

CellPhoneDB stores ligand-receptor interactions as well as other properties of the interacting
partners, including their subunit architecture and gene and protein identifiers. In order to create
the content of the database, four main .csv data files are required: “gene_input.csv”,
“protein_input.csv”, “ complex_input.csv” and “interaction_input.csv”

As we do not have complex (heteromeric) interactions in our database, we do not need this input. 

**gene_input:**

Mandatory fields: `gene_name`, `uniprot`, `hgnc_symbol` and `ensembl`

This file is crucial for establishing the link between the scRNA-seq data and the interaction pairs stored at the protein level. It includes the following gene and protein identifiers: i) gene
name (“gene_name”); ii) UniProt identifier (“uniprot”); iiii) HUGO nomenclature committee
symbol (HGNC) (“hgnc_symbol”) and iv) gene ensembl identifier (ENSG) (“ensembl”).

**protein_input:**

Mandatory fields: “uniprot”; “protein_name”
Optional fields: “transmembrane”; “peripheral”; “secreted”; “secreted_desc”; “secreted_highlight”; “receptor”; “receptor_desc” ; “integrin”; “pfam”; “other”; “other_desc”; “tags”; “tags_ description”; “tags_reason”; “pfam”.

However, since the interactions in our database has directionality while CellPhoneDB does not have, thus it is important to specify the role of the gene under the "receptor" column. 

**interaction_input:**

Mandatory fields: “partner_a”; “partner_b”; “annotation_strategy”; “source”
Optional fields: “protein_name_a”; “protein_name_b”

Interactions stored in CellPhoneDB are annotated using their UniProt identifier (“partner_a” and
“partner_b”). The name of the protein is also included, yet not mandatory (“protein_name_a”
and “protein_name_b”). Protein names are not stored in the database.



In [1]:
import pandas as pd
import numpy as np
# from unipressed import IdMappingClient

In [2]:
#list of intersaction pairs

interactions = pd.read_csv ('LR_database.csv', index_col=False)

In [3]:
#CellPhoneDB is using protein_name_a and protein_name_b column names for genesymbols
#in our version its Ligand and Receptor column
interactions.rename({'Ligand': 'protein_name_a', 'Receptor': 'protein_name_b'}, axis=1, inplace=True)

In [4]:
#CellPhoneDB mandatory fields to build a customDB
interactions["source"]="OmniPath"

In [5]:
#CellphoneDB requires ENSEMBL IDs so we retrive it through UniProtIDs

In [6]:
ligand_ids=list(set(interactions["partner_a"].values))

In [7]:
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="Ensembl", ids=ligand_ids
)
time.sleep(2.0)

In [8]:
lig_list=list(request.each_result())

In [9]:
#put it in dictionary key: UniProtID, value: ENSEMBL
lig_dict=dict()
for x in lig_list:
    lig_dict[x["from"]]=x["to"]

In [29]:
# some uniprot IDs do not have ENSEMBL, so we mock for those
count=1
for x in ligand_ids:
    if x not in lig_dict:
#         print(x)
        lig_dict[x]="ENSG000000000"+str(count)
        count+=1

In [11]:
#do the same for receptor
receptor_ids=list(set(interactions["partner_b"].values))

In [12]:
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="Ensembl", ids=receptor_ids
)
time.sleep(2.0)

In [13]:
rec_list=list(request.each_result())

In [14]:
rec_dict=dict()
for x in rec_list:
    rec_dict[x["from"]]=x["to"]
    
for x in receptor_ids:
    if x not in rec_dict:
        rec_dict[x]="ENSG000000000"+str(count)
        count+=1

In [15]:
#merge two of them into one dictionary
rec_dict.update(lig_dict)

In [16]:
# dictionary with genesymbols and uniprotIDs
name2id=dict()
for x in range(0,len(interactions)):
    if interactions.iloc[x].protein_name_a in name2id and interactions.iloc[x].protein_name_b in name2id:
        continue
    else:
        name2id[interactions.iloc[x].protein_name_a]=interactions.iloc[x].partner_a
        name2id[interactions.iloc[x].protein_name_b]=interactions.iloc[x].partner_b
        

In [17]:
#CPDB requires a file csv to generate custom DB, with genesymbol, uniprot and ensembl IDS
df={"gene_name":[],
   "uniprot":[],
   "hgnc_symbol":[],
   "ensembl":[]}

In [18]:
for symbol,uniprot in name2id.items():
    df["gene_name"].append(symbol)
    df["uniprot"].append(uniprot)
    df["hgnc_symbol"].append(symbol)
    df["ensembl"].append(rec_dict[uniprot])

In [19]:
df=pd.DataFrame(df)

In [20]:
#no need versions
df["ensembl"] = df["ensembl"].str.split(".").str[0]

In [21]:
#CPDB requires another file with uniprot and genesymbols
prot = df.loc[:, ["uniprot", "hgnc_symbol"]]

In [22]:
prot["hgnc_symbol"] = df["hgnc_symbol"] + "_HUMAN"

In [23]:
#the column name must be protein_name in this one
prot = prot.rename(columns={"hgnc_symbol": "protein_name"})

In [24]:
#tag the ones that are receptors
prot['receptor']=[1 if uniprot in interactions['partner_b'].values else 0 for uniprot in prot['uniprot']]

In [25]:
df[df["gene_name"]=="PIK3CD"]

Unnamed: 0,gene_name,uniprot,hgnc_symbol,ensembl
2178,PIK3CD,O00329,PIK3CD,ENSG00000171608


In [26]:
# prot['other']=1
# prot['peripheral']=1
# prot['secreted']=1
# prot['integrin']=1
# prot['transmembrane']=1
# prot['tags']="To_comment"
# prot['tags_reason']="curation"
# prot['secreted_highlight']=1

In [27]:
# combined = interactions.copy()

In [28]:
#write those to a file
df.to_csv('gene_user.csv', index=False)
prot.to_csv('prot_user.csv', index=False)
interactions.to_csv('interactions.csv', index=False)
combined = interactions.copy()

# generate custom DB with the below command

```cellphonedb database generate --user-interactions interactions.csv --user-interactions-only --user-protein prot_user.csv --user-gene gene_user.csv --result-path combined_raw```

In [32]:
pwd

'/home/mami/Maria/polish/clean/community-paper/src/method_comparison/compare_algorithms/run_CellPhoneDB/build_customDB'