# PCNet

In [None]:
import os
import networkx as nx
from PCNet import PCNet_network as pcn
from PCNet import PCNet_parser as pp

## Paths

The next cell is used to give the names to the paths that will be used later.

Running the following cell you will create a directory: *data*.

It will contain the following folders:

- *pubmed*: it is the folder in which you must put your *.xml.gz* files to be parsed.

- *csv*: it is the folder in which the *.csv* files returned by the parse will be saved.

- *graph*: it is the folder in which the *.gexf* file corresponding to the citation network will be saved.

In [None]:
# Set paths for data in relation to your current directory

if not os.path.exists('../data/pubmed/'):
    pubmed_path = os.makedirs('../data/pubmed/')
else:
    pubmed_path = '../data/pubmed/'

if not os.path.exists('../data/csv/'):
    csv_path = os.makedirs('/..data/csv/')
else:
    csv_path = '../data/csv/'

if not os.path.exists('../data/graph/'):
    graph_path = os.makedirs('/data/graph/')
else:
    graph_path = '../data/graph/'

## Configuration
- **mesh**: it is the MeSH of the research area you want to study. MeSH (Medical Subject Headings) is a controlled vocabulary used for indexing and organizing biomedical literature. To search for MeSH terms related to a certain word, you can use the [MeSH Browser](https://meshb.nlm.nih.gov/search) provided by the National Library of Medicine (NLM).

    If ```mesh = ''``` the parse will be performed over all the articles;
- **term_mesh**: it is the word corresponding to the MeSH. 
    It will be used to store the graph as *term_mesh*.gexf.
    
    If ```term_mesh = ''``` the graph will be saved as *pubmed*.gexf.

- **info**: it is a list containing the information you want to extract with the parse. They will be added to the    nodes as attributes.

    By default it is ```info = ['title', 'abstract', 'date', 'authors', 'journal', 'keywords']```, i.e. all the possible information. 

    *PMID* and *References* are always got through the parse since they are necessary to create the graph.
    
    If you need less information you can modify the list.
    Note: the order of the words in the list is important.

- **connected**: it is a boolean variable.

    If ```connected = True``` the graph will be weakly connected, which means tha will be kept only the largest connected component of the graph.

- **keep_unknown_nodes**: it is a boolean variable.

    If ```keep_unknown_nodes = False``` the graph will contain only the nodes whose informtion are known because the related articles were parsed.

    If ```keep_unknown_nodes = True``` the graph will contain all the nodes, including those of which we know only the *PMID* and the *citations*.
    

In [None]:
# MeSH settings
mesh = 'D000086382'         # As an example, here is set MeSH 'D000086382'
term_mesh = 'Covid-19'      # corresponding to the term 'Covid-19'
if term_mesh == '':
    term_mesh = 'pubmed'

# Set info you want to extract (default: ['title', 'abstract', 'date', 'authors', 'journal', 'keywords'])
info = ['title', 'abstract', 'date', 'authors', 'journal', 'keywords']

# Network settings
connected = True
keep_unknown_nodes = False

## Parser

The next step is the parse of the files. The parameters of the function are the following: 

- **path_xml**: path to *.xml.gz* to be parsed.

    If you put *.xml.gz* files in the **pubmed** folder, ```path_xml = pubmed_path```.

- **path_csv**: path where *.csv* files are saved.

    Default: ***csv*** folder, ```path_csv = csv_path```.

- **MeSH**: if specified in the configuration, ```MeSH = mesh```.

    If not specified, by default ```MeSH = ''```.

- **informations**: if specified in the configuraton, ```informations = info```.

    If not specified, by default ```informations = ['title', 'abstract', 'date', 'authors', 'journal', 'keywords']```.

In [None]:
# Parse the xml files and save informations in the csv files
pp.xml_parser(path_xml=pubmed_path, path_csv=csv_path, MeSH=mesh, informations=info)

## Dataframes

The next cell will transform the *.csv* files into *pandas dataframes*.

- **path_csv**: path where *.csv* files are stored.

    Should be coherent with the precedent choice.

- **type_of_df**: it can be *links* or *nodes*.

    If ```type_of_df='links'``` the function will create a links dataframe with two columns: *source* and *target*.

    If ```type_of_df='nodes'``` the function will create a nodes dataframe. In this case the columns are set in according to the informations extracted during the parse with the **columns** parameter. To do so it should be set ```columns=info```. As before if not specified, by default ```columns = ['title', 'abstract', 'date', 'authors', 'journal', 'keywords']```.


In [None]:
# Create dataframes from csv files
df_links = pcn.csv_to_dataframe(path_csv=csv_path, type_of_df='links')
df_nodes = pcn.csv_to_dataframe(path_csv=csv_path, type_of_df='nodes', columns=info)

## Graph

Once you have the dataframes you can create the graph with the function in the next cell.

- **df_links** and **df_nodes**: dataframes just created.


- **connected_graph**: it is a boolean variable. Default: True.

    If ```connected_graph = True``` the graph will be weakly connected, which means tha will be kept only the largest connected component of the graph.

    If ```connected_graph = False``` the graph will contain all the nodes.

    If specified in the configuation cell, ```connected_graph = connnected```.

- **unknown_nodes**: it is a boolean variable. Default: False.

    If ```unknown_nodes = False``` the graph will contain only the nodes whose informtion are known because the related articles were parsed.

    If ```unknown_nodes = True``` the graph will contain all the nodes, including those of which we know only the *PMID* and the *citations*.

    If specified in the configuation cell, ```unknown_nodes = keep_unknown_nodes```.

In [None]:
# Create the graph from the dataframes
G = pcn.df_to_graph(df_links, df_nodes, connected_graph=connected, unknown_nodes=keep_unknown_nodes)

## Save the graph

In order to save the graph you should specify:

- **path_graph**: path where the graph *.gexf* files is stored.

    If specified in the configuration cell: ```path_graph = graph_path + term_mesh + '.gexf' ```

- **term_mesh**: it will give the name to the graph, saved as *term_mesh*.gexf.

    If it is not specified, by default ```term_mesh = pubmed```.

In [None]:
# Save the graph as a .gexf file with the name specified in configuration cell
path_graph = graph_path + term_mesh + '.gexf'
nx.write_gexf(G, path_graph)

## Import a graph

Once you saved your graphs, may you'd like to import the graph in this notebook again, in order to perform an analysis or to modify data: 

- If you want to import a graph you can simply use the *networkx* function **read_gexf** with the path to graph as argument.

- If you need to work on dataframes you can obtain them in the same format of the ones created with the **PCNet** tool using the *nodes_to_df* and *links_to_df* functions imported with ```PCNet_network.py```.

> :warning: with large networks running functions in the following cell could be very time consuming. If you saved csv files corresponding to the network you need it could be faster to obtain the graph again from them. 

In [None]:
# Convert a graph from a .gexf file to a networkx graph
graph = 'pubmed.gexf' 
G = nx.read_gexf(graph_path + graph)

# # Convert a graph from a .gexf file to pandas dataframes
df_nodes = pcn.nodes_to_df(graph_path + graph)
df_edges = pcn.links_to_df(graph_path + graph)