# Analyse publications on Gephi

Data : all publications with a mention of Gephi somewhere

In [1]:
import pandas as pd
data = pd.read_csv("../Data/scopus_all.csv")
data.columns

## What are the software used with Gephi

Filter articles with Gephi in the abstract

In [11]:
f = data["Abstract"].str.lower().str.contains("gephi")
df_sub = data[f]
df_sub.shape

Use of Gliner to extract software mentions

In [17]:
from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")

def split_text(text, chunck=300):
    chuncks = [text[i:i + chunck] for i in range(0, len(text), chunck)]
    return chuncks

def get_softwares(text):
    chucks = split_text(text)
    total = []
    for c in chucks:
        total += [i["text"].lower().strip() for i in model.predict_entities(c, labels=["Software"])]
    return list(set(total))



In [41]:
get_softwares(data[f]["Abstract"].iloc[1])

['spss', 'exc', 'gephi']

In [42]:
df_sub["softwares"] = df_sub["Abstract"].apply(get_softwares)
df_sub.to_csv("scopus_gephi_softwaredetect.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["softwares"] = df_sub["Abstract"].apply(get_softwares)


Manual cleaning of software

In [70]:
t = pd.DataFrame(pd.Series([j for i in df_sub["softwares"] for j in i]).value_counts()).reset_index()
t.columns = ["name","count"]
#t.to_csv("reco_software.csv")

Get the list of cleaned software (careful, there is always the problem of identifying pattern and rematching them from a list, with polysemia)

In [108]:
softwares = list(pd.read_csv("reco_software.csv")["name"])[0:100]

def get_softwares(texte, liste):
    """
    Rematch the software names
    """
    return [i for i in liste if i.lower() in texte.lower()]

df_sub["softwares_reco"] = df_sub["Abstract"].apply(lambda x : get_softwares(x, softwares))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["softwares_reco"] = df_sub["Abstract"].apply(lambda x : get_softwares(x, softwares))


Build the network

In [109]:
import networkx as nx
from itertools import combinations

reseau = nx.Graph()

def recode(x):
    if "gephi" in x.lower():
        return "gephi"
    return x

for softwares in df_sub["softwares_reco"]:
    
    softwares = [recode(x) for x in softwares]

    for s in softwares:
        s = s.lower()
        if "gephi" in s:
            s = "gephi"
        if not reseau.has_node(s):
            reseau.add_node(s, weight=0)
        reseau.nodes[s]["weight"]+=1
    for a,b in combinations(softwares,2):
        if not reseau.has_edge(a,b):
            reseau.add_edge(a,b,weight=0)
        reseau.edges[a,b]["weight"]+=1
    
reseau.remove_node("gephi")
print(reseau)

Graph with 94 nodes and 511 edges


Display it

In [110]:
from ipysigma import Sigma
Sigma(reseau, node_size="weight", edge_size="weight")

Sigma(nx.Graph with 94 nodes and 511 edges)