analysis the graph of tags

![common tag](./images/graph_schema-TAG_TAG_COMMON_PROJECT.png)

we have to note that, the production above is by no mean the weigh we should consider

## Import data and define the weight

In [None]:
# visiable only some GPUs
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

In [None]:
import pandas as pd
import cudf
import cugraph as cnx
from tqdm import tqdm

tqdm.pandas()

In [None]:
project_tag = cudf.read_csv("../data/gen/project_tags.csv")
project_tag.rename(columns={":START_ID(Loan-ID)": "project_id", ":END_ID": "tag"}, inplace=True)
project_tag.drop(columns=[":TYPE"], inplace=True)
# project_tag["tag"] = project_tag["tag"].astype("category").cat.as_ordered() # not work
project_tag.head()

Notice that, the above is the edge-list of a bipartite graph, where the node types is `project_id` and `tag`.
To study such graphs, with the hypothesis that there might have a *community* of tags, which contribute a same type of impact.

Now, do a *bipartite* project on the *tag*. We will use the following weight. Support that there are 2 tags: `tag1` and `tag2`. The corresponding project set for the tags is $T_1$ and $T_2$. We could you the "intersaction over union" as a weight, that is

$$weight(tag1, tag2) = \frac{|T_1 \cap T_2|}{|T_1 \cup T_2|}$$

Notice that the above weight is symmetric, in a way that $$weight(t1, t2) = weight(t2, t1)$$

project the bipartite graph onto the *tag* nodes

In [None]:
merged = project_tag.merge(project_tag, on="project_id")
merged.head(3)

in the above table, we have a project_id and the 2 tags have been in that project.   
Notice that, when doing the merge, the `tag_x` and `tag_y` could be the same. We will filter out that.
Also because of the symmetrically nature of the weight. We will keep only half of the table, where `tag_x` > `tag_y`

In [None]:
filtered = merged[merged["tag_x"] > merged["tag_y"]]
filtered.head()

In [None]:
inter = filtered.groupby(["tag_x", "tag_y"]).nunique()
inter.rename(columns={"project_id": "union"}, inplace=True)
inter.reset_index(inplace=True)
inter.head()

Note that

$${|T_1 \cup T_2|} = |T_1| + |T_2| - |T_1 \cap T_2|$$

In [None]:
pro_by_tag = project_tag.groupby("tag").nunique()
pro_by_tag.rename(columns={"project_id": "nunique"}, inplace=True)
pro_by_tag.reset_index(inplace=True)
pro_by_tag.head()

In [None]:
pair = (
    inter.merge(pro_by_tag, left_on="tag_x", right_on="tag")
    .drop(columns=["tag"])
    .rename(columns={"nunique": "nunique_x"})
)
pair.head()

In [None]:
pair = (
    pair.merge(pro_by_tag, left_on="tag_y", right_on="tag")
    .drop(columns=["tag"])
    .rename(columns={"nunique": "nunique_y"})
)
pair

In [None]:
pair["overlap"] = pair["nunique_x"] + pair["nunique_y"] - pair["union"]
pair

In [None]:
pair["weight"] = pair["union"] / pair["overlap"]
pair

In [None]:
pair.info()

## Let try `networkx`

In [None]:
edges_df = pair[["tag_x", "tag_y", "weight"]].to_pandas()
edges_df.info()

In [None]:
import networkx as nx
import matplotlib.pyplot as plt


Gnx = nx.Graph()
Gnx = nx.from_pandas_edgelist(edges_df, source="tag_x", target="tag_y", edge_attr="weight")
edge_vmin = edges_df["weight"].min()
edge_vmax = edges_df["weight"].max()
print(Gnx.number_of_nodes(), Gnx.number_of_edges(), edge_vmin, edge_vmax)
nx.write_gexf(Gnx, "../data/gen/tag_tag_common_loans.gexf")  # use this one for gephi

# draw the graph in circular layout, with labels, and map the color of the edges with the attribute weight
weights = [Gnx[u][v]["weight"] for u, v in Gnx.edges()]
nx.draw_circular(Gnx, with_labels=True, edge_color=weights, edge_cmap=plt.cm.Reds)

In [None]:
# calculate nodes degree and store in a dataframe
degree = Gnx.degree(weight="weight")
node_df = pd.DataFrame.from_dict(dict(degree), orient="index", columns=["degree"])
node_df.index.name = "tag"
node_df.reset_index(inplace=True)
node_df.head()

In [None]:
# naive community finding in the graph using louvain algorithm
community = nx.community.louvain_communities(Gnx, resolution=1.1, seed=123)
community_index = {node: i for i, community in enumerate(community) for node in community}
partition = pd.DataFrame.from_dict(community_index, orient="index", columns=["louvain_community"])
partition

In [None]:
node_df = node_df.merge(partition, left_on="tag", right_index=True)
node_df

In [None]:
import forceatlas2
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(200, 400))
node_df["degree_scaled"] = scaler.fit_transform(node_df[["degree"]])

nodes = node_df.tag.values
nodes_size = node_df.degree_scaled.values
node_color = node_df.louvain_community.values

l = forceatlas2.forceatlas2_networkx_layout(
    Gnx, niter=1000, scalingRatio=20.0, strongGravityMode=True, gravity=0.05
)  # Optionally specify iteration count
nx.draw_networkx(
    Gnx,
    l,
    nodelist=nodes,
    node_size=nodes_size,
    with_labels=True,
    node_color=node_color,
    cmap=plt.cm.viridis,
    edge_color=weights,
    edge_cmap=plt.cm.Reds,
)
plt.show()

In [None]:
# Add community information to nodes
for node, community in community_index.items():
    Gnx.nodes[node]["community_louvain"] = community
nx.write_gexf(Gnx, "../data/gen/tag_tag_common_loans.gexf")  # use this one for gephi