# Social Network Analysis
## AO3 User Bookmarks

The following is an analysis of the network created by users bookmarking fics on AO3.

We used the [Social Network Analysis with Python]("https://www.kirenz.com/post/2019-08-13-network_analysis/") article by Jan Kirenz as a reference for our analysis.

In [1]:
#imports
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
import warnings; warnings.simplefilter('ignore')
import json
import pandas as pd

First we must read in our data and format it so that we can make a graph of the data.

In [2]:
#read in data
topfic_bookmarks = pd.read_json('../ao3bot/top_fic_bookmarks.json')
topfic_bookmarks.head()


Unnamed: 0,fandom,total_bookmarks,users,fic_info
0,僕のヒーローアカデミア | Boku no Hero Academia | My Hero ...,16100,[{'user_link': '/users/whole_grain_bagel/pseud...,"{'fic_link': '/works/8337607', 'fic_name': 'Ye..."
1,Haikyuu!!,11388,[{'user_link': '/users/artisticDragon074639/ps...,"{'fic_link': '/works/5096105', 'fic_name': 'In..."
2,Naruto,8451,[{'user_link': '/users/AllCrush/pseuds/AllCrus...,"{'fic_link': '/works/8211566', 'fic_name': 'Of..."
3,Shingeki no Kyojin | Attack on Titan,3809,[{'user_link': '/users/Frank_boi/pseuds/Frank_...,"{'fic_link': '/works/2336534', 'fic_name': 'Ei..."
4,Miraculous Ladybug,4335,[{'user_link': '/users/cook999/pseuds/cook999'...,"{'fic_link': '/works/7568518', 'fic_name': 'Pi..."


In [3]:
user_df = pd.read_json('../ao3bot/user_bookmarks_final.json')
user_df.head()

Unnamed: 0,user,fic_info,fandoms,user_stats
0,Watermelonflooff,"[[{'fic_link': '/works/8337607', 'fic_name': '...",[[{'fandom_link': '/tags/%E5%83%95%E3%81%AE%E3...,"{'total_bookmarks': 18, 'ratings': {'Explicit'..."
1,SanguineRoar,"[[{'fic_link': '/works/24273403', 'fic_name': ...",[[{'fandom_link': '/tags/%E5%83%95%E3%81%AE%E3...,"{'total_bookmarks': 263, 'ratings': {'Explicit..."
2,duhvy,"[[{'fic_link': '/works/8337607', 'fic_name': '...",[[{'fandom_link': '/tags/%E5%83%95%E3%81%AE%E3...,"{'total_bookmarks': 53, 'ratings': {'Teen And ..."
3,Sther_2515,"[[{'fic_link': '/works/8337607', 'fic_name': '...",[[{'fandom_link': '/tags/%E5%83%95%E3%81%AE%E3...,"{'total_bookmarks': 143, 'ratings': {'Teen And..."
4,BlackCat666,"[[{'fic_link': '/series/56292', 'fic_name': 'T...",[[{'fandom_link': '/tags/Teen%20Wolf%20(TV)/wo...,"{'total_bookmarks': 410, 'ratings': {'Teen And..."


This data is set up such that in `user_df` we have the `user` whose bookmarks we scraped, the `fic_info` of the bookmarks and the `fandoms` associated with those bookmarks. We also have the general stats of their bookmarks in `user_stats`.

In `topfic_bookmarks` we have the information on the top fics in each fandom we have scraped and the users that have bookmarked it. This is the starting file that guiding the scraping for the `user_df` data.

### Common Bookmarks

In this analysis we will be looking at the data in this way:
- Nodes: Fic
- Edges: A user has bookmarked this node (fic) and this node (fic)

To do this we will need to wrangle the data into the correct format.

We also want a list of nodes:

In [4]:
nodes = []
nodes_unformatted = []
tf_users = topfic_bookmarks["users"].tolist()
topfics = topfic_bookmarks["fic_info"].tolist()
users = user_df["user"].tolist()
count = 1
for fic in topfics:
    for users in tf_users:
        for user in users:
            l = user_df.loc[user_df["user"] == user["user_name"]]["fic_info"]
            if len(l) > 0:
                for i in l:
                    for j in i:
                        for k in j:
                            if 'fic_link' in k.keys():
                                if k["fic_link"] not in nodes_unformatted:
                                    nodes.append([count, k["fic_link"]])
                                    nodes_unformatted.append(k["fic_link"])
                                    count += 1
print(len(nodes))
# nodes_df = pd.DataFrame(nodes, columns=["Id","Label"])
# nodes_df.head()

8882


In [5]:
nodes_df = pd.DataFrame(nodes, columns=["Id","Label"])
nodes_df.head()
nodes_df.to_csv("nodes_and_edges/common_fics_nodes_reformatted.csv", index = False)

In [6]:
nodes_df.loc[nodes_df.Label == '/works/31830625','Id'].tolist()[0]

3

In [7]:
edges = []
tf_users = topfic_bookmarks["users"].tolist()
topfics = topfic_bookmarks["fic_info"].tolist()
users = user_df["user"].tolist()
for fic in topfics:
    for users in tf_users:
        for user in users:
            l = user_df.loc[user_df["user"] == user["user_name"]]["fic_info"]
            if len(l) > 0:
                for i in l:
                    for j in i:
                        for k in j:
                            if 'fic_link' in k.keys():
                                target = nodes_df.loc[nodes_df.Label == k["fic_link"],'Id'].tolist()[0]
                                src = nodes_df.loc[nodes_df.Label == fic["fic_link"],'Id'].tolist()[0]
                                edges.append([src, target, "Directed"])
print(len(edges))

550902


In [8]:
# tf_users = topfic_bookmarks["users"].tolist()
# topfics = topfic_bookmarks["fic_info"].tolist()
# users = user_df["user"].tolist()
# fics = []
# for fic in topfics:
#     for users in tf_users:
#         for user in users:
#             l = user_df.loc[user_df["user"] == user["user_name"]]["fic_info"]
#             if len(l) > 0:
#                 for i in l:
#                     for j in i:
#                         for k in j:
#                             if 'fic_link' in k.keys():
#                                 if fic["fic_link"] != k["fic_link"]:
#                                     fics.append([fic["fic_link"], k["fic_link"]])
                                    

In [9]:
fics_df = pd.DataFrame(edges, columns = ["Source", "Target", "Type"])
fics_df

Unnamed: 0,Source,Target,Type
0,1,1,Directed
1,1,2,Directed
2,1,3,Directed
3,1,4,Directed
4,1,5,Directed
...,...,...,...
550897,8563,8878,Directed
550898,8563,8879,Directed
550899,8563,8880,Directed
550900,8563,8881,Directed


In [10]:
#we can go through and graph it here, but I am going to export and go to a different software for faster graphing
fics_df.to_csv("nodes_and_edges/common_fics_edges_reformatted.csv", index = False)
# fic1 = fics_df["ficA"].tolist()
# fic2 = fics_df["ficB"].tolist()
# i = 0;
# g = nx.Graph()
# while i < 10000:
#     g.add_edge(fic1[i], fic2[i])
#     i+=1
# print(nx.info(g))

In [11]:
# pos = nx.spring_layout(g)
# betCent = nx.betweenness_centrality(g, normalized=True, endpoints=True)
# node_color = [20000.0 * g.degree(v) for v in g]
# node_size =  [v * 10000 for v in betCent.values()]
# plt.figure(figsize=(10,10))
# nx.draw_networkx(g, pos=pos, with_labels=False,
#                  node_color=node_color,
#                  node_size=node_size )
# plt.axis('off');

### Common Fandoms
Lets look at the common fandoms based on bookmarking a top-fic.

In [12]:
fandom_nodes = []
nodes_unformatted = []
tf_users = topfic_bookmarks["users"].tolist()
topfics = topfic_bookmarks["fandom"].tolist()
users = user_df["user"].tolist()
count = 1
for fic in topfics:
    for users in tf_users:
        for user in users:
            l = user_df.loc[user_df["user"] == user["user_name"]]["fandoms"]
            if len(l) > 0:
                for i in l:
                    for j in i:
                        for k in j:
                            if 'fandom_name' in k.keys():
                                if k["fandom_name"] not in nodes_unformatted:
                                    fandom_nodes.append([count, k["fandom_name"]])
                                    nodes_unformatted.append(k["fandom_name"])
                                    count += 1
print(len(fandom_nodes))
# nodes_df = pd.DataFrame(nodes, columns=["Id","Label"])
# nodes_df.head()
# fandom_nodes = []
# for node in fandoms_df["fandomA"].tolist():
#     if node not in fandom_nodes:
#         fandom_nodes.append(node)
# for node in fandoms_df["fandomB"].tolist():
#     if node not in fandom_nodes:
#         fandom_nodes.append(node)
# fandom_nodes_df = pd.DataFrame(fandom_nodes, columns=["fandom_nodes"])
# fandom_nodes_df.head()

1326


In [13]:
fandom_nodes_df = pd.DataFrame(fandom_nodes, columns=["Id", "Label"])
fandom_nodes_df.head()
fandom_nodes_df.to_csv("nodes_and_edges/common_fandoms_nodes_reformatted.csv", index = False)

In [14]:
tf_users = topfic_bookmarks["users"].tolist()
topfics = topfic_bookmarks["fandom"].tolist()
users = user_df["user"].tolist()
fandom_egdes = []
for fandom in topfics:
    for users in tf_users:
        for user in users:
            l = user_df.loc[user_df["user"] == user["user_name"]]["fandoms"]
            if len(l) > 0:
                for i in l:
                    for j in i:
                        for k in j:
                            if 'fandom_name' in k.keys():
                                target = fandom_nodes_df.loc[fandom_nodes_df.Label == k["fandom_name"],'Id'].tolist()
                                src = fandom_nodes_df.loc[fandom_nodes_df.Label == fandom,'Id'].tolist()
                                if len(src) > 0 and len(target) > 0:
                                    src = src[0]
                                    target = target[0]
                                    fandom_egdes.append([src, target, "Directed"])

In [15]:
fandoms_df = pd.DataFrame(fandom_egdes, columns = ["Source", "Target", "Type"])

In [16]:
fandoms_df.to_csv("nodes_and_edges/common_fandoms_edges_reformatted.csv", index = False)

In [17]:
#trying to get smaller graphs, lets look at My Hero Academia and Naruto
'''
My Hero Academia are both shounen anime/manga so lets see if there are any similarities within the fandoms people bookmark after they bookmarked one of their top-fics.
These are nodes 1 and 36
'''
#subset data to only edges with src = 1 or 36 then subset node list
believe_it = fandoms_df[fandoms_df["Source"].isin([1, 36])]
believe_it["Type"] = "Undirected"
print(believe_it.head())

nodes = []
nodes_unformatted = []
for idx, row in believe_it.iterrows():
    temp = fandom_nodes_df.loc[fandom_nodes_df['Id'] == row["Target"]]["Label"].tolist()
    if len(temp)>0:
        if row["Target"] not in nodes_unformatted:
            nodes.append([row["Target"], temp[0]])
            nodes_unformatted.append(row["Target"])
print(len(nodes))

   Source  Target        Type
0       1       1  Undirected
1       1       2  Undirected
2       1       3  Undirected
3       1       4  Undirected
4       1       3  Undirected
1326


In [18]:
quirks = pd.DataFrame(nodes, columns=["Id", "Label"])
quirks.head()
quirks.to_csv("nodes_and_edges/mha_nar_nodes_reformatted.csv", index = False)

In [19]:
believe_it.to_csv("nodes_and_edges/mha_nar_edges_reformatted.csv", index = False)

In [21]:
'''
Let's try looking at the most popular fandoms based on our analysis from before: Marvel, RPF, KPop
'''
#subset data to only edges with src = 1 or 36 then subset node list
df1 = fandoms_df[fandoms_df["Source"].isin([32, 733, 744])]
df1["Type"] = "Undirected"
df1

nodes = []
nodes_unformatted = []
for idx, row in df1.iterrows():
    temp = fandom_nodes_df.loc[fandom_nodes_df['Id'] == row["Target"]]["Label"].tolist()
    if len(temp)>0:
        if row["Target"] not in nodes_unformatted:
            nodes.append([row["Target"], temp[0]])
            nodes_unformatted.append(row["Target"])
print(len(nodes))
df2 = pd.DataFrame(nodes, columns=["Id", "Label"])
df2.head()
df2.to_csv("nodes_and_edges/top3_nodes_reformatted.csv", index = False)
df1.to_csv("nodes_and_edges/top3_edges_reformatted.csv", index = False)

1326


In [22]:
'''
Let's try looking at the two less popular fandoms animes and see if it sticks in the genre or not. Attack on Titan and Miraculous Ladybug
'''
#subset data to only edges with src = 1 or 36 then subset node list
anime1 = fandoms_df[fandoms_df["Source"].isin([7, 90])]
anime1["Type"] = "Undirected"
#print(anime1)

nodes = []
nodes_unformatted = []
for idx, row in anime1.iterrows():
    temp = fandom_nodes_df.loc[fandom_nodes_df['Id'] == row["Target"]]["Label"].tolist()
    if len(temp)>0:
        if row["Target"] not in nodes_unformatted:
            nodes.append([row["Target"], temp[0]])
            nodes_unformatted.append(row["Target"])
print(len(nodes))
anime2 = pd.DataFrame(nodes, columns=["Id", "Label"])
anime2.head()
anime2.to_csv("nodes_and_edges/anime3_nodes_reformatted.csv", index = False)
anime1.to_csv("nodes_and_edges/anime3_edges_reformatted.csv", index = False)

1326


In [23]:
'''
Let's try looking at the two less popular fandoms in the musical category and see if it sticks in the genre or not. 
Hamilton - Miranda, Newsies - All Media Types, and Be More Chill - Iconis/Tracz
'''
#subset data to only edges with src = 1 or 36 then subset node list
musical1 = fandoms_df[fandoms_df["Source"].isin([152,1236, 815])]
musical1["Type"] = "Undirected"
#print(anime1)

nodes = []
nodes_unformatted = []
for idx, row in musical1.iterrows():
    temp = fandom_nodes_df.loc[fandom_nodes_df['Id'] == row["Target"]]["Label"].tolist()
    if len(temp)>0:
        if row["Target"] not in nodes_unformatted:
            nodes.append([row["Target"], temp[0]])
            nodes_unformatted.append(row["Target"])
print(len(nodes))
musical2 = pd.DataFrame(nodes, columns=["Id", "Label"])
musical2.head()
musical2.to_csv("nodes_and_edges/musical3_nodes_reformatted.csv", index = False)
musical1.to_csv("nodes_and_edges/musical3_edges_reformatted.csv", index = False)

1326
