# Analiza društvene mreže Reddit

Univerzitet u Beogradu, Elektrotehnički fakultet, veb adresa [ovde](https://www.etf.bg.ac.rs).  
Predmet: Analiza socijalnih mreža, veb adresa [ovde](https://rti.etf.bg.ac.rs/rti/ms1asm/).  
Tekst projektnog zadatka se može naći na veb sajtu predmeta [ovde](https://rti.etf.bg.ac.rs/rti/ms1asm/projekti/2021-2022/ASM_PZ2_2122.pdf).  
- Bogdan Bebić 2022/3051
- Marta Avramović 2022/3166

## Instalacija i učitavanje korišćenih biblioteka
Neophodne bilioteke se mogu instalirati korišćenjem Python package instalera `pip`.

In [1]:
import os
import networkx as nx
import numpy as np
import pandas as pd

## Format skupa podataka
Podaci su preuzeti sa sajta predmeta i otpakovani u odgovarajuće direktorijume.
Kako postoji više datoteka sa istim tipom podataka, učitavamo ih sve zajedno i spajamo u jedinstvenu strukturu podataka.

In [2]:
reddit_data_path = "ASM_PZ2_podaci_2122/reddit2008"
submission_dataPath = f"{reddit_data_path}/submissions_2008_asm/"
comments_dataPath = f"{reddit_data_path}/comments_2008_asm_v1.1/comments_2008/"


def loadDataSet(folderPath):
    allFileData = pd.DataFrame([])
    for fileName in os.listdir(folderPath):
        singleFileData = pd.read_csv(folderPath + fileName, low_memory=False)
        allFileData = pd.concat([allFileData, singleFileData])

    return allFileData


def groupby_count(data_frame, groupby_list):
    return data_frame.groupby(groupby_list).size().reset_index(name="counts")


def groupby_count_sorted(data_frame, groupby_list):
    return groupby_count(data_frame, groupby_list).sort_values('counts', ascending=False)

## Filtriranje skupa podataka
Na društvenoj mreži Reddit je moguće obrisati nalog - u tom slučaju se mogu dobiti podaci koji sadrže `[deleted]` kao korisničko ime.
Ovakvi podaci nepovoljno utiču na analizu jer su ti nalozi mogli pripadati proizvoljnom broju korisnika i stoga ih filtriramo i ne koristimo u daljoj analizi.

In [3]:
submissionData = loadDataSet(submission_dataPath)
commentsData = loadDataSet(comments_dataPath)

# It is possible to have "[deleted]" as author name
submissionFilter = submissionData["author"] != "[deleted]"
commentsFilter = commentsData["author"] != "[deleted]"

filteredSubmissions = submissionData[submissionFilter]
filteredComments = commentsData[commentsFilter]

allData = pd.concat([filteredSubmissions, filteredComments])

## Statistička obrada podataka

In [4]:
allSubredditIds = np.union1d(submissionData['subreddit_id'], commentsData['subreddit_id'])
print(f"Number of different subreddits: {len(allSubredditIds)}")

commentsPerSubreddit = groupby_count_sorted(commentsData, ["subreddit_id"])
print(f"Comments per subreddit:\n{commentsPerSubreddit[:1]}")

# subreddit - author - count interactions
interactionsPerAuthorPerSubreddit = groupby_count(allData, ["subreddit_id", "author"])
# subreddit - count authors
authorsPerSubreddit = groupby_count_sorted(interactionsPerAuthorPerSubreddit, ["subreddit_id"])
print(f"Authors per subreddit:\n{authorsPerSubreddit[:1]}")

print(f"AVG number users per subreddit:\n{authorsPerSubreddit['counts'].sum() / len(allSubredditIds)}")

submissionsPerAuthor = groupby_count_sorted(filteredSubmissions, ['author'])
commentsPerAuthor = groupby_count_sorted(filteredComments, ['author'])
print(f"Max submissions per author:\n{submissionsPerAuthor[:1]}")
print(f"Max comments per author:\n{commentsPerAuthor[:1]}")

# author - subreddit - count interactions
interactionsPerSubredditPerAuthor = groupby_count(allData, ['author', 'subreddit_id'])
# author - count subreddits
subredditsPerAuthor = groupby_count_sorted(interactionsPerSubredditPerAuthor, ['author'])
print(f"Subreddits per author:\n{subredditsPerAuthor[:1]}")

Number of different subreddits: 5032
Comments per subreddit:
     subreddit_id   counts
2689         t5_6  1884629
Authors per subreddit:
     subreddit_id  counts
4354         t5_6  163779
AVG number users per subreddit:
128.78398251192368
Max submissions per author:
      author  counts
84823    gst   18870
Max comments per author:
                author  counts
12603  NoMoreNicksLeft   13480
Subreddits per author:
         author  counts
26173  MrKlaatu     181


In [None]:
paersonCalculation = submissionsPerAuthor.copy().merge(commentsPerAuthor.copy(), on="author", how="inner", suffixes=["_submissions", "_comments"])
paersonCalculation.plot.scatter(y="counts_submissions", x="counts_comments")
print(f"Pearson correlation matrix:\n{paersonCalculation.corr(method='pearson')}")

In [None]:
filterNonOver18 = submissionData["over_18"] == False
filteredSubmissionsNonOver18 = submissionData[filterNonOver18]
extractedCommentsData = pd.DataFrame(commentsData["link_id"].map(lambda element: element.split("_")[1]))
commentsDataSubmissionId = groupby_count_sorted(extractedCommentsData, ["link_id"]).rename(columns={"link_id": "id"})
filteredSubmissionsNonOver18JoinedComments = filteredSubmissionsNonOver18.merge(commentsDataSubmissionId, how="inner", on="id")
topSubmissionsNonOver18JoinedComments = filteredSubmissionsNonOver18JoinedComments.sort_values(by="counts", ascending=False)[:10]
print(topSubmissionsNonOver18JoinedComments)
topSubmissionsNonOver18JoinedComments.to_csv("top_submission_non_over_18.csv")

## Modelovanje podataka grafovima
Skup podataka modelujemo pomoću 4 različita grafa:
1. SNet (Subreddit network) - sadrži kompletne podatke, sve sabredite i interakcije sa njima
2. SNetF (Subreddit network filtered) - filtrirani SNet na osnovu broja korisnika koji definišu interakciju izmedju više sabredita
3. SNetT (Subreddit network targeted) - filtrirani SNet na osnovu odabranih sabredita i grana kojima su povezani
4. UserNet - sadrži interakcije između korisnika - komentare na objave ili na komentare

Iz SNet eliminišemo sve čvorove koji nemaju nijednu granu kako bismo omogućili dalju analizu povezanosti čvorova grafa.

In [5]:
snet = nx.Graph()
snet.add_nodes_from(allSubredditIds)

authorSubredditIdGroups = allData.groupby(["author", "subreddit_id"]).groups
groups = dict()
for author, subredditId in authorSubredditIdGroups:
    if author not in groups:
        groups[author] = [subredditId]
    else:
        groups[author].append(subredditId)

for key in groups:
    subreddit_ids = groups[key]
    for i in range(0, len(subreddit_ids)):
        for j in range(i + 1, len(subreddit_ids)):
            if snet.has_edge(subreddit_ids[i], subreddit_ids[j]):
                snet.edges[subreddit_ids[i], subreddit_ids[j]]['weight'] += 1
            else:
                snet.add_edge(subreddit_ids[i], subreddit_ids[j], weight=1)

snet.remove_nodes_from(list(nx.isolates(snet)))

### Odabir praga za filtriranje grana po težini
Prag za filtriranje je uzet kao prosečna vrednost svih težina grana u grafu SNet.

In [6]:
average_weight = sum([tags["weight"] for u, v, tags in snet.edges(data=True)]) / len(snet.edges)
w_threshold = average_weight

# TODO: check if commented out works
# snetf = snet.edge_subgraph([(u, v) for u, v, tags in snet.edges(data=True) if tags["weight"] > w_threshold])

snetf = nx.Graph()
snetf.add_nodes_from(allSubredditIds)
snetf.add_edges_from([(u, v, tags) for u, v, tags in snet.edges(data=True) if tags["weight"] > w_threshold])

snetf.remove_nodes_from(list(nx.isolates(snetf)))

### Odabir sabredita od interesa
Sabrediti od interesa za analizu pomoću SNetT su uzeti iz teksta projektog zadatka.

In [7]:
targetSubreddits = {
    "reddit.com",
    "pics",
    "worldnews",
    "programming",
    "business",
    "politics",
    "obama",
    "science",
    "technology",
    "WTF",
    "AskReddit",
    "netsec",
    "philosophy",
    "videos",
    "offbeat",
    "funny",
    "entertainment",
    "linux",
    "geek",
    "gaming",
    "comics",
    "gadgets",
    "nsfw",
    "news",
    "environment",
    "atheism",
    "canada",
    "math",
    "Economics",
    "scifi",
    "bestof",
    "cogsci",
    "joel",
    "Health",
    "guns",
    "photography",
    "software",
    "history",
    "ideas",
}

targetSubredditIds = [allData[allData["subreddit"] == targetSubreddit]["subreddit_id"].unique()[0] for targetSubreddit in targetSubreddits]
snett = snet.subgraph(targetSubredditIds)

### UserNet
Modelujemo interakcije korisnika društvene mreže usmerenim grafom.

In [None]:
authorComments = pd.concat([filteredComments["author"], filteredComments["link_id"].map(lambda element: element.split("_")[1])], axis=1).rename(columns={"link_id": "id"})
authorToAuthorInteractions = groupby_count(authorComments.merge(allData[["author", "id"]], on="id", how="inner", suffixes=["_from", "_to"]), ["author_from", "author_to"])
edge_list = authorToAuthorInteractions.rename(columns={"author_from": "source", "author_to": "target", "counts": "weight"})
usernet = nx.from_pandas_edgelist(edge_list, edge_attr=True, create_using=nx.DiGraph)

Dobijene grafove čuvamo u standardnom `gml` formatu na disku kako bismo im mogli pristupati i iz eksternih alata.

In [None]:
nx.write_gml(snet, "snet.gml")
nx.write_gml(snetf, "snetf.gml")
nx.write_gml(snett, "snett.gml")
nx.write_gml(usernet, "usernet.gml")

### Generisanje Erdos-Renyi mreža
Erdos-Renyi mreže koristimo za poređenje sa mrežama reddit podataka.

Verovatnoća za stvaranje grana u grafu je odabrana kao gustina grafova koje ispitujemo.

In [None]:
erdos_renyi_snet = nx.erdos_renyi_graph(n=snet.number_of_nodes(), p=nx.density(snet))
erdos_renyi_snetf = nx.erdos_renyi_graph(n=snetf.number_of_nodes(), p=nx.density(snetf))
erdos_renyi_snett = nx.erdos_renyi_graph(n=snett.number_of_nodes(), p=nx.density(snett))
erdos_renyi_usernet = nx.erdos_renyi_graph(n=usernet.number_of_nodes(), p=nx.density(usernet), directed=True)

In [None]:
nx.write_gml(erdos_renyi_snet, "erdos_renyi_snet.gml")
nx.write_gml(erdos_renyi_snetf, "erdos_renyi_snetf.gml")
nx.write_gml(erdos_renyi_snett, "erdos_renyi_snett.gml")
nx.write_gml(erdos_renyi_usernet, "erdos_renyi_usernet.gml")

## Analiza modelovanih grafova
### Klub bogatih
Kako bi izvrsavanje funkcija trajalo kratko, koristi se nenormalizovana varijanta.
Kao sto se moze videti sa grafika dole, svi grafovi ispoljavaju klub bogatih (cvorovi su u klubu bogatih za vrednosti 1).

In [None]:
snet_rich_club_coefficient = nx.rich_club_coefficient(snet, normalized=False)
snetf_rich_club_coefficient = nx.rich_club_coefficient(snetf, normalized=False)
snett_rich_club_coefficient = nx.rich_club_coefficient(snett, normalized=False)
usernet_rich_club_coefficient = nx.rich_club_coefficient(usernet, normalized=False)

pd.DataFrame.from_dict(snet_rich_club_coefficient, orient="index").plot(title="SNet rich club coefficient", grid=True, legend=False)
pd.DataFrame.from_dict(snetf_rich_club_coefficient, orient="index").plot(title="SNetF rich club coefficient", grid=True, legend=False)
pd.DataFrame.from_dict(snett_rich_club_coefficient, orient="index").plot(title="SNetT rich club coefficient", grid=True, legend=False)
pd.DataFrame.from_dict(usernet_rich_club_coefficient, orient="index").plot(title="UserNet rich club coefficient", grid=True, legend=False)

### Asortativna analiza

In [8]:
// TODO: implement

TypeError: 'module' object is not callable