# Part 1: Social Media Behaviour Data Analysis


---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install some-package
%pip install networkx

Note: you may need to restart the kernel to use updated packages.


### Import Python packages

In [2]:
#e.g., import some-package
import networkx as nx
from scipy import stats
import pandas as pd

---

### Task 1 of 1

Examine the Graph Modelling Language (gml) files "socialmedia_cmt224_reply_network.gml" (reply network) and "socialmedia_cmt224_social_network.gml" (social network) which represent Twitter data between a sample of users over several days at the time of the Higgs boson particle discovery. Both networks are directed and share the same ids for nodes (anonymised Twitter users).  However, the shared user ids are contained within the "label" attribute in the .gml files, not the node "id" attribute of each individual .gml file.

In the reply network, an edge from a node, 𝑢, to some other node, 𝑣, indicates that 𝑢 replied to a Tweet made by 𝑣 during the time period. Replies are also Tweets. Edges are weighted with the weight representing the number of times this happened over the time period.

In the social network, an edge from node 𝑢 to 𝑣 indicates that 𝑢 follows 𝑣 on the social media platform.

Using these networks, answer the following questions:

##### Q1. What fraction of users do not reply to or follow any other user, but have had others reply to their Tweets?

In [3]:
#CODE:
#Load the reply and social networks data
G_reply = nx.read_gml("socialmedia_cmt224_reply_network.gml", label = "label").to_directed()
G_social = nx.read_gml("socialmedia_cmt224_social_network.gml", label = "label").to_directed()

#create a list for those nodes without out degree in reply network => meaning they do not reply
who_donot_reply = [n for (n,out_) in G_reply.out_degree if out_ == 0 ]

#create a list for those nodes without out degree in social network => meaning they do not follow
who_donot_follow = [n for (n,out_) in G_social.out_degree if out_ == 0 ]

#transform combined lists into a set to remove duplicated nodes => meaning those who do not reply or fllow
who_donot_reply_or_follow = set(who_donot_reply + who_donot_follow)

#create a list for those nodes with in degree > 0 in reply network => meaning they got reply from others
who_get_replied = [n for (n,in_) in G_reply.in_degree if in_ > 0 ]

#find the intersection nodes
those_had_others_replied = who_donot_reply_or_follow.intersection(who_get_replied)

#calculate the answer
answer1 = len(those_had_others_replied) / G_reply.number_of_nodes()
print(round(answer1,2))

0.37


Answer: The fraction is around 37/100(37%).
- Explanation: 
    - Directed graphs were created due to directed behaviours. Label attributes were set to 'label', allowing consistency when cross-analysing. 
    - Two subsets of nodes were created based on their reply and follow behaviours. 
    - A set of nodes existing in both subsets was found using intersection.
- Justification:
    - The "degree" of a node is the sum of "in" and "out" edges in a directed network, a potentially worse method is to compute the whole “degree”.
    - Therefore, it is more appropriate to consider:
        - Users who do not reply or follow are those with out-degree equals to 0 in both networks.
        - Users who got reply from others are those with in-degree greater than 0 in reply network.

##### Q2. How does the topological structure of the reply network differ from the social network in terms of overall sparsity of edges between users and the number of connected groups of users?

In [4]:
#CODE: 
#Calculate overall Sparsity of edges between users
density_reply = round(nx.density(G_reply),3)
density_social = round(nx.density(G_social),3)

print("The density metrics for reply and social network are {0} and {1} respectively."\
      .format(density_reply,density_social))

#Calculate the number of weakly/strongly connected groups of users for reply network
no_weakly_connected_reply = nx.number_weakly_connected_components(G_reply)
no_strongly_connected_reply = nx.number_strongly_connected_components(G_reply)

print("The number of weakly and strongly connected components for *reply network* are {0} and {1} respectively."\
      .format(no_weakly_connected_reply,no_strongly_connected_reply))

#Calculate the number of weakly/strongly connected groups of users for social network
no_weakly_connected_social = nx.number_weakly_connected_components(G_social)
no_strongly_connected_social = nx.number_strongly_connected_components(G_social)

print("The number of weakly and strongly connected components for *social network* are {0} and {1} respectively."\
      .format(no_weakly_connected_social,no_strongly_connected_social))

The density metrics for reply and social network are 0.0 and 0.001 respectively.
The number of weakly and strongly connected components for *reply network* are 5920 and 16217 respectively.
The number of weakly and strongly connected components for *social network* are 436 and 4648 respectively.


ANSWER: The reply network is sparser than social according to density metrics. However, the numbers of strongly/weakly connected groups of users in the reply network are greater than those in the social network. 
- Explanation:
    - Density, number of weakly/strongly connected components for two networks were calculated to compare the topological structure.
- Justification:
    - Density allows analysts to understand how dense or sparse a network is, enabling comparison among networks. 
    - The number of weakly/strongly connected components represents group of nodes that can reach all of others in the group through undirected/directed edges. 
    - One may choose to only calculate weakly connected components; however, it is useful to consider both perspectives in the context to understand topological structure.

##### Q3. Does the number of users a user follows in the social network correlate with the number of replies that they make?

In [5]:
#CODE:
#Create node Order
nodeOrder = list(G_reply.nodes())

#retrieve the strength(the number of replied - weight of edge) in reply network => the number of replies
reply_out_strength = [out_ for (n,out_) in G_reply.out_degree(nodeOrder,weight='occurrences')]

#retrieve the out degree in social network => the number of follows
follow_out_degree = [out_ for (n,out_) in G_social.out_degree(nodeOrder)]

#calculate the pearson r 
r, p = stats.pearsonr(reply_out_strength, follow_out_degree)

#print the result
print("The r-value is {0}, the p-value is {1}.".format(round(r, 3), round(p, 3)))

The r-value is 0.07, the p-value is 0.0.


ANSWER: The r value is 0.07, indicating there is a weak positive linear correlation between two variables. The p-value is 0, meaning the result is strongly likely statistically significant.
- Explanation:
    - The order of nodes was preserved by using a list of nodes. 
    - Two variables, namely reply strength and follow degree, were extracted from networks.
    - Pearson correlation coefficient was calculated.
- Justification: 
    - The number of replies a user made is the strength (the sum of weights on edges/out degree) in the reply network. If uses out-degree solely, the weight information will be ignored, leading to wrong answer.
    - The number of users a user follows is the out degree in the social network. 
    - Pearson correlation coefficient is used due to its simplicity.

##### Q4. Is a user that replies to another user's Tweet multiple times more likely to follow that user in comparison to if they only replied once?

In [6]:
#CODE:

#Extract source and target pairs where the number of reply from source to target is greather than 1
multiple_reply = set([(source,target) for source,target,data in G_reply.edges(data=True) if sum(data.values())>1])

#Extract source and target pairs where the number of reply from source to target is equal to 1
reply_once = set([(source,target) for source,target,data in  G_reply.edges(data=True) if sum(data.values())==1])

#Extract the follow relationship
g_social_edges = list(dict(G_social.edges()))

#Multiple reply and also follow that user
multiple_reply_and_follow = multiple_reply.intersection(g_social_edges)

#Reply once and also follow that user
reply_once_and_follow = reply_once.intersection(g_social_edges)

print("Percentage of user who made multiple replies and followed:",round(len(multiple_reply_and_follow)/len(multiple_reply),2))
print("Percentage of user who made reply once and followed:",round(len(reply_once_and_follow)/len(reply_once),2))

Percentage of user who made multiple replies and followed: 0.88
Percentage of user who made reply once and followed: 0.84


ANSWER: The percentage difference (0.04) indicates that a user is slightly more likely to follow another user they have replied to multiple times than just once. 

- Explanation: 
    - Subsets of source and target pairs that met strength condition were extracted.
    - For each subset, its intersection with the edges of social network was formed to represent a replied and followed relationship.
    - Percentages were calculated for each subset to represent the reply-follow likelihoods.
- Justification:
    - Strength refers to the sum of weights on edges, as opposed to degree which simply denotes the number of edges:
        - Users who replied to other user multiple times are those with edge weight that is greater than 1.
        - Users who replied to other user only once are those with edge weight that is equal 1.

##### Q5. How many users have only mutual following connections (i.e., every user they follow also follows them) and only mutual reply connections with these same users?

In [7]:
#CODE:
#Calculate users who have only mutual following connections => every user they followed also follows them
only_mutual_following = { key for (key,value) in nx.reciprocity(G_social,G_social.nodes()).items() if value == 1}

#Calculate users who have only mutual reply connections => every user they reply also reply them
only_mutual_reply = { key for (key,value) in nx.reciprocity(G_reply,G_reply.nodes()).items() if value == 1}

#Find the pairs for only mutual following
pairs_mutual_following = G_social.edges(only_mutual_following)

#Find the pairs for only mutual reply
pairs_mutual_reply = G_reply.edges(only_mutual_reply)

#Find the intersection
result_set = set([(u,v) for (u,v) in pairs_mutual_following]).intersection(set([(u,v) for (u,v) in pairs_mutual_reply]))

#Get the unique answer by using len on set
answer_q5 = len(set([u for u,v in result_set]))

print(answer_q5)

56


ANSWER: 56 users have only mutual following connections and only mutual reply connections with these same users they followed. 
- Explanation:
    - Nodes with only mutual following/reply connections were extracted based on local reciprocity value.
    - Then, edges were found given above nodes.
    - Intersection between only mutual following and only reply connections edges was found.
- Justification:
    - Local reciprocity represents the mutual actions occurring among nodes, which is a better choice than in/out degree that excludes actions between the same user.
    - For a given node(user):
        - Only mutual following connection indicates reciprocity is equal to 1 in social network.
        - Only mutual reply connection indicates reciprocity is also equal to 1 reply network.