# Shawn Cicoria - MiniProject 2

## Task 1


# Introduction
For mini-project 2, I've initially selected the following three datasets:

1. Social circles from Facebook - https://snap.stanford.edu/data/ego-Facebook.html
2. Social network of GitHub developers - https://snap.stanford.edu/data/github-social.html
3. Bitcoin OTC web of trust network - https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html


## Social circles from Facebook
This data set contains both raw and processed data. The raw data to be used is contained in `facebook_combined.txt` which has `88,234` rows of data with each row the connected nodes - essentially each row represents a connected edge.

Each node is identified by an integer in the rage of `0-4038`.

### Example data
An example set of rows shown below. The data set is tagged as undirected.

```csv
0 1
0 2
0 3
0 4
0 5
```

### Loading data as edges

```python
edges = import_data('./data/facebook_combined.txt', skip_header=False, sep=' ', cols=2)
print(len(edges)) # result is 88234
```

## Social network of GitHub developers
This data set contains both raw and processed data. The raw data to be used is contained in `musae_git_edges.csv` which has `289,003` rows of data with each row the connected nodes - again, the edges.

Each node is identified by an integer in the rage of `0-37,699`.

### Example data
An example set of rows shown below with first row a header. The data set is tagged as undirected.

```csv
id_1,id_2
0,23977
1,34526
1,2370
1,14683
```

### Loading data as edges

```python
# importing GitHub graph data
edges_github = import_data('./data/musae_git_edges.csv', skip_header=True, sep=',', cols=2)
print(len(edges_github)). # 289003
```


## Bitcoin OTC web of trust network
This data is a directed graph with weights. Weights can be positive or negative (signed).

Data is contained in `soc-sign-bitcointotc.csv`. The file is headerless, with four values for each row. Each line has one rating, sorted by time, with the following format:

```
SOURCE, TARGET, RATING, TIME
```

SOURCE: node id of source, i.e., rater
TARGET: node id of target, i.e., ratee
RATING: the source's rating for the target, ranging from -10 to +10 in steps of 1
TIME: the time of the rating, measured as seconds since Epoch


Thus, our edges are the first two columns, with a direction and weight (rating).

Each node is identified by an integer in the rage of `0-5880`.

### Example of data

```csv
6,2,4,1289241911.72836
6,5,2,1289241941.53378
1,15,1,1289243140.39049
4,3,7,1289245277.36975
13,16,8,1289254254.44746
```

### Loading data as edges

```python
# import btc data
edges_btc = import_data('./data/soc-sign-bitcoinotc.csv', skip_header=False, sep=',', cols=3)
print(len(edges_btc)). # 35592
```

## Data Import function

```python
# importing graph data
def import_data(filename, skip_header = False, sep = ',', cols = 2):
    rv = []
    with open(filename, 'rt') as f:
        for line in f:
            if skip_header:
                skip_header = False
                continue
            rv.append(line.strip().split(sep)[0:cols])
            assert len(rv[len(rv) - 1]) == cols

    return rv

```


# Data Summary

 Name | Directed | Nodes | Edges
---|---|---|---
 Social circles: Facebook | No | 4,039 | 88,234
 GitHub Social Network | No | 37,700 | 289,003
 Bitcoin OTC trust weighted signed network | Yes | 5,881 | 35,592


## Data descriptions


 Name | Description
---|---
 Bitcoin OTC trust weighted signed network | This is who\-trusts\-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC\. Since Bitcoin users are anonymous, there is a need to maintain a record of users' reputation to prevent transactions with fraudulent and risky users\. Members of Bitcoin OTC rate other members in a scale of \-10 \(total distrust\) to \+10 \(total trust\) in steps of 1\. This is the first explicit weighted signed directed network available for research\. 
 Social circles: Facebook | This dataset consists of 'circles' \(or 'friends lists'\) from Facebook\. Facebook data was collected from survey participants using this Facebook app\. The dataset includes node features \(profiles\), circles, and ego networks\.
 GitHub Social Network | A large social network of GitHub developers which was collected from the public API in June 2019\. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them\. The vertex features are extracted based on the location, repositories starred, employer and e\-mail address\. The task related to the graph is binary node classification \- one has to predict whether the GitHub user is a web or a machine learning developer\. This target feature was derived from the job title of each user\. 




## Task 2 - Task 6 Python Code:

In [12]:
import numpy as np
import unittest

class connected_helper:
    def __init__(self, G: list = None):
        if G is not None:
            self.G = G

        self.max_id = 0

    # Task 2
    def import_data(self, filename,
                    skip_header = False,
                    sep = ',',
                    cols = 2) -> list:
        rv = []
        with open(filename, 'rt') as f:
            for line in f:
                if skip_header:
                    skip_header = False
                    continue
                row = line.strip().split(sep)[0:cols]
                self.max_id = max(self.max_id, int(row[0]), int(row[1]))  # needed later.
                row[0] = int(row[0])
                row[1] = int(row[1])
                rv.append(row)
                assert len(rv[len(rv) - 1]) == cols

        self.G = rv
        return rv

    # Task 3 implementation
    def node_freq(self, directed=False):
        '''faster implementation'''
        rv = {}
        for r in self.G:
            if not directed:
                if not r[0] in rv:
                    rv[r[0]] = 1
                else:
                    rv[r[0]] += 1
                if not r[1] in rv:
                    rv[r[1]] = 1
                else:
                    rv[r[1]] += 1
            else:
                if not r[1] in rv:
                    rv[r[1]] = 1
                else:
                    rv[r[1]] += 1

        #  return rv
        return list(rv.items())  # list of tuples

    # Task 3 - alternative and crappy way
    def node_frequency(self, directed=False):
        '''attempt with numpy
        this is NOT used....'''
        import numpy as np
        if not directed:  # undirected
            #  this flips the orig and concatenates it again
            a_all = np.concatenate((self.G, np.flip(self.G, axis=1)))
        else:
            a_all = self.G

        unique, counts = np.unique(np.array(a_all)[:, 1], return_counts=True)
        frequencies = np.asarray((unique, counts)).T

        return frequencies

    # Task 3
    def node_top(self, count=100):
        return sorted(
            self.node_freq(self.G),
            key = lambda x: x[1], reverse=True)[0:100]

    # Task 4 implementation.
    def get_connected_counts(self) -> list:
        '''this just has to count each time a node
        appears. This doesn't have to report connected
        component structure...'''
        # visited = [False] * self.max_id
        cc = [0] * (self.max_id + 1)

        for edge in self.G:  # this is an edge (from,to)
            for node in edge:  # each node
                cc[int(node)] += 1

        self.connected_counts_alt = cc
        return cc

    # Task 4 implemeentation -- proper one
    #  returns an array where offset is the ID and each
    # element is the connected nodes as a sublist.
    def get_connected_nodes(self, directed: bool = None, graph: list = None) -> list:
        if graph is None:
            graph = self.G
        if directed is None:
            directed = False

        cc = [None] * (self.max_id + 1)
        visited = [False] * (self.max_id + 1)
        counts = [0] * (self.max_id + 1)

        for edge in graph:  # edge of (from,to)
            node_from = edge[0]
            node_to = edge[1]

            if not visited[int(node_from)]:
                cc[int(node_from)] = [node_to]
                visited[int(node_from)] = True
                #  counts[int(node_from)] = 1
            else:
                cc[int(node_from)] = cc[int(node_from)] + [node_to]

            counts[int(node_from)] += 1

            if not directed:
                if not visited[int(node_to)]:
                    cc[int(node_to)] = [node_from]
                    visited[int(node_to)] = True
                    #  counts[int(node_to)] = 1
                else:
                    cc[int(node_to)] = cc[int(node_to)] + [node_from]

                counts[int(node_to)] += 1

        self.connected_nodes = cc
        self.connected_counts = counts
        return cc

    def get_largest_node_degree(self):
        if self.G is None:
            raise 'Must iomport data first'

        _ = self.get_connected_nodes()
        max_index = self.connected_counts.index(max(self.connected_counts))

        return max_index, self.connected_nodes[max_index]

    #  def dfs_util(self, node, visited):
    #     visited[int(node)

    # Task 4 alternative method but an adjacency list as dict.
    def get_connected_counts_alt(self):
        '''alternative implementation'''
        from collections import defaultdict
        # adj_list = defaultdict(lambda: defaultdict(lambda: 0))
        # adj_list = defaultdict(lambda: defaultdict(int))

        # this alleviates need to pre-alloc an array of n items
        # and discovering the ID's of all the unique nodes.
        mysum = defaultdict(int)
        for start, end in self.G:
            # adj_list[start][end] += 1
            mysum[start] += 1
            mysum[end] += 1

        self.connected_counts_alt2 = mysum
        return mysum

    # Task 5 parts
    def get_reversed_graph(self, graph=None):
        if graph is None:
            graph = self.G

        self.R = [None] * len(graph)
        for i, edge in enumerate(graph):  # edge of (from,to)
            node_from = edge[0]
            node_to = edge[1]
            self.R[i] = [node_to, node_from]

        return self.R

    def get_scc(self, G=None):
        # if a graph is already imported it is in the
        # [[1,2], [2,3]] format and needs to be connected component instead.
        if G is None:
            G = self.get_connected_nodes(directed=True, graph=self.G)
        else:
            G = self.get_connected_nodes(directed=True, graph=G)

        # this has to be interative as stack overflow on datasets with more than 500 edges.
        n = len(G)
        transposed = [[None]] * n
        order_w = []
        visited = [False] * n
        scc = [None] * n

        # transpose and first DFS with order by weight
        for u in range(n):
            if not visited[u]:
                visited[u] = True
                stack = [u]

                while len(stack) > 0:
                    u = stack[-1]  # peek at last item.
                    done = True
                    # tv = G[u]   # odd bug I can't id.
                    if G[u] is None:
                        break

                    for v in G[u]:
                        if transposed[v][0] is None:
                            transposed[v] = [u]
                        else:
                            transposed[v].append(u)
                        if not visited[v]:
                            visited[v] = True
                            done = False
                            stack.append(v)
                            break
                    if done:
                        stack.pop()
                        order_w.append(u)

        # second DFS on tranposed to build the scc array
        while len(order_w) > 0:
            r = order_w.pop()
            stack = [r]
            if visited[r]:
                visited[r] = False
                scc[r] = r
            while len(stack) > 0:
                u = stack[-1]
                done = True
                if transposed[u][0] is not None:
                    for v in transposed[u]:
                        if visited[v]:
                            done = False
                            visited[v] = False
                            stack.append(v)
                            scc[v] = r
                            break
                if done:
                    stack.pop()

        return scc

    # Task 6 - this i've taken from my HW5 submission.
    def get_adj_matrix(self, data=None, directed=None):
        if data is None:
            data = self.G
        if directed is None:
            directed = False

        data = np.array(data)
        #  get the node list
        nodes = np.unique(data)
        n = len(nodes)
        # create a dict
        node_dict = {n: i for i, n in enumerate(nodes)}

        # inverted to vector
        numdata = np.vectorize(node_dict.get)(data)
        am = np.zeros((n, n),)
        for j, i in numdata:
            am[j, i] = 1
            if not directed:
                am[i, j] = 1

        return am.astype(int)

    # Task 6
    def get_number_paths(self, adj_matrix, source, target, k):
        # give up if K is exhaused (zero or negative.)
        if (k == 0 and source == target):
            return 1
        if (k <= 0):
            return 0

        n = len(adj_matrix)
        steps = 0
        # traverse the adj matrix
        for i in range(n):
            if (adj_matrix[source][i] == 1):  # have a connection.
                # deduct step and recursive call; add to current steps.
                steps += self.get_number_paths(adj_matrix, i, target, k - 1)

        return steps


class test_one(unittest.TestCase):
    def setUp(self) -> None:
        pass

    def test_steps(self):
        arr = [[0, 1], [1, 0], [1, 2], [2, 0], [2, 3], [2, 4], [3, 4], [4, 5], [5, 6], [6, 4], [7, 6]]
        s = connected_helper(G=arr)
        s.max_id = 7
        adj = s.get_adj_matrix(directed=True)

        act = s.get_number_paths(adj, 0, 1, 1)
        self.assertEqual(1, act, 'path is 1')

        act = s.get_number_paths(adj, 3, 4, 1)
        self.assertEqual(1, act, 'path is 1')

        act = s.get_number_paths(adj, 2, 4, 2)
        self.assertEqual(1, act, 'path is 1')

        act = s.get_number_paths(adj, 4, 5, 3)
        self.assertEqual(0, act, 'path is 1')

    def test_steps_two(self):
        arr = [[0, 1], [0, 3], [0, 2], [1, 3], [2, 3]]
        s = connected_helper(G=arr)
        s.max_id = 3
        adj = s.get_adj_matrix(directed=True)
        act = s.get_number_paths(adj, 0, 3, 2)
        self.assertEqual(2, act, 'path is 1')

    def test_adj_matrix(self):
        arr = [[0, 1], [1, 0], [1, 2], [2, 0], [2, 3], [2, 4], [3, 4], [4, 5], [5, 6], [6, 4], [7, 6]]
        s = connected_helper(G=arr)
        s.max_id = 7

        act = s.get_adj_matrix(directed=True)
        self.assertIsNotNone(act)

    def test_dfs_one(self):
        # arr = [[1, 2], [2, 3], [2, 5], [3, 4], [4, 6], [5, 1], [6, 3]]
        arr = [[0, 1], [1, 0], [1, 2], [2, 0], [2, 3], [2, 4], [3, 4], [4, 5], [5, 6], [6, 4], [7, 6]]
        #  cn_exp = [[1], [0, 2], [0, 3, 4], [4], [5], [6], [4], [6]]
        s = connected_helper(G=arr)
        s.max_id = 7

        act = s.get_scc(arr)
        exp = [0, 0, 0, 3, 4, 4, 4, 7]
        self.assertIsNotNone(act)
        self.assertListEqual(act, exp, 'alg done...')

    def test_reverse(self):
        arr = [['0', '2'], ['0', '3'], ['1', '1']]
        exp = [['2', '0'], ['3', '0'], ['1', '1']]
        s = connected_helper(G=arr)
        s.max_id = 3
        act = s.get_reversed_graph()
        self.assertListEqual(exp, act, 'reverse to orig ok')

    def test_one(self):
        arr = [['0', '2'], ['0', '3'], ['1', '1']]
        s = connected_helper(G=arr)
        s.max_id = 3
        cn = s.get_connected_nodes()

        self.assertIsNotNone(cn)

    def test_with_fb(self):
        s_fb = connected_helper()
        s_fb.import_data('./data/facebook_combined.txt', skip_header=False, sep=' ', cols=2)
        cn = s_fb.get_connected_nodes()
        cn_2 = s_fb.get_connected_counts()
        self.assertIsNotNone(cn)
        self.assertListEqual(cn_2, s_fb.connected_counts_alt, 'two connected counts')

    # @unittest.skip('data not present')
    def test_larget_node(self):
        s_fb = connected_helper()
        s_fb.import_data('./data/facebook_combined.txt', skip_header=False, sep=' ', cols=2)
        index, con_nodes = s_fb.get_largest_node_degree()

        print('larget node has {}'.format(len(con_nodes)))

        self.assertIsNotNone(con_nodes)
        self.assertEqual(1045, len(con_nodes), 'length of connections')
        self.assertEqual(107, index, 'index of larget node')
        self.assertEqual(s_fb.connected_counts[107], len(con_nodes))
        
        
        
        
if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)
    # unittest.main(verbosity=1)

.......

larget node has 1045


.
----------------------------------------------------------------------
Ran 8 tests in 0.763s

OK




# Task 2 - 

In [13]:
s_fb = connected_helper()
edges_facebook = s_fb.import_data('./data/facebook_combined.txt', skip_header=False, sep=' ', cols=2)
print(len(edges_facebook))  # result is 88234

# importing GitHub graph data
s_gh = connected_helper()
edges_github = s_gh.import_data('./data/musae_git_edges.csv', skip_header=True, sep=',', cols=2)
print(len(edges_github))

# import btc data
s_btc = connected_helper()
edges_btc = s_btc.import_data('./data/soc-sign-bitcoinotc.csv', skip_header=False, sep=',', cols=3)
print(len(edges_btc))




88234
289003
35592


# Task 3

In [14]:
# The following gives the top 100
fb_100 = s_fb.node_top(edges_facebook)
gh_100 = s_gh.node_top(edges_github)
btc_100 = s_btc.node_top(edges_btc)

In [15]:
## just show the top 10 of the top 100

## output is (ID, count)
print(fb_100[0:10])
print(gh_100[0:10])
print(btc_100[0:10])

[(1888, 251), (2543, 246), (1800, 216), (2611, 197), (1827, 186), (1730, 183), (2607, 183), (1833, 182), (2602, 182), (2604, 182)]
[(31890, 7470), (35773, 2401), (36652, 2285), (18163, 1858), (19222, 1499), (36628, 1477), (35008, 1472), (3712, 884), (13638, 858), (30002, 819)]
[(35, 535), (2642, 412), (1810, 311), (2028, 279), (905, 264), (1, 226), (4172, 222), (7, 216), (4197, 203), (13, 191)]




# Task 4

In [16]:
arr = [['0', '2'], ['0', '3'], ['1', '1']]
s = connected_helper(G=arr)
s.max_id = 3
cn = s.get_connected_nodes()
print('small dataset result')
cn

small dataset result


[['2', '3'], ['1', '1'], ['0'], ['0']]

In [17]:
## ID # 2 is example of smaller "network connected ID"
s_fb.get_connected_nodes()[2]

[0, 20, 115, 116, 149, 226, 312, 326, 333, 343]

In [18]:
# s_fb.get_connected_nodes()[0:3]

## Task 4 continued.


In [19]:

index, con_nodes = s_fb.get_largest_node_degree()
print('index ID of largest FB: {} with {} total nodes'.format(index, len(con_nodes)))

index ID of largest FB: 107 with 1045 total nodes


# Task 5

In [53]:
rv = s_gh.get_scc()
import numpy as np
nnn = np.array(rv)


print('top 5 items from rv:\n')
print(nnn[nnn!=None])
# here are the top ten... 
import collections
collections.Counter(rv).most_common(n=10)


top 5 items from rv:

[18335 18335 36562 ... 37691 28140 37694]


[(None, 10095),
 (18335, 190),
 (23433, 80),
 (27418, 70),
 (36955, 69),
 (2433, 61),
 (35970, 59),
 (26989, 57),
 (29665, 43),
 (33416, 39)]

# Task 6

In [56]:
## Using a sample array.  which underneath uses adjacency matrix.

arr = [[0, 1], [0, 3], [0, 2], [1, 3], [2, 3]]
s = connected_helper(G=arr)
s.max_id = 3
adj = s.get_adj_matrix(directed=True)
act = s.get_number_paths(adj, 0, 3, 2)

print(act)

2


In [57]:
print(s.get_adj_matrix())

[[0 1 1 1]
 [1 0 0 1]
 [1 0 0 1]
 [1 1 1 0]]


In [59]:
## for facebook it's quite large...
adj = s_fb.get_adj_matrix(directed=True)
adj

array([[0, 1, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [61]:
# total number of ID's in FB adjacenc matrix
len(adj)

4039