# 1. Bellevue.edu Site Link Analysis

In this part of the exercise, you will be examining the Bellevue University web site. The bellevue.edu directory contains two CSV files, all_links.csv and content_links.csv. The all_links.csv contains all the links between the pages including links from the default navigation bar. The content_links.csv file only contains links between pages if those links occurred in the content of the page and not in the navigation bar.

The files contain two columns of data, src_link and dst_link. The src_link indicates the page that contains a link to the dst_link. The links are relative to the main page, so / resolves to http://www.bellevue.edu while /degrees resolves to http://www.bellevue.edu/degrees.

For this exercise, you will need to create a Networkx DiGraph from both datasets. A Networkx DiGraph is a directed graph where each node is a relative URL for a page on the Bellevue University website, and each edge represents a link from one page to another. See Networkx’s tutorial if you need help creating these graphs.

In [56]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import json

In [2]:
bv_data_root = 'C:\\Users\\Dan Siegel\\Desktop\\Classes\\550\\data\\bellevue.edu\\'
all_links_path = bv_data_root+'all_links.csv'
content_links_path = bv_data_root+'content_links.csv'

In [3]:
all_links_df = pd.read_csv(all_links_path)
content_links_df = pd.read_csv(content_links_path)

In [4]:
import itertools 
from nltk import sent_tokenize 
def cooccurrence(text, cast): 
    """ Takes as input text, a dict of chapter {headings: text}, and cast, a comma separated list of character names. 
    Returns a dictionary of cooccurrence counts for each possible pair. """ 
    possible_pairs = list( itertools.combinations( cast, 2)) 
    cooccurring = dict.fromkeys(possible_pairs, 0) 
    for title, chapter in text[' chapters']. items(): 
        for sent in sent_tokenize( chapter): 
            for pair in possible_pairs: 
                if pair[ 0] in sent and pair[ 1] in sent: 
                    cooccurring[ pair] += 1 
                    return cooccurring


In [5]:
G = nx.from_pandas_edgelist(all_links_df, source='src_link', target='dst_link', edge_attr=True, create_using=nx.DiGraph())

In [14]:
p = dict(nx.shortest_path_length(G))

In [16]:
shortest = pd.DataFrame.from_dict(p)

In [28]:
shortest['/'] = shortest['/'].fillna(value=-1)

a. Clicks from the Homepage

Using the all_links.csv dataset, you will determine how many clicks it takes to get from the homepage to every other page. You can calculate the number of clicks it takes to get from the homepage to any other page by calculating the shortest path length from the homepage (/) as the source node and the destination link as the target node. You may need to check if there is a path between the homepage and the destination link.

Provide a summary of the number of clicks each page is away from the main page (i.e., the / link) using the following format.

In [29]:
shortest['/'].value_counts()
#-1 is No path links 

 2.0    423
 3.0    273
 1.0     74
-1.0     13
 4.0      2
 0.0      1
Name: /, dtype: int64

b. PageRank

Using the content_links.csv data, calculate the PageRank for each node. PageRank is an algorithm first developed by Google to determine the importance of a website by looking at the links between websites. Intuitively, the PageRank algorithm simulates a web user clicking random links on a website. The PageRank of a site is a measure of the likelihood a user would land on that site by randomly clicking links.

This means a page is more important if other pages link to it.

Provide a summary of the top 20 pages by PageRank using the following format.

In [35]:
GG = nx.from_pandas_edgelist(content_links_df, source='src_link', target='dst_link', edge_attr=True, create_using=nx.DiGraph())
pr = nx.pagerank(GG, alpha=0.9)

In [47]:
page_rank_df = pd.DataFrame.from_dict(pr, orient = 'index')

In [54]:
page_rank_df[0].head(20).sort_values

<bound method Series.sort_values of /about/about-us/community-service                     0.002317
/about/about-us                                       0.002285
/about/about-us/history                               0.002285
/about/about-us/locations                             0.004939
/about/about-us/mission-values                        0.002285
/about/about-us/news-events-calendar/calendar         0.002520
/about/about-us/pratt-award                           0.002359
/about/about-us/remembering-our-fallen                0.002285
/about/about-us/statistics-facts                      0.002397
/about/about-us/student-stories                       0.002285
/about/gcc                                            0.002496
/degrees/academic-catalog/signature-series            0.018014
/about/about-us/latino-dream                          0.000337
/about/about-us/community-service/latino-dream        0.000163
/admissions-tuition/financing-options/how-to-apply    0.002650
/about/about-us/loc

2. Featured Articles Entity Graph

In this part of the exercise, you will extract entity pairs from the Wikipedia featured articles data set. Extract the entity pairs and load them into a Networkx graph. Using this graph, report the following basic information.

reporting format

a. Report on the top 20 nodes for each rank for degree centrality

Report the top 20 degree centrality rankings as described in chapter 9 of the Applied Text Analysis with Python book.

In [57]:
path = 'C:\\Users\\Dan Siegel\\Desktop\\Classes\\550\\data\\wikipedia\\featured-articles\\featured-articles_000.jsonl'
with open(path) as f:
    lines = f.readlines()

articles = [json.loads(line) for line in lines]

In [59]:
articles_df = pd.DataFrame.from_records(articles)

In [73]:
articles_df = articles_df.interlinks

In [85]:
ggg = nx.DiGraph()

In [79]:
for i in articles_df:
    list_of_dicts.append(i)
for i in list_of_dicts:
    ggg.add_nodes_from(i.keys())
for i in list_of_dicts:
    for k, v in i.items():
        ggg.add_edges_from(([(k, t) for t in v]))

In [108]:
pr = nx.pagerank(ggg, alpha=0.9)

In [109]:
centrality_rank_df = pd.DataFrame.from_dict(pr, orient = 'index')

In [112]:
centrality_rank_df.head(20).sort_values

<bound method DataFrame.sort_values of                                    0
developmental disorder      0.000009
Interpersonal relationship  0.000009
communication               0.000009
behavior                    0.000009
developmental milestones    0.000009
Regressive autism           0.000009
Heritability of autism      0.000009
environmental factors       0.000009
infection                   0.000009
pregnancy                   0.000009
rubella                     0.000009
valproic acid               0.000009
Alcohol (drug)              0.000009
cocaine                     0.000009
Controversies in autism     0.000009
Causes of autism            0.000009
MMR vaccine controversy     0.000009
brain                       0.000009
nerve cell                  0.000009
synapse                     0.000009>

b. Report on the top 20 nodes for each rank for betweenness centrality Report the top 20 betweenness centrality rankings as described in chapter 9 of the Applied Text Analysis with Python book.

In [113]:
centrality_rank_df.head(20).sort_values

<bound method DataFrame.sort_values of                                    0
developmental disorder      0.000009
Interpersonal relationship  0.000009
communication               0.000009
behavior                    0.000009
developmental milestones    0.000009
Regressive autism           0.000009
Heritability of autism      0.000009
environmental factors       0.000009
infection                   0.000009
pregnancy                   0.000009
rubella                     0.000009
valproic acid               0.000009
Alcohol (drug)              0.000009
cocaine                     0.000009
Controversies in autism     0.000009
Causes of autism            0.000009
MMR vaccine controversy     0.000009
brain                       0.000009
nerve cell                  0.000009
synapse                     0.000009>

. Bot Detection on Reddit

Your data directory should contain the file user_comment_links.csv the reddit directory. This file contains three fields.

user1: The Reddit username of the first user
user2: The Reddit username of the second user
num_comments: The number of times user1 commented on something user2 wrote
Using Networkx, create a directed graph of user comments. The following is Python demonstrating how to add a weighted edge in a directed graph.

reporting format

In this exercise, you will use the degree centrality algorithm to find users who a central to in the communication graph. These users are most likely automated bots which explain how they are able to communicate with a wide variety of users across multiple subreddits.



In [128]:
wel = pd.read_csv('C:\\Users\\Dan Siegel\\Desktop\\Classes\\550\\data\\reddit\\user_comment_links.csv')

In [132]:
redG = nx.DiGraph()
redG.add_weighted_edges_from([tuple(x) for x in wel.values])

In [140]:
#df for average degrees
dsdff = pd.DataFrame.from_dict(dict(redG.degree), orient = 'index')

In [144]:
#df for avg out degrees
redgOutDeg = pd.DataFrame.from_dict(dict(redG.out_degree), orient = 'index')

a. Graph Information

Report the following basic information.
Nodes, edges, Average Degree, Average out degree

In [146]:
len(redG.nodes), len(redG.edges), dsdff.mean(), redgOutDeg.mean()

(1160746, 1724191, 0    2.970833
 dtype: float64, 0    1.485416
 dtype: float64)

In [150]:
Reddit_centrality = pd.DataFrame.from_dict(nx.degree_centrality(redG), orient = 'index')

b. Degree Centrality

Report the top 20 degree centrality rankings as described in chapter 9 of the Applied Text Analysis with Python book.

In [151]:
Reddit_centrality.head(20).sort_values

<bound method DataFrame.sort_values of                                  0
-----------------www  8.615157e-07
zcc0nonA              2.584547e-06
----------_----       3.015305e-05
Amicus-Regis          3.618366e-05
BobIV                 6.202913e-05
Dont-worry-about-it   3.446063e-06
Felikitsune           2.239941e-05
HeyImNiko             2.756850e-05
Iguessimnotcreative   1.550728e-05
KokuTatsu             5.599852e-05
Last-Man-Standing     9.476672e-06
Mimighster            1.723031e-06
Saskatchemoose        5.169094e-06
Swiffles33            8.615157e-07
TheRisingDownfall     9.476672e-06
TheSamshinCashew      8.615157e-07
UnpleasantVisitor     3.446063e-06
bmierror              6.892125e-06
ininja2               3.446063e-06
itztaytay             3.962972e-05>