## SubgraphManipulator

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/SubgraphManipulator```:

In [1]:
PATH = "/path/to/SubgraphManipulator" # <-- optional if running from native path

In [2]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [3]:
import IPython
import plotly.offline as py

from MSM import MSM
from RSM import RSM
from TSM import TSM
from smlib import *

#### IPython-exclusive

In [4]:
py.init_notebook_mode(connected=True)

In [5]:
def configure_plotly_browser_state():
    display(IPython.core.display.HTML('''
            <script src="/static/components/requirejs/require.js"></script>
            <script>
                requirejs.config({
                paths: {
                base: '/static/base',
                plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
                },
              });
            </script>
            '''))

### Analyze data

SubgraphManipulator will (1) slice the data into buckets, (2) find top communities in each bucket and (3) look for persistent communities.

In [6]:
split_number = 1  # <-- number of days for every slice
max_items    = 15 # <-- maximum users, tags or items

#### 1/3) Tweets dataset `OK`

In [1]:
input_file = "" # "tweets.csv"
input_file = "/home/neo/workspace/tweets/spiderman_80k.csv"

%time matches, gini_coef, time,\
top_rt, top_ht, dates = TSM(input_file,\
                            split_number=split_number,\
                            min_hashtags=max_items,\
                            min_retweets=max_items)

NameError: name 'TSM' is not defined

#### 2/3) News media dataset `WIP`

In [None]:
input_path = "." # <-- folder containing articles/stories JSON files

graph_file = "news_network.gexf"    # <-- graph network file output from MediaCollector
cent_file  = "news_centrality.xlsx" # <-- previously saved network measures (optional)

gini_coef, time = MSM(input_path,
                      graph_file=graph_file,
                      centrality_file=cent_file)

#### 3/3) RIS datasets `DEPRECATED`

In [None]:
input_path = "" # <-- "ISJ_Citations"

matches, gini_coef, time,\
top_rt, top_ht, dates = RSM(input_path,
                            output_file='dataset.csv',
                            split_number=7,
                            top=30)

### Visualization
#### Communities over time
Displays identified communities and their persistence over analyzed periods.

In [22]:
#configure_plotly_browser_state()
communities_over_time(matches, output=None, inline=True)

#### Population size

In [23]:
#configure_plotly_browser_state()
population_size(matches, output=None, inline=True)

#### Time rates

For each community, get the partition time rate, i.e. ``part_rate = (P2.users & P1.users)/(P1.users)``, in order to plot over time.

In [24]:
#configure_plotly_browser_state()
time_rate(matches, output=None, inline=True)

#### Gini Coefficient

In [25]:
#configure_plotly_browser_state()
communities_gini(gini_coef, output=None, inline=True)

#### Top influencers

In [26]:
#configure_plotly_browser_state()
top_influencers(top_rt, max_items, output=None, inline=True)

#### Top associated tags

In [27]:
#configure_plotly_browser_state()
top_influencers(top_ht, max_items, output=None, inline=True)

#### Detail communities

In [28]:
#configure_plotly_browser_state()
detail_communities(time, output=None, inline=True)

### Write data from communities

Data is automatically written to `community_items.xlsx` and `community_users.xlsx`, with each sheet describing a specific period.

In [29]:
write_communities(time, dates)

0 2019-10-12
1 2019-10-13
2 2019-10-14
3 2019-10-15
4 2019-10-16
5 2019-10-17
6 2019-10-18
7 2019-10-19
8 2019-10-20
9 2019-10-21


#### Compress output →  `output.zip`

In [30]:
!zip output.zip *xlsx

  adding: communities_items.xlsx (deflated 27%)
  adding: communities_users.xlsx (deflated 10%)


In [31]:
!ls

communities_items.xlsx	__pycache__	  smlib.py
communities_users.xlsx	README.md	  SubgraphManipulator.ipynb
__init__.py		requirements.txt  SubgraphManipulator.py
MSM.py			RESULTS		  TSM.py
output.zip		RSM.py


### [Download output files](output.zip)

___

### References

* TSM on GitHub: https://github.com/dfreelon/TSM
* The TSM author, Deen Freelon: http://dfreelon.org/
* Line charts with plot.ly: https://plot.ly/python/line-charts/
* Bringing Matplotlib to the Browser: https://mpld3.github.io/examples/html_tooltips.html
* matplotlib.pyplot.scatter: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html