## PyCluster

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/PyCluster```:

In [None]:
PATH = "/path/to/PyCluster" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
import numpy as np
import plotly.offline as py

from chart import line_chart
from media import cluster_media
from tweets import cluster_tweets

py.init_notebook_mode(connected=True)

### Analyze data

Calculate the appropriate **k** number of clusters for k-means and start data analysis. Alternatively, leave it as `0` for elbow method (auto-detection).

In [None]:
input_name = "" # data set to parse
k_value    = 0  # set number of clusters

#### Advanced settings

Allows setting interval to split data, number of clusters, performing k-value accuracy tests. **Note:** the `classic_k_means` method has been deprecated.

In [None]:
n_features = 0             # number of tweets X axis
n_dimensions = 0           # number of words Y axis
num_days = 0               # interval to split data

tfidf = True               # perform tf-idf processing
mini_batch = True          # faster at expense of accuracy

elbow_method = False       # within-cluster sum of square
silhouette_scores = False  # silhouette coefficients
gap_statistics = False     # calculate gap statistics
all_metrics = False        # virality/ASS/topic complexity

random_state = None        # optionally set random seed

np.random.seed(random_state) 

#### Read tweets data

In [None]:
df, ag, tg, mm, mx = cluster_tweets(input_name,
                                    k=k_value,
                                    days=num_days,
                                    tfidf=tfidf,
                                    minibatch=mini_batch,
                                    elb=elbow_method,
                                    sil=silhouette_scores,
                                    gap=gap_statistics,
                                    random_state=random_state,
                                    all_metrics=all_metrics)

#### Read news media data

In [None]:
df, ag, tg, mm, mx = cluster_media(input_name,
                                   k=k_value,
                                   days=num_days,
                                   tfidf=tfidf,
                                   minibatch=mini_batch,
                                   elb=elbow_method,
                                   sil=silhouette_scores,
                                   gap=gap_statistics,
                                   random_state=random_state,
                                   all_metrics=all_metrics)

### Attention graph

Plot overall narrative attention over time, i.e. the amount of input data during the period analyzed. **Tip:** smaller time frames result in more detailed steps.

In [None]:
line_chart(ag, inline=True)

### Themed graph

Plot clusters and their attention over time. **Note:** groups are named after their most occurring word, but defined by word co-occurrences (k-means).

In [None]:
line_chart(tg, inline=True)

### Details on clusters

Returns data frame with detailed data on a specific cluster `k`. By default, shows objects from the first cluster (`k=0`).

In [None]:
k = 0

df[df['cluster'] == k]

### Markov chain

Displays markov matrix describing a sequence of possible events in which the probability of each event depends on the previous state.

In [None]:
from IPython.display import Image
from markov import markov_chain

#### Render transition diagram

Methods available for rendering the transition diagram: `graphviz` or `pygraphviz` or `pydot`.

In [None]:
Q = list(mm.values())
states = list(mm.keys())
m = markov_chain(Q, states, method='graphviz'); m
#Image(m) # <-- uncomment for pygraphviz

### Cause-effect structure `WIP`

Integrated information theory provides a mathematical framework to fully characterize the cause-effect structure of a physical system. Here we use [PyPhi](http://integratedinformationtheory.org), which implements a framework for causal analysis and unfolds the full cause-effect structure of discrete dynamical systems of binary elements.

In [None]:
from phi import run_pyphi, complex_mechanism

#### Compute network

Returns all complexes in the network context of all φ and Φ computation. Here we’ll use the 2-dimensional state-by-node form for the TPM.

In [None]:
bc, ac = run_pyphi(mx['presence_matrix'],
                   #cm=mx['connective_matrix'], # <-- optional
                   states=mx['states'])

#### Mechanism details

See details (cause and effect) of one selected mechanism, based on the list above and small phi values.

In [None]:
m = 0 # <-- mechanism number

complex_mechanism(bc, m)

#### Best complex data

Display first mechanism after system irreducibility analysis and its cause-effect structure of the complex with the highest phi value.

In [None]:
bc

#### ALL complexes data (!)

Display all complexes identified by PyPhi.

In [None]:
ac

#### Compress output →  `output.zip`

In [None]:
!zip output.zip *html *json *csv *xls *xlsx

### [Download output files](output.zip)

_____

### References:

* Scipy: [documentation](https://www.scipy.org/) | [GitHub](https://github.com/scipy/scipy)

* Scikit-learn: [documentation](https://scikit-learn.org/stable/documentation.html) | [GitHub](https://github.com/scipy/scipy)

* PyPhi: [website](https://pypi.org/project/pyphi/) | [arxiv](https://arxiv.org/abs/1712.09644) | [documentation](https://pyphi.readthedocs.io/en/latest/) | [GitHub](https://github.com/wmayner/pyphi)

* Mini-Batch K-means : [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans) | [article](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf)

* Gap Statistic for k-means: [anaconda](https://anaconda.org/milesgranger/gap-statistic/notebook)

* PyGraphviz documentation: [GitHub](https://pygraphviz.github.io/)

* PyDot markov chains: [Vince Knight](https://vknight.org/unpeudemath/code/2015/11/15/Visualising-markov-chains.html)

* Original d3-cloud by Jason Davies: [GitHub](http://github.com/jasondavies/d3-cloud)

* Based on twarc's implementation: [GitHub](https://github.com/DocNow/twarc)

* Wordle algorithm due to: [Jonathan Feinberg](http://static.mrfeinberg.com/bv_ch03.pdf)