## PyCluster

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/PyCluster```:

In [None]:
PATH = "/path/to/PyCluster" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
import json

import numpy as np
import plotly.offline as py

from chart import line_chart
from clusterize import clusterize

py.init_notebook_mode(connected=True)

### Analyze data

Required to choose between one of existing input data types to clusterize: `tweets`, `media` or `text` (file/folder).

In [None]:
input_name = ""
data_type  = ""

#### Advanced settings

Allows setting interval to split data and perform k-value accuracy tests. Alternatively, leave `k_value` as `0` for elbow method (auto-detection).

In [None]:
k_value = 0                 # set number of clusters
min_k = 2                   # minimum number of clusters
max_k = 8                   # maximum nmber of clusters
max_iter = 300              # maximum K-Means iterations

n_features = 0              # number of tweets X axis
n_dimensions = 0            # number of words Y axis
batch_size = 100            # size of the mini-batches
s_samples = None            # samples for silhouette
g_nrefs = 3                 # references for gap statistics

num_days = 0                # interval to split data
words_in_cloud = 100        # maximum words in clouds
sort_by = 'degree'          # centrality measure to sort by

idf = True                  # perform tf-idf weighing
minibatch = False           # faster at expense of accuracy
stemming = False            # remove affixes forom words

elbow_method = False        # within-cluster sum of squares
silhouette_scores = False   # silhouette coefficients
gap_statistics = False      # calculate gap statistics

all_metrics = False         # virality/ASS/topic complexity

inline = True               # plot k-value analysis
output_folder = 'PYCLUSTER' # output folder name

random_state = None         # optionally store random seed

np.random.seed(random_state)

#### Read and cluster data

Calculate the appropriate **k** number of clusters for k-means if unset and start data analysis.

In [None]:
df, ag, tg = clusterize(input_name,
                        data_type,
                        k=k_value,
                        min_k=min_k,
                        max_k=max_k,
                        max_iter=max_iter,
                        n_features=n_features,
                        n_dimensions=n_dimensions,
                        batch_size=batch_size,
                        s_samples=s_samples,
                        g_nrefs=g_nrefs,
                        interval=num_days,
                        words_in_cloud=words_in_cloud,
                        idf=idf,
                        minibatch=minibatch,
                        wss=elbow_method,
                        sil=silhouette_scores,
                        gap=gap_statistics,
                        all_metrics=all_metrics,
                        stemming=stemming,
                        random_state=random_state,
                        output_folder=output_folder,
                        inline=inline)

### Attention graph

Plot overall narrative attention over time, i.e. the amount of input data during the period analyzed. **Tip:** smaller time frames result in more detailed steps.

In [None]:
line_chart(ag, inline=True)

### Themed graph

Plot clusters and their attention over time. **Note:** groups are named after their most occurring word, but defined by word co-occurrences (k-means).

In [None]:
line_chart(tg, inline=True)

### Filter data by cluster

Returns data frame with detailed data on a specific cluster **k**. By default, shows objects from the first cluster (`k=0`).

In [None]:
k = 0

df_ = df[df['cluster'] == k]; df_

#df_.to_csv('%s/cluster_%s.csv' % (output_folder, k)) # <-- uncomment to save as CSV file

#### Filter data for all clusters (!)

Output all data filtered by clusters as CSV files.

In [None]:
for k in sorted(df['cluster'].unique()):
    df_ = df[df['cluster'] == k]
    df_.to_csv('%s/cluster_%s.csv' % (output_folder, k))

#### Compress output →  `output.zip`

In [None]:
!zip -r output.zip PYCLUSTER

### [Download output files](output.zip)

_____

### References:

* Scipy: [documentation](https://www.scipy.org/) | [GitHub](https://github.com/scipy/scipy)

* Scikit-learn: [documentation](https://scikit-learn.org/stable/documentation.html) | [GitHub](https://github.com/scipy/scipy)

* Mini-Batch K-means : [article](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) | [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans)

* Silhouette scores: [article](https://doi.org/10.1016/0377-0427(87)90125-7) | [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

* Gap statistics: [article](https://statweb.stanford.edu/~gwalther/gap) | [anaconda](https://anaconda.org/milesgranger/gap-statistic/notebook)

* Wordle algorithm due to: [Jonathan Feinberg](http://static.mrfeinberg.com/bv_ch03.pdf)

* Original d3-cloud by Jason Davies: [GitHub](http://github.com/jasondavies/d3-cloud)

* Based on twarc's implementation: [GitHub](https://github.com/DocNow/twarc)