# HMDP Topic Model IPython Wrapper

This is a Python class which wraps the Java binaries from the HMDP topic model from the PROMOSS topic modelling toolbox. The *promoss.jar* is expected to be in *../promoss.jar*.

## HMDP class

The HMDP class contains all the methods required to run the HMDP topic model. 

### Mandatory parameters
It takes two parameters as mandatory parameters:
* directory 		String. Gives the directory of the texts.txt and groups.txt file.
* meta_params		String. Specifies the metadata types and gives the desired clustering. Types of metadata are given separated by semicolons (and correspond to the number of different metadata in the meta.txt file. Possible datatypes are:
 * G	Geographical coordinates. The number of desired clusters is specified in brackets, i.e. G(1000) will cluster the documents into 1000 clusters based on the geographical coordinates. (Technical detail: we use EM to fit a mixture of fisher distributions.)
 * T	UNIX timestamps (in seconds). The number of clusters (based on binning) is given in brackets, and there can be multiple clusterings based on a binning on the timeline or temporal cycles. This is indicated by a letter followed by the number of desired clusters:
 * L	Binning based on the timeline. Example: L1000 gives 1000 bins.
 * Y	Binning based on the yearly cycle. Example: L1000 gives 1000 bins.
 * M	Binning based on the monthly cycle. Example: L1000 gives 1000 bins.
 * W	Binning based on the weekly cycle. Example: L1000 gives 1000 bins.
 * D	Binning based on the daily  cycle. Example: L1000 gives 1000 bins.
 * O	Ordinal values (numbers)
 * N	Nominal values (text strings)

### Optional parameters
Additionally, optional parameters can be given. The most commonly used ones are *T, RUNS, processed, stemming, stopwords* and *language*.
* T			Integer. Number of truncated topics. Default: 100
* RUNS			Integer. Number of iterations the sampler will run. Default: 200
* SAVE_STEP		Integer. Number of iterations after which the learned paramters are saved. Default: 10
* TRAINING_SHARE		Double. Gives the share of documents which are used for training (0 to 1). Default: 1
* BATCHSIZE		Integer. Batch size for topic estimation. Default: 128
* BATCHSIZE_GROUPS	Integer. Batch size for group-specific parameter estimation. Default: BATCHSIZE
* BURNIN			Integer. Number of iterations till the topics are updated. Default: 0
* BURNIN_DOCUMENTS	Integer. Gives the number of sampling iterations where the group-specific parameters are not updated yet. Default: 0
* INIT_RAND		Double. Topic-word counts are initiatlised as INIT_RAND * RANDOM(). Default: 0
* SAMPLE_ALPHA		Integer. Every SAMPLE_ALPHAth document is used to estimate alpha_1. Default: 1
* BATCHSIZE_ALPHA	Integer. How many observations do we take before updating alpha_1. Default: 1000
* MIN_DICT_WORDS		Integer. If the words.txt file is missing, words.txt is created by using words which occur at least MIN_DICT_WORDS times in the corpus. Default: 100
* save_prefix		String. If given, this String is appended to all output files.
* alpha_0		Double. Initial value of alpha_0. Default: 1
* alpha_1		Double. Initial value of alpha_1. Default: 1
* epsilon		Comma-separated double. Dirichlet prior over the weights of contexts. Comma-separated double values, with dimensionality equal to the number of contexts.
* delta_fix 		If set, delta is fixed and set to this value. Otherwise delta is learned during inference.
* rhokappa		Double. Initial value of kappa, a parameter for the learning rate of topics. Default: 0.5
* rhotau			Integer. Initial value of tau, a parameter for the learning rate of topics. Default: 64
* rhos			Integer. Initial value of s, a parameter for the learning rate of topics. Default: 1
* rhokappa_document	Double. Initial value of kappa, a parameter for the learning rate of the document-topic distribution. Default: kappa
* rhotau_document	Integer. Initial value of tau, a parameter for the learning rate of the document-topic distribution. Default: tau
* rhos_document		Integer. Initial value of tau, a parameter for the learning rate of the document-topic distribution. Default: rhos
* rhokappa_group		Double. Initial value of kappa, a parameter for the learning rate of the group-topic distribution. Default: kappa
* rhotau_group		Integer. Initial value of tau, a parameter for the learning rate of the group-topic distribution. Default: tau
* rhos_group		Integer. Initial value of tau, a parameter for the learning rate of the group-topic distribution. Default: rhos
* processed		Boolean. Tells if the text is already processed, or if words should be split with complex regular expressions. Otherwise split by spaces. Default: true.
* stemming		Boolean. Activates word stemming in case no words.txt/wordsets file is given. Default: false
* stopwords		Boolean. Activates stopword removal in case no words.txt/wordsets file is given. Default: false
* language		String. Currently "en" and "de" are available languages for stemming. Default: "en"
* store_empty		Boolean. Determines if empty documents should be omitted in the final document-topic matrix or if the topic distribution should be predicted using the context. Default: True
* topk			Integer. Set the number of top words returned in the topktopics file of the output. Default: 100
* gamma			Double. Initial scaling parameter of the top-level Dirichlet process. Default: 1
* learn_gamma		Boolean. Should gamma be learned during inference? Default: True

### Provided methods

#### run()
This method executes the java binaries with the parameters specified in the initialisation step.

#### check_run()
Checks if the HMDP model was already trained.

*Output: Boolean

#### map_from_JSON()
Creates HTML files with interactive maps showing the topic probabilities per cluster for all geographical metadata.
<img src="img/screenshot_map.png" style="height: 300px" />
*Input: 
 * color: Gives the color of the markers (hexadecimal, e.g. #aa23cc). Default: auto (changing colours)
 * marker_size: Integer, size of markers. Default: 10
 * show_map: Show map in the IPython notebook. Warning, this can crash your browser. Default: false


#### plot_zeta()
Show metadata (feature) weights.

#### plot_time()
Plot temporal distribution(s) of topic probabilities for a given topic.

* Input: ID of a topic

#### plot_ordinal()
Plot distribution of topic probabilities over ordinal metadata variables for a given topic.
* Input: ID of a topic


#### get_topics()
* Output: Returns the top-k words (k given by parameter -topk of the HMDP class) in a pandas DataFrame.

In [5]:
# coding: utf-8
%matplotlib inline

import json
import io, os, shutil, time, datetime
import subprocess
import folium
from IPython.core.display import HTML
from IPython.display import IFrame, display
import matplotlib.pyplot as plt
import pandas as pd

class HMDP(object):
    directory = "";
    meta_params = "";
    T=100
    RUNS=200
    SAVE_STEP=10
    TRAINING_SHARE=1.0
    BATCHSIZE=128
    BATCHSIZE_GROUPS=128
    BURNIN=0
    BURNIN_DOCUMENTS=0
    INIT_RAND=0
    SAMPLE_ALPHA=1
    BATCHSIZE_ALPHA=1000
    MIN_DICT_WORDS=100
    alpha_0=1
    alpha_1=1
    epsilon="none"
    delta_fix="none"
    rhokappa=0.5
    rhotau=64
    rhos=1
    rhokappa_document=0.5
    rhotau_document=64
    rhos_document=1
    rhokappa_group=0.5
    rhotau_group=64
    rhos_group=1
    processed=True
    stemming=False
    stopwords=False
    language="en"
    store_empty=True
    topk=100
    gamma = 1
    learn_gamma = True;
    
    def __init__(self,
    directory,
    meta_params,
    T=100,
    RUNS=200,
    SAVE_STEP=10,
    TRAINING_SHARE=1.0,
    BATCHSIZE=128,
    BATCHSIZE_GROUPS=128,
    BURNIN=0,
    BURNIN_DOCUMENTS=0,
    INIT_RAND=0,
    SAMPLE_ALPHA=1,
    BATCHSIZE_ALPHA=1000,
    MIN_DICT_WORDS=100,
    alpha_0=1,
    alpha_1=1,
    epsilon="none",
    delta_fix="none",
    rhokappa=0.5,
    rhotau=64,
    rhos=1,
    rhokappa_document=0.5,
    rhotau_document=64,
    rhos_document=1,
    rhokappa_group=0.5,
    rhotau_group=64,
    rhos_group=1,
    processed=True,
    stemming=False,
    stopwords=False,
    language="en",
    store_empty=True,
    topk=100,
    gamma = 1,
    learn_gamma = True
                ):
        self.directory = directory
        self.meta_params = meta_params
        self.T = T
        self.RUNS = RUNS
        self.SAVE_STEP = SAVE_STEP
        self.TRAINING_SHARE = TRAINING_SHARE
        self.BATCHSIZE = BATCHSIZE
        self.BATCHSIZE_GROUPS = BATCHSIZE_GROUPS
        self.BURNIN = BURNIN
        self.BURNIN_DOCUMENTS = BURNIN_DOCUMENTS
        self.INIT_RAND = INIT_RAND
        self.SAMPLE_ALPHA = SAMPLE_ALPHA
        self.BATCHSIZE_ALPHA = BATCHSIZE_ALPHA
        self.MIN_DICT_WORDS = MIN_DICT_WORDS
        self.alpha_0 = alpha_0
        self.alpha_1 = alpha_1
        self.epsilon = epsilon
        self.delta_fix = delta_fix
        self.rhokappa = rhokappa
        self.rhotau = rhotau
        self.rhos = rhos
        self.rhokappa_document = rhokappa_document
        self.rhotau_document = rhotau_document
        self.rhos_document = rhos_document
        self.rhokappa_group = rhokappa_group
        self.rhotau_group = rhotau_group
        self.rhos_group = rhos_group
        self.processed = processed
        self.stemming = stemming
        self.stopwords = stopwords
        self.language = language
        self.store_empty = store_empty
        self.topk = topk
        self.gamma = gamma
        self.learn_gamma = learn_gamma

    def run(self, RUNS = None):
        if RUNS == None:
            RUNS = self.RUNS;
            
        print("Running HMDP topic model... (please wait)");

        if os.path.isdir(directory+"/output_HMDP"):
            shutil.rmtree(directory+"/output_HMDP") 
        if os.path.isdir(self.directory+"/cluster_desc"):
            shutil.rmtree(self.directory+"/cluster_desc") 

        if os.path.isfile(self.directory+"/groups"):
            os.remove(self.directory+"/groups")
        if os.path.isfile(self.directory+"/groups.txt"):
            os.remove(self.directory+"/groups.txt")
        if os.path.isfile(self.directory+"/text.txt"):
            os.remove(self.directory+"/text.txt")
        if os.path.isfile(self.directory+"/words.txt"):
            os.remove(self.directory+"/words.txt")
        if os.path.isfile(self.directory+"/wordsets"):
            os.remove(self.directory+"/wordsets")

        if not os.path.isfile("../promoss.jar"):
            print("Could not find ../promoss.jar. Exit")
            return;
        try:
            with subprocess.Popen(['java', '-jar', '../promoss.jar', 
                                '-directory', self.directory, 
                                '-meta_params', self.meta_params, 
                                '-T',str(self.T),
                                '-RUNS',str(self.RUNS),
                                '-SAVE_STEP',str(self.SAVE_STEP),
                                '-TRAINING_SHARE',str(self.TRAINING_SHARE),
                                '-BATCHSIZE',str(self.BATCHSIZE),
                                '-BATCHSIZE_GROUPS',str(self.BATCHSIZE_GROUPS),
                                '-BURNIN',str(self.BURNIN),
                                '-BURNIN_DOCUMENTS',str(self.BURNIN_DOCUMENTS),
                                '-INIT_RAND',str(self.INIT_RAND),
                                '-SAMPLE_ALPHA',str(self.SAMPLE_ALPHA),
                                '-BATCHSIZE_ALPHA',str(self.BATCHSIZE_ALPHA),
                                '-MIN_DICT_WORDS',str(self.MIN_DICT_WORDS),
                                '-alpha_0',str(self.alpha_0),
                                '-alpha_1',str(self.alpha_1),
                                '-epsilon',str(self.epsilon),
                                '-delta_fix',str(self.delta_fix),
                                '-rhokappa',str(self.rhokappa),
                                '-rhotau',str(self.rhotau),
                                '-rhos',str(self.rhos),
                                '-rhokappa_document',str(self.rhokappa_document),
                                '-rhotau_document',str(self.rhotau_document),
                                '-rhos_document',str(self.rhos_document),
                                '-rhokappa_group',str(self.rhokappa_group),
                                '-rhotau_group',str(self.rhotau_group),
                                '-rhos_group',str(self.rhos_group),
                                '-processed',str(self.processed),
                                '-stemming',str(self.stemming),
                                '-stopwords',str(self.stopwords),
                                '-language',str(self.language),
                                '-store_empty',str(self.store_empty),
                                '-topk',str(self.topk),
                                '-gamma',str(self.gamma),
                                '-learn_gamma',str(self.learn_gamma)
                                ], stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:   

                for line in p.stdout:
                    line = str(line)[2:-1].replace("\\n","").replace("\\t","   ")
                    print(line, end='\n');
                for line in p.stderr:
                    line = str(line)[2:-1].replace("\\n","").replace("\\t","   ")
                    print(line, end='\n');

                
            #rc = process.poll();
            #print("Finished with return code " + str(rc));
        except subprocess.CalledProcessError as e:
            print(e.returncode)
            print(e.output)

    def check_run(self):
        if os.path.isdir(self.directory + "/output_HMDP/" + str(self.RUNS)):
            return True;
        else:
            print("Please call run() first");
            return False;
            
            
    #returns the command which we used to call the java file
    def get_command(self):
        args = ['java', '-jar', '../promoss.jar', 
                                '-directory', self.directory, 
                                '-meta_params', self.meta_params, 
                                '-T',str(self.T),
                                '-RUNS',str(self.RUNS),
                                '-SAVE_STEP',str(self.SAVE_STEP),
                                '-TRAINING_SHARE',str(self.TRAINING_SHARE),
                                '-BATCHSIZE',str(self.BATCHSIZE),
                                '-BATCHSIZE_GROUPS',str(self.BATCHSIZE_GROUPS),
                                '-BURNIN',str(self.BURNIN),
                                '-BURNIN_DOCUMENTS',str(self.BURNIN_DOCUMENTS),
                                '-INIT_RAND',str(self.INIT_RAND),
                                '-SAMPLE_ALPHA',str(self.SAMPLE_ALPHA),
                                '-BATCHSIZE_ALPHA',str(self.BATCHSIZE_ALPHA),
                                '-MIN_DICT_WORDS',str(self.MIN_DICT_WORDS),
                                '-alpha_0',str(self.alpha_0),
                                '-alpha_1',str(self.alpha_1),
                                '-epsilon',str(self.epsilon),
                                '-delta_fix',str(self.delta_fix),
                                '-rhokappa',str(self.rhokappa),
                                '-rhotau',str(self.rhotau),
                                '-rhos',str(self.rhos),
                                '-rhokappa_document',str(self.rhokappa_document),
                                '-rhotau_document',str(self.rhotau_document),
                                '-rhos_document',str(self.rhos_document),
                                '-rhokappa_group',str(self.rhokappa_group),
                                '-rhotau_group',str(self.rhotau_group),
                                '-rhos_group',str(self.rhos_group),
                                '-processed',str(self.processed),
                                '-stemming',str(self.stemming),
                                '-stopwords',str(self.stopwords),
                                '-language',str(self.language),
                                '-store_empty',str(self.store_empty),
                                '-topk',str(self.topk)
                                ];
        return (" ".join(args));
        
    #function to create topic maps based on JSON files by the HMDP topic model
    def map_from_JSON(self, base_folder = None, runs = None, color='auto', marker_size=10, show_map=False):

        if not self.check_run():
            return;
        
        if base_folder == None:
            base_folder = self.directory;
        if runs == None:
            runs = self.RUNS;
            
        topics = self.get_topics();
        k = 3;
        
        #we only create a map for the final run folder.
        #comment the next line to create maps for all folders
        final_run_folder = base_folder + "/output_HMDP/" + str(runs) +"/";

        #traverse folders containing geojson files
        folders = [x[0] for x in os.walk(final_run_folder) if x[0].endswith("_geojson")];
        
        if (len(folders)==0):
            print("No geoJSON data found. Does your model contain geographical metadata?");
            return;
        
        for folder in folders:
            print("opening folder "+folder+":");

            #Create new folium map class
            f_map = folium.Map(location=[50, 6], tiles='Stamen Toner', zoom_start=1);

            #traverse geoJSON files
            files = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f)) & f.endswith(".geojson")];
            
            topic_numbers = [""]*len(files);
            
            for i in range(0,len(files)):
                topic_numbers[i] = int(files[i].split("_")[1].split(".")[0]);
           
            files = [x for (y,x) in sorted(zip(topic_numbers,files))]
            
            i = 0;
            for file in files:
                print("processing "+file+" ...");

                with open(folder+'/'+file) as f:
                    geojson = json.load(f)

                icon_size = (14, 14)

                #name of the topic are the first three topic words
                name = "Topic "+str(i)+": "+" ".join(topics.iloc[i][0:k]);        
                #traverse geoJSON features
                feature_group = folium.FeatureGroup(name);
                for feature in geojson['features']:
                    #we get position, colour, transparency from JSON
                    lat, lon = feature['geometry']['coordinates'];
                    if color == 'auto':
                        fillColor = "#"+feature['properties']['fillColor'];
                    else:
                        fillColor = color;
                    fillOpacity = feature['properties']['fillOpacity'];
                    marker = folium.CircleMarker([lat, lon], 
                                                 fill_color=fillColor, 
                                                 fill_opacity=fillOpacity,
                                                 color = "none",
                                                 radius = marker_size)
                    feature_group.add_child(marker);

                f_map.add_child(feature_group);
                f.close();
                i=i+1;

            #add layer control to activate/deactivate topics
            folium.LayerControl().add_to(f_map);    
            #save map
            f_map.save(folder+'/topic_map.htm')
            print('created map in: '+folder+'/topic_map.htm');
            f_map._repr_html_();
            #show map only if wanted, can consume quite some memory
            if show_map:
                if not os.path.exists("tmp"):
                    os.makedirs("tmp");

                f_map.save("tmp/"+folder.split("/")[-1]+"_map.html");
                display(IFrame("tmp/"+folder.split("/")[-1]+"_map.html",width=400, height=400));
                display(f_map._repr_png());
            display(HTML('<a href="file://'+folder+'/topic_map.htm'+'" target="_blank">Link to map of '+folder.split("/")[-1].replace("_geojson","")+'</a>'));

    #plot topic proportions
    def plot_zeta(self, directory=None, RUNS=None):
        
        if not self.check_run():
            return;
        
        if directory == None:
            directory = self.directory;
        if RUNS == None:
            RUNS = self.RUNS;
            
        fig = plt.figure();
        
        zeta_file = self.directory + "/output_HMDP/" + str(RUNS) +"/zeta";
        
        df = pd.read_csv(zeta_file, header=None);
        zeta = df.iloc[[0]].values[0];
        print(zeta);
               
        plt.bar(range(0,len(zeta)),zeta);
        plt.xticks(range(0,len(zeta)));
        plt.xlabel("Features");
        plt.ylabel("Feature weight");
        plt.show();
        
        return(fig);
    
    #read topics ad DataFrame
    def get_topics(self, directory=None, RUNS=None):
        
        if not self.check_run():
            return;
        
        if directory == None:
            directory = self.directory;
        if RUNS == None:
            RUNS = self.RUNS;
            
        
        topic_file = self.directory + "/output_HMDP/" + str(RUNS) +"/topktopic_words";
        
        df = pd.read_csv(topic_file, header=None, sep=" ");
                
        return(df);
    
    #plot topic probabilities over time
    def plot_time(self, topic_ID, directory=None, RUNS=None):
        
        if not self.check_run():
            return;
        
        if directory == None:
            directory = self.directory;
        if RUNS == None:
            RUNS = self.RUNS;           
            
        topics = self.get_topics();
        k = min(3,len(topics.iloc[0]));
        
        #traverse folders containing time files
        time_files = [x for x in os.listdir(directory+"/cluster_desc/") if x.endswith("_L")];

       
        figs = [];
        
        for time_file in time_files:
            
            time_file = directory+"/cluster_desc/"+time_file;
            
            cluster_number = int(time_file.split("/")[-1][7:-2]);
        
            times = pd.read_csv(time_file, header=None, sep=" ");
            times = times[1];
            
            #print(times);
            
            first_time = min(times);
            last_time = max(times);            
            
            first_date = datetime.datetime.fromtimestamp(
                    int(first_time)
                    ).strftime('%d.%m.%Y');            
            last_date = datetime.datetime.fromtimestamp(
                    int(last_time)
                    ).strftime('%d.%m.%Y');

            fig = plt.figure();

            cluster_file = self.directory + "/output_HMDP/" + str(RUNS) +"/clusters_"+str(cluster_number);
           
            probabilities = pd.read_csv(cluster_file, header=None);
            topic_probabilities = probabilities[topic_ID];

            topic_probabilities = [x for (y,x) in sorted(zip(times,topic_probabilities))]
            times = sorted(times);

            
            #name of the topic are the first three topic words
            name = "Topic "+str(topic_ID)+": "+" ".join(topics.iloc[topic_ID][0:k]);
            
            fig = plt.figure();
            
            #print(times)
            #print(topic_probabilities)
            
            plt.scatter(times, topic_probabilities);
            plt.xticks([first_time,last_time],[first_date,last_date]);
            plt.xlabel("Time");
            plt.ylabel("Topic probability");
            plt.legend([name]);
            plt.show();
            figs.append(fig);
        
        return(figs);
    
    #plot topic probabilities for ordinal data
    def plot_ordinal(self, topic_ID, directory=None, RUNS=None):
        
        if not self.check_run():
            return;
        
        if directory == None:
            directory = self.directory;
        if RUNS == None:
            RUNS = self.RUNS;           
            
        topics = self.get_topics();
        k = min(3,len(topics.iloc[0]));
        
        #traverse folders containing time files
        cluster_files = [x for x in os.listdir(directory+"/cluster_desc/") if x.endswith("_O")];
       
        figs = [];
        
        for cluster_file in cluster_files:
            
            cluster_file = directory+"/cluster_desc/"+cluster_file;
            
            cluster_number = int(cluster_file.split("/")[-1][7:-2]);
        
            lines = pd.read_csv(cluster_file, header=None,names=["keys","values"], skiprows=1, sep=" ");
            keys = lines["keys"].values;
            values = lines["values"].values;
            
            #sort by keys
            [keys,values] = list(zip(*sorted(zip(keys,values))));
            
            #print(times);
            
            fig = plt.figure();

            cluster_file = self.directory + "/output_HMDP/" + str(RUNS) +"/clusters_"+str(cluster_number);
           
            probabilities = pd.read_csv(cluster_file, header=None);
            topic_probabilities = probabilities[topic_ID];
            topic_probabilities = [x for (y,x) in sorted(zip(values,topic_probabilities))]
            
            #name of the topic are the first three topic words
            name = "Topic "+str(topic_ID)+": "+" ".join(topics.iloc[topic_ID][0:k]);
            
            fig = plt.figure();
            
            #print(times)
            #print(topic_probabilities)
            
            value_array = [];
            value_array.append(values);
            value_array = [x for xs in value_array for x in xs];
            #print(value_array);
            #print(topic_probabilities);
            
            plt.scatter(value_array, topic_probabilities);
            plt.xticks(values,keys);
            plt.xticks(rotation=90)
            plt.xlabel("Category");
            plt.ylabel("Topic probability");
            plt.legend([name]);
            plt.show();
            figs.append(fig);
        
        return(figs);