-
Notifications
You must be signed in to change notification settings - Fork 2
CSSR from http://masi.cscs.lsa.umich.edu/~crshalizi/CSSR/
License
barak/CSSR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <meta name="GENERATOR" content="TtH 3.59"> <style type="text/css"> div.p { margin-top: 7pt;}</style> <style type="text/css"><!-- td div.comp { margin-top: -0.6ex; margin-bottom: -1ex;} td div.comb { margin-top: -0.6ex; margin-bottom: -.6ex;} td div.hrcomp { line-height: 0.9; margin-top: -0.8ex; margin-bottom: -1ex;} td div.norm {line-height:normal;} span.roman {font-family: serif; font-style: normal; font-weight: normal;} span.overacc2 {position: relative; left: .8em; top: -1.2ex;} span.overacc1 {position: relative; left: .6em; top: -1.2ex;} --></style> <title> ``Read Me'' for \CSSR </title> <h1 align="center">"Read Me" for <tt>CSSR</tt> </h1> <h3 align="center">Kristina Lisa Klinkner and Cosma Rohilla Shalizi<br /> kklinkner@gmail.com, cshalizi@stat.cmu.edu </h3> <h3 align="center">Last revised 11 September 2005 </h3> <div class="p"><!----></div> <h1>Contents </h1><a href="#tth_sEc1" >1 General</a><br /> <a href="#tth_sEc2" >2 Obtaining and Installing the Program</a><br /> <a href="#tth_sEc3" >3 Usage</a><br /> <a href="#tth_sEc4" >4 Some Suggestions About Parameters</a><br /> <a href="#tth_sEc5" >5 Known Issues</a><br /> <a href="#tth_sEc6" >6 Bug Reports, Fixes, Modifications</a><br /> <a href="#tth_sEc7" >7 Some Details on the Code</a><br /> <a href="#tth_sEc7.1" >7.1 Revision History</a><br /> <a href="#tth_sEc8" >8 Credits and Acknowledgments</a><br /> <div class="p"><!----></div> <h2><a name="tth_sEc1"> 1</a> General</h2> <div class="p"><!----></div> <tt>CSSR</tt> tries to infer the minimal Markovian model capable of generating a time-series, or set of time-series from the same source. The program implements the algorithm proposed in the paper ""Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", hereafter BC.<a href="#tthFtNtAAB" name="tthFrefAAB"><sup>1</sup></a> We won't describe the algorithm in any detail here (see BC for that), but the next two paragraphs say a little about what it produces and how it does it. <div class="p"><!----></div> The output of the algorithm is a set of states which form a Markov chain. Each state has a certain probability of emitting any of the symbols in the original time series. The current state and the symbol it emits fix the next state. (The states are "future-resolving", if you're from nonlinear dynamics, or "deterministic", if you're from automata theory.) Each state, moreover, corresponds to a distinct set of strings, in the following sense. If the state <i>A</i> contains a string <i>w</i>, and at time <i>t</i> the time-series ends with <i>w</i>, then at time <i>t</i> the Markov chain is in state <i>A</i>. The set of states, their transition probabilities and connections, is called the state machine. <div class="p"><!----></div> The algorithm uses a recursive inference procedure to find the simplest set of states with the above properties that can reproduce the statistical properties of the data. If we could give the algorithm an infinitely long time series, and let it consider infinitely long sub-strings, it would produce the <i>causal states</i> of the process, which are its ideal predictors (see BC for a formal definition). Since we have only finite data, there is always some probability that the inferred or estimated states are not the true causal states. Nonetheless, for the rest of this file, when we say "causal states", we mean the estimated causal states. <div class="p"><!----></div> <h2><a name="tth_sEc2"> 2</a> Obtaining and Installing the Program</h2> <div class="p"><!----></div> The code for <tt>CSSR</tt> can be obtained from <a href="http://bactra.org/CSSR/"><tt>http://bactra.org/CSSR/</tt></a>. If you'd like to set up a new archive for it, we'd appreciate hearing about it. <div class="p"><!----></div> From any site, download the CSSR-v0.1.tar.gz file for the code. When gunzipped and untarred, this will produce a directory called CSSR-v0.1, containing all the necessary header and source code files, a copy of this documentation, the release license, and a make file. Running <tt>make</tt> inside that directory will produce an executable, which should be moved to someplace in the command path. On most Unix systems, the following sequence of commands will create the executable and put it in the your <tt>bin</tt> directory, usually part of your command path. <pre> gunzip CSSR-v0.1.tar.gz tar xvf CSSR-v0.1.tar cd CSSR-v0.1 make cp CSSR ~/bin/ </pre> <div class="p"><!----></div> The code has been successfully compiled with gcc 3.1 on Macintosh OS X (10.2), with gcc 3.2 on Linux (Red Hat 9, SUSE 9), with gcc 3.3 on Macintosh OS X (10.3) and with Microsoft Visual C++ on Windows 98. On some systems, compilation may produce warnings about escape sequences or the use of deprecated headers. These can be safely ignored. <div class="p"><!----></div> <h2><a name="tth_sEc3"> 3</a> Usage</h2> <a name="usage"> </a> <div class="p"><!----></div> <tt>CSSR</tt> is a command-line program. <div class="p"><!----></div> <center><tt>CSSR</tt> <i>alphabetfile</i> <i>datafile</i> <i>maxlength</i> <tt>[-m] [-s siglevel] [-ch]</tt> </center> <div class="p"><!----></div> The first argument is the name of a file which contains all the symbols in the alphabet used by your data. The second argument is the name of the data file itself. Only one data file can be used, but it may contain multiple lines. The third argument is the maximum length of history to examine, which we will abbreviate by <i>L</i> here. The three trailing flags are optional arguments. Set the <tt>-m</tt> flag if the data-file contains multiple lines (see below). If the <tt>-s</tt> flag is set, it must be followed by a significance level; the default value is 0.001. (For more on setting parameters, see Section <a href="#parameters">4</a>.) If the <tt>-ch</tt> flag is set, the program will use the <font face="symbol">c</font ><sup>2</sup> significance test, rather than the default Kolmogorov-Smirnov test. <div class="p"><!----></div> In multiple-line mode (entered through the <tt>-m</tt> flag), each line of the data file is treated as a distinct time series from the same source. (Technically, as independent realizations of a single stochastic process.) The lines need not be of equal length. <div class="p"><!----></div> The program will create the following files after running, where <i>datafile</i> is the name of the file the data is in: <ol type="1"> <li> <i>datafile</i>_results <div class="p"><!----></div> </li> <li> <i>datafile</i>_info <div class="p"><!----></div> </li> <li> <i>datafile</i>_state_series <div class="p"><!----></div> </li> <li> <i>datafile</i>_inf.dot <div class="p"><!----></div> </li> </ol> <div class="p"><!----></div> (1) contains the information on the inferred states and their properties. For each state, it gives: <ul> <li> the histories of length <i>L</i><font face="symbol">-</font >1 and <i>L</i> in the state <div class="p"><!----></div> </li> <li> the probability that the state will emit different symbols (e.g. <i>P</i>(<i>a</i>) = <i>x</i>) and the states transitioned to when those symbols are emitted (e.g. <i>T</i>(<i>a</i>) = <i>s</i>) <div class="p"><!----></div> </li> <li> the observed probability of the state in the data-stream <div class="p"><!----></div> </li> </ul> <div class="p"><!----></div> (2) is the file containing the metrics run on the causal state machine. These are: <ul> <li> the number of states <div class="p"><!----></div> </li> <li> the statistical complexity (entropy of the states) <div class="p"><!----></div> </li> <li> the entropy rate <div class="p"><!----></div> </li> <li> three measures of the difference between the empirical distribution of symbol sequences, and that generated by the inferred causal state machine. These are: the divergence or relative entropy between the inferred and empirical distribution; the relative entropy rate, or increase per symbol in the divergence; the total variational distance ("variation") between the two distributions. <div class="p"><!----></div> </li> </ul> <div class="p"><!----></div> Note that the relative entropy and the relative entropy rate can be infinite; this indicates that the inferred model gives a probability of zero to a sequence in the data. <div class="p"><!----></div> Note that sometimes, when the relative entropy should be very small (order of 10<sup><font face="symbol">-</font >6</sup> bits or less), numerical rounding errors result in a negative number being calculated. In these cases, the program outputs zero. Similarly, the complexity of one-state machines is sometimes reported as <font face="symbol">-</font >0. <div class="p"><!----></div> (3) is the series of causal states in the data. That is, the program scans through the data, looks up which state the history-to-date is in, and writes the corresponding symbol to this file. What you see is then the trajectory through estimated causal state space of the data/process. Multiline data results in a multiline state-series file. <div class="p"><!----></div> (4) represents the state machine as a labeled directed graph, where each state has its own node, and each transition between states its own edge. The file is for use with the program <tt>dot</tt>, available from <a href="http://www.graphviz.org/"><tt>http://www.graphviz.org/</tt></a>. <div class="p"><!----></div> <h2><a name="tth_sEc4"> 4</a> Some Suggestions About Parameters</h2> <a name="parameters"> </a> <div class="p"><!----></div> It is always good to use as much data as you can. While it is generally good practice to hold back some data for testing or cross-validation, we recommend that this be minimized. High-entropy processes are especially data-hungry. (See BC.) For reference, let us call the number of data-points <i>N</i>. <div class="p"><!----></div> The two key parameters of the program are the maximum history length, <i>L</i>, and the significance level used in the test, <i>s</i>. For any given process, there is a minimum history length <font face="symbol">L</font >, such that the true states cannot be found if <i>L</i> < <font face="symbol">L</font >. The number of states returned may be less than the correct number <em>or higher</em>. If <i>L</i> <font face="symbol">³</font > <font face="symbol">L</font >, and there is enough data, there will generally be a "plateau" of values of <i>L</i> where the correct number of states is returned. For fixed <i>N</i>, if we keep increasing <i>L</i>, then past a certain point there are not enough examples of each string in the data. This tends to erroneously create new states, which spawn others through determinization. Thus there is generally a "blow-up" when <i>L</i> is too large (relative to <i>N</i> and <i>s</i>). A rough guide-line is to limit <i>L</i> to no more than log<sub>2</sub><i>N</i>/log<sub>2</sub><i>k</i>, where <i>k</i> is the alphabet size (see BC for details). <div class="p"><!----></div> In general, one should use as small an <i>L</i> as possible, since under-sampling, even before the blow-up, will reduce the accuracy of many probability estimates. Blow-up can be delayed by reducing <i>s</i> - that is, reducing the probability of mistakenly splitting a state - but this carries the risk of failing to create valid new states. We suggest exploring the data at low <i>L</i> and high <i>s</i> initially, and then increasing <i>L</i> and lowering <i>s</i>. If a stable architecture is found, it should be recorded at the lowest possible <i>L</i>. <div class="p"><!----></div> <h2><a name="tth_sEc5"> 5</a> Known Issues</h2> <div class="p"><!----></div> There are a few known issues with <tt>CSSR</tt> 's behavior. These arise from certain unavoidable approximations in the determinization procedure. (For details, see below.) <div class="p"><!----></div> After creating and determinizing the states, <tt>CSSR</tt> goes over the input data and attempts to produce the corresponding sequence of causal states, both as a filtering procedure, and as part of estimating the fit of the causal-state model (see Section <a href="#usage">3</a>, item (3)). Typically there will be some initial number of symbols which it must read before "synchronizing" to one of the causal states, after which it will follow the transitions deterministically, and never de-synchronize. Because of the approximations mentioned, it can happen that certain transitions are deleted from the causal state model which shouldn't be. This can lead to three kinds of problem: (1) <tt>CSSR</tt> can never synchronize to a state at all; (2) it has to re-synchronize at some point; (3) it encounters an apparently "impossible" sequence in the data. <div class="p"><!----></div> In case (1), <tt>CSSR</tt> writes the message "Error: Could not synchronize, cannot use this data" to standard error and halts execution. In case (2), it produces a warning message on both standard error and the <tt>_info</tt> output file. In case (3), it produces a warning, and discounts that particular string from various calculations. <div class="p"><!----></div> The best approach to these problems is to use a longer history length, if possible, and to provide more data. In the case of a third error, it can sometimes arise if a particular string appears only once, at the very beginning of the data, and sometimes removing that string from the data-file fixes matters. <div class="p"><!----></div> All three errors arise because we have only finite-length histories available to us, while what we want are really infinite ones. This forces us to make certain approximations in our implementation of the theory. Specifically, in the determinization procedure, we are forced to "guess" which state certain strings belong to, even though we have not directly examined these strings. The particular approximation scheme (or "closure") we have adopted may be investigated by consulting the code in AllStates.cpp. (Others are possible, but this one seemed to give the best over-all results.) Sometimes this closure will "guess wrong", and possible transitions will be labeled forbidden, etc. In these cases, extending the history length <em>should</em> solve the problem, if enough data is available for reliable inference. Similar approximations must be made in determining whether or not a given state is transient or recurrent on the basis of finite data. This occasionally leads to a recurrent state being labeled transient, which in turn is the most common cause of the code mistaking an actually-occurring string for an impossible one. Again, the best approach is to provide more data, and a longer history. <div class="p"><!----></div> <h2><a name="tth_sEc6"> 6</a> Bug Reports, Fixes, Modifications</h2> <div class="p"><!----></div> We welcome bug reports or reports of strange behavior. These reports are welcomed with more enthusiasm when accompanied by successful modifications to the code! (See the accompanying file on the Gnu Public License for information about modifying the code.) Even if you can't fix it, however, please do tell us about it; at the least it will be documented for other users. <div class="p"><!----></div> If you modify <tt>CSSR</tt> , and want to make the resulting program available, please let us know. We are happy to provide a link, and have a (limited) capability to host alternate versions and descendants. Also, if you use <tt>CSSR</tt> successfully in some application, we'd love to hear about it. <div class="p"><!----></div> Please check <a href="http://bactra.org/CSSR/"><tt>http://bactra.org/CSSR/</tt></a> for up-to-date contact information. <div class="p"><!----></div> <h2><a name="tth_sEc7"> 7</a> Some Details on the Code</h2> There are twelve classes in the program. They are each comprised of a .cpp file and a .h file, except for ArrayElem (contained in G_Array) and StringElem (contained in State), as well as a source file Main, and header files Common.h and Main.h. <br /> <div class="p"><!----></div> <table> <tr><td>AllStates </td><td>Contains and manipulates growable array of states</td></tr> <tr><td>ArrayElem </td><td>Each element in the growable array </td></tr> <tr><td>G_Array </td><td>A generic growable array</td></tr> <tr><td>Hash </td><td>Hash table which points from histories to their parent states</td></tr> <tr><td>Hash2 </td><td>Hash table which points from indices to symbol/alpha values</td></tr> <tr><td>Machine </td><td>Manipulates the determinized state machine and runs metrics</td></tr> <tr><td>ParseTree </td><td>Reads in data file and stores all strings present in file</td></tr> <tr><td></td><td>up to length <i>L</i> the maximum length input at the command line</td></tr> <tr><td>States </td><td>A state, contains all data for a single state</td></tr> <tr><td>StringElem </td><td>the element containing the history for a single state</td></tr> <tr><td>Test </td><td>Performs statistical significance tests</td></tr> <tr><td>TransTable </td><td>Stores initially estimated transitions from all histories of length </td></tr> <tr><td></td><td><i>L</i> in any given state. This class is used by AllStates to check for</td></tr> <tr><td></td><td>transient states before determinization. </td></tr></table> <div class="p"><!----></div> For brief descriptions of the classes, see the top of their source files. Note that the terms "string" and "history" are used interchangeably in the program. The terms both correspond to the concept of a "history" (as described in BC), but the program implements these as strings of symbols. <div class="p"><!----></div> Also, after initial state splitting, all strings/histories of less than maximum length minus one are deleted. This has no effect on the outcome of the algorithm and saves time and space. <div class="p"><!----></div> Lastly, the removal of transient states prior to determinization implemented in the <tt>AllStates::CheckConnComponents</tt> procedure is not strictly necessary. With a different implementation, the deletion of these states could automatically occur during the determinization process itself (see the pseudocode in BC for details), and the outcome of the algorithm would be the same. As it is implemented here the code is slightly redundant. <div class="p"><!----></div> <h3><a name="tth_sEc7.1"> 7.1</a> Revision History</h3> <div class="p"><!----></div> <table> <tr><td>0.1 </td><td>11 September 2005 </td><td>Minor bug-fixes to multi-line mode</td></tr> <tr><td>0.1 </td><td>7 August 2005 </td><td>Corrects minor numerical bugs. These only affected</td></tr> <tr><td></td><td></td><td>the calculations of relative entropy, relative entropy</td></tr> <tr><td></td><td></td><td>rate and total variation distance, and then only by</td></tr> <tr><td></td><td></td><td>about <i>O</i>(<i>k</i>/<i>N</i>).</td></tr> <tr><td>0.0 </td><td>7 August 2003 </td><td>First public release. </td></tr></table> <div class="p"><!----></div> <h2><a name="tth_sEc8"> 8</a> Credits and Acknowledgments</h2> <div class="p"><!----></div> <tt>CSSR</tt> was written by KLK, with some help in debugging and testing from CRS; this documentation was jointly written. Parts of the <tt>CSSR</tt> code were written at the Santa Fe Institute and at the University of San Francisco. Support at SFI was provided by SFI's core grants from the National Science Foundation and the MacArthur Foundation, by the NSF's Research Experience for Undergraduates program, and by the Dynamics of Learning group, under DARPA contractual agreement F30602-00-2-0583. Support at USF came from the Clare Boothe Luce Foundation. <div class="p"><!----></div> <hr /><h3>Footnotes:</h3> <div class="p"><!----></div> <a name="tthFtNtAAB"></a><a href="#tthFrefAAB"><sup>1</sup></a>Cosma Rohilla Shalizi and Kristina Lisa Shalizi, pp. 504-511 of Max Chickering and Joseph Halpern (eds.), <em>Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference</em>, available from <a href="http://arxiv.org/abs/cs.LG/0406011"><tt>http://arxiv.org/abs/cs.LG/0406011</tt></a>. <br /><br /><hr /><small>File translated from T<sub><font size="-1">E</font></sub>X by <a href="http://hutchinson.belmont.ma.us/tth/"> T<sub><font size="-1">T</font></sub>H</a>, version 3.59.<br />On 9 Aug 2005, 19:11.</small> </html>
About
CSSR from http://masi.cscs.lsa.umich.edu/~crshalizi/CSSR/
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published