## Networks of features based on co-occurrence

The ``features`` module in the ``tethne.networks`` subpackage contains a few functions for generating networks of features based on co-occurrence.

In [None]:
from tethne.networks import features

The ``transform`` method allows us to transform the values from one featureset using a custom function. One popular transformation for wordcount data is the [term frequency * inverse document frequency (tf\*idf)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) transformation. tf\*idf weights wordcounts for each document based on how frequent each word is in the rest of the corpus, and is supposed to bring to the foreground the words that are the most "important" for each document.

The default behavior of ``transform`` is to apply the tf\*idf transformation. Just as in other feature-manipulating methods, the first argument, ``'abstractTerms'`` is the featureset to be transformed, and the second argument, ``'absTFIDF'`` is the featureset to be generated as a result of the manipulation.

In [None]:
MyWoSCorpus.transform('abstractTerms', 'absTFIDF')

I can specify some other transformation by first defining a transformer function, and then passing it as a third argument to ``transform``. A transformer function should accept the following parameters, and return a single numerical value (int or float).

| Parameter | Description                                                       |
| --------- | ----------------------------------------------------------------- |
| ``s``     | Representation of the feature (e.g. string).                      |
| ``c``     | Value of the feature in the document (e.g. frequency).            |
| ``C``     | Value of the feature in the ``Corpus`` (e.g. global frequency).   |
| ``DC``    | Number of documents in which the feature occcurs.                 |
| ``N``     | Total number of documents in the ``Corpus``.                      |

For example:

In [None]:
def mytransformer(s, c, C, DC, N):
    """
    Doubles the feature value and divides by the overall value in the Corpus.
    """
    return float(c)*2/(C)

We can then pass transformer function to ``transform`` as the ``transformer`` keyword argument.

In [None]:
MyWoSCorpus.transform('abstractTerms', 'absTransformed', transformer=mytransformer)

In [None]:
print 'Before: ', MyWoSCorpus.features['abstractTerms']['papers'][2773][0]   # Value for a particular 
print 'TF*IDF: ', MyWoSCorpus.features['absTFIDF']['papers'][2773][0]        #  feature and paper.
print 'mytransformer: ', MyWoSCorpus.features['absTransformed']['papers'][2773][0]

The ``mutual_information`` function in the ``features`` module generates a network based on the [pointwise mutual information](http://en.wikipedia.org/wiki/Pointwise_mutual_information) of each pair of features in a featureset.

The first argument is a list of Papers, just like most other network-building functions. The second argument is the featureset that we wish to use. The third argument, ``indexed_by``, is the field used to index the Papers in the corpus. This is important, because it tells the function how to look up Papers in the featureset. Finally, the fourth argument (``percentile``) tells the function how selective to be in drawing edges: ``percentile=99`` means that only PMI scores in the 99th-or-greater percentile will result in a network edge.

In [None]:
MI_graph = features.mutual_information(MyWoSCorpus.all_papers(), MyWoSCorpus.features['absTFIDF'], 
                                       indexed_by='wosid', percentile=99 )

Take a look at the ratio of nodes to edges to get a sense of how to tune the ``percentile`` parameter. If you have an extremely high number of edges for the number of nodes, then you should probably increase ``percentile`` to obtain a more legible network. Depending on your field, you may have some guidance from theory as well.

In [None]:
print 'This graph has {0} nodes and {1} edges'.format(len(MI_graph.nodes()), len(MI_graph.edges()))

Once again, we'll use the GraphML writer to generate a visualizable network file.

In [None]:
from tethne.writers import graph

In [None]:
mi_outpath = '/Users/erickpeirson/tethne-notebooks/output/mi_graph.graphml'

In [None]:
graph.to_graphml(MI_graph, mi_outpath)

![](http://devo-evo.lab.asu.edu/methods/tethne/images/pmi_99.png)