# Content proximity, audience proximity

So the model, in a sense, models a set of graph proximities among outlets - tells us which pairs of outlets produce headlines that look similar, which produce headlines that look different, and it makes it possible to "score" these proximities in a very precise and continuous way.

The interesting thing about looking at just the headlines in isolation is that they allow us to model the relative similarities among outlets purely at the level of the actual content that's being produced (as instantiated in headlines) - a sort of sealed linguistic graph of media sources. But, alongside this linguistic graph, we can also imagine news outlets to be linked together in a kind of "social" or "audience" graph, a function of the degree to which two outlets are read / watched / engaged with by the same cohort of people.

"Audience" could be modeled in a number of ways - the set of people who follow the official Twitter account for an outlet; the set of people who read an article or watcher a broadcast from an outlet in the last week; the set of people who pay for a subscription to an outlet; etc. But, however we define audience, the point is that the audiences for any two outlets will always overlap to a greater or lesser degree - and this level of overlap will vary for different pairs of outlets.

For example, using the Twitter social graph as a representation of audience - XX people follow @breitbart, XX follow @dailycaller. Of the Breitbart followers, XX also follow @dailycaller; and, of the Daily Caller followers, XX also follow @breitbart - a very high level of overlap. A simple way to formalize this overlap between the two audiences is just to take the Jaccard similarity between the two cohorts of users, the ratio of the set intersection and union: XX. By comparison, if we compare the followers of the @breitbart account to the followers of, say, @npr, the Jaccard similarity is just XX - very few users follow both Breitbart and NPR. In the same way that we can model linguistic similarity between headlines produced by two outlets, we can also score the degree to which the audiences are similar or different.

So, two types of "proximity" - proximity of language/content/style, and proximity of audience. Of course, we'd expect these two modes of similarity to be closely related in various (and complex) ways. In the simplest sense - almost all of the outlets under consideration (with the exception of RT, Sputniknews, and NPR, which revieve direct government support) are commercial entities that have a profit motive (or at minimum, just a financial imperative) to produce content that will "succeed" with the outlet's audience, at least enough to pay the bills. Of course, profit motive isn't the whole story - we can assume that news organizations and individual journalists are driven by some complex function of commercial incentives, journalistic values, civic / political / ideological goals, and simple personal interest, and that these vary considerably from outlet to outlet. But, whatever these goals, the question of what's being covered and how it's being covered - proxied here by the headlines - is always wound up, at least to some extent, with an *audience*. Articles aren't just cooked up in the abstract and fired off into the ether - they're produced in the context of some real (or intended) readership, viewership, listenership. If we follow Dor in thinking of headlines as "relevance" optimizers, then this notion of relevance of course entails a set of assumptions (if only implicit) about who is consuming the content, or who might be interested in the content. Different types of content, and different framings of the same content, will be more and less relevant to different audiences; and so any specific piece of content, encoding a set of editorial and stylistic decisions - a headline, a sentence, a whole article - will optimize relevance for some audiences at the expense of others. Content and audience, then, exist in a kind of circular, push-and-pull relationship - different types of content presumably attract different audiences. Articles critical of Robert Mueller, for intance, might attract a conservative audience. But also, in the other direction, the existence of a particular type of audience can surely also exert a reciprocal pressure on the coverage itself - if the NYT learns that thousands of its readers, at any given moment, are interested in buying apartments in New York, then the NYT becomes incentivized to write about the NY real estate market. Content shapes audience, and audience shapes content.

We might expect, then, for there to be a correlation between similarity of *content* and a similarity of *audience*. And in some cases, this is clearly the case. For example, this seems to hold for Breitbart, Daily Caller, and NPR - Breitbart and Daily Caller have highly overlapping audiences, and, as we saw before, they produce headlines that are similar, in the sense operationalized by the classifier - the model has a comparatively hard time telling them apart. And, likewise, Breitbart and NPR have very low overlap in audience, and highly differentiable headlines - audience similarity is low, and content similarity is also low. So, at a conceptual level, if we represent the accuracy of the headline classifier as a three-node graph, where a short edge represents a high level of similarity and a long edge is low similarity, we might have something like:

XXX

And, the corresponding audience graph, based on the Jaccard similarities over the account followers:

XXX

Which looks very similar - content and audience are closely correlated. But how true is this generally? What we have, by way of the headline classifier and the Twitter social graph, is essentially a way to model both of these graphs - content and audience - in a fairly consistent and highly-sampled way. If we model the complete content and audience graphs for the full set of 15 outlets from before - how strong, and how consistent is this correlation? Does it always hold, or - maybe most interesting - are there places where it breaks down, where there is a mismatch between proximity of content and audience? That is - are there groups of outlets that produce similar headlines, but have very different audiences; or, vice versa, outlets that have very similar audiences, but produce very different headlines?

How similar are the content and audience graphs? And if they diverge - why?

# Modeling the audience graph

So, two "graphs," content and audience. To what extent, and how, are they similar or different? As a first step, we need to build complete representations of both of these graphs - which, it turns out, involves a number of modeling decisions that need to be explored. (Though, as we'll eventually see, the high-level of the results are similar regardless of how the graphs are constructed.)

The audience graph, arguably, is the simpler of the two. Essentially - Twitter users interact with media outlets, either by following an account or directly engaging with content (eg, tweeting a link to an article). This allows us to assign a set of users to each outlet, possibly weighted in some way that captures the extent of each users' engagement - eg, to model the difference between a user who has tweeted one article versus 100. And then, since any individual user can be associated with any number of outlets, we can simply measure the degree to which the audiences of any two outlets overlap, using basic measurements of set similarity.

- followship graph
- link actor graph
- how similar?

# Modeling the headline graph

- something of an impedance mismatch. eg, with the multiclass model, confusion matrix is asymmetric - graph is directed, whereas actor graph is undirected.
- how to get undirected, pairwise similarities?
- can do correlations of probability mass - pearson, spearman
- and, can sort of map misclassifications onto user multiple-membership. eg, can boil out pairwise confusions between outlets A and B, just looking at masses assigned to A and B
- but, maybe simpler just to directly model the pairwise comparisons - sample 10 size-balanced for each pair, 10-fold CV on linear SVC, then take misclassifications as "shared" hls, take jaccard sim
- compare similarity rankings induced by different approaches

# (Dis)similarities between content and audience

- two sets of similarity scores, one for hls, one for audience
- center and unit-variance these; compare distributions
- subtract (hl sim - audience sim)
- high audience overlap, low hl overlap (nyt/hill)
- more interesting - high hl overlap, low audience overlap (dailykos/caller, fox/ap)
- "triples"?

- change over time