Collective Order

Wang Cheng-Jun edited this page Dec 19, 2016 · 1 revision

计算传播学是计算社会科学的重要分支。它主要关注人类传播行为的可计算性基础,以传播网络分析、传播文本挖掘、数据科学等为主要分析工具,(以非介入地方式)大规模地收集并分析人类传播行为数据,挖掘人类传播行为背后的模式和法则,分析模式背后的生成机制与基本原理,可以被广泛地应用于数据新闻和计算广告等场景,注重编程训练、数学建模、可计算思维。

Clone this wiki locally

We propose a geometric model to quantify the dynamics of attention in online communities. Using clicks as a proxy of attention, we find that the diffusion of collective attention in Web forums and news sharing sites forms time-invariant “fields” whose density vary solely with distance from the center of the fields that represents the input of attention from the physical world. As time goes by, old information pieces are pushed farther from the center by new pieces, receive fewer and fewer clicks, and eventually become invisible in the virtual world. The discovered “attention fields” not only explain the fast decay of attention to information pieces, but also predict the accelerating growth of clicks against the active user population, which is a universal pattern relevant to the economics of scales of online interactions. The Hidden Geometry of Attention Diffusion. http://arxiv.org/abs/1501.06552

==阅读新闻的顺序== 我们想要描述的是新闻浏览的顺序(order)本身的规律。下图横轴是200个个体浏览新闻的顺序,纵轴是新闻的年龄(从新闻产生到用户阅读它经历的时间)。我们可以观察到两种主要的形式:一种是不太活跃,仅仅围绕着最近三天的新闻反复波动;另一种是非常活跃阅读足够多的新闻并且越来越倾向于阅读老新闻。'''换言之:当用户付出较少的注意力的时候,仅仅阅读最近三天的新闻;当用户付出较多注意力时,就会越来越倾向于阅读老新闻。'''

文件:News age and voting sequence.jpg

对于个体用户行为,我们可以采用这种粗糙的测度。对于新闻本身,我们如何刻画群体行为的特征?

==注意力流网络==

文件:File-Exampleflownetwork.PNG

我们可以采用构造注意力流网络的方法来解决这个问题(参考 注意力网络介绍)。

这种构建网络的方法:

  • 节点是内容(如帖子、新闻、微博等)
  • 节点间的链接表示从一个节点流向另一个节点的流的数量。
  • 加入source和sink来平衡网络当中的流,使得对于每个不是source和sink的节点,其流入量等于流出量。更重要的意义在于,可以很明确地表明节点与环境的流量交换。 注意力流网络可以采用流量矩阵进行表示(见 流动网络)。

文件:Flownetdistance example.png import networkx as nx %matplotlib inline import matplotlib.pyplot as plt import numpy as np from scholarNetwork import scholarNetwork

G=nx.DiGraph() G.add_edge('source',1,weight=80) G.add_edge(1,2,weight=50) G.add_edge(1,3,weight=30) G.add_edge(3,2,weight=10) G.add_edge(2,4,weight=20) G.add_edge(2,5,weight=30) G.add_edge(4,5,weight=10) G.add_edge(5,3,weight=5) G.add_edge(2,'sink',weight=10) G.add_edge(4,'sink',weight=10) G.add_edge(3,'sink',weight=25) G.add_edge(5,'sink',weight=35)

fig = plt.figure(figsize=(9, 9),facecolor='white') ax = fig.add_subplot(111) scholarNetwork.plotTree(G,ax) plt.show()

gb = scholarNetwork.flowBalancing(G) scholarNetwork.flowDistanceFromSource(gb)

{1: 1.0, 2: 2.223639455782313, 3: 2.352324263038549, 4: 3.2236394557823136, 5: 3.4736394557823136, 'sink': 3.935728458049887}

==流距离==

We define flow distance l_i as the average number of steps '''users''' takes from source to the ith node.

流距离(Flow distance)是测量节点间路径长度的一种方法,它具体定义为流经第i个节点的用户从source到第i个节点所经历的平均路径长度。

文件:Flowdistance.png

  • 从source到节点1的流通量为80,即有80个人流经节点1。这些人从source到节点1经过的路径长度为1,流距离也是l_1 = 1。
  • 从source到节点3的流通量为30,即有30个人流经节点3。这些人从source到节点3经过的路径长度为2,流距离也是l_3 = 2。
  • 从source到节点2的流通量为60,其中50人经过节点1流到节点2(经过路径长度为2),10人经过节点1和节点3到达节点2(经过路径长度为3),流距离l_2 = (502 + 103)/60 = 2.17。

流距离表明了节点的平均被访问顺序,因而是测量新闻阅读顺序的非常好的测量方式。我们发现:流距离揭示了新闻阅读的隐秩序。隐秩序是Holland著作hidden order的中文译法。Holland试图勾勒分析复杂适应系统的一整套方法,这套方法在探索层面是基于计算机仿真多主体建模展开的,最终达到数学形式的机制描述。此处与之不同。

flow network functions

def flowBalancing(G): RG = G.reverse() H = G.copy() def nodeBalancing(node): outw=0 for i in G.edges(node): outw+=G[i[0]][i[1]].values()[0] inw=0 for i in RG.edges(node): inw+=RG[i[0]][i[1]].values()[0] deltaflow=inw-outw if deltaflow > 0: H.add_edge(node, "sink",weight=deltaflow) elif deltaflow < 0: H.add_edge("source", node, weight=abs(deltaflow)) else: pass for i in G.nodes(): nodeBalancing(i) if ("source", "source") in H.edges(): H.remove_edge("source", "source")
if ("sink", "sink") in H.edges(): H.remove_edge("sink", "sink") return H

def flowDistanceFromSource(G): #input a balanced nx graph R = G.reverse() mapping = {'source':'sink','sink':'source'} H = nx.relabel_nodes(R,mapping) #---------initialize flow distance dict------ L = dict((i,1) for i in G.nodes()) #---------prepare weighted out-degree dict------ T = G.out_degree(weight='weight') #---------iterate until converge------------ ls = np.array(L.values()) delta = len(L)0.01 + 1 k=0 while delta > len(L)0.01: k+=1 if k>20: break for i in L: l=1 for m,n in H.edges(i): l+=L[n]*H[m][n].values()[0]/float(T[m]) L[i]=l delta = sum(np.abs(np.array(L.values()) - ls)) ls = np.array(L.values()) #---------clean the result------- if 'sink' in L: del L['sink'] for i in L: L[i]-=1 L['sink'] = L.pop('source') return L

==新闻的生命周期==

文件:News life cycle.png

===计算累计点击量C的方法=== 文件:Tree.png

'''问题:如何测量归一化的累计点击量C?它和累计点击量V_L之间具有什么关系?'''

我们发现累计点击量C和流距离L之间符合Gompertz方程。当L达到最大值的时候,C就等于流网络的所有点击量。

我们采用一个简单的树来说明,树结构使得一个节点只有一个流入,所以其流距离就是该节点到source的路径长度。例如:

  • 节点ABC的流距离为1,
  • DE的流距离是2,
  • F的流距离是3,
  • G的流距离是4。

那么根据L进行累计的点击量: C_{L_1} = 70, C_{L_2} = 110, C_{L_3} = 120, C_{L_4} = 125 。125就是该树状网络的所有点击量。

===流距离与新闻年龄=== We trace all news stories in the system and plot their flow distance L against their age T (Figure 2A). It is observed that the movement of all the 3,553 stories follows the same equation

L = kT^ω , (1)

in which k = 6.0 and ω = 0.35 are constant parameters estimated from empirical data. According to the nature of power functions, ω < 1 means that the distance goes up rapidly at first and then the increasing speed becomes slower as time goes by.

===点击量与流距离=== As all the news stories under study have the same motion equation('''问题:这是什么意思?'''), they can be viewed as moving sensors whose clicks received at certain positions reflect the density of attention at that position. By relating distance with clicks we obtain the spatial density distribution of attention.

We find that the cumulative number of clicks, after being normalized (represented by C), follows the Gompertz function:

\int_{0}^{L} C = e^{-\alpha e ^{- \beta L}}. (2)

由(2)可以得到:

C = \alpha \beta e^{-\alpha e^{-\beta L} - \beta L}. (3)

Figure B shows that the number of clicks (the orange bars) increases with the distance at first, and then decays with it gradually, forming a bell curve that can be quantified using the differential of the Gompertz function (the dark blue curve, see Eq. 3). The two parameters of the Gompertz function (the red curve) are estimated to be α = 10.34 and β = 0.42.

===点击量与新闻年龄===

由(1)和(3)可以得到公式(4):

C \sim e^{-\beta L} = e^{-\beta k T^w} . (4)

Eq. 4 is called stretched exponential function or Kohlrausch-Williams-Watts function. The parameter ω determines the decay rate of attention. When ω > 0, attention decays slower than exponential and faster than power law. Interestingly, this is exactly the function used by F. Wu and B. A. Huberman. Novelty and collective attention. Proceedings of the National Academy of Sciences, 104(45):17599–17601, 2007.to fit the decay of clicks to news stories.

==不随时间改变的标度特征==

文件:Scaling hidden order.png

我们已经知道累计点击量C和流距离L之间符合Gompertz方程。当L达到最大值的时候,C就等于流网络的所有点击量。

如果我们定义V_L为累计点击量,那么公式2可以表达为以下形式:

\frac{V_L}{V_{L_{max}} }= \int_{0}^{L} C_L = e^{-\alpha_1 e^{- \beta_1 L}} (5)

我们还可以定义F_L作为L的方程,也就是说用户在流距离效率L时离开系统的累计数量,因此它刻画了注意力耗散的情况。

我们发现 F_L 也符合Gompertz function:

\frac{F_L}{F_{L_{max}} }= \int_{0}^{L} D_L = e^{-\alpha_2 e^{- \beta_2 L}} (6)

其中,D_L是用户在流距离为L的节点离开系统的概率。

'''空间标度关系'''

根据图3a的inset我们知道, \frac{\beta_1}{\beta_2} \sim 1 而 \frac{\alpha_1}{\alpha_2} > 1,由此可以将公式(5)和(6)合并:

V_L = F_L^{ \frac{\alpha_1}{\alpha_2}}F_{L_{max}}^{- \frac{\alpha_1}{\alpha_2}}V_{L_{max}} (7)

也就是说V_L和F_L具有超线性的标度关系,见图3b, 称之为spatial scaling。

'''时间标度关系'''

同时,我们发现:V_{L_{max}} \sim F_{L_{max}}^\gamma (8),

也就是图3c所表示的不同时间点上点击流网络的异速增长, 称之为temporal scaling。

'''空间标度与时间标度之间的关系'''

将公式(7)和(8)合并,可以得到:

V_L = F_L^{\frac{\alpha_1}{\alpha_2}} F_{L_{max}}^{\gamma - \frac{\alpha_1}{\alpha_2}} = F_L^{\frac{\alpha_1}{\alpha_2}}F_{L_{max}}^{\psi} (9)

或 \gamma = \frac{\alpha_1}{\alpha_2} + \psi (10)

公式(10)表明spatial scaling的指数可以预测temporal scaling的指数,即为图3d。

===流通量与耗散量的关系=== there is a gap between the two curves (Figure 3A). This implies that the contribution of clicks is unequal between users. While most of the users leave the system within a few steps, a few users visit a lot of threads, generating long surfing paths. The larger the gap is, the more unequal the click contributions are among users.

===时间不变性=== As shown by Figure 3A, we fit Eq.5 and Eq.6 using 24 hourly clickstream networks collected from the EXO forum in the TIEBA system and find that the values of β1/β2 and α1/α2 are invariant over time, supporting our assumption on the time-invariant structure of attention fields.

=讨论= ===公交车比喻===

一个公交车,上车打卡,下车再次打卡扣款。行驶到某一站时累计收入为V, 该站下车的人为F。

如下图,我们可以发现四种流与outdegree、indegree间具有较好的幂律关系,尤其是outdegree效果更好。outdegree高的节点也就是hub节点,传递流量的作用更强、耗散更少。hub耗散低,接着往前走的人多。

File:outdegree_flow.png

参见耗散定律J Zhang, L Wu. Allometry and dissipation of ecological flow networks. PloS one, 2013

===随机游走===

def randomWalk(times): vs=[] v=0 # news age for i in range(times): if random.random()>0.5: v+=1 else: v-=1 vs.append(v) return vs

===浏览顺序的流网络===

根据一维的模型,比如随机游走,我们可以生成n个数字用来代表不同年龄的新闻,它表示用户群体在不同年龄的新闻之间跳动的行为。比如 [1, 2,1, 2, 3]表示用户看完年龄为1的新闻后看来年龄为2的新闻,看了年龄为2的新闻后再次看了年龄为1的新闻,看完年龄为1的新闻后看了年龄为2的新闻, 看完年龄为1的新闻后看了年龄为3的新闻。因而可以将这个浏览顺序转化为网络:

  • 1, 2
  • 2, 1
  • 1, 2
  • 2, 3

注意1, 2这条边出现了两次,因而这个网络天然是有权重的!


def ageDistance(vs): e=defaultdict(lambda:0) for i,j in zip(vs[:-1],vs[1:]): e[(i,j)]+=1 # i, j这条边出现的次数 g=nx.DiGraph() for k,v in e.items(): i,j=k g.add_edge(i,j,weight=v) h = flowBalancing(g) l = flowDistanceFromSource(h) age,distance=zip(*[[k,v ]for k,v in l.items() if k!='sink']) return age, distance

===有下限的随机游走=== 所谓的下限是最新只能当天的新闻,不能看到明天的新闻。可以按照这个原则改变随机游走模型


def boundRandomWalk(times): vs=[] v=0 # news age for i in range(times): if random.random()>0.5: v+=1 else: v-=1 if v < 0: v = 0 vs.append(v) return vs


vs = boundRandomWalk(10000) age, distance = ageDistance(vs)

fig = plt.figure(figsize=(10, 5),facecolor='white') ax = fig.add_subplot(2,1,1) plt.plot(range(1,len(vs)+1),vs,'b-') plt.xlabel('Steps'), plt.ylabel('News Age') plt.title("Random Walk with Low Bound")

ax = fig.add_subplot(2,1,2) plt.plot(age,distance,'r-') yss = [35.5i*0.028 for i in age] plt.plot(age,yss,'r-')

plt.xlim(0,100)

plt.xlabel('News Age') plt.ylabel('Flow Distance')

plt.tight_layout() plt.show()

File:fit_random_walk_with_low_bound.png

===减半后退-有下限随机游走===
def rw(upPro,N): vs=[] v=0 vs1.append(v) for i in range(N): if random.random()<=upPro: v+=1 else: v=int(v/2) #指数式归零 #v=0 #归零随机游走,导致线性结果 #v+=-1 # 纯随机游走 if v<0: v=0 vs.append(v) return vs


def tPlot(upPro,N,col): vs=rw(upPro,N) plt.plot(range(1,len(vs)+1),vs,color=col,linestyle='-',marker='')

def AgeDisPlot(upPro,N,col): vs=rw(upPro,N) e=defaultdict(lambda:0) for i,j in zip(vs[:-1],vs[1:]): e[(i,j)]+=1 g=nx.DiGraph() for k,v in e.items(): i,j=k g.add_edge(i,j,weight=v) h = flowBalancing(g) l = flowDistanceFromSource(h) age,distance=np.array([[k,v ]for k,v in l.items() if k!='sink']).T plt.plot(age,distance,color=col,linestyle='-',marker='.',label=str(np.round(upPro,2)))

from collections import Counter

def AgeDisClick(upPro,N): vs=rw(upPro,N) e=defaultdict(lambda:0) for i,j in zip(vs[:-1],vs[1:]): e[(i,j)]+=1 g=nx.DiGraph() for k,v in e.items(): i,j=k g.add_edge(i,j,weight=v) h = flowBalancing(g) l = flowDistanceFromSource(h) c = Counter(vs) age,distance, click=np.array([[k,v,c[k] ]for k,v in l.items() if k!='sink']).T return age, distance, click

'''指数减半随机游走'''

File:exponential_backward_random_walk.png

'''归零随机游走'''

File:zero_backward_random_walk.png

'''纯随机游走(有下限)'''

File:low_bound_random_walk.png


fig = plt.figure(figsize=(12, 4),facecolor='white') cmap = cm.get_cmap('rainbow_r',10) ax1 = fig.add_subplot(1,3,1) ax2 = fig.add_subplot(1,3,2) ax3 = fig.add_subplot(1,3,3)

j=0 for k in np.linspace(0.5,0.9,10): age, distance, click = AgeDisClick(k,1000) ax1.plot(distance,click, color=cmap(j),linestyle='-',marker='.',label=str(np.round(k,2))) ax2.plot(age,distance, color=cmap(j),linestyle='-',marker='.',label=str(np.round(k,2))) ax3.plot(age,click, color=cmap(j),linestyle='-',marker='.',label=str(np.round(k,2))) j+=1

plt.xlim(0,20)

xs=np.linspace(1,20,10) ax2.plot(xs,np.exp(4)xs*0.15,'k-.',linewidth=2) plt.legend(loc=1,fontsize=8)

plt.xscale('log',basex=2)

plt.yscale('log',basey=2) ax1.set_xlabel('Distance'),ax1.set_ylabel('Clicks') ax2.set_xlabel('Age'),ax2.set_ylabel('Distance') ax3.set_xlabel('age'),ax3.set_ylabel('Click') plt.tight_layout() plt.show()

Novelty and collective attention Wu F, Huberman BA (2007) Novelty and collective attention. Proceedings of the National Academy of Sciences 104: 17599–17601.

:File:Wu (2007) Novelty and collective attention. PNAS.pdf

=文献=