83の出力を利用し，単語文脈行列Xを作成せよ．ただし，行列Xの各要素Xtcは次のように定義する．

- f(t,c)≥10ならば，Xtc=PPMI(t,c)=max{logN×f(t,c)f(t,∗)×f(∗,c),0}
- f(t,c)<10ならば，Xtc=0
ここで，PPMI(t,c)はPositive Pointwise Mutual Information（正の相互情報量）と呼ばれる統計量である．なお，行列Xの行数・列数は数百万オーダとなり，行列のすべての要素を主記憶上に載せることは無理なので注意すること．幸い，行列Xのほとんどの要素は0になるので，非0の要素だけを書き出せばよい．

In [1]:
import math
import sys

import pandas as pd
from scipy import sparse, io

In [2]:
# 単語tと文脈語cの共起回数を読み込み、9回以下の共起回数の組み合わせは除去
def read_tc():
    group_tc = pd.read_pickle('./083_group_tc.zip')
    return group_tc[group_tc > 9]

In [3]:
%%time
group_tc = read_tc()

CPU times: user 7.48 s, sys: 5.79 s, total: 13.3 s
Wall time: 13.3 s


In [4]:
%%time
group_t = pd.read_pickle('./083_group_t.zip')
group_c = pd.read_pickle('./083_group_c.zip')

CPU times: user 450 ms, sys: 143 ms, total: 593 ms
Wall time: 613 ms


In [5]:
matrix_x = sparse.lil_matrix((len(group_t), len(group_c)))

In [6]:
%%time
for ind ,v in group_tc.iteritems():
    
    # 展開式 Wall time: 30.5 s(sys: 13.8 ms)
    # ppmi = max(LOG_N + math.log(v) - math.log ( group_t[ind[0]] ) - math.log( group_c[ind[1]] ), 0)
    
    # 元の式 Wall time: 31 s(sys: 12.3ms)
    ppmi = max(math.log((68000317 * v) / (group_t[ind[0]] * group_c[ind[1]])), 0)
    matrix_x[group_t.index.get_loc(ind[0]), group_c.index.get_loc(ind[1])] = ppmi

CPU times: user 41.1 s, sys: 106 ms, total: 41.2 s
Wall time: 41.2 s


In [7]:
# 疎行列を確認
print('matrix_x Shape:', matrix_x.shape)
print('matrix_x Number of non-zero entries:', matrix_x.nnz)
print('matrix_x Format:', matrix_x.getformat())

matrix_x Shape: (388836, 388836)
matrix_x Number of non-zero entries: 447875
matrix_x Format: lil


In [8]:
%%time
io.savemat('084.matrix_x.mat', {'x': matrix_x})

CPU times: user 298 ms, sys: 36.3 ms, total: 334 ms
Wall time: 335 ms


In [9]:
%%time
print("{}{: >25}{}{: >10}{}".format('|','Variable Name','|','Memory','|'))
print(" ------------------------------------ ")
for var_name in dir():
    if not var_name.startswith("_") and sys.getsizeof(eval(var_name)) > 10000: #ここだけアレンジ
        print("{}{: >25}{}{: >10}{}".format('|',var_name,'|',sys.getsizeof(eval(var_name)),'|'))

|            Variable Name|    Memory|
 ------------------------------------ 
|                  group_c|  40314974|
|                  group_t|  40314974|
|                 group_tc|  63534504|
CPU times: user 3.87 s, sys: 19.8 ms, total: 3.89 s
Wall time: 3.9 s
