# Weight Inference Notebook

Examples on how to use the `ann_inference` package to gain a deeper understanding of neural networks. To begin I'll load the `arrow_helper.py` file into memory so I can use a few of the builtin methods to work with `Parquet` tables that have been written to disk on my system.

In [24]:
% run ../data/arrow_helper.py
% run ../sample/boot.py

Iterations of a simple feedforward neural network were run and the column averages of the model weights were computed over each epoch. The Apache `Arrow` project and specifically the `Parquet` format were leveraged heavily to incrementally write the results to disk. The incremental writes of model coefficients allowed for minimal impact on in-memory processing. In addition, the columnar nature as well as the efficient IO operations enabled by the `Arrow` project allowed for inference to be performed on the model as a product of training as opposed to designing additional training steps solely for inference purposes. For more information on the `Arrow` project or the `Parquet` format as utilized in the `ann_inference` package please see either the `pyarrow` documentation: [pyarrow](https://arrow.apache.org/docs/python/) or the `Arrow` homepage: [Apache Arrow](https://arrow.apache.org/).

Parquet files on disk can be read into a variety of powerful analytic engines including: `Spark`, `Drill`, and `Pandas DataFrames`. I will focus on Pandas as the files I will be working with are written on local disk but they could just have easily been added to hdfs or similar and processed.

Reading files into memory in the form of a `DataFrame` is as simple as calling `read_parquet_store(path, nthreads=5)` on a directory where the `fit` or `gen_test_datasets` methods have been pointed to. Note that in addition to the path users can specify the number of threads to use when reading in the table which can significantly improve IO speed.

In [4]:
path = '../../../parquet_store/regression_tests/12_9_2018/'
weight_1 = 'stat=weight_1/'
weight_2 = 'stat=weight_2/'

pd_weight1 = read_parquet_store(path + weight_1)
pd_weight2 = read_parquet_store(path + weight_2)

In [5]:
pd_weight1.head()

Unnamed: 0,id,epoch_num,weight_0,weight_1,weight_2,weight_3,weight_4,weight_5,weight_6,weight_7,...,weight_14,weight_15,weight_16,weight_17,weight_18,weight_19,weight_20,weight_21,weight_22,weight_23
0,8,0,-0.071435,-0.095128,0.057164,0.05075,0.11039,0.029797,-0.116168,0.000309,...,-0.06561,0.00685,0.251044,-0.137895,-0.028108,-0.000548,-0.056498,-0.02003,-0.152754,-0.022859
1,8,1,-0.04476,-0.106421,0.025078,0.060286,0.172469,0.078673,-0.16081,-0.018072,...,-0.087509,0.016619,0.313774,-0.20458,0.01222,0.042591,-0.097589,-0.024403,-0.212262,0.019503
2,8,2,-0.012611,-0.120125,-0.006748,0.076673,0.262823,0.136139,-0.218535,-0.037569,...,-0.124132,0.034019,0.432285,-0.299575,0.057452,0.087715,-0.148148,-0.033915,-0.303251,0.068829
3,8,3,0.027172,-0.137673,-0.04399,0.101274,0.384577,0.207367,-0.295965,-0.057796,...,-0.182837,0.060315,0.609462,-0.436094,0.108917,0.138808,-0.216622,-0.048528,-0.438521,0.126833
4,8,4,0.07567,-0.161504,-0.092994,0.134044,0.547262,0.296462,-0.402497,-0.083237,...,-0.272366,0.096202,0.857831,-0.633354,0.172057,0.200103,-0.310764,-0.070381,-0.638018,0.198318


In [18]:
np.sort(pd_weight1.id.unique())

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

From the above we can see that approximately 20 tests were run across the network. The number of observations is sufficient that we should be able to generate insight into the underlying behavior of the network at each epoch. I will focus most of my testing on the final epoch which we can easily see by calling the below.

In [19]:
pd_weight1.epoch_num.max()

499

Next, I'll generate a `DataFrame` consisting of the final epoch across all tests to better understand the behavior of the network.

In [22]:
final_epoch = pd_weight1[pd_weight1.epoch_num==pd_weight1.epoch_num.max()]

In [23]:
final_epoch

Unnamed: 0,id,epoch_num,weight_0,weight_1,weight_2,weight_3,weight_4,weight_5,weight_6,weight_7,...,weight_14,weight_15,weight_16,weight_17,weight_18,weight_19,weight_20,weight_21,weight_22,weight_23
499,8,499,0.327893,-0.329544,-0.415155,0.338067,1.302721,0.75351,-1.169199,-0.254125,...,-0.889235,0.302476,2.042427,-1.973327,0.472699,0.516767,-0.958766,-0.23355,-1.991574,0.520245
499,11,499,0.30398,-0.330608,-0.400635,0.351527,1.278058,0.761393,-1.16608,-0.272617,...,-0.908209,0.290667,2.050454,-1.974693,0.481918,0.491695,-0.953462,-0.231461,-2.004575,0.530335
499,19,499,0.316105,-0.32313,-0.410164,0.345922,1.283801,0.746934,-1.172615,-0.254714,...,-0.902846,0.275202,2.053539,-1.96124,0.482116,0.491827,-0.953637,-0.237383,-1.981324,0.512885
499,14,499,0.316509,-0.330442,-0.418492,0.349019,1.284906,0.742172,-1.203934,-0.245937,...,-0.912625,0.284511,2.034875,-1.962383,0.458443,0.479452,-0.964799,-0.237986,-1.983756,0.54713
499,5,499,0.302831,-0.329847,-0.421022,0.334386,1.300896,0.74112,-1.187677,-0.25037,...,-0.924568,0.295741,2.057034,-1.971517,0.49112,0.516452,-0.957887,-0.224429,-1.979303,0.525312
499,17,499,0.309302,-0.338831,-0.403287,0.325512,1.301427,0.758231,-1.200986,-0.276455,...,-0.908653,0.296592,2.049554,-1.950534,0.475223,0.505559,-0.95275,-0.22577,-1.986002,0.516724
499,16,499,0.320053,-0.351562,-0.388246,0.337194,1.284952,0.753948,-1.195881,-0.253483,...,-0.897631,0.29164,2.040516,-1.967697,0.487466,0.51453,-0.958366,-0.23165,-1.990224,0.53051
499,9,499,0.332829,-0.33151,-0.414016,0.334055,1.280147,0.720062,-1.186065,-0.263354,...,-0.919588,0.306642,2.056237,-1.955362,0.481104,0.499795,-0.982745,-0.233179,-1.977958,0.529198
499,6,499,0.328793,-0.323546,-0.389737,0.343175,1.296909,0.735945,-1.189104,-0.242138,...,-0.895035,0.295789,2.058506,-1.961391,0.475925,0.479986,-0.94861,-0.241603,-2.010661,0.525145
499,0,499,0.315472,-0.335306,-0.41926,0.345119,1.290294,0.740713,-1.192261,-0.260825,...,-0.925361,0.285483,2.046592,-1.962055,0.475055,0.498207,-0.947596,-0.23592,-1.993744,0.53534


Next, the weights will be converted into a __Numpy__ `ndarray` to allow bootstrap estimates of the mean and variance for the given weight to be calculated.

In [25]:
np_weight1 = final_epoch[final_epoch.columns[2:]].values

In [42]:
boot_samples = 999

boot_mean = np.zeros([999, np_weight1.shape[1]])
boot_var = np.zeros([999, np_weight1.shape[1]])

In [43]:
for j in np.arange(0, boot_stats.shape[1]):
    boot_mean[:, j] = boot_stat(np_weight1[:, j],n_iter=boot_samples, test_stat=np.mean)
    boot_var[:, j] = boot_stat(np_weight1[:, j],n_iter=boot_samples)

In [44]:
mean_percentile = np.percentile(boot_mean, [0.025, 0.975], axis=0)
var_percentile = np.percentile(boot_var, [0.025, 0.975], axis=0)

In [45]:
mean_percentile.shape

(2, 24)

In [46]:
mean_percentile

array([[ 0.31345909, -0.3313351 , -0.40760235,  0.34027868,  1.28698505,
         0.74953199, -1.19010639, -0.26040125,  1.17664455,  1.4657461 ,
         0.58602274, -0.20114825,  1.77999072, -1.74713196, -0.90831183,
         0.29073113,  2.0456559 , -1.96639814,  0.47761025,  0.49588564,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.31384396, -0.33123494, -0.40717574,  0.34035243,  1.28734185,
         0.74991472, -1.18969651, -0.25990862,  1.1771564 ,  1.46616125,
         0.58649911, -0.2009129 ,  1.78054709, -1.74684683, -0.9078282 ,
         0.29108927,  2.04628654, -1.96612368,  0.47790386,  0.49610406,
         0.        ,  0.        ,  0.        ,  0.        ]])