### Introduction

This notebook was created to look at the metadata of the Amazon S3 bucket containing all of the images and json files and get some statistics for the data.

In [1]:
import pandas as pd

In [2]:
df_raw = pd.read_json('bucket-contents.json')

In [3]:
df_raw.tail()

Unnamed: 0,Key,Size
1072863,metadata/99995.json,3740
1072864,metadata/99996.json,864
1072865,metadata/99997.json,2132
1072866,metadata/99998.json,2770
1072867,metadata/99999.json,1658


The total size of all files is approximately 31.5GB  
The mean size of each file is approximately 29.4kB

In [4]:
df_raw.Size.sum()

31564792008

In [5]:
df_raw.describe()

Unnamed: 0,Size
count,1072868.0
mean,29420.95
std,34686.91
min,56.0
25%,1849.0
50%,13362.5
75%,48572.0
max,358438.0


The raw data has been loaded in serial fashion. The first half consists of the images and the second half consists of the json documents.

In [6]:
df_raw[536430:536440]

Unnamed: 0,Key,Size
536430,bin-images/99996.jpg,58212
536431,bin-images/99997.jpg,39300
536432,bin-images/99998.jpg,36076
536433,bin-images/99999.jpg,35218
536434,metadata/00001.json,2472
536435,metadata/00002.json,2195
536436,metadata/00003.json,2195
536437,metadata/00004.json,2087
536438,metadata/00005.json,1390
536439,metadata/00006.json,1548


### Pivot the Dataframe

I wish to pivot the dataframe so that metadata files share the same index as the associated image. This will allow me to pull file names for the image and json doc using a common index and allow for shuffling or random selection of data.  

First split the dataframes.

In [7]:
df_img = df_raw[:536434]
df_meta = df_raw[536434:].reset_index()
df_meta = df_meta.drop('index', axis=1)

In [8]:
df_img.tail()

Unnamed: 0,Key,Size
536429,bin-images/99995.jpg,103665
536430,bin-images/99996.jpg,58212
536431,bin-images/99997.jpg,39300
536432,bin-images/99998.jpg,36076
536433,bin-images/99999.jpg,35218


In [9]:
df_meta.tail()

Unnamed: 0,Key,Size
536429,metadata/99995.json,3740
536430,metadata/99996.json,864
536431,metadata/99997.json,2132
536432,metadata/99998.json,2770
536433,metadata/99999.json,1658


Rename the columns

In [10]:
df_img = df_img.rename(columns={'Key': 'img_file', 'Size': 'img_size'})
df_meta = df_meta.rename(columns={'Key': 'meta_file', 'Size': 'meta_size'})

In [11]:
df_img.tail()

Unnamed: 0,img_file,img_size
536429,bin-images/99995.jpg,103665
536430,bin-images/99996.jpg,58212
536431,bin-images/99997.jpg,39300
536432,bin-images/99998.jpg,36076
536433,bin-images/99999.jpg,35218


In [12]:
df_meta.head()

Unnamed: 0,meta_file,meta_size
0,metadata/00001.json,2472
1,metadata/00002.json,2195
2,metadata/00003.json,2195
3,metadata/00004.json,2087
4,metadata/00005.json,1390


Finally, concatenate the split frames into a single dataframe.

In [13]:
df = pd.concat([df_img, df_meta], axis=1)

In [24]:
df[333500:333520]

Unnamed: 0,img_file,img_size,meta_file,meta_size
333500,bin-images/393611.jpg,27723,metadata/393611.json,1654
333501,bin-images/393612.jpg,29842,metadata/393612.json,2396
333502,bin-images/393613.jpg,33408,metadata/393613.json,3382
333503,bin-images/393614.jpg,60090,metadata/393614.json,2836
333504,bin-images/393615.jpg,34904,metadata/393615.json,2246
333505,bin-images/393616.jpg,33808,metadata/393616.json,2961
333506,bin-images/393617.jpg,41628,metadata/393617.json,2777
333507,bin-images/393618.jpg,43914,metadata/393618.json,2777
333508,bin-images/393619.jpg,43346,metadata/393619.json,2715
333509,bin-images/39362.jpg,49569,metadata/39362.json,56


Some more stats on images and metadata.  

Shall I pickle the resulting dataframe?

In [25]:
df.img_size.sum(), df.img_size.mean()

(30466377489, 56794.270104057534)

In [26]:
df.meta_size.sum(), df.meta_size.mean()

(1098414519, 2047.6228557473985)