# Table of Contents
* [VIND dataset](#VIND-dataset)
	* [Basic stats](#Basic-stats)
		* [Test/Train split- video numbers](#Test/Train-split--video-numbers)
		* [fine-grained action types](#fine-grained-action-types)
		* [Establishing non-uniqueness of paths and movie names](#Establishing-non-uniqueness-of-paths-and-movie-names)
		* [Coarse category types](#Coarse-category-types)
		* [Comparing category distributions between test/train](#Comparing-category-distributions-between-test/train)
	* [Looking at movie length distribution of combined test/train](#Looking-at-movie-length-distribution-of-combined-test/train)
	* [Sorting replacment videos](#Sorting-replacment-videos)
	* [second pass over selected categories](#second-pass-over-selected-categories)
		* [In the future, should set max capture length](#In-the-future,-should-set-max-capture-length)
	* [Confirmation code output test](#Confirmation-code-output-test)
	* [identicle frames](#identicle-frames)
	* [file renaming example](#file-renaming-example)
	* [similar image comparison](#similar-image-comparison)


# VIND dataset

In [1]:
%%capture
import numpy as np
import pandas as pd
import scipy.stats as st
import itertools
import operator
import math
from collections import Counter, defaultdict
import glob
import os

In [2]:
%%capture
import matplotlib as mpl
mpl.use("Agg")
import matplotlib.pylab as plt
#%matplotlib notebook
%matplotlib inline
%load_ext base16_mplrc
%base16_mplrc light default
plt.rcParams['figure.figsize'] = (16.0, 10.0)

go over rugby to screen pov changes

ask about first person pov
low frame rate train lifting_weights 130_0

subset image/video

slow motion/ repeated frames 
data/prediction_videos_final_train/falling-cliff-jumping/56_0
diving/cliff jumping
data/prediction_videos_final_train/falling-cliff-jumping/100_2

## Basic stats

### Test/Train split- video numbers

In [3]:
!ls data/prediction_videos_final_train/* | wc -l
!ls data/prediction_videos_final_test/* | wc -l

In [4]:
ntot = 6887

In [125]:
ntot*0.95

In [120]:
(ntot-6018)*15/100

In [6]:
1420 * 5 /3600

### fine-grained action types

In [7]:
test_cats = !ls data/prediction_videos_final_test
test_cats = set(test_cats)
test_cats

In [8]:
train_cats = !ls data/prediction_videos_final_train
train_cats = set(train_cats)
train_cats

In [9]:
test_cats.difference(train_cats)

Some categories don't appear in both test and train

### Establishing non-uniqueness of paths and movie names

In [10]:
test_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_test/*/*'):
    test_movie_names.append(movie.split('/')[4])

In [11]:
test_movie_names[:5]

In [12]:
print(len(test_movie_names))
print(len(set(test_movie_names)))

Movie names are not unique

In [13]:
[m for m in test_movie_names if m == '00000']

These 'stable' images can be ignored

In [14]:
train_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_train/*/*'):
    train_movie_names.append(movie.split('/')[4])

In [15]:
print(len(train_movie_names))
print(len(set(train_movie_names)))
print(len(test_movie_names) + len(train_movie_names))
print()
print(len(set(train_movie_names).intersection(set(test_movie_names))))
print(len(set(train_movie_names).difference(set(test_movie_names))))

In [16]:
full_train_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_train/*/*'):
    full_train_movie_names.append('/'.join(movie.split('/')[3:5]))

full_test_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_test/*/*'):
    full_test_movie_names.append('/'.join(movie.split('/')[3:5]))

In [17]:
print(len(full_train_movie_names))
print(len(set(full_train_movie_names)))
print()
print(len(full_test_movie_names))
print(len(set(full_test_movie_names)))

In [18]:
print(len(set(full_train_movie_names).intersection(set(full_train_movie_names))))
print(len(set(train_movie_names)))

### Coarse category types

In [19]:
train_cat_occurence = [cat.split('/')[0] for cat in full_train_movie_names]

train_cat_counts = Counter(train_cat_occurence)
train_cat_counts.most_common()[:10]

In [20]:
test_cat_occurence = [cat.split('/')[0] for cat in full_test_movie_names]

test_cat_counts = Counter(test_cat_occurence)
test_cat_counts.most_common()[:10]

In [21]:
train_meta_cat_occurence = [cat.split('/')[0].split('-', maxsplit=1)[0] for cat in full_train_movie_names]

train_meta_cat_counts = Counter(train_meta_cat_occurence)
train_meta_cat_counts.most_common()

In [22]:
test_meta_cat_occurence = [cat.split('/')[0].split('-', maxsplit=1)[0] for cat in full_test_movie_names]

test_meta_cat_counts = Counter(test_meta_cat_occurence)
sorted_test = test_meta_cat_counts.most_common()

### Comparing category distributions between test/train

In [23]:
bot_test = [cat[0] for cat in sorted_test]
wid_test = [cat[1] for cat in sorted_test]

bot_train = bot_test 
wid_train = [train_meta_cat_counts[cat[0]] for cat in sorted_test]

fig, ax = plt.subplots()
ax.barh(np.arange(len(wid_test)),wid_test, alpha = 0.6)
ax.barh(np.arange(len(wid_test)),wid_train, alpha = 0.6, color='y')

_ = ax.set_yticklabels(bot_test)

In [24]:
all_movies = []
for movie in glob.glob('./data/prediction_videos_final_train/*/*'):
    all_movies.append('/'.join(movie.split('/')[2:5]))

full_test_movie = []
for movie in glob.glob('./data/prediction_videos_final_test/*/*'):
    all_movies.append('/'.join(movie.split('/')[2:5]))

In [25]:
all_movies[:5]

## Looking at movie length distribution of combined test/train

In [26]:
movie_lengths = {}
for path in all_movies:
    files = os.listdir('./data/'+path)
    movie_lengths[''.join(path.split('_', maxsplit=3)[3:])] = len(files)

sorted_movies = sorted(movie_lengths.items(), key=operator.itemgetter(1))

In [28]:
sorted_movies[293:305]

In [29]:
386/24

In [30]:
sorted_movies[-20:]

In [31]:
length_series = pd.Series(list(movie_lengths.values()))
_ = length_series.hist(bins = 30, log = True)

In [128]:
length_series = pd.Series([val for val in movie_lengths.values() if val < 300])
_ = length_series.hist(bins = 30)

In [33]:
len_array = np.array([int(sm[1]) for sm in sorted_movies])

len_array_f = len_array[len_array > 1]
len_array_f = len_array_f[len_array_f < 300]
l_full, l_fill =len(len_array), len(len_array_f)
print(l_full, l_fill)
ldiff = l_full- l_fill
print(ldiff)
print(ldiff/l_full)

Only 5% of the videos have more than 300 frames, or 12s

In [35]:
df = pd.DataFrame.from_records(sorted_movies)

!!! Remove movies that have been removed if this is re-written !!!

In [36]:
df[0].to_csv('./movies_sorted_by_length.csv', index = False)

## Sorting replacment videos

In [196]:
replacement_movie_names = []
for movie in glob.glob('./data/prediction_videos_3_categories/*/*'):
    replacement_movie_names.append('/'.join(movie.split('/')[3:5]))

In [207]:
bowling_videos = replacement_movie_names[0:129]

In [160]:
replacement_movie_lengths = {}
for path in replacement_movie_names:
    files = os.listdir('./data/prediction_videos_3_categories/'+path)
    replacement_movie_lengths[path] = len(files)

In [166]:
sorted_replacement_movies = sorted(replacement_movie_lengths.items(), key=operator.itemgetter(1))
sdf = pd.DataFrame.from_records(sorted_replacement_movies)
sdf[0].to_csv('./bowling_movies_sorted_by_length.csv', index = False)

In [244]:
bowling_movie_lengths = {}
for path in bowling_videos:
    files = os.listdir('./data/prediction_videos_3_categories/'+path)
    bowling_movie_lengths[path] = len(files)

sorted_bowling_videos = sorted(bowling_movie_lengths.items(), key=operator.itemgetter(1))
sdf = pd.DataFrame.from_records(sorted_bowling_videos)
sdf[0].to_csv('./bowling_movies_sorted_by_length.csv', index = False)

In [245]:
len(sorted_replacement_movies)

## second pass over selected categories

In [246]:
all_movie_2s = []
for movie_2 in glob.glob('./data/prediction_videos_final_train/*billiard*/*'):
    all_movie_2s.append('/'.join(movie_2.split('/')[2:5]))

full_test_movie_2 = []
for movie_2 in glob.glob('./data/prediction_videos_final_test/*billiard*/*'):
    all_movie_2s.append('/'.join(movie_2.split('/')[2:5]))
    
movie_2_lengths = {}
for path in all_movie_2s:
    files = os.listdir('./data/'+path)
    movie_2_lengths[''.join(path.split('_', maxsplit=3)[3:])] = len(files)

sorted_movie_2s = sorted(movie_2_lengths.items(), key=operator.itemgetter(1))

In [247]:
spdf = pd.DataFrame.from_records(sorted_movie_2s)
spdf[0].to_csv('./second_pass_billiard_test.csv', index = False)

In [248]:
%page sorted_movie_2s

In [249]:
len(sorted_movie_2s)

### In the future, should set max capture length

## Confirmation code output test

In [37]:
import ast

In [38]:
# frame_results = []
# with open('ex_log.txt', 'r') as f:
#     frame_results = f.readlines()

In [None]:
spt = frame_results[0].split(', ', maxsplit =2)

In [None]:
flist = ast.literal_eval(spt[2])

## identicle frames

passing-rugby, train  95_3 96_3 are identicle


data/prediction_videos_final_test/passing-rugby/118_3


data/prediction_videos_final_train/passing-rugby/117_3

## file renaming example

In [None]:
import os

In [None]:
frames = [['00000', '00020'], ['00040', '00050']]

In [None]:
old_path = './79_2/'
new_path = './test/'
fext = '.png'
os.mkdir(new_path)
for span in frames:
    for frame in range(int(span[0]), int(span[1])+1):
        oldf = old_path+str(frame).zfill(5) + fext
        newf = new_path+str(frame).zfill(5) + fext
        os.rename(oldf, newf)

## similar image comparison

In [92]:
from skimage.measure import compare_ssim as ssim
from skimage.measure import compare_mse as mse
import cv2
from IPython.core.display import display

In [93]:
image1 = cv2.imread('./similar_ex/53_6/00039.png')
image2 = cv2.imread('./similar_ex/53_7/00036.png')

In [96]:
s = ssim(image1, image2, multichannel=True)
s

In [97]:
s = mse(image1, image2)
s

In [74]:
_ = plt.imshow(image1)

In [75]:
_ = plt.imshow(image2)

In [78]:
_ = plt.imshow(image1 - image2)