# Table of Contents
* [VIND dataset](#VIND-dataset)
	* [Basic stats](#Basic-stats)
		* [Test/Train split- video numbers](#Test/Train-split--video-numbers)
		* [fine-grained action types](#fine-grained-action-types)
		* [Establishing non-uniqueness of paths and movie names](#Establishing-non-uniqueness-of-paths-and-movie-names)
		* [Coarse category types](#Coarse-category-types)
		* [Comparing category distributions between test/train](#Comparing-category-distributions-between-test/train)
	* [Looking at movie length distribution of combined test/train](#Looking-at-movie-length-distribution-of-combined-test/train)
		* [In the future, should set max capture length](#In-the-future,-should-set-max-capture-length)


# VIND dataset

In [3]:
%%capture
import numpy as np
import pandas as pd
import scipy.stats as st
import itertools
import operator
import math
from collections import Counter, defaultdict
import glob
import os

In [4]:
%%capture
import matplotlib as mpl
mpl.use("Agg")
import matplotlib.pylab as plt
#%matplotlib notebook
%matplotlib inline
%load_ext base16_mplrc
%base16_mplrc light default
plt.rcParams['figure.figsize'] = (16.0, 10.0)

## Basic stats

### Test/Train split- video numbers

In [182]:
!ls data/prediction_videos_final_train/* | wc -l
!ls data/prediction_videos_final_test/* | wc -l

### fine-grained action types

In [6]:
test_cats = !ls data/prediction_videos_final_test
test_cats = set(test_cats)
test_cats

In [7]:
train_cats = !ls data/prediction_videos_final_train
train_cats = set(train_cats)
train_cats

In [8]:
test_cats.difference(train_cats)

Some categories don't appear in both test and train

### Establishing non-uniqueness of paths and movie names

In [9]:
test_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_test/*/*'):
    test_movie_names.append(movie.split('/')[4])

In [10]:
test_movie_names[:5]

In [11]:
print(len(test_movie_names))
print(len(set(test_movie_names)))

Movie names are not unique

In [12]:
[m for m in test_movie_names if m == '00000']

These 'stable' images can be ignored

In [13]:
train_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_train/*/*'):
    train_movie_names.append(movie.split('/')[4])

In [221]:
print(len(train_movie_names))
print(len(set(train_movie_names)))
print(len(test_movie_names) + len(train_movie_names))
print()
print(len(set(train_movie_names).intersection(set(test_movie_names))))
print(len(set(train_movie_names).difference(set(test_movie_names))))

In [87]:
full_train_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_train/*/*'):
    full_train_movie_names.append('/'.join(movie.split('/')[3:5]))

full_test_movie_names = []
for movie in glob.glob('./data/prediction_videos_final_test/*/*'):
    full_test_movie_names.append('/'.join(movie.split('/')[3:5]))

In [222]:
print(len(full_train_movie_names))
print(len(set(full_train_movie_names)))
print()
print(len(full_test_movie_names))
print(len(set(full_test_movie_names)))

In [18]:
print(len(set(full_train_movie_names).intersection(set(full_train_movie_names))))
print(len(set(train_movie_names)))

### Coarse category types

In [108]:
train_cat_occurence = [cat.split('/')[0] for cat in full_train_movie_names]

train_cat_counts = Counter(train_cat_occurence)
train_cat_counts.most_common()[:10]

In [110]:
test_cat_occurence = [cat.split('/')[0] for cat in full_test_movie_names]

test_cat_counts = Counter(test_cat_occurence)
test_cat_counts.most_common()[:10]

In [150]:
train_meta_cat_occurence = [cat.split('/')[0].split('-', maxsplit=1)[0] for cat in full_train_movie_names]

train_meta_cat_counts = Counter(train_meta_cat_occurence)
train_meta_cat_counts.most_common()

In [149]:
test_meta_cat_occurence = [cat.split('/')[0].split('-', maxsplit=1)[0] for cat in full_test_movie_names]

test_meta_cat_counts = Counter(test_meta_cat_occurence)
sorted_test = test_meta_cat_counts.most_common()

### Comparing category distributions between test/train

In [171]:
bot_test = [cat[0] for cat in sorted_test]
wid_test = [cat[1] for cat in sorted_test]

bot_train = bot_test 
wid_train = [train_meta_cat_counts[cat[0]] for cat in sorted_test]

fig, ax = plt.subplots()
ax.barh(np.arange(len(wid_test)),wid_test, alpha = 0.6)
ax.barh(np.arange(len(wid_test)),wid_train, alpha = 0.6, color='y')

_ = ax.set_yticklabels(bot_test)

In [19]:
all_movies = []
for movie in glob.glob('./data/prediction_videos_final_train/*/*'):
    all_movies.append('/'.join(movie.split('/')[2:5]))

full_test_movie = []
for movie in glob.glob('./data/prediction_videos_final_test/*/*'):
    all_movies.append('/'.join(movie.split('/')[2:5]))

In [20]:
all_movies[:5]

## Looking at movie length distribution of combined test/train

In [223]:
movie_lengths = {}
for path in all_movies:
    files = os.listdir('./data/'+path)
    movie_lengths[''.join(path.split('_', maxsplit=3)[3:])] = len(files)

In [75]:
sorted_movies = sorted(movie_lengths.items(), key=operator.itemgetter(1))

In [76]:
sorted_movies[295:305]

In [77]:
sorted_movies[-10:]

In [38]:
length_series = pd.Series(list(movie_lengths.values()))
_ = length_series.hist(bins = 30, log = True)

In [226]:
length_series = pd.Series([val for val in movie_lengths.values() if val < 300])
_ = length_series.hist(bins = 30)

In [216]:
len_array = np.array([int(sm[1]) for sm in sorted_movies])

len_array_f = len_array[len_array > 1]
len_array_f = len_array_f[len_array_f < 300]
l_full, l_fill =len(len_array), len(len_array_f)
print(l_full, l_fill)
ldiff = l_full- l_fill
print(ldiff)
print(ldiff/l_full)

In [217]:
300/24

Only 5% of the videos have more than 300 frames, or 12s

In [81]:
df = pd.DataFrame.from_records(sorted_movies)

In [176]:
df[0].to_csv('./movies_sorted_by_length.csv', index = False)

### In the future, should set max capture length