# CAN A COMPUTER UNDERSTAND SIGN LANGUAGE?

# Experimental Write-Up

BACKGROUND:  Sign language is a visual language that is not easily understood by hearing people.

OBJECTIVE:  Can we use motion sensor data to translate signs in Australian sign language into written English?

HYPOTHESIS:  Given motion sensor data, a machine learning algorithm can distinguish between the six question words:  
   what, when, where, which, who, why

SUCCESS METRIC:  Confusion matrix (precision and recall) for correct classification of six question words.  Optimal Area Under Curve (AUC) = 1.


## Dataset Description

DATA SOURCE:  https://archive.ics.uci.edu/ml/machine-learning-databases/auslan2-mld/auslan.data.html

DESCRIPTION:
  * 2565 signs: 27 samples of 95 signs made by 5 different signers
      * Note that although 95 different signs were available, we limited to the 6 question words for simplicity
  * Motion captured using Flock system
      * origin is a point just below the chin
      * x = number of meters in x-direction from origin 
      * y = number of meters in y-direction from origin
      * z = number of meters in z-direction from origin
      * Roll = number of degrees of palm rotation
          * 0 means palm is flat horizontal from perspective of the signer
          * -0.5 <= roll <= 0.5
          * positive means wrist is rolled clockwise from the perspective of the signer
          * multiply by 180 to get degrees
      * Pitch
          * 0 means palm is flat horizontal from the perspective of the signer
          * -0.5 <= pitch <= 0.5
          * positive means fingers are pointing up, palm facing out
          * multiply by 180 to get degrees      
      * Yaw
          * 0 means palm is flat horizontal from the perspective of the signer
          * -1.0 <= pitch <= 1.0
          * positive means wrist is cocked to the right 
          * multiply by 180 to get degrees            
      * Finger bend for 5 fingers
          * for each finger
              * 0 means straight finger
              * 1 means totally curled
              * 0 <= bend <= 1
              * bend measurements are not very exact
  * Avg Frames per sign:  41-102 (ave: 57)
      * Note that we limit to 51-52 frames per sign below to simplify the classification problem


<img src="PitchRollYaw.png">

## Diagram of Roll, Pitch, and Yaw
Think of the fingers of the hand as being the nose of the plane and the palm of the hand as being the belly of the plane.  This is the origin for roll, pitch, and yaw.

## Potential Classification Methodologies

* K-nearest neighbor
    * Too many features to be effective (11 features)
* Random forest
* Support vector machine
* Hierarchical clustering of time series
   https://stackoverflow.com/questions/34940808/hierarchical-clustering-of-time-series-in-python-scipy-numpy-pandas

# Exploratory Data Analysis

In [64]:
import os

import math
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

  # PLOTTING
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import seaborn as sns

In [2]:
   # this function reads in one dataset at a time
def read_in_file(person, session_num, sign, test_num):
    df = pd.read_csv(os.path.join('..', '..', '..', 'DS-SF-34-MyStuff', 'Australian-Sign-Language',
                                  'signs', person+session_num, sign+test_num+'.sign'), header=None)
    df.rename(columns = {0: 'x',
    1: 'y',
    2: 'z',
    3: 'roll',
    4: 'pitch',
    5: 'yaw',
    6: 'thumb_bend',
    7: 'pointer_bend',
    8: 'middle_bend',
    9: 'ring_bend',
    10: 'pinky_bend'}, inplace = True)
    df['signer'] = person
    df['sign'] = sign
    df['session'] = session_num
    df['test'] = test_num
    df['sign_index']=range(1, len(df) + 1)
    return df


## WHAT

In [3]:
   #  read in all of the files for WHAT
    # you have to do this manually for each sign, 
      # because each sign has a different # of session_# and test_#s from the dataset
def get_what():
    file_1=read_in_file('adam','1','what','0')
    file_2=read_in_file('adam','1','what','1')
    file_3=read_in_file('adam','2','what','0')
    file_4=read_in_file('adam','2','what','1')
    file_5=read_in_file('adam','2','what','2')
    file_6=read_in_file('adam','2','what','3')
    file_7=read_in_file('adam','2','what','4')
    file_8=read_in_file('adam','2','what','5')
    file_9=read_in_file('andrew','1','what','0')
    file_10=read_in_file('andrew','2','what','1')
    file_11=read_in_file('andrew','2','what','2')
    file_12=read_in_file('andrew','2','what','3')
    file_13=read_in_file('andrew','2','what','4')
    file_14=read_in_file('andrew','3','what','0')
    file_15=read_in_file('andrew','3','what','1')
    file_16=read_in_file('john','1','what','0')
    file_17=read_in_file('john','2','what','0')
    file_18=read_in_file('john','2','what','1')
    file_19=read_in_file('john','2','what','2')
    file_20=read_in_file('john','2','what','3')
    file_21=read_in_file('john','2','what','4')
    file_22=read_in_file('john','3','what','0')
    file_23=read_in_file('john','3','what','1')
    file_24=read_in_file('john','3','what','2')
    file_25=read_in_file('john','3','what','3')
    file_26=read_in_file('john','3','what','4')
    file_27=read_in_file('john','4','what','0')
    file_28=read_in_file('john','4','what','1')
    file_29=read_in_file('john','4','what','2')
    file_30=read_in_file('john','4','what','3')
    file_31=read_in_file('john','4','what','4')
    file_32=read_in_file('john','5','what','0')
    file_33=read_in_file('john','5','what','1')
    file_34=read_in_file('stephen','1','what','0')
    file_35=read_in_file('stephen','2','what','1')
    file_36=read_in_file('stephen','2','what','2')
    file_37=read_in_file('stephen','2','what','3')
    file_38=read_in_file('stephen','2','what','4')
    file_39=read_in_file('stephen','3','what','0')
    file_40=read_in_file('stephen','3','what','1')
    file_41=read_in_file('stephen','3','what','2')
    file_42=read_in_file('stephen','3','what','3')
    file_68=read_in_file('stephen','3','what','4')
    file_43=read_in_file('stephen','4','what','0')
    file_44=read_in_file('stephen','4','what','1')
    file_45=read_in_file('stephen','4','what','2')
    file_46=read_in_file('stephen','4','what','3')
    file_47=read_in_file('stephen','4','what','4')
    file_48=read_in_file('waleed','1','what','0')
    file_49=read_in_file('waleed','1','what','1')
    file_50=read_in_file('waleed','1','what','2')
    file_51=read_in_file('waleed','1','what','3')
    file_52=read_in_file('waleed','1','what','4')
    file_53=read_in_file('waleed','2','what','0')
    file_54=read_in_file('waleed','2','what','1')
    file_55=read_in_file('waleed','2','what','2')
    file_56=read_in_file('waleed','2','what','3')
    file_57=read_in_file('waleed','2','what','4')
    file_58=read_in_file('waleed','3','what','0')
    file_59=read_in_file('waleed','3','what','1')
    file_60=read_in_file('waleed','3','what','2')
    file_61=read_in_file('waleed','3','what','3')
    file_62=read_in_file('waleed','3','what','4')
    file_63=read_in_file('waleed','4','what','0')
    file_64=read_in_file('waleed','4','what','1')
    file_65=read_in_file('waleed','4','what','2')
    file_66=read_in_file('waleed','4','what','3')
    file_67=read_in_file('waleed','4','what','4')
    what=pd.concat([file_1,file_2,file_3,file_4,file_5,file_6,file_7,file_8,file_9,file_10,
                 file_11,file_12,file_13,file_14,file_15,file_16,file_17,file_18,file_19,file_10,
                 file_21,file_22,file_23,file_24,file_25,file_26,file_27,file_28,file_29,file_20,
                 file_31,file_32,file_33,file_34,file_35,file_36,file_37,file_38,file_39,file_30,
                 file_41,file_42,file_43,file_44,file_45,file_46,file_47,file_48,file_49,file_40,
                 file_51,file_52,file_53,file_54,file_55,file_56,file_57,file_58,file_59,file_50,
                 file_61,file_62,file_63,file_64,file_65,file_66,file_67,file_68])

    return what

what=get_what()

what['num_frames'] = what.groupby(['signer','session','test'])['x'].transform('count').astype(int)

    # how many frames do we have for each "what" sign?  It varies quite a bit.
what[['signer','session','test','x']].groupby(['signer','session','test']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,x
signer,session,test,Unnamed: 3_level_1
adam,1,0,49
adam,1,1,54
adam,2,0,46
adam,2,1,50
adam,2,2,51
...,...,...,...
waleed,4,0,41
waleed,4,1,44
waleed,4,2,52
waleed,4,3,46


In [4]:
  # 41
#what.groupby(['signer','session','test'])['x'].count().min()
  # 102
what.groupby(['signer','session','test'])['x'].count().max()
  # 57.14
#what.groupby(['signer','session','test'])['x'].count().mean()

102

In [5]:
freq_what = what.groupby(['signer','session','test'])['x'].count().value_counts().reset_index()
freq_what.columns = ['num_frames','count']
freq_what

Unnamed: 0,num_frames,count
0,49,6
1,48,6
2,68,5
3,51,4
4,65,4
...,...,...
24,60,1
25,64,1
26,69,1
27,78,1


## WHEN

In [6]:
   #  read in all of the files for WHEN
    # you have to do this manually for each sign, 
      # because each sign has a different # of session_# and test_#s from the dataset

def get_when():
    file_1=read_in_file('adam','1','when','0')
    file_2=read_in_file('adam','1','when','1')
    file_3=read_in_file('adam','2','when','0')
    file_4=read_in_file('adam','2','when','1')
    file_5=read_in_file('adam','2','when','2')
    file_6=read_in_file('adam','2','when','3')
    file_7=read_in_file('adam','2','when','4')
    file_8=read_in_file('adam','2','when','5')
    file_9=read_in_file('andrew','1','when','0')
    file_10=read_in_file('andrew','2','when','1')
    file_11=read_in_file('andrew','2','when','2')
    file_12=read_in_file('andrew','2','when','3')
    file_13=read_in_file('andrew','2','when','4')
    file_14=read_in_file('andrew','3','when','0')
    file_15=read_in_file('andrew','3','when','1')
    file_16=read_in_file('john','1','when','0')
    file_17=read_in_file('john','2','when','0')
    file_18=read_in_file('john','2','when','1')
    file_19=read_in_file('john','2','when','2')
    file_20=read_in_file('john','2','when','3')
    file_21=read_in_file('john','2','when','4')
    file_22=read_in_file('john','3','when','0')
    file_23=read_in_file('john','3','when','1')
    file_24=read_in_file('john','3','when','2')
    file_25=read_in_file('john','3','when','3')
    file_26=read_in_file('john','3','when','4')
    file_27=read_in_file('john','4','when','0')
    file_28=read_in_file('john','4','when','1')
    file_29=read_in_file('john','4','when','2')
    file_30=read_in_file('john','4','when','3')
    file_31=read_in_file('john','4','when','4')
    file_32=read_in_file('john','5','when','0')
    file_33=read_in_file('john','5','when','1')
    file_34=read_in_file('stephen','1','when','0')
    file_35=read_in_file('stephen','2','when','1')
    file_36=read_in_file('stephen','2','when','2')
    file_37=read_in_file('stephen','2','when','3')
    file_38=read_in_file('stephen','2','when','4')
    file_39=read_in_file('stephen','3','when','0')
    file_40=read_in_file('stephen','3','when','1')
    file_41=read_in_file('stephen','3','when','2')
    file_42=read_in_file('stephen','3','when','3')
    file_68=read_in_file('stephen','3','when','4')
    file_43=read_in_file('stephen','4','when','0')
    file_44=read_in_file('stephen','4','when','1')
    file_45=read_in_file('stephen','4','when','2')
    file_46=read_in_file('stephen','4','when','3')
    file_47=read_in_file('stephen','4','when','4')
    file_48=read_in_file('waleed','1','when','0')
    file_49=read_in_file('waleed','1','when','1')
    file_50=read_in_file('waleed','1','when','2')
    file_51=read_in_file('waleed','1','when','3')
    file_52=read_in_file('waleed','1','when','4')
    file_53=read_in_file('waleed','2','when','0')
    file_54=read_in_file('waleed','2','when','1')
    file_55=read_in_file('waleed','2','when','2')
    file_56=read_in_file('waleed','2','when','3')
    file_57=read_in_file('waleed','2','when','4')
    file_58=read_in_file('waleed','3','when','0')
    file_59=read_in_file('waleed','3','when','1')
    file_60=read_in_file('waleed','3','when','2')
    file_61=read_in_file('waleed','3','when','3')
    file_62=read_in_file('waleed','3','when','4')
    file_63=read_in_file('waleed','4','when','0')
    file_64=read_in_file('waleed','4','when','1')
    file_65=read_in_file('waleed','4','when','2')
    file_66=read_in_file('waleed','4','when','3')
    file_67=read_in_file('waleed','4','when','4')
    when=pd.concat([file_1,file_2,file_3,file_4,file_5,file_6,file_7,file_8,file_9,file_10,
                 file_11,file_12,file_13,file_14,file_15,file_16,file_17,file_18,file_19,file_10,
                 file_21,file_22,file_23,file_24,file_25,file_26,file_27,file_28,file_29,file_20,
                 file_31,file_32,file_33,file_34,file_35,file_36,file_37,file_38,file_39,file_30,
                 file_41,file_42,file_43,file_44,file_45,file_46,file_47,file_48,file_49,file_40,
                 file_51,file_52,file_53,file_54,file_55,file_56,file_57,file_58,file_59,file_50,
                 file_61,file_62,file_63,file_64,file_65,file_66,file_67,file_68])

    return when

when=get_when()

when['num_frames'] = when.groupby(['signer','session','test'])['x'].transform('count').astype(int)

when['index']=range(1, len(when) + 1)

    # how many frames do we have for each "when" sign?  It varies quite a bit.
when[['signer','session','test','x']].groupby(['signer','session','test']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,x
signer,session,test,Unnamed: 3_level_1
adam,1,0,55
adam,1,1,56
adam,2,0,41
adam,2,1,48
adam,2,2,47
...,...,...,...
waleed,4,0,67
waleed,4,1,53
waleed,4,2,56
waleed,4,3,57


In [7]:
  # 41
#when[['signer','session','test','x']].groupby(['signer','session','test']).count().min()
  # 292
#when[['signer','session','test','x']].groupby(['signer','session','test']).count().max()
  # 63.2
when[['signer','session','test','x']].groupby(['signer','session','test']).count().mean()

x    63.164179
dtype: float64

In [8]:
freq_when = when.groupby(['signer','session','test'])['x'].count().value_counts().reset_index()
freq_when.columns = ['num_frames','count']
freq_when

Unnamed: 0,num_frames,count
0,58,7
1,55,5
2,48,5
3,60,4
4,56,4
...,...,...
24,68,1
25,70,1
26,72,1
27,78,1


## WHERE

In [9]:
   #  read in all of the files for where
    # you have to do this manually for each sign, 
      # because each sign has a different # of session_# and test_#s from the dataset

def get_where():
    file_1=read_in_file('adam','1','where','0')
    file_2=read_in_file('adam','1','where','1')
    file_3=read_in_file('adam','2','where','0')
    file_4=read_in_file('adam','2','where','1')
    file_5=read_in_file('adam','2','where','2')
    file_6=read_in_file('adam','2','where','3')
    file_7=read_in_file('adam','2','where','4')
    file_8=read_in_file('adam','2','where','5')
    file_9=read_in_file('andrew','1','where','0')
    file_10=read_in_file('andrew','2','where','1')
    file_11=read_in_file('andrew','2','where','2')
    file_12=read_in_file('andrew','2','where','3')
    file_13=read_in_file('andrew','2','where','4')
    file_14=read_in_file('andrew','3','where','0')
    file_15=read_in_file('andrew','3','where','1')
    file_16=read_in_file('john','1','where','0')
    file_17=read_in_file('john','2','where','0')
    file_18=read_in_file('john','2','where','1')
    file_19=read_in_file('john','2','where','2')
    file_20=read_in_file('john','2','where','3')
    file_21=read_in_file('john','2','where','4')
    file_22=read_in_file('john','3','where','0')
    file_23=read_in_file('john','3','where','1')
    file_24=read_in_file('john','3','where','2')
    file_25=read_in_file('john','3','where','3')
    file_26=read_in_file('john','3','where','4')
    file_27=read_in_file('john','4','where','0')
    file_28=read_in_file('john','4','where','1')
    file_29=read_in_file('john','4','where','2')
    file_30=read_in_file('john','4','where','3')
    file_31=read_in_file('john','4','where','4')
    file_32=read_in_file('john','5','where','0')
    file_33=read_in_file('john','5','where','1')
    file_34=read_in_file('stephen','1','where','0')
    file_35=read_in_file('stephen','2','where','1')
    file_36=read_in_file('stephen','2','where','2')
    file_37=read_in_file('stephen','2','where','3')
    file_38=read_in_file('stephen','2','where','4')
    file_39=read_in_file('stephen','3','where','0')
    file_40=read_in_file('stephen','3','where','1')
    file_41=read_in_file('stephen','3','where','2')
    file_42=read_in_file('stephen','3','where','3')
    file_68=read_in_file('stephen','3','where','4')
    file_43=read_in_file('stephen','4','where','0')
    file_44=read_in_file('stephen','4','where','1')
    file_45=read_in_file('stephen','4','where','2')
    file_46=read_in_file('stephen','4','where','3')
    file_47=read_in_file('stephen','4','where','4')
    file_48=read_in_file('waleed','1','where','0')
    file_49=read_in_file('waleed','1','where','1')
    file_50=read_in_file('waleed','1','where','2')
    file_51=read_in_file('waleed','1','where','3')
    file_52=read_in_file('waleed','1','where','4')
    file_53=read_in_file('waleed','2','where','0')
    file_54=read_in_file('waleed','2','where','1')
    file_55=read_in_file('waleed','2','where','2')
    file_56=read_in_file('waleed','2','where','3')
    file_57=read_in_file('waleed','2','where','4')
    file_58=read_in_file('waleed','3','where','0')
    file_59=read_in_file('waleed','3','where','1')
    file_60=read_in_file('waleed','3','where','2')
    file_61=read_in_file('waleed','3','where','3')
    file_62=read_in_file('waleed','3','where','4')
    file_63=read_in_file('waleed','4','where','0')
    file_64=read_in_file('waleed','4','where','1')
    file_65=read_in_file('waleed','4','where','2')
    file_66=read_in_file('waleed','4','where','3')
    file_67=read_in_file('waleed','4','where','4')
    where=pd.concat([file_1,file_2,file_3,file_4,file_5,file_6,file_7,file_8,file_9,file_10,
                 file_11,file_12,file_13,file_14,file_15,file_16,file_17,file_18,file_19,file_10,
                 file_21,file_22,file_23,file_24,file_25,file_26,file_27,file_28,file_29,file_20,
                 file_31,file_32,file_33,file_34,file_35,file_36,file_37,file_38,file_39,file_30,
                 file_41,file_42,file_43,file_44,file_45,file_46,file_47,file_48,file_49,file_40,
                 file_51,file_52,file_53,file_54,file_55,file_56,file_57,file_58,file_59,file_50,
                 file_61,file_62,file_63,file_64,file_65,file_66,file_67,file_68])

    return where

where=get_where()

where['num_frames'] = where.groupby(['signer','session','test'])['x'].transform('count').astype(int)

where['index']=range(1, len(where) + 1)

    # how many frames do we have for each "where" sign?  It varies quite a bit.
where[['signer','session','test','x']].groupby(['signer','session','test']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,x
signer,session,test,Unnamed: 3_level_1
adam,1,0,53
adam,1,1,50
adam,2,0,46
adam,2,1,49
adam,2,2,47
...,...,...,...
waleed,4,0,50
waleed,4,1,53
waleed,4,2,55
waleed,4,3,60


In [10]:
  # 45
#where[['signer','session','test','x']].groupby(['signer','session','test']).count().min()
  # 485
#where[['signer','session','test','x']].groupby(['signer','session','test']).count().max()
  # 68.06
where[['signer','session','test','x']].groupby(['signer','session','test']).count().mean()

x    68.059701
dtype: float64

In [11]:
freq_where = where.groupby(['signer','session','test'])['x'].count().value_counts().reset_index()
freq_where.columns = ['num_frames','count']
freq_where

Unnamed: 0,num_frames,count
0,53,8
1,55,5
2,50,5
3,52,4
4,47,4
...,...,...
29,79,1
30,81,1
31,85,1
32,92,1


##  WHICH

In [12]:
   #  read in all of the files for which
    # you have to do this manually for each sign, 
      # because each sign has a different # of session_# and test_#s from the dataset

def get_which():
    file_1=read_in_file('adam','1','which','0')
    file_2=read_in_file('adam','1','which','1')
    file_3=read_in_file('adam','2','which','0')
    file_4=read_in_file('adam','2','which','1')
    file_5=read_in_file('adam','2','which','2')
    file_6=read_in_file('adam','2','which','3')
    file_7=read_in_file('adam','2','which','4')
    file_8=read_in_file('adam','2','which','5')
    file_9=read_in_file('andrew','1','which','0')
    file_10=read_in_file('andrew','2','which','1')
    file_11=read_in_file('andrew','2','which','2')
    file_12=read_in_file('andrew','2','which','3')
    file_13=read_in_file('andrew','2','which','4')
    file_14=read_in_file('andrew','3','which','0')
    file_15=read_in_file('andrew','3','which','1')
    file_16=read_in_file('john','1','which','0')
    file_17=read_in_file('john','2','which','0')
    file_18=read_in_file('john','2','which','1')
    file_19=read_in_file('john','2','which','2')
    file_20=read_in_file('john','2','which','3')
    file_21=read_in_file('john','2','which','4')
    file_22=read_in_file('john','3','which','0')
    file_23=read_in_file('john','3','which','1')
    file_24=read_in_file('john','3','which','2')
    file_25=read_in_file('john','3','which','3')
    file_26=read_in_file('john','3','which','4')
    file_27=read_in_file('john','4','which','0')
    file_28=read_in_file('john','4','which','1')
    file_29=read_in_file('john','4','which','2')
    file_30=read_in_file('john','4','which','3')
    file_31=read_in_file('john','4','which','4')
    file_32=read_in_file('john','5','which','0')
    file_33=read_in_file('john','5','which','1')
    file_34=read_in_file('stephen','1','which','0')
    file_35=read_in_file('stephen','2','which','1')
    file_36=read_in_file('stephen','2','which','2')
    file_37=read_in_file('stephen','2','which','3')
    file_38=read_in_file('stephen','2','which','4')
    file_39=read_in_file('stephen','3','which','0')
    file_40=read_in_file('stephen','3','which','1')
    file_41=read_in_file('stephen','3','which','2')
    file_42=read_in_file('stephen','3','which','3')
    file_68=read_in_file('stephen','3','which','4')
    file_43=read_in_file('stephen','4','which','0')
    file_44=read_in_file('stephen','4','which','1')
    file_45=read_in_file('stephen','4','which','2')
    file_46=read_in_file('stephen','4','which','3')
    file_47=read_in_file('stephen','4','which','4')
    file_48=read_in_file('waleed','1','which','0')
    file_49=read_in_file('waleed','1','which','1')
    file_50=read_in_file('waleed','1','which','2')
    file_51=read_in_file('waleed','1','which','3')
    file_52=read_in_file('waleed','1','which','4')
    file_53=read_in_file('waleed','2','which','0')
    file_54=read_in_file('waleed','2','which','1')
    file_55=read_in_file('waleed','2','which','2')
    file_56=read_in_file('waleed','2','which','3')
    file_57=read_in_file('waleed','2','which','4')
    file_58=read_in_file('waleed','3','which','0')
    file_59=read_in_file('waleed','3','which','1')
    file_60=read_in_file('waleed','3','which','2')
    file_61=read_in_file('waleed','3','which','3')
    file_62=read_in_file('waleed','3','which','4')
    file_63=read_in_file('waleed','4','which','0')
    file_64=read_in_file('waleed','4','which','1')
    file_65=read_in_file('waleed','4','which','2')
    file_66=read_in_file('waleed','4','which','3')
    file_67=read_in_file('waleed','4','which','4')
    which=pd.concat([file_1,file_2,file_3,file_4,file_5,file_6,file_7,file_8,file_9,file_10,
                 file_11,file_12,file_13,file_14,file_15,file_16,file_17,file_18,file_19,file_10,
                 file_21,file_22,file_23,file_24,file_25,file_26,file_27,file_28,file_29,file_20,
                 file_31,file_32,file_33,file_34,file_35,file_36,file_37,file_38,file_39,file_30,
                 file_41,file_42,file_43,file_44,file_45,file_46,file_47,file_48,file_49,file_40,
                 file_51,file_52,file_53,file_54,file_55,file_56,file_57,file_58,file_59,file_50,
                 file_61,file_62,file_63,file_64,file_65,file_66,file_67,file_68])

    return which

which=get_which()

which['num_frames'] = which.groupby(['signer','session','test'])['x'].transform('count').astype(int)

which['index']=range(1, len(which) + 1)

    # how many frames do we have for each "which" sign?  It varies quite a bit.
which[['signer','session','test','x']].groupby(['signer','session','test']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,x
signer,session,test,Unnamed: 3_level_1
adam,1,0,58
adam,1,1,51
adam,2,0,47
adam,2,1,47
adam,2,2,47
...,...,...,...
waleed,4,0,52
waleed,4,1,50
waleed,4,2,53
waleed,4,3,51


In [13]:
  # 43
#which[['signer','session','test','x']].groupby(['signer','session','test']).count().min()
  # 121
#which[['signer','session','test','x']].groupby(['signer','session','test']).count().max()
  # 58.8
which[['signer','session','test','x']].groupby(['signer','session','test']).count().mean()

x    58.820896
dtype: float64

In [14]:
freq_which = which.groupby(['signer','session','test'])['x'].count().value_counts().reset_index()
freq_which.columns = ['num_frames','count']
freq_which

Unnamed: 0,num_frames,count
0,47,6
1,52,5
2,68,4
3,59,3
4,64,3
...,...,...
26,55,1
27,67,1
28,112,1
29,61,1


## WHO

In [15]:
   #  read in all of the files for who
    # you have to do this manually for each sign, 
      # because each sign has a different # of session_# and test_#s from the dataset

def get_who():
    file_1=read_in_file('adam','1','who','0')
    file_2=read_in_file('adam','1','who','1')
    file_3=read_in_file('adam','2','who','0')
    file_4=read_in_file('adam','2','who','1')
    file_5=read_in_file('adam','2','who','2')
    file_6=read_in_file('adam','2','who','3')
    file_7=read_in_file('adam','2','who','4')
    file_8=read_in_file('adam','2','who','5')
    file_9=read_in_file('andrew','1','who','0')
    file_10=read_in_file('andrew','2','who','1')
    file_11=read_in_file('andrew','2','who','2')
    file_12=read_in_file('andrew','2','who','3')
    file_13=read_in_file('andrew','2','who','4')
    file_14=read_in_file('andrew','3','who','0')
    file_15=read_in_file('andrew','3','who','1')
    file_16=read_in_file('john','1','who','0')
    file_17=read_in_file('john','2','who','0')
    file_18=read_in_file('john','2','who','1')
    file_19=read_in_file('john','2','who','2')
    file_20=read_in_file('john','2','who','3')
    file_21=read_in_file('john','2','who','4')
    file_22=read_in_file('john','3','who','0')
    file_23=read_in_file('john','3','who','1')
    file_24=read_in_file('john','3','who','2')
    file_25=read_in_file('john','3','who','3')
    file_26=read_in_file('john','3','who','4')
    file_27=read_in_file('john','4','who','0')
    file_28=read_in_file('john','4','who','1')
    file_29=read_in_file('john','4','who','2')
    file_30=read_in_file('john','4','who','3')
    file_31=read_in_file('john','4','who','4')
    file_32=read_in_file('john','5','who','0')
    file_33=read_in_file('john','5','who','1')
    file_34=read_in_file('stephen','1','who','0')
    file_35=read_in_file('stephen','2','who','1')
    file_36=read_in_file('stephen','2','who','2')
    file_37=read_in_file('stephen','2','who','3')
    file_38=read_in_file('stephen','2','who','4')
    file_39=read_in_file('stephen','3','who','0')
    file_40=read_in_file('stephen','3','who','1')
    file_41=read_in_file('stephen','3','who','2')
    file_42=read_in_file('stephen','3','who','3')
    file_68=read_in_file('stephen','3','who','4')
    file_43=read_in_file('stephen','4','who','0')
    file_44=read_in_file('stephen','4','who','1')
    file_45=read_in_file('stephen','4','who','2')
    file_46=read_in_file('stephen','4','who','3')
    file_47=read_in_file('stephen','4','who','4')
    file_48=read_in_file('waleed','1','who','0')
    file_49=read_in_file('waleed','1','who','1')
    file_50=read_in_file('waleed','1','who','2')
    file_51=read_in_file('waleed','1','who','3')
    file_52=read_in_file('waleed','1','who','4')
    file_53=read_in_file('waleed','2','who','0')
    file_54=read_in_file('waleed','2','who','1')
    file_55=read_in_file('waleed','2','who','2')
    file_56=read_in_file('waleed','2','who','3')
    file_57=read_in_file('waleed','2','who','4')
    file_58=read_in_file('waleed','3','who','0')
    file_59=read_in_file('waleed','3','who','1')
    file_60=read_in_file('waleed','3','who','2')
    file_61=read_in_file('waleed','3','who','3')
    file_62=read_in_file('waleed','3','who','4')
    file_63=read_in_file('waleed','4','who','0')
    file_64=read_in_file('waleed','4','who','1')
    file_65=read_in_file('waleed','4','who','2')
    file_66=read_in_file('waleed','4','who','3')
    file_67=read_in_file('waleed','4','who','4')
    who=pd.concat([file_1,file_2,file_3,file_4,file_5,file_6,file_7,file_8,file_9,file_10,
                 file_11,file_12,file_13,file_14,file_15,file_16,file_17,file_18,file_19,file_10,
                 file_21,file_22,file_23,file_24,file_25,file_26,file_27,file_28,file_29,file_20,
                 file_31,file_32,file_33,file_34,file_35,file_36,file_37,file_38,file_39,file_30,
                 file_41,file_42,file_43,file_44,file_45,file_46,file_47,file_48,file_49,file_40,
                 file_51,file_52,file_53,file_54,file_55,file_56,file_57,file_58,file_59,file_50,
                 file_61,file_62,file_63,file_64,file_65,file_66,file_67,file_68])

    return who

who=get_who()

who['num_frames'] = who.groupby(['signer','session','test'])['x'].transform('count').astype(int)

who['index']=range(1, len(who) + 1)

    # how many frames do we have for each "who" sign?  It varies quite a bit.
who[['signer','session','test','x']].groupby(['signer','session','test']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,x
signer,session,test,Unnamed: 3_level_1
adam,1,0,59
adam,1,1,69
adam,2,0,50
adam,2,1,48
adam,2,2,54
...,...,...,...
waleed,4,0,48
waleed,4,1,53
waleed,4,2,53
waleed,4,3,52


In [16]:
  # 48
#who[['signer','session','test','x']].groupby(['signer','session','test']).count().min()
  # 126
#who[['signer','session','test','x']].groupby(['signer','session','test']).count().max()
  # 65.9
who[['signer','session','test','x']].groupby(['signer','session','test']).count().mean()

x    65.895522
dtype: float64

In [17]:
freq_who = who.groupby(['signer','session','test'])['x'].count().value_counts().reset_index()
freq_who.columns = ['num_frames','count']
freq_who

Unnamed: 0,num_frames,count
0,54,6
1,73,6
2,70,5
3,51,4
4,69,4
...,...,...
24,58,1
25,79,1
26,60,1
27,124,1


## WHY

In [18]:
   #  read in all of the files for why
    # you have to do this manually for each sign, 
      # because each sign has a different # of session_# and test_#s from the dataset

def get_why():
    file_1=read_in_file('adam','1','why','0')
    file_2=read_in_file('adam','1','why','1')
    file_3=read_in_file('adam','2','why','0')
    file_4=read_in_file('adam','2','why','1')
    file_5=read_in_file('adam','2','why','2')
    file_6=read_in_file('adam','2','why','3')
    file_7=read_in_file('adam','2','why','4')
    file_8=read_in_file('adam','2','why','5')
    file_9=read_in_file('andrew','1','why','0')
    file_10=read_in_file('andrew','2','why','1')
    file_11=read_in_file('andrew','2','why','2')
    file_12=read_in_file('andrew','2','why','3')
    file_13=read_in_file('andrew','2','why','4')
    file_14=read_in_file('andrew','3','why','0')
    file_15=read_in_file('andrew','3','why','1')
    file_16=read_in_file('john','1','why','0')
    file_17=read_in_file('john','2','why','0')
    file_18=read_in_file('john','2','why','1')
    file_19=read_in_file('john','2','why','2')
    file_20=read_in_file('john','2','why','3')
    file_21=read_in_file('john','2','why','4')
    file_22=read_in_file('john','3','why','0')
    file_23=read_in_file('john','3','why','1')
    file_24=read_in_file('john','3','why','2')
    file_25=read_in_file('john','3','why','3')
    file_26=read_in_file('john','3','why','4')
    file_27=read_in_file('john','4','why','0')
    file_28=read_in_file('john','4','why','1')
    file_29=read_in_file('john','4','why','2')
    file_30=read_in_file('john','4','why','3')
    file_31=read_in_file('john','4','why','4')
    file_32=read_in_file('john','5','why','0')
    file_33=read_in_file('john','5','why','1')
    file_34=read_in_file('stephen','1','why','0')
    file_35=read_in_file('stephen','2','why','1')
    file_36=read_in_file('stephen','2','why','2')
    file_37=read_in_file('stephen','2','why','3')
    file_38=read_in_file('stephen','2','why','4')
    file_39=read_in_file('stephen','3','why','0')
    file_40=read_in_file('stephen','3','why','1')
    file_41=read_in_file('stephen','3','why','2')
    file_42=read_in_file('stephen','3','why','3')
    file_68=read_in_file('stephen','3','why','4')
    file_43=read_in_file('stephen','4','why','0')
    file_44=read_in_file('stephen','4','why','1')
    file_45=read_in_file('stephen','4','why','2')
    file_46=read_in_file('stephen','4','why','3')
    file_47=read_in_file('stephen','4','why','4')
    file_48=read_in_file('waleed','1','why','0')
    file_49=read_in_file('waleed','1','why','1')
    file_50=read_in_file('waleed','1','why','2')
    file_51=read_in_file('waleed','1','why','3')
    file_52=read_in_file('waleed','1','why','4')
    file_53=read_in_file('waleed','2','why','0')
    file_54=read_in_file('waleed','2','why','1')
    file_55=read_in_file('waleed','2','why','2')
    file_56=read_in_file('waleed','2','why','3')
    file_57=read_in_file('waleed','2','why','4')
    file_58=read_in_file('waleed','3','why','0')
    file_59=read_in_file('waleed','3','why','1')
    file_60=read_in_file('waleed','3','why','2')
    file_61=read_in_file('waleed','3','why','3')
    file_62=read_in_file('waleed','3','why','4')
    file_63=read_in_file('waleed','4','why','0')
    file_64=read_in_file('waleed','4','why','1')
    file_65=read_in_file('waleed','4','why','2')
    file_66=read_in_file('waleed','4','why','3')
    file_67=read_in_file('waleed','4','why','4')
    why=pd.concat([file_1,file_2,file_3,file_4,file_5,file_6,file_7,file_8,file_9,file_10,
                 file_11,file_12,file_13,file_14,file_15,file_16,file_17,file_18,file_19,file_10,
                 file_21,file_22,file_23,file_24,file_25,file_26,file_27,file_28,file_29,file_20,
                 file_31,file_32,file_33,file_34,file_35,file_36,file_37,file_38,file_39,file_30,
                 file_41,file_42,file_43,file_44,file_45,file_46,file_47,file_48,file_49,file_40,
                 file_51,file_52,file_53,file_54,file_55,file_56,file_57,file_58,file_59,file_50,
                 file_61,file_62,file_63,file_64,file_65,file_66,file_67,file_68])

    return why

why=get_why()

why['num_frames'] = why.groupby(['signer','session','test'])['x'].transform('count').astype(int)

why['index']=range(1, len(why) + 1)

    # how many frames do we have for each "why" sign?  It varies quite a bit.
why[['signer','session','test','x']].groupby(['signer','session','test']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,x
signer,session,test,Unnamed: 3_level_1
adam,1,0,52
adam,1,1,59
adam,2,0,50
adam,2,1,47
adam,2,2,41
...,...,...,...
waleed,4,0,65
waleed,4,1,56
waleed,4,2,48
waleed,4,3,49


In [19]:
  # 41
#why[['signer','session','test','x']].groupby(['signer','session','test']).count().min()
  # 132
#why[['signer','session','test','x']].groupby(['signer','session','test']).count().max()
  # 57.4
why[['signer','session','test','x']].groupby(['signer','session','test']).count().mean()

x    57.432836
dtype: float64

In [20]:
freq_why = why.groupby(['signer','session','test'])['x'].count().value_counts().reset_index()
freq_why.columns = ['num_frames','count']
freq_why

Unnamed: 0,num_frames,count
0,51,6
1,47,5
2,52,4
3,59,4
4,66,4
...,...,...
24,65,1
25,67,1
26,69,1
27,73,1


## COMBINED FREQ FILE

In [48]:
   # INNER JOIN, so order matters.
# based on the result below, selecting 51-52 frames as the baseline yields the largest number of data points per word.
freq1 = pd.merge(freq_what,freq_when,how='left',on='num_frames')
freq1.columns=['num_frames','count_what','count_when']
freq2 = pd.merge(freq1,freq_where,how='left',on='num_frames')
freq2.columns=['num_frames','count_what','count_when','count_where']
freq3 = pd.merge(freq2,freq_which,how='left',on='num_frames')
freq3.columns=['num_frames','count_what','count_when','count_where','count_which']
freq4 = pd.merge(freq3,freq_who,how='left',on='num_frames')
freq4.columns=['num_frames','count_what','count_when','count_where','count_which','count_who']
freq = pd.merge(freq4,freq_why,how='left',on='num_frames')
freq.columns=['num_frames','count_what','count_when','count_where','count_which','count_who','count_why']
freq.count_what.fillna(0, inplace = True)
freq.count_when.fillna(0, inplace = True)
freq.count_where.fillna(0, inplace = True)
freq.count_which.fillna(0, inplace = True)
freq.count_who.fillna(0, inplace = True)
freq.count_why.fillna(0, inplace = True)
freq['sum']=freq.count_what+freq.count_when+freq.count_where+freq.count_which+freq.count_who+freq.count_why
freq.sort_values('num_frames')
freq[(freq.num_frames==51)|(freq.num_frames==52)]


Unnamed: 0,num_frames,count_what,count_when,count_where,count_which,count_who,count_why,sum
3,51,4,1.0,3.0,3.0,4.0,6.0,21.0
13,52,2,4.0,4.0,5.0,1.0,4.0,20.0


# We only want the records with num_frames = 51 or 52.  For the 51 frame records:  repeat the 51st entry to get 52 frames

## What

In [68]:
   # define a function that will do the following:  
    # For 51 frame records, repeat the 51st entry to get 52 frames.
    
def copy_51_frame(sign_name):
    # how many records have exactly 51 frames?
    sign_name[(sign_name.num_frames==51)&(sign_name.sign_index==sign_name.num_frames)]
       # repeat the 51st record for each of the records above
    sign_name+'_toadd' = sign_name[(sign_name.num_frames==51)&(sign_name.sign_index==sign_name.num_frames)]
    sign_name+'_toadd'.set_value('50', 'sign_index', 52)
    what_toadd
       # lets see all the records together and confirm that we did this correctly.
    'new'+sign_name = pd.concat([sign_name,sign_name+'_toadd'])
    just_new = 'new'+sign_name['new'+sign_name.num_frames==51]
    just_new.sort_values(['signer','session','test','sign_index'])

SyntaxError: can't assign to operator (<ipython-input-68-de889db3a86b>, line 8)

In [67]:
copy_51_frame(what)

NameError: name 'copy_51_frame' is not defined

In [65]:
   # repeat the 51st record for each of the records above
what_toadd = what[(what.num_frames==51)&(what.sign_index==what.num_frames)]
what_toadd.set_value('50', 'sign_index', 52)
what_toadd
   # lets see all the records together and confirm that we did this correctly.
new_what = pd.concat([what,what_toadd])
just_new = new_what[new_what.num_frames==51]
just_new.sort_values(['signer','session','test','sign_index'])

Unnamed: 0,x,y,z,roll,pitch,...,sign,session,test,sign_index,num_frames
0,0.007812,0.000000,0.000000,0.083333,-1.0,...,what,2,2,1,51
1,0.007812,0.000000,0.000000,0.083333,-1.0,...,what,2,2,2,51
2,0.000000,-0.007812,0.000000,0.000000,-1.0,...,what,2,2,3,51
3,0.023438,0.007812,0.000000,0.083333,-1.0,...,what,2,2,4,51
4,0.031250,0.039062,0.000000,0.083333,-1.0,...,what,2,2,5,51
5,0.031250,0.078125,-0.015625,0.000000,-1.0,...,what,2,2,6,51
6,0.015625,0.132812,-0.023438,0.000000,-1.0,...,what,2,2,7,51
7,0.015625,0.218750,-0.039062,0.000000,-1.0,...,what,2,2,8,51
8,0.000000,0.343750,-0.046875,0.000000,-1.0,...,what,2,2,9,51
9,-0.023438,0.429688,-0.054688,0.000000,-1.0,...,what,2,2,10,51


## When

In [None]:
what[(what.num_frames==51)&(what.sign_index==what.num_frames)]