## Problem 3.1: Validating fish activity data

Attribution: Xinhong did the majority of this problem, and Zhiyang and Maddie helped to clean up the file. We all discussed ways of validating the activity data together.

In [2]:
import glob

import numpy as np
import pandas as pd

### Let's check the fish activity file first

In [4]:
fname = '../data/fish_actvity_for_validation/150717_2A_2B.csv'

# Load in the data
df = pd.read_csv(fname)

df.head()

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime
0,c97,z097,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,34,55.7,34,4.3,0,0.0,17/07/2015,14:30:00
1,c98,z098,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,0,60.0,0,0.0,0,0.0,17/07/2015,14:30:00
2,c99,z099,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,6,59.0,6,1.0,0,0.0,17/07/2015,14:30:00
3,c100,z100,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,25,55.8,25,4.2,0,0.0,17/07/2015,14:30:00
4,c101,z101,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,1,60.0,0,0.0,0,0.0,17/07/2015,14:30:00


Since we only care the fish 1-96 (which are already genotyped), we could slice out the data of those fish first. We can see that the ID of the fish is buried in the animal column, so we can generate an ID column from the 'animal' column.

In [5]:
# Generate an ID column
df['ID'] = df['animal'].str[1:]
df.head()

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime,ID
0,c97,z097,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,34,55.7,34,4.3,0,0.0,17/07/2015,14:30:00,97
1,c98,z098,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,0,60.0,0,0.0,0,0.0,17/07/2015,14:30:00,98
2,c99,z099,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,6,59.0,6,1.0,0,0.0,17/07/2015,14:30:00,99
3,c100,z100,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,25,55.8,25,4.2,0,0.0,17/07/2015,14:30:00,100
4,c101,z101,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,1,60.0,0,0.0,0,0.0,17/07/2015,14:30:00,101


In [6]:
# Perform data checks on the out the fish that has ID from 1 to 96
df.loc[:,'ID'] = df.loc[:, 'ID'].astype(int)
df_fish = df.loc[df['ID'] <= 96, :]
df_fish.head()

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime,ID
96,c1,z001,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,0,60.0,0,0.0,0,0.0,17/07/2015,14:29:59,1
97,c2,z002,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,10,58.5,10,1.5,0,0.0,17/07/2015,14:29:59,2
98,c3,z003,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,17,57.6,16,2.4,0,0.0,17/07/2015,14:29:59,3
99,c4,z004,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,4,59.5,4,0.6,0,0.0,17/07/2015,14:29:59,4
100,c5,z005,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,38,54.9,37,5.2,0,0.0,17/07/2015,14:29:59,5


#### Below are the test we would like to perform on the dataset:
- Check whether there are NaN in the acitivity column (middur)
- Check whether acitivity column (middur) >=0 [because the middur is time in 1 minute]
- Check whether acitivity column (middur) <= 60s [because the middur is time in 1 minute]
- Check the time interval is always 1 minute across the whole session
- Check whether each fish has the same amount of data point

In [7]:
# Check whether there are NaN in the acitivity column (middur)
def test_missing_data(df, fname):
    """Look for missing entries."""
    assert np.all(df['middur'].notnull()), fname + ' contains missing data'
    
test_missing_data(df_fish, fname)

It passed! There are no NaN in the dataset.

In [8]:
# Check whether acitivity column (middur) >=0 [because the middur is time in 1 minute]
def test_negative(df, fname):
    """Look for negative scattering values in cells."""
    assert np.all(df['middur'] >= 0), \
            fname + ' contains negative scattering data'

test_negative(df_fish,fname)

It passed! All the activity data is not negative.

In [9]:
# Check whether acitivity column (middur) <= 60s [because the middur is time in 1 minute]
def test_less_than_60s(df, fname):
    """Look for negative scattering values in gated cells."""
    assert np.all(df['middur']<= 60), \
            fname + ' contains data that large than 60s'

test_less_than_60s(df_fish,fname)

AssertionError: ../data/fish_actvity_for_validation/150717_2A_2B.csv contains data that large than 60s

Seems there are some data in the activity column (middur) is larger than 60s, which dosen't make sense when considering the definition of middur (time in one minute), let's take a look at them

In [10]:
df_activity_exceed_60 = df_fish.loc[df['middur']> 60]
df_activity_exceed_60

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime,ID
619978,c67,z067,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,1.3,1,75.3,0,0.0,19/07/2015,20:20:04,67
619994,c83,z083,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,1.3,1,75.3,0,0.0,19/07/2015,20:20:11,83
620005,c94,z094,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,1.3,1,75.3,0,0.0,19/07/2015,20:20:15,94
620104,c3,z003,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.6,End of period,End of period,0,1.2,1,75.4,0,0.0,19/07/2015,20:19:39,3
620110,c9,z009,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.6,End of period,End of period,1,1.1,2,75.4,0,0.0,19/07/2015,20:19:41,9
620128,c28,z028,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.6,End of period,End of period,1,1.1,2,75.4,0,0.0,19/07/2015,20:19:48,28


We could see there are 6 datapoints in the activity column are >60, which is inconsistent with the researcher's claim. Maybe we should exclude those datapoints when doing the real analysis.  

When we look at the time interval of this datapoint (end-start = 76.5) which is larger than 60s, that explains why the activity data is >60s. This brings up the concern that the time for each measurement is not always 1 minutes across whole dataset, let's check that in the next step.

In [11]:
# Generate a column of time length for each measurement 
# using the 'end' and 'start' column
df_fish['time_length'] = df_fish['end'] - df_fish['start']
df_fish.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime,ID,time_length
96,c1,z001,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,0,60.0,0,0.0,0,0.0,17/07/2015,14:29:59,1,60.0
97,c2,z002,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,10,58.5,10,1.5,0,0.0,17/07/2015,14:29:59,2,60.0
98,c3,z003,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,17,57.6,16,2.4,0,0.0,17/07/2015,14:29:59,3,60.0
99,c4,z004,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,4,59.5,4,0.6,0,0.0,17/07/2015,14:29:59,4,60.0
100,c5,z005,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,38,54.9,37,5.2,0,0.0,17/07/2015,14:29:59,5,60.0


In [12]:
# Check the time length is always 1 minute across the whole session
def test_time_length(df, fname):
    """Look for negative scattering values in gated cells."""
    assert np.all(df['time_length']!= 60), \
            fname + ' contains time length that is not 60s'

test_time_length(df_fish,fname)

AssertionError: ../data/fish_actvity_for_validation/150717_2A_2B.csv contains time length that is not 60s

It didn't pass, indicating not all time length is 60s which is consistent with our observation in the last step. Let's take a look at those measurement in time !=60s.

In [15]:
df_fish.loc[df_fish['time_length'] > 60]

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime,ID,time_length
619968,c10,z010,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193805.5,End of period,End of period,1,65.3,1,0.2,0,0.0,19/07/2015,20:19:41,10,65.5
619969,c61,z061,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193810.1,End of period,End of period,0,70.1,0,0.0,0,0.0,19/07/2015,20:20:02,61,70.1
619970,c57,z057,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193811.3,End of period,End of period,0,71.3,0,0.0,0,0.0,19/07/2015,20:20:00,57,71.3
619973,c62,z062,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,76.5,0,0.0,0,0.0,19/07/2015,20:20:02,62,76.5
619974,c63,z063,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,76.5,0,0.0,0,0.0,19/07/2015,20:20:02,63,76.5
619975,c64,z064,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,76.5,0,0.0,0,0.0,19/07/2015,20:20:03,64,76.5
619976,c65,z065,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,2,76.3,2,0.2,0,0.0,19/07/2015,20:20:03,65,76.5
619977,c66,z066,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,1,76.0,1,0.5,0,0.0,19/07/2015,20:20:04,66,76.5
619978,c67,z067,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,0,1.3,1,75.3,0,0.0,19/07/2015,20:20:04,67,76.5
619979,c68,z068,ZEBRALAB02\zebralab_user,1,0,quant,193740.0,193816.5,End of period,End of period,3,76.3,2,0.3,0,0.0,19/07/2015,20:20:05,68,76.5


We could see that there are actually 96 measurements of activity is recorded in a session >60s. We need to keep that in mind when doing the data analysis. 

In [16]:
df_fish.loc[df_fish['time_length'] < 60]

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime,ID,time_length
620256,c1,z001,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,2,43.1,2,0.3,0,0.0,19/07/2015,20:20:44,1,43.4
620257,c2,z002,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,3,43.2,3,0.2,0,0.0,19/07/2015,20:20:44,2,43.4
620258,c3,z003,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,6,42.4,5,0.9,0,0.0,19/07/2015,20:20:44,3,43.4
620259,c4,z004,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,7,42.3,7,1.1,0,0.0,19/07/2015,20:20:44,4,43.4
620260,c5,z005,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,3,42.8,3,0.6,0,0.0,19/07/2015,20:20:44,5,43.4
620261,c6,z006,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,4,42.8,4,0.6,0,0.0,19/07/2015,20:20:44,6,43.4
620262,c7,z007,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,3,43.0,3,0.4,0,0.0,19/07/2015,20:20:44,7,43.4
620263,c8,z008,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,0,43.4,0,0.0,0,0.0,19/07/2015,20:20:44,8,43.4
620264,c9,z009,ZEBRALAB02\zebralab_user,1,0,quant,193816.6,193860.0,End of period,End of period,20,19.6,19,23.8,0,0.0,19/07/2015,20:20:44,9,43.4
620265,c10,z010,ZEBRALAB02\zebralab_user,1,0,quant,193805.5,193860.0,End of period,End of period,4,53.8,4,0.6,0,0.0,19/07/2015,20:20:17,10,54.5


We could see that there are actually 192 measurements of activity is recorded in a session <60s. We also need to keep that in mind when doing the data analysis.  

We would also like to know whether each fish has same amount of data points.

In [308]:
# check whether each fish has the same amounts of datapoints
df_datapoints_num = pd.DataFrame(df_fish.groupby('ID').size())
df_datapoints_num.head()

Unnamed: 0_level_0,0
ID,Unnamed: 1_level_1
1,4119
2,4119
3,4119
4,4119
5,4119


In [309]:
df_datapoints_num.columns = ['datapoints_num']
df_datapoints_num.loc[df_datapoints_num['datapoints_num'] != 4119]

Unnamed: 0_level_0,datapoints_num
ID,Unnamed: 1_level_1


All the fish have the same number (4119) of datapoints.

#### In general, the major problems we see in the behavior dataset are:
1. The data is not evenly distributed in time, the time length of 96 measurements are >60s.
2. The first problem leads to the observation that 5 datapoints in the 'middur' column are >60s, which is inconsistent with the researcher's definition for 'middur'.
3. There are also 192 measurements are <60s.
4. When we do the real analysis on this dataset, it's better to exclude these datapoints or normalize them to their bin size (time_length).

## Let's also check the genotype data

The researcher said that, 'the fish in instrument 2A, numbered 1 through 96 in the activity data file 150717_2A_2B.csv, were genotyped'. That indicates the genotype data we get here should has 96 fish's genotype. So let's load the file and have a look.

In [24]:
# Load in the genotype file, call it df_gt for genotype DataFrame
df_gt = pd.read_csv('../data/fish_actvity_for_validation/150717_2A_genotype_3.txt',
                    delimiter='\t',
                    comment='#')

# Take a look at it
df_gt.head()

Unnamed: 0,wt,het,mut
0,1.0,5,2.0
1,3.0,6,4.0
2,8.0,7,14.0
3,12.0,9,16.0
4,21.0,10,26.0


First let's tidy the dataframe.

In [25]:
# Tidy the DataFrame
df_gt = pd.melt(df_gt, var_name='genotype', value_name='location')

In [26]:
# Drop all the NaN and sort the dataset based on 'location'
df_gt = df_gt.dropna()
df_gt = df_gt.sort_values(['location'])
df_gt.head()

Unnamed: 0,genotype,location
0,wt,1.0
76,mut,2.0
1,wt,3.0
77,mut,4.0
38,het,5.0


After we drop the NaN, we see there are only 75 rows of genotype data, which is inconsistent with the researcher's claim. So we would like to check out what happens to the data: 
#### 1. We would like to see which location is not genotyped and we would rule out that data when analyze behavior.
#### 2. We would also make sure each location only reports single genotype.

In [27]:
# Check if the location is not genotyped by ordering the locations and 
# checking that we're not missing any numbers from 1 to 96
df_gt['location+1'] = df_gt['location'].shift(-1)
df_gt['location_difference'] = df_gt['location+1'] - df_gt['location']

In [29]:
# If some locations are skipped, we would see the location interval > 1
df_gt.loc[df_gt['location_difference'] > 1]

Unnamed: 0,genotype,location,location+1,location_difference
42,het,10.0,12.0,2.0
4,wt,21.0,23.0,2.0
9,wt,35.0,37.0,2.0
85,mut,46.0,48.0,2.0
59,het,50.0,52.0,2.0
86,mut,54.0,56.0,2.0
15,wt,61.0,63.0,2.0
63,het,63.0,71.0,8.0
95,mut,82.0,85.0,3.0
74,het,94.0,96.0,2.0


From the above data, we could see that 11, 22, 36, 47, 51, 55, 62, 64-70, 83, 84, 95 are not genotyped. 
It would be better for them to have a fourth column clarified that these fish haven't be genotyped in the original genotype file.

In [315]:
# We would also like to make sure each location reports single genotype, 
# so there are no duplicates. In this case we would see the location interval !=0
df_gt.loc[df_gt['location_difference']==0]

Unnamed: 0,genotype,location,location+1,location_difference


There is nothing here, which helps us make sure that single location reports single genotype

#### In general, the major problems we see in the genotype dataset are:
- Location 11,22,36,47,51,55,62,64-70,83,84,95 are not genotyped, we need to exclude those wells when performing the data analysis.
- It would be better for them to have a fourth column in their genotype file clarifying the ones that were not genotyped.

In [316]:
%load_ext watermark

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [317]:
%watermark -v -p numpy,pandas,glob,jupyterlab

CPython 3.6.6
IPython 7.0.1

numpy 1.15.2
pandas 0.23.4
glob n
jupyterlab 0.35.0


# Grading

**Grade: 43/50**

Overall, you chose some important tests to validate the data and executed them accurately.

Some of your tests, like for the number of data points for each fish, or for all of your genotype data validation, aren't built in to have assertion errors or printed errors describing the issues.  The homework said to "write functions" to validate the fish activity data, which would allow the pipeline to be more customizable for other datasets the scientists would be likely to acquire. (-5 points)

It would have been a little cleaner for the ID column you make in the behavior DataFrame to be composed of ints, instead of making it a string and then having to convert it to int each time you want to use it quantitatively.

You say "we see there are only 75 rows of genotype data" after only printing the first few rows of the DataFrame.  Would be nice to have a function, or even just a line of code, to automatically return this number.

Your test for the locations without genotype data could be cleaner.  You could have automated the process, using a function, to take that DataFrame you returned and tell us what the locations (not just locations+1) without data were.  Our having to look at that DataFrame to figure it out defeats the purpose of making the process automated. (-2 points)