# Exploratory analysis of copepod behavior dataset
Here, I combine two csv files, one containing data from videos of copepod behavior and one containing comments from the people analyzing these videos. Then, I perform some basic exploratory data analyses.

In [62]:
import pandas as pd # import pandas library

First, read the two csv files into the workspace. The column names include some unusual characters, so the `latin-1` encoding is needed.

In [63]:
behav_data = pd.read_csv("behav_combined_out.csv", encoding = 'latin-1')
videogr_data = pd.read_csv("cops_measured_video_comments.csv", encoding = 'latin-1')

How many copepods have a behavior measurement?

In [64]:
behav_data['cop_name'].nunique(), videogr_data['cop'].nunique()

(689, 686)

Pretty similar, but there are 3 copepods that are missing from the videography data. Which ones are they?

In [65]:
cops_in_behav_not_videogr = []
for x in range( 0, behav_data['cop_name'].nunique() ):
    cop = behav_data['cop_name'].unique()[x]
    if cop not in videogr_data['cop'].unique():
        print(cop)
        cops_in_behav_not_videogr.append(cop)

39_1B
50_6D
66_1B


The videography file was used by multiple people, so may these mystery copepods were accidentally deleted from the file. One of the first things we can do is check how often they were observed.

In [66]:
(behav_data[ 
    behav_data['cop_name'].isin(cops_in_behav_not_videogr) ] # filter to just these copepds
 [['cop_name', 'day']].drop_duplicates()) # show only unique name-day combinations

Unnamed: 0,cop_name,day
221498,39_1B,7
227450,50_6D,7
277711,66_1B,9


In each case, they were only recorded on one day, instead of on all seven observation days. If these copepds were accidentally deleted from the videography file, I would expect some to have more and some to have fewer observations, just because they presumably would not have been deleted at the same time.

Another possibility is that there is a typo in the file name. For example, maybe on day 7 copepod **39_1A** was accidentally recorded as **39_1B**. Let's look at all the copepods on plate 39 to see if any have missing data suggestive of typos.

In [67]:
( behav_data[behav_data.cop_name.str.contains("39_")] # filter to plate 39
 [['cop_name', 'day']].drop_duplicates() # select cop_name and day and drop duplicates
 .groupby(by = 'cop_name').count() # group by cop name and count number of days observed
)

Unnamed: 0_level_0,day
cop_name,Unnamed: 1_level_1
39_1B,1
39_1C,7
39_2B,5
39_2C,7
39_2D,7
39_5B,7
39_5D,7
39_6A,6
39_6B,7
39_6D,7


All but two of the copepods from this plate were recorded on all 7 observation days. In these two cases, the copepod did not have the full set because it died before the end of the experiment. Neither was missing data on day 7, when the mystery copepod **39_1B** was observed. This was also the case for the other two mystery copepods, suggesting there is no reason to suspect typos for the occurence of these data.

There are three treatment groups in the experiment: infected copepods, exposed by uninfected copepods, and unexposed control copepods. My main interest was the infected copepods, so the behavior of all infected copepods was recorded. All controls were also recorded. There were just a couple controls per plate, because I exposed most copepods to maximize the number of infecteds available. The least interesting, yet most common group was the exposed but uninfected copepods. All three mystery copepods were in this group (this info is in a different table). Thus, I do not think they would have been chosen for behavioral processing. Rather, I think the videos for these copepods were recorded by accident. On the one hand, this is not problematic (more data is good). On the other hand, I plan to model variation in copepod behavior across days with mixed models, and these copepods will not contribute information to this end. Still, at least right now, I do not see a strong reason to exclude them.

Now, let's combine the *behavior* and *videography* data tables. In the main *behavior* data frame, the first two columns record the observation number (starting from 0 or 1). Copepods were recorded for just over two minutes, with an observation taken every two seconds, hence the 62 observations per copepod per day. The `X` and `Y` variables record copepod position, and these coordinates are used to calculate the `Distance` moved by the copepod in each time interval. `Pixel Value` is simply how dark the tracked copepod appeared on the video. The three columns starting with `ok_` were created when compiling the dataset from individual txt files and flag cases of potential concern. The last three columns (`cop_name`, `day`, and `fname`) record the copepod id, the day post infection it was observed, and the file name containing the data for that copepod on that day.

In [68]:
behav_data.head(5)

Unnamed: 0.1,Unnamed: 0,Slice n°,X,Y,Distance,Pixel Value,ok_col_names,ok_col_num,ok_row_num,cop_name,day,fname
0,0,1.0,93.0,33.0,-1.0,131.0,1,0,0,01_1D,11,01_1D_11
1,1,2.0,88.0,23.0,1.442,150.0,1,0,0,01_1D,11,01_1D_11
2,2,3.0,88.0,22.0,0.129,155.0,1,0,0,01_1D,11,01_1D_11
3,3,4.0,71.0,15.0,2.372,197.0,1,0,0,01_1D,11,01_1D_11
4,4,5.0,69.0,13.0,0.365,205.0,1,0,0,01_1D,11,01_1D_11


In the *videography* data table, copepod name (`cop`), `day`, and filename (`fname`) correspond to those in the behavior dataset. The `by` variable records who processed the video, and if problems were noted, it is recorded in the `Ok?` variable.

In [6]:
videogr_data.head(5)

Unnamed: 0,index,cop,day,fname,by,Ok?,remarks
0,1,01_1D,5,01_1D_5,DB,0.0,
1,10,01_2B,5,01_2B_5,MS,0.0,
2,19,01_2C,5,01_2C_5,MS,0.0,
3,28,01_3D,5,01_3D_5,MS,0.0,
4,37,01_4A,5,01_4A_5,MS,0.0,


Some of the information in the data tables is repetitive, so let's select the needed columns and rename them for simplicity and consistency.

In [7]:
behav_data = (behav_data[['Slice n°', 'Distance', 'Pixel Value', # select columns
                         'ok_col_names', 'ok_col_num', 'ok_row_num',
                         'cop_name', 'day', 'fname']]
              .rename(columns = {'Slice n°':'slice', # rename columns
                                 'Distance':'dist',
                                 'Pixel Value':'pixel'})
              .sort_values(['day', 'cop_name', 'slice'], ascending = [True, True, True]) # sort by day and copepod
             )

In [8]:
videogr_data = (videogr_data[['fname', 'by', 'Ok?', 'remarks']]
                .rename(columns = {'by':'recorded_by',
                                   'Ok?':'video_problematic',
                                   'remarks':'video_remarks'})
               )

Now we can merge the two data frames into a combined dataset.

In [9]:
combined_data = pd.merge(behav_data, videogr_data,
                         how = 'left', on = 'fname')

How many videos were processed? Each file name corresponds to one video, so let's count the file names.

In [10]:
combined_data['fname'].nunique()

4566

That's a lot! Now of those, how many are potentially problematic? There are 4 variables that flag cases worth looking at. Let's count the problems by focusing on these variables. In each, a 1 records a problematic case, while a 0 is a non-problematic one. Next, we group by the video (`fname`) and then sum the values. Anything above zero is a problematic video.

In [11]:
prob_df = combined_data[['fname', 'ok_col_names', 
                         'ok_col_num', 'ok_row_num',
                         'video_problematic']].groupby(by = 'fname').sum()

In [12]:
def is_zero(x):
    if x == 0:
        return x
    else:
        return 1.0

In [13]:
pd.concat([
    prob_df['ok_col_names'].apply(is_zero).value_counts(),
    prob_df['ok_col_num'].apply(is_zero).value_counts(),
    prob_df['ok_row_num'].apply(is_zero).value_counts(),
    prob_df['video_problematic'].apply(is_zero).value_counts()
], axis = 1)

Unnamed: 0,ok_col_names,ok_col_num,ok_row_num,video_problematic
0.0,1919,4566.0,4494,4450
1.0,2647,,72,116


The most common issue was that column names were missing when imported in the txt file containing the data. I think was a byproduct of how the file was saved and is thus not likely to be important. Still, since the videos with and without column names are fairly split, we can check whether there are obvious differences between the file types. None of the video output files had the wrong number of columns. 

The next most common issue was problematic videos (116/4566, 2.5%). Some of this is probably difficulty observing the copepod, but some cases are also likely video mistakes, such as letting it run too long. This leads to the next problem...

Some videos (72/4566, 1.5%) ended up with the wrong number of rows. Let's look at the distribution of row numbers.

In [14]:
c = combined_data[['fname', 'ok_row_num']].groupby(by = 'fname').count()
c['ok_row_num'].value_counts()

62     4493
61       41
47       13
63        8
60        3
59        1
66        1
113       1
53        1
124       1
96        1
56        1
52        1
Name: ok_row_num, dtype: int64

The most common mistake was to miss a single observation (61 instead of 62 observations). There also appear to be a few cases in which twice as many observations as expected were made. For cases in which fewer observations were made than expected, it might be possible to re-analyze the videos and correct the values. For cases in which more observations are made than expected, we need to see if we can exclude the extra observations, such that the `slice` number is correct and comparable. In a couple cases, the remarks made by the person analyzing the videos are informative. For example, the video for copepods on plate 61 recorded on day 9 was incomplete (only 47 observations). But usually, when there were problems with the video, under remarks, only a "?" was given, indicating there were problems in observing the copepod. 

In [20]:
vid_probs = combined_data[combined_data.video_problematic == 1]
vid_probs.video_remarks.drop_duplicates()
combined_data

Unnamed: 0,slice,dist,pixel,ok_col_names,ok_col_num,ok_row_num,cop_name,day,fname,recorded_by,video_problematic,video_remarks
0,1.0,-1.000,136.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
1,2.0,1.553,171.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
2,3.0,1.296,200.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
3,4.0,0.387,204.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
4,5.0,0.903,196.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
5,6.0,0.000,204.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
6,7.0,0.547,161.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
7,8.0,0.129,144.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
8,9.0,0.000,138.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,
9,10.0,1.110,158.0,0,0,0,01_1D,5,01_1D_5,DB,0.0,


Steps to take???