# Exploratory analysis of copepod behavior dataset
Here, I combine two csv files, one containing data from videos of copepod behavior and one containing comments from the people analyzing these videos. Then, I perform some basic exploratory data analyses.

In [1]:
import pandas as pd # import pandas library

First, read the two csv files into the workspace. The column names include some unusual characters, so the `latin-1` encoding is needed.

In [2]:
behav_data = pd.read_csv("behav_combined_out.csv", encoding = 'latin-1')
videogr_data = pd.read_csv("cops_measured_video_comments.csv", encoding = 'latin-1')

How many copepods have a behavior measurement?

In [3]:
behav_data['cop_name'].nunique(), videogr_data['cop'].nunique()

(689, 686)

Pretty similar, but there are 3 copepods that are missing from the videography data. Which ones are they?

In [4]:
cops_in_behav_not_videogr = []
for x in range( 0, behav_data['cop_name'].nunique() ):
    cop = behav_data['cop_name'].unique()[x]
    if cop not in videogr_data['cop'].unique():
        print(cop)
        cops_in_behav_not_videogr.append(cop)

39_1B
50_6D
66_1B


My best guess is that these three copepods were deleted by accident from the videography file (it was used by multiple people). Another possibility is that there is a typo somewhere in either file. Probably gonna have to go to the raw data to figure this out. Keep this in mind in case they turn out to be problematic. Now, before we merge the two data tables, let's look at the column names.

In the main *behavior* data frame, the first two columns record the observation number (starting from 0 or 1). Copepods were recorded for just over two minutes, with an observation taken every two seconds, hence the 62 observations per copepod per day. The `X` and `Y` variables record copepod position, and these coordinates are used to calculate the `Distance` moved by the copepod in each time interval. `Pixel Value` is simply how dark the tracked copepod appeared on the video. The three columns starting with `ok_` were created when compiling the dataset from individual txt files and flag cases of potential concern. The last three columns (`cop_name`, `day`, and `fname`) record the copepod id, the day post infection it was observed, and the file name containing the data for that copepod on that day.

In [5]:
behav_data.head(5)

Unnamed: 0.1,Unnamed: 0,Slice n°,X,Y,Distance,Pixel Value,ok_col_names,ok_col_num,ok_row_num,cop_name,day,fname
0,0,1.0,93.0,33.0,-1.0,131.0,1,0,0,01_1D,11,01_1D_11
1,1,2.0,88.0,23.0,1.442,150.0,1,0,0,01_1D,11,01_1D_11
2,2,3.0,88.0,22.0,0.129,155.0,1,0,0,01_1D,11,01_1D_11
3,3,4.0,71.0,15.0,2.372,197.0,1,0,0,01_1D,11,01_1D_11
4,4,5.0,69.0,13.0,0.365,205.0,1,0,0,01_1D,11,01_1D_11


In the *videography* data table, copepod name (`cop`), `day`, and filename (`fname`) correspond to those in the behavior dataset. The `by` variable records who processed the video, and if problems were noted, it is recorded in the `Ok?` variable.

In [6]:
videogr_data.head(5)

Unnamed: 0,index,cop,day,fname,by,Ok?,remarks
0,1,01_1D,5,01_1D_5,DB,0.0,
1,10,01_2B,5,01_2B_5,MS,0.0,
2,19,01_2C,5,01_2C_5,MS,0.0,
3,28,01_3D,5,01_3D_5,MS,0.0,
4,37,01_4A,5,01_4A_5,MS,0.0,


Some of the information in the data tables is repetitive, so let's select the needed columns and rename them for simplicity and consistency.

In [7]:
behav_data = (behav_data[['Slice n°', 'Distance', 'Pixel Value', # select columns
                         'ok_col_names', 'ok_col_num', 'ok_row_num',
                         'cop_name', 'day', 'fname']]
              .rename(columns = {'Slice n°':'slice', # rename columns
                                 'Distance':'dist',
                                 'Pixel Value':'pixel'})
              .sort_values(['day', 'cop_name', 'slice'], ascending = [True, True, True]) # sort by day and copepod
             )

In [8]:
videogr_data = (videogr_data[['fname', 'by', 'Ok?', 'remarks']]
                .rename(columns = {'by':'recorded_by',
                                   'Ok?':'video_problematic',
                                   'remarks':'video_remarks'})
               )

Now we can merge the two data frames into a combined dataset.

In [9]:
combined_data = pd.merge(behav_data, videogr_data,
                         how = 'left', on = 'fname')

How many videos were processed? Each file name corresponds to one video, so let's count the file names.

In [11]:
combined_data['fname'].nunique()

4566

That's a lot! Now of those, how many are potentially problematic? There are 4 variables that flag cases worth looking at. Let's count the problems by focusing on these variables. In each, a 1 records a problematic case, while a 0 is a non-problematic one. Next, we group by the video (`fname`) and then sum the values. Anything above zero is a problematic video.

In [16]:
prob_df = combined_data[['fname', 'ok_col_names', 
                         'ok_col_num', 'ok_row_num',
                         'video_problematic']].groupby(by = 'fname').sum()

In [53]:
def is_zero(x):
    if x == 0:
        return x
    else:
        return 1.0

In [54]:
pd.concat([
    prob_df['ok_col_names'].apply(is_zero).value_counts(),
    prob_df['ok_col_num'].apply(is_zero).value_counts(),
    prob_df['ok_row_num'].apply(is_zero).value_counts(),
    prob_df['video_problematic'].apply(is_zero).value_counts()
], axis = 1)

Unnamed: 0,ok_col_names,ok_col_num,ok_row_num,video_problematic
0.0,1919,4566.0,4494,4450
1.0,2647,,72,116


The most common issue was that column names were missing when imported in the txt file containing the data. I think was a byproduct of how the file was saved and is thus not likely to be important. Still, since the videos with and without column names are fairly split, we can check whether there are obvious differences between the file types. None of the imported files had the wrong number of columns. 

The next most common issue was problematic videos (116/4566, 2.5%). Some of this is probably difficulty observing the copepod, but some cases are also likely video mistakes, such as letting it run too long. This leads to the next problem...

Some videos (72/4566, 1.5%) ended up with the wrong number of rows. When we look at the distribution of rows, most common was to miss a single observation (61 instead of 62 observations). There also appear to be a few cases in which twice as many observations as expected were made. Whether the videos can be re-analyzed and corrected remains to be determined.

In [72]:
c = combined_data[['fname', 'ok_row_num']].groupby(by = 'fname').count()
c['ok_row_num'].value_counts()

62     4493
61       41
47       13
63        8
60        3
59        1
66        1
113       1
53        1
124       1
96        1
56        1
52        1
Name: ok_row_num, dtype: int64