# How to approach reading in real-world data

When you work with real-world data the data is usually not ready for analysis directly. We have to do some preprocessing first. That means that you write a program that takes the original (bad) data and then this program creates a new dataset (usually split across several files). And this new dataset is what you will analyse.

In general it is a good idea to not solve all problems at the same time. That it is why you first want to produce clean and well-structured data and then do the actual analysis later.

## Producing clean datasets

So, what do we need to produce a clean dataset? There are several aspects to it.

First of all, you want to have regular data values. If there are some misspellings or invalid values in the original data, you want to correct these things automatically when you produce clean data from the original data. This is really a matter of writing *individual rules* about details. This can be very annoying because you basically have to cover all the bad details in the dataset with your program. But once you have written the program you have clean data! And if you later discover that there are still some invalid values, you extend the program, run it again and then keep working with the clean data. Furthermore, by writing a program that creates a new version of the data, you have documented your changes and corrections in a reliable way. If you change data manually you will probably not remember what you have changed a couple of weeks later.

Secondly, you want to split your data so that each dataset has its own file. For example, if you have an Excel sheet with measurements and further parameters that describe the experiment itself, split this information into two files: One file with the measurements and another file with the additional parameters. This way you can easily read the information that you want to read without jumping around in the file with your program. (This point is less relevant for the drosophila data but applies even more so to the data from the redox experiments).

Thirdly, you probably need to restructure the data. There is a convention that is becoming more and more widespread in the recent years: *Tidy data*. Tidy data follows the following rules:

* Each variable forms a column
* Each observation forms a row
* Each type of observation forms a table

For example, the following structure is a tidy dataset:

<table>
<thead>
<tr><th>name</th><th>treatment</th><th>treatment_result</th></tr>
</thead>
<tbody>
<tr><td>John Smith</td><td>A</td><td></td></tr>
<tr><td>Jane Doe</td><td>A</td><td>16</td></tr>
<tr><td>Mary Johnson</td><td>A</td><td>3</td></tr>
<tr><td>John Smith</td><td>B</td><td>2</td></tr>
<tr><td>Jane Doe</td><td>B</td><td>11</td></tr>
<tr><td>Mary Johnson</td><td>B</td><td>1</td></tr>
</tbody>
</table>

(Table adapted from <a href="http://vita.had.co.nz/papers/tidy-data.pdf" target="_blank">Hadley Wickham's tidy data paper</a>)

Note that the names occur in several rows. That is because they are associated with different *observations*. Each column is a variable. The variables are `name`, `treatment` and `treatment_result`. Data that is structured in such a way is much easier to process with a program. Also note that missing values are represented by an empty cell. If you have a dataset where all missing values have the same representation it is easier to write a program that knows how to recognise such missing values.

Note that the things above talk about how you store data in files. Once you have the data loaded into your Python program you can create other structures that are more convenient for processing the data.

# Approaching the Larva data sets

Thankfully, the dataset is pretty clean. So let us think about what kind of things there are in the data set:

* name/index of the larva
* time
* acc_dst
* acceleration
* area
* bending
* dst_to_origin
* go_phase
* head_x
* head_y
* is_coiled
* is_well_oriented
* left_bended
* mom_dst
* mom_x
* mom_y
* mov_direction
* perimeter
* radius_1
* right_bended
* spine_length
* spinepoint_1_x
* spinepoint_1_y
* tail_x
* tail_y
* velocity

My suggestion:
Each observation is dependent on the time. The variables are the measurements. Each larva should get its own table. So that we will have the following structure:

<table>
<thead>
<tr><th>time</th><th>acc_dst</th><th>acceleration</th><th>...</th><th>velocity</th></tr>
</thead>
<tbody>
<tr><td>0</td><td>3</td><td>-0,208191</td><td>...</td><td>0,608276</td></tr>
<tr><td>1</td><td>3</td><td>-0,840494</td><td>...</td><td>0,894284</td></tr>
<tr><td>2</td><td>3</td><td>-0,949482</td><td>...</td><td>0,289482</td></tr>
</tbody>
</table>

In general: When you write a program you should ask yourself if it will run just a few times or frequently. If it runs frequently you should optimise it for speed. But data cleaning programs usually only run a few times and they tend to be fast enough in practise. So it is okay to write your programs in an easier way even if it means that the program is less efficient. When you write your program in an easier way it is also much less likely that you make mistakes.

What does that mean in practise? In this case we will open each file several times, because there are several larvae measurements in each file. In principle it is possible to write a program that collects all larvae measurements from a file at once but that makes the program more complex and also more error-prone (and also more difficult for me to explain to your).

In [None]:

# For this code we will use a defaultdict. A defaultdict is a dictionary that behaves
# exactly like a normal dictionary but if you ask for a key that is not in the dictionary
# the defaultdict will automatically create a new value for the missing key and from
# the outside it will not be apparent if the value is new or old. This is very handy
# because it means that you do not have to provide initial values for the dictionary
# contents. A defaultdict can be used like this:
#
# `my_dict = defaultdict( list )`
#
# And when a missing key is requested the defaultdict will execute
#
# `new_value = list()`
#
# and the new value will be saved in the dictionary and returned
from collections import defaultdict


def clean_larva_data_from_file( filename, larva_index, outfile ):
    """
    Functions can be documented by adding a multi-line string after the `def` line.
    You can look up this documentation by running `help(read_larva_data_from_file)`
    in Jupyter or in a Python console.
    
    Multiline-strings are defined by three quote characters in the beginning and
    the same three quote characters at the end of the string.
    
    There are some conventions on how to document functions (and methods).
    For example, the Sphinx convention asks you to document your parameters
    by using the :param directive followed by the name of the parameter plus
    a colon and the description of the parameter:
    
    :param filename: The path to the Drosophila larvae measurements file
    :param larva_index: The column index of the drosophila larva
    :param outfile: The path of the clean output file
    """ #this is the end of the multi-line string
    
    #structured_data: { variable -> [ measurements_at_different_time_points ] } #It is always a good idea to sketch out the structure of your data
    structured_data = defaultdict( list ) # defaultdict: a dictionary that will insert a default value if we use a new key. In this case the default key will be an empty list
    
    #first part of this function: Read in data and store it in the variable structured_data
    
    with open( filename ) as f:
        for line in f:
            parts = line.rstrip( '\n' ).split( '\t' ) # parts are the individual cells in a row
            
            # I use a cheap trick to separate the label from the time. I basically split the string
            # at the `(` parenthesis. The stuff that comes before the `(` is the label and the stuff
            # that comes after the `(` is the time plus the closing parenthesis. I get rid of the
            # closing parenthesis by saying [ :-1 ] which takes a substring without the last letter.
            label = parts[ 0 ].split( '(' )[ 0 ] # from "label(time)" extract "label"
            if label == '': #if we are in the header row
                continue # jump back to line 16 and read the next row
            time = parts[ 0 ].split( '(' )[ 1 ][ :-1 ] #from "label(time)" extract "time"
            value = parts[ larva_index ]

            
            #assert is a Python mechanism for sanity checks.
            #It will throw a big error if the following line is not True
            #We assume that there is a measurement for each variable at each time. And that time is increasing by 1 between measurements.
            #In general you use asserts to document your assumptions about what the program assumes to be true.
            #If an assumptions is false, the `assert` will complain and you will know that need to reconsider your assumptions.
            assert len( structured_data[ label ] ) == int( time ), "Time is not increasing in steps of 1" 
            
            structured_data[ label ].append( value ) # this is where the defaultdict comes in handy. If the label key would be absent we would get an error. (same applies to the line with the `assert`)
    
    #second part of this function: Write the data in structured_data to the output file
    
    with open( outfile, "w" ) as out:
        labels = structured_data.keys() # aka column headers or variables
        
        #write header line
        out.write( 'time\t{}\n'.format( '\t'.join( labels ) ) )
        
        # each row corresponds to a time point
        for time in range( len( structured_data[ label ] ) ): #it does not matter which label we use as long as it is a valid label

            # each column corresponds to a label/variable
            values = [] # temporary storage for the values for the current row
            for label in labels:
                values.append( structured_data[ label ][ time ] ) #gather all values for the current time

            out.write( "{}\t{}\n".format( time, '\t'.join( values ) ) ) #write a tidy row

    



In [None]:
#Call the abov function with one dataset for the first larva
clean_larva_data_from_file( 'measurements/Results_video1.csv', 1, "measurements/Results_video1_larva_col1.csv" )

And the results are really nice! Check them out <a href="measurements/Results_video1_larva_col1.csv" download="Results_video1_larva_col1.csv">here</a>. If you open the document in MS Excel or OpenOffice/LibreOffice, make sure you select "tab" as the column separator. If you get asked about the file encoding, choose UTF-8.

But wait! That is only one single dataset! Let's just ask the computer to convert them all for us!

In [None]:
#We have to do a bit of planning, because not all spreadsheets have the same number of larvae in them
number_of_larvae = {
    1: 2, # two larvae in Results_video1.csv file (we already processed this one in the previous cell)
    2: 4,
    3: 3,
    5: 2,
    6: 1,
    7: 2,
    8: 1,
    9: 2,
}
# We could also have written another program that gives us the number of columns for each file.
# But well, I just wrote down this dictionary in 2 minutes :)
# Keep in mind that in the future, if you should get more files, it is a better idea to automate this part as well.

for video_id, larva_count in number_of_larvae.items():
    for column_index in range( 1, larva_count + 1 ): #start at 1, end at larva_count
        clean_larva_data_from_file(
            'measurements/Results_video{}.csv'.format( video_id ), #input file
            column_index, #larva located at column `column_index`
            'measurements/Results_video{}_larva_col{}.csv'.format( video_id, column_index ) #output file
        )

Well, that took less than a second! Let's take a look at the generated files!

In [None]:
from os import listdir
sorted( listdir( 'measurements/' ) )

And from here on we will proceed with the analysis! I recommend to create a fresh notebook for this.