# Data Processing from Prism files
We demonstrate programmatic data processing from two Prism formats: `.pzfx` and `.prism`

## import packages

In [1]:
import zipfile
import pandas as pd
import pzfx_parser
import io

## .pzfx file
Although `.pzfx` is a properiety format used by Prism/Graphpad, there are parsers available. We use the parser `pzfx_parser` (https://pypi.org/project/pzfx_parser/) to extract data.

To use this package, please add the following to your environment:

`pzfx_parser="~0.4"`


In [2]:
# read the .pzfx file using pzfx_parser
# returns a dictionary of dataframes
df_pzfx = pzfx_parser.read_pzfx("data/prism_example_oldformat.pzfx")
df_pzfx

{'Data 1':    log ERY974 (nM)_0  log ERY974 (nM)_1  % cytotoxicity_0
 0          -5.123403          -5.123403          1.528881
 1          -4.162947          -4.162947          0.635631
 2          -3.157275          -3.157275          5.933002
 3          -2.136899          -2.136899         28.854434
 4          -1.096025          -1.096025         40.341922
 5          -0.113859          -0.113859         41.467704
 6           0.867409           0.867409         41.517207
 7           1.867805           1.867805         40.491442}

In [3]:
df_pzfx = (
    # select the desired dataframe by key
    df_pzfx['Data 1']
    # drop the first column, as it is repeated in this data
    .drop(columns=['log ERY974 (nM)_0'])
    .rename(columns={'log ERY974 (nM)_1': 'log_dose_nM', '% cytotoxicity_0': 'percent_cytotoxicity'})
)
df_pzfx

Unnamed: 0,log_dose_nM,percent_cytotoxicity
0,-5.123403,1.528881
1,-4.162947,0.635631
2,-3.157275,5.933002
3,-2.136899,28.854434
4,-1.096025,40.341922
5,-0.113859,41.467704
6,0.867409,41.517207
7,1.867805,40.491442


## .prism file
Graphpad/Prism has recently changed their file format to .prism, which operates much like a .zip file. The data are stored in .csv files and other relevant information is in .json schemas. More on this file format here: https://www.graphpad.com/guides/prism/latest/user-guide/prism_file_format.htm

#### Tip!
Changing the extension of a `.prism` file to `.zip` and unzipping the resulting file exposes its contents (you can try this in your Desktop). 
Below, we use a Python script to read the contents of the `.prism` file.

In [4]:
# reading the zipfile
fprism_zip = zipfile.ZipFile(
    'data/prism_example_newformat.prism',
    mode='r'
)
fprism_zip

<zipfile.ZipFile filename='data/prism_example_newformat.prism' mode='r'>

In [5]:
# extracting names of filles with .csv extension
csv_files = [file for file in fprism_zip.namelist() if '.csv' in file]
csv_files

['data/tables/0F9A8752-97E5-469D-BA97-3DD6B9BECE6C/data.csv',
 'data/tables/4F4814CB-660B-4A39-B71D-8613009FF242/data.csv',
 'data/tables/5B474CD4-1E17-4B3A-BBAB-0F36896A1A42/data.csv',
 'data/tables/831842C8-EA10-4C56-8653-E243EADDF288/data.csv',
 'data/tables/93B56563-11DB-45D8-AB1A-630C417C4459/data.csv',
 'data/tables/BEB35D29-2AAF-4C78-9274-A2E89D63B0CD/data.csv',
 'data/tables/E0516C10-E3F4-49DC-8207-DE3754E4C6E1/data.csv']

In [6]:
# create empty list to hold the dataframes
dfList = []
# iterate over the list containing paths to csv files
for file in csv_files:
    # open each file
    with fprism_zip.open(file) as f:
        # read the file (output: byte)
        fread=f.read()
        # convert the byte output to suit read_csv
        # set header option to None, else the first row becomes header
        _df=pd.read_csv(io.StringIO(fread.decode('utf-8')), header=None)
    dfList.append(_df)
dfList

[          0          1
 0 -5.123403   2.000000
 1 -4.162947   1.000000
 2 -3.157275   1.000000
 3 -2.136899   4.000000
 4 -1.096025   8.000000
 5 -0.113859  30.000000
 6  0.867409  41.517207
 7  1.867805  40.491442,
           0          1
 0 -5.123403   1.000000
 1 -4.162947   3.000000
 2 -3.157275   6.000000
 3 -2.136899  24.000000
 4 -1.096025  40.341922
 5 -0.113859  41.467704
 6  0.867409  41.517207
 7  1.867805  40.491442,
                                                     0                     1
 0   log(agonist) vs. response -- Variable slope (f...                   NaN
 1                                     Best-fit values                   NaN
 2                                              Bottom                0.9525
 3                                                 Top                 41.21
 4                                             LogEC50                -2.435
 5                                           HillSlope                 1.196
 6                         

### MANUALLY select, modify, and save the desired csv file
There are often multiple `.csv` files and one needs to manually decide the one to extract, process, and save. Here, we work with the **fourth** dataframe.

#### Tip!
The free/trial version of Prism can be used to view the prism files, but not to edit/copy the contents. Because there are often multiple `.csv` files, we can use the Prism viewer on our Desktop to view the data to help identify `.csv` file we are interested in. The `.csv` files do not have column labels; viewing the prism file allows us to manually set these.


In [7]:
df_prism = (
    dfList[3]
    # manually rename columns
    .rename(columns={0: 'log_dose_nM', 1: 'percent_cytotoxicity'})
)
df_prism

Unnamed: 0,log_dose_nM,percent_cytotoxicity
0,-5.123403,1.528881
1,-4.162947,0.635631
2,-3.157275,5.933002
3,-2.136899,28.854434
4,-1.096025,40.341922
5,-0.113859,41.467704
6,0.867409,41.517207
7,1.867805,40.491442


#### Future work
- The column labels and other metadata are in the `.json` files. With some effort, it should be possible to automate the manual steps.
- Some popular file formats `.docx` and `.pptx` operate much like a `.zip`. So, in principle, we should be able to use the same trick above to extract embedded data files.

If you end up working on any of these, or have additional feature requests, please update this notebook or contact the SSI team :)