In [1]:
import pandas as pd
import numpy as np
import re
%load_ext watermark
%watermark -iv -v -d

pandas      0.20.3
numpy       1.11.3
re          2.2.1
2018-02-15 

CPython 3.5.4
IPython 6.1.0


## Prevention is the best treatment
The easiest way of dealing with terribly formatted, poorly filled in spreadsheets, is to provide our collaborators with a sample spreadsheet with some rows already filled in by us with some dummy information.

# Read in our Data

Pandas comes with many different parsers, making our life a lot easier - luckily one of them handles excel files. The data we are dealing with here are modified from an original spreadsheet from a clinician, handed to one of my professors.

In [2]:
excelfile = pd.ExcelFile('./terrible_spreadsheet.xlsx')

In [3]:
firstsheet = excelfile.sheet_names[0]
excelfile.sheet_names

['Plate 1',
 'Plate 2',
 'Plate 3',
 'Plate 4',
 'Plate 5',
 'Plate6',
 'Plate7',
 'Plate8',
 'Plate9',
 'Plate10',
 'Plate11']

In [4]:
ff = pd.read_excel(excelfile, sheetname=firstsheet, header=1)
ff.shape

(47, 54)

To make our life easier, we want to read all worksheets from the spreadsheet into a single DataFrame. To keep track which row came from which worksheet, we will additionally incorporate a column with the name `sheet` into each DataFrame.

In [5]:
df = pd.concat([pd.read_excel(excelfile, sheetname=sheet, header=1).assign(sheet=sheet)
                for sheet in excelfile.sheet_names])

In [6]:
df.shape

(470, 55)

## Clean the column names

For convenience in our further analysis, we want column names to only consist of numbers, letters or the underscore character. That is because columns for which this is true can be accessed via the `.`, so to access `column1`, we would write `df.column1`. We will use a regular expression or short regex to do this. For more info you can look at the slides from [Al Sweigart's talk "Yes it's time to learn regular expressions"](http://bitly.com/yesregex) or watch the talk itself.

In [7]:
df.columns

Index(['ABA', 'Age', 'BSA', 'Betatoxin', 'Exoprotein ext', 'Gender',
       'Glom.extract', 'HLA', 'HLA -2', 'HLA-1', 'HSA', 'Hemolysin gamma A',
       'Hemolysin gamma B', 'Hemolysin gamma C', 'Hospital ', 'LDL',
       'LukAB(Lab)', 'LukAB(cc30)', 'LukD', 'LukE', 'LukF-PV', 'LukS-PV',
       'PC-12', 'PC16', 'PC4', 'PLY', 'PNAG', 'PSM 4variant', 'PSMalpha2',
       'PSMalpha3', 'Pn CWPS', 'Pn PS12', 'Pn PS23', 'Rabbit IgG',
       'S.Pyogenese arcA', 'SEB', 'SEB.1', 'SEG', 'SEI', 'SEM', 'SEN', 'SEO',
       'SEU', 'SP', 'Sample ID', 'SpA domD5-WT', 'SpA domD5FcNull',
       'Tetanus Toxoid', 'Tetanus Toxoid.1', 'cytoplasmic ext', 'hIgA', 'hIgG',
       'psmalpah4', 'sheet', 'surface protein ext'],
      dtype='object')

In [8]:
oldcols = df.columns

In [9]:
colcleaningregex = re.compile(r'[^\w]')

In [10]:
newcols = [colcleaningregex.sub('_', col.strip()) for col in df.columns]

In [11]:
print(len(oldcols)-len(oldcols.unique()),
     len(newcols)-len(np.unique(newcols)))

0 0


In [12]:
df.columns = newcols

## Get rid of empty columns & clean index

Sometimes people include columns in their DataFrame that are completely useless for analysis, as they are completely empty. If we expect more data in the forms of additional spreadsheets to require our processing in the future, we would leave them in but since that is not the case here, we can simply delete those.

In [13]:
df.notnull().any()

ABA                     True
Age                     True
BSA                     True
Betatoxin               True
Exoprotein_ext          True
Gender                  True
Glom_extract            True
HLA                     True
HLA__2                  True
HLA_1                   True
HSA                     True
Hemolysin_gamma_A       True
Hemolysin_gamma_B       True
Hemolysin_gamma_C       True
Hospital                True
LDL                     True
LukAB_Lab_              True
LukAB_cc30_             True
LukD                    True
LukE                    True
LukF_PV                 True
LukS_PV                 True
PC_12                   True
PC16                    True
PC4                     True
PLY                     True
PNAG                    True
PSM_4variant            True
PSMalpha2               True
PSMalpha3               True
Pn_CWPS                 True
Pn_PS12                 True
Pn_PS23                 True
Rabbit_IgG              True
S_Pyogenese_ar

In [14]:
df = df.loc[:, df.notnull().any()]


## Fill *Hospital*, *Age* and *Gender* columns

The `Hospital`, `Age` and `Gender` columns are only filled in every couple of lines. We want to fill in the blanks. Since we already have everything loaded into one DataFrame, we have to use a groupby operation. Otherwise the last info in these columns from one sheet can transfer to the empty rows in the beginning of the next sheet, which we do not want. In order to be able to use apply on a groupby object, we need to `reset_index` because our DataFrame contains duplicate indices which is prohibited in groupby-apply operations.

In [15]:
df = df.reset_index(drop=True) #required in order for the groupby-apply to work
df.loc[:, ['Hospital', 'Age', 'Gender']] = df.groupby('sheet').apply(
    lambda x: x.loc[:, ['Hospital', 'Age', 'Gender']].fillna(method='ffill')
)

In [16]:
df.loc[:, ['Hospital', 'Age', 'Gender']].iloc[10:15]

Unnamed: 0,Hospital,Age,Gender
10,hospital1,60.0,M
11,hospital1,60.0,M
12,hospital1,60.0,M
13,hospital1,60.0,M
14,hospital1,60.0,M


## Extract *PatientID*, *Visit* and *Dilution* from *Sample_ID*

Three pieces of information are stored in the `Sample_ID` column - the `PatientID`, the `Visit` and the `Dilution`. 
The `PatientID` is a five digit number, the `Visit` comes in between the `PatientID` and the `Dilution` and the `Dilution` is composed of 1s and 0s and is at the end of the `Sample_ID` string.
Each of these can be missing in a row. We want to use a regular expression in order to extract this info.

In [17]:
df.Sample_ID.str?

[1;31mType:[0m        StringMethods
[1;31mString form:[0m <pandas.core.strings.StringMethods object at 0x00000000090F75F8>
[1;31mFile:[0m        c:\users\tobias\miniconda3\envs\pdsh_mod\lib\site-packages\pandas\core\strings.py
[1;31mDocstring:[0m  
Vectorized string functions for Series and Index. NAs stay NA unless
handled otherwise by a particular method. Patterned after Python's string
methods, with some inspiration from R's stringr package.

Examples
--------
>>> s.str.split('_')
>>> s.str.replace('_', '')


In [26]:
#df.Sample_ID.str.extract(r'.*\s(?P<Dilution>1[0]+)', expand=True)
df.Sample_ID.str.extract(r'\s(?P<Visit>[^\s]*)\s', expand=False).unique()

array([nan, 'V1', 'V2', 'V3', 'v1', '', 'GS2', 'GS1', 'JM', 'VANDER'], dtype=object)

In [34]:
results = df.Sample_ID.str.extract(
    r'(?P<PatientID>\d{5})?\s*(?P<Visit>[^\s]+\d)?\s+(?P<Dilution>1[0]+)?\s*$', 
                         expand=True)
results

Unnamed: 0,PatientID,Visit,Dilution
0,,,10
1,,,100
2,,,1000
3,,,10000
4,,,100000
5,,,1000000
6,,,10000000
7,23234,V1,100
8,23234,V1,1000
9,23234,V1,10000


In [31]:
results.Visit.unique()

array([nan, 'V1', 'V2', 'V3', 'v1', 'GS2', 'GS1'], dtype=object)

In [32]:
results.PatientID.unique()

array([nan, '23234', '28531', '28729', '33142', '35568', '62901', '52950',
       '57756', '48689', '62129', '62300', '62900', '17588', '15363',
       '59707', '64779', '67029', '77612', '78202', '83700', '84504',
       '99361', '92827', '93954', '94232', '99382', '11825', '99624',
       '99682', '27764', '44989', '27986', '37422', '46713', '59302',
       '15439', '14127', '10732'], dtype=object)

In [33]:
results.Dilution.unique()

array(['10', '100', '1000', '10000', '100000', '1000000', '10000000', nan], dtype=object)