# Database cleanup item 2:

```
From: "Qian, Yong (NIH/NIA/IRP) [C]" <QianY@grc.nia.nih.gov>
Date: Friday, April 28, 2017 at 12:08 PM
To: "Coletta, Christopher (NIH/NIA/IRP) [E]" <christopher.coletta@nih.gov>
Cc: "Ding, Jun (NIH/NIA/IRP) [E]" <jun.ding@nih.gov>, "Schlessinger, David (NIH/NIA/IRP) [E]" <SchlessingerD@grc.nia.nih.gov>
Subject: repeated records in 2017-02-24-Sardinia-Data-TAB.txt
 
Hi Chris,

There were a few wave2 pwv records that are repeated in 2017-02-24-Sardinia-Data-TAB.txt.  Can you correct the database file?
I understand that the you didn’t process the wave2 data so it was not your fault.
 
Here are those:
id_individual   Wave    pwvDate pwvQual pwv
3528    2       2006-01-27      3       509
4285    2       2005-06-10      3       754.8   
11633   2       2006-12-06      3       590.5   
27393   2       2006-09-13      3       957.5   
30245   2       2006-01-31      2       771.7   
 
Yong
```

======

```
From: Coletta, Christopher (NIH/NIA/IRP) [E]
Sent: Friday, April 28, 2017 12:12:24 PM
To: Qian, Yong (NIH/NIA/IRP) [C]
Cc: Ding, Jun (NIH/NIA/IRP) [E]; Schlessinger, David (NIH/NIA/IRP) [E]
Subject: Re: repeated records in 2017-02-24-Sardinia-Data-TAB.txt
 
I understand.
 
When the same thing happened in the Wave IV data (duplicate PWV readings), I looked at the date and selected the ones the fell within the Wave IV date range. I’ll try to do the same in this case.
 
Is there a table somewhere available showing the dates that correspond to the waves?
 
-Chris
 ```
 
======

 ```
 I don't think there is a date table for waves.  

I think in general, waves are defined as:
wave1:  12/2001 to 5/2004
wave2:    5/2004 to 7/2008
wave3:    7/2008 to 1/2012
wave4:    2/2012 to 5/2016 ?

correct me if I am wrong.

Yong
```

# Initial declarations

In [1]:
pwd

'/Users/colettace/projects/david/sardiNIA_database/latest/20170703_IMT_update'

In [2]:
import pandas as pd

In [3]:
pd.__version__

'0.20.2'

In [4]:
prf = pd.read_csv( '2017-07-07-Sardinia-Data-TAB.txt', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


# Wave II: multiple records for participant

In [5]:
wave2 = prf.loc[ prf.Wave == 2 ]

In [6]:
len( wave2)

5256

In [7]:
len( set( wave2.id_individual ) )

5251

In [8]:
from collections import Counter

In [9]:
trouble_ids = [ _id for _id, count in Counter( list(wave2.id_individual)).items() if count > 1]

In [10]:
trouble_rows = wave2[ prf.id_individual.isin(trouble_ids  )]

  """Entry point for launching an IPython kernel.


In [11]:
trouble_rows

Unnamed: 0,id_individual,id_sir,id_mad,Wave,Visit,Age,Sex,Education,Occupation,MaritalStatus,...,audioR4000,audioR6000,audioR8000,audioL250,audioL500,audioL1000,audioL2000,audioL4000,audioL6000,audioL8000
5204,3528,,,2,,29.2,,,,,...,,,,,,,,,,
5205,3528,,,2,,29.2,,,,,...,,,,,,,,,,
6206,4285,,,2,,64.9,,,,,...,,,,,,,,,,
6207,4285,,,2,,64.9,,,,,...,,,,,,,,,,
9695,11633,,,2,,41.0,,,,,...,,,,,,,,,,
9696,11633,,,2,,41.0,,,,,...,,,,,,,,,,
15437,27393,,,2,,66.4,,,,,...,,,,,,,,,,
15438,27393,,,2,,66.4,,,,,...,,,,,,,,,,
16617,30245,,,2,,44.5,,,,,...,,,,,,,,,,
16618,30245,,,2,,44.5,,,,,...,,,,,,,,,,


# Check that they are perfect duplicates that do not need to be merged

In [12]:
trouble_rows_copy = trouble_rows.copy()

In [13]:
trouble_rows_copy.fillna(0, inplace=True)

In [14]:
all( trouble_rows_copy.loc[ 5204] == trouble_rows_copy.loc[ 5205] )

True

In [15]:
all( trouble_rows_copy.loc[ 6206] == trouble_rows_copy.loc[ 6207] )

True

In [16]:
all( trouble_rows_copy.loc[ 9695 ] == trouble_rows_copy.loc[ 9696] )

True

In [17]:
all( trouble_rows_copy.loc[ 15437 ] == trouble_rows_copy.loc[ 15438 ] )

True

In [18]:
all( trouble_rows_copy.loc[ 16617 ] == trouble_rows_copy.loc[ 16618 ] )

True

# Drop the duplicate rows

In [19]:
prf.drop( [5204, 6206, 9695, 15437 , 16617], inplace=True )

# Check to see it worked

In [20]:
wave2 = prf.loc[ prf.Wave == 2 ]

In [21]:
len( wave2)

5251

In [22]:
len( set( wave2.id_individual ) )

5251

In [23]:
trouble_ids = [ _id for _id, count in Counter( list(wave2.id_individual)).items() if count > 1]

In [24]:
trouble_ids

[]

# Format the database for text/TSV format and write to file

Convert all numeric values to strings and lop off .0's, etc

In [39]:
for name in list(prf.columns):
    prf[name] = prf[name].astype(str)
prf.replace( to_replace='nan', value='', inplace=True, regex=False )
prf.replace( to_replace='NaT', value='', inplace=True, regex=False )
prf.replace( regex=True, inplace=True, to_replace=r'\.0$', value='')
prf.replace( regex=True, inplace=True, to_replace=r'00000+\d$', value='')
prf.replace( regex=True, inplace=True, to_replace=r'999999+\d$', value='')

In [40]:
prf.to_csv( '2017-07-10-Sardinia-Data-TAB.txt', sep='\t', encoding='utf-8', index=False)

In [41]:
ls -l *TAB*.txt

-rw-r--r--@ 1 colettace  NIH\Domain Users  41187887 Jul  3 16:22 2017-07-03-Sardinia-TAB.txt
-rw-r--r--  1 colettace  NIH\Domain Users  41217915 Jul  7 16:54 2017-07-07-Sardinia-Data-TAB.txt
-rw-r--r--  1 colettace  NIH\Domain Users  41217915 Jul  7 18:00 2017-07-07-Sardinia-TAB-NEW.txt
-rw-r--r--  1 colettace  NIH\Domain Users  41212401 Jul 10 11:52 2017-07-10-Sardinia-Data-TAB.txt


In [42]:
len(prf)

20963