### Introduction

In the previous notebook -`clinical-trial-dataset-assessment.ipynb`, we identified the tidiness and quality issues in the dataset. Now, we will clean those issues.
>To begin, the typical order to approach cleaning will be to follow the data quality dimentions already noted.

- Completeness
- Validity
- Accuracy
- Consistency

>However, once we have tackled completeness issues it is more important to handle the tidiness issues next.

>This is because untidy data makes our cleaning process more difficult as we progress. Thus, we must ensure all the tables are as they should before we start addressing the individual issues.

>Following the tidiness operation, we will then continue on to our data dimension related issues.

#### Process
>**Define**

Define the cleaning operation to be performed
>**Code**

Write the code to perfrom the operation
>**Test**

Test to confirm that our code worked as it should on the dataset


In [369]:
import pandas as pd

In [370]:
# create the dataframes

patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

## Must-Do

Before performing cleaning operations on any dataset, we must create a copy for that dataset and reatain the original as it is.

All cleaning operations must be performed on the copy not the main dataset

In [371]:
# create a copy for each dataframe
patients_copy = patients.copy()
treatments_copy = treatments.copy()
adverse_reactions_copy = adverse_reactions.copy()

### __Assessment issues__

#### __Quality__

`patients` table:
- zip codes are floats and not strings
- state column sometimes in full, other times abbreviated
- Tim Neudorf height is 27 instead of 72
- zip_code column sometimes has four digits instead of 5
- patient with patient id 9 is spelt Dsvid and not David
- some missing demographics data (eg, adress, city, state)
- erroneous datatypes
 
`treatments` table

- missing data in the hba1c change column
- missing data some rows of data, as total trial was 350, but records are 280
- the 'u' in start and end dose on treatments table
- patients names are in lower case and not title case
- erroneous datatypes
- inaccurate hba1c changes

`adverse_reactions` table
- patient names are in lower case




### __Tidiness__ Issues
`patients` table
- contact column contains two variables

`treatments` table
- contains three columns in two variables

`adverse reaction` table
- table should be combined with the treatments table 

`adverse reaction` and `treatments` tables
- given name and surname columns are duplicated ie unnecessary for both tables

_when we join the adverse_reaction and treatment tables we'll also drop the duplicated name columns. Hence, only the patients table will have names in it. The primary key for the both tables will however be the **patients id** whch doesn't change_

### Cleaning all missing records first

We have been provided with extra rows of data for the treatments table, we will append it to our main table by concatenating


#### Define
Append the new rows of treatments table data to the copied df

#### Code

In [372]:
treatments_cut = pd.read_csv('treatments_cut.csv')

In [373]:
treatments_copy = pd.concat([treatments_copy, treatments_cut], ignore_index = True)

#### Test

In [374]:
treatments_copy.shape

(350, 7)

#### Define

>Drop rows with missing demographics data (eg, adress, city, state) in patients table

>Implement right operation for missing data in the hba1c change column in treatments table

#### Code

In [375]:
# obtain the indexes of row with missing addresses
missing_address = patients_copy[patients_copy.address.isnull()].index

In [376]:
# drop those rows wit missing addresses
patients_copy.drop(index=missing_address, axis=0, inplace=True)

In [377]:
# recalculate the hba1c change column to fill all the missing values in
treatments_copy.hba1c_change = treatments_copy.hba1c_start - treatments_copy.hba1c_end

#### Test

In [378]:
# confirm that no more rows have missing addresses
patients_copy[patients_copy.address.isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi


In [379]:
# confrim that no more NAN values in hba1c change column
treatments_copy.sample(5)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
329,satsita,batukayev,-,42u - 42u,7.63,7.25,0.38
196,peter,pospíšil,51u - 60u,-,7.76,7.34,0.42
24,isac,berg,31u - 41u,-,9.68,9.29,0.39
84,furuta,osman,30u - 41u,-,7.52,7.18,0.34
74,ole,petersen,-,29u - 32u,7.95,7.55,0.4


### Moving On

So, we are done with addressing the completeness issues, we need to prepare our dtaset structure to enable us perform the required operations smoothly.

#### Define

- contact column contains two variables patients
- contains three columns in two variables treatment 
- adverse reactions table should be combined with the treatments table
- given name and surname columns are duplicated (ie unnecessary for both tables) for adverse reactions and treatments 

#### Code

In [380]:
patients_copy.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [381]:
import re
phone_regex = r"((\+[0-9].)?(\(\d{3}\).)?([0-9].)?(\d{3}.)?(\d{3}.)(\d{4}))"
patients_copy['phone_number'] = patients_copy.contact.str.extract(pat=phone_regex)[0]

#### Test

In [382]:
patients_copy.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phone_number
77,78,female,Rut,Halldórsdóttir,1054 Zappia Drive,Lexington,KY,40507.0,United States,859-297-3368RutHalldorsdottir@einrot.com,5/9/1959,162.6,66,26.2,859-297-3368
351,352,female,Nile,Mehari,2458 Broadway Avenue,Chattanooga,TN,37403.0,United States,423-799-1730NileMehari@gustr.com,2/27/1956,197.6,65,32.9,423-799-1730
448,449,male,Ivan,Fomin,632 Peaceful Lane,Garfield Heights,OH,44128.0,United States,216-502-3773IvanFomin@dayrep.com,6/10/1930,139.9,65,23.3,216-502-3773
130,131,male,Albert,Wolfe,3710 Jerry Dove Drive,North Charleston,SC,29420.0,United States,843-494-0313AlbertRWolfe@jourrapide.com,6/16/1939,217.8,65,36.2,843-494-0313
255,256,female,Mette,Sandgreen,2324 Benson Street,Elk Lake,WI,54739.0,United States,MetteSandgreen@gustr.com715-231-3508,5/13/1938,187.1,60,36.5,715-231-3508


In [383]:
patients_copy[patients_copy.phone_number.isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phone_number


#### Code

In [384]:
email_pattern = r"([A-Za-z][A-Za-z0-9.]*@[a-z]*\.[a-z]*)"
patients_copy['email'] = patients_copy.contact.str.extract(pat=email_pattern)

#### Test

In [385]:
print(patients_copy.email.isnull().sum())

0


In [386]:
patients_copy.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phone_number,email
294,295,female,Annie,Allen,3634 Lyon Avenue,Cambridge,MA,2142.0,United States,AnnieJAllen@superrito.com1 508 921 6327,3/31/1926,159.7,60,31.2,1 508 921 6327,AnnieJAllen@superrito.com
8,9,male,Dsvid,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,816-265-9578DavidGustafsson@armyspy.com,3/6/1937,163.9,66,26.5,816-265-9578,DavidGustafsson@armyspy.com
308,309,male,Daud,Batukayev,153 Fieldcrest Road,Huntington,New York,11743.0,United States,631-875-3023DaudBatukayev@teleworm.us,9/15/1983,178.9,73,23.6,631-875-3023,DaudBatukayev@teleworm.us
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4,1234567890,johndoe@email.com


In [387]:
# drop the contact column
patients_copy.drop(columns = 'contact', inplace=True)
patients_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 491 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    491 non-null    int64  
 1   assigned_sex  491 non-null    object 
 2   given_name    491 non-null    object 
 3   surname       491 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   birthdate     491 non-null    object 
 10  weight        491 non-null    float64
 11  height        491 non-null    int64  
 12  bmi           491 non-null    float64
 13  phone_number  491 non-null    object 
 14  email         491 non-null    object 
dtypes: float64(3), int64(2), object(10)
memory usage: 61.4+ KB


#### Define
create the three columns contained in the two variables on the treatment table

#### code

In [388]:
treatments_copy.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [389]:
treatments_copy = pd.melt(treatments_copy, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'], value_vars=['auralin', 'novodra'],
 var_name='treatment', value_name = 'dosage', ignore_index=True)

In [390]:
treatments_copy.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dosage
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,auralin,-
2,yukitaka,takenaka,7.68,7.25,0.43,auralin,-
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,7.78,7.46,0.32,auralin,-


#### Define
Expand the dosage column of treatment table to separate start dose from end dose and remove the unit sign (U)

In [391]:
treatments_copy['start_dose'] = treatments_copy.dosage.str.split(' - ').str[0].str[:2]
treatments_copy['end_dose'] = treatments_copy.dosage.str.split(' - ').str[1].str[:2]

#### Test

In [392]:
treatments_copy.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dosage,start_dose,end_dose
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u,41,48.0
1,elliot,richardson,7.56,7.09,0.47,auralin,-,-,
2,yukitaka,takenaka,7.68,7.25,0.43,auralin,-,-,
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u,33,36.0
4,alissa,montez,7.78,7.46,0.32,auralin,-,-,


In [393]:
# drop all the NAN rows in start and end dose in treatments table
treatments_copy.dropna(axis=0, inplace = True)
treatments_copy.drop(columns = 'dosage', inplace=True)

#### Test


In [394]:
treatments_copy.reset_index(inplace=True, drop=True)
treatments_copy.tail(5)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose
345,christopher,woodward,7.51,7.06,0.45,novodra,55,51
346,maret,sultygov,7.67,7.3,0.37,novodra,26,23
347,lixue,hsueh,9.21,8.8,0.41,novodra,22,23
348,jakob,jakobsen,7.96,7.51,0.45,novodra,28,26
349,berta,napolitani,7.68,7.21,0.47,novodra,42,44


In [395]:
treatments_copy.shape

(350, 8)

In [396]:
treatments_copy[treatments_copy.given_name == 'elliot']

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose
175,elliot,richardson,7.56,7.09,0.47,novodra,40,45


#### Define
combine the adverse reactions table with the treatments table by merging them

#### Code

In [397]:
adverse_reactions_copy.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


In [398]:
treatments_copy = treatments_copy.merge(adverse_reactions_copy, how='left', on=['given_name', 'surname'])

#### Define
Since given name and surname columns are duplicated treatments table now, make the common identifier between patient and treatemnt table be the patient id. Drop the names from the treatments table

In [399]:
# create a df containing only the given name, surname and patiednt id from the patients table
df_id = patients_copy[['patient_id', 'given_name', 'surname']].copy()
df_id

Unnamed: 0,patient_id,given_name,surname
0,1,Zoe,Wellish
1,2,Pamela,Hill
2,3,Jae,Debord
3,4,Liêm,Phan
4,5,Tim,Neudorf
...,...,...,...
498,499,Mustafa,Lindström
499,500,Ruman,Bisliev
500,501,Jinke,de Keizer
501,502,Chidalu,Onyekaozulu


In [400]:
# convert the letter case in the df_id table to lower case to match with treatment table
df_id.given_name = df_id.given_name.str.lower()
df_id.surname = df_id.surname.str.lower()

In [401]:
# merge the new df with the treatments table on the given name and surname
treatments_copy = df_id.merge(treatments_copy, how ='inner', on=['given_name', 'surname'])

#### Test

In [402]:
treatments_copy.head()

Unnamed: 0,patient_id,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction
0,1,zoe,wellish,7.71,7.3,0.41,novodra,33,33,
1,2,pamela,hill,9.53,9.1,0.43,novodra,27,29,
2,4,liêm,phan,7.58,7.1,0.48,novodra,43,48,
3,6,rafael,costa,7.73,7.34,0.39,auralin,50,60,
4,7,mary,adams,7.65,7.26,0.39,novodra,32,33,


In [403]:
# drop the given name and surname columns 
treatments_copy.drop(columns = ['given_name', 'surname'], inplace=True)
treatments_copy.reset_index(inplace=True, drop=True)

In [404]:
treatments_copy

Unnamed: 0,patient_id,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction
0,1,7.71,7.30,0.41,novodra,33,33,
1,2,9.53,9.10,0.43,novodra,27,29,
2,4,7.58,7.10,0.48,novodra,43,48,
3,6,7.73,7.34,0.39,auralin,50,60,
4,7,7.65,7.26,0.39,novodra,32,33,
...,...,...,...,...,...,...,...,...
338,495,8.90,8.59,0.31,novodra,26,24,
339,497,7.71,7.35,0.36,auralin,35,38,
340,499,7.92,7.60,0.32,novodra,35,33,
341,500,7.72,7.39,0.33,auralin,46,53,


### Cleaning Quality Issues
>Since we are done cleaning the missing data issues and tidiness, we now proceed to the other quality issues we encountered.

**Note:** We have already cleaned some of them during our process above
#### __Quality__

`patients` table:
- zip codes are floats and not strings
- state column sometimes in full, other times abbreviated
- Tim Neudorf height is 27 instead of 72
- zip_code column sometimes has four digits instead of 5
- patient with patient id 9 is spelt Dsvid and not David
- some missing demographics data (eg, adress, city, state)
- erroneous datatypes
 
`treatments` table

- missing data in the hba1c change column
- missing data some rows of data, as total trial was 350, but records are 280
- the 'u' in start and end dose on treatments table
- patients names are in lower case and not title case
- erroneous datatypes
- inaccurate hba1c changes

`adverse_reactions` table
- patient names are in lower case




#### Define
- using the pad function, ad a zero to zip_codes in the zip_codes column with four digits instead of 5
- convert zip codes to strings

#### Code

In [405]:
# convert the datatype of the zip_code column to string
patients_copy.zip_code = patients_copy.zip_code.astype('str')

In [406]:
patients_copy.zip_code = patients_copy.zip_code.str[:-2]

In [407]:
# using the pad function to pad zeros before four digit zip codes
patients_copy.zip_code.str.pad(5, fillchar='0')

0      92390
1      61812
2      68467
3      07095
4      36303
       ...  
498    03852
499    86341
500    64110
501    98109
502    68324
Name: zip_code, Length: 491, dtype: object

#### Define
- Change Tim Neudorf height from 27 instead to 72
- Change patient name in id 9 from Dsvid to David
#### Code

In [408]:
patients_copy.height[patients_copy.given_name == 'Tim'] = 72

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_copy.height[patients_copy.given_name == 'Tim'] = 72


#### Test

In [409]:
patients_copy[patients_copy.given_name == 'Tim']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,2/18/1928,192.3,72,26.1,334-515-7487,TimNeudorf@cuvox.de


In [410]:
# change name in patient id 9
patients_copy.given_name[patients_copy.patient_id == 9] = 'David'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_copy.given_name[patients_copy.patient_id == 9] = 'David'


#### Test

In [411]:
patients_copy[patients_copy.patient_id == 9]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,64105,United States,3/6/1937,163.9,66,26.5,816-265-9578,DavidGustafsson@armyspy.com


#### Define
- change the state names from full forms to abbreviated ones

In [412]:
# find the state names that are written in full forms
patients_copy.state.value_counts()

California    36
TX            32
New York      25
CA            24
NY            22
MA            22
PA            18
GA            15
Illinois      14
OH            14
OK            13
MI            13
Florida       13
LA            13
NJ            12
VA            11
MS            10
WI            10
IL            10
IN             9
MN             9
FL             9
AL             9
TN             9
WA             8
NC             8
KY             8
MO             7
ID             6
NV             6
KS             6
SC             5
IA             5
CT             5
ME             4
RI             4
Nebraska       4
ND             4
CO             4
AZ             4
AR             4
MD             3
DE             3
WV             3
SD             3
OR             3
NE             2
MT             2
VT             2
DC             2
AK             1
WY             1
NH             1
NM             1
Name: state, dtype: int64

In [413]:
# create a dictionary to carry the state name and its abbreviation
states = {'California': 'CA',
                'New York': 'NY',
                'Illinois': 'IL',
                'Florida': 'FL',
                'Nebraska': 'NE'}

In [414]:
# write the function to replace the affected names in the state column with the abbreviations in the dict

def abbreviation(df):
    if df['state'] in states.keys():
        abb = states[df['state']]
        return abb
    else:
        return df['state']
    
patients_copy.apply(abbreviation, axis= 1)

0      CA
1      IL
2      NE
3      NJ
4      AL
       ..
498    ME
499    AZ
500    MO
501    WA
502    NE
Length: 491, dtype: object

#### Define
- Correct all erroneous datatypes of bath dfs

In [415]:
patients_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 491 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    491 non-null    int64  
 1   assigned_sex  491 non-null    object 
 2   given_name    491 non-null    object 
 3   surname       491 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    object 
 8   country       491 non-null    object 
 9   birthdate     491 non-null    object 
 10  weight        491 non-null    float64
 11  height        491 non-null    int64  
 12  bmi           491 non-null    float64
 13  phone_number  491 non-null    object 
 14  email         491 non-null    object 
dtypes: float64(2), int64(2), object(11)
memory usage: 61.4+ KB


### Phone number data type
>The phone number column shouty of type int, we have to remove all the spaces and non digit characters from it

#### CODE

In [416]:
patients_copy.phone_number = patients_copy.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')

  patients_copy.phone_number = patients_copy.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')


#### Test

In [417]:
patients_copy.phone_number.sample(5)

308    16318753023
401    17857488181
259    17274397150
44     12078614587
499    19282844492
Name: phone_number, dtype: object

In [418]:
# change the datatypes
patients_copy.assigned_sex = patients_copy.assigned_sex.astype('category')
patients_copy.zip_code = patients_copy.zip_code.astype('category')
patients_copy.birthdate = pd.to_datetime(patients_copy.birthdate)

In [419]:
patients_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 491 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    491 non-null    int64         
 1   assigned_sex  491 non-null    category      
 2   given_name    491 non-null    object        
 3   surname       491 non-null    object        
 4   address       491 non-null    object        
 5   city          491 non-null    object        
 6   state         491 non-null    object        
 7   zip_code      491 non-null    category      
 8   country       491 non-null    object        
 9   birthdate     491 non-null    datetime64[ns]
 10  weight        491 non-null    float64       
 11  height        491 non-null    int64         
 12  bmi           491 non-null    float64       
 13  phone_number  491 non-null    object        
 14  email         491 non-null    object        
dtypes: category(2), datetime64[ns](1), float

### Treatment table datatypes
>Change the datatype for the start and end dose columne to int

In [420]:
treatments_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 343 entries, 0 to 342
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   patient_id        343 non-null    int64  
 1   hba1c_start       343 non-null    float64
 2   hba1c_end         343 non-null    float64
 3   hba1c_change      343 non-null    float64
 4   treatment         343 non-null    object 
 5   start_dose        343 non-null    object 
 6   end_dose          343 non-null    object 
 7   adverse_reaction  34 non-null     object 
dtypes: float64(3), int64(1), object(4)
memory usage: 21.6+ KB


In [421]:
treatments_copy.start_dose = treatments_copy.start_dose.astype(int)
treatments_copy.end_dose = treatments_copy.end_dose.astype(int)

#### Test

In [422]:
treatments_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 343 entries, 0 to 342
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   patient_id        343 non-null    int64  
 1   hba1c_start       343 non-null    float64
 2   hba1c_end         343 non-null    float64
 3   hba1c_change      343 non-null    float64
 4   treatment         343 non-null    object 
 5   start_dose        343 non-null    int32  
 6   end_dose          343 non-null    int32  
 7   adverse_reaction  34 non-null     object 
dtypes: float64(3), int32(2), int64(1), object(2)
memory usage: 18.9+ KB


### Diplicate patients names
Some patuients like Jakobsen, Gersten, and Taylor have more than one record. There are also some unuseful John Doe records
>To clean them out, we'll select all rows of the df where the addresses are not duplicated

In [423]:
# rows with duplicated address for same patients but with their nicknames
patients_copy[patients_copy.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771,United States,1985-08-01,155.8,67,24.4,18458587707,JakobCJakobsen@einrot.com
229,230,male,John,Doe,123 Main Street,New York,NY,12345,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
237,238,male,John,Doe,123 Main Street,New York,NY,12345,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
244,245,male,John,Doe,123 Main Street,New York,NY,12345,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
251,252,male,John,Doe,123 Main Street,New York,NY,12345,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
277,278,male,John,Doe,123 Main Street,New York,NY,12345,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324,United States,1954-05-03,138.2,71,19.3,14028484923,PatrickGersten@rhyta.com


#### Define
select the remaining of the dataframe excluding these rows, using tilde indexing
>NOTE that the ~ means 'is not'

In [424]:
# select the remaining of the dataframe excluding these rows, using tilde indexing
patients_copy = patients_copy[~patients_copy.address.duplicated()]

In [425]:
patients_copy

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,1976-07-10,121.7,66,19.6,19517199170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,1967-04-03,118.8,66,19.2,12175693204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,1980-02-19,177.8,71,24.8,14023636804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095,United States,1951-07-26,220.9,70,31.7,17326368246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,1928-02-18,192.3,72,26.1,13345157487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497,498,male,Masataka,Murakami,1179 Patton Lane,Tulsa,OK,74116,United States,1937-08-19,155.1,72,21.0,19189849171,MasatakaMurakami@einrot.com
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852,United States,1959-04-10,181.1,72,24.6,12074770579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341,United States,1948-03-26,239.6,70,34.4,19282844492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110,United States,1971-01-13,171.2,67,26.8,18162236007,JinkedeKeizer@teleworm.us


#### Test


In [426]:
patients_copy[patients_copy.surname == 'Taylor']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,33179,United States,1992-09-02,186.6,69,27.6,13054346299,RogelioJTaylor@teleworm.us


In [448]:
treatments_copy.treatment.value_counts()

novodra    173
auralin    170
Name: treatment, dtype: int64