## The Process

The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later. Copying DataFrames in pandas is done using the copy method. If the original DataFrame was called df, the soon-to-be clean copy of the dataset could be named df_clean.

**Note** that simply assigning a DataFrame to a new variable name leaves the original DataFrame vulnerable to modifications.

##### This is an example of how to pad text with leading zero

Imagine a dataset called patients_clean & we want to clean up the variable zip_code 

    patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')
    
This code coverts a float to a string, then removes the decimal point & its integer with the -2 slice, then, pads the remaining string with 0's to ensure all strings are at least 5 in length.

### Gather

In [1]:
import pandas as pd 

In [2]:
patients = pd.read_csv(r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\Assessing Data Datasets\patients.csv")
treatments = pd.read_csv(r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\Assessing Data Datasets\treatments.csv")
adverse_reactions = pd.read_csv(r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\Assessing Data Datasets\adverse_reactions.csv")

### Assess

In [3]:
patients.info()
treatments.info()
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 37.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype

In [4]:
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

In [5]:
list(patients)

['patient_id',
 'assigned_sex',
 'given_name',
 'surname',
 'address',
 'city',
 'state',
 'zip_code',
 'country',
 'contact',
 'birthdate',
 'weight',
 'height',
 'bmi']

In [6]:
patients[patients['address'].isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [7]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [8]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [9]:
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
32,33,female,Một,Phạm,2690 Pin Oak Drive,Long Beach,CA,90815.0,United States,PhamThiBichMot@dayrep.com562-985-4582,1/30/1997,147.6,69,21.8
354,355,female,Vivian,House,4932 Goldleaf Lane,Newark,NJ,7102.0,United States,201-586-2848VivianRHouse@dayrep.com,8/13/1936,130.2,62,23.8
169,170,male,Markus,Solberg,4148 Callison Lane,Bensalem,DE,19020.0,United States,302-474-8075MarkusSolberg@fleckens.hu,11/21/1953,140.6,70,20.2
113,114,female,Svanhvít,Guðjónsdóttir,2641 Michael Street,Houston,TX,77074.0,United States,SvanhvitGujonsdottir@cuvox.de1 713 779 6516,7/23/1931,204.2,63,36.2
437,438,male,Alwin,Svensson,1846 Joseph Street,Union Grove,WI,53182.0,United States,AlwinSvensson@armyspy.com+1 (262) 878-9576,11/2/1924,137.7,63,24.4


In [10]:
patients.surname.value_counts()

Doe           6
Taylor        3
Jakobsen      3
Cabrera       2
Kadyrov       2
             ..
Schmitt       1
Piirainen     1
Czerwinska    1
Koldenhof     1
Grubišić      1
Name: surname, Length: 466, dtype: int64

In [11]:
patients.address.value_counts()

123 Main Street            6
2476 Fulton Street         2
2778 North Avenue          2
648 Old Dear Lane          2
2852 Irving Road           1
                          ..
1507 Woodlawn Drive        1
4220 Simpson Square        1
272 Boone Crockett Lane    1
3613 Lodgeville Road       1
4500 Myra Street           1
Name: address, Length: 483, dtype: int64

In [12]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [13]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [14]:
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / (height_in * height_in)
bmi_check

210    19.055827
dtype: float64

In [15]:
patients[patients.surname == 'Zaitseva'].bmi

210    19.1
Name: bmi, dtype: float64

In [16]:
sum(treatments.auralin.isnull())

0

In [17]:
sum(treatments.novodra.isnull())

0

#### Quality
##### `patients` table
- Zip code is a float not a string
- Zip code has four digits sometimes
- Tim Neudorf height is 27 in instead of 72 in
- Full state names sometimes, abbreviations other times
- Dsvid Gustafsson
- Missing demographic information (address - contact columns) ***(can't clean)***
- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)
- Multiple phone number formats
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- kgs instead of lbs for Zaitseva weight

##### `treatments` table
- Missing HbA1c changes
- The letter 'u' in starting and ending doses for Auralin and Novodra
- Lowercase given names and surnames
- Missing records (280 instead of 350)
- Erroneous datatypes (auralin and novodra columns)
- Inaccurate HbA1c changes (leading 4s mistaken as 9s)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `adverse_reactions` table
- Lowercase given names and surnames

#### Tidiness
- Contact column in `patients` table should be split into phone number and email
- Three variables in two columns in `treatments` table (treatment, start dose and end dose)
- Adverse reaction should be part of the `treatments` table
- Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables

## Clean

In [18]:
# Create copy datasets from original to perform clean actions on
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

### Missing Data

<font color='red'>Complete the following two "Missing Data" **Define, Code, and Test** sequences after watching the "**Address Missing Data First**" video.</font>

##### `treatments`: Missing records (280 instead of 350)

##### Define
There are currently only 280 records rather than the correct 350 in the Treatments table.
The missing 70 records are stored separately in a file called Treatments_Cut.
We need to read this file in & append it to the existing Treatments Dataset.

##### Code

In [19]:
# write cleaning code here
treatments_cut = treatments = pd.read_csv(r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\Assessing Data Datasets\treatments_cut.csv")
treatments_clean = treatments_clean.append(treatments_cut)

##### Test

In [20]:
# write testing code here
print(len(treatments_clean))    # should show 350 records in output

350


##### `treatments`: Missing HbA1c changes and Inaccurate HbA1c changes (leading 4s mistaken as 9s)
*Note: the "Inaccurate HbA1c changes (leading 4s mistaken as 9s)" observation, which is an accuracy issue and not a completeness issue, is included in this header because it is also fixed by the cleaning operation that fixes the missing "Missing HbA1c changes" observation. Multiple observations in one **Define, Code, and Test** header occurs multiple times in this notebook.*

##### Define
In the treatments dataset, there are cases where the HbA1c changes are null (because they have not been calculated) or, where they are populated, but have been populated incorrectly, likely due to a typing/reading error.

If we recreate this column using a calculation, to identify the value if the change, we can fix these issues.

##### Code

In [21]:
treatments_clean.hba1c_change = (treatments_clean.hba1c_start - treatments_clean.hba1c_end)

##### Test

In [22]:
treatments_clean.hba1c_change.head()

0    0.43
1    0.47
2    0.43
3    0.35
4    0.32
Name: hba1c_change, dtype: float64

### Tidiness

<font color='red'>Complete the following four "Tidiness" **Define, Code, and Test** sequences after watching the *"Cleaning for Tidiness"* video.</font>

#### Contact column in `patients` table contains two variables: phone number and email

##### Define
*Your definition here. Hint 1: use regular expressions with pandas' [`str.extract` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html). Here is an amazing [regex tutorial](https://regexone.com/). Hint 2: [various phone number regex patterns](https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number). Hint 3: [email address regex pattern](http://emailregex.com/), which you might need to modify to distinguish the email from the phone number.*

Extract the *phone number* and *email* variables from the *contact* column using regular expressions and pandas' `str.extract` method. Drop the *contact* column when done.

##### code

In [23]:
# create a new phone number variable, fill it with extarcting the the pattern numbers of an american style phone record
patients_clean['phone_number'] = patients_clean.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)

# [a-zA-Z] to signify emails in this dataset all start and end with letters
# from the same contact column, extract the expression that looks like an email string
patients_clean['email'] = patients_clean.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+[a-zA-Z])', expand=True)

# now, drop old contact column as we have 2 new clean ones
# Note: axis=1 denotes that we are referring to a column, not a row
patients_clean = patients_clean.drop('contact', axis=1)

##### Test

In [24]:
# Confirm contact column is gone
list(patients_clean)

['patient_id',
 'assigned_sex',
 'given_name',
 'surname',
 'address',
 'city',
 'state',
 'zip_code',
 'country',
 'birthdate',
 'weight',
 'height',
 'bmi',
 'phone_number',
 'email']

In [25]:
# show a sample of the new phone numbers column
patients_clean.phone_number.sample(10)

29     +1 (845) 858-7707
448         216-502-3773
481         918 706 2776
282         304-438-2648
351         423-799-1730
178         773-934-7423
444         415-277-2563
396         863-386-3795
37     +1 (605) 204-6572
60          619-570-3898
Name: phone_number, dtype: object

In [26]:
# show a sample of the new email address column 
patients_clean.email.sample(10)

462        GuniHeimisson@superrito.com
80     EufemioRosarioAlarcon@gustr.com
186               JaneCitizen@cuvox.de
380       SiljeAKristiansen@dayrep.com
86           PhilemonAbdulov@rhyta.com
150         FelicijanBubanj@einrot.com
427           TeclaOnio@jourrapide.com
136     VictoriaTMikkelsen@armyspy.com
260        EmilyNHenriksen@armyspy.com
229                  johndoe@email.com
Name: email, dtype: object

##### Three variables in two columns in treatments table (treatment, start dose and end dose)

###### Define

*Your definition here. Hint: use pandas' [melt function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) and [`str.split()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html). Here is an excellent [`melt` tutorial](https://deparkes.co.uk/2016/10/28/reshape-pandas-data-with-melt/).*

This is basically the pandas equivalent to a proc transpose in SAS.

Melt the *auralin* and *novodra* columns to a *treatment* and a *dose* column (dose will still contain both start and end dose at this point). Then split the dose column on ' - ' to obtain *start_dose* and *end_dose* columns. Drop the intermediate *dose* column.

##### Code

In [27]:
# start by looking at our current data 
treatments_clean.head(3)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43


In [28]:
# First - specify columns we don't want to change as in the id_vars list to be used as 'keys'
# Second - the remaining column names are going to be transposed as values into one column, 
#          and we label that one column using var_name
# Third - the values that existed inside those columns before they were transposed also need be moved to a 'new' column.
#         We create this using value_name= method 
treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
                           var_name='treatment', value_name='dose')
treatments_clean.head(3)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,auralin,-
2,yukitaka,takenaka,7.68,7.25,0.43,auralin,-


In [29]:
# now we filter - making sure we only retain cases where dose is actually populated
treatments_clean = treatments_clean[treatments_clean.dose != "-"]
treatments_clean.head(3)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
6,sophia,haugen,7.65,7.27,0.38,auralin,37u - 42u


In [30]:
# now we want to split out the dose variable. Into 'Start' and 'End'
# we can do this by splitting the current string either side of the '-'
treatments_clean['dose_start'], treatments_clean['dose_end'] = treatments_clean['dose'].str.split(' - ', 1).str
treatments_clean.head(3)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose,dose_start,dose_end
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u,41u,48u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u,33u,36u
6,sophia,haugen,7.65,7.27,0.38,auralin,37u - 42u,37u,42u


In [31]:
# finally, drop the old dose column (using axis='columns' to be dropped)
treatments_clean = treatments_clean.drop('dose', axis='columns')
treatments_clean.head(3)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u,48u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u
6,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u


##### Test

In [32]:
# takle a look at a sample of records 
treatments_clean.sample(5) 

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end
316,chibuzo,okoli,7.64,7.29,0.35,auralin,28u,33u
656,firenze,fodor,7.89,7.55,0.34,novodra,30u,35u
315,alvin,jackson,7.62,7.23,0.39,auralin,38u,43u
266,ursula,freud,7.75,7.46,0.29,auralin,42u,54u
34,alexander,mathiesen,7.96,7.55,0.41,auralin,47u,58u


#### Adverse reaction should be part of the `treatments` table

##### Define
Merge the *adverse_reaction* column to the `treatments` table, joining on *given name* and *surname*.

##### Code

In [33]:
# use pandas merge statement to perform a left join. Use first name & surname as join keys. 
# we will join on the adverse_reactions dataset - lets look at some records:
adverse_reactions_clean.head(3)

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia


In [34]:
# only the columns not already in treatments_clean, will be merged on.
treatments_clean = pd.merge(treatments_clean, adverse_reactions_clean,
                            on=['given_name', 'surname'], how='left')
treatments_clean.head(3)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u,48u,
1,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u,
2,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u,


##### Test

In [35]:
# check a sample of records to test everythhing worked ok
treatments_clean.sample(5)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction
121,tomáš,navrátil,7.84,7.41,0.43,auralin,24u,36u,
91,kerman,dandonneau,7.82,7.28,0.54,auralin,41u,50u,
97,minea,lindgren,9.45,8.94,0.51,auralin,38u,45u,
249,ivona,jakšić,7.98,7.54,0.44,novodra,41u,41u,
300,kang,mai,7.78,7.45,0.33,novodra,39u,36u,injection site discomfort


#### Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables  and Lowercase given names and surnames

##### Define
Adverse reactions table is no longer needed so ignore that part. Isolate the patient ID and names in the `patients` table, then convert these names to lower case to join with `treatments`. Then drop the given name and surname columns in the treatments table (so these being lowercase isn't an issue anymore).

##### Code

In [36]:
# collect the patients IDs and their names
id_names = patients_clean[['patient_id', 'given_name', 'surname']]
id_names.head(10)

Unnamed: 0,patient_id,given_name,surname
0,1,Zoe,Wellish
1,2,Pamela,Hill
2,3,Jae,Debord
3,4,Liêm,Phan
4,5,Tim,Neudorf
5,6,Rafael,Costa
6,7,Mary,Adams
7,8,Xiuxiu,Chang
8,9,Dsvid,Gustafsson
9,10,Sophie,Cabrera


In [37]:
# set both name variables to lower case 
id_names.given_name = id_names.given_name.str.lower()
id_names.surname = id_names.surname.str.lower()
id_names.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,patient_id,given_name,surname
0,1,zoe,wellish
1,2,pamela,hill
2,3,jae,debord
3,4,liêm,phan
4,5,tim,neudorf
5,6,rafael,costa
6,7,mary,adams
7,8,xiuxiu,chang
8,9,dsvid,gustafsson
9,10,sophie,cabrera


In [38]:
# now join the lower case names to the treatments_clean data set - so we map on Patient ID
treatments_clean = pd.merge(treatments_clean, id_names, on=['given_name', 'surname'])
treatments_clean.head(5)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u,48u,,225
1,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u,,242
2,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u,,345
3,eddie,archer,7.89,7.55,0.34,auralin,31u,38u,,276
4,asia,woźniak,7.76,7.37,0.39,auralin,30u,36u,,15


In [39]:
# now drop given_name and surname, so lowercase is not an issue anymore, and we just have the Patient ID
treatments_clean = treatments_clean.drop(['given_name', 'surname'], axis=1)
treatments_clean.head(5)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41u,48u,,225
1,7.97,7.62,0.35,auralin,33u,36u,,242
2,7.65,7.27,0.38,auralin,37u,42u,,345
3,7.89,7.55,0.34,auralin,31u,38u,,276
4,7.76,7.37,0.39,auralin,30u,36u,,15


##### Test

In [40]:
# Confirm the merge was executed correctly
treatments_clean.sample(10)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
345,7.67,7.3,0.37,novodra,26u,23u,,420
48,7.54,7.17,0.37,auralin,53u,64u,,315
100,7.99,7.72,0.27,auralin,43u,51u,,66
47,8.34,7.9,0.44,auralin,25u,31u,hypoglycemia,460
72,7.85,7.52,0.33,auralin,39u,44u,,287
110,7.69,7.31,0.38,auralin,25u,35u,,257
52,7.64,7.23,0.41,auralin,32u,41u,hypoglycemia,8
296,7.97,7.59,0.38,novodra,51u,54u,,289
150,7.66,7.3,0.36,auralin,39u,46u,,126
162,7.58,7.29,0.29,auralin,43u,56u,,395


In [41]:
# finally - check Patient ID should be the only duplicate column between Patients_Clean & Treatments_Clean
all_columns = pd.Series(list(patients_clean) + list(treatments_clean))
all_columns[all_columns.duplicated()]

22    patient_id
dtype: object

### Quality

Refer back to our notes under quality above. Each of our tables has some data quality issues that we should look to rectify.

`Task 1` - Zip Code is a float & not a string. Zip code also less than 5 in length in some cases.

##### Define

Convert zip code variable to string, using the "astype" method, remove the decimal (.0) with slicing, and pad left over with 0's to ensure all records are 5 chars long.

##### Code

In [42]:
import numpy as np

patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')

# Reconvert NaNs entries that were converted to '0000n' by code above
patients_clean.zip_code = patients_clean.zip_code.replace('0000n', np.nan)

##### Test

In [43]:
# Take a sample & check that all zip codes are dtype: object, and all have 5 chars (with maybe leading 0's)
patients_clean.zip_code.sample(5)

200    08110
115    95453
463    65066
386    95134
469    28412
Name: zip_code, dtype: object

`Task 2` - Height observation recorded wrong way round for one patient

##### Define 
Replace height obs for the Patients table that has a height of 27, replace to correct 72.

##### Code

In [44]:
# use .replace() method to change any '27' record in the height variable, to the value 72
patients_clean.height = patients_clean.height.replace(27, 72)

##### Test

In [45]:
# should have no return / be empty 
print(patients_clean[patients_clean.height == 27])

Empty DataFrame
Columns: [patient_id, assigned_sex, given_name, surname, address, city, state, zip_code, country, birthdate, weight, height, bmi, phone_number, email]
Index: []


`Task 3` - Some full state names, some are abbreviated. Clean this up.

##### Define 
Apply a function that converts full state name to state abbreviation.

##### Code

In [46]:
# First - let's identify all the state inputs to see what we have 
patients_clean['state'].value_counts()

California    36
TX            32
New York      25
CA            24
NY            22
MA            22
PA            18
GA            15
OH            14
Illinois      14
MI            13
Florida       13
OK            13
LA            13
NJ            12
VA            11
IL            10
WI            10
MS            10
AL             9
IN             9
FL             9
MN             9
TN             9
NC             8
WA             8
KY             8
MO             7
KS             6
ID             6
NV             6
CT             5
IA             5
SC             5
ND             4
Nebraska       4
CO             4
AZ             4
RI             4
ME             4
AR             4
DE             3
MD             3
OR             3
SD             3
WV             3
NE             2
MT             2
VT             2
DC             2
NM             1
NH             1
WY             1
AK             1
Name: state, dtype: int64

So, from the output above, we can see that we need to modify:

Nebraska (NE), Florida (FL), New York (NY), California (CA) & Illinois (IL)

we need to ensure that all fully written state names are converted to the abbreviation

In [47]:
# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
                'New York': 'NY',
                'Illinois': 'IL',
                'Florida': 'FL',
                'Nebraska': 'NE'}

# Create a function to apply changes to full state names listed above 
def abbreviate_state(patient):
    if patient['state'] in state_abbrev.keys():
        abbrev = state_abbrev[patient['state']]
        return abbrev
    else:
        return patient['state']


# apply the abbreviate_state function to the 'state' variable within the patients_clean dataset 
patients_clean['state'] = patients_clean.apply(abbreviate_state, axis=1)

##### Test

In [48]:
# run our value counts code again - should only see abbreviations 
patients_clean['state'].value_counts()

CA    60
NY    47
TX    32
IL    24
FL    22
MA    22
PA    18
GA    15
OH    14
MI    13
LA    13
OK    13
NJ    12
VA    11
WI    10
MS    10
TN     9
IN     9
MN     9
AL     9
NC     8
KY     8
WA     8
MO     7
KS     6
ID     6
NE     6
NV     6
SC     5
CT     5
IA     5
ND     4
CO     4
AZ     4
RI     4
AR     4
ME     4
OR     3
WV     3
DE     3
SD     3
MD     3
MT     2
DC     2
VT     2
WY     1
AK     1
NH     1
NM     1
Name: state, dtype: int64

In [49]:
# We need to fix the patient names where David is mis-spelled as 'Dsvid' 
patients_clean.given_name = patients_clean.given_name.replace('Dsvid', 'David')

# Test that patient 
patients_clean[patients_clean.surname == 'Gustafsson']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,64105,United States,3/6/1937,163.9,66,26.5,816-265-9578,DavidGustafsson@armyspy.com


In [50]:
# wrong datatypes given to Sex, State, Zip Code & Birthdate variables.
patients_clean.assigned_sex = patients_clean.assigned_sex.astype('category')
patients_clean.state = patients_clean.state.astype('category')
patients_clean.birthdate = pd.to_datetime(patients_clean.birthdate)

# check it changed 
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    category      
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       491 non-null    object        
 5   city          491 non-null    object        
 6   state         491 non-null    category      
 7   zip_code      491 non-null    object        
 8   country       491 non-null    object        
 9   birthdate     503 non-null    datetime64[ns]
 10  weight        503 non-null    float64       
 11  height        503 non-null    int64         
 12  bmi           503 non-null    float64       
 13  phone_number  491 non-null    object        
 14  email         491 non-null    object        
dtypes: category(2), datetime64[ns](1), float

In [51]:
# In the treatments table, we need to clean the start & end dose variables. Remove the letter "u" from the string, convert to integer
treatments_clean.dose_start = treatments_clean.dose_start.str.strip('u').astype(int)
treatments_clean.dose_end = treatments_clean.dose_end.str.strip('u').astype(int)

# Check it changed
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349 entries, 0 to 348
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   hba1c_start       349 non-null    float64
 1   hba1c_end         349 non-null    float64
 2   hba1c_change      349 non-null    float64
 3   treatment         349 non-null    object 
 4   dose_start        349 non-null    int32  
 5   dose_end          349 non-null    int32  
 6   adverse_reaction  35 non-null     object 
 7   patient_id        349 non-null    int64  
dtypes: float64(3), int32(2), int64(1), object(2)
memory usage: 19.1+ KB


In [53]:
# Fix the phone number variable to remove all formatting and just contain numbers. Ensure each string is then at least 11 
# in length, by padding with a leading 1 to represent the international dialling number for US
patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')

# check output with a sample to ensure correct
patients_clean.phone_number.sample(5)

261    16172970387
470    13862345932
310    19133229114
486    12546814504
281    12054946040
Name: phone_number, dtype: object

In [58]:
# we need to remove the 'John Doe' records from the patients table. These are dummy entries
patients_clean = patients_clean[patients_clean.surname != 'Doe']

# test that Doe is no longer in the surname list - should be an empty df returned 
print(patients_clean[patients_clean.surname == 'Doe'])

Empty DataFrame
Columns: [patient_id, assigned_sex, given_name, surname, address, city, state, zip_code, country, birthdate, weight, height, bmi, phone_number, email]
Index: []


##### Define
Remove the Jake Jakobsen, Pat Gersten, and Sandy Taylor rows from the `patients` table. These are the nicknames, which happen to also not be in the `treatments` table (removing the wrong name would create a consistency issue between the `patients` and `treatments` table). These are all the second occurrence of the duplicate. These are also the only occurences of non-null duplicate addresses.

##### Code

In [59]:
# tilde means not: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
patients_clean = patients_clean[~((patients_clean.address.duplicated()) & patients_clean.address.notnull())]

##### Test

In [63]:
# Ensure that the nickname Jake is no longer present, only the full name Jakob etc.
patients_clean[patients_clean.surname.isin(['Jakobsen','Gersten','Taylor'])]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,1985-08-01,155.8,67,24.4,18458587707,JakobCJakobsen@einrot.com
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324,United States,1954-05-03,138.2,71,19.3,14028484923,PatrickGersten@rhyta.com
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,33179,United States,1992-09-02,186.6,69,27.6,13054346299,RogelioJTaylor@teleworm.us
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020,United States,1962-11-25,185.2,67,29.0,19792030438,KarenJakobsen@jourrapide.com


In [65]:
# fix the weight issue in the observation surname=Zaitseva
# The entry needs to be converted from KGs to pounds

# find teh min. weight 
weight_kg = patients_clean.weight.min()
print(weight_kg)

48.8


In [66]:
# collect the observation for the patient in question, place into a holding variable 
mask = patients_clean.surname == 'Zaitseva'

In [67]:
# .loc[] allows us to access a group of rows & columns by label/s or boolean array
# now, the variable that needs to be fixed, for that specific customer & apply the calc 
patients_clean.loc[mask, 'weight'] = weight_kg * 2.20462

In [68]:
# Test 
# # 48.8 shouldn't be the lowest weight now 
patients_clean.weight.sort_values()

459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 494, dtype: float64