# Data Integrity

Coming out of the last section where we began showing some of the data challenges you'll run across, hopefully you're starting to understand it's not just _read in the data and get to modeling_.  There's so much time devoted to getting the data, checking it, summarizing, asking questions, fixing issues, then rinse and repeat.  Real-world data is super messy and there's no one size fits all methodology.  Nearly every dataset will be unique and have some kind of peculiarities specific to the domain and data collections method.

Remember talking about "garbage in, garbage out"?  If your data is riddled with holes and untrustworthy data, then your entire model and project won't be worth a darn.  So again, be sure to give this EDA step all the attention it requires.  

Since you've already seen how to identify most of these issues, we're going to jump straight into addressing them.  In this section we'll cover the most common issues you'll run across.  There are others to be sure, but they'll be less common and more domain specific in most cases.

- Duplicate records
- Missing values/records
- Anomalous values/Outliers
- Sensoring

Below we'll speed up getting our data back into our environment and ready to tackle the integrity issues.

In [20]:
# Import libraries
import pandas as pd
import matplotlib

# Read in the data from github repository
url = 'https://github.com/bradybr/practical-data-science-and-ml/blob/main/datasets/sleep_study.csv?raw=true'
dat = pd.read_csv(url, sep = ',')

# Create a list of the features you want to change and recast them using ".astype()"
vars = ['id']
dat[vars] = dat[vars].astype('object')
dat

Unnamed: 0,id,gender,age,country,study_begin,study_end,active_mins,sleep_disturb_mins,sleep_rem_mins
0,1,F,38.0,USA,4/3/2021,4/9/2023,96.8,227.5,25.9
1,2,M,72.0,Poland,4/3/2021,4/9/2023,245.6,644.2,86.7
2,3,F,95.0,Italy,4/3/2021,4/9/2023,279.4,465.6,31.8
3,4,M,37.0,USA,4/3/2021,4/9/2023,60.0,109.0,19.7
4,5,F,80.0,Spain,4/3/2021,4/9/2023,89.4,113.3,38.6
...,...,...,...,...,...,...,...,...,...
872,872,F,17.0,Italy,4/3/2021,4/9/2023,42.1,87.7,42.3
873,873,F,64.0,USA,4/3/2021,4/9/2023,128.5,268.8,23.4
874,874,F,27.0,Italy,4/3/2021,4/9/2023,43.2,84.1,18.0
875,875,F,69.0,Spain,4/3/2021,4/9/2023,246.8,237.6,44.1


<h3>Duplicate Records</h3>

Let's deal with the easiest ones first.  Remember we had a few duplicate observations?  Well, we spoke with our business expert, and she said there's no reason for those to exist and they are in fact real duplicate records.  We can go ahead and delete them.  

Easy enough.  Let's find them again and then we'll use the `.drop_duplicates()` function to drop them from our data.

In [4]:
# Check for duplicate observations
dat[dat.duplicated()]

Unnamed: 0,id,gender,age,country,study_begin,study_end,active_mins,sleep_disturb_mins,sleep_rem_mins
88,89,F,53.0,Italy,4/3/2021,4/9/2023,141.0,186.6,107.1
238,238,M,0.0,Poland,4/3/2021,4/9/2023,2.0,2.0,2.0


In [21]:
# Delete the duplicate records & check the dimensions of the dataset
dat.drop_duplicates(inplace = True)
dat.shape

(875, 9)

Success!  We only lost two of our observations and we're down to 875 records.  On to the missing values!  

<h3>Missing Values/Records</h3>

Ok, so we have some missing values to deal with now.  Let's deal with the easy ones first.  We again asked our business domain expert to help us understand the data collection and participant intake process to reign in how these might have occurred.  Here's what we found out.

- __Age:__  Participants self reported their age on their intake applications when applying for the study.  Any missing values were not followed up on when the study actually started.
- __Study End Date:__ All dates should be 4/9/2023.  Any date after 4/9/2023 is invalid and a data entry mistake.
- __Sleep Pattern Minutes:__  Missing values in sleep recordings were accidental technician omissions.

In [29]:
# Count NA's by feature
dat.isna().sum()

id                    0
gender                0
age                   0
country               0
study_begin           0
study_end             0
active_mins           4
sleep_disturb_mins    6
sleep_rem_mins        9
dtype: int64

In [24]:
# Delete the 1 observation with a missing age
dat.dropna(subset = ['age'], inplace = True)

In [28]:
# Set all of the end dates to '4/9/2023'
dat['study_end'] = '4/9/2023'

Now, for the missing sleep pattern mintues... There are few enough that we could just delete them and it probably wouldn't materially change anything in our analysis, hopefully.  On the other hand, it is just a few of them so we could take a shot at imputation because we would not be creating a large number of artificial values.  For this example it's probalbly a toss up as to whether or not it matters either way; however, you will definitely see more complicated and difficult decisions in the real world.

When you're considering deleting information, which should be a last resort, you should get in the habit of understanding what you're deleting.  For example, you may be removing observations that are unique values in other features and you'd be losing visibility to this group entirely, or maybe they are very important interactions somewhere else in the values under study.  Point is, try not to get in the habit of just deleting data without analyzing a bit and thinking it through.

Let's give imputation a try so you can see how it might work.

There are tons of different ways you could go about imputing missing values, from very simple, to unnecessarily complex.  We'll try somewhere in the middle.

And that's it.  If we re-run our `is.na().sum()` counts we should see the fruits of our labor with all of the missing values taken care of.

In [29]:
# Count NA's by feature
dat.isna().sum()

id                    0
gender                0
age                   0
country               0
study_begin           0
study_end             0
active_mins           4
sleep_disturb_mins    6
sleep_rem_mins        9
dtype: int64

<h3>Anomalous Values/Outliers</h3>

In [31]:
# Print numeric summary stats
dat.describe()

Unnamed: 0,age,active_mins,sleep_disturb_mins,sleep_rem_mins
count,874.0,870.0,868.0,865.0
mean,54.243707,215.427126,311.946313,121.906127
std,23.669669,820.03536,817.874558,826.77915
min,0.0,2.0,2.0,2.0
25%,35.0,81.325,146.75,26.5
50%,54.0,130.5,227.05,43.8
75%,74.0,199.625,322.875,69.3
max,146.0,9999.0,9999.0,9999.0


In [33]:
# Print categorical summary stats
dat[['id','gender','country','study_begin','study_end']].describe()

Unnamed: 0,id,gender,country
count,874,874,874
unique,874,2,4
top,1,F,Italy
freq,1,612,344


In [30]:
# Delete static and non-informative 
dat.drop(['study_begin', 'study_end'], axis = 1, inplace = True)