# Data Charactersitics

I wish I could tell you that your data will always be clean, complete, and ready for analysis.  I really do.  The truth however is far from it.  Real world data is messy.  Even working for a $100 Bn company, I'm continually shocked at how difficult it is to create, standardize, archive, and retrieve good data.  Very rarely does data come ready to go for proof of concept analytics.  It's only after we go through the effort to identify, extract, and play with and massage the data to prove there's some value in the solution we're building, that we then go to the next step of building a standard data asset.  This new data asset will now deliver the data in exactly the way we need it for use in our production program from now on.  But we have to get there first through exploratory analysis.

There's no end to the unique challenges you'll run into once you start working with data, and again there's no standardization from dataset to dataset.  What we can do though is lay out a normal workflow that will always be a good idea to start with, independent of any differences across the different datasets.

We'll get some overlap between this section with the integrity and plotting sections to follow, but don't worry if you see us plotting something, calling out an issue or fixing something and you don't fully understand it yet.  It may make more sense to cover a new topic in a next section if we don't fully explain it here.  Wee have to start somewhere though, so let's get to it.

Here we'll go through the following in this lesson.

- Data shape & features
- Descriptive summaries
- Missing values/records
- Duplicates
- Data Quality Report

<h3>Data Shape & Features</h3>

Almost without exception, the first thing you should do is simply take a look at your data.  Read it in and then print it to the screen, or view it through an object explorer.  Just see what you have before doing anything else.

Does it look clean?  Is it human interpretable?  Do you understand what all of the columns mean?  Do you need to ask the business domain sponsor for any explanations or data dictionaries for special keys or encodings?  What about the unit of analysis?  Can you tell what it is?

Let's read in a dataset and see all of these in action.

In [None]:
# Import libraries
import pandas as pd
import matplotlib

The data we're going to play with is a large sleep study tracking the average minutes per night for various sleep patterns, over a one week period of time, for 877 participants.  The recorded sleep states are as follows.

- active_mins = not asleep
- sleep_disturb_mins =  sleeping, but not in a REM cycle
- sleep_rem_mins = sleeping, REM cycle

In [None]:
# Read in the data from github repository
url = 'https://github.com/bradybr/practical-data-science-and-ml/blob/main/datasets/sleep_study.csv?raw=true'
dat = pd.read_csv(url, sep = ',')
dat

Can you infer the _unit of analysis_?  My vote would be-

> Sleep pattern minutes, by participant (id)

Looks like there's one person per row as our level of analysis.  This would be cross-sectional data if we think back to our review of data types.  Other than that, it looks pretty clean and easy to understand.  Each person should have a unique "id" that identifies them, then some demographic details, and finally their average minutes for each sleep state during the study.  Easy enough.

The dataframe print out shows us the number of rows (observations) and columns (features/variables), but we can also print these out by using the `.shape()` command.

In [None]:
dat.shape

Cool.  They match up!

Generally after looking at my data, the next thing I want to consider are the feature data types.  Meaning, do my variables have the correct data types?  This matters because we will soon start to perform operations on our data and they'll need to be correct.  You can't perform mathematical operations on words, and you don't want to treat numeric variables as objects, so we'll need to cast them into the correct type if they were read-in incorrectly.

So where do we start?  You guessed it.  There's a function for this too - `.info()`

In [None]:
dat.info()

Above we can see that the "id" feature is an integer.  Does that feel correct?  It shouldn't, because numbers carry implied characteristics.  Is id = 4, two times greater than id = 2?  Hopefully you said no.  These are not really numeric values.  They're just numbers being used as categorical identifiers for the participants in the study, so we want to recast "id" as an "object" (categorical string).  See below.

In [None]:
# Create a list of the features you want to change and recast them using ".astype()"
vars = ['id']
dat[vars] = dat[vars].astype('object')
dat.info()

Great!  Now they all look correct.  The categorical object variables (i.e. id, gender, country) are all objects, and the numeric values (i.e. age, minutes) are all numeric floats.  Perfect!

<h3>Descriptive Summaries</h3>

Next up, let's take a look at some summaries of our data.  We're generally making a first pass here trying to get a sense for what we're working with and if there might be some issues to address.  Try the list below to observe and see if they pass the sniff test or seem funny to you.

- Numeric ranges (min, max, central tendencies, distribution shapes - skew/kurtosis)
- Nonsensical values
- Cardinality (number of unique values)
- High/low uniqueness

The easiest place to start is with the `.describe()` function.

In [None]:
# Print numeric summary stats
dat.describe()

Notice how we only see the numeric features?  This is because you can't get an averge or minimum value for a categorical string.  This is why it's so important to get your data types right.  If you don't, you'll end up with summaries that don't make any sense.  While we will usually go the next step and plot these distributions as well, this nice little table summary is really a succint way to spot anomalies.

Do you notice anything odd in the table above?

How about a the minimum and maximum values for Age?  It's not possible to have anyone in the study with an age of 0, and it's just as unlikely to have someone at the age of 146.  So right away we know we have some data integrity issues going on that we'll need to solve for in the next section.

A couple of other things to take note of here as well:

- Counts are all less than the total number of observations in the data (877), indicating we have missing records
- Minimum values of 2 across the sleep pattern minutes looks suspicous
- Likely placeholder '9999' values in the minute features

And the same thing below with the categorical features.  We have some missing values to deal with, but otherwise, the unique counts and top mode values make sense.

In [None]:
# Print categorical summary stats
dat[['id','gender','country','study_begin','study_end']].describe()

Did you notice that the study begin date only has one unique static value of 4/3/2021?  This is both confirmation that everyone started on the same date and we have no issues, and also telling us that the variable will serve no further purpose for us since there is no variability in the data.  Variables that do not change at all carry no information content so they can usually be removed from our analysis.

Did you also notice there are three unique values for the "study_end" date?  There should only be one since the study ended after one week for all participants.  Let's take a closer look by printing a frequency table using the `.value_counts()` function.

In [None]:
# Create frequency table for "study_end" date
dat.study_end.value_counts()

Interesting.  We have two that closed out a few days after the study ended on 4/9/23, and then one around six months later.  No idea what that one's about.  This is an example of when we'd go back to our business domain expert to find out how this might of happened.  Were they typos?  Or maybe some reason two were recorded three days after the study closed, and then someone entered that date instead of when the study actually ended?  And what about the fact that we only have 861 recorded end dates?  What about the other 16 participants?  Did they drop out of the study early or were these just omissions?

So many questions that we'd need to follow up on so we can figure out how to handle them.  Stay tuned for the next {doc}`../Chapter5/data_integrity` section.

<h3>Missing Values/Records</h3>

The topic of missing values could fill the contents of a book by itself believe it or not.  There are several different kinds of "missing" values, and also several different mechanisms that create missing values which all have implications for the different ways we need to handle them.

At the end of the day, it's always a battle between removing the entire variable (column) with the missing values, removing any observation (row) entirely that has missing values, or _imputing_ (filling in) any missing values with a proxy value so we're are able to keep the observation and still use it in our analysis.  The main considerations with imputation are, 1) what methods are we going to use to do so, and 2) what percentage of imputed values are too many that render the variable too synthetic?  If you impute and fill in 90% of a column of data with approximated values, is it really going to tell you anything worthwhile?  As a general rule of thumb, you would probably be wise to remove any variable feature that has more than 50% of its values missing.  We'll address these considerations more in the next section.

Now, what about the types of missing values?

1. Missing Completely At Random (MCAR)

   MCAR assumes that all of the missing values have the same probability of being absent, and there's no systematic bias or pattern as to how the values are missing.  It is best safely thought of as
   unrealistic in the real-world.
   
3. Missing At Random (MAR)

   MAR indicate that the probability of the value being missing is somehow related to the value of another observed feature(s) in the dataset.  This type may or may not introduce bias into the system.

3. Missing Not At Random (MNAR)

   MNAR missing values are missing because they are systematically related to the unobserved data in some way, i.e. related to factors outside of our controls which are not measured.  This type of
   missing value will most likely introduce bias.

Let's see what we have in our dataset by running the `is.na()` and summing up all of the missing values for each feature.

In [None]:
# Count NA's by feature
dat.isna().sum()

We will have to deal with these in the next section when we talk to our business domain expert, but for now think about what types of missing values these might be.  The missing Age value may just be an omission because it's only 1 person, or it might be MNAR because it could be related to something outside of what we can observe.  If there were more, it might be MAR due to some kind of relation between maybe gender at a prefence to not give their age, and so on.

One last consideration here that we'll cover when we get to the time series section under machine learning, is the situation of missing time periods.  It's quite common to see missing date records (rows) either due to omission, or the compression that happens in data storage when there is no value to record.  This will throw a wrench in the works for our time series algorithms that expect every single time period accounted for in our total range of dates.  This is known as _regular_ periodicity where the records are recorded at regular intervals, as opposed to irregular periodicity.  More to come on this topic later.

<h3>Duplicates</h3>

The last issue we may want to understand is the possibility of duplicates.  Luckily there's a built-in function for this one too.

In [None]:
# Check for duplicate observations
dat[dat.duplicated()]

In [None]:
# Find all of the duplicates by the id's identified above if you want to see all of them plus the originals
dat[dat['id'].isin([89,238])]

We'll address these in the next section too.

This can get quite a bit more complicated for different kinds of data, think panel data, where we have multiple levels and time periods in the data.  We'll likely need to use a "split-apply-combine" methodology with something like the pandas `.groupby().apply()` functionality.  More on this when we get to the {doc}`../Chapter5/wrangling` section soon.

Ok, so that's it for all I would probably do at this time for my initial pass through of the data.  Coming up we will discuss how to fix all of these issues in the next section, and also start to introduce some basic plotting and graphing techinques you can use to explore the relationships in your data as well.

The last topic I'd like to introduce in this section is the notion of a Data Quality Report, or Profile Report.  Learning how to work through all of this manually is extremely valuable time well spent.  If you can learn to think through how to identify issues, and think through the questions you want to ask of the data, you'll be a much stronger analyst for it.  Having said all of that, you should know that there are plenty of Python libraries we can import that essentially do everything we've just done automatically.

Don't be angry.  Remember, you're a better analyst now.

For kicks let's check one such automated example and see if we like it.

<h3>Data Quality Report</h3>

I personally find the value in doing it all manually like we've done above, but I certainly understand the appeal of a simple two lines of code approach as well.  To each his or her own.  Use whatever suits your style.    

See the `ydata_profiling` example below.  Give it a spin and see what you think.

In [None]:
from ydata_profiling import ProfileReport

dqr = ProfileReport(dat, title = "Profiling Report")
dqr.to_notebook_iframe()
#dqr.to_file("dqr_report.html")