# Video: Finding and Imputing Missing Data

This video walks through basic checks for missing data, and common ways to fill it in.

In [None]:
import pandas as pd


In [None]:
penguins_adelie = pd.read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff", index_col="Sample Number")
penguins_gentoo = pd.read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381", index_col="Sample Number")
penguins_chinstrap = pd.read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462", index_col="Sample Number")
penguins = pd.concat([penguins_adelie, penguins_gentoo, penguins_chinstrap])
penguins.head()

Script:
* The first five rows of the penguins data set have 2 rows with visibly missing values.
* This was an easy case, so we know that there is missing data, but we do not yet know the full extent of it.
* When looking for missing data, the first three things you should do are check the documentation, plot a histogram, and check for NA values in the data frame that you loaded.
* For this video, I'll assume that you looked at the documentation already, and go straight to the histogram.

In [None]:
penguins.hist()

Script:
* These histograms are not pretty, but you can see quickly that there aren't any huge unexpected spikes at particular values.
* However, this view is incomplete.
* It only covers numeric columns, and does not include values that pandas calls N/A.
* How can we look at those?
* There are a couple quick ways.

In [None]:
penguins.count()

Script:
* The count method on a data frame will quickly count the number of present values for each column.
* So if a column has missing values, then it will have a lower count shown.
* So in this case, the sex and delta columns in particular have a number of missing values.
* The comments column on the otherhand is mostly missing values.
* Beware that this check is quick but you need to make sure you are comparing to the length of the data frame, its number of rows, not the highest value shown in the count results.
* They may be different if every column is missing at least one value.
* A more direct way to check for missing values is to use the `isna` method to explicitly check for missing data recognized by pandas, and then sum up the number found by column.

In [None]:
penguins.isna().sum()

Script:
* This view makes it a little more obvious at a glance that there are a few physical measurements missing, not just the sex and blood isotope ratios and comments.
* So what should you do to fill in the missing data?
* This is very context dependent, and column dependent too.
* For most of these columns, you could probably fill in the mean or median value and get reasonable results.
* But what about the Sex column?
* That column has string values - female or male.
* If we pick a value to fill in there, we will be prone to skewing the data set in some way.
* Think carefully before you fill in non-numeric values.
* So how do we fill in the missing values?
* Scikit-Learn has a class SimpleImputer to take care of most cases.

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(strategy="mean")

Script:
* If there is a sentinel value like -1 that you found in the documentation or histogram, you can pass a `missing_values` parameter when creating the `SimpleImputer` object to replace that value.
* You can also use different calculations for the replacement values such as the median, mode, or a constant by changing the strategy value.
* Let's stick with the mean here and calculate that value to fill in now.

In [None]:
imputer.fit(penguins)

Script:
* The main limitation of this class is that it is for numeric data, not strings.
* So it won't work for these id columns which would be sketchy anyway, and it won't work for the sex column that we already talked about.
* Let's apply it to just the numeric columns.

In [None]:
penguins.info()

In [None]:
penguins_numeric_columns = [c for c in penguins.columns if penguins[c].dtype == "float64"]
penguins_numeric_columns

In [None]:
imputer = SimpleImputer(strategy="mean")
imputer.fit(penguins[penguins_numeric_columns])

Script:
* Now that the imputer has fit the mean of the numeric columns, we can use it to fill in missing data.
* The imputer will reject inputs that do not match the input columns, so we will just pass in the numeric columns now.

In [None]:
imputer.transform(penguins[penguins_numeric_columns])

Script:
* We can update the numeric columns by assigning the transform output to them.
* I will do that now with a copy of the penguins data frame so we can compare afterwards.

In [None]:
penguins_new = penguins.copy()
penguins_new[penguins_numeric_columns] = imputer.transform(penguins[penguins_numeric_columns])

In [None]:
penguins_new.head()

Script:
* Now all the NaN values are gone from those early rows except the missing Sex calue.
* And for comparison...

In [None]:
penguins.head()

Script:
* We started with many missing values and quickly replaced most of them with the column mean using the `SimpleImputer` class.
* We could something similar with the missing values in the Sex column using the `SimpleImputer` class, but you should think carefully about what you fill in there before doing so.