# Data Consistency

- why (input error, etc.)
- e.g. ints stored as str
- Duplicated data
- Simple checks to do
- Pandas column dtypes
- Numeric data: Check unique values, mean, min, max
- Combine with domain knowledge (what are sensible ranges of values for each column)
- Reason for strange values (e.g. missingness encoded as large negative value)
- Input errors (decimal in wrong place, decimal instead of comma...)
- How to fix/replace weird values

## Having a First Look at the Data

Here we will use the Iris dataset, a well-known example dataset for machine learning.

In [22]:
import pandas as pd

In [23]:
df = pd.read_csv("data/Iris.csv")

Display the first few ten rows of the data:

Note: `display` is a Jupyter notebook function that displays tables with prettier formatting than Python's `print` function. The `display` function won't work outside of Jupyter notebooks.

TODO: format note above with markdown styling

In [69]:
display(df.head(10))

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


We could also look at the last few rows of the data with `df.tail()`, or a random sample of rows with `df.sample()`.

To check the number of rows and columns we can use:

In [30]:
print(df.shape)

(150, 6)

Our data has 150 rows and 6 columns. It might also be useful to look at the column names (especially for larger datasets with many columns where they may not all be displayed by `df.head()`):

In [28]:
print(df.columns)

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

A useful command that summarises much of this information is `df.info()`:

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


This gives us the number of rows (entries) and columns at the top, and then a table with the name, number of non-null values (i.e. the number of rows that have data) and data type of each column.

Finally, it gives the amount of memory the data frame is using. Pandas can use a lot of memory, which may cause problems when analysing large datasets. The [Scaling to large datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html) page in the Pandas documentation gives pointers for what you can try in that case.

The pandas `describe()` function gives summary statistics for the numeric columns in our data (the mean, standard deviation, minimum and maximum value, and quartiles for each column):

In [41]:
display(df.describe())

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


Looking at these values gives us a better idea of what our data contains, but also allows us to perform some sanity checks. For example, do the minimum and maximum values in each column make sense?

TODO: make it so there's a min/max value that looks wrong

Note that the `Species` column does not appear above as it contains text. For both text and numeric columns it can be helpful to know the number of unique values in each column:

In [76]:
print(df.nunique())

Id               150
SepalLengthCm     35
SepalWidthCm      23
PetalLengthCm     43
PetalWidthCm      22
Species            3
dtype: int64


We see the `Id` column has as many different values as rows in our dataset, i.e. it is a unique identifier for each sample in the data. All the other columns have fewer than 150 unique values, and the `Species` column has three different values.

The `value_counts()` function, applied to the `Species` column, shows the number of occurrences of the three different values in the dataset:

In [40]:
print(df["Species"].value_counts())

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64


So we have 50 measurements each of "setosa", "versicolor", and "virginica" irises.

TODO: Make it so a weird value turns up, e.g. UNKNOWN

## Displaying Data Frames with Style 😎

You can also get fancy with how you display data frames by highlighting and formatting cells differently using its `style` attribute. There are a few examples below, for more detailis see the [Table Visualization page in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styler-Object-and-HTML).

Change the precision with which numbers are displayed:

In [71]:
df_top10 = df.head(10)  # just style the first 10 rows for demo purposes here

# round values to nearest integer (0 decimal places)
display(df_top10.style.format(precision=0))

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5,4,1,0,Iris-setosa
1,2,5,3,1,0,Iris-setosa
2,3,5,3,1,0,Iris-setosa
3,4,5,3,2,0,Iris-setosa
4,5,5,4,1,0,Iris-setosa
5,6,5,4,2,0,Iris-setosa
6,7,5,3,1,0,Iris-setosa
7,8,5,3,2,0,Iris-setosa
8,9,4,3,1,0,Iris-setosa
9,10,5,3,2,0,Iris-setosa


Apply a colour gradient to each column based on each cell's value:

In [73]:
display(df_top10.style.background_gradient())

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


Highlight the smallest value in each column:

In [74]:
display(df_top10.style.highlight_min())

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa
