## Cleaning data
So far, the datasets you have worked with in this subject have been relatively clean. In practice, raw data tends to be very messy. Raw data will usually contain missing and/or incorrect data that needs to be processed before analysis. 

***
## Missing values
In the event there are missing values in a dataset, pandas will import these values as <code>na</code>.

You can use the <code>isna</code> method in pandas to analyse missing values. This method returns a dataframe where entries are <code>True</code> for values that are missing and <code>False</code> otherwise. You can use the <code>sum</code> aggregate to then count the number of missing values.

```python
# get missing value count for each column
df_name.isna().sum()

# get total missing value count
df_name.isna().sum().sum()
```

To remove rows or columns with missing values, you can call the <code>dropna</code> method.

```python
# remove rows with missing values
df_name.dropna()
```

To replace missing values, you can call the <code>fillna</code> method. Use this with caution - missing values should only be replaced if you are certain what value they should have.

```python
# example: replace missing values with 0
df_name.fillna(0)
```

Note that pandas aggregates and plotting will still work if there are missing values. Pandas will automatically ignore missing values when computing aggregates and will not plot missing values. Because of this, simply leaving in the missing values is often better than removal or replacement.

***
## Significant outliers
The simplest tool to identify significant outliers is data visualisation. A scatter matrix will quickly plot all univariate and bivariate distributions in the data, allowing you to see if any observations are significantly different to others.

```python
# create a scatter matrix
import seaborn as sns
sns.pairplot(df_name)
```

Generally, you should only remove observations from the data if you have strong evidence to suggest they are incorrect. Often this requires some domain knowledge of the data (ie. understand what values are and aren't possible for each variable).

If data does contain significant outliers, a good approach is to use 'robust statistics'. These are statistics that aren't heavily influenced by outliers. The median is a good example of a robust statistic, since the size of the outliers doesn't heavily skew it (unlike the mean).

***
## Domain/formatting checks
Often you will know the required domain or format of a variable. For example, an Australian mobile number must have 10 digits and start with '04'. In this case you can implement a formatting check to see if any of the mobile numbers have an invalid format. The following is an example function that would do this check.

```python
def is_valid_mobile(number):
    number = str(number)
    if number[:2] == '04' & len(number) == 10:
        return True
    else:
        return False
```

You could then call the apply method to do this check for every value in a column.

```python
df_name['col_name'].apply(is_valid_mobile)
```