# Data Cleaning

Data cleaning is the process of cleaning the data by fixing missing data and invalid data so that data analytics can be performed correctly

## Introduction to Missing Data

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a _`Salary`_ field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy"(Python consider then as boolean `False`):

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
falsy_values = (0, False, None, '', [], {})
falsy_values

<a id="Falsy"></a>
For Python, all the values above are considered "falsy":

In [None]:
any(falsy_values)

Numpy has a special "nullable" value for numbers which is `np.nan`. It's _NaN_: "Not a number"

In [None]:
np.nan

The `np.nan` value is kind of a virus. Everything that it touches becomes `np.nan`:

In [None]:
3 + np.nan

In [None]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [None]:
a.sum()

In [None]:
a.mean()

This is better than regular `None` values, which in the previous examples would have raised an exception:

In [None]:
3 + None

For a numeric array, the `None` value is replaced by `np.nan`:

In [None]:
a = np.array([1, 2, 3, np.nan, None, 4], dtype='float')

In [None]:
a

As we said, `np.nan` is like a virus. If you have any `nan` value in an array and you try to perform an operation on it, you'll get unexpected results:

In [None]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [None]:
a.mean()

In [None]:
a.sum()

Numpy also supports an "Infinite" type:

In [None]:
np.inf

Which also behaves as a virus:

In [None]:
3 + np.inf

In [None]:
np.inf / 3

In [None]:
np.inf / np.inf

In [None]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

In [None]:
b.sum()

![separator1](https://i.imgur.com/ZUWYTii.png)

### Checking for `nan` or `inf`

There are two functions: `np.isnan` and `np.isinf` that will perform the desired checks:

In [None]:
np.isnan(np.nan)

In [None]:
np.isinf(np.inf)

And the joint operation can be performed with `np.isfinite`.

In [None]:
np.isfinite(np.nan), np.isfinite(np.inf)

`np.isnan` and `np.isinf` also take arrays as inputs, and return boolean arrays as results:

In [None]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

In [None]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

In [None]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

_Note: It's not so common to find infinite values. From now on, we'll keep working with only `np.nan`_

![separator1](https://i.imgur.com/ZUWYTii.png)

### Filtering them out

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid `nan` propagation. We'll use a combination of the previous `np.isnan` + boolean arrays for this purpose:

In [None]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [None]:
a[~np.isnan(a)]

Which is equivalent to:

In [None]:
a[np.isfinite(a)]

And with that result, all the operation can be now performed:

In [None]:
a[np.isfinite(a)].sum()

In [None]:
a[np.isfinite(a)].mean()

![separator2](https://i.imgur.com/4gX5WFr.png)

## Handling Missing Data with Pandas

pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time:

### Pandas utility functions

Similarly to `numpy`, pandas also has a few utility functions to identify and detect null values. Both `isnull()` and `isna()` have same working and are synonym functions in pandas. Using any function is fine.

In [None]:
pd.isnull(np.nan)

In [None]:
pd.isnull(None)

In [None]:
pd.isna(np.nan)

In [None]:
pd.isna(None)

The opposite ones also exist. `notnull` and `notna` are synonym functions and behave in same manner.

In [None]:
pd.notnull(None)

In [None]:
pd.notnull(np.nan)

In [None]:
pd.notna(np.nan)

In [None]:
pd.notnull(3)

These functions also work with `Series` and `DataFrame`s. `isnull`  will return a boolean mask with `True` for values which are [Falsy](#Falsy) and `False` otherwise. Opposite behaviour is present when we call `notnull()`.

In [None]:
pd.Series([1, np.nan, 7])

In [None]:
pd.isnull(pd.Series([1, np.nan, 7]))

In [None]:
pd.notnull(pd.Series([1, np.nan, 7]))

In [None]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

![separator1](https://i.imgur.com/ZUWYTii.png)

### Pandas Operations with Missing Values

Pandas manages missing values more gracefully than numpy. `nan`s will no longer behave as "viruses", and operations will just ignore them completely:

In [None]:
pd.Series([1, 2, np.nan]).count()

In [None]:
pd.Series([1, 2, np.nan]).sum()

In [None]:
pd.Series([2, 2, np.nan]).mean()

### Filtering missing data

As we saw with numpy, we could combine boolean selection + `pd.isnull` to filter out those `nan`s and null values:

In [None]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [None]:
pd.notnull(s)

In [None]:
pd.isnull(s)

In Python, we can sum the boolean values. `True` has value of `1` while `False` has value of `0`. Using this logic, we can identify how many null values are there. If we do `notnull()`, it will return a boolean Series with `True` for non null values as `False` for null values. We can then apply `.sum()` to the result to get the total number of not null values. Then we can identify how many null values or non null values we have in the dataset.

Similarly, we can also use the function `isnull()` which return `True` if a value is null and `False` otherwise and then use the `.sum()` to get the count of null values.

In [None]:
pd.notnull(s).sum()

In [None]:
pd.isnull(s).sum()

We can also use these function to get all null or non-null values from the Series or DataFrames(since `isnull` and `notnull` both return a boolean mask)

In [None]:
s[pd.notnull(s)]

But both `notnull` and `isnull` are also methods of `Series` and `DataFrame`s, so we could use it that way:

In [None]:
s.isnull()

In [None]:
s.notnull()

In [None]:
s[s.notnull()]

![separator1](https://i.imgur.com/ZUWYTii.png)

### Dropping null values

Boolean selection + `notnull()` seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the `dropna` method. `dropna` method will drop all the null values and return the object with all non null values. `dropna` does not modify the existing object but returns a new object.

In [None]:
s

In [None]:
s.dropna()

In [None]:
s

### Dropping null values on DataFrames

You saw how simple it is to drop `na`s with a Series. But with `DataFrame`s, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample `DataFrame`:

In [None]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})

In [None]:
df

To identify the null values in DataFrames, we first start by getting the `shape()` and the `info()` of the DataFrame to get the idea about the structure of the data set as well as how many null values are present in each column

In [None]:
df.shape

In [None]:
df.info()

We can see above that we have 2 non-null values in Column A, 3 non-null values in Column B & C and 4 values in Column D. Also we can see that there are 4 rows in the DataFrame(from the `RangeIndex` value) and accordingly identify the number of null values in each columns.

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

By default, `dropna()` will drop all the rows in the DataFrame which have 1 or more null values.

In [None]:
df.dropna()

In this case we're dropping **rows**. Rows containing null values are dropped from the DF. You can also use the `axis` parameter to drop columns containing null values:

In [None]:
df.dropna(axis=1)  # axis='columns' also works

In [None]:
df2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})

In [None]:
df2

Using `dropna()` in default behaviour can be a bit extreme since it will drop all rows(or columns) having atleast 1 null value. We can also set a null value threshold, i.e., for how many null values in the row(or column) should the row(or column) be dropped. This behaviour is set using the parameter `how`. We can set `how='all'` to drop if all the values are null or `how='any'` if any of the values are null. `how='any'` is the default behaviour of `dropna()`

In [None]:
df.dropna(how='all')

In [None]:
df.dropna(how='any')  # default behavior

You can also use the `thresh` parameter to indicate a _threshold_, a minimum number of non-null values for the row/column to be kept:

In [None]:
df

In [None]:
df.dropna(thresh=3)

In [None]:
df.dropna(thresh=3, axis='columns')

![separator1](https://i.imgur.com/ZUWYTii.png)

### Filling null values

Sometimes instead than dropping the null values, we might need to replace them with some other value. This highly depends on the context and the dataset being worked upon. Depending on the context of data, we can replace the `nan` values with any fixed value(like `0`), some statistically calculated values like the `mean` of the dataset and other times we can take the closest value(any value above or below the nan data point). The value that we choose to fit for a particular missing point depends on context and the knowledge of the data structure and it is important to have some domain knowledge about the data on which we are working to be able to better fit the values.

In [None]:
s

Filling nulls with a arbitrary value like `0`

In [None]:
s.fillna(0)

Filling null values with the `mean` value

In [None]:
s.fillna(s.mean())

The `fillna()` is immutable funciton,i.e., it returns a new data object instead of modifying the existing one.

In [None]:
s

We can also fill the mmissing values by using the values in close proximity to the null values(like some value above or below it). We do so by using the `method` argument. We can pass `method='ffill'`(forward fill) to forward fill the missing values,i.e., use the value of the row before(or above in case of column) the null value to fill the missing value with. We can also use `method='bfill'`(backward fill) to backward fill the missing values(reverse of forward fill), i.e., use the values of row or column after the null value to fill the missing value with.

In [None]:
s.fillna(method='ffill')

In [None]:
s.fillna(method='bfill')

This can still leave null values at the extremes of the Series/DataFrame:

In [None]:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')

In [None]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

### Filling null values on DataFrames

The `fillna` method also works on `DataFrame`s, and it works similarly. The main differences are that you can specify the `axis` (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed:

In [None]:
df

In [None]:
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})

We can define the axis we want to use while using `method` in `fillna()` function in DataFrame. `axis` will define whether we want to fill the missing values row wise or column wise. `axis=0` will fill the values column wise(top to bottom or bottom to top) and `axis=1` will fill the values row wise(left to right or right to left)

In [None]:
df.fillna(method='ffill', axis=0)

In [None]:
df.fillna(method='ffill', axis=1)

![separator1](https://i.imgur.com/ZUWYTii.png)

### Checking if there are NAs

The question is: Does this `Series` or `DataFrame` contain any missing value? The answer should be yes or no: `True` or `False`. How can you verify it?

#### **Method 1: Checking the length**

If there are missing values, `s.dropna()` will have less elements than `s`:

In [None]:
s.dropna().count()

In [None]:
len(s)

In [None]:
len(s.dropna())

In [None]:
missing_values = len(s.dropna()) != len(s)
missing_values

Since there are missing values in the dataset(the count of number of values is greater than number of values after dropping na), we got `True` when checking that the length of both are same or not.

There's also a `count` method, that excludes `nan`s from its result:

In [None]:
len(s)

In [None]:
s.count()

So we could just do:

In [None]:
missing_values = s.count() != len(s)
missing_values

#### **Method 2: More Pythonic solution `any`**

The methods `any` and `all` check if either there's `any` True value in a Series or `all` the values are `True`. They work in the same way as in Python:

In [None]:
pd.Series([True, False, False]).any()

In [None]:
pd.Series([True, False, False]).all()

In [None]:
pd.Series([True, True, True]).all()

The `isnull()` method returned a Boolean `Series` with `True` values wherever there was a `nan`:

In [None]:
s.isnull()

So we can just use the `any` method with the boolean array returned:

In [None]:
pd.Series([1, np.nan]).isnull().any()

In [None]:
pd.Series([1, 2]).isnull().any()

In [None]:
s.isnull().any()

A more strict version would check only the `values` of the Series:

In [None]:
s.isnull().values

In [None]:
s.isnull().values.any()

![separator2](https://i.imgur.com/4gX5WFr.png)

## Cleaning not-null values

We previouly saw that when there is data missing from the dataset, i.e, we have `np.nan` values in the datasets, we can easily identify(using method like `isnull`) and fill/drop the missing values(using functions like `fillna` or `dropna`) and hence deal with these problems in a straight forward manner.

But sometimes, there are values in the dataset which are not missing but are invalid in nature(like some particular attirbute value is not possible, or there is an outlier value, etc).

In [None]:
df = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'D', '?'],
    'Age': [29, 30, 24, 290, 25],
})
df

The previous `DataFrame` doesn't have any "missing value", but clearly has invalid data. `290` doesn't seem like a valid age, and `D` and `?` don't correspond with any known sex category. How can you clean these not-missing, but clearly invalid values then?

### Finding Unique Values

The first step to clean invalid values is to **notice** them, then **identify** them and finally handle them appropriately (remove them, replace them, etc). Usually, for a "categorical" type of field (like Sex, which only takes values of a discrete set `('M', 'F')`), we start by analyzing the variety of values present. For that, we use the `unique()` method. The `unique()` methods returns all the unique values present in the dataobject(including `nan`):

In [None]:
df['Sex'].unique()

The function `value_count()` returns the count of each unique value, i.e., how many datapoints we have for each of the value type present(which we can also get from `unique()` function)

In [None]:
df['Sex'].value_counts()

Clearly if you see values like `'D'` or `'?'`, it'll immediately raise your attention. Now, what to do with them? Let's say you picked up the phone, called the survey company and they told you that `'D'` was a typo and it should actually be `F`. You can use the `replace` function to replace these values:

In [None]:
df['Sex'].replace('D', 'F')

It can accept a dictionary of values to replace. For example, they also told you that there might be a few `'N's`, that should actually be `'M's`:

In [None]:
df['Sex'].replace({'D': 'F', 'N': 'M'})

If you have many columns to replace, you could apply it at "DataFrame level":

In [None]:
df.replace({
    'Sex': {
        'D': 'F',
        'N': 'M'
    },
    'Age': {
        290: 29
    }
})

In the previous example, I explicitly replaced 290 with 29 (assuming it was just an extra 0 entered at data-entry phase). But what if you'd like to remove all the extra 0s from the ages columns? (example, `150 > 15`, `490 > 49`).

The first step would be to just set the limit of the "not possible" age. Is it 100? 120? Let's say that anything above 100 isn't credible for **our** dataset. We can then combine boolean selection with the operation:

In [None]:
df = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'F', '?'],
    'Age': [29, 30, 24, 290, 25],
})
df

In [None]:
df[df['Age'] > 100]

And we can now just divide by 10:

In [None]:
df.loc[df['Age'] > 100, 'Age'] / 10

In [None]:
df.loc[df['Age'] > 100, 'Age'] = df.loc[df['Age'] > 100, 'Age'] / 10

In [None]:
df

![separator1](https://i.imgur.com/ZUWYTii.png)

### Duplicates

Checking duplicate values is extremely simple. It'll behave differently between Series and DataFrames. Let's start with Series. As an example, let's say we're throwing a fancy party and we're inviting Ambassadors from Europe. But can only invite one ambassador per country. This is our original list, and as you can see, both the UK and Germany have duplicated ambassadors:

In [None]:
ambassadors = pd.Series([
    'France',
    'United Kingdom',
    'United Kingdom',
    'Italy',
    'Germany',
    'Germany',
    'Germany',
], index=[
    'Gérard Araud',
    'Kim Darroch',
    'Peter Westmacott',
    'Armando Varricchio',
    'Peter Wittig',
    'Peter Ammon',
    'Klaus Scharioth '
])

In [None]:
ambassadors

The two most important methods to deal with duplicates are `duplicated` (that will tell you which values are duplicates) and `drop_duplicates` (which will just get rid of duplicates):

In [None]:
ambassadors.duplicated()

In this case `duplicated` didn't consider `'Kim Darroch'`, the first instance of the United Kingdom or `'Peter Wittig'` as duplicates. That's because, by default, it'll consider the first occurrence of the value as not-duplicate. You can change this behavior with the `keep` parameter:

In [None]:
ambassadors.duplicated(keep='last')

In this case, the result is "flipped", `'Kim Darroch'` and `'Peter Wittig'` (the first ambassadors of their countries) are considered duplicates, but `'Peter Westmacott'` and `'Klaus Scharioth'` are not duplicates. You can also choose to mark all of them as duplicates with `keep=False`. `keep=False` will mark all the values as duplicates if there are 2 or more duplicate values(contrary to default behaviour or `keep='last'` which will mark first(or last) occurance as non duplicate and rest as duplicate):

In [None]:
ambassadors.duplicated(keep=False)

A similar method is `drop_duplicates`, which just excludes the duplicated values and also accepts the `keep` parameter:

In [None]:
ambassadors.drop_duplicates()

In [None]:
ambassadors.drop_duplicates(keep='last')

In [None]:
ambassadors.drop_duplicates(keep=False)

### Duplicates in DataFrames

Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates:

In [None]:
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'LeBron James',
        'Kobe Bryant',
        'Carmelo Anthony',
        'Kobe Bryant',
    ],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF'
    ]
})

In [None]:
players

In the previous DataFrame, we clearly see that `Kobe bryant` value under `Name` column is duplicated; but he appears with two different positions(values `SG` and `SF` in `Pos` Column). We can call `duplicated` method to check the duplicate values. It will work in same way by default as for series, i.e., first matching value will not be considered duplicate but consecutive ones will be considered duplicate.

In [None]:
players.duplicated()

"duplicated" for a DataFrame means "all the column values should be duplicates"(if 2 rows are exact copy of each other, it is duplicated).

We can also look for duplicate values in specific column(s) using the `subset` attribute. We pass the column names in `[]` to the `subset` attribute which we want to consider while checking for duplicate values. When passed, the `duplicated` method will check whether any rows are same or not for the subset of columns.

In [None]:
players.duplicated(subset=['Name'])

And the same rules of `keep` still apply:

In [None]:
players.duplicated(subset=['Name'], keep='last')

`drop_duplicates` takes the same parameters:

In [None]:
players.drop_duplicates()

In [None]:
players.drop_duplicates(subset=['Name'])

In [None]:
players.drop_duplicates(subset=['Name'], keep='last')

![separator1](https://i.imgur.com/ZUWYTii.png)

### Text Handling

Cleaning text values can be incredibly hard. Invalid text values involves, 99% of the time, mistyping, which is completely unpredictable and doesn't follow any pattern. Thankfully, it's not so common these days, where data-entry tasks have been replaced by machines. Still, let's explore the most common cases:

#### Splitting Columns

The result of a survey is loaded and this is what you get:

In [None]:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_   IT_1',
        '1985_F_I  T_2'
]})

In [None]:
df

In pandas, the Series and DataFrames can habe dtype other than int or float, like object for string, datetime, etc. Some of the dtype like object and datetime(along with others) have some special attributes and functions reserved for them that we can use when we have a dataset with that dtype.

In case of string, the Series and DataFrames have an attribute called `str`(it is called differently for different dtype). The `str` attributes contains some string based functions, i.e., functions that we can perform on strings. Some of these functions are `split`, `contains`, `strip` and `replace`.

You know that the single columns represent the values "year, Sex, Country and number of children", but it's all been grouped in the same column and separated by an underscore. Pandas has a convenient method named `split` that we can use in these situations. The split function will break the string into two parts wherever it finds the matching string pattern/regex that has been proivided as input and convert the string to an array where the elements are the seperated strings. The string pattern/regex is not part of the output

In [None]:
df['Data'].str.split('_')

If we want to spread the elements of the array into multiple column values, we can add the parameter `expand=True`

In [None]:
df['Data'].str.split('_', expand=True)

In [None]:
df = df['Data'].str.split('_', expand=True)

In [None]:
df.columns = ['Year', 'Sex', 'Country', 'No Children']

In [None]:
df

You can also check which columns contain a given value with the `contains` method:

In [None]:
df['Year'].str.contains('\?')

[`contains`](http://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.Series.str.contains.html) takes a regex/pattern as first value, so we need to escape the `?` symbol as it has a special meaning for these patterns. Regular letters don't need escaping:

In [None]:
df['Country'].str.contains('U')

Removing blank spaces (like in `'US '` or `'I  T'` can be achieved with `strip` (`lstrip` and `rstrip` also exist) or just `replace`:

In [None]:
df['Country'].str.strip()

In [None]:
df['Country'].str.replace(' ', '')

As we said, `replace` and `contains` take regex patterns, which can make it easier to replace values in bulk:

In [None]:
df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m: m.group('year'))

But, be warned:

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

As you can see, all these string/text-related operations are applied over the `str` attribute of the series. That's because they have a special place in Series handling and you can read more about it [here](https://pandas.pydata.org/pandas-docs/stable/text.html).

![separator2](https://i.imgur.com/4gX5WFr.png)

## Visualizations

Apart from just staistically and mathematically evaluating the data, it is also helpful when we can visually __see__ the data to understand it better. That's where the matplotlib comes into picture. matplotlib is a library which can provide graphical representation to the the data through the usage of many available graphs and charts and allowing us to __plot__ the data in those graphs and charts. The `plot` function of pandas library also uses matplotlib library internally.

In matplotlib, we have 2 ways to create plots, using the Global API and using OOP Interface - 
* [Global API](https://matplotlib.org/api/index.html#the-pyplot-api) is basically when we import pyplot(`import matplotlib.pyplot as plt`), we directly use the plt object to create our figure, plot, adding labels and titles to it, etc. This way has the problem that we are directly changing the state of the globally available instance of the `plt` object and any functions we call of `pyplot` which are not immutable are essentially changing certain attributes of the `plt` object. So if we create multiple plots in out program, the sequence of steps becomes very important since each is having some impcat on the global state and as such is not highly modular and prone to mistakes.
* [OOP Interface](https://matplotlib.org/api/index.html#the-object-oriented-api) is when we are creating `subplot` objects from the `plt` object and then modifying the `subplot` objects to plot our data. These changes do not cause any impact on the global object and we can create as many `sublot` objects as we want and all of then will independently maintain their state. This allows to easily manage our plots as well as being sure that alteration in attribute of 1 plot will not impact another.

### API Overview

#### Global API

Matplotlib's default pyplot API has a global, MATLAB-style interface :

In [None]:
x = np.arange(-10, 11)

In a basic plot, we specify the following things -
* `figure` - This is used to create a new figure. A figure is basically where the plot al well as other operations will be performed in. We can pass `figsize` to set the width and height(in inches) for the figure
* `title` - This is the title name of the plot we want to give to make the plot more expressive
* `plot` - plot function is used to plot the actual data. In this, we pass the values we want to set in the x-axis and the y-axis

In [None]:
plt.figure(figsize=(12, 6))

plt.title('My Nice Plot')

plt.plot(x, x ** 2)
plt.plot(x, -1 * (x ** 2))


We can also plot multiple graphs in a single figure by using the function `subplot` 

In [None]:
plt.figure(figsize=(12, 6))
plt.title('My Nice Plot')

plt.subplot(1, 2, 1)  # rows, columns, panel selected
plt.plot(x, x ** 2)
plt.plot([0, 0, 0], [-10, 0, 100])
plt.legend(['X^2', 'Vertical Line'])
plt.xlabel('X')
plt.ylabel('X Squared')

plt.subplot(1, 2, 2)
plt.plot(x, -1 * (x ** 2))
plt.plot([-10, 0, 10], [-50, -50, -50])
plt.legend(['-X^2', 'Horizontal Line'])

plt.xlabel('X')
plt.ylabel('X Squared')

#### OOP Interface
In OOP Interface, as we saw, we create `subplots`. The `subplot` function creates a figure and as many subplots as we want in that figure. It returns 2 objects, `fig` and `axes`. The `fig` is the [figure](https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure) object of the subplot and the `axes` is an object or array of [axes](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes) objects(it will be an array in case we want to create multiple different graphs in a single figure. We will see more examples below)

In [None]:
fig, axes = plt.subplots(figsize=(12, 6))

In [None]:
axes.plot(
    x, (x ** 2), color='red', linewidth=3,
    marker='o', markersize=8, label='X^2')

axes.plot(x, -1 * (x ** 2), 'b--', label='-X^2')

axes.set_xlabel('X')
axes.set_ylabel('X Squared')

axes.set_title("My Nice Plot")

axes.legend()

fig

In [None]:
fig, axes = plt.subplots(figsize=(12, 6))

axes.plot(x, x + 0, linestyle='solid')
axes.plot(x, x + 1, linestyle='dashed')
axes.plot(x, x + 2, linestyle='dashdot')
axes.plot(x, x + 3, linestyle='dotted');

axes.set_title("My Nice Plot")

In [None]:
fig, axes = plt.subplots(figsize=(12, 6))

axes.plot(x, x + 0, '-og', label="solid green")
axes.plot(x, x + 1, '--c', label="dashed cyan")
axes.plot(x, x + 2, '-.b', label="dashdot blue")
axes.plot(x, x + 3, ':r', label="dotted red")

axes.set_title("My Nice Plot")

axes.legend()

There are a lot of line and marker types.

In [None]:
print('Markers: {}'.format([m for m in plt.Line2D.markers]))

In [None]:
linestyles = ['_', '-', '--', ':']

print('Line styles: {}'.format(linestyles))

![separator1](https://i.imgur.com/ZUWYTii.png)

### Different types of plots

#### Figures and subfigures

When we call the `subplots()` function we get a tuple containing a `Figure` and a `axes` element.

In [None]:
plot_objects = plt.subplots()

fig, ax = plot_objects

ax.plot([1,2,3], [1,2,3])

plot_objects

We can also define how many elements we want inside our figure. To do that we can set the `nrows` and `ncols` params.

In [None]:
plot_objects = plt.subplots(nrows=2, ncols=2, figsize=(14, 6))

fig, ((ax1, ax2), (ax3, ax4)) = plot_objects

plot_objects

In [None]:
ax4.plot(np.random.randn(50), c='yellow')
ax1.plot(np.random.randn(50), c='red', linestyle='--')
ax2.plot(np.random.randn(50), c='green', linestyle=':')
ax3.plot(np.random.randn(50), c='blue', marker='o', linewidth=3.0)


fig

##### The `subplot2grid` command

There is another way to make subplots using a grid-like format:

In [None]:
plt.figure(figsize=(14, 6))

ax1 = plt.subplot2grid((3,3), (0,0), colspan=3)
ax2 = plt.subplot2grid((3,3), (1,0), colspan=2)
ax3 = plt.subplot2grid((3,3), (1,2), rowspan=2)
ax4 = plt.subplot2grid((3,3), (2,0))
ax5 = plt.subplot2grid((3,3), (2,1))

#### Scatter Plot

In [None]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (20 * np.random.rand(N))**2  # 0 to 15 point radii

In [None]:
plt.figure(figsize=(14, 6))

plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Spectral')
plt.colorbar()

plt.show()

In [None]:
fig = plt.figure(figsize=(14, 6))

ax1 = fig.add_subplot(1,2,1)
plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Pastel1')
plt.colorbar()

ax2 = fig.add_subplot(1,2,2)
plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Pastel2')
plt.colorbar()

plt.show()

Here is the full `cmap` options available: https://matplotlib.org/users/colormaps.html

#### Histograms

In [None]:
values = np.random.randn(1000)

In [None]:
plt.subplots(figsize=(12, 6))

plt.hist(values, bins=100, alpha=0.8,
          histtype='bar', color='steelblue',
          edgecolor='green')
plt.xlim(xmin=-5, xmax=5)

plt.show()

In [None]:
fig.savefig('hist.png')

#### KDE (kernel density estimation)

In [None]:
from scipy import stats

density = stats.kde.gaussian_kde(values)
density

In [None]:
plt.subplots(figsize=(12, 6))

values2 = np.linspace(min(values)-10, max(values)+10, 100)

plt.plot(values2, density(values2), color='#FF7F00')
plt.fill_between(values2, 0, density(values2), alpha=0.5, color='#FF7F00')
plt.xlim(xmin=-5, xmax=5)

plt.show()

#### Combine plots

In [None]:
plt.subplots(figsize=(12, 6))

plt.hist(values, bins=100, alpha=0.8, density=1,
          histtype='bar', color='steelblue',
          edgecolor='green')

plt.plot(values2, density(values2), color='#FF7F00', linewidth=3.0)
plt.xlim(xmin=-5, xmax=5)

plt.show()

#### Bar plots

In [None]:
Y = np.random.rand(1, 5)[0]
Y2 = np.random.rand(1, 5)[0]

In [None]:
plt.figure(figsize=(12, 4))

barWidth = 0.5
plt.bar(np.arange(len(Y)), Y, width=barWidth, color='#00b894')

plt.show()

Also can be stacked bars, and add a legend to the plot:

In [None]:
plt.figure(figsize=(12, 4))

barWidth = 0.5
plt.bar(np.arange(len(Y)), Y, width=barWidth, color='#00b894', label='Label Y')
plt.bar(np.arange(len(Y2)), Y2, width=barWidth, color='#e17055', bottom=Y, label='Label Y2')

plt.legend()
plt.show()

#### Boxplots and outlier detection

In [None]:
values = np.concatenate([np.random.randn(10), np.array([10, 15, -10, -15])])

In [None]:
plt.figure(figsize=(12, 4))

plt.hist(values)

In [None]:
plt.figure(figsize=(12, 4))

plt.boxplot(values)

![separator2](https://i.imgur.com/4gX5WFr.png)

## Exercise

### Introduction to exercise

Let's get started by importing Bitcoin and Ether data:

In [None]:
df = pd.read_csv(
    'data/btc-eth-prices-outliers.csv',
    index_col=0,
    parse_dates=True
)

In [None]:
df.head()

And now we can run a simple visualization:

In [None]:
df.plot(figsize=(16, 9))

There are clearly some invalid values, both ETH and BTC have huge spikes. On top of that, there seems to be some data missing in Ether between December 2017 and and January 2018:

In [None]:
df.loc['2017-12': '2017-12-15'].plot(y='Ether', figsize=(16, 9))

In [None]:
df_na = df.loc['2017-12': '2017-12-15']

Are those null values?

In [None]:
df_na['Ether'].isna().values.any()

When? what periods of time?

In [None]:
df_na.loc[df_na['Ether'].isna()]

Let's add a little bit more context:

In [None]:
df.loc['2017-12-06': '2017-12-12']

We now need to decide what we'll do with the missing values. Drop them? fill them? If we decide to fill them, what will be use as fill value? For example: we can use the previous value and just assume the price stayed the same.

In [None]:
df.loc['2017-12-06': '2017-12-12'].fillna(method='bfill')

In [None]:
df.fillna(method='bfill', inplace=True)

Let's take a look now:

In [None]:
df.plot(figsize=(16, 9))

Much better. We now need to fix the huge spikes. The first step is identifying them. How can we do it? The simple answer is of course visually. They seem to be located in the last 10 days of Dec 2017 and first of March 2018:

In [None]:
df['2017-12-25':'2018-01-01'].plot()

In [None]:
df['2018-03-01': '2018-03-09'].plot()

Apparently, they're located in '2017-12-28' and '2018-03-04':

In [None]:
df_cleaned = df.drop(pd.to_datetime(['2017-12-28', '2018-03-04']))

In [None]:
df_cleaned.plot(figsize=(16, 9))

Now it looks much better. Our data seems to be clean.

### Cleaning Analysis 

Visualizations helps make sense of the data and let us judge if our analysis and work is on the right track. But we need a more powerful method to handle our data. That's what we call "analysis". We'll use _analytical_ methods to identify these outliers or these skewed values.

#### Central Tendency

We'll use a set of common indicators of to measure central tendency and identify these outliers:

##### mean
The mean is probably the most common and popular one. The problem is that it's really sensitive to outliers. The mean of our dataset with invalid values is:

In [None]:
df.mean()

Both values seem too high. That's because the outliers are skewing with the mean:

In [None]:
df_cleaned.mean()

##### median

In [None]:
df.median()

##### mode

It doesn't make much sense to measure the mode, as we have continuous values. But you can do it just with `df.mode()`.

#### Visualizing distribution
Now we can use a few of the charts that we saw before + seaborn to visualize the distribution of our values. In particular, we're interested in **histograms**:

In [None]:
df_cleaned.plot(kind='hist', y='Ether', bins=150)

In [None]:
df_cleaned.plot(kind='hist', y='Bitcoin', bins=150)

Using seaborn:

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Ether'], ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Bitcoin'], rug=True, ax=ax)

Seaborn's `distplot` is a general method that will plot a histogram, a KDE and a rugplot. You can also use them as separate:

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.kdeplot(df_cleaned['Ether'], shade=True, cut=0, ax=ax)
sns.rugplot(df_cleaned['Ether'], ax=ax);

We can also visualize a cumulative plot of our distribution:

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Bitcoin'], ax=ax,
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))


This plot shows how many samples fall behind a certain value. We can increase the number of bins in order to have more detail:

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Bitcoin'], ax=ax, bins=50,
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))


#### Visualizing bivariate distributions

The most common way to observe a bivariate distribution is a scatterplot, the `jointplot` will also include the distribution of the variables:

In [None]:
sns.jointplot(x="Bitcoin", y="Ether", data=df_cleaned, height=9)

If you want only a scatter plot, you can use the `regplot` method, that also fits a linear regression model in the plot:

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.regplot(x="Bitcoin", y="Ether", data=df_cleaned, ax=ax)

#### Quantiles, quartiles and percentiles

In [None]:
df_cleaned['Bitcoin'].quantile(.2)

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Bitcoin'], ax=ax, bins=50,
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))
ax.axhline(0.2, color='red')
ax.axvline(df_cleaned['Bitcoin'].quantile(.2), color='red')

In [None]:
df_cleaned['Bitcoin'].quantile(.5)

In [None]:
df_cleaned['Bitcoin'].median()

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Bitcoin'], ax=ax, bins=50,
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))
ax.axhline(0.5, color='red')
ax.axvline(df_cleaned['Bitcoin'].quantile(.5), color='red')

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df_cleaned['Bitcoin'], ax=ax, bins=50,
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))
ax.axhline(0.5, color='red')
ax.axvline(df_cleaned['Bitcoin'].median(), color='red')

Quantile `0.25` == Percentile `25%` == Quartile `1st`

### Dispersion

We'll use a few methods to measure dispersion in our dataset, most of them well known:

* Range
* Variance and Standard Deviation
* IQR

#### Range

Range is fairly simple to understand, it's just the max - min values:

In [None]:
df['Bitcoin'].max() - df['Bitcoin'].min()

Range is **really** sensitive to outliers. As you can see, the range value is extremely high (might indicate the presence of outliers / invalid values).

In [None]:
df_cleaned['Bitcoin'].max() - df_cleaned['Bitcoin'].min()

This value now makes a lot more sense. We know that Bitcoin had a high in about 20k, and it was around 900 when we started measuring. It makes more sense now.

#### Variance and Standard Deviation

In [None]:
df['Bitcoin'].var()

In [None]:
df['Bitcoin'].std()

Both variance and std are sensible to outliers as well. We can check with our cleaned dataset:

In [None]:
df_cleaned['Bitcoin'].var()

In [None]:
df_cleaned['Bitcoin'].std()

#### IQR

The [Interquartile range](https://en.wikipedia.org/wiki/Interquartile_range) is a good measure of "centered" dispersion, and is calculated as `Q3 - Q1` (3rd quartile - 1st quartile).

In [None]:
df['Bitcoin'].quantile(.75) - df['Bitcoin'].quantile(.25)

In [None]:
df_cleaned['Bitcoin'].quantile(.75) - df_cleaned['Bitcoin'].quantile(.25)

As you can see, IQR is more robust than std or range, because it's not so sensitive to outliers.

### Analytical Analysis of invalid values

We can now use the measurements we've seen to analyze those values that seem invalid.

#### Using `std`: Z scores

We can now define those values that are a couple of Z scores above or below the mean (or the max/min value). Example:

In [None]:
upper_limit = df['Bitcoin'].mean() + 2 * df['Bitcoin'].std()
lower_limit = df['Bitcoin'].mean() - 2 * df['Bitcoin'].std()

In [None]:
print("Upper Limit: {}".format(upper_limit))
print("Lower Limit: {}".format(lower_limit))

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df['Bitcoin'], ax=ax)
ax.axvline(lower_limit, color='red')
ax.axvline(upper_limit, color='red')

Seems like this is a good measurement. Our lower limit doesn't make a lot of sense, as negative values are invalid. But our upper limit has a really good measure. Anything above \$27,369 is considered to be an invalid value. Pretty accurate.

#### Using IQRs

We can use the IQR instead of std if we think that the standard deviation might be **too** affected by the outliers/invalid values.

In [None]:
iqr = df['Bitcoin'].quantile(.75) - df['Bitcoin'].quantile(.25)
iqr

In [None]:
upper_limit = df['Bitcoin'].mean() + 2 * iqr
lower_limit = df['Bitcoin'].mean() - 2 * iqr

In [None]:
print("Upper Limit: {}".format(upper_limit))
print("Lower Limit: {}".format(lower_limit))

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(df['Bitcoin'], ax=ax)
ax.axvline(lower_limit, color='red')
ax.axvline(upper_limit, color='red')

Our measurement now is a little bit less precise. There are a few valid values (20k) that seem to be above our upper limit. Regardless, it's still a good indicator.

#### Cleaning invalid values analytically

It's time now to remove these invalid values analytically, we'll use the upper limit defined by standard deviation:

In [None]:
upper_limit = df['Bitcoin'].mean() + 2 * df['Bitcoin'].std()

In [None]:
df[df['Bitcoin'] < upper_limit].plot(figsize=(16, 7))

In [None]:
df.drop(df[df['Bitcoin'] > upper_limit].index).plot(figsize=(16, 7))

![separator2](https://i.imgur.com/4gX5WFr.png)