# Session Seven 7 
# Programming with Python  MOD007891

# Data Manipulation with Pandas


# Outline:

 - Installing and Using Pandas
 - Pandas Objects
 - Pandas Series Object
 - Pandas DataFrame Object
 - Pandas Index Object
 - Data Indexing and Selection
 - Handling Missing Data



### Supplementary Datasets

- state-population.csv
- state-areas.csv
- state-abbrevs.csv
- births.csv

In the previous sessions, we dove into detail on **NumPy** and its ``ndarray`` object, which provides **efficient storage** and **manipulation** of dense typed arrays in Python.


Here we'll build on this knowledge by looking in detail at the **data structures** provided by the Pandas library.


Pandas is a newer package **built on top of NumPy**, and provides an efficient implementation of a ``DataFrame``.


``DataFrame``s are essentially **multidimensional arrays** with **attached row and column labels**, and often with **heterogeneous types** and/or **missing data.**


As well as offering a **convenient storage** interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.



**Numpy** limitations become clear when we need more **flexibility** (e.g., **attaching labels** to data, working with **missing data**, etc.).


Pandas, and in particular **its ``Series`` and ``DataFrame`` objects**, builds on the **NumPy array** structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.



## Installing and Using Pandas

Installation of Pandas on your system **requires NumPy** to be installed.



Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).


Once Pandas is installed, you can import it and check the version:


In [None]:
import pandas
pandas.__version__

Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:

In [None]:
import pandas as pd

This import convention will be used throughout the remainder of this module.

## Reminder about Built-In Documentation

As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ``?`` character). (Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb) if you need a refresher on this.)

For example, to display all the contents of the pandas namespace, you can type

```ipython
In [3]: pd.<TAB>
```

And to display Pandas's built-in documentation, you can use this:

```ipython
In [4]: pd?
```

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

# Introducing Pandas Objects


At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the **rows and columns are identified with labels rather than simple integer indices.**


Pandas provides a host of useful tools, methods, and functionality on top of the basic **data structures.**


Thus, before we go any further, let's introduce these **three fundamental Pandas data structures**:

- ``Series`` 
- ``DataFrame``
- ``Index``

We will start our code sessions with the standard NumPy and Pandas imports:

In [None]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a **one-dimensional array of indexed data.**
It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

As we see in the output, the ``Series`` wraps both a **sequence of values** and a **sequence of indices**, which we can access with the ``values`` and ``index`` attributes.


The ``values`` are simply a **NumPy array**

In [None]:
data.values

In [None]:
type(data.values)

In [None]:
data.values.dtype

The ``index`` is an **array-like** object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [None]:
data.index

Like with a NumPy array, data can be **accessed** by the **associated index** via the familiar Python square-bracket notation:

In [None]:
data[1]

In [None]:
data[1:3]

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

### ``Series`` as generalized NumPy array


From what we've seen so far, it may look like the ``Series`` object is basically **interchangeable** with a **one-dimensional NumPy array.**


The essential difference is the **presence of the index:** while the Numpy Array has an ***implicitly defined*** integer index used to access the values, the Pandas ``Series`` has an ***explicitly defined*** index associated with the values.

This explicit index definition gives the ``Series`` object **additional capabilities.**9

For example, the index **need not be an integer,** but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [None]:
# the index in Pandas Series need not be an integer
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

And the item access works as expected:

In [None]:
data['b']

We can even use **non-contiguous** or **non-sequential** indices:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

In [None]:
data[5]

### Series as specialized dictionary

You can think of a Pandas ``Series`` a bit like a specialization of a **Python dictionary.**


A dictionary is a structure that maps **arbitrary keys** to a set of **arbitrary values**, and a Pandas ``Series`` is a structure which **maps typed keys** to a set of **typed values.**

But because Pandas ``Series`` backed by **Numpy array,** it makes it much **more efficient** than **Python dictionaries** for certain operations.



In [None]:
# constructing a Pandas Series object directly from a Python dictionary

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
population

In [None]:
population['California']

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as **slicing:**

In [None]:
population['California':'New York']

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an **optional** argument, and ``data`` can be one of many entities.

For example, ``data`` can be a **list** or **NumPy array**, in which case ``index`` defaults to an integer sequence:

In [None]:
# where index is an optional argument
pd.Series([2, 4, 6])

In [None]:
# explicit indexing 
pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])

``data`` can be a scalar, which is **repeated to fill** the specified index:

In [None]:
pd.Series(5, index=[100, 200, 300])

``data`` can be a **dictionary**, in which ``index`` defaults to the sorted dictionary keys:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'})

In each case, the index can be **explicitly** set if a different result is preferred:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.


The ``DataFrame`` can be thought of either as a **generalization of a NumPy array**, or as a specialization of a **Python dictionary.**




### DataFrame as a generalized NumPy array


If a ``Series`` is an analog of a **one-dimensional array** with flexible indices, a ``DataFrame`` is an analog of a **two-dimensional array** with both flexible row indices and flexible column names.


You can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.


Here, by ***"aligned"*** we mean that they **share the same index.**


- **Let's demonstrate this:**



let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
population

Now that we have `` area `` along with the ``population`` Series, we can use a **dictionary** to construct a single two-dimensional object containing this information:

In [None]:
states = pd.DataFrame({'Population': population,'Area': area})
states

Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [None]:
states.index

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
states.columns

### DataFrame as specialized dictionary

Similarly, we can also think of a ``DataFrame`` as a **specialization of a dictionary.**


Where a dictionary **maps a key to a value**, a ``DataFrame`` maps a **column name** to a ``Series`` of **column data.**


For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [None]:
states['area']

Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. 

For a ``DataFrame``, ``data['col0']`` will return the first *column*.


Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays.



### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [None]:
type(population)

In [None]:
# convert a single Pandas Series to a Pandas DataFrame
x = pd.DataFrame(population, columns=['population'])
x

In [None]:
type(x)

#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [None]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data

In [None]:
pd.DataFrame(data)

Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [None]:
# first row missing c, Second row missing a
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

#### From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [None]:
population

In [None]:
area

In [None]:
# Dataframe made from 2 Series using ""dictionary of Series"" syntax
pd.DataFrame({'population': population,'area': area})

#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

In [None]:
pd.DataFrame(A)

## The Pandas Index Object

We have seen here that both the ``Series`` and ``DataFrame`` objects contain an **explicit *index*** that lets you reference and modify data.


This ``Index`` object is an interesting structure in itself, and it can be thought of either as:

-  *Immutable Array* 
-  *Ordered Set* 



In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

### Index as immutable array

The ``Index`` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:

In [None]:
ind[1]

In [None]:
ind[::2]

``Index`` objects also have many of the attributes familiar from NumPy arrays:

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

One difference between ``Index`` objects and NumPy arrays is that **indices are immutable** and they **cannot be modified** via the normal means:

In [None]:
ind[1] = 0

This **immutability** makes it **safer** to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.

### Index as ordered set

Pandas objects are designed to facilitate operations such as **joins** across datasets.


The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that **unions, intersections, differences,** and other combinations can be computed.



In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA ^ indB  # symmetric difference

These operations may also be accessed via object methods, for example ``indA.intersection(indB)``.

# Data Indexing and Selection

Previously, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays.


These included:

- indexing (e.g., ``arr[2, 1]``)
- slicing (e.g., ``arr[:, 1:5]``)
- masking (e.g., ``arr[arr > 0]``), 
- fancy indexing (e.g., ``arr[0, [1, 5]]``)
- combinations thereof (e.g., ``arr[:, [1, 5]]``).


Here we'll look at **similar means** of **accessing** and **modifying** values in Pandas ``Series`` and ``DataFrame`` objects.




## Data Selection in Series

As we saw in the previous section, a ``Series`` object acts in many ways like:


- A standard **Python dictionary**
- A one-dimensional **NumPy array**



If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [None]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

We can also use **dictionary-like Python expressions** and methods to examine the keys/indices and values:

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
data.values

In [None]:
list(data.items())

``Series`` objects can even be **modified with a dictionary-like syntax.**


Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [None]:
# assigning a new item to Pandas Series  
data['e'] = 1.25
data

### Series as one-dimensional array

A ``Series`` can provides array-style **item selection, slices, masking, and fancy indexing** via the same basic mechanisms as **NumPy arrays**


Examples of these are as follows:

In [None]:
# slicing by explicit index
data['a':'c']

In [None]:
# slicing by implicit integer index
data[0:2]

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['a', 'e']]

**NOTE:** 

Among these, slicing may be the source of the most confusion.
Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index (upper band) is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index (upper band) is *excluded* from the slice.

## Indexers: loc, iloc


Pandas provides some special **indexer attributes** that explicitly expose certain indexing schemes.




First, the ``loc`` attribute allows indexing and slicing that always references the **explicit index:**


In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

In [None]:
data.loc['a']

In [None]:
data.loc[1:3]

The ``iloc`` attribute allows indexing and slicing that always references the **implicit Python-style index:**

In [None]:
data.iloc[0]

In [None]:
data.iloc[1:3]

**NOTE:**

One guiding principle of Python code is that "explicit is better than implicit."



The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

## Data Selection in DataFrame

A ``DataFrame`` acts in many ways like:

- A **dictionary of ``Series``**
- A **two-dimensional array,** 






### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

# Join to Series using dic syntax to create a Dataframe
data = pd.DataFrame({'area':area, 'pop':pop})
data

The individual ``Series`` that make up the columns of the ``DataFrame`` can be **accessed via dictionary-style indexing** of the column name:

In [None]:
# dic like indexing 
data['area']

Equivalently, we can use **attribute-style access** with column names that are strings:

In [None]:
# attribute-style of indexing
data.area



Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be **used to modify** the object, in this case adding a new column:

In [None]:
data['density'] = data['pop'] / data['area']
data

This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects.

### DataFrame as two-dimensional array

As mentioned previously, we can also view the ``DataFrame`` as an enhanced **two-dimensional array.**


We can examine the raw underlying data array using the ``values`` attribute:

In [None]:
data.values

Many familiar **array-like** observations can be done on the ``DataFrame`` itself.


For example, we can **transpose** the full ``DataFrame`` to swap rows and columns:

In [None]:
data

In [None]:
data.T

In [None]:
data

In [None]:
data.values[1]

and passing a single "index" to a ``DataFrame`` accesses a column:

In [None]:
data['area']

Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc`` indexers mentioned earlier.


Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array, but the ``DataFrame`` index and column labels are maintained in the result:

In [None]:
data.iloc[:3]

In [None]:
data.iloc[:3, :2]

Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the **explicit index** and **column names:**

In [None]:
data.loc[:'Illinois', :'pop']

Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following:

In [None]:
data.loc[data.density > 100, ['pop', 'density']]

#### Modify values using ``iloc``
Any of these indexing conventions may also be **used to set or modify values;** this is done in the standard way that you might be accustomed to from working with NumPy:

In [None]:
data.iloc[0, 2] = 90
data

To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.

### Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:

In [None]:
data['Florida':'Illinois']

Such slices can also refer to rows by number rather than by index:

In [None]:
data[1:3]

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [None]:
data[data.density > 100]

## Ufuncs in Pandas

One of the essential pieces of NumPy is vectorized operations (ufuncs); The ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.)


Pandas inherits much of this functionality from NumPy and vectorized operations (ufuncs).

Pandas includes a couple useful twists, however: 

- For **unary operations** like **negation** and **trigonometric functions**, these ufuncs will **preserve index and column labels** in the output.


- For **binary** operations such as **addition** and **multiplication,** Pandas will automatically **align indices** when passing the objects to the ufunc.




## Ufuncs: Index Preservation (unary operations)

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Create a random Pandas Series
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

In [None]:
# Create a random Pandas Dataframe
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*

In [None]:
np.exp(ser)

Or, for a slightly more complex calculation:

In [None]:
np.sin(df * np.pi / 4)

## UFuncs: Index Alignment (binary operations)

For **binary operations** on two ``Series`` or ``DataFrame`` objects, Pandas will **align indices** in the process of performing the operation.
This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

### Index alignment in Series

As an example, suppose we are **combining two different data sources,** and find only the **top three** US states by **area** and the **top three** US states by **population**:

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [None]:
area

In [None]:
population

**Texas** and **California** are the common rows.  
Let's see what happens when we **divide** these to compute the **population density:**

In [None]:
population / area

The resulting array contains the ***union* of indices** of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [None]:
area.index | population.index

Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data.


Any missing values are filled in with ``NaN`` by default.

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

If using ``NaN`` values is not the desired behavior, the **fill value** can be modified using appropriate **object methods** in place of the **operators.**

- What is **object methods**?


For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

In [None]:
A.add(B, fill_value=0)

### Index alignment in DataFrame

A similar type of alignment takes place for *both* **columns** and **indices** when performing operations on ``DataFrame``s:

In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

In [None]:
# Reminder 
list('ABC')

In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

In [None]:
A + B

Notice that indices are **aligned correctly irrespective of their order** in the two objects, and **indices in the result are sorted.**


As was the case with ``Series``, we can use the associated **object's arithmetic method** and pass any desired ``fill_value`` to be used in place of missing entries.


Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

In [None]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


# Handling Missing Data

The difference between **data found in many tutorials** and data in the **real world** is that real-world data is **rarely clean** and **homogeneous.**


In particular, many interesting datasets will have some amount of **data missing.**


To make matters even more complicated, different data sources may indicate missing data in different ways.


In this section, we will discuss some general considerations for missing data, discuss **how Pandas chooses to represent it,** and demonstrate some built-in Pandas tools for **handling missing data** in Python.


Here and throughout this module, we'll refer to missing data in general as *null*, *NaN*, or *NA* values. 

## Trade-Offs in Missing Data Conventions

There are a number of **schemes** that have been developed to indicate the **presence of missing data** in a table or DataFrame.


Generally, they revolve around one of **two strategies:**



 - **Masking approach:** In the masking approach, the **mask** might be an **entirely separate Boolean array,** or it may involve appropriation of **one bit** in the data representation to **locally indicate** the null status of a value.

- **Sentinel approach:** In the **sentinel approach**, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number).


### Trade-offs:
None of these approaches is without **trade-offs:** use of a **separate mask array** requires **allocation of an additional Boolean array,** which adds **overhead** in both storage and computation.


A **sentinel value** reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like **NaN** are not available for all data types.


Different languages and systems use different conventions.
For example, **the R** language uses reserved bit patterns within each data type as sentinel values indicating missing data, while the SciDB system uses an extra byte attached to every cell which indicates a NA state.

## Missing Data in Pandas

The way in which **Pandas handles missing values** is constrained by its reliance on the **NumPy package,** which **does not have a built-in notion of NA values** for non-floating-point data types.

Pandas could have followed **R's** lead in specifying **bit patterns** for each individual data type to **indicate nullness,** but this approach turns out to be rather unwieldy.

### Why not?

While **R** contains **four basic data types,** NumPy supports ***far* more** than this: 

for example, while R has a single integer type, NumPy supports *fourteen* basic integer types once you account for available precisions, signedness, and endianness of the encoding.


**Reserving** a specific **bit pattern** in **all available NumPy types** would lead to an unwieldy amount of **overhead** in special-casing various operations for various types, likely even requiring a new fork of the NumPy package. 


Further, for the smaller data types (such as 8-bit integers), **sacrificing a bit** to use **as a mask** will significantly **reduce the range** of values it can represent.



With these constraints in mind, **Pandas chose to use sentinels** for **missing data,** and further chose to use two already-existing Python null values:

the special floating-point **``NaN``** value, and the Python **``None``** object.


This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest.

### ``None``: Pythonic missing data

The first sentinel value used by Pandas is ``None``, a Python singleton object (single design pattern) that is often used for missing data in Python code.


Because it is a **Python object,** ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., **arrays of Python objects**):

In [None]:
import numpy as np
import pandas as pd

In [None]:
vals1 = np.array([1, None, 3, 4])
vals1

This ``dtype=object`` means that the best common type representation NumPy could infer for the contents of the array is that they are **Python objects.**


While this kind of **object array** is useful for some purposes **BUT** any operations on the data will be done at the **Python level,** with much **more overhead** than the typically fast operations seen for arrays with native types:

In [None]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

The **use** of **Python objects** in an array also means that if you perform **aggregations** like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally **get an error:**

In [None]:
vals1.sum()

This reflects the fact that addition between an integer and ``None`` is undefined.

### ``NaN``: Missing numerical data

The other missing data representation, ``NaN`` (acronym for ***Not a Number***), is different;

It is a **special floating-point value** recognized by all systems that use the standard IEEE floating-point representation:

In [None]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

Notice that NumPy chose a **native floating-point** type for this array: this means that unlike the object array from before, **this array supports fast operations** pushed into compiled code.


**NOTE:** You should be aware that ``NaN`` is a bit like a data virus–it infects any other object it touches.

Regardless of the operation, **the result of arithmetic with ``NaN`` will be another ``NaN``**

In [None]:
1 + np.nan

In [None]:
0 *  np.nan

Note that this means that **aggregates over the values**  don't result in an **error** (like ``None``)but not always **useful.**

In [None]:
vals2.sum()

In [None]:
vals2.min()

In [None]:
vals2.max()

**NumPy** does provide some **special aggregations** that will ignore these missing values:

In [None]:
np.nansum(vals2)

In [None]:
np.nanmin(vals2)

In [None]:
 np.nanmax(vals2)

Keep in mind that ``NaN`` is specifically a **floating-point value;** there is no equivalent NaN value for integers, strings, or other types.

### NaN and None in Pandas

``NaN`` and ``None`` both have their place, and **Pandas** is built to handle the two of them **nearly interchangeably,** converting between them where appropriate:

In [None]:
pd.Series([1, np.nan, 2, None])

For types that don't have an available sentinel value, Pandas **automatically type-casts** when NA values are present.


For example, if we set a value in an **integer array** to ``np.nan``, it will automatically be upcast to a **floating-point type** to accommodate the NA:

In [None]:
x = pd.Series(range(2), dtype=int)
x

In [None]:
# Automatically be upcast to a **floating-point type** to accommodate the NaN
x[0] = None
x

Notice that in addition to casting the **integer array** to **floating point,** Pandas automatically converts the ``None`` to a ``NaN`` value.




The following table lists the **upcasting conventions in Pandas** when NA values are introduced:

|Typeclass     | Conversion When Storing NAs | NA Sentinel Value      |
|--------------|-----------------------------|------------------------|
| ``floating`` | No change                   | ``np.nan``             |
| ``object``   | No change                   | ``None`` or ``np.nan`` |
| ``integer``  | Cast to ``float64``         | ``np.nan``             |
| ``boolean``  | Cast to ``object``          | ``None`` or ``np.nan`` |

Keep in mind that in **Pandas, string data** is always stored with an ``object`` dtype (Yeah.. it is slow).

## Operating on Null Values

As we have seen, Pandas treats ``None`` and ``NaN`` as essentially **interchangeable** for indicating **missing or null** values.


To facilitate this convention, there are several useful methods for **detecting, removing,** and **replacing** null values in Pandas data structures.
They are:

- ``isnull()``: Generate a **boolean mask** indicating missing values
- ``notnull()``: Opposite of ``isnull()``
- ``dropna()``: Return a filtered version of the data
- ``fillna()``: Return a copy of the data with missing values filled or imputed

We will conclude this section with a brief exploration and demonstration of these routines.

### Detecting null values using ``isnull()`` and ``notnull()`` 
Pandas data structures have two useful methods for detecting null data: ``isnull()`` and ``notnull()``.
Either one will return a Boolean mask over the data. For example:

In [None]:
data = pd.Series([1, np.nan, 'hello', None])

In [None]:
data.isnull()

**NOTE:** Boolean masks can be used directly as a ``Series`` or ``DataFrame`` index:

In [None]:
data[data.notnull()]

The ``isnull()`` and ``notnull()`` methods produce similar Boolean results for ``DataFrame``s.

### Dropping null values using ``dropna()`` and ``fillna()``

In addition to the masking used before, there are the convenience methods, ``dropna()``
(which removes NA values) and ``fillna()`` (which fills in NA values). 

For a ``Series``,the result is straightforward:

In [None]:
data

In [None]:
data.dropna()

For a ``DataFrame``, there are more options.
Consider the following ``DataFrame``:

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

We cannot drop single values from a ``DataFrame``; we can **ONLY** drop **full rows** or **full columns.**



By **default**, ``dropna()`` will **drop all rows** in which *any* null value is present:

In [None]:
df.dropna()

Alternatively, you can drop NA values along a different axis; ``axis=1`` **drops all columns** containing a null value:

In [None]:
df.dropna(axis='columns')

But this drops some good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values.
This can be specified through the ``how`` or ``thresh`` parameters, which allow fine control of the number of nulls to allow through.

The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.
You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis='columns', how='all')

For finer-grained control, the ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [None]:
df.dropna(axis='rows', thresh=3)

Here the first and last row have been dropped, because they contain only two non-null values.

### Filling null values

Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced.

Consider the following ``Series``:

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

We can fill NA entries with a single value, such as zero:

In [None]:
data.fillna(0)

We can specify a forward-fill to propagate the previous value forward:

In [None]:
# forward-fill
data.fillna(method='ffill')

Or we can specify a back-fill to propagate the next values backward:

In [None]:
# back-fill
data.fillna(method='bfill')

For ``DataFrame``s, the options are similar, but we can also specify an ``axis`` along which the fills take place:

In [None]:
df

In [None]:
df.fillna(method='ffill', axis=1)

Notice that if a previous value is not available during a forward fill, the NA value remains.

# Good Luck!