<a href="https://colab.research.google.com/github/finesketch/data_science/blob/main/Python_Data_Science_Handbook/03_Data_Manipulation_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

## Installing and Using Pandas

Installation of the Pandas library is very straigthforward, visit Pandas documentation for details, https://pandas.pydata.org.

Once installed, to start using Pandas is very easy as well.

In [1]:
import pandas
pandas.__version__

'1.1.5'

Alternatively, you can reference it as **pd**, extremely common in data science community.

In [2]:
import pandas as pd
pd.__version__

'1.1.5'

To get help on Pandas, try this:

In [3]:
pd?

## Introducing Pandas Objects

To think about Pandas data structure, it should consist of *row* and *column*, like database table.

In [4]:
import numpy as np
import pandas as pd

### The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [10]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Series wraps both a sequence of *values* and a sequence of *indices*, which we can access with the values and index attributes.

In [11]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The index is an array-like object of type pd.Index.

In [12]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [13]:
# data can be accessed by the associated index
data[1]

0.5

In [14]:
# or use range ":"
data[1:3]

1    0.50
2    0.75
dtype: float64

In [16]:
data[0:-1]

0    0.25
1    0.50
2    0.75
dtype: float64

### Series as Generalized Numpy Array

Series may look a 1-D Numpy array, but the main is the presence of the index.

Index in Pandas Series is explicit.

In [17]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [19]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [20]:
data[5]

0.5

### Series as Specialized Dictionary

It is like Python dictionary data structure, but more efficient like Numpy which using type-specific.

In [21]:
# use it like Python dictionary
population_dict = {'California': 38332521,
                    'Texas': 26448193,
                    'New York': 19651127,
                    'Florida': 19552860,
                    'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [22]:
# access data like a dictionary
population['California']

38332521

In [24]:
# unlike Python dictionary, Series also supports array_style operation
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

### Constructing Series Objects

A Series can be created using following function:

*pd.Series(data, index=index)*

In [25]:
# data can be a list or NumPy array, in which case index defaults to an integer sequence
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [28]:
# data can be a scalar, which is repeated to fill the specified index
pd.Series(5, index = [100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [29]:
# data can be a dictionary, in which index defaults to the sorted dictionary keys
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [30]:
# index can be "re-used", but the later will overwrite the previous one.
pd.Series({2:'a', 1:'b', 2:'c'})

2    c
1    b
dtype: object

In [31]:
# the index can be explicitly set if a different result is preferred
# "1:'b'" is removed from the results
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

In [32]:
python_list = [3, 7, 1, 3, 5]
numpy_list = np.array(python_list)
pd.Series(numpy_list)

0    3
1    7
2    1
3    3
4    5
dtype: int64

## The Pandas DataFrame Object

Besides Series, DataFrame is very important.

### Dataframe as a Generalized Numpy Array

Series is used for 1-D array with an index, and DataFrame is a 2-D array with row and column indices.

So DataFrame consists of one to many Series.

In [33]:
# first construct a new Series listing the area of each of the five states discussed in the previous section
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [35]:
# now combine both population and area series together
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [36]:
# get index information
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [37]:
# get column information
states.columns

Index(['population', 'area'], dtype='object')

### DataFrame as Specialized Dictionary

In [38]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [39]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

### Constructing DataFrame Objects

A DataFrame can be constructed in a variety of ways

In [40]:
# from a single Series object
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [42]:
# from a list of dicts
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data


[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [43]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [44]:
# with missing keys will be filled with "NaN"
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [45]:
# from a dictionary of Series objects
pd.DataFrame({'population': population, 'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [55]:
# from a two-dimensional NumPy array
A = np.random.rand(3, 2)
A

array([[0.87638915, 0.89460666],
       [0.08504421, 0.03905478],
       [0.16983042, 0.8781425 ]])

In [56]:
pd.DataFrame(A, columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.876389,0.894607
b,0.085044,0.039055
c,0.16983,0.878143


In [53]:
# from a Numpy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [54]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object

The Index object can be thought of either as an immutable array or as an ordered set.

In [57]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

### Index as Immutable Array

Use standard Python indexing notation to retrieve values or slices.

In [58]:
ind[1]

3

In [59]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [60]:
# Index objects also have many of the attributes familiar from NumPy arrays
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [61]:
# one difference: Index object is immutable, cannot be modified
ind[1] = 0

TypeError: ignored

### Index as Ordered Set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way.

In [62]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [63]:
# intersection
indA & indB

Int64Index([3, 5, 7], dtype='int64')

In [64]:
# union
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [65]:
# symmetric difference
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

In [66]:
# these operations may also be accessed via object methods
indA.intersection(indB)

Int64Index([3, 5, 7], dtype='int64')

In [67]:
indA.union(indB)

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [68]:
indA.symmetric_difference(indB)

Int64Index([1, 2, 9, 11], dtype='int64')

## Data Indexing and Selection

### Data Selection in Series

#### Series as Dictionary

In [69]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [72]:
# Series object provides a mapping from a collection of keys to a collection of values
data['b']

0.5

In [71]:
# examine the keys/indices and values
'a' in data

True

In [74]:
# get all the keys
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [76]:
new_list = list(data.items())

In [77]:
type(new_list)

list

In [78]:
# Series objects can be modified or extended
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### Series as 1-D Array

In [79]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [80]:
# slciing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [81]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [82]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

In [83]:
data[['a', 'c', 'e']]

a    0.25
c    0.75
e    1.25
dtype: float64

### Indexes: LOC, ILOC, and IX

One guiding principle of Python code is that **explicit is better than implicit.**

In [84]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [85]:
# explicit index when indexing
data[1]

'a'

In [86]:
data[5]

'c'

In [87]:
# error here, not found
data[4]

KeyError: ignored

In [88]:
data[1:3]

3    b
5    c
dtype: object

In [89]:
# "loc" attribute allows indexing and slicing that always references the explicit index
data.loc[1] # like data[1]

'a'

In [90]:
data.loc[3] # like data[3]

'b'

In [91]:
data.loc[4] # like data[4]

KeyError: ignored

In [92]:
# "iloc" attribute allows indexing and slicing that always references the implicit Python-style index
data.iloc[1]. # implicit Python-style index (position based)

'b'

In [93]:
data.iloc[1:3] # implicit Python-style index (position based)

3    b
5    c
dtype: object

## Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

### DataFrame as a Dictionary

In [94]:
 area = pd.Series({'California': 423967, 
                   'Texas': 695662,
                   'New York': 141297, 
                   'Florida': 170312,
                   'Illinois': 149995})
pop = pd.Series({'California': 38332521, 
                 'Texas': 26448193,
                 'New York': 19651127, 
                 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [95]:
# dictionary-style to access the datya
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [96]:
# or access it using "attribute" property
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [97]:
# how about check this?
data.area is data['area']

True

In [98]:
# keep in mind that it does not work for all cases
# because DataFrame has a pop() method
data.pop is data['pop']

False

In [99]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [100]:
data.density is data['density']

True

### DataFrame as Two-Dimensional Array

We can examine the raw underlying data array using the values attribute

In [101]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [102]:
# transpose the DataFrame to swap rows and columns
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [103]:
# access a column
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [104]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [105]:
data.iloc[:3, :2]. # Pyhton-stype index

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [108]:
data.loc[:'Florida', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860


In [109]:
data.loc[:'California', :'area']

Unnamed: 0,area
California,423967


In [111]:
data.loc[:'Florida', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860


In [112]:
data.ix[:3, :'pop']. # .ix is deprecated

AttributeError: ignored

In [113]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [115]:
# modify data
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## Additional Indexing Conventions

In [116]:
# while indexing refers to columns, slicing refers to rows
data['Florida': 'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [117]:
# slices can also refer to rows by number rather than by index
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [118]:
# masking operations are also interpreted row-wise rather than column-wise
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


## Operating on Data in Pandas

### Ufuncs: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.

In [119]:
import numpy as np
import pandas as pd

In [120]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 14, 4))
ser

0     6
1     3
2    12
3    10
dtype: int64

In [121]:
# generate random numbers 0-10 for the shape of (3, 4)
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,7,4,6,9
1,2,6,7,4
2,3,7,7,2


In [122]:
# apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved
np.exp(ser)

0       403.428793
1        20.085537
2    162754.791419
3     22026.465795
dtype: float64

In [123]:
# more complex calculation
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-0.707107,1.224647e-16,-1.0,0.7071068
1,1.0,-1.0,-0.707107,1.224647e-16
2,0.707107,-0.7071068,-0.707107,1.0


## UFuncs: Index Alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation.

### Index Alignment in Series



In [124]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')

In [125]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [126]:
# union
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [127]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [128]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index Alignment in DataFrame



In [132]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
A

Unnamed: 0,A,B
0,18,11
1,19,2


In [133]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,2,6
1,4,8,6
2,1,3,8


In [134]:
A + B

Unnamed: 0,A,B,C
0,20.0,15.0,
1,27.0,6.0,
2,,,


In [135]:
# the missing data with "mean" value
# which we compute by first stacking the rows of A
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,20.0,15.0,18.5
1,27.0,6.0,18.5
2,15.5,13.5,20.5


## Ufuncs: Operations Between DataFrame and Series

In [136]:
A = rng.randint(10, size=(3, 4))
A

array([[1, 9, 8, 9],
       [4, 1, 3, 6],
       [7, 2, 0, 3]])

In [137]:
A - A[0]

array([[ 0,  0,  0,  0],
       [ 3, -8, -5, -3],
       [ 6, -7, -8, -6]])

In [138]:
df = pd.DataFrame(A, columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,1,9,8,9
1,4,1,3,6
2,7,2,0,3


In [139]:
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,3,-8,-5,-3
2,6,-7,-8,-6


In [140]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-8,0,-1,0
1,3,0,2,5
2,5,0,-2,1


In [141]:
halfrow = df.iloc[0, ::2]
halfrow

Q    1
S    8
Name: 0, dtype: int64

## Handling Missing Data

A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame. Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.

In the *masking approach*, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.

In the *sentinel approach*, the sentinel value could be some data-specific convention, such as indicating a missing integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.

The *sentinel approach* is more common in the Pythonic world, using **NaN** or **None**.

### Pythonic Missing Data

The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. 


In [1]:
import numpy as np
import pandas as pd

In [2]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects. While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

In [6]:
# use of 'object' dtype is more expensive
for dtype in ['object', 'int']:
  print('dtype=', dtype)
  %timeit np.arange(1E6, dtype=dtype).sum()
  print()

dtype= object
10 loops, best of 5: 69.7 ms per loop

dtype= int
100 loops, best of 5: 2.12 ms per loop



In [5]:
# due to "dtype=object"
# exception will occurr
vals1.sum()

TypeError: ignored

### Missing Numerical Data

The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [7]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. *You should be aware that NaN is a bit like a data virus—it infects any other object it touches.* Regardless of the operation, the result of arithmetic with NaN will be another NaN:

In [8]:
1 + np.nan

nan

In [9]:
0 * np.nan

nan

In [10]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [11]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

### **NaN** and **None** in Pandas

NaN and None both have their place, and *Pandas is built to handle the two of them nearly interchangeably*, converting between them where appropriate:

In [12]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types that don’t have an available sentinel value, Pandas automatically type-casts when NA values are present. 

In [13]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [14]:
# it will automatically be upcast to a floating-point type to accommodate the NA:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

### Operating on Null Values

As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.

- isnull()
- notnull()
- dropna()
- fillna()


### Detecting Null Values

In [16]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [17]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [18]:
data[data.notnull()]

0        1
2    hello
dtype: object

### Dropping Null Values

In [20]:
data = pd.Series([1, np.nan, 'hello', None])

data.dropna()

0        1
2    hello
dtype: object

In [21]:
df = pd.DataFrame([[1,      np.nan,   2],
                   [2,      3,        5],
                   [np.nan, 4,        6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


**We cannot drop single values from a DataFrame; we can only drop full rows or full columns.** 

By default, dropna() will drop all rows in which any null value is present:

In [22]:
df.dropna() # not in-place update

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, you can drop NA values along a different axis; axis=1 drops all columns containing a null value:

In [23]:
df.dropna(axis='columns') # not in-place update

Unnamed: 0,2
0,2
1,5
2,6


In [24]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [26]:
df[3] = np.nan # add a new column (3) with NaN value
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [28]:
df.dropna(axis='columns', how='all') # drop column if all the value is NaN

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [35]:
df.dropna(axis='rows', thresh=3) # thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


### Filling Null Values

In [36]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [37]:
# fill NaN with a value
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [39]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [40]:
# backward fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [41]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [42]:
df.fillna(method='ffill', axis=1) # "axis=0" is "index/rows"; "axis=1" is "columns"

# Notice that if a previous value is not available during a forward fill, the NA value remains.

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0
