# DataFrame

A DataFrame is a tabular, spreadsheet-like data structure containing an ordered collection of columns. Each column can be a different value type (numeric, string, boolean, etc.). One of the most common ways to construct a DataFrame is through equal length lists of NumPy arrays.

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [2]:
employees = {"name": ["Kasper", "Ellen", "Lexi", "Cecilia", "Jason", "Andrew", "Doug"],
             "year": [2012, 2011, 2011, 2012, 2013, 2011, 2012],
             "school": ["Cal Poly", "UCB", "Stanford", "Cal Tech", "UCSB", "Stanford", "Michigan"]}
frame = DataFrame(employees)
frame

Unnamed: 0,name,school,year
0,Kasper,Cal Poly,2012
1,Ellen,UCB,2011
2,Lexi,Stanford,2011
3,Cecilia,Cal Tech,2012
4,Jason,UCSB,2013
5,Andrew,Stanford,2011
6,Doug,Michigan,2012


The names of columns can be passed to make them appear in a specific order.

In [3]:
DataFrame(employees, columns = ["name", "year", "school"])

Unnamed: 0,name,year,school
0,Kasper,2012,Cal Poly
1,Ellen,2011,UCB
2,Lexi,2011,Stanford
3,Cecilia,2012,Cal Tech
4,Jason,2013,UCSB
5,Andrew,2011,Stanford
6,Doug,2012,Michigan


Passing a column not in the data produces null values (the same way it does with Series).

In [4]:
frame2 = DataFrame(employees, columns = ["name", "year", "school", "hometown"],
                       index = ["one", "two", "three", "four", "five", "six", "seven"])
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,2012,Cal Poly,
two,Ellen,2011,UCB,
three,Lexi,2011,Stanford,
four,Cecilia,2012,Cal Tech,
five,Jason,2013,UCSB,
six,Andrew,2011,Stanford,
seven,Doug,2012,Michigan,


A column can be retrieved as a Series in a couple different ways.

In [5]:
frame2["name"]

one       Kasper
two        Ellen
three       Lexi
four     Cecilia
five       Jason
six       Andrew
seven       Doug
Name: name, dtype: object

In [6]:
frame2.name

one       Kasper
two        Ellen
three       Lexi
four     Cecilia
five       Jason
six       Andrew
seven       Doug
Name: name, dtype: object

Rows can be retrieved the same way as columns using the ix method.

In [7]:
frame2.ix["four"]

name         Cecilia
year            2012
school      Cal Tech
hometown         NaN
Name: four, dtype: object

Columns can be modified by assignment. Note: The assigned value's length must be the same as the length of the DataFrame.

In [8]:
frame2["hometown"] = "SF"
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,2012,Cal Poly,SF
two,Ellen,2011,UCB,SF
three,Lexi,2011,Stanford,SF
four,Cecilia,2012,Cal Tech,SF
five,Jason,2013,UCSB,SF
six,Andrew,2011,Stanford,SF
seven,Doug,2012,Michigan,SF


In [9]:
frame2["year"] = np.arange(7.)
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,0,Cal Poly,SF
two,Ellen,1,UCB,SF
three,Lexi,2,Stanford,SF
four,Cecilia,3,Cal Tech,SF
five,Jason,4,UCSB,SF
six,Andrew,5,Stanford,SF
seven,Doug,6,Michigan,SF


When assigning lists or arrays to a column, the length of the array must match the length of the DataFrame. When assigning a Series, it will match up the indexes of the Series and the DataFrame and insert null values into any holes.

In [10]:
exp = Series([3, 3, 2, 1, 7], index = ["two", "three", "four", "five", "six"])
frame2["year"] = exp
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,,Cal Poly,SF
two,Ellen,3.0,UCB,SF
three,Lexi,3.0,Stanford,SF
four,Cecilia,2.0,Cal Tech,SF
five,Jason,1.0,UCSB,SF
six,Andrew,7.0,Stanford,SF
seven,Doug,,Michigan,SF


Assigning a column that doesn't exist creates a new column. Columns can also be renamed and deleted.

In [11]:
frame2["status"] = "intern"
frame2

Unnamed: 0,name,year,school,hometown,status
one,Kasper,,Cal Poly,SF,intern
two,Ellen,3.0,UCB,SF,intern
three,Lexi,3.0,Stanford,SF,intern
four,Cecilia,2.0,Cal Tech,SF,intern
five,Jason,1.0,UCSB,SF,intern
six,Andrew,7.0,Stanford,SF,intern
seven,Doug,,Michigan,SF,intern


In [12]:
frame2.rename(columns = {frame2.columns[4]:"title"}, inplace = True)
frame2

Unnamed: 0,name,year,school,hometown,title
one,Kasper,,Cal Poly,SF,intern
two,Ellen,3.0,UCB,SF,intern
three,Lexi,3.0,Stanford,SF,intern
four,Cecilia,2.0,Cal Tech,SF,intern
five,Jason,1.0,UCSB,SF,intern
six,Andrew,7.0,Stanford,SF,intern
seven,Doug,,Michigan,SF,intern


In [13]:
del frame2["title"]
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,,Cal Poly,SF
two,Ellen,3.0,UCB,SF
three,Lexi,3.0,Stanford,SF
four,Cecilia,2.0,Cal Tech,SF
five,Jason,1.0,UCSB,SF
six,Andrew,7.0,Stanford,SF
seven,Doug,,Michigan,SF


The nested dict of dicts format is another form of data.

In [14]:
wins = {"Giants": {2009: 88, 2010: 92, 2011: 86, 2012: 94, 2013: 76, 2014: 88},
        "Dodgers": {2010: 80, 2011: 82, 2012: 86, 2013: 92, 2014: 94},
        "Padres": {2010: 90, 2011: 71, 2012: 76, 2013: 76, 2014: 77}}
frame3 = DataFrame(wins)
frame3

Unnamed: 0,Dodgers,Giants,Padres
2009,,88,
2010,80.0,92,90.0
2011,82.0,86,71.0
2012,86.0,94,76.0
2013,92.0,76,76.0
2014,94.0,88,77.0


The T method is used to transpose the results by flipping the columns and indexes.

In [15]:
frame3.T

Unnamed: 0,2009,2010,2011,2012,2013,2014
Dodgers,,80,82,86,92,94
Giants,88.0,92,86,94,76,88
Padres,,90,71,76,76,77


The keys of the inner dicts are combined to form the index of the result, unless a specific index is specified.

In [16]:
DataFrame(wins, index = [2008, 2009, 2010, 2011])

Unnamed: 0,Dodgers,Giants,Padres
2008,,,
2009,,88.0,
2010,80.0,92.0,90.0
2011,82.0,86.0,71.0


The index and column names can also be displayed if their name attributes are set.

In [17]:
frame3.index.name = "year"
frame3.columns.name = "team"
frame3

team,Dodgers,Giants,Padres
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009,,88,
2010,80.0,92,90.0
2011,82.0,86,71.0
2012,86.0,94,76.0
2013,92.0,76,76.0
2014,94.0,88,77.0


The values attribute returns the DataFrame's information, just as it does with Series's information.

In [18]:
frame3.values

array([[ nan,  88.,  nan],
       [ 80.,  92.,  90.],
       [ 82.,  86.,  71.],
       [ 86.,  94.,  76.],
       [ 92.,  76.,  76.],
       [ 94.,  88.,  77.]])

### Index Objects

Index Objects (in pandas) hold things like axes labels or axes names. Index objects are immutable, meaning they can't be changed. This allows them to be safely shared among data structures. Each index has various methods that provide information about the data they contain. Some examples are as follows:
- append (concatenate with additional index objects)
- diff (set difference)
- intersection (set intersection)
- union (set union)
- delete (delete element at index i)

IMPORTANT: All of these methods create a new index, they do not modify the old index

### Reindexing

A pandas method called reindex allows a new object to be created with the new data conformed to a new index. Any missing index from the original object will be filled with a null unless a fill value is specified.

In [19]:
object1 = Series([13, 14, 1, 22], index = ["e", "r", "i", "c"])
object1

e    13
r    14
i     1
c    22
dtype: int64

In [20]:
object2 = object1.reindex(["o", "c", "i", "m", "e"])
object2

o   NaN
c    22
i     1
m   NaN
e    13
dtype: float64

In [21]:
object2 = object1.reindex(["o", "c", "i", "m", "e"], fill_value = 0)
object2

o     0
c    22
i     1
m     0
e    13
dtype: int64

It is possible to either forward fill or backfill values when reindexing. Forward fill is ffill while backfill is bfill. This can be really useful for time series.

In [22]:
cum_time = Series(["wait time", "setup time", "queue time", "load time", "run time"], index = [0, 3, 5, 6, 8])
cum_time.reindex(range(12), method = "ffill")

0      wait time
1      wait time
2      wait time
3     setup time
4     setup time
5     queue time
6      load time
7      load time
8       run time
9       run time
10      run time
11      run time
dtype: object

Reindex can also be used to reshape the DataFrame. After the DataFrame is constructed, the indexes and columns can be reindexed as normal.

In [23]:
bland_frame = DataFrame(np.arange(16).reshape((4,4)), index = ["a", "b", "c", "d"], columns = ["e", "f", "g", "h"])
bland_frame

Unnamed: 0,e,f,g,h
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


### Dropping Entries from an Axis

One or more entries are dropped by referencing its index. In a DataFrame, entire rows or columns are dropped by referencing the particular axis. Axis = 0 refers to the indexes while axis = 1 refers to the columns. By default it is assumed axis = 0.

In [24]:
object1

e    13
r    14
i     1
c    22
dtype: int64

In [25]:
new_object1 = object1.drop("r")
new_object1

e    13
i     1
c    22
dtype: int64

In [26]:
object1.drop(["e", "c"])

r    14
i     1
dtype: int64

In [27]:
bland_frame.drop(["a", "c"])

Unnamed: 0,e,f,g,h
b,4,5,6,7
d,12,13,14,15


In [28]:
bland_frame.drop("e", axis = 1)

Unnamed: 0,f,g,h
a,1,2,3
b,5,6,7
c,9,10,11
d,13,14,15


In [29]:
bland_frame.drop(["f", "g"], axis = 1)

Unnamed: 0,e,h
a,0,3
b,4,7
c,8,11
d,12,15


### Indexing, Selection and Filtering

Series indexing works just like NumPy array indexing except the Series's index values can be used as opposed to just integers. One major difference to note is that slicing with labels is different than normal Python, as it is inclusive of endpoints in this case.

In [30]:
object1

e    13
r    14
i     1
c    22
dtype: int64

In [31]:
object1["r"]

14

In [32]:
object1[1]

14

In [33]:
object1[1:3]

r    14
i     1
dtype: int64

In [34]:
object1[["e", "r", "c"]]

e    13
r    14
c    22
dtype: int64

In [35]:
object1[[0, 3]]

e    13
c    22
dtype: int64

In [36]:
object1[object1 > 1]

e    13
r    14
c    22
dtype: int64

In [37]:
object1["r":"c"]

r    14
i     1
c    22
dtype: int64

In [38]:
bland_frame

Unnamed: 0,e,f,g,h
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [39]:
bland_frame["f"]

a     1
b     5
c     9
d    13
Name: f, dtype: int32

In [40]:
bland_frame[["g", "e"]]

Unnamed: 0,g,e
a,2,0
b,6,4
c,10,8
d,14,12


In [41]:
bland_frame[:2]

Unnamed: 0,e,f,g,h
a,0,1,2,3
b,4,5,6,7


In [42]:
bland_frame[bland_frame["e"] > 6]

Unnamed: 0,e,f,g,h
c,8,9,10,11
d,12,13,14,15


The ix method allows a subset of rows and columns to be selected.

In [43]:
bland_frame.ix["a", ["f", "h"]]

f    1
h    3
Name: a, dtype: int32

### Arithmetic and Data Alignment

When two Series are added together, the resulting output is a union of the two Series. Any index not represented in both Series appears as a null in the output.

In [44]:
series1 = Series([2, 6, 3, 7, 12], index = ["a", "c", "t", "w", "d"])
series2 = Series([3, -1, 5, 4, 4], index = ["a", "b", "c", "d", "e"])

In [45]:
series1

a     2
c     6
t     3
w     7
d    12
dtype: int64

In [46]:
series2

a    3
b   -1
c    5
d    4
e    4
dtype: int64

In [47]:
series1 + series2

a     5
b   NaN
c    11
d    16
e   NaN
t   NaN
w   NaN
dtype: float64

The same properties hold true in DataFrames.

In [48]:
dataframe1 = DataFrame(np.arange(9.).reshape((3, 3)), columns = list("abc"), index = ["Ben", "Paul", "Drew"])
dataframe2 = DataFrame(np.arange(12.).reshape((4, 3)), columns = list("pab"), index = ["Ben", "Paul", "Drew", "Tao"])

In [49]:
dataframe1

Unnamed: 0,a,b,c
Ben,0,1,2
Paul,3,4,5
Drew,6,7,8


In [50]:
dataframe2

Unnamed: 0,p,a,b
Ben,0,1,2
Paul,3,4,5
Drew,6,7,8
Tao,9,10,11


In [51]:
dataframe1 + dataframe2

Unnamed: 0,a,b,c,p
Ben,1.0,3.0,,
Drew,13.0,15.0,,
Paul,7.0,9.0,,
Tao,,,,


Filling in the null values with new values (such as 0) can be done with the add method and passing an argument to fill_value.

In [52]:
dataframe3 = DataFrame(np.arange(4.).reshape((2,2)), columns = list("ab"))
dataframe4 = DataFrame(np.arange(12.).reshape((3,4)), columns = list("abcd"))
dataframe3 + dataframe4

Unnamed: 0,a,b,c,d
0,0.0,2.0,,
1,6.0,8.0,,
2,,,,


In [53]:
dataframe3.add(dataframe4, fill_value = 0)

Unnamed: 0,a,b,c,d
0,0,2,2,3
1,6,8,6,7
2,8,9,10,11


A comparable result could be achieved when reindexing.

In [54]:
dataframe3.reindex(columns = dataframe4.columns, fill_value = 0)

Unnamed: 0,a,b,c,d
0,0,1,0,0
1,2,3,0,0


Arithmetic between NumPy arrays, DataFrames and Series is well defined. Arithmetic operations can be performed between entire rows and columns.

In [55]:
array1 = np.arange(16.).reshape((4, 4))
array1

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.],
       [ 12.,  13.,  14.,  15.]])

In [56]:
array1[0]

array([ 0.,  1.,  2.,  3.])

In [57]:
array1 - array1[0]

array([[  0.,   0.,   0.,   0.],
       [  4.,   4.,   4.,   4.],
       [  8.,   8.,   8.,   8.],
       [ 12.,  12.,  12.,  12.]])

In [58]:
bland_frame

Unnamed: 0,e,f,g,h
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [59]:
series = bland_frame.ix[0]
series

e    0
f    1
g    2
h    3
Name: a, dtype: int32

In [60]:
bland_frame - series

Unnamed: 0,e,f,g,h
a,0,0,0,0
b,4,4,4,4
c,8,8,8,8
d,12,12,12,12


### Sorting

Use the sort_index method to sort data lexicographically by row or column index. This returns a new, sorted object. In a DataFrame, you can sort on either axis.

In [61]:
object3 = Series(range(5), index = ["c", "r", "a", "i", "g"])
object3.sort_index()

a    2
c    0
g    4
i    3
r    1
dtype: int64

In [62]:
new_bland_frame = DataFrame(np.arange(15).reshape((3, 5)), index = ["slo", "sb", "sd"], columns = ["m", "t", "w", "r", "f"])
new_bland_frame

Unnamed: 0,m,t,w,r,f
slo,0,1,2,3,4
sb,5,6,7,8,9
sd,10,11,12,13,14


In [63]:
new_bland_frame.sort_index()

Unnamed: 0,m,t,w,r,f
sb,5,6,7,8,9
sd,10,11,12,13,14
slo,0,1,2,3,4


In [64]:
new_bland_frame.sort_index(axis = 1)

Unnamed: 0,f,m,r,t,w
slo,4,0,3,1,2
sb,9,5,8,6,7
sd,14,10,13,11,12


By default, the data is sorted in ascending order, but this can be changed to descending order.

In [65]:
new_bland_frame.sort_index(axis = 1, ascending = False)

Unnamed: 0,w,t,r,m,f
slo,2,1,3,0,4
sb,7,6,8,5,9
sd,12,11,13,10,14


To instead sort by values (as opposed to the index), use the order method. Any missing values are put at the end of the sort by default.

In [66]:
object4 = Series([4, 5, -12, 33, 2])
object4.order()

2   -12
4     2
0     4
1     5
3    33
dtype: int64

In [67]:
object5 = Series([2, np.nan, np.nan, 7, -3, 16, np.nan])
object5.order()

4    -3
0     2
3     7
5    16
1   NaN
2   NaN
6   NaN
dtype: float64

With a DataFrame, values can be sorted in one or multiple columns by passing the column names to the by option.

In [68]:
dataframe5 = DataFrame({"a": [3, 3, 2, 3, 2], "b": [9, 1, 3, 5, -1]})
dataframe5

Unnamed: 0,a,b
0,3,9
1,3,1
2,2,3
3,3,5
4,2,-1


In [69]:
dataframe5.sort_index(by = "a")

Unnamed: 0,a,b
2,2,3
4,2,-1
0,3,9
1,3,1
3,3,5


In [70]:
dataframe5.sort_index(by = "b")

Unnamed: 0,a,b
4,2,-1
1,3,1
2,2,3
3,3,5
0,3,9


In [71]:
dataframe5.sort_index(by = ["a", "b"])

Unnamed: 0,a,b
4,2,-1
2,2,3
1,3,1
3,3,5
0,3,9


### Ranking

Think of ranking as a number associated with a competition. In a race, the lowest score (or time) would receive a rank of 1, while the second place finisher would receive a rank of 2 and so forth. In other competitions, the person with the highest score at the end would receive a rank of 1 and ranks would continue in order of descending scores. The same ranking system can be applied to Series and DataFrames. By default, rank breaks ties by assigning a mean rank.

In [72]:
outputs = Series([3, 4, 1, 9, 3, -2, 5])
outputs.rank()

0    3.5
1    5.0
2    2.0
3    7.0
4    3.5
5    1.0
6    6.0
dtype: float64

Ties can also be broken by order they're observed in the data. As stated above, ranks can also be done in descending order.

In [73]:
outputs.rank(method = "first")

0    3
1    5
2    2
3    7
4    4
5    1
6    6
dtype: float64

In [74]:
outputs.rank(ascending = False, method = "max")

0    5
1    3
2    6
3    1
4    5
5    7
6    2
dtype: float64

In a DataFrame, rank can be computed for either a row or a column.

In [75]:
dataframe5

Unnamed: 0,a,b
0,3,9
1,3,1
2,2,3
3,3,5
4,2,-1


In [76]:
dataframe5.rank()

Unnamed: 0,a,b
0,4.0,5
1,4.0,2
2,1.5,3
3,4.0,4
4,1.5,1


In [77]:
dataframe5.rank(axis = 1)

Unnamed: 0,a,b
0,1,2
1,2,1
2,1,2
3,1,2
4,2,1


### Axis Indexes with Duplicate Values

Some pandas functions, such as reindex, require that labels are unique, but in many circumstances this is not mandatory.

In [78]:
dupes = Series(range(7), index = ["c", "a", "l", "p", "o", "l", "y"])
dupes

c    0
a    1
l    2
p    3
o    4
l    5
y    6
dtype: int64

In [79]:
dupes.index.is_unique

False

Data selection produces different results when selecting duplicate vs non-duplicate indexes.

In [80]:
dupes["l"]

l    2
l    5
dtype: int64

In [81]:
dupes["p"]

3

The same principles hold true with DataFrames.

In [82]:
dupes2 = DataFrame(np.random.rand(4, 5), index = ["s", "b", "c", "c"])
dupes2

Unnamed: 0,0,1,2,3,4
s,0.047183,0.996029,0.276031,0.023333,0.341311
b,0.370452,0.747338,0.763194,0.953116,0.86586
c,0.180184,0.375337,0.835334,0.141111,0.224942
c,0.207849,0.523573,0.015508,0.679267,0.984879


In [83]:
dupes2.ix["c"]

Unnamed: 0,0,1,2,3,4
c,0.180184,0.375337,0.835334,0.141111,0.224942
c,0.207849,0.523573,0.015508,0.679267,0.984879


### Descriptive Statistics

Pandas has many math and stats methods. Null values are excluded in these calculations unless the skipna option is set to false.

In [84]:
random_dataframe = DataFrame([[3.2, np.nan, np.nan], [9.1, 2.3, 0.2], [np.nan, np.nan, 2.3], [np.nan, 0.3, 8.1], [4.2, 5.1, 7.2]], index = ["a", "b", "c", "d", "e"], columns = ["uno", "dos", "tres"])
random_dataframe

Unnamed: 0,uno,dos,tres
a,3.2,,
b,9.1,2.3,0.2
c,,,2.3
d,,0.3,8.1
e,4.2,5.1,7.2


In [85]:
random_dataframe.sum()

uno     16.5
dos      7.7
tres    17.8
dtype: float64

In [86]:
random_dataframe.sum(axis = 1)

a     3.2
b    11.6
c     2.3
d     8.4
e    16.5
dtype: float64

In [87]:
random_dataframe.mean(axis = 1)

a    3.200000
b    3.866667
c    2.300000
d    4.200000
e    5.500000
dtype: float64

In [88]:
random_dataframe.mean(axis = 1, skipna = False)

a         NaN
b    3.866667
c         NaN
d         NaN
e    5.500000
dtype: float64

Various other methods produce other interesting statisical information. Idxmax and idxmin produce the index with the max and min values, respectively. Cumsum accumulates the values as you go down the column.

In [89]:
random_dataframe.idxmax()

uno     b
dos     e
tres    d
dtype: object

In [90]:
random_dataframe.idxmin()

uno     a
dos     d
tres    b
dtype: object

In [91]:
random_dataframe.cumsum()

Unnamed: 0,uno,dos,tres
a,3.2,,
b,12.3,2.3,0.2
c,,,2.5
d,,2.6,10.6
e,16.5,7.7,17.8


In [92]:
random_dataframe.describe()

Unnamed: 0,uno,dos,tres
count,3.0,3.0,4.0
mean,5.5,2.566667,4.45
std,3.157531,2.411086,3.810949
min,3.2,0.3,0.2
25%,3.7,1.3,1.775
50%,4.2,2.3,4.75
75%,6.65,3.7,7.425
max,9.1,5.1,8.1


### Filtering Out Missing Data

Missing data is a very common obstacle in data analysis. Filtering out this missing data is a common technique for working around this issue.

In [93]:
from numpy import nan as NA

incomplete_data = Series([2, 4, NA, 3, NA])
incomplete_data.dropna()

0    2
1    4
3    3
dtype: float64

In [94]:
incomplete_data[incomplete_data.notnull()]

0    2
1    4
3    3
dtype: float64

In [95]:
random_dataframe

Unnamed: 0,uno,dos,tres
a,3.2,,
b,9.1,2.3,0.2
c,,,2.3
d,,0.3,8.1
e,4.2,5.1,7.2


In [96]:
cleaned_dataframe = random_dataframe.dropna()
cleaned_dataframe

Unnamed: 0,uno,dos,tres
b,9.1,2.3,0.2
e,4.2,5.1,7.2


Passing how = "all" will only drop rows that have all nulls (which isn't any of the rows in this case).

In [97]:
random_dataframe.dropna(how = "all")

Unnamed: 0,uno,dos,tres
a,3.2,,
b,9.1,2.3,0.2
c,,,2.3
d,,0.3,8.1
e,4.2,5.1,7.2


In [98]:
random_dataframe.dropna(axis = 1)

a
b
c
d
e


A minimum amount of non-null values can be set for the row/column to not be dropped.

In [99]:
random_dataframe.dropna(thresh = 2)

Unnamed: 0,uno,dos,tres
b,9.1,2.3,0.2
d,,0.3,8.1
e,4.2,5.1,7.2


### Filling in Missing Data

As an alternative to removing missing values, they can be filled in with values that will (hopefully) taint the data the least.

In [100]:
random_dataframe.fillna(0)

Unnamed: 0,uno,dos,tres
a,3.2,0.0,0.0
b,9.1,2.3,0.2
c,0.0,0.0,2.3
d,0.0,0.3,8.1
e,4.2,5.1,7.2


Calling fillna with a dict allows a different specific value to repalce the null values in each respective column. The mean or median of a respective column can also be passed in.

In [101]:
random_dataframe.fillna({"uno": 1, "dos": 4, "tres": 9})

Unnamed: 0,uno,dos,tres
a,3.2,4.0,9.0
b,9.1,2.3,0.2
c,1.0,4.0,2.3
d,1.0,0.3,8.1
e,4.2,5.1,7.2


In [102]:
random_dataframe.fillna(random_dataframe.mean())

Unnamed: 0,uno,dos,tres
a,3.2,2.566667,4.45
b,9.1,2.3,0.2
c,5.5,2.566667,2.3
d,5.5,0.3,8.1
e,4.2,5.1,7.2


In [103]:
random_dataframe.fillna(random_dataframe.median())

Unnamed: 0,uno,dos,tres
a,3.2,2.3,4.75
b,9.1,2.3,0.2
c,4.2,2.3,2.3
d,4.2,0.3,8.1
e,4.2,5.1,7.2
