# pandas 05 - Manipulating Data

by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/BMnU5AHK8mriOojOb3PtF2oyzjZ7ivW_mXC_Rp-olUiwIumekTziMw

---

## 5.1 Handling Missing values

### 5.1.1 What are missing values in pandas?

The `NaN` value represents that in a particular Series that there is not a value specifed for the particular index label.

Causes of `NaN`:

1. A join of two sets of data;
2. Dirty data from an external source;
3. The `NaN` value is not known / an error at the time of generating;
5. Reindexing of data;
6. The shape of data has changed and there are now additional rows or columns.

In [1]:
import pandas as pd
import numpy as np
%load_ext memory_profiler

df = pd.DataFrame([np.random.randn(3)]*3, columns=['a', 'b', 'c'], index=['A', 'B', 'C'] )
df['d'], df.loc['D'] = np.nan, np.nan
df

Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,
B,-0.05975,1.021281,0.37801,
C,-0.05975,1.021281,0.37801,
D,,,,


---

### 5.1.2 Evaluating NaN values

`pd.Dataframe.isnull()`: check if the DataFrame contains `Nan`.
`pd.Dataframe.notnull()`: check if the DataFrame contains `Nan`.
`pd.Dataframe.count()`: count valid values in DataFrame.

In [2]:
display(df.isnull(), df.notnull())

Unnamed: 0,a,b,c,d
A,False,False,False,True
B,False,False,False,True
C,False,False,False,True
D,True,True,True,True


Unnamed: 0,a,b,c,d
A,True,True,True,False
B,True,True,True,False
C,True,True,True,False
D,False,False,False,False


In [3]:
df.count()

a    3
b    3
c    3
d    0
dtype: int64

---

### 5.1.3 Filtering out missing values

1. `dropna` by default drops any row containing a missing value
2. passing `how='all'` will only drop rows that are all NA
3. dropping columns by passing `axis=1`

In [4]:
df.dropna()

Unnamed: 0,a,b,c,d


In [33]:
display(df.dropna(how='all'), df.dropna(how='all', axis=1))

Unnamed: 0,a,b,c,d,e
A,-0.05975,1.021281,0.37801,,-2.675991
B,-0.05975,1.021281,0.37801,,-2.675991
C,-0.05975,1.021281,0.37801,,-2.675991


Unnamed: 0,a,b,c,e
A,-0.05975,1.021281,0.37801,-2.675991
B,-0.05975,1.021281,0.37801,-2.675991
C,-0.05975,1.021281,0.37801,-2.675991
D,,,,


---

### 5.1.4 Filling in missing values

1. Calling `fillna` with a constant replaces missing values with that value;

2. pandas allows forward and backword filling with `method=ffill` or `method=bfill`;

In [6]:
df.fillna(0)

Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,0.0
B,-0.05975,1.021281,0.37801,0.0
C,-0.05975,1.021281,0.37801,0.0
D,0.0,0.0,0.0,0.0


In [7]:
display(df.fillna(method='ffill'), df.fillna(method='bfill'))

Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,
B,-0.05975,1.021281,0.37801,
C,-0.05975,1.021281,0.37801,
D,-0.05975,1.021281,0.37801,


Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,
B,-0.05975,1.021281,0.37801,
C,-0.05975,1.021281,0.37801,
D,,,,


---

### 5.1.5 Interpolation of missing values

Both DataFrame and Series have an `.interpolate()` method that will perform a linear interpolation of missing values:

1. by taking the frst value before and after any sequence of `NaN` values and then incrementally adding that value from the start and substituting `NaN` values.

In [8]:
display(df, df.interpolate())

Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,
B,-0.05975,1.021281,0.37801,
C,-0.05975,1.021281,0.37801,
D,,,,


Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,
B,-0.05975,1.021281,0.37801,
C,-0.05975,1.021281,0.37801,
D,-0.05975,1.021281,0.37801,


---

## 5.2 Handling duplicate values

Duplicate values are another form of bad data. We can use the following methods to handle duplicate values:

`pd.duplicated()`: check if a DataFrame contains duplicate rows;
`pd.drop_duplicates()`: drops duplicate rowsfrom a DataFrame, and returns a copy;
`pd.drop_duplicates(keep='last')`: keeps the last duplicate row;

In [9]:
df.duplicated()

A    False
B     True
C     True
D    False
dtype: bool

In [35]:
display(df.drop_duplicates(), df.drop_duplicates(keep='last'))

Unnamed: 0,a,b,c,d,e
A,-0.05975,1.021281,0.37801,,-2.675991
D,,,,,


Unnamed: 0,a,b,c,d,e
C,-0.05975,1.021281,0.37801,,-2.675991
D,,,,,


## 5.3 Handling duplicate index labels and column names

It is quite often to have duplicate index label or column names.

We can:

1. reset / replace the index labels or column names:
2. remove duplicate index lables or column names.

In [37]:
df1 = pd.concat([df, df])
df1.index

Index(['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D'], dtype='object')

In [12]:
df1.reset_index(inplace=True)
df1.index

RangeIndex(start=0, stop=8, step=1)

In [13]:
df2 = pd.concat([df, df], axis=1)
df2.columns

Index(['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd'], dtype='object')

In [14]:
df2 = df2.loc[:,~df2.columns.duplicated()]
df2

Unnamed: 0,a,b,c,d
A,-0.05975,1.021281,0.37801,
B,-0.05975,1.021281,0.37801,
C,-0.05975,1.021281,0.37801,
D,,,,


## 5.4 Filtering data

We mentioned boolean selction in pandas 03, which is one of the ways to filter data in pandas. Moreover, there are two others ways to do this task.

1. `pandas.DataFrame.query`: Query the columns of a frame with a boolean expression.
2. `pandas.DataFrame.eval`: Evaluate a string describing operations on DataFrame columns.

The difference in computation time between the traditional methods and the eval/query method is usually not significant

1. the boolean method is faster for smaller arrays!
2. the benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

In [15]:
nasdaq = pd.read_csv('../data/nasdaq.csv')
nasdaq[(nasdaq['Open'] > 6500) & (nasdaq['Open'] < 7000)]
# nasdaq[6500 < nasdaq['Open'] < 7000]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-11-23,6919.52002,6987.890137,6919.160156,6938.97998,6938.97998,958950000
10,2018-12-10,6959.629883,7047.620117,6878.990234,7020.52002,7020.52002,2367560000
14,2018-12-14,6986.370117,7027.169922,6898.990234,6910.660156,6910.660156,2200510000
15,2018-12-17,6886.459961,6931.810059,6710.009766,6753.72998,6753.72998,2665240000
16,2018-12-18,6809.819824,6847.27002,6733.709961,6783.910156,6783.910156,2595400000
17,2018-12-19,6777.589844,6868.859863,6586.5,6636.830078,6636.830078,2899950000
18,2018-12-20,6607.759766,6666.200195,6447.910156,6528.410156,6528.410156,3258090000
19,2018-12-21,6573.490234,6586.680176,6304.629883,6333.0,6333.0,4534120000


In [39]:
nn = nasdaq.query('6000 < Open < 7000')

False

In [17]:
eval_result = nasdaq.eval('6000 < Open < 7000')
display(eval_result, nasdaq[eval_result])

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14     True
15     True
16     True
17     True
18     True
19     True
Name: Open, dtype: bool

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-11-23,6919.52002,6987.890137,6919.160156,6938.97998,6938.97998,958950000
10,2018-12-10,6959.629883,7047.620117,6878.990234,7020.52002,7020.52002,2367560000
14,2018-12-14,6986.370117,7027.169922,6898.990234,6910.660156,6910.660156,2200510000
15,2018-12-17,6886.459961,6931.810059,6710.009766,6753.72998,6753.72998,2665240000
16,2018-12-18,6809.819824,6847.27002,6733.709961,6783.910156,6783.910156,2595400000
17,2018-12-19,6777.589844,6868.859863,6586.5,6636.830078,6636.830078,2899950000
18,2018-12-20,6607.759766,6666.200195,6447.910156,6528.410156,6528.410156,3258090000
19,2018-12-21,6573.490234,6586.680176,6304.629883,6333.0,6333.0,4534120000


## 5.5 Transforming data

1. map values to others;
2. replace values;
3. apply methods to transform;
4. change the direction of the DataFrame;

### 5.5.1 Mappling

`pd.Series.map()`:

1. It matches the values of the outer Series with the index labels of the inner Series.

2. If the values and the index labels are not matched, it will return `NaN`.

In [43]:
x = pd.Series({"one": 2, "two": 2, "three": 3, 'Three': 3}) 
y = pd.Series({2: "b", 3: "c"})
display(x, x.map(y), x.replace({2: 'b', 3: 'c'})

one      2
two      2
three    3
Three    3
dtype: int64

one      b
two      b
three    c
Three    c
dtype: object

one      b
two      b
three    c
Three    c
dtype: object

In [45]:
%timeit %memit x.map(y)
%timeit %memit x.replace({2: 'b', 3: 'c'})

peak memory: 76.71 MiB, increment: 0.00 MiB
peak memory: 76.71 MiB, increment: 0.00 MiB
peak memory: 76.71 MiB, increment: 0.00 MiB
peak memory: 76.71 MiB, increment: 0.00 MiB
peak memory: 76.72 MiB, increment: 0.00 MiB
peak memory: 76.72 MiB, increment: 0.00 MiB
peak memory: 76.72 MiB, increment: 0.00 MiB
peak memory: 76.72 MiB, increment: 0.00 MiB
306 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
peak memory: 76.73 MiB, increment: 0.00 MiB
300 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
y = pd.Series({1: "a", 2: "b"})
x.map(y)

one        a
two        b
three    NaN
dtype: object

---

### 5.5.2 Replacing

`pd.Series.replace()`: 

1. replaces any value with another value;
2. can replaces multiple items at once (with a list or a dict);

In [20]:
x.replace(2, 5)

one      1
two      5
three    3
dtype: int64

In [21]:
x.replace([1, 2, 3], ['I', 'II', 'III'])

one        I
two       II
three    III
dtype: object

In [46]:
x.replace({1: 'I', 2: 'II', 3: 'III'})

one      2
two      2
three    3
Three    3
dtype: int64

---

### 5.5.3 Applying functions

`pd.Series.apply()`:

1. be careful to use this methods, as it may significantly reduce the perfoemence as the applied function is not a pandas function;
2. it is common to take the result of an apply operation  as a new column of the DataFrame;
3. always applies to the provided function to all of the items, or rows or columns;

In [23]:
%time %memit x.apply(lambda i: i * 2)

peak memory: 76.25 MiB, increment: 0.11 MiB
CPU times: user 50.7 ms, sys: 36.9 ms, total: 87.6 ms
Wall time: 231 ms


In [24]:
df['e'] = df['c'].apply(lambda i: i * 6 / 7 - 3)
df

Unnamed: 0,a,b,c,d,e
A,-0.05975,1.021281,0.37801,,-2.675991
B,-0.05975,1.021281,0.37801,,-2.675991
C,-0.05975,1.021281,0.37801,,-2.675991
D,,,,,


### 5.5.4 Transpose a DataFame

`pd.DataFrame.T` can transpose a DataFrame.

In [25]:
df

Unnamed: 0,a,b,c,d,e
A,-0.05975,1.021281,0.37801,,-2.675991
B,-0.05975,1.021281,0.37801,,-2.675991
C,-0.05975,1.021281,0.37801,,-2.675991
D,,,,,


In [26]:
df.T

Unnamed: 0,A,B,C,D
a,-0.05975,-0.05975,-0.05975,
b,1.021281,1.021281,1.021281,
c,0.37801,0.37801,0.37801,
d,,,,
e,-2.675991,-2.675991,-2.675991,


In [27]:
df.T.T

Unnamed: 0,a,b,c,d,e
A,-0.05975,1.021281,0.37801,,-2.675991
B,-0.05975,1.021281,0.37801,,-2.675991
C,-0.05975,1.021281,0.37801,,-2.675991
D,,,,,


---

## 5.6 Some Advanced Data Manipulations

1. `pd.Series.isin`: Check whether values are contained in Series;
2. `pd.Series.where`: For each element in the DataFrame, if cond is __True__ the element is used; otherwise the corresponding element from the DataFrame other is used;
3. `pd.Series.mask`: Opposite to `pd.Series.where`

In [28]:
date_list = ['2018-12-03', '2018-12-04', '2018-12-05']
nasdaq['Date'].isin(date_list)

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: Date, dtype: bool

In [29]:
nasdaq['Open'].where(nasdaq['Open'] > 7000)

0             NaN
1     7026.500000
2     7041.229980
3     7135.080078
4     7267.370117
5     7279.299805
6     7486.129883
7     7407.950195
8     7017.049805
9     7163.490234
10            NaN
11    7121.660156
12    7127.000000
13    7135.279785
14            NaN
15            NaN
16            NaN
17            NaN
18            NaN
19            NaN
Name: Open, dtype: float64

In [30]:
nasdaq['Open'].where(nasdaq['Open'] > 7000, 7000)

0     7000.000000
1     7026.500000
2     7041.229980
3     7135.080078
4     7267.370117
5     7279.299805
6     7486.129883
7     7407.950195
8     7017.049805
9     7163.490234
10    7000.000000
11    7121.660156
12    7127.000000
13    7135.279785
14    7000.000000
15    7000.000000
16    7000.000000
17    7000.000000
18    7000.000000
19    7000.000000
Name: Open, dtype: float64

In [31]:
nasdaq['Open'].mask(nasdaq['Open'] > 7000, 7000)

0     6919.520020
1     7000.000000
2     7000.000000
3     7000.000000
4     7000.000000
5     7000.000000
6     7000.000000
7     7000.000000
8     7000.000000
9     7000.000000
10    6959.629883
11    7000.000000
12    7000.000000
13    7000.000000
14    6986.370117
15    6886.459961
16    6809.819824
17    6777.589844
18    6607.759766
19    6573.490234
Name: Open, dtype: float64

---

## 5.7 Exercise

1. check the parameters in `pd.DataFrame.interpolate()`.
2. check how to use `pd.Series.replace()` to replace values in a DataFrame.
3. How can we remove duplicate index labels?
4. How can we replace duplicate column names?
5. If there is only one condition, how do we pass it to `pd.isin()`?