## Lesson 07 — Pandas Part II: Data cleaning and wrangling

In this lesson we will cover some more advanced features of [Pandas](http://pandas.pydata.org).

### Readings

* [_Data Cleaning and Preparation_, by Wes McKinney](https://wesmckinney.com/book/data-cleaning)
* [_Data Wrangling: Join, Combine, and Reshape_, by Wes McKinney](https://wesmckinney.com/book/data-wrangling)
* [_SQL OUTER JOIN_, by IONOS Redaktion](https://www.ionos.de/digitalguide/hosting/hosting-technik/sql-outer-join/)

### Table of Contents

* [concat](#concat)
* [merge](#merge)
* [join](#join)
* [stack](#stack)
* [unstack](#unstack)
* [values](#values)
* [apply](#apply)
* [map](#map)
* [sort_index](#sort_index)
* [sort_values](#sort_values)
* [isnull](#isnull)
* [fillna](#fillna)

In [1]:
# import modules
import pandas as pd
import numpy as np

### Concatenating (appending) and merging (joining) DataFrames
See [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html) for more information.

This table describes how the funcitons were are going to learn about are related to each other:

Action        | Combine two          | Add one to another
--------------|----------------------|-------------------
Concatenating | pd.concat([df1, df2]) | df1.append(df2)
Merging       | pd.merge(df1, df2)   | df1.join(df2)

#### concat

See [concat documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) for more info.

```python
pd.concat(
    objs,
    axis=0,
    join='outer',
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    sort=False,
)
```

* `objs`: list or dict of Series, DataFrame, or Panel objects. If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below)
* `axis`: `{0/’index’, 1/’columns’}`, default `0`. The axis to concatenate along
* `join`: `{‘inner’, ‘outer’}`, default `‘outer’`. How to handle indexes on other axis (or axes). Outer for union and inner for intersection
* `ignore_index`: boolean, default `False`. If `True`, do not use the index values on the concatenation axis. The resulting axis will be labeled `0, ..., n - 1`. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
* `keys`: sequence, default `None`. Construct hierarchical index using the passed keys as the outermost level If multiple levels passed, should contain tuples.
* `levels`: list of sequences, default None. If keys passed, specific levels to use for the resulting MultiIndex. Otherwise they will be inferred from the keys.
* `names`: list, default `None`. Names for the levels in the resulting hierarchical index.
* `verify_integrity`: boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation
* `sort`: Sort non-concatenation axis if it is not already aligned. One exception to this is when the non-concatentation axis is a DatetimeIndex and join=’outer’ and the axis is not already aligned. In that case, the non-concatenation axis is always sorted lexicographically.

In [2]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7]) 
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

In [3]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [4]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [5]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


In [6]:
frames = [df1, df2, df3]
result = pd.concat(frames)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [7]:
result = pd.concat(frames, keys=('x', 'y', 'z'))
result

Unnamed: 0,Unnamed: 1,A,B,C,D
x,0,A0,B0,C0,D0
x,1,A1,B1,C1,D1
x,2,A2,B2,C2,D2
x,3,A3,B3,C3,D3
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9


In [8]:
result.loc['y']

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [9]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])
df4

Unnamed: 0,B,D,F
2,B2,D2,F2
3,B3,D3,F3
6,B6,D6,F6
7,B7,D7,F7


In [10]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [11]:
pd.concat([df1, df4]) # axis=0

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
2,,B2,,D2,F2
3,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


In [12]:
pd.concat([df1, df4], axis=1)

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


In [13]:
pd.concat([df1, df4], axis=1, join='outer')

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


In [14]:
pd.concat([df1, df4], axis=1, join='inner')

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


In [15]:
# Not ignoring indexes (default)
pd.concat([df1, df4], ignore_index=False)

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
2,,B2,,D2,F2
3,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


In [16]:
# Ignoring indexes
pd.concat([df1, df4], ignore_index=True)

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,,B2,,D2,F2
5,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


In [17]:
s3 = pd.Series([0, 1, 2, 3], name='foo')
s4 = pd.Series([0, 1, 2, 3])
s5 = pd.Series([0, 1, 4, 5])
result = pd.concat([s3, s4, s5], axis=1)
result

Unnamed: 0,foo,0,1
0,0,0,0
1,1,1,1
2,2,2,4
3,3,3,5


In [18]:
result = pd.concat([s3, s4, s5], axis=1, keys=('red', 'blue', 'yellow'))
result

Unnamed: 0,red,blue,yellow
0,0,0,0
1,1,1,1
2,2,2,4
3,3,3,5


#### merge

See [merge documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) for more info.

```python
pd.merge(
    left, 
    right, 
    how='inner', 
    on=None, 
    left_on=None, 
    right_on=None, 
    left_index=False, 
    right_index=False,
    sort=False, 
    suffixes=('_x', '_y'), 
    copy=None, 
    indicator=False,
    validate=None,
)
```

* `left`: A DataFrame object.
* `right`: Another DataFrame object.
* `on`: Columns (names) to join on. Must be found in both the left and right DataFrame objects. If not passed and `left_index` and `right_index` are `False`, the intersection of the columns in the DataFrames will be inferred to be the join keys.
* `left_on`: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
* `right_on`: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
* `left_index`: If `True`, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
* `right_index`: Same usage as `left_index` for the right DataFrame
* `how`: One of `'left'`, `'right'`, `'outer'`, `'inner'`. Defaults to inner. See below for more detailed description of each method
* `sort`: Defaults to `False`. Sort the result DataFrame by the join keys in lexicographical order. Otherwise, the order will depend on the join type.
* `suffixes`: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x', '_y').
* `copy`: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
* `indicator`: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in 'left' DataFrame, right_only for observations whose merge key only appears in 'right' DataFrame, and both if the observation’s merge key is found in both.
* `validate`: If specified, checks if merge is of specified type.
    * “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
    * “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
    * “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
    * “many_to_many” or “m:m”: allowed, but does not result in checks.



In [19]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K3', 'K2', 'K1', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

In [20]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [21]:
right

Unnamed: 0,key,C,D
0,K3,C0,D0
1,K2,C1,D1
2,K1,C2,D2
3,K0,C3,D3


In [22]:
result = pd.merge(left, right, on='key')
result

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C3,D3
1,K1,A1,B1,C2,D2
2,K2,A2,B2,C1,D1
3,K3,A3,B3,C0,D0


In [23]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

In [24]:
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [25]:
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [26]:
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


The `how` argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:


| Merge method    | SQL Join Name     | Description                               |
|-----------------|-------------------|-------------------------------------------|
| left            | LEFT OUTER JOIN   | Use keys from left frame only             |
| right	          | RIGHT OUTER JOIN	 | Use keys from right frame only            |
| outer	          | FULL OUTER JOIN	  | Use union of keys from both frames        |
| inner (default) | INNER JOIN	       | Use intersection of keys from both frames |

Here is a Venn Diagram visualization of the join types:

<center>
<img src="../images/ionos_joins.webp" width="60%"/>
</center>

In [27]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [28]:
result = pd.merge(left, right, how='right', on=['key1', 'key2'])
result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [29]:
result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K0,,,C3,D3
5,K2,K1,A3,B3,,


In [30]:
result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [31]:
# merge indicator
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})

In [32]:
df1

Unnamed: 0,col1,col_left
0,0,a
1,1,b


In [33]:
df2

Unnamed: 0,col1,col_right
0,1,2
1,2,2
2,2,2


In [34]:
# note: the argument indicator=True is an option with pandas 0.17.0 and greater
result = pd.merge(df1, df2, on='col1', how='outer', indicator='Merged')
result

Unnamed: 0,col1,col_left,col_right,Merged
0,0,a,,left_only
1,1,b,2.0,both
2,2,,2.0,right_only
3,2,,2.0,right_only


#### join

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, **but joins on indexes by default** rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

In [35]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                     index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [36]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [37]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [38]:
left.join(right, how='left')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [39]:
left.join(right, how='right')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2
K3,,,C3,D3


In [40]:
left.join(right, how='inner')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


In [41]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


In [42]:
# overlapping value columns: suffixes
left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]})

In [43]:
left

Unnamed: 0,k,v
0,K0,1
1,K1,2
2,K2,3


In [44]:
right

Unnamed: 0,k,v
0,K0,4
1,K0,5
2,K3,6


In [45]:
result = pd.merge(left, right, on='k')
result

Unnamed: 0,k,v_x,v_y
0,K0,1,4
1,K0,1,5


In [46]:
result = pd.merge(left, right, on='k', suffixes=['_l', '_r'])
result

Unnamed: 0,k,v_l,v_r
0,K0,1,4
1,K0,1,5


In [47]:
# Why would I use this? Example: Merging gene expression tables
degs = pd.DataFrame({'gene': ['cds1', 'cds3', 'cds6'], 'count': [345, 887, 459]})
names = pd.DataFrame({
    'gene': ['cds1', 'cds2', 'cds3', 'cds4', 'cds5', 'cds6'], 
    'description': ['primase', 'ligase', 'aldolase', 'amylase', 'polymerase', 'kinase']
})

In [48]:
degs

Unnamed: 0,gene,count
0,cds1,345
1,cds3,887
2,cds6,459


In [49]:
names

Unnamed: 0,gene,description
0,cds1,primase
1,cds2,ligase
2,cds3,aldolase
3,cds4,amylase
4,cds5,polymerase
5,cds6,kinase


In [50]:
degs_plus_names = pd.merge(degs, names)

In [51]:
degs_plus_names

Unnamed: 0,gene,count,description
0,cds1,345,primase
1,cds3,887,aldolase
2,cds6,459,kinase


In [52]:
# argument settings on and how were inferred/defaults
degs_plus_names = pd.merge(degs, names, on='gene', how='inner')

In [53]:
degs_plus_names

Unnamed: 0,gene,count,description
0,cds1,345,primase
1,cds3,887,aldolase
2,cds6,459,kinase


#### merge/join by index

In [54]:
degs.index = degs.gene
degs.drop('gene', axis=1, inplace=True)
degs

Unnamed: 0_level_0,count
gene,Unnamed: 1_level_1
cds1,345
cds3,887
cds6,459


In [55]:
names.index = names.gene
names.drop('gene', axis=1, inplace=True)
names

Unnamed: 0_level_0,description
gene,Unnamed: 1_level_1
cds1,primase
cds2,ligase
cds3,aldolase
cds4,amylase
cds5,polymerase
cds6,kinase


In [56]:
pd.merge(left=degs, right=names, left_index=True, right_index=True)

Unnamed: 0_level_0,count,description
gene,Unnamed: 1_level_1,Unnamed: 2_level_1
cds1,345,primase
cds3,887,aldolase
cds6,459,kinase


<a id="set_option"></a>

### World Series example

In [57]:
df_ws = pd.read_csv('../data/WorldSeriesWinners.txt', header=None)
df_ws.columns = ['team']

In [58]:
df_new = pd.DataFrame({'team': ['Atlanta Braves']})

In [59]:
df_new

Unnamed: 0,team
0,Atlanta Braves


#### pd.concat()

In [60]:
df2 = pd.concat([df_ws, df_new], ignore_index=True)
df2

Unnamed: 0,team
0,Boston Americans
1,No Winner
2,New York Giants
3,Chicago White Sox
4,Chicago Cubs
...,...
118,Atlanta Braves
119,Houston Astros
120,Texas Rangers
121,Los Angeles Dodgers


#### add columns

In [61]:
df2.shape[0]

123

In [62]:
df2['year'] = np.arange(1903, 1903 + df2.shape[0])
df2['first_initial'] = [team[0] for team in df2['team']]

In [63]:
df2

Unnamed: 0,team,year,first_initial
0,Boston Americans,1903,B
1,No Winner,1904,N
2,New York Giants,1905,N
3,Chicago White Sox,1906,C
4,Chicago Cubs,1907,C
...,...,...,...
118,Atlanta Braves,2021,A
119,Houston Astros,2022,H
120,Texas Rangers,2023,T
121,Los Angeles Dodgers,2024,L


#### stack

Data is often stored in so-called “stacked” or “record” format. In a “record” or “wide” format, typically there is one row for each subject. In the “stacked” or “long” format there are multiple rows for each subject where applicable.

“pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series, when not having a multi-index) with an index with a new inner-most level of row labels.

<center>
<img src="../images/reshaping_stack.png" width="60%"/>
</center>

In [64]:
stack_df = df2.stack()
stack_df

0    team             Boston Americans
     year                         1903
     first_initial                   B
1    team                    No Winner
     year                         1904
                            ...       
121  year                         2024
     first_initial                   L
122  team               Atlanta Braves
     year                         2025
     first_initial                   A
Length: 369, dtype: object

In [65]:
type(stack_df)

pandas.core.series.Series

In [66]:
stack_df[113]

team             Chicago Cubs
year                     2016
first_initial               C
dtype: object

In [67]:
stack_df[113]["team"]

'Chicago Cubs'

In [68]:
stack_df.index

MultiIndex([(  0,          'team'),
            (  0,          'year'),
            (  0, 'first_initial'),
            (  1,          'team'),
            (  1,          'year'),
            (  1, 'first_initial'),
            (  2,          'team'),
            (  2,          'year'),
            (  2, 'first_initial'),
            (  3,          'team'),
            ...
            (119, 'first_initial'),
            (120,          'team'),
            (120,          'year'),
            (120, 'first_initial'),
            (121,          'team'),
            (121,          'year'),
            (121, 'first_initial'),
            (122,          'team'),
            (122,          'year'),
            (122, 'first_initial')],
           length=369)

In [69]:
stacked_df = pd.DataFrame(stack_df)
stacked_df

Unnamed: 0,Unnamed: 1,0
0,team,Boston Americans
0,year,1903
0,first_initial,B
1,team,No Winner
1,year,1904
...,...,...
121,year,2024
121,first_initial,L
122,team,Atlanta Braves
122,year,2025


#### unstack

(inverse operation of stack()) “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

<center>
<img src="../images/reshaping_unstack.png" width="60%"/>
</center>

In [70]:
unstack_df = df2.unstack()
unstack_df

team           0       Boston Americans
               1              No Winner
               2        New York Giants
               3      Chicago White Sox
               4           Chicago Cubs
                            ...        
first_initial  118                    A
               119                    H
               120                    T
               121                    L
               122                    A
Length: 369, dtype: object

In [71]:
type(unstack_df)

pandas.core.series.Series

In [72]:
unstack_df.index

MultiIndex([(         'team',   0),
            (         'team',   1),
            (         'team',   2),
            (         'team',   3),
            (         'team',   4),
            (         'team',   5),
            (         'team',   6),
            (         'team',   7),
            (         'team',   8),
            (         'team',   9),
            ...
            ('first_initial', 113),
            ('first_initial', 114),
            ('first_initial', 115),
            ('first_initial', 116),
            ('first_initial', 117),
            ('first_initial', 118),
            ('first_initial', 119),
            ('first_initial', 120),
            ('first_initial', 121),
            ('first_initial', 122)],
           length=369)

In [73]:
unstack_df["team"][113]

'Chicago Cubs'

In [74]:
pd.DataFrame(unstack_df)

Unnamed: 0,Unnamed: 1,0
team,0,Boston Americans
team,1,No Winner
team,2,New York Giants
team,3,Chicago White Sox
team,4,Chicago Cubs
...,...,...
first_initial,118,A
first_initial,119,H
first_initial,120,T
first_initial,121,L


In [75]:
# unstacking a stacked series is the reverse operation
stack_df.unstack()

Unnamed: 0,team,year,first_initial
0,Boston Americans,1903,B
1,No Winner,1904,N
2,New York Giants,1905,N
3,Chicago White Sox,1906,C
4,Chicago Cubs,1907,C
...,...,...,...
118,Atlanta Braves,2021,A
119,Houston Astros,2022,H
120,Texas Rangers,2023,T
121,Los Angeles Dodgers,2024,L


### Employment data example with stack and unstack

In [76]:
# I have downloaded the World data properties from https://www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023
df_raw = pd.read_csv('../data/world-data-2023.csv', index_col=0, thousands=',')

In [77]:
df_raw

Unnamed: 0_level_0,Density (P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,CPI,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,60,AF,58.10%,652230.0,323000.0,32.49,93.0,Kabul,8672.0,149.90,...,78.40%,0.28,38041754.0,48.90%,9.30%,71.40%,11.12%,9797273.0,33.939110,67.709953
Albania,105,AL,43.10%,28748.0,9000.0,11.78,355.0,Tirana,4536.0,119.05,...,56.90%,1.20,2854191.0,55.70%,18.60%,36.60%,12.33%,1747593.0,41.153332,20.168331
Algeria,18,DZ,17.40%,2381741.0,317000.0,24.28,213.0,Algiers,150006.0,151.36,...,28.10%,1.72,43053054.0,41.20%,37.20%,66.10%,11.70%,31510100.0,28.033886,1.659626
Andorra,164,AD,40.00%,468.0,,7.20,376.0,Andorra la Vella,469.0,,...,36.40%,3.33,77142.0,,,,,67873.0,42.506285,1.521801
Angola,26,AO,47.50%,1246700.0,117000.0,40.73,244.0,Luanda,34693.0,261.73,...,33.40%,0.21,31825295.0,77.50%,9.20%,49.10%,6.89%,21061025.0,-11.202692,17.873887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,32,VE,24.50%,912050.0,343000.0,17.88,58.0,Caracas,164175.0,2740.27,...,45.80%,1.92,28515829.0,59.70%,,73.30%,8.80%,25162368.0,6.423750,-66.589730
Vietnam,314,VN,39.30%,331210.0,522000.0,16.75,84.0,Hanoi,192668.0,163.52,...,43.50%,0.82,96462106.0,77.40%,19.10%,37.60%,2.01%,35332140.0,14.058324,108.277199
Yemen,56,YE,44.60%,527968.0,40000.0,30.45,967.0,Sanaa,10609.0,157.58,...,81.00%,0.31,29161922.0,38.00%,,26.60%,12.91%,10869523.0,15.552727,48.516388
Zambia,25,ZM,32.10%,752618.0,16000.0,36.19,260.0,Lusaka,5141.0,212.31,...,27.50%,1.19,17861030.0,74.60%,16.20%,15.60%,11.43%,7871713.0,-13.133897,27.849332


In [78]:
df_raw.dtypes

Density (P/Km2)                                int64
Abbreviation                                  object
Agricultural Land( %)                         object
Land Area(Km2)                               float64
Armed Forces size                            float64
Birth Rate                                   float64
Calling Code                                 float64
Capital/Major City                            object
Co2-Emissions                                float64
CPI                                          float64
CPI Change (%)                                object
Currency-Code                                 object
Fertility Rate                               float64
Forested Area (%)                             object
Gasoline Price                                object
GDP                                           object
Gross primary education enrollment (%)        object
Gross tertiary education enrollment (%)       object
Infant mortality                             f

In [79]:
# Let's select a couple of markers for the dataset
df_subset = df_raw[["Calling Code", "Capital/Major City", "Currency-Code", "Latitude", "Longitude", "Population", "Co2-Emissions"]]
df_subset

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,93.0,Kabul,AFN,33.939110,67.709953,38041754.0,8672.0
Albania,355.0,Tirana,ALL,41.153332,20.168331,2854191.0,4536.0
Algeria,213.0,Algiers,DZD,28.033886,1.659626,43053054.0,150006.0
Andorra,376.0,Andorra la Vella,EUR,42.506285,1.521801,77142.0,469.0
Angola,244.0,Luanda,AOA,-11.202692,17.873887,31825295.0,34693.0
...,...,...,...,...,...,...,...
Venezuela,58.0,Caracas,VED,6.423750,-66.589730,28515829.0,164175.0
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0
Yemen,967.0,Sanaa,YER,15.552727,48.516388,29161922.0,10609.0
Zambia,260.0,Lusaka,ZMW,-13.133897,27.849332,17861030.0,5141.0


In [80]:
# If you transpose, you switch the columns with the index and vice-versa
df_transposed = df_subset.transpose()

In [81]:
df_transposed

Country,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,United Kingdom,United States,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Calling Code,93.0,355.0,213.0,376.0,244.0,1.0,54.0,374.0,61.0,43.0,...,44.0,1.0,598.0,998.0,678.0,58.0,84.0,967.0,260.0,263.0
Capital/Major City,Kabul,Tirana,Algiers,Andorra la Vella,Luanda,"St. John's, Saint John",Buenos Aires,Yerevan,Canberra,Vienna,...,London,"Washington, D.C.",Montevideo,Tashkent,Port Vila,Caracas,Hanoi,Sanaa,Lusaka,Harare
Currency-Code,AFN,ALL,DZD,EUR,AOA,XCD,ARS,AMD,AUD,EUR,...,GBP,USD,UYU,UZS,VUV,VED,VND,YER,ZMW,
Latitude,33.93911,41.153332,28.033886,42.506285,-11.202692,17.060816,-38.416097,40.069099,-25.274398,47.516231,...,55.378051,37.09024,-32.522779,41.377491,-15.376706,6.42375,14.058324,15.552727,-13.133897,-19.015438
Longitude,67.709953,20.168331,1.659626,1.521801,17.873887,-61.796428,-63.616672,45.038189,133.775136,14.550072,...,-3.435973,-95.712891,-55.765835,64.585262,166.959158,-66.58973,108.277199,48.516388,27.849332,29.154857
Population,38041754.0,2854191.0,43053054.0,77142.0,31825295.0,97118.0,44938712.0,2957731.0,25766605.0,8877067.0,...,66834405.0,328239523.0,3461734.0,33580650.0,299882.0,28515829.0,96462106.0,29161922.0,17861030.0,14645468.0
Co2-Emissions,8672.0,4536.0,150006.0,469.0,34693.0,557.0,201348.0,5156.0,375908.0,61448.0,...,379025.0,5006302.0,6766.0,91811.0,147.0,164175.0,192668.0,10609.0,5141.0,10983.0


In [82]:
# We are going to select just a couple of countries in our dataset
df = df_transposed[
    ["Brazil", "Argentina", "Germany", "France", "Italy", "Vietnam", "Australia", "Canada", "Costa Rica", "Vatican City"]
].transpose()
df

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Brazil,55.0,Bras���,BRL,-14.235004,-51.92528,212559417.0,462299.0
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0
Costa Rica,506.0,San Jos������,CRC,9.748917,-83.753428,5047561.0,8023.0
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,


In [83]:
# Some data cleaning required, fix some of the capital names!
df.loc["Brazil", "Capital/Major City"] = "Brasília"
df.loc["Costa Rica", "Capital/Major City"] = "San José"

### apply

Apply a function along an axis of the DataFrame.

In [84]:
def is_tropical(latitude_column):
    # According to the wikipedia (https://en.wikipedia.org/wiki/Tropics), the tropics are defined as between +-23.43595
    return (-23.43595 <= latitude_column) & (latitude_column <= 23.43595)

In [85]:
df["Is tropical"] = df[["Latitude"]].apply(is_tropical)

In [86]:
df

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,,False


In [87]:
df.dtypes

Calling Code          object
Capital/Major City    object
Currency-Code         object
Latitude              object
Longitude             object
Population            object
Co2-Emissions         object
Is tropical             bool
dtype: object

In [88]:
# just select the numeric columns (exclude first 2 columns)
df.iloc[:, 2:]

Unnamed: 0_level_0,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Brazil,BRL,-14.235004,-51.92528,212559417.0,462299.0,True
Argentina,ARS,-38.416097,-63.616672,44938712.0,201348.0,False
Germany,EUR,51.165691,10.451526,83132799.0,727973.0,False
France,EUR,46.227638,2.213749,67059887.0,303276.0,False
Italy,EUR,41.87194,12.56738,60297396.0,320411.0,False
Vietnam,VND,14.058324,108.277199,96462106.0,192668.0,True
Australia,AUD,-25.274398,133.775136,25766605.0,375908.0,False
Canada,CAD,56.130366,-106.346771,36991981.0,544894.0,False
Costa Rica,CRC,9.748917,-83.753428,5047561.0,8023.0,True
Vatican City,EUR,41.902916,12.453389,836.0,,False


#### df.stack() and df.unstack() (again)

In [89]:
df.iloc[:, 3:5].stack()

Country                
Brazil        Latitude     -14.235004
              Longitude     -51.92528
Argentina     Latitude     -38.416097
              Longitude    -63.616672
Germany       Latitude      51.165691
              Longitude     10.451526
France        Latitude      46.227638
              Longitude      2.213749
Italy         Latitude       41.87194
              Longitude      12.56738
Vietnam       Latitude      14.058324
              Longitude    108.277199
Australia     Latitude     -25.274398
              Longitude    133.775136
Canada        Latitude      56.130366
              Longitude   -106.346771
Costa Rica    Latitude       9.748917
              Longitude    -83.753428
Vatican City  Latitude      41.902916
              Longitude     12.453389
dtype: object

In [90]:
df.iloc[:, 3:5].unstack()

           Country     
Latitude   Brazil          -14.235004
           Argentina       -38.416097
           Germany          51.165691
           France           46.227638
           Italy             41.87194
           Vietnam          14.058324
           Australia       -25.274398
           Canada           56.130366
           Costa Rica        9.748917
           Vatican City     41.902916
Longitude  Brazil           -51.92528
           Argentina       -63.616672
           Germany          10.451526
           France            2.213749
           Italy             12.56738
           Vietnam         108.277199
           Australia       133.775136
           Canada         -106.346771
           Costa Rica      -83.753428
           Vatican City     12.453389
dtype: object

#### values

In [91]:
# sometimes you need the values of a DataFrame, not the DataFrame representation of it
df["Currency-Code"]

Country
Brazil          BRL
Argentina       ARS
Germany         EUR
France          EUR
Italy           EUR
Vietnam         VND
Australia       AUD
Canada          CAD
Costa Rica      CRC
Vatican City    EUR
Name: Currency-Code, dtype: object

In [92]:
df["Currency-Code"].values

array(['BRL', 'ARS', 'EUR', 'EUR', 'EUR', 'VND', 'AUD', 'CAD', 'CRC',
       'EUR'], dtype=object)

#### value_counts

In [93]:
df["Currency-Code"].value_counts()

Currency-Code
EUR    4
BRL    1
ARS    1
VND    1
AUD    1
CAD    1
CRC    1
Name: count, dtype: int64

#### map

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

It is different to the `apply` function as here each element is passed, whereas apply will receive the entire column.

In [94]:
# We will try to also convert the currencies based on a map I got from a currency API
# All values represent 1 Euro in the foreign currency
currency_conversion_rate = {'EUR': 1, "ARS": 1733.97, 'BRL': 6.25, "VND": 31029.00, "AUD": 1.78, "CAD": 1.62, "CRC": 593.30}
df['Currency Value (to EUR)'] = df[["Currency-Code"]].map(lambda x: currency_conversion_rate[x])

In [95]:
df

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True,6.25
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False,1733.97
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False,1.0
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False,1.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False,1.0
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True,31029.0
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False,1.78
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False,1.62
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True,593.3
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,,False,1.0


In [96]:
df.dtypes

Calling Code                object
Capital/Major City          object
Currency-Code               object
Latitude                    object
Longitude                   object
Population                  object
Co2-Emissions               object
Is tropical                   bool
Currency Value (to EUR)    float64
dtype: object

#### sort_index, sort_values

In [97]:
df.sort_index(axis=0, ascending=True, inplace=False)

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False,1733.97
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False,1.78
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True,6.25
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False,1.62
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True,593.3
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False,1.0
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False,1.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False,1.0
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,,False,1.0
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True,31029.0


In [98]:
df.sort_index(axis=1, ascending=True, inplace=False)

Unnamed: 0_level_0,Calling Code,Capital/Major City,Co2-Emissions,Currency Value (to EUR),Currency-Code,Is tropical,Latitude,Longitude,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,55.0,Brasília,462299.0,6.25,BRL,True,-14.235004,-51.92528,212559417.0
Argentina,54.0,Buenos Aires,201348.0,1733.97,ARS,False,-38.416097,-63.616672,44938712.0
Germany,49.0,Berlin,727973.0,1.0,EUR,False,51.165691,10.451526,83132799.0
France,33.0,Paris,303276.0,1.0,EUR,False,46.227638,2.213749,67059887.0
Italy,39.0,Rome,320411.0,1.0,EUR,False,41.87194,12.56738,60297396.0
Vietnam,84.0,Hanoi,192668.0,31029.0,VND,True,14.058324,108.277199,96462106.0
Australia,61.0,Canberra,375908.0,1.78,AUD,False,-25.274398,133.775136,25766605.0
Canada,1.0,Ottawa,544894.0,1.62,CAD,False,56.130366,-106.346771,36991981.0
Costa Rica,506.0,San José,8023.0,593.3,CRC,True,9.748917,-83.753428,5047561.0
Vatican City,379.0,Vatican City,,1.0,EUR,False,41.902916,12.453389,836.0


In [99]:
df.sort_values('Population', axis=0, ascending=False, inplace=False)

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True,6.25
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True,31029.0
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False,1.0
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False,1.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False,1.0
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False,1733.97
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False,1.62
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False,1.78
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True,593.3
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,,False,1.0


In [100]:
df.sort_values(['Is tropical', 'Latitude'], axis=0, ascending=False, inplace=False)

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True,31029.0
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True,593.3
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True,6.25
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False,1.62
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False,1.0
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False,1.0
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,,False,1.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False,1.0
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False,1.78
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False,1733.97


In [101]:
df

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True,6.25
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False,1733.97
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False,1.0
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False,1.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False,1.0
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True,31029.0
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False,1.78
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False,1.62
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True,593.3
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,,False,1.0


#### df.isnull(), df.fillna()

In [102]:
df['Co2-Emissions'] = df['Co2-Emissions'].astype(np.float64)
df['Co2-Emissions']

Country
Brazil          462299.0
Argentina       201348.0
Germany         727973.0
France          303276.0
Italy           320411.0
Vietnam         192668.0
Australia       375908.0
Canada          544894.0
Costa Rica        8023.0
Vatican City         NaN
Name: Co2-Emissions, dtype: float64

In [103]:
df.isnull()

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,False,False,False,False,False,False,False,False,False
Argentina,False,False,False,False,False,False,False,False,False
Germany,False,False,False,False,False,False,False,False,False
France,False,False,False,False,False,False,False,False,False
Italy,False,False,False,False,False,False,False,False,False
Vietnam,False,False,False,False,False,False,False,False,False
Australia,False,False,False,False,False,False,False,False,False
Canada,False,False,False,False,False,False,False,False,False
Costa Rica,False,False,False,False,False,False,False,False,False
Vatican City,False,False,False,False,False,False,True,False,False


In [104]:
df.fillna(0.0, inplace=True)

In [105]:
df

Unnamed: 0_level_0,Calling Code,Capital/Major City,Currency-Code,Latitude,Longitude,Population,Co2-Emissions,Is tropical,Currency Value (to EUR)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,55.0,Brasília,BRL,-14.235004,-51.92528,212559417.0,462299.0,True,6.25
Argentina,54.0,Buenos Aires,ARS,-38.416097,-63.616672,44938712.0,201348.0,False,1733.97
Germany,49.0,Berlin,EUR,51.165691,10.451526,83132799.0,727973.0,False,1.0
France,33.0,Paris,EUR,46.227638,2.213749,67059887.0,303276.0,False,1.0
Italy,39.0,Rome,EUR,41.87194,12.56738,60297396.0,320411.0,False,1.0
Vietnam,84.0,Hanoi,VND,14.058324,108.277199,96462106.0,192668.0,True,31029.0
Australia,61.0,Canberra,AUD,-25.274398,133.775136,25766605.0,375908.0,False,1.78
Canada,1.0,Ottawa,CAD,56.130366,-106.346771,36991981.0,544894.0,False,1.62
Costa Rica,506.0,San José,CRC,9.748917,-83.753428,5047561.0,8023.0,True,593.3
Vatican City,379.0,Vatican City,EUR,41.902916,12.453389,836.0,0.0,False,1.0
