# Unit 3 - missing values
---

1. Find rows with missing values
2. Remove missing values using dropna()  
3. Fill missing values using fillna()
4. Fill missing values using interpolate()
5. A note on slicing - copy()
6. GroupBy()





In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

<a id='section1'></a>

`null` / `na` - no value

`NaN` - **N**ot **a** **N**umber - the value is missing. This value will be ignored in calculations such as `.mean()`


### 1. Find rows with missing values

In [3]:
vacc_df.isnull().sum()

location                                   0
iso_code                                   0
date                                       0
total_vaccinations                      9403
people_vaccinated                      10199
people_fully_vaccinated                12953
daily_vaccinations_raw                 11522
daily_vaccinations                       228
total_vaccinations_per_hundred          9403
people_vaccinated_per_hundred          10199
people_fully_vaccinated_per_hundred    12953
daily_vaccinations_per_million           228
dtype: int64

`isnull()` is a pandas function, so either use it on a dataframe or call it through pd

In [4]:
pd.isnull(vacc_df).sum()

location                                   0
iso_code                                   0
date                                       0
total_vaccinations                      9403
people_vaccinated                      10199
people_fully_vaccinated                12953
daily_vaccinations_raw                 11522
daily_vaccinations                       228
total_vaccinations_per_hundred          9403
people_vaccinated_per_hundred          10199
people_fully_vaccinated_per_hundred    12953
daily_vaccinations_per_million           228
dtype: int64

In [5]:
vacc_df['daily_vaccinations'].notnull().sum()

23278

In [6]:
vacc_df['daily_vaccinations'].isnull().sum()

228

`isnan` is a numpy function

In [7]:
np.isnan(vacc_df['daily_vaccinations']).sum()

228

### 2. Remove missing values using dropna() 

##### Look at Zimbabwe for example. Zimbabwe contains missing values:

In [8]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']
#zimbabwe.head(10)

In [45]:
zimbabwe['total_vaccinations'].isnull().sum()

0

In [46]:
zimbabwe['total_vaccinations'].notnull().sum()

103

##### We can see the difference between the number of values per row:

In [11]:
zimbabwe.count()

location                               103
iso_code                               103
date                                   103
total_vaccinations                     100
people_vaccinated                      100
people_fully_vaccinated                 71
daily_vaccinations_raw                  98
daily_vaccinations                     102
total_vaccinations_per_hundred         100
people_vaccinated_per_hundred          100
people_fully_vaccinated_per_hundred     71
daily_vaccinations_per_million         102
dtype: int64

##### Remove all rows that contain one or more missing values: 

In [42]:
zimbabwe.dropna()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,newTotal,newTotal2
23403,Zimbabwe,ZWE,2021-02-18,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.0,0.0,0.0
23404,Zimbabwe,ZWE,2021-02-19,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,328.5
23405,Zimbabwe,ZWE,2021-02-20,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,657.0
23406,Zimbabwe,ZWE,2021-02-21,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,985.5
23407,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,0.0,0.0,328.0,0.01,0.01,0.00,22.0,1314.0,1314.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0,953389.0,953389.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0,976796.0,976796.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0,1002465.0,1002465.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0,1011973.0,1011973.0


Note: `dropna()`, like most other functions in the pandas API returns a new DataFrame 
(a copy of the original with changes) as the result, so you should assign it back if you want to see changes:

In [44]:
zimbabwe

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,newTotal,newTotal2
23403,Zimbabwe,ZWE,2021-02-18,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.0,0.0,0.0
23404,Zimbabwe,ZWE,2021-02-19,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,328.5
23405,Zimbabwe,ZWE,2021-02-20,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,657.0
23406,Zimbabwe,ZWE,2021-02-21,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,985.5
23407,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,0.0,0.0,328.0,0.01,0.01,0.00,22.0,1314.0,1314.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0,953389.0,953389.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0,976796.0,976796.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0,1002465.0,1002465.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0,1011973.0,1011973.0


assign it back:

In [43]:
zimbabwe2 = zimbabwe.dropna()
zimbabwe2

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,newTotal,newTotal2
23403,Zimbabwe,ZWE,2021-02-18,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.0,0.0,0.0
23404,Zimbabwe,ZWE,2021-02-19,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,328.5
23405,Zimbabwe,ZWE,2021-02-20,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,657.0
23406,Zimbabwe,ZWE,2021-02-21,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,985.5
23407,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,0.0,0.0,328.0,0.01,0.01,0.00,22.0,1314.0,1314.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0,953389.0,953389.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0,976796.0,976796.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0,1002465.0,1002465.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0,1011973.0,1011973.0


##### Remove all values for a specific column - using `subset`

In [47]:
zimbabwe.dropna(subset = ['total_vaccinations'])

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,newTotal,newTotal2
23403,Zimbabwe,ZWE,2021-02-18,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.0,0.0,0.0
23404,Zimbabwe,ZWE,2021-02-19,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,328.5
23405,Zimbabwe,ZWE,2021-02-20,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,657.0
23406,Zimbabwe,ZWE,2021-02-21,0.0,0.0,0.0,0.0,328.0,0.00,0.00,0.00,22.0,0.0,985.5
23407,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,0.0,0.0,328.0,0.01,0.01,0.00,22.0,1314.0,1314.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0,953389.0,953389.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0,976796.0,976796.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0,1002465.0,1002465.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0,1011973.0,1011973.0


For more columns:

In [16]:
zimbabwe.dropna(subset = ['total_vaccinations', 'daily_vaccinations_per_million']).head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
23407,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,,,328.0,0.01,0.01,,22.0
23408,Zimbabwe,ZWE,2021-02-23,4041.0,4041.0,,2727.0,808.0,0.03,0.03,,54.0
23409,Zimbabwe,ZWE,2021-02-24,7872.0,7872.0,,3831.0,1312.0,0.05,0.05,,88.0
23410,Zimbabwe,ZWE,2021-02-25,11007.0,11007.0,,3135.0,1572.0,0.07,0.07,,106.0
23411,Zimbabwe,ZWE,2021-02-26,12579.0,12579.0,,1572.0,1750.0,0.08,0.08,,118.0


---
>A summary of the functions so far:
>
>* `.isnull()` - display rows that contain missing values
>* `.notnull()` - display rows that don't contain missing values
>* `.dropna()` - Remove rows with missing values according to parameters:
    * `.dropna()` (default) - drops rows if at least one column has NaN
    * `.dropna(subset = ['column_name'])` - drop rows that contain missing values in the subset of column names
    * `.dropna(how='all')` - drops rows only if all of its columns have NaNs
    * `.dropna(thresh = k)` - k how many non-null values you want to keep (k=3 means the row should contain at least 3 non-null values)
    * `.dropna(axis=1)` - drop columns instead of rows
> 

See documnetation [here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

---


### 3. Fill missing values using fillna()

Use `.fillna()` to fill missing dataframe values with:
* Whatever value you choose
* Mean, median, mode

Replace all NaNs with 0s

In [50]:
vacc_df.fillna(0, inplace = True )
vacc_df

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,newTotal,newTotal2
0,Afghanistan,AFG,2021-02-22,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.0,0.0,0.000000e+00
1,Afghanistan,AFG,2021-02-23,0.0,0.0,0.0,0.0,1367.0,0.00,0.00,0.00,35.0,0.0,1.366667e+03
2,Afghanistan,AFG,2021-02-24,0.0,0.0,0.0,0.0,1367.0,0.00,0.00,0.00,35.0,0.0,2.733333e+03
3,Afghanistan,AFG,2021-02-25,0.0,0.0,0.0,0.0,1367.0,0.00,0.00,0.00,35.0,0.0,4.100000e+03
4,Afghanistan,AFG,2021-02-26,0.0,0.0,0.0,0.0,1367.0,0.00,0.00,0.00,35.0,0.0,5.466667e+03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0,953389.0,9.533890e+05
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0,976796.0,9.767960e+05
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0,1002465.0,1.002465e+06
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0,1011973.0,1.011973e+06


>`inplace = False` is the default. This doesn't change the vacc_df dataframe. 
>
>To change it you need:
>
>`vacc_df.fillna(0 , inplace = True)`
>
>or to assign:
>
>`vacc_df = vacc_df.fillna(0)`
>
>But we won't do that! This is where some **business understanding** comes in: it's not a good idea to fill a column like `total_vaccinations` with 0s. 
>
>See what happens:

In [18]:
vacc_df.fillna(0).head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,AFG,2021-02-23,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
2,Afghanistan,AFG,2021-02-24,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
3,Afghanistan,AFG,2021-02-25,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
4,Afghanistan,AFG,2021-02-26,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
5,Afghanistan,AFG,2021-02-27,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,0.0,0.0,1367.0,0.02,0.02,0.0,35.0
7,Afghanistan,AFG,2021-03-01,0.0,0.0,0.0,0.0,1580.0,0.0,0.0,0.0,41.0
8,Afghanistan,AFG,2021-03-02,0.0,0.0,0.0,0.0,1794.0,0.0,0.0,0.0,46.0
9,Afghanistan,AFG,2021-03-03,0.0,0.0,0.0,0.0,2008.0,0.0,0.0,0.0,52.0


So we'll use 0's only for the daily_vaccinations columns, and perhaps for some other columns (which?)

In [51]:
vacc_df['daily_vaccinations'].fillna(0 , inplace = True)

checkout some of the data to see that it works

In [20]:
vacc_df.iloc[0:3,[0,2,7]]

Unnamed: 0,location,date,daily_vaccinations
0,Afghanistan,2021-02-22,0.0
1,Afghanistan,2021-02-23,1367.0
2,Afghanistan,2021-02-24,1367.0


Other options - using central measures:

In [21]:
# Using median
vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].median(), inplace=True)
  
# Using mean
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mean(), inplace=True)
  
# Using mode
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mode(), inplace=True)


What about `total_vaccinations`? - there are some `NaN`s there as well:

In [22]:
vacc_df.iloc[52:62,[0,2,3]]

Unnamed: 0,location,date,total_vaccinations
52,Afghanistan,2021-04-15,
53,Afghanistan,2021-04-16,
54,Afghanistan,2021-04-17,
55,Afghanistan,2021-04-18,
56,Afghanistan,2021-04-19,
57,Afghanistan,2021-04-20,
58,Afghanistan,2021-04-21,
59,Afghanistan,2021-04-22,240000.0
60,Afghanistan,2021-04-23,
61,Afghanistan,2021-04-24,


For the `total_vaccinations` we'll use `ffill` which fills the missing values with first non-missing value that occurs before it.

Yes, `bfill` exists as well. If does what you think it does :-)

In [23]:
vacc_df[['total_vaccinations']].fillna(method='ffill')[52:62]
#vacc_df['total_vaccinations'][52:62]

Unnamed: 0,total_vaccinations
52,120000.0
53,120000.0
54,120000.0
55,120000.0
56,120000.0
57,120000.0
58,120000.0
59,240000.0
60,240000.0
61,240000.0


The first value for some country might be NaN 

Business understanding: this isn't good enought! We need to aggregate by country!!

Use `groupby()` and `apply`  (This is more advanced and we will return to it shortly)

We will create a new column here, `newTotal` - so we can see the difference in `total_vaccinations`


In [24]:
vacc_df['newTotal'] = vacc_df.groupby('location')[['total_vaccinations']].apply(lambda x: x.fillna(method='ffill'))
vacc_df.iloc[52:62,[0,2,3,12]]

Unnamed: 0,location,date,total_vaccinations,newTotal
52,Afghanistan,2021-04-15,,120000.0
53,Afghanistan,2021-04-16,,120000.0
54,Afghanistan,2021-04-17,,120000.0
55,Afghanistan,2021-04-18,,120000.0
56,Afghanistan,2021-04-19,,120000.0
57,Afghanistan,2021-04-20,,120000.0
58,Afghanistan,2021-04-21,,120000.0
59,Afghanistan,2021-04-22,240000.0,240000.0
60,Afghanistan,2021-04-23,,240000.0
61,Afghanistan,2021-04-24,,240000.0


### 4. Fill missing values using interpolate()

In [25]:
vacc_df['newTotal2'] = vacc_df['total_vaccinations'].interpolate(method ='linear') 
vacc_df.iloc[52:62,[0,2,3,12, 13]]

Unnamed: 0,location,date,total_vaccinations,newTotal,newTotal2
52,Afghanistan,2021-04-15,,120000.0,184000.0
53,Afghanistan,2021-04-16,,120000.0,192000.0
54,Afghanistan,2021-04-17,,120000.0,200000.0
55,Afghanistan,2021-04-18,,120000.0,208000.0
56,Afghanistan,2021-04-19,,120000.0,216000.0
57,Afghanistan,2021-04-20,,120000.0,224000.0
58,Afghanistan,2021-04-21,,120000.0,232000.0
59,Afghanistan,2021-04-22,240000.0,240000.0,240000.0
60,Afghanistan,2021-04-23,,240000.0,253921.157895
61,Afghanistan,2021-04-24,,240000.0,267842.315789


---
>A summary of the functions so far:
>
>* `.fillna()` - fill missing values according to parameters:
    * `.fillna('k')`  - with value k, create a new dataframe
    * `.fillna('k', inplace = True)` - with value k, into the existing dataframe
    * `.fillna(method='ffill')` - fill with first non-missing value that occurs before it 
    * `.fillna(method='bfill')` - fill with first non-missing value that occurs after it  
> * `interpolate` - fill using some interpolation technique
>
>See documnetation:
>
>* [Missing data handling documentation](https://pandas-docs.github.io/pandas-docs-travis/reference/frame.html#missing-data-handling)
---

### 5. A note on slicing

Slicing is taking only part of a dataframe. For example - the slice we named zimbabwe:

In [26]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']

When we change data in a slice, we are changing the ORIGINAL dataframe. This will cause a warning to appear:

In [27]:
zimbabwe.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


The warning will disappear if you rerun the command, but it can still be scary. Best way to avoid it is to create a `copy` of the dataframe:

In [28]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe'].copy()
zimbabwe.fillna(0, inplace=True)

This works fine, no warnings. But - this won't change the original dataframe (which might be a good thing, if you didn't plan to change it, or a bad thing, if you did)

What about changes in the original dataframe? Your copy will not change.
If you do  want your copy to change, use a shallow copy:

In [29]:
small_example = pd.Series([1, 2], index=["a", "b"])
small_example

a    1
b    2
dtype: int64

In [30]:
my_deep_copy = small_example.copy()
my_deep_copy

a    1
b    2
dtype: int64

In [31]:
my_shallow_copy = small_example.copy(deep=False)
my_shallow_copy

a    1
b    2
dtype: int64

Make a change to the dataframe - where will it appear?

In [32]:
small_example[0] = -100
small_example

a   -100
b      2
dtype: int64

In [33]:
my_deep_copy

a    1
b    2
dtype: int64

In [34]:
my_shallow_copy

a   -100
b      2
dtype: int64

### 6. Groupby()

#### Group according to something + some columns + some summary statistic

The `mean` of `daily_vaccinations` according to `location`:


In [54]:
vacc_df.groupby('location')[['daily_vaccinations']].median()

Unnamed: 0_level_0,daily_vaccinations
location,Unnamed: 1_level_1
Afghanistan,4822.0
Africa,202056.0
Albania,2606.0
Algeria,3748.0
Andorra,197.0
...,...
Wallis and Futuna,103.0
World,7218390.0
Yemen,4276.0
Zambia,3768.0


The same, but for two columns (though as we said, not much business logic for mean value of `total_vaccinations`)

In [63]:
vacc_df.loc[(vacc_df.date == '2020-12-02')]


Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,newTotal,newTotal2
6533,Europe,OWID_EUR,2020-12-02,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
8928,High income,OWID_HIC,2020-12-02,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
15893,Norway,NOR,2020-12-02,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
23154,World,OWID_WRL,2020-12-02,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


Still the same, but using a lambda function

In [37]:
vacc_df.groupby('location')[['daily_vaccinations', 'total_vaccinations']].apply(lambda x: x.mean())

Unnamed: 0_level_0,daily_vaccinations,total_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6.065878e+03,3.483454e+05
Africa,2.123865e+05,1.114164e+07
Albania,5.358493e+03,2.986477e+05
Algeria,3.139545e+03,2.501000e+04
Andorra,2.800796e+02,1.246850e+04
...,...,...
Wallis and Futuna,1.213594e+02,4.997700e+03
World,1.018555e+07,5.067592e+08
Yemen,4.072381e+03,6.131250e+04
Zambia,3.108745e+03,7.530030e+04


`fillna()` is not an aggregation function, so the result is different:

In [38]:
vacc_df.groupby('location')[['daily_vaccinations']].apply(lambda x: x.fillna(x.mean()))

Unnamed: 0_level_0,Unnamed: 1_level_0,daily_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,0,0.0
Afghanistan,1,1367.0
Afghanistan,2,1367.0
Afghanistan,3,1367.0
Afghanistan,4,1367.0
...,...,...
Zimbabwe,23501,12285.0
Zimbabwe,23502,12695.0
Zimbabwe,23503,14056.0
Zimbabwe,23504,14420.0


The same but for two columns:

In [39]:
vacc_df.groupby('location')[['daily_vaccinations', 'total_vaccinations']].apply(lambda x: x.fillna(x.mean()))

Unnamed: 0_level_0,Unnamed: 1_level_0,daily_vaccinations,total_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,0,0.0,0.000000e+00
Afghanistan,1,1367.0,3.483454e+05
Afghanistan,2,1367.0,3.483454e+05
Afghanistan,3,1367.0,3.483454e+05
Afghanistan,4,1367.0,3.483454e+05
...,...,...,...
Zimbabwe,23501,12285.0,9.533890e+05
Zimbabwe,23502,12695.0,9.767960e+05
Zimbabwe,23503,14056.0,1.002465e+06
Zimbabwe,23504,14420.0,1.011973e+06


---
>A summary:
>
>* `.copy()` - creates a copy of the slice of the dataframe
>
>* `.copy(deep=False)` - updates to the original dataframe will show in the copy
>
>* `.groupby()` - group according to the columns specified
---