# Unit 2
---




1. [Introducing Pandas](#section1)
2. [Reading files](#section2)
3. [Selecting data](#section3)
4. [Conditional selection](#section4)








<a id='section1'></a>

## 1. Introducing Pandas
---

<div>
<img src="images/pandas.JPG" width="400"/>
</div>



[Panda's documentation](https://pandas.pydata.org/pandas-docs/stable/)



To begin we need to import pandas
When you see pd, know it is referring to pandas

In [5]:
import numpy as np
import pandas as pd

Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet).

There are two main data structure used by pandas
- Series: like a vector or a list
- Dataframe: equivalent to a table. 

Each column in a pandas Dataframe is a pandas Series data structure. We will mainly be looking at the Dataframe.

We can easily create a Pandas Dataframe by reading a .csv file


<a id='section2'></a>


## 2. Reading files
---

<div>
<img src="images/reading.PNG" width="400"/>
</div>


We will read the whole file at once using Pandas.
Sometimes you might want to read the file line by line, and process each line. Thats possible of course. See for example [here.](https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/)

We will read data on [COVID-19 vaccinations](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations)

In order to do that, I retrieved the raw data's url

<div>
<img src="images/raw.png" width="800"/>
</div>


In [6]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

read_csv has about 30 different options. See the 
[documentation](https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)

For example, sep='\t' is used for tab delimited files and 'usecol' reads only specific columns. 

In [7]:
type(vacc_df)

pandas.core.frame.DataFrame

view the shape of the dataframe:

In [46]:
vacc_df

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.00,0.00,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0


view basic information:

In [9]:
vacc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23506 entries, 0 to 23505
Data columns (total 12 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   location                             23506 non-null  object 
 1   iso_code                             23506 non-null  object 
 2   date                                 23506 non-null  object 
 3   total_vaccinations                   14103 non-null  float64
 4   people_vaccinated                    13307 non-null  float64
 5   people_fully_vaccinated              10553 non-null  float64
 6   daily_vaccinations_raw               11984 non-null  float64
 7   daily_vaccinations                   23278 non-null  float64
 8   total_vaccinations_per_hundred       14103 non-null  float64
 9   people_vaccinated_per_hundred        13307 non-null  float64
 10  people_fully_vaccinated_per_hundred  10553 non-null  float64
 11  daily_vaccinations_per_milli

In [10]:
vacc_df.columns

Index(['location', 'iso_code', 'date', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated',
       'daily_vaccinations_raw', 'daily_vaccinations',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred',
       'daily_vaccinations_per_million'],
      dtype='object')

View the first few rows:

In [11]:
vacc_df.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0


What do you think that the 'tail' command does? Try it out!

What happens if we just type vacc_df, without a head or a tail?

In [12]:
vacc_df.tail()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.3,970.0
23505,Zimbabwe,ZWE,2021-05-31,1020078.0,675678.0,344400.0,8105.0,15022.0,6.86,4.55,2.32,1011.0


In [52]:
vacc_df.describe()

Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
count,14103.0,13307.0,10553.0,11984.0,23278.0,14103.0,13307.0,10553.0,23278.0
mean,26560470.0,14897380.0,8060504.0,650470.1,329465.4,19.011819,13.108453,7.150422,3111.603016
std,113101300.0,57212220.0,30941860.0,2510323.0,1694184.0,26.402102,16.787676,11.275678,7895.901272
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,103935.5,81098.5,36885.0,4691.5,943.0,1.61,1.37,0.6,337.0
50%,816175.0,602150.0,311945.0,25622.0,7712.0,8.07,5.89,2.8,1567.0
75%,4944926.0,3511914.0,1911143.0,151935.5,45419.0,25.635,18.6,8.77,4309.0
max,1937501000.0,840956200.0,430361700.0,40325490.0,33225240.0,230.09,116.0,114.09,1000000.0


NameError: name 'describe' is not defined

A summary of the functions so far:

* `pd.read_csv` - Read data from a CSV file into a Pandas `DataFrame` object
* `.info()` - View basic infomation about rows, columns & data types
* `.describe()` - View statistical information about numeric columns
* `.columns` - Get the list of column names
* `.shape` - Get the number of rows & columns as a tuple
* `.head`, `.tail`


<a id='section3'></a>

---
## 3. Selecting data

![](https://i.imgur.com/zfxLzEv.png)

In [14]:
# Pandas format is NOT similar to this
covid_data_list = [
    {'date': '2020-08-30', 'new_cases': 1444, 'new_deaths': 1, 'new_tests': 53541},
    {'date': '2020-08-31', 'new_cases': 1365, 'new_deaths': 4, 'new_tests': 42583},
    {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
    {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8 },
    {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6},
]
covid_data_list

[{'date': '2020-08-30',
  'new_cases': 1444,
  'new_deaths': 1,
  'new_tests': 53541},
 {'date': '2020-08-31',
  'new_cases': 1365,
  'new_deaths': 4,
  'new_tests': 42583},
 {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
 {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8},
 {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6}]

In [15]:
# Pandas format is simliar to this
covid_data_dict = {
    'date':       ['2020-08-30', '2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03'],
    'new_cases':  [1444, 1365, 996, 975, 1326],
    'new_deaths': [1, 4, 6, 8, 6],
    'new_tests': [53541, 42583, 54395, None, None]
}

#### The index of a dataframe doesn't have to be numeric

In [16]:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
df

Unnamed: 0,age,color,food,height,score,state
Jane,30,blue,Steak,165,4.6,NY
Nick,2,green,Lamb,70,8.3,TX
Aaron,12,red,Mango,120,9.0,FL
Penelope,4,white,Apple,80,3.3,AL
Dean,32,gray,Cheese,180,1.8,AK
Christina,33,black,Melon,172,9.5,TX
Cornelia,69,red,Beans,150,2.2,TX


In our case it just happens to be numeric:

In [17]:
vacc_df.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0


In [18]:
vacc_df.location
#is the same as this:
vacc_df['location']


0        Afghanistan
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
            ...     
23501       Zimbabwe
23502       Zimbabwe
23503       Zimbabwe
23504       Zimbabwe
23505       Zimbabwe
Name: location, Length: 23506, dtype: object

note: using the `.` notation is possible only for columns whose names do not contain spaces or special characters

what data type is vacc_df.location? (list? series? dataframe?)

In [19]:
type(vacc_df.location)

pandas.core.series.Series

In [54]:
#retrieve a specific cell
vacc_df.location[2131]

'Barbados'

In [21]:
#retrieve two columns
vacc_df[['location','date']]

Unnamed: 0,location,date
0,Afghanistan,2021-02-22
1,Afghanistan,2021-02-23
2,Afghanistan,2021-02-24
3,Afghanistan,2021-02-25
4,Afghanistan,2021-02-26
...,...,...
23501,Zimbabwe,2021-05-27
23502,Zimbabwe,2021-05-28
23503,Zimbabwe,2021-05-29
23504,Zimbabwe,2021-05-30


#### Seletcting subsets of rows and columns

One way to do that is iloc. 

`.iloc` - selects subsets of rows and columns by integer location only

In [55]:
#Rows:
#vacc_df.iloc[0]  #first row
vacc_df.iloc[-1234] #last row

location                               United Kingdom
iso_code                                          GBR
date                                       2021-05-13
total_vaccinations                         5.5435e+07
people_vaccinated                          3.6116e+07
people_fully_vaccinated                    1.9319e+07
daily_vaccinations_raw                         637325
daily_vaccinations                             514372
total_vaccinations_per_hundred                  81.66
people_vaccinated_per_hundred                    53.2
people_fully_vaccinated_per_hundred             28.46
daily_vaccinations_per_million                   7577
Name: 22272, dtype: object

The : operator 

 - when used alone it means "everything"

- also used to indicate a ***slice*** of values


In [71]:
vacc_df.iloc[1:2] # second and third row
vacc_df.iloc[[-1,2,22]] #a few specific rows

# Columns:
vacc_df.iloc[:,0] # first column of data frame  
vacc_df.iloc[:,1] # second column of data frame  
vacc_df.iloc[:,-1] # last column of data frame

#Rows and columns
vacc_df.iloc[0:5,1] # first five rows of dataframe
#vacc_df.iloc[1:3, 1:2] # first two columns of data frame with all rows
#vacc_df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
vacc_df.iloc[700:770,8:9]
vacc_df.iloc[:,9]

0        0.00
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
23501    4.36
23502    4.42
23503    4.49
23504    4.51
23505    4.55
Name: people_vaccinated_per_hundred, Length: 23506, dtype: float64

In [24]:
vacc_df.iloc[0:19]


Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0
5,Afghanistan,AFG,2021-02-27,,,,,1367.0,,,,35.0
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,,,1367.0,0.02,0.02,,35.0
7,Afghanistan,AFG,2021-03-01,,,,,1580.0,,,,41.0
8,Afghanistan,AFG,2021-03-02,,,,,1794.0,,,,46.0
9,Afghanistan,AFG,2021-03-03,,,,,2008.0,,,,52.0


What if I want to select the 'daily_vaccinations' column, but I don't remember which column it is?

Use `.loc`

`.loc` - selects subsets of rows and columns by label only. Allowed inputs are:

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

- A list or array of labels, e.g. ['a', 'b', 'c'].

- A slice object with labels, e.g. 'a':'f'.

In [25]:
vacc_df.loc[:,'daily_vaccinations']

0            NaN
1         1367.0
2         1367.0
3         1367.0
4         1367.0
          ...   
23501    12285.0
23502    12695.0
23503    14056.0
23504    14420.0
23505    15022.0
Name: daily_vaccinations, Length: 23506, dtype: float64

I'm missing the location. Let's add it. 

In [26]:
vacc_df.loc[0:3,['location','daily_vaccinations']]

#vacc_df.loc[0:4:,['location', 'total_vaccinations']]  this is to select specific rows

Unnamed: 0,location,daily_vaccinations
0,Afghanistan,
1,Afghanistan,1367.0
2,Afghanistan,1367.0
3,Afghanistan,1367.0


Semantics are similar to iloc. But note:

- `iloc` excludes the last element.  `df.iloc[0:1000]` will return entries 0...999
- `loc`, includes the last element.  `df.loc[0:1000]` will return entries 0...1000

you try it! What is the difference between:

> vacc_df.iloc[0:5]

> vacc_df.loc[0:5]

In [72]:
 vacc_df.iloc[0:5,1:11]

Unnamed: 0,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred
0,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,
1,AFG,2021-02-23,,,,,1367.0,,,
2,AFG,2021-02-24,,,,,1367.0,,,
3,AFG,2021-02-25,,,,,1367.0,,,
4,AFG,2021-02-26,,,,,1367.0,,,


In [73]:
vacc_df.loc[0:5,1:11]

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

---
---

Now you: 

what do you do to select:

a. first five rows?

b. first two columns, all rows?

c. 1st and 3rd row and 2nd and 4th column?

---
---

a. the first five rows are:

In [29]:
vacc_df.head()



Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0


In [30]:
vacc_df.iloc [:,0:2]


Unnamed: 0,location,iso_code
0,Afghanistan,AFG
1,Afghanistan,AFG
2,Afghanistan,AFG
3,Afghanistan,AFG
4,Afghanistan,AFG
...,...,...
23501,Zimbabwe,ZWE
23502,Zimbabwe,ZWE
23503,Zimbabwe,ZWE
23504,Zimbabwe,ZWE


In [31]:
vacc_df.iloc[[1,3],[2,4]]


Unnamed: 0,date,people_vaccinated
1,2021-02-23,
3,2021-02-25,


---
A summary of the functions in this unit:

* `.iloc` - selects rows and columns by integer location
* `.loc` - selects rows and columns by label location



Note: indexing operators as the ones working on dictionaries, will also work in pandas. But for more advanced operations, better get used to loc and iloc.

---

<a id='section4'></a>

## 4. Conditional selection




In [32]:
vacc_df.loc[:,'location'] == 'Israel'


0        False
1        False
2        False
3        False
4        False
         ...  
23501    False
23502    False
23503    False
23504    False
23505    False
Name: location, Length: 23506, dtype: bool

This creates a series of true/false 

We can insert this into data to select only that task:

In [33]:
vacc_df[vacc_df.loc[:,'location'] == 'Israel'] 

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
10349,Israel,ISR,2020-12-19,60.0,60.0,,,,0.00,0.00,,
10350,Israel,ISR,2020-12-20,7419.0,7419.0,,7359.0,7359.0,0.09,0.09,,850.0
10351,Israel,ISR,2020-12-21,32302.0,32302.0,,24883.0,16121.0,0.37,0.37,,1863.0
10352,Israel,ISR,2020-12-22,76920.0,76920.0,,44618.0,25620.0,0.89,0.89,,2960.0
10353,Israel,ISR,2020-12-23,139760.0,139760.0,,62840.0,34925.0,1.61,1.61,,4035.0
...,...,...,...,...,...,...,...,...,...,...,...,...
10508,Israel,ISR,2021-05-27,10574763.0,5447371.0,5127392.0,4522.0,3626.0,122.17,62.94,59.24,419.0
10509,Israel,ISR,2021-05-28,10576599.0,5447984.0,5128615.0,1836.0,3572.0,122.19,62.94,59.25,413.0
10510,Israel,ISR,2021-05-29,10576836.0,5448078.0,5128758.0,237.0,3540.0,122.20,62.94,59.25,409.0
10511,Israel,ISR,2021-05-30,10580056.0,5449465.0,5130591.0,3220.0,3242.0,122.23,62.96,59.28,375.0


Another way:

In [34]:
vacc_df.loc[vacc_df.location == 'Israel']

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
10349,Israel,ISR,2020-12-19,60.0,60.0,,,,0.00,0.00,,
10350,Israel,ISR,2020-12-20,7419.0,7419.0,,7359.0,7359.0,0.09,0.09,,850.0
10351,Israel,ISR,2020-12-21,32302.0,32302.0,,24883.0,16121.0,0.37,0.37,,1863.0
10352,Israel,ISR,2020-12-22,76920.0,76920.0,,44618.0,25620.0,0.89,0.89,,2960.0
10353,Israel,ISR,2020-12-23,139760.0,139760.0,,62840.0,34925.0,1.61,1.61,,4035.0
...,...,...,...,...,...,...,...,...,...,...,...,...
10508,Israel,ISR,2021-05-27,10574763.0,5447371.0,5127392.0,4522.0,3626.0,122.17,62.94,59.24,419.0
10509,Israel,ISR,2021-05-28,10576599.0,5447984.0,5128615.0,1836.0,3572.0,122.19,62.94,59.25,413.0
10510,Israel,ISR,2021-05-29,10576836.0,5448078.0,5128758.0,237.0,3540.0,122.20,62.94,59.25,409.0
10511,Israel,ISR,2021-05-30,10580056.0,5449465.0,5130591.0,3220.0,3242.0,122.23,62.96,59.28,375.0


Select two countries:

In [78]:
two_countries = vacc_df.loc[(vacc_df.location == 'Israel') | (vacc_df.location == 'Denmark')]
two_countries

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
5278,Denmark,DNK,2020-12-17,1.0,1.0,,,,0.00,0.00,,
5279,Denmark,DNK,2020-12-18,,,,,0.0,,,,0.0
5280,Denmark,DNK,2020-12-19,2.0,2.0,,,0.0,0.00,0.00,,0.0
5281,Denmark,DNK,2020-12-20,,,,,0.0,,,,0.0
5282,Denmark,DNK,2020-12-21,3.0,3.0,,,0.0,0.00,0.00,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
10508,Israel,ISR,2021-05-27,10574763.0,5447371.0,5127392.0,4522.0,3626.0,122.17,62.94,59.24,419.0
10509,Israel,ISR,2021-05-28,10576599.0,5447984.0,5128615.0,1836.0,3572.0,122.19,62.94,59.25,413.0
10510,Israel,ISR,2021-05-29,10576836.0,5448078.0,5128758.0,237.0,3540.0,122.20,62.94,59.25,409.0
10511,Israel,ISR,2021-05-30,10580056.0,5449465.0,5130591.0,3220.0,3242.0,122.23,62.96,59.28,375.0


only the indexs of the tasks:

In [36]:
two_countries.index.values

array([ 5278,  5279,  5280,  5281,  5282,  5283,  5284,  5285,  5286,
        5287,  5288,  5289,  5290,  5291,  5292,  5293,  5294,  5295,
        5296,  5297,  5298,  5299,  5300,  5301,  5302,  5303,  5304,
        5305,  5306,  5307,  5308,  5309,  5310,  5311,  5312,  5313,
        5314,  5315,  5316,  5317,  5318,  5319,  5320,  5321,  5322,
        5323,  5324,  5325,  5326,  5327,  5328,  5329,  5330,  5331,
        5332,  5333,  5334,  5335,  5336,  5337,  5338,  5339,  5340,
        5341,  5342,  5343,  5344,  5345,  5346,  5347,  5348,  5349,
        5350,  5351,  5352,  5353,  5354,  5355,  5356,  5357,  5358,
        5359,  5360,  5361,  5362,  5363,  5364,  5365,  5366,  5367,
        5368,  5369,  5370,  5371,  5372,  5373,  5374,  5375,  5376,
        5377,  5378,  5379,  5380,  5381,  5382,  5383,  5384,  5385,
        5386,  5387,  5388,  5389,  5390,  5391,  5392,  5393,  5394,
        5395,  5396,  5397,  5398,  5399,  5400,  5401,  5402,  5403,
        5404,  5405,

the index in the first place:

In [79]:
two_countries.index.values[0]

5278

how many rows for the two countries?

In [80]:
two_countries.count()

location                               329
iso_code                               329
date                                   329
total_vaccinations                     326
people_vaccinated                      326
people_fully_vaccinated                295
daily_vaccinations_raw                 321
daily_vaccinations                     327
total_vaccinations_per_hundred         326
people_vaccinated_per_hundred          326
people_fully_vaccinated_per_hundred    295
daily_vaccinations_per_million         327
dtype: int64

At the end of the file we have some world data.

Use str.contains if you're not sure how this location is called

In [39]:
vacc_df[vacc_df['location'].str.contains('World')]                          

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
23154,World,OWID_WRL,2020-12-02,1.000000e+00,1.0,,,,0.00,0.00,,
23155,World,OWID_WRL,2020-12-03,1.000000e+00,1.0,,0.0,0.0,0.00,0.00,,0.0
23156,World,OWID_WRL,2020-12-04,2.000000e+00,2.0,,1.0,0.0,0.00,0.00,,0.0
23157,World,OWID_WRL,2020-12-05,2.000000e+00,2.0,,0.0,0.0,0.00,0.00,,0.0
23158,World,OWID_WRL,2020-12-06,2.000000e+00,2.0,,0.0,0.0,0.00,0.00,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
23330,World,OWID_WRL,2021-05-27,1.813771e+09,810431629.0,414771680.0,33957660.0,31080505.0,23.27,10.40,5.32,3987.0
23331,World,OWID_WRL,2021-05-28,1.845718e+09,818819847.0,419760225.0,31947359.0,31483414.0,23.68,10.50,5.39,4039.0
23332,World,OWID_WRL,2021-05-29,1.876595e+09,827707203.0,423563234.0,30876843.0,32007393.0,24.07,10.62,5.43,4106.0
23333,World,OWID_WRL,2021-05-30,1.906443e+09,835289867.0,427421843.0,29848009.0,32897914.0,24.46,10.72,5.48,4220.0


Remove the world data:

In [40]:
vacc_df_noWorld = vacc_df.loc[vacc_df.location != 'World']
vacc_df_noWorld.tail()



Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
23501,Zimbabwe,ZWE,2021-05-27,953389.0,648121.0,305268.0,16349.0,12285.0,6.41,4.36,2.05,827.0
23502,Zimbabwe,ZWE,2021-05-28,976796.0,656630.0,320166.0,23407.0,12695.0,6.57,4.42,2.15,854.0
23503,Zimbabwe,ZWE,2021-05-29,1002465.0,666786.0,335679.0,25669.0,14056.0,6.74,4.49,2.26,946.0
23504,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.3,970.0
23505,Zimbabwe,ZWE,2021-05-31,1020078.0,675678.0,344400.0,8105.0,15022.0,6.86,4.55,2.32,1011.0


Find the country with the maximum vaccinations:

In [41]:
max_vacc = vacc_df_noWorld.total_vaccinations.max()
max_vacc

1062201277.0

In [83]:
vacc_df_noWorld.loc[vacc_df_noWorld.total_vaccinations == max_vacc]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
1266,Asia,OWID_ASI,2021-05-31,1062201000.0,276224506.0,104895621.0,26636410.0,24089671.0,22.89,5.95,2.26,5192.0


What do you think this function does?

In [43]:
vacc_df_noWorld.total_vaccinations.mean()

20317402.17662692

----
#### Your turn:

Select the number of daily vaccinations in Israel on date 2021-02-06 (hint: use &)

In [44]:
vacc_df.loc[(vacc_df.location == 'Israel') & (vacc_df.date == '2021-02-06')]





Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
10398,Israel,ISR,2021-02-06,5517761.0,3442374.0,2075387.0,43668.0,99784.0,63.75,39.77,23.98,11528.0


Find all the countries with more than 3000000 vaccinations

In [45]:
vacc_df_noWorld.loc[(vacc_df.daily_vaccinations > 300000)]


Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
175,Africa,OWID_AFR,2021-03-27,10302320.0,6886943.0,3606160.0,325837.0,301154.0,0.77,0.51,0.27,225.0
176,Africa,OWID_AFR,2021-03-28,10561169.0,7143129.0,3608823.0,258849.0,335054.0,0.79,0.53,0.27,250.0
177,Africa,OWID_AFR,2021-03-29,10774084.0,7229895.0,3743032.0,212915.0,348008.0,0.80,0.54,0.28,260.0
178,Africa,OWID_AFR,2021-03-30,10970869.0,7311247.0,3870507.0,196785.0,355231.0,0.82,0.55,0.29,265.0
218,Africa,OWID_AFR,2021-05-09,21198985.0,15987716.0,5518749.0,729471.0,351787.0,1.58,1.19,0.41,262.0
...,...,...,...,...,...,...,...,...,...,...,...,...
22615,Upper middle income,OWID_UMC,2021-05-27,822880783.0,156476628.0,85995305.0,21093601.0,20642008.0,31.00,5.89,3.24,7775.0
22616,Upper middle income,OWID_UMC,2021-05-28,844859174.0,158683995.0,87125564.0,21978391.0,20979756.0,31.82,5.98,3.28,7903.0
22617,Upper middle income,OWID_UMC,2021-05-29,865304962.0,160384542.0,87689754.0,20445788.0,21568527.0,32.59,6.04,3.30,8124.0
22618,Upper middle income,OWID_UMC,2021-05-30,886276477.0,162275624.0,88521156.0,20971515.0,22392985.0,33.38,6.11,3.33,8435.0


---
Summary of the functions in this unit:

* `.index.values` - the row indexes of this part of the dataframes
* `.str.contains` - selects rows and columns that contain a string
* `.max` - maximum value
* `.mean` - average value
* `.count` - the number of rows that contain a value