# Python: Pandas

In [Python Intro to Python - II](https://github.com/cra-international/Intro-to-Python/tree/master/Intro%20to%20Python%20-%20II), we explore some foundamental functions provided by Pandas module in Python. Feel free to check out comprehensive [Python API Rerence](https://pandas.pydata.org/docs/reference/index.html) to learn more pandas.

We will now dive deeper into the Pandas package, which is an open source library built on top of NumPy. It allows for fast analysis and data cleaning and preparation and has built-in visualization features.

![PandasTrend](https://warehouse-camo.ingress.cmh1.psfhosted.org/72a12ce501802d03822c790fa370b2725a40d218/68747470733a2f2f6e7363686c6f652e6769746875622e696f2f737461636b746167732f736f2d6578616d706c652e737667)

# What's Pandas is for

Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do.

This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

- Calculate statistics and answer questions about the data, like
- What's the average, median, max, or min of each column?
- Does column A correlate with column B?
- What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
- Store the cleaned, transformed data back into a CSV, other file or database

# Optional: Intall Pandas

Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

**conda install pandas**

OR

**pip install pandas**


Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:

In [None]:
!pip install pandas

_Note:The ! at the beginning runs cells as if they were in a terminal._

# Import Pandas
To import pandas we usually import it with a shorter name since it's used so much:

In [1]:
import pandas as pd

In [2]:
pd.*?

In [3]:
pd.DataFrame?

In [4]:
pd.DataFrame

pandas.core.frame.DataFrame

# Sections
- 1. <a href='#Series'>Series</a>

    - 1.1 <a href='#CreateSeries'>Create a Series using list and dictionary</a>
    
- 2. <a href='#DataFrame'>DataFrame</a>

    - 2.1 <a href='CreateDataFrame'>Create a dataframe </a>
    
    - 2.2 <a href='#InputData'>Read in Data</a>

        - 2.2.1 <a href='#CSV'>CSV</a>
    
        - 2.2.2 <a href='#Excel'>Excel</a>
    
    - 2.3 <a href='#ViewData'>View Data </a>
    
    - 2.4 <a href='#SelectandIndexDataFrame'>Grabing data from a dataframe </a>
    
    - 2.5 <a href='#indexSelection'>Index-based selection </a>

    - 2.6 <a href='#LabelSelection'>Label-based selection </a>
    
    - 2.7 <a href='#ManipulateIndex'>Manipulating the index</a>

    - 2.8 <a href='#ConditionalSelection'>Conditional Selection</a>
    
    - 2.9 <a href='#Assign Data'>Assign Data </a>
    
    - 2.10 <a href='#Duplicates'>Handling duplicates </a>
    
    - 2.11 <a href='#SummaryFunctions'>Summary Functions </a>
    
    - 2.12 <a href='#MissingData'>Missing Data</a>
    
    - 2.13 <a href='#Map'>Map</a>
    
    - 2.14 <a href='#GroupBy'>Using Group by with a dataframe</a>

    - 2.15 <a href='#Sort'>Sorting</a>

- 3. <a href='#MergeJoinConcat'>Merging, joining and Concatenation</a>

    - 3.1 <a href='#Concatenation'>Concatenation</a>
    
    - 3.2 <a href='#Merging'>Merging</a>
    
    - 3.3 <a href='#Joining'>Joining</a>
    
- 4. <a href='#Pivot'>Pivot Table</a>
    
- 6. <a href='#Transpose'>Transpose Table</a>
    
- 7. <a href='#HTML'>Optional: Data input from html</a>
    
- 8. <a href='#SQL'>Optional: Data input from SQL</a>
    

# Series
<a id='Series'> </a>

A Series is essentially a _column_. A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object but all data in a series has to be same daya type.


![Series](https://miro.medium.com/max/700/1*3h2YForpOw5O_MjHgMOBsQ.png)

### Creating a Series
<a id='CreateSeries'> </a>
You can convert a list,numpy array, or dictionary to a Series:


_Using List with default index setting_


Lists are used to store multiple items in a single variable.

In [5]:
my_list = [10,20,30]
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

_Using List with specified index_

In [6]:
labels = ['a','b','c']
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

_Dictionary_

Dictionaries are used to store data values in **key:value** pairs. A dictionary is a collection which is unordered, changeable and does not allow duplicates. Dictionaries are written with curly brackets, and have keys and values. If we create a Series from a python dictionary, the **key becomes the row index while the value becomes the value at that row index.**

In [7]:
d = {'a':10,'b':20,'c':30}
pd.Series(d)

a    10
b    20
c    30
dtype: int64

Things don’t change if the values in the dictionary contain a list of items. The list items remain part of a single row index as in the case below


In [8]:
d = {'a' : [1,2,3], 'b': [4,5], 'c':6, 'd': "Hello World"}
pd.Series(d)

a      [1, 2, 3]
b         [4, 5]
c              6
d    Hello World
dtype: object

# DataFrames
<a id='DataFrame'> </a>
A DataFrame is a multi-dimensional table made up of a collection of Series. DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. 

![df](https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png)

_Create DataFrames_

<a id='CreateDataFrame'> </a>


We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (apples and oranges in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter. The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) 

In [9]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

In [10]:
pd.DataFrame(data)

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor. Let's customize our index:

In [11]:
pd.DataFrame(data, index=['Sam', 'Robert', 'Lily', 'David'])

Unnamed: 0,apples,oranges
Sam,3,0
Robert,2,3
Lily,0,7
David,1,2


In [12]:
#Pratice; see if you could create a dataframe looks like the print-out
data = {
    'name': ['nick','david','joe','rosss'], 
    'age': [5,10, 7, 6]
}
pd.DataFrame(data)

Unnamed: 0,name,age
0,nick,5
1,david,10
2,joe,7
3,rosss,6


# Read in data
<a id='InputData'> </a>

It’s quite simple to load data from various file formats into a DataFrame.

### CSV
<a id='CSV'> </a>


The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.


This data is from [Kaggle](https://www.kaggle.com/gpreda/covid-world-vaccination-progress) which track COVID-19 vaccination in the world. Data is collected daily from [Our World in Data](https://ourworldindata.org/) GitHub repository for covid-19, merged and uploaded.

In [13]:
df = pd.read_csv('country_vaccinations.csv')
df

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Argentina,ARG,2020-12-29,700.0,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,Wales,,2021-01-18,162197.0,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1393,Wales,,2021-01-19,176186.0,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1394,Wales,,2021-01-20,190831.0,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1395,Wales,,2021-01-21,212732.0,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...



### Excel
<a id='Excel'> </a>
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash. 

In [14]:
### Excel Input
pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')


Unnamed: 0.1,Unnamed: 0,a,b,c,d
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11
3,3,12,13,14,15


### HTML
<a id='HTML'> </a>

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

    conda install lxml
    conda install html5lib
    conda install BeautifulSoup4

Then restart Jupyter Notebook.
(or use pip install if you aren't using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

In [15]:
#Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

list_df = pd.read_html('https://en.wikipedia.org/wiki/Pythonidae')

len(list_df)

8

In [16]:
list_df[1]

Unnamed: 0,Pythonidae,Pythonidae.1
0,,
1,Indian python (Python molurus),Indian python (Python molurus)
2,Scientific classification,Scientific classification
3,Kingdom:,Animalia
4,Phylum:,Chordata
5,Class:,Reptilia
6,Order:,Squamata
7,Suborder:,Serpentes
8,Superfamily:,Pythonoidea
9,Family:,"PythonidaeFitzinger, 1826"


**Let use COVID Vaccine data for the following tutorial:**

## Viewing data

<a id='ViewData'> </a>

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head(). witch outputs the first five rows of your DataFrame by default. for example.:

![image.png](attachment:image.png)

In [17]:
df.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Argentina,ARG,2020-12-29,700.0,,,,,0.0,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...


To see the last five rows use .tail(). tail() also accepts a number, and in this case we printing the bottom five rows.:

In [18]:
df.tail()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
1392,Wales,,2021-01-18,162197.0,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1393,Wales,,2021-01-19,176186.0,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1394,Wales,,2021-01-20,190831.0,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1395,Wales,,2021-01-21,212732.0,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1396,Wales,,2021-01-22,241016.0,240547.0,469.0,28284.0,15148.0,7.64,7.63,0.01,4804.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...


To get info about your data..info() should be one of the very first commands you run after loading your data.

.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Notice in our dataset we have some obvious missing values in the various columns such as iso_code. We'll look at how to handle those in a bit.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1397 entries, 0 to 1396
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              1397 non-null   object 
 1   iso_code                             1240 non-null   object 
 2   date                                 1397 non-null   object 
 3   total_vaccinations                   919 non-null    float64
 4   people_vaccinated                    885 non-null    float64
 5   people_fully_vaccinated              223 non-null    float64
 6   daily_vaccinations_raw               738 non-null    float64
 7   daily_vaccinations                   1337 non-null   float64
 8   total_vaccinations_per_hundred       919 non-null    float64
 9   people_vaccinated_per_hundred        885 non-null    float64
 10  people_fully_vaccinated_per_hundred  223 non-null    float64
 11  daily_vaccinations_per_million

In [20]:
#Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns)
df.shape

(1397, 15)

## Selection and Indexing
<a id='SelectandIndexDataFrame'> </a>


In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.

Hence to access the country property of dataframe we can use:

In [21]:
df.country

0       Argentina
1       Argentina
2       Argentina
3       Argentina
4       Argentina
          ...    
1392        Wales
1393        Wales
1394        Wales
1395        Wales
1396        Wales
Name: country, Length: 1397, dtype: object

Alternatively, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame.


indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a country providence column, df.country providence wouldn't work).

In [22]:
df['vaccines']

0                                 Sputnik V
1                                 Sputnik V
2                                 Sputnik V
3                                 Sputnik V
4                                 Sputnik V
                       ...                 
1392    Oxford/AstraZeneca, Pfizer/BioNTech
1393    Oxford/AstraZeneca, Pfizer/BioNTech
1394    Oxford/AstraZeneca, Pfizer/BioNTech
1395    Oxford/AstraZeneca, Pfizer/BioNTech
1396    Oxford/AstraZeneca, Pfizer/BioNTech
Name: vaccines, Length: 1397, dtype: object

In [23]:
type(df.vaccines)

pandas.core.series.Series

In [24]:
# Pass a list of column names
df[['country','vaccines']]

Unnamed: 0,country,vaccines
0,Argentina,Sputnik V
1,Argentina,Sputnik V
2,Argentina,Sputnik V
3,Argentina,Sputnik V
4,Argentina,Sputnik V
...,...,...
1392,Wales,"Oxford/AstraZeneca, Pfizer/BioNTech"
1393,Wales,"Oxford/AstraZeneca, Pfizer/BioNTech"
1394,Wales,"Oxford/AstraZeneca, Pfizer/BioNTech"
1395,Wales,"Oxford/AstraZeneca, Pfizer/BioNTech"


In [25]:
# Practice
# see if you can you access the source_name column in the df

## Index-based selection

<a id='indexSelection'> </a>

Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. **iloc** follows this paradigm. 

To select the first row of data in a DataFrame, we may use the following:



In [26]:
df.iloc[0]

country                                                                        Argentina
iso_code                                                                             ARG
date                                                                          2020-12-29
total_vaccinations                                                                   700
people_vaccinated                                                                    NaN
people_fully_vaccinated                                                              NaN
daily_vaccinations_raw                                                               NaN
daily_vaccinations                                                                   NaN
total_vaccinations_per_hundred                                                         0
people_vaccinated_per_hundred                                                        NaN
people_fully_vaccinated_per_hundred                                                  NaN
daily_vaccinations_pe

In [27]:
type(df.iloc[0])

pandas.core.series.Series

iloc is row-first , column-second. On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.

In [28]:
df.iloc[:,0]

0       Argentina
1       Argentina
2       Argentina
3       Argentina
4       Argentina
          ...    
1392        Wales
1393        Wales
1394        Wales
1395        Wales
1396        Wales
Name: country, Length: 1397, dtype: object

In [29]:
#if we want to select the country column from just the first, second, and third row, we would do:
df.iloc[:3,0]

0    Argentina
1    Argentina
2    Argentina
Name: country, dtype: object

In [31]:
#pratice
#select just the second and third entries, we would do:

## Label-based selection

<a id='LabelSelection'> </a>


The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters. loc can index more data type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet).

In [32]:
# get the first entry in country
df.loc[0, 'country']

'Argentina'

In [33]:
df.loc[:, ['country', 'iso_code', 'date']]

Unnamed: 0,country,iso_code,date
0,Argentina,ARG,2020-12-29
1,Argentina,ARG,2020-12-30
2,Argentina,ARG,2020-12-31
3,Argentina,ARG,2021-01-01
4,Argentina,ARG,2021-01-02
...,...,...,...
1392,Wales,,2021-01-18
1393,Wales,,2021-01-19
1394,Wales,,2021-01-20
1395,Wales,,2021-01-21


## Manipulating the index

<a id='ManipulateIndex'> </a>

Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The set_index() method can be used to do the job. 

In [34]:
df.set_index("date")

Unnamed: 0_level_0,country,iso_code,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2020-12-29,Argentina,ARG,700.0,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2020-12-30,Argentina,ARG,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2020-12-31,Argentina,ARG,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2021-01-01,Argentina,ARG,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2021-01-02,Argentina,ARG,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-01-18,Wales,,162197.0,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
2021-01-19,Wales,,176186.0,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
2021-01-20,Wales,,190831.0,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
2021-01-21,Wales,,212732.0,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...


In [35]:
#pratice
#use value based selection to look at the vaccine get entered on 2020-12-30


### Conditional Selection
<a id='ConditionalSelection'> </a>

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

Suppose that we're interested specifically in covid-19 vaccines in US.

In [36]:
#We can start by checking if each vaccine is in US or not:
# This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:

df.country=='United States'

0       False
1       False
2       False
3       False
4       False
        ...  
1392    False
1393    False
1394    False
1395    False
1396    False
Name: country, Length: 1397, dtype: bool

In [37]:
df.loc[df.country=='United States']

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
1321,United States,USA,2020-12-20,556208.0,556208.0,,,,0.17,0.17,,,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1322,United States,USA,2020-12-21,614117.0,614117.0,,57909.0,57909.0,0.19,0.19,,175.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1323,United States,USA,2020-12-22,,,,,127432.0,,,,385.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1324,United States,USA,2020-12-23,1008025.0,1008025.0,,,150606.0,0.3,0.3,,455.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1325,United States,USA,2020-12-24,,,,,191001.0,,,,577.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1326,United States,USA,2020-12-25,,,,,215238.0,,,,650.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1327,United States,USA,2020-12-26,1944585.0,1944585.0,,,231396.0,0.59,0.59,,699.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1328,United States,USA,2020-12-27,,,,,211379.0,,,,639.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1329,United States,USA,2020-12-28,2127143.0,2127143.0,,,216147.0,0.64,0.64,,653.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1330,United States,USA,2020-12-29,,,,,235685.0,,,,712.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...


In [38]:
#practiec
#how many records are from UK?

### Multiple Conditions

We also wanted to know which records are after 2021-01-01 in UK. We can use the ampersand (&) to bring the two conditions together:

In [39]:
df.loc[(df.country=='United States')&(df.date>'2021-01-01')]

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
1334,United States,USA,2021-01-02,4225756.0,4225756.0,,,325882.0,1.28,1.28,,985.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1335,United States,USA,2021-01-03,,,,,336949.0,,,,1018.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1336,United States,USA,2021-01-04,4563260.0,4563260.0,,,348017.0,1.38,1.38,,1051.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1337,United States,USA,2021-01-05,4836469.0,4836469.0,,273209.0,339372.0,1.46,1.46,,1025.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1338,United States,USA,2021-01-06,5306797.0,5306797.0,,470328.0,358887.0,1.6,1.6,,1084.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1339,United States,USA,2021-01-07,5919418.0,5919418.0,,612621.0,378253.0,1.79,1.79,,1143.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1340,United States,USA,2021-01-08,6688231.0,6688231.0,,768813.0,419933.0,2.02,2.02,,1269.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1341,United States,USA,2021-01-09,,,,,461263.0,,,,1394.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1342,United States,USA,2021-01-10,,,,,546636.0,,,,1651.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1343,United States,USA,2021-01-11,8987322.0,8987322.0,,,632009.0,2.72,2.72,,1909.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...


Suppose we are interested in any vaccine info in Argentina or in UK. For this we use a pipe (|):

In [40]:
df.loc[(df.country=='United States')|(df.country=='Argentina')]

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Argentina,ARG,2020-12-29,700.0,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1351,United States,USA,2021-01-19,15707588.0,13595803.0,2023124.0,,911493.0,4.75,4.11,0.61,2754.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1352,United States,USA,2021-01-20,16525281.0,14270441.0,2161419.0,817693.0,892403.0,4.99,4.31,0.65,2696.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1353,United States,USA,2021-01-21,17546374.0,15053257.0,2394961.0,1021093.0,913912.0,5.30,4.55,0.72,2761.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1354,United States,USA,2021-01-22,19107959.0,16243093.0,2756953.0,1561585.0,975540.0,5.77,4.91,0.83,2947.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...


Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of values. so we can rewrite above code to:

In [41]:
df.loc[df.country.isin(['United States','Argentina'])]

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Argentina,ARG,2020-12-29,700.0,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1351,United States,USA,2021-01-19,15707588.0,13595803.0,2023124.0,,911493.0,4.75,4.11,0.61,2754.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1352,United States,USA,2021-01-20,16525281.0,14270441.0,2161419.0,817693.0,892403.0,4.99,4.31,0.65,2696.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1353,United States,USA,2021-01-21,17546374.0,15053257.0,2394961.0,1021093.0,913912.0,5.30,4.55,0.72,2761.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
1354,United States,USA,2021-01-22,19107959.0,16243093.0,2756953.0,1561585.0,975540.0,5.77,4.91,0.83,2947.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...


In [42]:
#practiec
#how many records that have total_vaccinations greater than 300000 in USA or UK

## Assign Data

<a id='AssignData'> </a>

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [43]:
df['DownloadFrom'] = 'Kaggle'

In [44]:
df.head(1)

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom
0,Argentina,ARG,2020-12-29,700.0,,,,,0.0,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle


In [45]:
#Practice
# use loc to create a USA_Flag column that has value of 1 to where country is USA

In [46]:
df.loc[df.country=='United States','USA_Flag']=1

In [47]:
df.loc[df.country=='United States'].USA_Flag

1321    1.0
1322    1.0
1323    1.0
1324    1.0
1325    1.0
1326    1.0
1327    1.0
1328    1.0
1329    1.0
1330    1.0
1331    1.0
1332    1.0
1333    1.0
1334    1.0
1335    1.0
1336    1.0
1337    1.0
1338    1.0
1339    1.0
1340    1.0
1341    1.0
1342    1.0
1343    1.0
1344    1.0
1345    1.0
1346    1.0
1347    1.0
1348    1.0
1349    1.0
1350    1.0
1351    1.0
1352    1.0
1353    1.0
1354    1.0
1355    1.0
Name: USA_Flag, dtype: float64

In [48]:
df.loc[df.country=='United Kingdom'].USA_Flag

1287   NaN
1288   NaN
1289   NaN
1290   NaN
1291   NaN
1292   NaN
1293   NaN
1294   NaN
1295   NaN
1296   NaN
1297   NaN
1298   NaN
1299   NaN
1300   NaN
1301   NaN
1302   NaN
1303   NaN
1304   NaN
1305   NaN
1306   NaN
1307   NaN
1308   NaN
1309   NaN
1310   NaN
1311   NaN
1312   NaN
1313   NaN
1314   NaN
1315   NaN
1316   NaN
1317   NaN
1318   NaN
1319   NaN
1320   NaN
Name: USA_Flag, dtype: float64

## Handling duplicates

<a id='Duplicates'> </a>

To check if there are duplicates rows we can use .drop_duplicates() function wich will also return a copy of our data frame with duplicates removed.

In [49]:
df.drop_duplicates().shape,df.shape

((1397, 17), (1397, 17))

In [50]:
# This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate rows.
# let's see if there are duplicated rows based on country and iso_code
df.drop_duplicates(subset=['country','iso_code']).shape,df.shape

((60, 17), (1397, 17))

In [51]:
df.drop_duplicates(subset=['country','iso_code'])

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,Argentina,ARG,2020-12-29,700.0,,,,,0.0,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
26,Austria,AUT,2021-01-05,8360.0,8360.0,,,,0.09,0.09,,,Pfizer/BioNTech,Ministry of Health,https://info.gesundheitsministerium.gv.at/open...,Kaggle,
45,Bahrain,BHR,2020-12-23,38965.0,38965.0,,,,2.29,2.29,,,"Pfizer/BioNTech, Sinopharm",Ministry of Health,https://twitter.com/MOH_Bahrain/status/1351989...,Kaggle,
74,Belgium,BEL,2020-12-28,298.0,298.0,,,,0.0,0.0,,,Pfizer/BioNTech,Sciensano,https://datastudio.google.com/embed/u/0/report...,Kaggle,
100,Brazil,BRA,2021-01-16,0.0,0.0,,,,0.0,0.0,,,Sinovac,Regional governments via Coronavirus Brasil,https://coronavirusbra1.github.io/,Kaggle,
108,Bulgaria,BGR,2020-12-29,1719.0,1719.0,,,,0.02,0.02,,,"Moderna, Pfizer/BioNTech",Ministry of Health,https://coronavirus.bg/bg/statistika,Kaggle,
134,Canada,CAN,2020-12-14,297.0,297.0,,,,0.0,0.0,,,"Moderna, Pfizer/BioNTech",COVID-19 Canada Open Data Working Group,https://github.com/ishaberry/Covid19Canada,Kaggle,
175,Chile,CHL,2020-12-24,420.0,420.0,,,,0.0,0.0,,,Pfizer/BioNTech,Department of Statistics and Health Information,https://informesdeis.minsal.cl/SASVisualAnalyt...,Kaggle,
205,China,CHN,2020-12-15,1500000.0,1500000.0,,,,0.1,0.1,,,"CNBG, Sinovac",National Health Commission,https://www.globaltimes.cn/page/202101/1213364...,Kaggle,
242,Costa Rica,CRI,2020-12-24,55.0,,,,,0.0,,,,Pfizer/BioNTech,National Health Commission,https://www.larepublica.net/noticia/sube-a-293...,Kaggle,


In [52]:
#pratice
# let's see if there are duplicated rows based on country

_How to look at the duplicates rows_

duplicated function return boolean Series denoting duplicate rows. First occurence will be marked as non-duplicate 'False' and other occurence will be marked as duplicates 'True'. Here we will check the duplicated rows based on country column itself.

In [53]:
df.duplicated(subset=['country'])

0       False
1        True
2        True
3        True
4        True
        ...  
1392     True
1393     True
1394     True
1395     True
1396     True
Length: 1397, dtype: bool

In [54]:
df.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,Argentina,ARG,2020-12-29,700.0,,,,,0.0,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,


In [55]:
df[df.duplicated(subset=['country'])]

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
5,Argentina,ARG,2021-01-03,,,,,7400.0,,,,164.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,Wales,,2021-01-18,162197.0,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1393,Wales,,2021-01-19,176186.0,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1394,Wales,,2021-01-20,190831.0,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1395,Wales,,2021-01-21,212732.0,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,


## Summary functions

<a id='SummaryFunctions'> </a>

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the describe() method:

In [56]:
#let check the descriptive statitsics for total_vaccinations
df.total_vaccinations.describe()

count    9.190000e+02
mean     5.894958e+05
std      1.857483e+06
min      0.000000e+00
25%      1.501150e+04
50%      6.282200e+04
75%      2.787140e+05
max      2.053799e+07
Name: total_vaccinations, dtype: float64

In [57]:
df.describe()

Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,USA_Flag
count,919.0,885.0,223.0,738.0,1337.0,919.0,885.0,223.0,1337.0,35.0
mean,589495.8,551640.6,167644.6,47356.01,42173.37,2.434276,2.303085,0.650179,1432.382947,1.0
std,1857483.0,1673335.0,418836.8,131928.1,111674.5,5.064139,4.548534,1.533988,3120.828782,0.0
min,0.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
25%,15011.5,15023.0,2139.0,1706.0,1662.0,0.215,0.22,0.015,262.0,1.0
50%,62822.0,60753.0,8449.0,7627.0,5512.0,0.79,0.8,0.07,640.0,1.0
75%,278714.0,282450.0,190302.0,37856.5,25299.0,1.955,1.95,0.685,1065.0,1.0
max,20537990.0,17390340.0,3027865.0,1561585.0,1057387.0,39.79,28.86,10.94,30869.0,1.0


In [58]:
#Return the median of the values
#Exclude NA/null values when computing the result.
df.median()

total_vaccinations                     62822.00
people_vaccinated                      60753.00
people_fully_vaccinated                 8449.00
daily_vaccinations_raw                  7627.00
daily_vaccinations                      5512.00
total_vaccinations_per_hundred             0.79
people_vaccinated_per_hundred              0.80
people_fully_vaccinated_per_hundred        0.07
daily_vaccinations_per_million           640.00
USA_Flag                                   1.00
dtype: float64

In [59]:
#To see a list of unique values we can use the unique() function:
df.vaccines.unique()

array(['Sputnik V', 'Pfizer/BioNTech', 'Pfizer/BioNTech, Sinopharm',
       'Sinovac', 'Moderna, Pfizer/BioNTech', 'CNBG, Sinovac',
       'Oxford/AstraZeneca, Pfizer/BioNTech',
       'Pfizer/BioNTech, Pifzer/BioNTech', 'Covaxin, Covishield',
       'Sinopharm'], dtype=object)

In [60]:
# To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:
df.country.value_counts()

Scotland                41
Canada                  41
Northern Ireland        41
Wales                   41
China                   37
Israel                  36
United States           35
United Kingdom          34
England                 34
Mexico                  31
Chile                   30
Russia                  30
Switzerland             30
Lithuania               30
Bahrain                 29
Italy                   28
Czechia                 28
Denmark                 28
Portugal                27
Romania                 27
Germany                 27
Estonia                 27
Latvia                  27
Greece                  27
Bulgaria                26
Norway                  26
Belgium                 26
Argentina               26
Poland                  26
Malta                   26
Costa Rica              26
Hungary                 26
Iceland                 24
Croatia                 24
Finland                 23
Luxembourg              23
Oman                    23
S

# Missing Data
<a id='MissingData'> </a>

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:

In [61]:
df.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,Argentina,ARG,2020-12-29,700.0,,,,,0.0,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,


In [62]:
df.isnull()
#df.isnull()
#pd.isna(df)

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,False,False,False,False,True,True,True,True,False,True,True,True,False,False,False,False,True
1,False,False,False,True,True,True,True,False,True,True,True,False,False,False,False,False,True
2,False,False,False,False,True,True,True,False,False,True,True,False,False,False,False,False,True
3,False,False,False,True,True,True,True,False,True,True,True,False,False,False,False,False,True
4,False,False,False,True,True,True,True,False,True,True,True,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1393,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1394,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1395,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True


In [63]:
df.notnull()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,True,True,True,True,False,False,False,False,True,False,False,False,True,True,True,True,False
1,True,True,True,False,False,False,False,True,False,False,False,True,True,True,True,True,False
2,True,True,True,True,False,False,False,True,True,False,False,True,True,True,True,True,False
3,True,True,True,False,False,False,False,True,False,False,False,True,True,True,True,True,False
4,True,True,True,False,False,False,False,True,False,False,False,True,True,True,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False
1393,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False
1394,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False
1395,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False


In [64]:
#Pratice
# how to extract records where daily_vaccinations_per_million is not null

Alternatively,we can use dropna() method to drop records where there are missing values

In [65]:
#by default, it will drop records whereever there is an NaN
df.dropna()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
1346,United States,USA,2021-01-14,11148991.0,9690757.0,1342086.0,870529.0,747082.0,3.37,2.93,0.41,2257.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...,Kaggle,1.0
1347,United States,USA,2021-01-15,12279180.0,10595866.0,1610524.0,1130189.0,798707.0,3.71,3.2,0.49,2413.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...,Kaggle,1.0
1352,United States,USA,2021-01-20,16525281.0,14270441.0,2161419.0,817693.0,892403.0,4.99,4.31,0.65,2696.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...,Kaggle,1.0
1353,United States,USA,2021-01-21,17546374.0,15053257.0,2394961.0,1021093.0,913912.0,5.3,4.55,0.72,2761.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...,Kaggle,1.0
1354,United States,USA,2021-01-22,19107959.0,16243093.0,2756953.0,1561585.0,975540.0,5.77,4.91,0.83,2947.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...,Kaggle,1.0
1355,United States,USA,2021-01-23,20537990.0,17390345.0,3027865.0,1430031.0,1057387.0,6.2,5.25,0.91,3194.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...,Kaggle,1.0


In [66]:
#Pratice
# how to only drop record if daily_vaccinations_per_million is missing

Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

In [67]:
df.iso_code.unique()

array(['ARG', 'AUT', 'BHR', 'BEL', 'BRA', 'BGR', 'CAN', 'CHL', 'CHN',
       'CRI', 'HRV', 'CYP', 'CZE', 'DNK', 'ECU', nan, 'EST', 'FIN', 'FRA',
       'DEU', 'GIB', 'GRC', 'HUN', 'ISL', 'IND', 'IDN', 'IRL', 'IMN',
       'ISR', 'ITA', 'KWT', 'LVA', 'LTU', 'LUX', 'MLT', 'MEX', 'NLD',
       'NOR', 'OMN', 'PAN', 'POL', 'PRT', 'ROU', 'RUS', 'SAU', 'SRB',
       'SYC', 'SGP', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'TUR', 'ARE',
       'GBR', 'USA'], dtype=object)

In [68]:
df.iso_code.fillna("UNKNOWN")

0           ARG
1           ARG
2           ARG
3           ARG
4           ARG
         ...   
1392    UNKNOWN
1393    UNKNOWN
1394    UNKNOWN
1395    UNKNOWN
1396    UNKNOWN
Name: iso_code, Length: 1397, dtype: object

In [69]:
df.iso_code

0       ARG
1       ARG
2       ARG
3       ARG
4       ARG
       ... 
1392    NaN
1393    NaN
1394    NaN
1395    NaN
1396    NaN
Name: iso_code, Length: 1397, dtype: object

In [70]:
df.iso_code.fillna("UNKNOWN",inplace=True)

In [71]:
df.iso_code

0           ARG
1           ARG
2           ARG
3           ARG
4           ARG
         ...   
1392    UNKNOWN
1393    UNKNOWN
1394    UNKNOWN
1395    UNKNOWN
1396    UNKNOWN
Name: iso_code, Length: 1397, dtype: object

In [72]:
#Practice
#fill missing value in total_vaccinations with the mean of total_vaccinations

Alternatively, we may have a non-null value that we would like to replace. For example, suppose that we don't to have a unknown iso_code in our dataset then we can replace the 'UNKNOWN' iso_code to be 'ISO UNKNOWN'

In [73]:
df.iso_code.replace("UNKNOWN", "ISO UNKNOWN")

0               ARG
1               ARG
2               ARG
3               ARG
4               ARG
           ...     
1392    ISO UNKNOWN
1393    ISO UNKNOWN
1394    ISO UNKNOWN
1395    ISO UNKNOWN
1396    ISO UNKNOWN
Name: iso_code, Length: 1397, dtype: object

In [74]:
df.iso_code

0           ARG
1           ARG
2           ARG
3           ARG
4           ARG
         ...   
1392    UNKNOWN
1393    UNKNOWN
1394    UNKNOWN
1395    UNKNOWN
1396    UNKNOWN
Name: iso_code, Length: 1397, dtype: object

## Map

<a id='Map'> </a>

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted to calcualted a column that remean the total_vaccinations to 0 so we can easily see what country have total_vaccinations above the mean or below the mean in the dataset.

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

In [75]:
total_vaccinations_mean = df.total_vaccinations.mean()

In [76]:
#it doesn't modify the original data they're called on. 
df.total_vaccinations.map(lambda x: x - total_vaccinations_mean)

0      -588795.76605
1                NaN
2      -557482.76605
3                NaN
4                NaN
            ...     
1392   -427298.76605
1393   -413309.76605
1394   -398664.76605
1395   -376763.76605
1396   -348479.76605
Name: total_vaccinations, Length: 1397, dtype: float64

In [77]:
df.total_vaccinations

0          700.0
1            NaN
2        32013.0
3            NaN
4            NaN
          ...   
1392    162197.0
1393    176186.0
1394    190831.0
1395    212732.0
1396    241016.0
Name: total_vaccinations, Length: 1397, dtype: float64

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [78]:
def remean_total_vaccinations(row):
    row.total_vaccinations = row.total_vaccinations - total_vaccinations_mean
    return row


In [79]:
df.apply(remean_total_vaccinations, axis='columns')

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,Argentina,ARG,2020-12-29,-588795.76605,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
2,Argentina,ARG,2020-12-31,-557482.76605,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,Wales,UNKNOWN,2021-01-18,-427298.76605,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1393,Wales,UNKNOWN,2021-01-19,-413309.76605,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1394,Wales,UNKNOWN,2021-01-20,-398664.76605,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1395,Wales,UNKNOWN,2021-01-21,-376763.76605,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,


In [80]:
#Pratice
# remean columns daily_vaccinations_per_million referencing to the code above

# Groupby

<a id='GroupBy'> </a>

Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, often we want to group our data, and then do something specific to the group the data is in.


#### Now you can use the .groupby() method to group rows together based off of a column name. This will create a DataFrameGroupBy object:

In [81]:
df.groupby('country')


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000233CF691C88>

In [82]:
#You can save this object as a new variable:
by_country = df.groupby("country")

In [83]:
# then call aggregate methods off the object:
by_country.total_vaccinations.median()

country
Argentina                183796.0
Austria                   74100.0
Bahrain                   68472.0
Belgium                   28898.0
Brazil                    20006.5
Bulgaria                  17038.0
Canada                   117727.0
Chile                     18526.0
China                   9000000.0
Costa Rica                 9751.0
Croatia                   32276.5
Cyprus                     8130.5
Czechia                   47583.0
Denmark                  117201.5
Ecuador                      54.0
England                 3352029.5
Estonia                   10860.5
Finland                   23126.0
France                   282691.5
Germany                  607874.0
Gibraltar                  5847.0
Greece                    42121.5
Hungary                   82754.0
Iceland                    6205.0
India                    674835.0
Indonesia                 66000.0
Ireland                   40000.0
Isle of Man                3603.0
Israel                  1548990.5
Italy 

In [84]:
type(by_country.total_vaccinations.median())

pandas.core.series.Series

In [85]:
by_country.total_vaccinations.median().reset_index()

Unnamed: 0,country,total_vaccinations
0,Argentina,183796.0
1,Austria,74100.0
2,Bahrain,68472.0
3,Belgium,28898.0
4,Brazil,20006.5
5,Bulgaria,17038.0
6,Canada,117727.0
7,Chile,18526.0
8,China,9000000.0
9,Costa Rica,9751.0


In [86]:
df.groupby("country").total_vaccinations.max()

country
Argentina                 288064.0
Austria                   169549.0
Bahrain                   144130.0
Belgium                   174307.0
Brazil                    537774.0
Bulgaria                   26119.0
Canada                    801238.0
Chile                      63047.0
China                   15000000.0
Costa Rica                 29389.0
Croatia                    64951.0
Cyprus                     17379.0
Czechia                   192267.0
Denmark                   196167.0
Ecuador                      108.0
England                  5526071.0
Estonia                    25704.0
Finland                    91260.0
France                   1008720.0
Germany                  1632777.0
Gibraltar                   8877.0
Greece                    157388.0
Hungary                   150128.0
Iceland                     8249.0
India                    1582201.0
Indonesia                 132000.0
Ireland                   121900.0
Isle of Man                 3648.0
Israel      

In [87]:
#Pratice
# use describe method to total_vaccinations column breakdown by country

_By default, group by column will be set as index;but you can set the as_index option to be false_

In [88]:
df.groupby("country",as_index=False).total_vaccinations.max()

Unnamed: 0,country,total_vaccinations
0,Argentina,288064.0
1,Austria,169549.0
2,Bahrain,144130.0
3,Belgium,174307.0
4,Brazil,537774.0
5,Bulgaria,26119.0
6,Canada,801238.0
7,Chile,63047.0
8,China,15000000.0
9,Costa Rica,29389.0


In [89]:
#similarily, you can use syntax below
df.groupby("country").total_vaccinations.max().reset_index()

Unnamed: 0,country,total_vaccinations
0,Argentina,288064.0
1,Austria,169549.0
2,Bahrain,144130.0
3,Belgium,174307.0
4,Brazil,537774.0
5,Bulgaria,26119.0
6,Canada,801238.0
7,Chile,63047.0
8,China,15000000.0
9,Costa Rica,29389.0


# Sorting

<a id='Sort'> </a>

Looking again at max total vaccination in each country we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.

To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this.

In [90]:
df.groupby("country").total_vaccinations.max().reset_index(name='total_vaccinations_max')

Unnamed: 0,country,total_vaccinations_max
0,Argentina,288064.0
1,Austria,169549.0
2,Bahrain,144130.0
3,Belgium,174307.0
4,Brazil,537774.0
5,Bulgaria,26119.0
6,Canada,801238.0
7,Chile,63047.0
8,China,15000000.0
9,Costa Rica,29389.0


In [91]:
by_country_totalvaccinations_max=df.groupby("country").total_vaccinations.max().reset_index(name='total_vaccinations_max')
by_country_totalvaccinations_max.sort_values('total_vaccinations_max')

Unnamed: 0,country,total_vaccinations_max
14,Ecuador,108.0
30,Kuwait,2500.0
27,Isle of Man,3648.0
40,Panama,5594.0
33,Luxembourg,6897.0
23,Iceland,8249.0
20,Gibraltar,8877.0
48,Seychelles,13163.0
11,Cyprus,17379.0
34,Malta,17767.0


In [92]:
#Practice
# Can you sort above dataframe in a descending order of total_vaccinations

# Merging, Joining, and Concatenating
<a id='MergeJoinConcat'> </a>

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. 

## Merging

<a id='Merging'> </a>

You can use merge() any time you want to do database-like join operations. It’s the most flexible of the three operations you’ll learn.

When you want to combine data objects based on one or more keys in a similar way to a relational database, merge() is the tool you need. More specifically, merge() is most useful when you want to combine rows that share data.

You can achieve both many-to-one and many-to-many joins with merge(). In a many-to-one join, one of your datasets will have many rows in the merge column that repeat the same values (such as 1, 1, 3, 5, 5), while the merge column in the other dataset will not have repeat values (such as 1, 3, 5).

In a many-to-many join, both of your merge columns will have repeat values. These merges are more complex and result in the Cartesian product of the joined rows.

This means that, after the merge, you’ll have every combination of rows that share the same value in the key column. You’ll see this in action in the examples below.

What makes merge() so flexible is the sheer number of options for defining the behavior of your merge. While the list can seem daunting, with practice you’ll be able to expertly merge datasets of all kinds.

When you use merge(), you’ll provide two required arguments:

- The left DataFrame
- The right DataFrame
- how: This defines what kind of merge to make. It defaults to 'inner', but other possible options include 'outer', 'left', and 'right'.

Before getting into the details of how to use merge(), you should first understand the various forms of joins:

- inner
- outer
- left
- right

![image.png](attachment:image.png)

Let's define two dataframes:
 - one contains the maximum total vaccinations breakdown by country
 - one contains the minimum total vaccinations breakdown by country

In [93]:
by_country_totalvaccinations_max=df.groupby("country").total_vaccinations.max().reset_index(name='total_vaccinations_max')
by_country_totalvaccinations_min=df.groupby("country").total_vaccinations.min().reset_index(name='total_vaccinations_min')


In [94]:
# for by_country_totalvaccinations_min
#let's only keep minimum greater than 0 to exclude osme countries
by_country_totalvaccinations_min=by_country_totalvaccinations_min[by_country_totalvaccinations_min.total_vaccinations_min>0]

In [95]:
by_country_totalvaccinations_max.shape,by_country_totalvaccinations_min.shape

((60, 2), (51, 2))

### Inner Join

In [96]:
pd.merge(by_country_totalvaccinations_max,by_country_totalvaccinations_min, how='inner',on='country')

Unnamed: 0,country,total_vaccinations_max,total_vaccinations_min
0,Argentina,288064.0,700.0
1,Austria,169549.0,8360.0
2,Bahrain,144130.0,38965.0
3,Belgium,174307.0,298.0
4,Bulgaria,26119.0,1719.0
5,Canada,801238.0,297.0
6,Chile,63047.0,420.0
7,China,15000000.0,1500000.0
8,Costa Rica,29389.0,55.0
9,Croatia,64951.0,7864.0


### Left join

In [97]:
pd.merge(by_country_totalvaccinations_max,by_country_totalvaccinations_min, how='left',on='country')

Unnamed: 0,country,total_vaccinations_max,total_vaccinations_min
0,Argentina,288064.0,700.0
1,Austria,169549.0,8360.0
2,Bahrain,144130.0,38965.0
3,Belgium,174307.0,298.0
4,Brazil,537774.0,
5,Bulgaria,26119.0,1719.0
6,Canada,801238.0,297.0
7,Chile,63047.0,420.0
8,China,15000000.0,1500000.0
9,Costa Rica,29389.0,55.0


In [98]:
# Practice 
# see right and outer merge results look like

## Joining
<a id='Joining'> </a>

 By default, .join() will attempt to do a left join on indices. If you want to join on columns like you would with merge(), then you’ll need to set the columns as indices.
 
 Like merge(), .join() has a few parameters that give you more flexibility in your joins. However, with .join(), the list of parameters is relatively short:
 
 - how: This has the same options as how from merge(). The difference is that it is index-based unless you also specify columns with on.
 
When there are overlapping columns, you’ll need to specify a suffix with lsuffix, rsuffix, or both, but this example will demonstrate the more typical behavior of .join():

In [99]:
#set index to be country
by_country_totalvaccinations_max=by_country_totalvaccinations_max.set_index('country')
by_country_totalvaccinations_min=by_country_totalvaccinations_min.set_index('country')

In [100]:
by_country_totalvaccinations_max.join(by_country_totalvaccinations_min, lsuffix="_left")

Unnamed: 0_level_0,total_vaccinations_max,total_vaccinations_min
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Argentina,288064.0,700.0
Austria,169549.0,8360.0
Bahrain,144130.0,38965.0
Belgium,174307.0,298.0
Brazil,537774.0,
Bulgaria,26119.0,1719.0
Canada,801238.0,297.0
Chile,63047.0,420.0
China,15000000.0,1500000.0
Costa Rica,29389.0,55.0


In [101]:
# Practice 
# see right and outer join results look like

## Concatenation
<a id='Concatenation'> </a>

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [102]:
by_country_totalvaccinations_max.head(1)

Unnamed: 0_level_0,total_vaccinations_max
country,Unnamed: 1_level_1
Argentina,288064.0


In [103]:
by_country_totalvaccinations_min.head(1)

Unnamed: 0_level_0,total_vaccinations_min
country,Unnamed: 1_level_1
Argentina,700.0


In [104]:
pd.concat([by_country_totalvaccinations_max,by_country_totalvaccinations_min],axis=0)

Unnamed: 0_level_0,total_vaccinations_max,total_vaccinations_min
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Argentina,288064.0,
Austria,169549.0,
Bahrain,144130.0,
Belgium,174307.0,
Brazil,537774.0,
...,...,...
Sweden,,4115.0
United Arab Emirates,,826301.0
United Kingdom,,667287.0
United States,,556208.0


In [105]:
#practice
# try to glue these two dataframes horizontally

## Drop Columns/Rows

<a id='Drop'> </a>

In [106]:
# you can drop a list of columns
df.drop(['USA_Flag','DownloadFrom'],axis=1)

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Argentina,ARG,2020-12-29,700.0,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,Wales,UNKNOWN,2021-01-18,162197.0,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1393,Wales,UNKNOWN,2021-01-19,176186.0,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1394,Wales,UNKNOWN,2021-01-20,190831.0,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...
1395,Wales,UNKNOWN,2021-01-21,212732.0,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...


In [107]:
#Practice
#Can you drop first row in the dataset?

## Rename Column 

<a id='Rename'> </a>

In [108]:
df.rename(columns={'iso_code':'Country ISO Code'})

Unnamed: 0,country,Country ISO Code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website,DownloadFrom,USA_Flag
0,Argentina,ARG,2020-12-29,700.0,,,,,0.00,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...,Kaggle,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1392,Wales,UNKNOWN,2021-01-18,162197.0,161932.0,265.0,10259.0,10123.0,5.14,5.14,0.01,3211.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1393,Wales,UNKNOWN,2021-01-19,176186.0,175816.0,370.0,13989.0,10672.0,5.59,5.58,0.01,3385.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1394,Wales,UNKNOWN,2021-01-20,190831.0,190435.0,396.0,14645.0,11105.0,6.05,6.04,0.01,3522.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,
1395,Wales,UNKNOWN,2021-01-21,212732.0,212317.0,415.0,21901.0,12318.0,6.75,6.73,0.01,3907.0,"Oxford/AstraZeneca, Pfizer/BioNTech",Government of the United Kingdom,https://coronavirus.data.gov.uk/details/health...,Kaggle,


## Pivot Table

<a id='Pivot'> </a>


pivot_table requires a 
- data 
- and an index parameter


data is the Pandas dataframe you pass to the function
index is the feature that allows you to group your data. The index feature will appear as an index in the resultant table.

The values shown in the table are the result of the summarization that aggfunc applies to the feature data. aggfunc is an aggregate function that pivot_table applies to your grouped data.


In [109]:
df.pivot_table(values='people_vaccinated',index=['date'])

Unnamed: 0_level_0,people_vaccinated
date,Unnamed: 1_level_1
2020-12-13,10220.67
2020-12-14,297.0
2020-12-15,509887.7
2020-12-16,3683.0
2020-12-17,7201.0
2020-12-18,10615.0
2020-12-19,5975.5
2020-12-20,239027.0
2020-12-21,221435.0
2020-12-22,49479.67



Using multiple features as indexes is fine, but using some features as columns will help you to intuitively understand the relationship between them. Also, the resultant table can always be better viewed by incorporating the columns parameter of the pivot_table.

This columns parameter is optional and displays the values horizontally on the top of the resultant table.

Both columns and the index parameters are optional, but using them effectively will help you to intuitively understand the relationship between the features.

In [110]:
df.pivot_table(values='people_vaccinated',index=['date'],columns=['country'])

country,Argentina,Austria,Bahrain,Belgium,Brazil,Bulgaria,Canada,Chile,China,Costa Rica,...,Slovakia,Slovenia,Spain,Sweden,Switzerland,Turkey,United Arab Emirates,United Kingdom,United States,Wales
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-12-13,,,,,,,,,,,...,,,,,,,,,,8181.0
2020-12-14,,,,,,,297.0,,,,...,,,,,,,,,,
2020-12-15,,,,,,,1163.0,,1500000.0,,...,,,,,,,,,,
2020-12-16,,,,,,,3683.0,,,,...,,,,,,,,,,
2020-12-17,,,,,,,7201.0,,,,...,,,,,,,,,,
2020-12-18,,,,,,,10615.0,,,,...,,,,,,,,,,
2020-12-19,,,,,,,11894.0,,,,...,,,,,,,,,,
2020-12-20,,,,,,,14492.0,,,,...,,,,,,,,667287.0,556208.0,23567.0
2020-12-21,,,,,,,20866.0,,,,...,,,,,,,,,614117.0,
2020-12-22,,,,,,,26287.0,,,,...,,,,,,,,,,


In [111]:
# Pratice 
# Get mean of the Total number of vaccinations by Vaccines used in the country


## Optional: Data input from HTML
<a id='HTML'> </a>

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

    conda install lxml
    conda install html5lib
    conda install BeautifulSoup4

Then restart Jupyter Notebook.
(or use pip install if you aren't using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

In [112]:
### HTML Input

#Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

df = pd.read_html('https://en.wikipedia.org/wiki/Pythonidae')

In [113]:
len(df)

8

In [114]:
#Now, we get a list of 8 tables. If we go to the Wikipedia page, we can see that the first table is the one to the right. 
df[1]

Unnamed: 0,Pythonidae,Pythonidae.1
0,,
1,Indian python (Python molurus),Indian python (Python molurus)
2,Scientific classification,Scientific classification
3,Kingdom:,Animalia
4,Phylum:,Chordata
5,Class:,Reptilia
6,Order:,Squamata
7,Suborder:,Serpentes
8,Superfamily:,Pythonoidea
9,Family:,"PythonidaeFitzinger, 1826"


In [115]:
df[2]

Unnamed: 0,Genus[2],Taxon author[2],Species[2],Subsp.[a][2],Common name,Geographic range[1]
0,Antaresia,"Wells & Wellington, 1984",4,2,Children's pythons,Australia in arid and tropical regions
1,Apodora[13],"Kluge, 1993",1,0,Papuan olive python,Papua New Guinea
2,Aspidites,"Peters, 1877",2,0,Shield pythons,Australia except in the south of the country
3,Bothrochilus,"Fitzinger, 1843",7,0,White-lipped pythons,"Most of New Guinea (below 1,200 metres (3,900 ..."
4,Leiopython,"Hubrecht, 1879",2,0,,Papua New Guinea
5,Liasis,"Gray, 1842",3,5,Water pythons,"Indonesia in the Lesser Sunda Islands, east th..."
6,Malayopython,"Reynolds, 2014",2,3,Reticulated and Timor pythons,From India to Timor
7,Morelia,"Gray, 1842",8,6,Tree pythons,"From Indonesia in the Maluku Islands, east thr..."
8,Nawaran,"Donnellan, Brennan, Lemmon, Moriarty Lemmon, Z...",4,0,Oenpelli python,"Northern Territory,Australia"
9,Python[b],"Daudin, 1803",10,2,"""True"" pythons",Africa in the tropics south of the Sahara (not...


## Optional: Data input from SQL
<a id='SQL'> </a>

The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.


If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.

See also some cookbook examples for some advanced strategies.

The key functions are:

* read_sql_table(table_name, con[, schema, ...])	
    * Read SQL database table into a DataFrame.
* read_sql_query(sql, con[, index_col, ...])	
    * Read SQL query into a DataFrame.
* read_sql(sql, con[, index_col, ...])	
    * Read SQL query or database table into a DataFrame.
* DataFrame.to_sql(name, con[, flavor, ...])	
    * Write records stored in a DataFrame to a SQL database.

In [116]:
from sqlalchemy import create_engine


In [117]:
engine = create_engine('sqlite:///:memory:')

df[1].to_sql('data', engine)

sql_df = pd.read_sql('data',con=engine)

sql_df

Unnamed: 0,index,Pythonidae,Pythonidae.1
0,0,,
1,1,Indian python (Python molurus),Indian python (Python molurus)
2,2,Scientific classification,Scientific classification
3,3,Kingdom:,Animalia
4,4,Phylum:,Chordata
5,5,Class:,Reptilia
6,6,Order:,Squamata
7,7,Suborder:,Serpentes
8,8,Superfamily:,Pythonoidea
9,9,Family:,"PythonidaeFitzinger, 1826"
