# Pandas Basics

## Pandas DataFrames

Pandas is built on the Numpy package and its key data structure if the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables. 

One way to create a DataFrame is to use a dictionary.

In [1]:
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
       "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
       "area": [8.516, 17.10, 3.286, 9.597, 1.221],
       "population": [200.4, 143.5, 1252, 1357, 52.98] }

import pandas as pd
brics = pd.DataFrame(dict)
print(brics)

        country    capital    area  population
0        Brazil   Brasilia   8.516      200.40
1        Russia     Moscow  17.100      143.50
2         India  New Dehli   3.286     1252.00
3         China    Beijing   9.597     1357.00
4  South Africa   Pretoria   1.221       52.98


As one can see the the brics df, Pandas has assigned a key for each country as the numerical values 0 to 4. If you would like different index values like, the two letter country code, that can be easily done as well.

In [2]:
# Set the index for brics
brics.index = ["BR", "RU", "IN", "CH", "SA"]

# Print out brics with new index values
print(brics)

         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Dehli   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98


Another way to create a DataFrame is importing a csv file using Pandas.

In [12]:
# Import a csv file# 
df = pd.read_csv('items for penicillins per.csv')


print(df)

            date   id                             name  y_items  \
0     2016-02-01  02D               NHS VALE ROYAL CCG       53   
1     2016-02-01  05J  NHS REDDITCH AND BROMSGROVE CCG        1   
2     2016-02-01  10C             NHS SURREY HEATH CCG     2445   
3     2016-02-01  02G          NHS WEST LANCASHIRE CCG     3058   
4     2016-02-01  02Q                NHS BASSETLAW CCG     3473   
...          ...  ...                              ...      ...   
8123  2021-01-01  15N                    NHS DEVON CCG    17778   
8124  2021-01-01  15E  NHS BIRMINGHAM AND SOLIHULL CCG    20704   
8125  2021-01-01  36L        NHS SOUTH WEST LONDON CCG    21088   
8126  2021-01-01  72Q        NHS SOUTH EAST LONDON CCG    23464   
8127  2021-01-01  91Q          NHS KENT AND MEDWAY CCG    29004   

      y_actual_cost  x_items  x_actual_cost  
0            185.56        0              0  
1              1.70        0              0  
2           7624.29        0              0  
3          

In [19]:
print(df)

            date   id                             name  y_items  y_actual_cost
0     2016-02-01  02D               NHS VALE ROYAL CCG       53         185.56
1     2016-02-01  05J  NHS REDDITCH AND BROMSGROVE CCG        1           1.70
2     2016-02-01  10C             NHS SURREY HEATH CCG     2445        7624.29
3     2016-02-01  02G          NHS WEST LANCASHIRE CCG     3058       10272.31
4     2016-02-01  02Q                NHS BASSETLAW CCG     3473        9287.69
...          ...  ...                              ...      ...            ...
8123  2021-01-01  15N                    NHS DEVON CCG    17778       52749.86
8124  2021-01-01  15E  NHS BIRMINGHAM AND SOLIHULL CCG    20704       57009.80
8125  2021-01-01  36L        NHS SOUTH WEST LONDON CCG    21088       54922.16
8126  2021-01-01  72Q        NHS SOUTH EAST LONDON CCG    23464       67455.56
8127  2021-01-01  91Q          NHS KENT AND MEDWAY CCG    29004       76874.62

[8128 rows x 5 columns]


## Indexing DataFrames 

There are several ways to index a Pandas DataFrame. One of the easiest ways to do this is by using square bracket notation.

In the example below, you can use square brackets to select one column of the penicillin_drugs DataFrame. You can either use a single bracket or a double bracket. The single bracket will output a Pandas Series, while a double bracket will output a Pandas DataFrame.

index_col is 0 instead of None (take first column as index by default).

In [21]:
penicillin = pd.read_csv('items for penicillins per.csv', index_col = 0)

print(penicillin)

             id                             name  y_items  y_actual_cost  \
date                                                                       
2016-02-01  02D               NHS VALE ROYAL CCG       53         185.56   
2016-02-01  05J  NHS REDDITCH AND BROMSGROVE CCG        1           1.70   
2016-02-01  10C             NHS SURREY HEATH CCG     2445        7624.29   
2016-02-01  02G          NHS WEST LANCASHIRE CCG     3058       10272.31   
2016-02-01  02Q                NHS BASSETLAW CCG     3473        9287.69   
...         ...                              ...      ...            ...   
2021-01-01  15N                    NHS DEVON CCG    17778       52749.86   
2021-01-01  15E  NHS BIRMINGHAM AND SOLIHULL CCG    20704       57009.80   
2021-01-01  36L        NHS SOUTH WEST LONDON CCG    21088       54922.16   
2021-01-01  72Q        NHS SOUTH EAST LONDON CCG    23464       67455.56   
2021-01-01  91Q          NHS KENT AND MEDWAY CCG    29004       76874.62   

           

In [22]:
print(penicillin['id'])

date
2016-02-01    02D
2016-02-01    05J
2016-02-01    10C
2016-02-01    02G
2016-02-01    02Q
             ... 
2021-01-01    15N
2021-01-01    15E
2021-01-01    36L
2021-01-01    72Q
2021-01-01    91Q
Name: id, Length: 8128, dtype: object


In [24]:
print(penicillin[['id','name']])

             id                             name
date                                            
2016-02-01  02D               NHS VALE ROYAL CCG
2016-02-01  05J  NHS REDDITCH AND BROMSGROVE CCG
2016-02-01  10C             NHS SURREY HEATH CCG
2016-02-01  02G          NHS WEST LANCASHIRE CCG
2016-02-01  02Q                NHS BASSETLAW CCG
...         ...                              ...
2021-01-01  15N                    NHS DEVON CCG
2021-01-01  15E  NHS BIRMINGHAM AND SOLIHULL CCG
2021-01-01  36L        NHS SOUTH WEST LONDON CCG
2021-01-01  72Q        NHS SOUTH EAST LONDON CCG
2021-01-01  91Q          NHS KENT AND MEDWAY CCG

[8128 rows x 2 columns]


Square brackets can also be used to access observations (rows) from a DataFrame. For example:

In [26]:
# Print out first 4 observations
print(penicillin[0:4])

             id                             name  y_items  y_actual_cost  \
date                                                                       
2016-02-01  02D               NHS VALE ROYAL CCG       53         185.56   
2016-02-01  05J  NHS REDDITCH AND BROMSGROVE CCG        1           1.70   
2016-02-01  10C             NHS SURREY HEATH CCG     2445        7624.29   
2016-02-01  02G          NHS WEST LANCASHIRE CCG     3058       10272.31   

            x_items  x_actual_cost  
date                                
2016-02-01        0              0  
2016-02-01        0              0  
2016-02-01        0              0  
2016-02-01        0              0  


In [27]:
# Print out fifth and sixth observation
print(penicillin[4:6])

             id                                  name  y_items  y_actual_cost  \
date                                                                            
2016-02-01  02Q                     NHS BASSETLAW CCG     3473        9287.69   
2016-02-01  09A  NHS CENTRAL LONDON (WESTMINSTER) CCG     3637       11704.73   

            x_items  x_actual_cost  
date                                
2016-02-01        0              0  
2016-02-01        0              0  


One can also use loc and iloc to perform pretty much any data selection operation. 'loc' is label-based, meaning you have to specify rows and columns based on their row and column labels.

'iloc' is integer index based, so you've to specify rows and columns by their integer index like in the above section.

In [28]:
# Print out observation for Surrey Heath # 
print(penicillin.iloc[2])

id                                10C
name             NHS SURREY HEATH CCG
y_items                          2445
y_actual_cost                 7624.29
x_items                             0
x_actual_cost                       0
Name: 2016-02-01, dtype: object


In [36]:
# Print out observations for 
print(penicillin.loc[['2016-02-01']])

             id                             name  y_items  y_actual_cost  \
date                                                                       
2016-02-01  02D               NHS VALE ROYAL CCG       53         185.56   
2016-02-01  05J  NHS REDDITCH AND BROMSGROVE CCG        1           1.70   
2016-02-01  10C             NHS SURREY HEATH CCG     2445        7624.29   
2016-02-01  02G          NHS WEST LANCASHIRE CCG     3058       10272.31   
2016-02-01  02Q                NHS BASSETLAW CCG     3473        9287.69   
...         ...                              ...      ...            ...   
2016-02-01  93C     NHS NORTH CENTRAL LONDON CCG    34278      124324.54   
2016-02-01  15E  NHS BIRMINGHAM AND SOLIHULL CCG    37547      128799.97   
2016-02-01  36L        NHS SOUTH WEST LONDON CCG    40592      136044.77   
2016-02-01  72Q        NHS SOUTH EAST LONDON CCG    47031      170830.78   
2016-02-01  91Q          NHS KENT AND MEDWAY CCG    57300      170015.79   

           