# Pandas simple tutorial
This is a simple tutorial I created for consultation.<br>

[Pandas API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)

## Important concepts:
* [Series](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
* [Dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe) is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
contacts = {
            'name': ['Susan Calvin', 'Bently Powell', 'Gregory Powell', 'Mike Donovan'],
            'city': ['London', 'Kathmandu', 'Moskow', 'Bangalore'],
            'phone': ['056152358', '096523995', '895712365', '886549702'],
            'age' : ['28', '42', '66', '67'],
            'e-mail': ['SusanCalvin@email.com', 'BentlyP@email.com', 'GregP14@email.com', 'MDonovan@email.com']
            }

In [3]:
# create a dataframe that can be worked with pandas
df = pd.DataFrame(contacts)

In [4]:
# print the dataframe, that's the best way to do so.
df

Unnamed: 0,name,city,phone,age,e-mail
0,Susan Calvin,London,56152358,28,SusanCalvin@email.com
1,Bently Powell,Kathmandu,96523995,42,BentlyP@email.com
2,Gregory Powell,Moskow,895712365,66,GregP14@email.com
3,Mike Donovan,Bangalore,886549702,67,MDonovan@email.com


In [5]:
# shape of the df (rows, columns)
df.shape

(4, 5)

In [6]:
df.columns

Index(['name', 'city', 'phone', 'age', 'e-mail'], dtype='object')

### Working with columns

In [7]:
# select a column
df['name']

0      Susan Calvin
1     Bently Powell
2    Gregory Powell
3      Mike Donovan
Name: name, dtype: object

In [8]:
# selecting 2 or more columns, notice the extra braket
df[['name', 'city']]

Unnamed: 0,name,city
0,Susan Calvin,London
1,Bently Powell,Kathmandu
2,Gregory Powell,Moskow
3,Mike Donovan,Bangalore


In [9]:
# select an entry on a given column
df['e-mail'][1]

'BentlyP@email.com'

### Renaming columns

In [10]:
# observe that it's a list
df.columns

Index(['name', 'city', 'phone', 'age', 'e-mail'], dtype='object')

In [11]:
# renaming all at once
df.columns = ['NAME', 'CITY', 'PHONE', 'AGE', 'E-MAIL']
df

Unnamed: 0,NAME,CITY,PHONE,AGE,E-MAIL
0,Susan Calvin,London,56152358,28,SusanCalvin@email.com
1,Bently Powell,Kathmandu,96523995,42,BentlyP@email.com
2,Gregory Powell,Moskow,895712365,66,GregP14@email.com
3,Mike Donovan,Bangalore,886549702,67,MDonovan@email.com


In [12]:
# using list comprehension
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,name,city,phone,age,e-mail
0,Susan Calvin,London,56152358,28,SusanCalvin@email.com
1,Bently Powell,Kathmandu,96523995,42,BentlyP@email.com
2,Gregory Powell,Moskow,895712365,66,GregP14@email.com
3,Mike Donovan,Bangalore,886549702,67,MDonovan@email.com


In [13]:
# item by item
df.rename(columns = {'name': 'full_name', 'e-mail': 'email'}, inplace=True)
df

Unnamed: 0,full_name,city,phone,age,email
0,Susan Calvin,London,56152358,28,SusanCalvin@email.com
1,Bently Powell,Kathmandu,96523995,42,BentlyP@email.com
2,Gregory Powell,Moskow,895712365,66,GregP14@email.com
3,Mike Donovan,Bangalore,886549702,67,MDonovan@email.com


### Working with rows
The most common way to access rows are from 2 comands **iloc** and **loc**.
* iloc uses numbered index to access items.
* loc can use column labels to access items and it gives more options.

In [14]:
df.iloc[3]

full_name          Mike Donovan
city                  Bangalore
phone                 886549702
age                          67
email        MDonovan@email.com
Name: 3, dtype: object

In [15]:
# to access more than one column or row use the brakets
df.loc[[1,3,0], ['full_name', 'email']]

Unnamed: 0,full_name,email
1,Bently Powell,BentlyP@email.com
3,Mike Donovan,MDonovan@email.com
0,Susan Calvin,SusanCalvin@email.com


In [16]:
# no brakets needed to slicing, differently to the python standard the stop index is inclusive
# [row, column]
df.loc[1:3, 'full_name':'phone']

Unnamed: 0,full_name,city,phone
1,Bently Powell,Kathmandu,96523995
2,Gregory Powell,Moskow,895712365
3,Mike Donovan,Bangalore,886549702


### Indexing

In [17]:
# set email column as index. To apply the change inplace must be True.
df.set_index('email', inplace=True)
df

Unnamed: 0_level_0,full_name,city,phone,age
email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SusanCalvin@email.com,Susan Calvin,London,56152358,28
BentlyP@email.com,Bently Powell,Kathmandu,96523995,42
GregP14@email.com,Gregory Powell,Moskow,895712365,66
MDonovan@email.com,Mike Donovan,Bangalore,886549702,67


In [18]:
# now the email is used as the index and loc cannot use the index numbers anymore 
df.loc['GregP14@email.com']

full_name    Gregory Powell
city                 Moskow
phone             895712365
age                      66
Name: GregP14@email.com, dtype: object

In [19]:
df.loc['MDonovan@email.com', ['full_name', 'phone']]

full_name    Mike Donovan
phone           886549702
Name: MDonovan@email.com, dtype: object

In [20]:
# iloc can still be used with the index number
df.iloc[1]

full_name    Bently Powell
city             Kathmandu
phone            096523995
age                     42
Name: BentlyP@email.com, dtype: object

In [21]:
# sorting the index, ascending order is the default
df.sort_index(inplace=True)
df

Unnamed: 0_level_0,full_name,city,phone,age
email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BentlyP@email.com,Bently Powell,Kathmandu,96523995,42
GregP14@email.com,Gregory Powell,Moskow,895712365,66
MDonovan@email.com,Mike Donovan,Bangalore,886549702,67
SusanCalvin@email.com,Susan Calvin,London,56152358,28


In [22]:
# sorting the index in the descent order 
df.sort_index(ascending=False, inplace=True)
df

Unnamed: 0_level_0,full_name,city,phone,age
email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SusanCalvin@email.com,Susan Calvin,London,56152358,28
MDonovan@email.com,Mike Donovan,Bangalore,886549702,67
GregP14@email.com,Gregory Powell,Moskow,895712365,66
BentlyP@email.com,Bently Powell,Kathmandu,96523995,42


In [23]:
# to reset the index
df.reset_index(inplace=True)
df

Unnamed: 0,email,full_name,city,phone,age
0,SusanCalvin@email.com,Susan Calvin,London,56152358,28
1,MDonovan@email.com,Mike Donovan,Bangalore,886549702,67
2,GregP14@email.com,Gregory Powell,Moskow,895712365,66
3,BentlyP@email.com,Bently Powell,Kathmandu,96523995,42


### Updating rows

In [24]:
df.loc[1]

email        MDonovan@email.com
full_name          Mike Donovan
city                  Bangalore
phone                 886549702
age                          67
Name: 1, dtype: object

In [25]:
# updating all items
df.loc[1] = ['MHoward@email.com', 'Mike Howard', 'New Delhi', '225896337', '68']
df

Unnamed: 0,email,full_name,city,phone,age
0,SusanCalvin@email.com,Susan Calvin,London,56152358,28
1,MHoward@email.com,Mike Howard,New Delhi,225896337,68
2,GregP14@email.com,Gregory Powell,Moskow,895712365,66
3,BentlyP@email.com,Bently Powell,Kathmandu,96523995,42


In [26]:
# updating selected items
df.loc[1, ['full_name', 'email']] = ['Mike Donovan', 'MDonovan@email.com']
df

Unnamed: 0,email,full_name,city,phone,age
0,SusanCalvin@email.com,Susan Calvin,London,56152358,28
1,MDonovan@email.com,Mike Donovan,New Delhi,225896337,68
2,GregP14@email.com,Gregory Powell,Moskow,895712365,66
3,BentlyP@email.com,Bently Powell,Kathmandu,96523995,42


### Adding/removing Columns and rows

In [27]:
# splitting column full_name into 2 new columns: first and last
# expand: Boolean value, returns a data frame with different value in different columns if True. Else it returns a series with list of strings.
# https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/

df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)
df

Unnamed: 0,email,full_name,city,phone,age,first,last
0,SusanCalvin@email.com,Susan Calvin,London,56152358,28,Susan,Calvin
1,MDonovan@email.com,Mike Donovan,New Delhi,225896337,68,Mike,Donovan
2,GregP14@email.com,Gregory Powell,Moskow,895712365,66,Gregory,Powell
3,BentlyP@email.com,Bently Powell,Kathmandu,96523995,42,Bently,Powell


In [28]:
# adding items from 2 columns to form a new one
df['full_name_2'] = df['first'] + ' ' + df['last']
df

Unnamed: 0,email,full_name,city,phone,age,first,last,full_name_2
0,SusanCalvin@email.com,Susan Calvin,London,56152358,28,Susan,Calvin,Susan Calvin
1,MDonovan@email.com,Mike Donovan,New Delhi,225896337,68,Mike,Donovan,Mike Donovan
2,GregP14@email.com,Gregory Powell,Moskow,895712365,66,Gregory,Powell,Gregory Powell
3,BentlyP@email.com,Bently Powell,Kathmandu,96523995,42,Bently,Powell,Bently Powell


In [29]:
# removing columns
df.drop(columns=['full_name','full_name_2'], inplace=True)
df

Unnamed: 0,email,city,phone,age,first,last
0,SusanCalvin@email.com,London,56152358,28,Susan,Calvin
1,MDonovan@email.com,New Delhi,225896337,68,Mike,Donovan
2,GregP14@email.com,Moskow,895712365,66,Gregory,Powell
3,BentlyP@email.com,Kathmandu,96523995,42,Bently,Powell


In [30]:
# removing rows, to apply use inplace=True
df.drop(index= [1, 3])

Unnamed: 0,email,city,phone,age,first,last
0,SusanCalvin@email.com,London,56152358,28,Susan,Calvin
2,GregP14@email.com,Moskow,895712365,66,Gregory,Powell


### apply and applymap
* apply() - applies a function to each column or row
* applymap() - applies a function to every element of a DataFrame 

In [31]:
# lowering case of all emails usins apply. Notice the function has to be written withouth the end ()
df['email'] = df['email'].apply(str.lower) 
df

Unnamed: 0,email,city,phone,age,first,last
0,susancalvin@email.com,London,56152358,28,Susan,Calvin
1,mdonovan@email.com,New Delhi,225896337,68,Mike,Donovan
2,gregp14@email.com,Moskow,895712365,66,Gregory,Powell
3,bentlyp@email.com,Kathmandu,96523995,42,Bently,Powell


In [32]:
df['email'].apply(len)

0    21
1    18
2    17
3    17
Name: email, dtype: int64

In [33]:
# can also be used with lambda functions
# https://realpython.com/python-lambda/
df['email'].apply(lambda x: x.upper())

0    SUSANCALVIN@EMAIL.COM
1       MDONOVAN@EMAIL.COM
2        GREGP14@EMAIL.COM
3        BENTLYP@EMAIL.COM
Name: email, dtype: object

In [34]:
df.applymap(len)

Unnamed: 0,email,city,phone,age,first,last
0,21,6,9,2,5,6
1,18,9,9,2,4,7
2,17,6,9,2,7,6
3,17,9,9,2,6,6


### Filtering and  "&, |, ~" operations


In [35]:
flt_name = df['first'] == 'Mike'
df.loc[flt_name]

Unnamed: 0,email,city,phone,age,first,last
1,mdonovan@email.com,New Delhi,225896337,68,Mike,Donovan


#### Pandas boolean operators:
* and: &
* or: |
* not: ~

In [36]:
# Not operator
df.loc[~flt_name]

Unnamed: 0,email,city,phone,age,first,last
0,susancalvin@email.com,London,56152358,28,Susan,Calvin
2,gregp14@email.com,Moskow,895712365,66,Gregory,Powell
3,bentlyp@email.com,Kathmandu,96523995,42,Bently,Powell


In [37]:
flt = (df['first'] == 'Mike') & (df['city'] == 'Bangalore')
df.loc[flt]

Unnamed: 0,email,city,phone,age,first,last


In [38]:
flt = (df['first'] == 'Mike') | (df['age'] == '28')
df.loc[flt]

Unnamed: 0,email,city,phone,age,first,last
0,susancalvin@email.com,London,56152358,28,Susan,Calvin
1,mdonovan@email.com,New Delhi,225896337,68,Mike,Donovan


In [39]:
# the properties of .loc continue to be valid
df.loc[flt, 'email']

0    susancalvin@email.com
1       mdonovan@email.com
Name: email, dtype: object

### Concatenating

In [40]:
contacts2 = {
            'first': ['Gladia', 'Cinda', 'Harla'],
            'last': ['Delmarre', 'Monay', 'Branno'],
            #'full_name': ['Gladia Delmarre', 'Cinda Monay', 'Harla Branno'],
            'city': ['London', 'London', 'London'],
            'phone': ['558697243', '' , '798866541'],
            'age' : ['35', '40', '35'],
            'email': ['GladiaDel@email.com', 'CMonay@email.com', 'AngolaMilan@email.com']
            }
df2 = pd.DataFrame(contacts2)
df2

Unnamed: 0,first,last,city,phone,age,email
0,Gladia,Delmarre,London,558697243.0,35,GladiaDel@email.com
1,Cinda,Monay,London,,40,CMonay@email.com
2,Harla,Branno,London,798866541.0,35,AngolaMilan@email.com


In [41]:
df = pd.concat([df, df2])
df

Unnamed: 0,email,city,phone,age,first,last
0,susancalvin@email.com,London,56152358.0,28,Susan,Calvin
1,mdonovan@email.com,New Delhi,225896337.0,68,Mike,Donovan
2,gregp14@email.com,Moskow,895712365.0,66,Gregory,Powell
3,bentlyp@email.com,Kathmandu,96523995.0,42,Bently,Powell
0,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
1,CMonay@email.com,London,,40,Cinda,Monay
2,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno


### Sorting

In [42]:
df.sort_index()

Unnamed: 0,email,city,phone,age,first,last
0,susancalvin@email.com,London,56152358.0,28,Susan,Calvin
0,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
1,mdonovan@email.com,New Delhi,225896337.0,68,Mike,Donovan
1,CMonay@email.com,London,,40,Cinda,Monay
2,gregp14@email.com,Moskow,895712365.0,66,Gregory,Powell
2,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno
3,bentlyp@email.com,Kathmandu,96523995.0,42,Bently,Powell


In [43]:
# sorting by email, ascending order is the default
df.sort_values(['email'])

Unnamed: 0,email,city,phone,age,first,last
2,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno
1,CMonay@email.com,London,,40,Cinda,Monay
0,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
3,bentlyp@email.com,Kathmandu,96523995.0,42,Bently,Powell
2,gregp14@email.com,Moskow,895712365.0,66,Gregory,Powell
1,mdonovan@email.com,New Delhi,225896337.0,68,Mike,Donovan
0,susancalvin@email.com,London,56152358.0,28,Susan,Calvin


In [44]:
# sorting by descending order
df.sort_values(['email'], ascending=False)

Unnamed: 0,email,city,phone,age,first,last
0,susancalvin@email.com,London,56152358.0,28,Susan,Calvin
1,mdonovan@email.com,New Delhi,225896337.0,68,Mike,Donovan
2,gregp14@email.com,Moskow,895712365.0,66,Gregory,Powell
3,bentlyp@email.com,Kathmandu,96523995.0,42,Bently,Powell
0,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
1,CMonay@email.com,London,,40,Cinda,Monay
2,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno


In [45]:
# sort according to the list order
df.sort_values(['last', 'first'], ascending=[True, False], inplace=True)
df

Unnamed: 0,email,city,phone,age,first,last
2,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno
0,susancalvin@email.com,London,56152358.0,28,Susan,Calvin
0,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
1,mdonovan@email.com,New Delhi,225896337.0,68,Mike,Donovan
1,CMonay@email.com,London,,40,Cinda,Monay
2,gregp14@email.com,Moskow,895712365.0,66,Gregory,Powell
3,bentlyp@email.com,Kathmandu,96523995.0,42,Bently,Powell


In [46]:
# reseting index, drop=True don't save the previous index as a column
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,email,city,phone,age,first,last
0,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno
1,susancalvin@email.com,London,56152358.0,28,Susan,Calvin
2,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
3,mdonovan@email.com,New Delhi,225896337.0,68,Mike,Donovan
4,CMonay@email.com,London,,40,Cinda,Monay
5,gregp14@email.com,Moskow,895712365.0,66,Gregory,Powell
6,bentlyp@email.com,Kathmandu,96523995.0,42,Bently,Powell


### Casting

In [47]:
# notice that age is object type
df.dtypes

email    object
city     object
phone    object
age      object
first    object
last     object
dtype: object

In [48]:
df['age'] = df['age'].astype(int)
df.dtypes

email    object
city     object
phone    object
age       int64
first    object
last     object
dtype: object

In [49]:
# now apply can be used with numeric functions
df['age'].apply(lambda x:  x**2)

0    1225
1     784
2    1225
3    4624
4    1600
5    4356
6    1764
Name: age, dtype: int64

### groupby

In [50]:
city_grp = df.groupby(['city'])

In [51]:
city_grp.get_group('London')

Unnamed: 0,email,city,phone,age,first,last
0,AngolaMilan@email.com,London,798866541.0,35,Harla,Branno
1,susancalvin@email.com,London,56152358.0,28,Susan,Calvin
2,GladiaDel@email.com,London,558697243.0,35,Gladia,Delmarre
4,CMonay@email.com,London,,40,Cinda,Monay


In [52]:
city_grp['age'].value_counts()

city       age
Kathmandu  42     1
London     35     2
           28     1
           40     1
Moskow     66     1
New Delhi  68     1
Name: age, dtype: int64

In [53]:
city_grp['age'].mean().sort_values(ascending=False)

city
New Delhi    68.0
Moskow       66.0
Kathmandu    42.0
London       34.5
Name: age, dtype: float64

In [54]:
city_grp['age'].median().sort_values()

city
London       35
Kathmandu    42
Moskow       66
New Delhi    68
Name: age, dtype: int64

In [55]:
city_grp['age'].max()

city
Kathmandu    42
London       40
Moskow       66
New Delhi    68
Name: age, dtype: int64

In [56]:
city_grp['age'].min()

city
Kathmandu    42
London       28
Moskow       66
New Delhi    68
Name: age, dtype: int64

In [57]:
# agg applies more than one function to the data selected from the group
city_grp['age'].agg(['median', 'mean'])

Unnamed: 0_level_0,median,mean
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Kathmandu,42,42.0
London,35,34.5
Moskow,66,66.0
New Delhi,68,68.0


In [58]:
city_grp['age'].agg(['median', 'mean']).loc['London']

median    35.0
mean      34.5
Name: London, dtype: float64

In [59]:
# normalize - If True then the object returned will contain the relative frequencies of the unique values.
city_grp['age'].value_counts(normalize=True)

city       age
Kathmandu  42     1.00
London     35     0.50
           28     0.25
           40     0.25
Moskow     66     1.00
New Delhi  68     1.00
Name: age, dtype: float64

### Pivot tables
the default aggregate function is mean. 
<br>
[more details]('https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html')

In [60]:
pd.pivot_table(df, index='city', values='age')

Unnamed: 0_level_0,age
city,Unnamed: 1_level_1
Kathmandu,42.0
London,34.5
Moskow,66.0
New Delhi,68.0


### Other useful functions
* tolist() - return a python list
* to_dict() - return a python dictionary

In [61]:
city_grp['age'].agg(['median', 'mean']).to_dict()

{'median': {'Kathmandu': 42, 'London': 35, 'Moskow': 66, 'New Delhi': 68},
 'mean': {'Kathmandu': 42.0,
  'London': 34.5,
  'Moskow': 66.0,
  'New Delhi': 68.0}}

In [62]:
# str functions
df['email'] = df['email'].str.upper() 
df

Unnamed: 0,email,city,phone,age,first,last
0,ANGOLAMILAN@EMAIL.COM,London,798866541.0,35,Harla,Branno
1,SUSANCALVIN@EMAIL.COM,London,56152358.0,28,Susan,Calvin
2,GLADIADEL@EMAIL.COM,London,558697243.0,35,Gladia,Delmarre
3,MDONOVAN@EMAIL.COM,New Delhi,225896337.0,68,Mike,Donovan
4,CMONAY@EMAIL.COM,London,,40,Cinda,Monay
5,GREGP14@EMAIL.COM,Moskow,895712365.0,66,Gregory,Powell
6,BENTLYP@EMAIL.COM,Kathmandu,96523995.0,42,Bently,Powell


In [63]:
retirement = ['No', 'No', 'No', 'Yes', 'No', 'Yes', 'No']
df['retired'] = retirement
df

Unnamed: 0,email,city,phone,age,first,last,retired
0,ANGOLAMILAN@EMAIL.COM,London,798866541.0,35,Harla,Branno,No
1,SUSANCALVIN@EMAIL.COM,London,56152358.0,28,Susan,Calvin,No
2,GLADIADEL@EMAIL.COM,London,558697243.0,35,Gladia,Delmarre,No
3,MDONOVAN@EMAIL.COM,New Delhi,225896337.0,68,Mike,Donovan,Yes
4,CMONAY@EMAIL.COM,London,,40,Cinda,Monay,No
5,GREGP14@EMAIL.COM,Moskow,895712365.0,66,Gregory,Powell,Yes
6,BENTLYP@EMAIL.COM,Kathmandu,96523995.0,42,Bently,Powell,No


In [64]:
# map function
df['retired'] = df['retired'].map({'Yes': True, 'No': False})
df

Unnamed: 0,email,city,phone,age,first,last,retired
0,ANGOLAMILAN@EMAIL.COM,London,798866541.0,35,Harla,Branno,False
1,SUSANCALVIN@EMAIL.COM,London,56152358.0,28,Susan,Calvin,False
2,GLADIADEL@EMAIL.COM,London,558697243.0,35,Gladia,Delmarre,False
3,MDONOVAN@EMAIL.COM,New Delhi,225896337.0,68,Mike,Donovan,True
4,CMONAY@EMAIL.COM,London,,40,Cinda,Monay,False
5,GREGP14@EMAIL.COM,Moskow,895712365.0,66,Gregory,Powell,True
6,BENTLYP@EMAIL.COM,Kathmandu,96523995.0,42,Bently,Powell,False


In [65]:
df.dtypes

email      object
city       object
phone      object
age         int64
first      object
last       object
retired      bool
dtype: object

In [66]:
# isna() notna()