# pandas

Pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas adopts significant part of NumPy's idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops. While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

### Importing pandas

In [1]:
# Import pandas

import pandas as pd

## Introduction to pandas Data Structures

### Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its *index*.

In [2]:
# The simplest Series is formed from only an array of data
# Since we did not speficy an index for the data, a default one consisting of the integers 0 through N-1 is created

obj = pd.Series([4,7,-3,1])
print(obj)

0    4
1    7
2   -3
3    1
dtype: int64


In [3]:
# You can get the array representation and index object of the Series via its values and index attributes, respectively

print(obj.values)
print(obj.index)

[ 4  7 -3  1]
RangeIndex(start=0, stop=4, step=1)


In [4]:
# Often it will be desirable to create a Series with an index identifying each data point with a label

obj2 = pd.Series([2,4,6,8], index=['b','a','c','d'])
print(obj2)

b    2
a    4
c    6
d    8
dtype: int64


In [5]:
# Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values

print(obj2['a'])
print(obj2[['a', 'c']])

4
a    4
c    6
dtype: int64


In [6]:
# Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, 
# or applying math functions, will preserve the index-value link

import numpy as np

print(obj2[obj2>2])
print(f"\n{obj2*10}")
print(f"\n{np.exp(obj2)}")

a    4
c    6
d    8
dtype: int64

b    20
a    40
c    60
d    80
dtype: int64

b       7.389056
a      54.598150
c     403.428793
d    2980.957987
dtype: float64


In [7]:
# Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.
# It can be used in many contexts where you might use a dict such as finding if an index exists (key in a dictionary)

print('b' in obj2)
print(4 in obj2)

True
False


In [8]:
# You can also pass in a dictionary to create a pandas Series. The dictionary keys become Series index

cdata = {'Seoul':1000000, 'Busan': 700000, 'Daegu':200000, 'Bucheon':100000}
obj3 = pd.Series(cdata)
print(obj3)

Seoul      1000000
Busan       700000
Daegu       200000
Bucheon     100000
dtype: int64


In [9]:
# Both the Series object itself and its index have a name attributes, which itegrates with other key areas of pandas functionality

obj3.name = 'My Series'
obj3.index.name = 'Cities'

print(obj3.name,',', obj3.index.name)

My Series , Cities


In [10]:
# A Series's index can be altered in-place by assignment

obj3.index = ['Seoul', 'Busan', 'Daegu', 'Gyeongsan']
print(obj3.index)

Index(['Seoul', 'Busan', 'Daegu', 'Gyeongsan'], dtype='object')


### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index.

In [11]:
# There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists
# or NumPy arrays

# Dictionary of lists
data = {'city': ['Seoul', 'Bucheon', 'Gyeongsan', 'Daegu', 'Jecheon'], 
       'year': [1991, 1968, 1988, 1995, 2022], 
       'pop': [3.2, 1.0, 0.5, 2.0, 0.2]}
df = pd.DataFrame(data)
df

Unnamed: 0,city,year,pop
0,Seoul,1991,3.2
1,Bucheon,1968,1.0
2,Gyeongsan,1988,0.5
3,Daegu,1995,2.0
4,Jecheon,2022,0.2


In [12]:
# For large DataFrames, the head method selects only the first five rows

df.head(2)

Unnamed: 0,city,year,pop
0,Seoul,1991,3.2
1,Bucheon,1968,1.0


In [13]:
# If you specify a sequence of columns, the DataFrame's columns will be arranged in that order

df = pd.DataFrame(data, columns=['pop', 'city', 'year'])
df

Unnamed: 0,pop,city,year
0,3.2,Seoul,1991
1,1.0,Bucheon,1968
2,0.5,Gyeongsan,1988
3,2.0,Daegu,1995
4,0.2,Jecheon,2022


In [14]:
# A column in a DataFrame can be retrieved as a Series by dict-like notation

df['city']

0        Seoul
1      Bucheon
2    Gyeongsan
3        Daegu
4      Jecheon
Name: city, dtype: object

In [15]:
# Rows can also be retrieved by position or name with the special loc attribute (much more on this later)
# This returns a pandas Series with column names as indices

df.loc[0]

pop       3.2
city    Seoul
year     1991
Name: 0, dtype: object

In [16]:
# Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values

df['debt'] = None
df['debt'] = 16.5
df

Unnamed: 0,pop,city,year,debt
0,3.2,Seoul,1991,16.5
1,1.0,Bucheon,1968,16.5
2,0.5,Gyeongsan,1988,16.5
3,2.0,Daegu,1995,16.5
4,0.2,Jecheon,2022,16.5


In [17]:
# Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values
# The value's length must match the length of the DataFrame

ls = [n for n in range(1, len(df)+1)]

df['debt'] = ls
df

Unnamed: 0,pop,city,year,debt
0,3.2,Seoul,1991,1
1,1.0,Bucheon,1968,2
2,0.5,Gyeongsan,1988,3
3,2.0,Daegu,1995,4
4,0.2,Jecheon,2022,5


In [18]:
# If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes

sr = pd.Series([1, 5, 3, 4], index=[0, 2, 3, 5])
df['debt'] = sr
df

Unnamed: 0,pop,city,year,debt
0,3.2,Seoul,1991,1.0
1,1.0,Bucheon,1968,
2,0.5,Gyeongsan,1988,5.0
3,2.0,Daegu,1995,3.0
4,0.2,Jecheon,2022,


In [19]:
# The del keyword will delete columns as with a dict

del df['debt']
df

Unnamed: 0,pop,city,year
0,3.2,Seoul,1991
1,1.0,Bucheon,1968
2,0.5,Gyeongsan,1988
3,2.0,Daegu,1995
4,0.2,Jecheon,2022


In [20]:
# Another common form of data is a nested dict of dicts
# If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices

pop = {'Seoul': {2001: 3.0, 2002: 4.0, 2003: 4.5}, 
      'Bucheon': {2001: 1.5, 2002: 1.7}}
df1 = pd.DataFrame(pop)
df1

Unnamed: 0,Seoul,Bucheon
2001,3.0,1.5
2002,4.0,1.7
2003,4.5,


In [21]:
# You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array

df.T

Unnamed: 0,0,1,2,3,4
pop,3.2,1.0,0.5,2.0,0.2
city,Seoul,Bucheon,Gyeongsan,Daegu,Jecheon
year,1991,1968,1988,1995,2022


In [22]:
# If a DataFrame's index and columns have their name attributes set, these will also be displayed

df1.index.name = 'yr'
df1.columns.name = 'city'

print(df1.index)
print(df1.columns)


Index([2001, 2002, 2003], dtype='int64', name='yr')
Index(['Seoul', 'Bucheon'], dtype='object', name='city')


In [23]:
# As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray

df.values

array([[3.2, 'Seoul', 1991],
       [1.0, 'Bucheon', 1968],
       [0.5, 'Gyeongsan', 1988],
       [2.0, 'Daegu', 1995],
       [0.2, 'Jecheon', 2022]], dtype=object)

### Essential Functionality

#### Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [24]:
# Consider this example

obj = pd.Series([5,3,6,1,2], index=['e', 'd', 'c', 'b', 'a'])
print(obj)

e    5
d    3
c    6
b    1
a    2
dtype: int64


In [25]:
# Calling index on this Series rearranges the data according to the new index

obj.index = ['a', 'b', 'c', 'd', 'e']
print(obj)

a    5
b    3
c    6
d    1
e    2
dtype: int64


In [26]:
# For ordered data like time series, it may be desirable to do some iterpolation or filling of values when reindexing.
# The method option allows us to do this, using a method such as ffill, which forward-fills the values

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3 = obj3.reindex(range(6), method='ffill')
print(obj3)

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object


In [27]:
# We can also reset the index to default value. The drop parameter determines whether the current index will be dropped.
# inplace parameter determines whether the change will apply to the existing Series

obj.reset_index(drop=True, inplace=True)
print(obj)

0    5
1    3
2    6
3    1
4    2
dtype: int64


In [28]:
# With DataFrame, the index method can alter the (row) index.

print(df)
print()
df.index = [1,2,3,4,5]
print(df)

   pop       city  year
0  3.2      Seoul  1991
1  1.0    Bucheon  1968
2  0.5  Gyeongsan  1988
3  2.0      Daegu  1995
4  0.2    Jecheon  2022

   pop       city  year
1  3.2      Seoul  1991
2  1.0    Bucheon  1968
3  0.5  Gyeongsan  1988
4  2.0      Daegu  1995
5  0.2    Jecheon  2022


In [29]:
# You can set one of the columns as the index

df.set_index('city', inplace=True)
df

Unnamed: 0_level_0,pop,year
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Seoul,3.2,1991
Bucheon,1.0,1968
Gyeongsan,0.5,1988
Daegu,2.0,1995
Jecheon,0.2,2022


In [30]:
# You can rename columns by using the .rename() or simply assigning a new list to df.columns

df.rename(columns={'pop':'Pop', 'year':'Year'}, inplace=True)
df.index.name = 'City'
print(df)

new_columns = ['Population', 'Year']
df.columns = new_columns
print(df)

           Pop  Year
City                
Seoul      3.2  1991
Bucheon    1.0  1968
Gyeongsan  0.5  1988
Daegu      2.0  1995
Jecheon    0.2  2022
           Population  Year
City                       
Seoul             3.2  1991
Bucheon           1.0  1968
Gyeongsan         0.5  1988
Daegu             2.0  1995
Jecheon           0.2  2022


#### Dropping Entries from an Axis

In [36]:
# The drop method will return a new object with the indicated value or values deleted from an axis

obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
print(obj)

new_obj = obj.drop("c")
print(new_obj)

new_new_obj = new_obj.drop(['a','b'])
print(new_new_obj)

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
d    3.0
e    4.0
dtype: float64


In [38]:
# The drop method will return a new object with the indicated value or values deleted from an axis

obj = pd.Series([1,2,3,4,5])
new_obj = obj.drop(0)
print(new_obj)

1    2
2    3
3    4
4    5
dtype: int64


In [44]:
# With DataFrame, index values can be deleted from either axis.

data = {'city':['Seoul', 'Busan', 'Bucheon'], 'population':[10,5,1]}
obj = pd.DataFrame(data, index=[1,2,3])
print(obj)

# Calling drop with a sequence of labels will drop values from the row labels (axis 0)

new_obj = obj.drop(index=[2])
print(new_obj)

# To drop labels from the columns, instead use the columns keyword

new_new_obj = obj.drop(columns=['population'])
print(new_new_obj)

# You can also drop values from the columns by passing axis=1 or axis='columns'

new_new_new_obj = obj.drop('population', axis=1)
print(new_new_new_obj)

      city  population
1    Seoul          10
2    Busan           5
3  Bucheon           1
      city  population
1    Seoul          10
3  Bucheon           1
      city
1    Seoul
2    Busan
3  Bucheon
      city
1    Seoul
2    Busan
3  Bucheon


#### Indexing, Selection, and Filtering

In [53]:
# Series indexing (obj[...]) works analogously to NumPy array indexing, 
# except you can use the Series’s index values instead of only integers.

data = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
print(data)

print(data['a'])
print(data[['c','e']])
print(data[0])
print(data[[2,4]])
print(data[3:])

a    1
b    2
c    3
d    4
e    5
dtype: int64
1
c    3
e    5
dtype: int64
1
c    3
e    5
dtype: int64
d    4
e    5
dtype: int64


  print(data[0])
  print(data[[2,4]])


In [57]:
# The preferred way to select index values is with the special loc operator

print(data.loc[['a','b']])
print(obj.loc[[1,2]])

a    1
b    2
dtype: int64
    city  population
1  Seoul          10
2  Busan           5


In [61]:
# The reason to prefer loc is because of the different treatment of integers 
# when indexing with []. Regular []-based indexing will treat integers as labels if the index 
# contains integers, so the behavior differs depending on the data type of the index.

d1 = pd.Series([1,2,3,4,5], index=[3,2,1,4,0])
d2 = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])

print(d1[[0, 1, 2]])
print(d2[[0, 1, 2]])

# When using loc, the expression d2.loc[[0, 1, 2]] will fail when the index does not contain integers

try:
    d2.loc[[0,1,2]]
except KeyError as e:
    print(e)

0    5
1    3
2    2
dtype: int64
a    1
b    2
c    3
dtype: int64
"None of [Index([0, 1, 2], dtype='int32')] are in the [index]"


  print(d2[[0, 1, 2]])


In [78]:
# Since loc operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work
# consistently whether or not the index contains integers

print(d1.iloc[[0, 1, 2]])
print(d2.iloc[[0,1,2]])

3    1
2    2
1    3
dtype: int64
a    1
b    2
c    3
dtype: int64


In [65]:
# You can also slice with labels, but it works differently from normal Python slicing in that the endpoint is inclusive

print(data.loc["a":"c"])

# Assigning values using these methods modifies the corresponding section of the Series

data.loc["a":"c"] = 5
print(data)

a    1
b    2
c    3
dtype: int64
a    5
b    5
c    5
d    4
e    5
dtype: int64


In [125]:
# Indexing into a DataFrame retrieves one or more columns either with a single value or sequence

obj = {'city':['Seoul', 'Bucheon', 'Sidney', 'Tianjin', 'Qingdao', 'Vancouver', 'Montreal', 'Toronto'],
      'age': [0, 7, 9, 9, 13, 14, 18, 27],
      'population': [10, 1, 3, 9, 5, 1, 1, 2]}
df = pd.DataFrame(obj)
print(df)

print(df[['city', 'age']])

        city  age  population
0      Seoul    0          10
1    Bucheon    7           1
2     Sidney    9           3
3    Tianjin    9           9
4    Qingdao   13           5
5  Vancouver   14           1
6   Montreal   18           1
7    Toronto   27           2
        city  age
0      Seoul    0
1    Bucheon    7
2     Sidney    9
3    Tianjin    9
4    Qingdao   13
5  Vancouver   14
6   Montreal   18
7    Toronto   27


In [126]:
# Indexing like this has a few special cases. The first is slicing or selecting data with a Boolean array

print(df[:2])
print()
print(df[df['age']>10])
print()

# Another use case is indexing with a Boolean DataFrame, such as one produced by a scalar comparison.

print(df['age'] > 10)
print()

# We can use this DataFrame to assign the value 10 to each location with the value True, like so:

df2 = df.copy()
df2[df2['age']>10] = 10
print(df2)

      city  age  population
0    Seoul    0          10
1  Bucheon    7           1

        city  age  population
4    Qingdao   13           5
5  Vancouver   14           1
6   Montreal   18           1
7    Toronto   27           2

0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
Name: age, dtype: bool

      city  age  population
0    Seoul    0          10
1  Bucheon    7           1
2   Sidney    9           3
3  Tianjin    9           9
4       10   10          10
5       10   10          10
6       10   10          10
7       10   10          10


##### Selection on DataFrame with loc and iloc

In [127]:
# Like Series, DataFrame has special attributes loc and iloc for label-based and integer-based indexing, respectively.
# Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either
# axis-labels (loc) or integers (iloc).

df.set_index('city', inplace = True)
df['city color'] = ['blue', 'red', 'orange', 'yellow', 'purple', 'black', 'white', 'green']
print(df)

           age  population city color
city                                 
Seoul        0          10       blue
Bucheon      7           1        red
Sidney       9           3     orange
Tianjin      9           9     yellow
Qingdao     13           5     purple
Vancouver   14           1      black
Montreal    18           1      white
Toronto     27           2      green


In [133]:
# As a first example, let's select a single row by label

print(df.loc['Seoul'])
print()

# The result of selecting a single row is a Series with an index that contains the DataFrame's column labels.
# To select multiple roles, creating a new DataFrame, pass a sequence of labels

print(df.loc[['Seoul', 'Bucheon']])

age              0
population      10
city color    blue
Name: Seoul, dtype: object

         age  population city color
city                               
Seoul      0          10       blue
Bucheon    7           1        red


In [135]:
# You can combine both row and column selection in loc by separating the selections with a comma

print(df.loc[["Seoul", "Bucheon"], ['age', 'population']])

         age  population
city                    
Seoul      0          10
Bucheon    7           1


In [146]:
# We will then perform some similar selections with integers using iloc

print(df.iloc[1])
print()
print(df.iloc[[1,3,5]])
print()
print(df.iloc[2, [1,2]])
print()
print(df.iloc[[2,4], [0,2]])

age             7
population      1
city color    red
Name: Bucheon, dtype: object

           age  population city color
city                                 
Bucheon      7           1        red
Tianjin      9           9     yellow
Vancouver   14           1      black

population         3
city color    orange
Name: Sidney, dtype: object

         age city color
city                   
Sidney     9     orange
Qingdao   13     purple


In [None]:
# Both indexing functions work with slices in addition to single labels or lists of labels

