# 3.3) Intro to Pandas

In [1]:
import numpy as np

In [2]:
import pandas as pd

- Pandas:
  - Series
  - DataFrame

### Pandas Series

- It is one-dimension (1D) array holding data of any type.
- In NumPy 1D array is stored as row in a table but in Pandas 1D array is stored as column in a table.
- Pandas series can be created using:
  - List
  - 1D Array
  - Dictionary
- Here, index numbers are called index names but not index numbers. Generally, while working with pandas we choose the column which hold unique items as index name. However, we can have multiple items with same index name.

**1. Creating Pandas series using List**

In [2]:
l = [10,20,30]
type(l)

list

In [5]:
# converting list to numpy array
a = np.array(l)
print(a)    # Here list items will be stored in row

[10 20 30]


In [6]:
# converting list to panda series
s = pd.Series(l)
print(s)    # Here list items will be stored in column

0    10
1    20
2    30
dtype: int64


In [7]:
s+1  # it will add one to all items of series (just like NumPy array)

0    11
1    21
2    31
dtype: int64

- Here: 0, 1, and 2 are index names but not index numbers.
- We can name indexes as per our choice using the index attribute of pd.Series() method. If we do not provide name then index number is considered as index name.

In [8]:
s = pd.Series(l, index=['a','b','c'])
print(s)

a    10
b    20
c    30
dtype: int64


- We can also have same index name for multiple items. However, it is discouraged

In [9]:
s = pd.Series(l, index=['a','a','c'])
print(s)

a    10
a    20
c    30
dtype: int64


In [11]:
s = pd.Series(l, index=['25','b','c'])
print(s)

25    10
b     20
c     30
dtype: int64


**2. Creating Pandas series using 1D array**

In [10]:
a = np.array([10, 20, 30])
print(a)

[10 20 30]


In [12]:
# converting the 1D array to series
s = pd.Series(a)
print(s)

0    10
1    20
2    30
dtype: int64


In [13]:
# We can specify index names using index attribute:
s = pd.Series(a, index=[1,5,'g'])
print(s)

1    10
5    20
g    30
dtype: int64


**3. Creating Pandas series using dictionary**

In [14]:
d = {'a': [10,20,40], 'b':20, 'c':30}
print(d)

{'a': [10, 20, 40], 'b': 20, 'c': 30}


In [15]:
# converting dictionary to pandas series
s = pd.Series(d)  # Here, key-names of dictionary will be converted to index names
print(s)

a    [10, 20, 40]
b              20
c              30
dtype: object


In [17]:
s = pd.Series(d, index=[1,2,'d'])
print(s)

1    NaN
2    NaN
d    NaN
dtype: object


- What happened here?
  - When we tried to convert the dictionary d to a Pandas Series with an index [1, 2, 'd'], Pandas attempted to map the dictionary keys to the specified index. Since, all the keys specified here (i.e. 1, 2, and 'd') are missing in the dictionary, the corresponding value in the Series is NaN.

In [18]:
s = pd.Series(d, index=['a','d','c'])
print(s)

a    [10, 20, 40]
d             NaN
c              30
dtype: object


In [26]:
type(s)

pandas.core.series.Series

In [27]:
s.ndim

1

In [28]:
s.shape

(3,)

### Pandas DataFrame

- A Pandas DataFrame is a 2-dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

**1. Creating a DataFrame using dictionary**

In [20]:
d = {'col1': [1, 2], 'col2': [3, 4]}
print(d)

{'col1': [1, 2], 'col2': [3, 4]}


In [23]:
# converting d to DataFrame
df = pd.DataFrame(d)
print(df)   # Here values of dictionary are stored in column instead of row like in series

   col1  col2
0     1     3
1     2     4


In [24]:
d = {'col1': 1, 'col2': [3, 4]}
df = pd.DataFrame(d)
print(df)   # key-name is column-name

   col1  col2
0     1     3
1     1     4


In [25]:
d = {'col1': 0, 'col2': [3, 4]}
df = pd.DataFrame(d)
print(df)

   col1  col2
0     0     3
1     0     4


In [29]:
type(df)

pandas.core.frame.DataFrame

In [30]:
df.ndim

2

In [31]:
df.shape

(2, 2)

**2. Creating DataFrame using nested list**

In [32]:
l = [[1,2,3],[4,5,6]]
print(l)

[[1, 2, 3], [4, 5, 6]]


In [33]:
# converting nested list to 2D array
print(np.array(l))

[[1 2 3]
 [4 5 6]]


In [35]:
# converting nested list to DataFrame
df = pd.DataFrame(l)
print(df)

   0  1  2
0  1  2  3
1  4  5  6


- We can provide our custome row and column names using index and columns attributes as:

In [37]:
df = pd.DataFrame(l, index=['a','b'], columns=['col1','col2','col3'])
print(df)

   col1  col2  col3
a     1     2     3
b     4     5     6


**3. Creating DataFrame using 2D array:**

In [4]:
arr_2d = np.array([[1,2,3],[4,5,6]])
df2 = pd.DataFrame(arr_2d, index=['a','b'], columns=['col1','col2','col3'])
print(df2)

   col1  col2  col3
a     1     2     3
b     4     5     6


- **Observation:**
  - When converting a dictionary to DataFrame, each value of dictionary was placed columnwise
  - But, when converting a list or array to DataFrame, each inner list in nested list or each 1D array in 2D array was placed row-wise.

In [5]:
df2.ndim

2

In [6]:
df2.shape

(2, 3)

#### Some operations related to DataFrame

- **Adding a column in DataFrame**

In [9]:
d = {'Age': [22,30,23,25,24],
     'Salary':[17000,17000,46000,42000,55000],
     'Gender':['M','F','M','M','F']}

In [11]:
df = pd.DataFrame(d)
df

Unnamed: 0,Age,Salary,Gender
0,22,17000,M
1,30,17000,F
2,23,46000,M
3,25,42000,M
4,24,55000,F


- Adding/Replacing a column in dataframe is similar to adding/replacing a value in Dictionary.
- In dictionary: dict['key'] = value
- In DataFrame: df[col_name] = values

In [14]:
df['Gender'] = ['M', 'M', 'F', 'M', 'F']
df

Unnamed: 0,Age,Salary,Gender
0,22,17000,M
1,30,17000,M
2,23,46000,F
3,25,42000,M
4,24,55000,F


**say we want to add a column named 'S.No.' to df which starts from 15 and ends on 19 then:**

In [16]:
df['S.No.'] = np.arange(15,20)
df

Unnamed: 0,Age,Salary,Gender,S.No.
0,22,17000,M,15
1,30,17000,M,16
2,23,46000,F,17
3,25,42000,M,18
4,24,55000,F,19


**set index**
- If we want to set a column as index name then we can use set_index() function.
- When using df.set_index(column), the function returns a new DataFrame with specified column set as the index, but it does not modify the original DataFrame df in place. To update df itself, we need to add the inplace=True parameter: df.set_index("S.No.", inplace=True)
- Say in above DataFrame we want to set 'S.No.' column as index name then:

In [20]:
df.set_index("S.No.", inplace=True)
df

Unnamed: 0_level_0,Age,Salary,Gender
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
15,22,17000,M
16,30,17000,M
17,23,46000,F
18,25,42000,M
19,24,55000,F


- **Getting value based on index name (i.e. row name)**
  - We use the loc[] property
  - Syntax: data_frame.loc[row_name, column_name]

In [23]:
df.loc[15]    # for this we use loc[] property

Age          22
Salary    17000
Gender        M
Name: 15, dtype: object

- **Getting value based on index number**
  - We use the iloc[] property

In [33]:
df.iloc[-2]    # for this we use iloc[] property

Age          25
Salary    42000
Gender        M
s.no         13
Name: 3, dtype: object

- **Dropping an index from DataFrame**
  - We use df.drop() function. It drops the row and returns a new DataFrame. If we want in place drop then we can specify inplace=True (which by default is False)

In [25]:
df.drop(index=19, inplace=True)
df

Unnamed: 0_level_0,Age,Salary,Gender
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
15,22,17000,M
16,30,17000,M
17,23,46000,F
18,25,42000,M


**Example**

In [26]:
d = {'Age': [22,30,23,25,24],
     'Salary':[17000,17000,46000,42000,55000],
     'Gender':['M','F','M','M','F']}

df = pd.DataFrame(d)
df

Unnamed: 0,Age,Salary,Gender
0,22,17000,M
1,30,17000,F
2,23,46000,M
3,25,42000,M
4,24,55000,F


In [27]:
df.size

15

In [28]:
df.ndim

2

In [29]:
df.shape

(5, 3)

In [31]:
# adding a column named 's.no'
df['s.no'] = range(10, 10+df.shape[0])
df

Unnamed: 0,Age,Salary,Gender,s.no
0,22,17000,M,10
1,30,17000,F,11
2,23,46000,M,12
3,25,42000,M,13
4,24,55000,F,14


In [34]:
# setting 's.no' column as index name
df.set_index('s.no', inplace=True)
df

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
11,30,17000,F
12,23,46000,M
13,25,42000,M
14,24,55000,F


In [35]:
# extracting a row from DataFrame using index name (i.e. row name)
df.loc[12]

Age          23
Salary    46000
Gender        M
Name: 12, dtype: object

In [39]:
# getting multiple rows data using index name:
df.loc[[10, 13]]

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
13,25,42000,M


In [37]:
# extracting a column data from DataFrame using the column name
df.loc[:,'Salary']     # Since we use ':' in place of row all rows will be selected

s.no
10    17000
11    17000
12    46000
13    42000
14    55000
Name: Salary, dtype: int64

In [40]:
# extracting multiple columns data from DataFrame using the column name
df.loc[:,['Salary','Age']]

Unnamed: 0_level_0,Salary,Age
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1
10,17000,22
11,17000,30
12,46000,23
13,42000,25
14,55000,24


In [44]:
# extracting a particular rows and columns data using row and column names
print(df.loc[10,'Salary'])
df.loc[10,'Salary']

17000


np.int64(17000)

In [48]:
# extracting multiple rows and columns data using row and column names
print(df.loc[[10,13],['Salary','Age']])

      Salary  Age
s.no             
10     17000   22
13     42000   25


In [49]:
# getting all rows and all columns data using loc[]:
df.loc[:,:]

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
11,30,17000,F
12,23,46000,M
13,25,42000,M
14,24,55000,F


In [50]:
# getting a particular row using index number
df.iloc[-2]

Age          25
Salary    42000
Gender        M
Name: 13, dtype: object

In [51]:
print(df.iloc[-2])

Age          25
Salary    42000
Gender        M
Name: 13, dtype: object


In [52]:
# getting a particular column data using column index number
df.iloc[:,-1]

s.no
10    M
11    F
12    M
13    M
14    F
Name: Gender, dtype: object

In [55]:
# getting a particular row and columns data
df.iloc[1,-2]

np.int64(17000)

In [56]:
# getting multiple rows data using index number
df.iloc[[-1, 0, -2]]

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
14,24,55000,F
10,22,17000,M
13,25,42000,M


In [57]:
# getting all rows and columns data using iloc[]
df.iloc[:]

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
11,30,17000,F
12,23,46000,M
13,25,42000,M
14,24,55000,F


- **Note:** If we try to use index number in iloc[] or index name in loc[] then we will get KeyError.

In [58]:
df

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
11,30,17000,F
12,23,46000,M
13,25,42000,M
14,24,55000,F


In [59]:
# dropping a row
df1 = df.drop(index=11)
df1

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
12,23,46000,M
13,25,42000,M
14,24,55000,F


In [61]:
# dropping multiple rows
print(df)
df1 = df.drop(index=np.array([10,12]))   # or we can simply do df.drop(index = [10,12])
df1

      Age  Salary Gender
s.no                    
10     22   17000      M
11     30   17000      F
12     23   46000      M
13     25   42000      M
14     24   55000      F


Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
11,30,17000,F
13,25,42000,M
14,24,55000,F


In [62]:
# dropping multiple rows
print(df)
df1 = df.drop(index = [10,12])
df1

      Age  Salary Gender
s.no                    
10     22   17000      M
11     30   17000      F
12     23   46000      M
13     25   42000      M
14     24   55000      F


Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
11,30,17000,F
13,25,42000,M
14,24,55000,F


In [64]:
# dropping a column
print(df)
df1 = df.drop(columns='Age')
df1

      Age  Salary Gender
s.no                    
10     22   17000      M
11     30   17000      F
12     23   46000      M
13     25   42000      M
14     24   55000      F


Unnamed: 0_level_0,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1
10,17000,M
11,17000,F
12,46000,M
13,42000,M
14,55000,F


In [65]:
# dropping multiple columns
print(df)
df1 = df.drop(columns=['Age', 'Gender'])   # we can even use np.array(['Age', 'Gender'])
df1

      Age  Salary Gender
s.no                    
10     22   17000      M
11     30   17000      F
12     23   46000      M
13     25   42000      M
14     24   55000      F


Unnamed: 0_level_0,Salary
s.no,Unnamed: 1_level_1
10,17000
11,17000
12,46000
13,42000
14,55000


In [66]:
# deleting multiple rows and columns
print(df)
df1 = df.drop(index = [10, 12], columns=['Age','Gender'])
df1

      Age  Salary Gender
s.no                    
10     22   17000      M
11     30   17000      F
12     23   46000      M
13     25   42000      M
14     24   55000      F


Unnamed: 0_level_0,Salary
s.no,Unnamed: 1_level_1
11,17000
13,42000
14,55000


In [69]:
# dropping all rows and columns
print(df)
df1 = df.drop(index=range(10,15), columns=['Age','Gender','Salary'])
df1

      Age  Salary Gender
s.no                    
10     22   17000      M
11     30   17000      F
12     23   46000      M
13     25   42000      M
14     24   55000      F


In [70]:
df

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
11,30,17000,F
12,23,46000,M
13,25,42000,M
14,24,55000,F


In [72]:
# for extracting all columns we can even do:
df.loc[:,'Age':'Gender']   # slicing of columns from starting to end (Here, end is also included)

Unnamed: 0_level_0,Age,Salary,Gender
s.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,22,17000,M
11,30,17000,F
12,23,46000,M
13,25,42000,M
14,24,55000,F


In [73]:
# similary for dropping all columns we can do
df1 = df.drop(columns=['Age':'Gender'])
df1

SyntaxError: invalid syntax (725808090.py, line 2)

- **Some Attributes:**

In [7]:
df2.size   # returns size of DataFrame

6

In [8]:
df2.columns   # returns column names

Index(['col1', 'col2', 'col3'], dtype='object')