# Pandas-3: Basic Opearations on DataFrame


### Table of Content

* Recap - Pandas Series and DataFrame Introduction
* DataTypes in Pandas
* Index, Columns and Labels in dataframe
* Selecting and Drop Columns - drop()
* Selecting rows - loc() and iloc()
* Subsetting DataFrames based on coditions
* Adding rows and columns to a DataFrame
* Renaming Index and Columns to a DataFrame

In [2]:
#import
import numpy as np
import pandas as pd

In [56]:
print(pd.__version__)

2.3.1


### Pandas Series

- A **one-dimensional labeled array** in pandas.  
- Can store data of any type (integers, floats, strings, objects).  
- Each element has an associated **index label**.  
- Similar to a **column in Excel** or a **1D NumPy array** with labels.  

In [218]:
# Creating a Series
s = pd.Series([10, 20, 30, 40], index=["a", "b", "c", "d"])
print(s)

a    10
b    20
c    30
d    40
dtype: int64


---

### Pandas DataFrame

- A **two-dimensional labeled data structure** in pandas.  
- Data is arranged in **rows and columns** (like a table in SQL or Excel).  
- Each column is a **Pandas Series**.  
- Columns can store different data types (int, float, string, etc.).  


#### Creating a DataFrame

In [240]:
# Creating a DataFrame
data = {
    'UId': [101, 201, 401],
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}

df = pd.DataFrame(data)
print(df)

   UId   Name  Age    City
0  101   Amit   25  Mumbai
1  201  Anand   28   Delhi
2  401  Akash   30   Patna


In [241]:
# print at max top 5 rows of dataframe
df.head()

Unnamed: 0,UId,Name,Age,City
0,101,Amit,25,Mumbai
1,201,Anand,28,Delhi
2,401,Akash,30,Patna


In [242]:
# shape of dataframe
df.shape

(3, 4)

In [243]:
#len
len(df)

3

#### Pandas DataTypes

In [244]:
# dataframe datatypes
df.dtypes

UId      int64
Name    object
Age      int64
City    object
dtype: object

In [248]:
# reduce memory consumed for Age attribute
df['Age'] = df['Age'].astype('int16')
df.dtypes

UId      int64
Name    object
Age      int16
City    object
dtype: object

**Note**
- In a Pandas DataFrame, if a column contains strings, Pandas sets its dtype to object
- object is basically a generic Python object type, which can hold anything — strings, lists, or even mixed types.
- Starting with Pandas 1.0, there is a dedicated string dtype (string) you can use

In [250]:
df['City'] = df['City'].astype('string')
df.dtypes

UId              int64
Name            object
Age              int16
City    string[python]
dtype: object

In [251]:
# basic dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   UId     3 non-null      int64 
 1   Name    3 non-null      object
 2   Age     3 non-null      int16 
 3   City    3 non-null      string
dtypes: int16(1), int64(1), object(1), string(1)
memory usage: 210.0+ bytes


In [252]:
df

Unnamed: 0,UId,Name,Age,City
0,101,Amit,25,Mumbai
1,201,Anand,28,Delhi
2,401,Akash,30,Patna


#### Index, Columns and Labels in dataframe

In [253]:
# Columns Attribute/property
df.columns

Index(['UId', 'Name', 'Age', 'City'], dtype='object')

In [255]:
print(df.index)

RangeIndex(start=0, stop=3, step=1)


In a DataFrame
- **Index** is the row labels of the DataFrame. By default, pandas assigns `0, 1, 2, ...` as the index.
- **Columns** are the labels for each vertical series in the DataFrame.
- **Labels** are the identifiers for both rows (index) and columns.

| Component | Description                  | Example                |
| --------- | ---------------------------- | ---------------------- |
| Index     | Row labels                   | 0,1,2 or custom  |
| Columns   | Column labels                | "Name""Age", "City"          |
| Labels    | Identifiers for rows/columns | Used in `loc`          |


In [256]:
from numpy.random import randint as ri
matrix_data = ri(1,100,20).reshape(5,4)
matrix_data

array([[29,  9, 61, 83],
       [75, 32, 76, 18],
       [71, 35, 76, 43],
       [51, 87, 28, 31],
       [40, 81, 12, 47]], dtype=int32)

In [257]:
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
df

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


In [258]:
df.shape

(5, 4)

In [259]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, A to E
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   W       5 non-null      int32
 1   X       5 non-null      int32
 2   Y       5 non-null      int32
 3   Z       5 non-null      int32
dtypes: int32(4)
memory usage: 120.0+ bytes


In [260]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,53.2,48.8,50.6,44.4
std,19.728152,33.73722,29.151329,24.368012
min,29.0,9.0,12.0,18.0
25%,40.0,32.0,28.0,31.0
50%,51.0,35.0,61.0,43.0
75%,71.0,81.0,76.0,47.0
max,75.0,87.0,76.0,83.0


* The df.describe() function in Pandas is used to generate descriptive statistics of the DataFrame’s columns. It provides a quick summary of numerical and, optionally, categorical data.

In [261]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [262]:
df.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

#### Select a Column

In [265]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


In [268]:
# Access Column X
df['X']

A     9
B    32
C    35
D    87
E    81
Name: X, dtype: int32

In [269]:
type(df['X'])

pandas.core.series.Series

In [270]:
# Access Column X and Z- df[col_name]
df[['X','Z']]

Unnamed: 0,X,Z
A,9,83
B,32,18
C,35,43
D,87,31
E,81,47


In [271]:
type(df[['X','Z']])

pandas.core.frame.DataFrame

#### Drop a Row or Column
**df.drop()**

In [272]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


In [273]:
# drop row 'A'
df.drop('A') # default axis = 0, check inplace = True

Unnamed: 0,W,X,Y,Z
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


In [275]:
# drop column X
df.drop('X',axis=1) # axis = 1, check inplace = True

Unnamed: 0,W,Y,Z
A,29,61,83
B,75,76,18
C,71,76,43
D,51,28,31
E,40,12,47


In [276]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


#### Selecting/indexing Rows
* Label-based 'loc' method
* Index (numeric) 'iloc' method

In [277]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


In [278]:
# selecting row with label - C or index 2 => Subsetting
print(df.loc['C'])
print("="*100)
print(df.iloc[2])

W    71
X    35
Y    76
Z    43
Name: C, dtype: int32
W    71
X    35
Y    76
Z    43
Name: C, dtype: int32


In [279]:
# Selecting multiple rows with label - 'A' and 'C' => Subsetting
print(df.loc[['A','C']])
print("="*100)
print(df.iloc[[0,2]])

    W   X   Y   Z
A  29   9  61  83
C  71  35  76  43
    W   X   Y   Z
A  29   9  61  83
C  71  35  76  43


In [281]:
# Slicing for multiple rows - first 3 rows => slicing
print(df.loc['A':'C'])
print("="*100)
print(df.iloc[0:3])

    W   X   Y   Z
A  29   9  61  83
B  75  32  76  18
C  71  35  76  43
    W   X   Y   Z
A  29   9  61  83
B  75  32  76  18
C  71  35  76  43


#### Subsetting DataFrame
Subsetting Refers to filtering down the DataFrame - selecting a subset of rows, columns, or both.

In [282]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


In [283]:
# Element at row 'B' and column 'Y' is
print(df.loc['B','Y'])
print(df.iloc[1,2])

76
76


In [284]:
# Subset comprising of rows B and D, and columns W and Y
print(df.loc[ ['B','D'], ['W','Y'] ])
print(df.iloc[ [1,3], [0,2] ])

    W   Y
B  75  76
D  51  28
    W   Y
B  75  76
D  51  28


In [285]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


* Subsetting dataFrame based on some condition for cell

In [286]:
# Subsetting dataFrame based on some condition - value > 40
booldf = df>40
booldf

Unnamed: 0,W,X,Y,Z
A,False,False,True,True
B,True,False,True,False
C,True,False,True,True
D,True,True,False,False
E,False,True,False,True


In [287]:
df[booldf]

Unnamed: 0,W,X,Y,Z
A,,,61.0,83.0
B,75.0,,76.0,
C,71.0,,76.0,43.0
D,51.0,87.0,,
E,,81.0,,47.0


In [288]:
# conditional subsetting - one Liner
df[df>40]

Unnamed: 0,W,X,Y,Z
A,,,61.0,83.0
B,75.0,,76.0,
C,71.0,,76.0,43.0
D,51.0,87.0,,
E,,81.0,,47.0


In [289]:
df.head()

Unnamed: 0,W,X,Y,Z
A,29,9,61,83
B,75,32,76,18
C,71,35,76,43
D,51,87,28,31
E,40,81,12,47


* Subsetting dataFrame based on some condition for a some rows

In [292]:
# creating a dataframe
matrix_data = np.matrix('22,160,70;42,170,80;30,175,90;35,160,70;25,140,70')
row_labels = ['A','B','C','D','E']
column_headings = ['Age', 'Height', 'Weight'] # 'Height(Cm)', 'Weight(Kg)

In [295]:
matrix_data

matrix([[ 22, 160,  70],
        [ 42, 170,  80],
        [ 30, 175,  90],
        [ 35, 160,  70],
        [ 25, 140,  70]])

In [296]:
df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nA new DataFrame\n",'-'*25, sep='')
print(df)


A new DataFrame
-------------------------
   Age  Height  Weight
A   22     160      70
B   42     170      80
C   30     175      90
D   35     160      70
E   25     140      70


In [297]:
print("\nRows with Height > 150 cm\n",'-'*35, sep='')
df[df['Height']>150]


Rows with Height > 150 cm
-----------------------------------


Unnamed: 0,Age,Height,Weight
A,22,160,70
B,42,170,80
C,30,175,90
D,35,160,70


In [298]:
print("\nRows with Height > 150 cm and Weight >70 kg\n",'-'*55, sep='')
df[(df['Height']>150) & (df['Weight']>70)]


Rows with Height > 150 cm and Weight >70 kg
-------------------------------------------------------


Unnamed: 0,Age,Height,Weight
B,42,170,80
C,30,175,90


#### Resetting the Index of your DataFrame

In [299]:
df

Unnamed: 0,Age,Height,Weight
A,22,160,70
B,42,170,80
C,30,175,90
D,35,160,70
E,25,140,70


In [300]:
# what are the index of df
df.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [301]:
df.reset_index() # try with inplace = True

Unnamed: 0,index,Age,Height,Weight
0,A,22,160,70
1,B,42,170,80
2,C,30,175,90
3,D,35,160,70
4,E,25,140,70


In [302]:
df.reset_index(drop=True)

Unnamed: 0,Age,Height,Weight
0,22,160,70
1,42,170,80
2,30,175,90
3,35,160,70
4,25,140,70


#### Add data Row or Column to a DataFrame

In [304]:
df.head()

Unnamed: 0,Age,Height,Weight
A,22,160,70
B,42,170,80
C,30,175,90
D,35,160,70
E,25,140,70


In [305]:
# adding a row to a dataframe
df.loc['F'] = [22,162,75]

In [306]:
df

Unnamed: 0,Age,Height,Weight
A,22,160,70
B,42,170,80
C,30,175,90
D,35,160,70
E,25,140,70
F,22,162,75


In [310]:
(df['Height']/100)**2

A    2.5600
B    2.8900
C    3.0625
D    2.5600
E    1.9600
F    2.6244
Name: Height, dtype: float64

In [311]:
# add a new column BMI = Weight(KG)/Height(CM)^2
df['BMI'] = df['Weight'] / (df['Height']/100)**2
df

Unnamed: 0,Age,Height,Weight,BMI
A,22,160,70,27.34375
B,42,170,80,27.681661
C,30,175,90,29.387755
D,35,160,70,27.34375
E,25,140,70,35.714286
F,22,162,75,28.577961


In [312]:
# Adding a new column
df['Profession'] = "Student Teacher Engineer Doctor Nurse Student".split()
print(df)

   Age  Height  Weight        BMI Profession
A   22     160      70  27.343750    Student
B   42     170      80  27.681661    Teacher
C   30     175      90  29.387755   Engineer
D   35     160      70  27.343750     Doctor
E   25     140      70  35.714286      Nurse
F   22     162      75  28.577961    Student


In [313]:
# Setting 'Profession' column as index
print (df.set_index('Profession'))

            Age  Height  Weight        BMI
Profession                                
Student      22     160      70  27.343750
Teacher      42     170      80  27.681661
Engineer     30     175      90  29.387755
Doctor       35     160      70  27.343750
Nurse        25     140      70  35.714286
Student      22     162      75  28.577961


In [314]:
df

Unnamed: 0,Age,Height,Weight,BMI,Profession
A,22,160,70,27.34375,Student
B,42,170,80,27.681661,Teacher
C,30,175,90,29.387755,Engineer
D,35,160,70,27.34375,Doctor
E,25,140,70,35.714286,Nurse
F,22,162,75,28.577961,Student


#### Rename the Index of a DataFrame

In [315]:
df.head()

Unnamed: 0,Age,Height,Weight,BMI,Profession
A,22,160,70,27.34375,Student
B,42,170,80,27.681661,Teacher
C,30,175,90,29.387755,Engineer
D,35,160,70,27.34375,Doctor
E,25,140,70,35.714286,Nurse


In [316]:
df.index

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [317]:
df.index = ['a', 'b', 'c', 'd', 'e', 'f']

In [318]:
df

Unnamed: 0,Age,Height,Weight,BMI,Profession
a,22,160,70,27.34375,Student
b,42,170,80,27.681661,Teacher
c,30,175,90,29.387755,Engineer
d,35,160,70,27.34375,Doctor
e,25,140,70,35.714286,Nurse
f,22,162,75,28.577961,Student


#### Rename the Columns of a DataFrame

In [319]:
df.columns

Index(['Age', 'Height', 'Weight', 'BMI', 'Profession'], dtype='object')

In [320]:
df.columns = ['age', 'height', 'weight', 'bmi', 'profession']

In [321]:
df

Unnamed: 0,age,height,weight,bmi,profession
a,22,160,70,27.34375,Student
b,42,170,80,27.681661,Teacher
c,30,175,90,29.387755,Engineer
d,35,160,70,27.34375,Doctor
e,25,140,70,35.714286,Nurse
f,22,162,75,28.577961,Student


---

Happy Learning ! Team DecodeAiML !!