# Pandas - Tutorial
<img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" alt="Pandas Logo" width="25%" align="center"/>

In [1]:
import numpy as np
import pandas as pd

print("Pandas Version:",pd.__version__)

Pandas Version: 1.4.4


# Series

It is a one-dimensional array holding data of any type.

In [2]:
labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)

### Creating Series

> With the `index` argument, you can name your own labels.


In [3]:
# Creating Pandas Series with data and labels
labels = ['a','b','c']
pd.Series(data=my_data , index=labels)

a    10
b    20
c    30
dtype: int64

### Key/Value Objects as Series



In [4]:
# Creating Series using Dictionary
d = {'a':10, 'b':20, 'c':30}
pd.Series(d)

a    10
b    20
c    30
dtype: int64

In [5]:
# Pandas Series can also hold Objects
pd.Series(data=[sum,len,print])

0      <built-in function sum>
1      <built-in function len>
2    <built-in function print>
dtype: object

#### Easier & Efficient Trick for adding Index

In [6]:
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
subject = ["Physics", "Mathematics", "Social Science", "Chemistry", "Drawing" , "Biology", "English"]
pd.Series(data=subject, index=days)

Mon           Physics
Tue       Mathematics
Wed    Social Science
Thu         Chemistry
Fri           Drawing
Sat           Biology
Sun           English
dtype: object

### Accessing Series

In [7]:
ser1 = pd.Series(["Rajasthan", "Maharashtra", "Delhi"], ['1st','2nd','3rd'])
display(ser1)

1st      Rajasthan
2nd    Maharashtra
3rd          Delhi
dtype: object

In [8]:
ser1['3rd']

'Delhi'

In [9]:
ser2 = pd.Series({'i':10, 'ii':31, 'iii':52, 'iv':24})
ser2

i      10
ii     31
iii    52
iv     24
dtype: int64

### Operations on Series

In [10]:
ser1+ser2

1st    NaN
2nd    NaN
3rd    NaN
i      NaN
ii     NaN
iii    NaN
iv     NaN
dtype: object

In [11]:
ser2+ser2

i       20
ii      62
iii    104
iv      48
dtype: int64

In [12]:
ser2**2

i       100
ii      961
iii    2704
iv      576
dtype: int64

# DataFrames

* A DataFrame is a two-dimensional, labeled data structure similar to a table or a spreadsheet.
* DataFrames consist of rows and columns, where each column can have different data types.
* DataFrames can be created from various data sources, including dictionaries, lists, CSV files, Excel files, and databases.

In [13]:
import pandas as pd
from numpy.random import randn

np.random.seed(101)

In [14]:
df = pd.DataFrame(data=randn(5,4),
                  index=['A','B','C','D','E'],
                  columns=['W','X','Y','Z'])

df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Difference between `iloc` and `loc`

The main difference between `iloc` and `loc` in Pandas is the way they are used to access data in a DataFrame:

1. `iloc`:
   - `iloc` is used for indexing and selecting data based on integer positions.
   - It accepts integer-based indexing to access rows and columns.
   - The indexing starts from 0 for the first row or column.
   - You can pass single integers, slices, or lists of integer positions to `iloc`.
   - Example: `df.iloc[0]` accesses the first row, `df.iloc[:, 1:3]` accesses columns 1 to 2.

2. `loc`:
   - `loc` is used for indexing and selecting data based on labels (index or column names).
   - It accepts label-based indexing to access rows and columns.
   - The labels can be the default integer index or user-defined labels.
   - You can pass single labels, slices, or lists of labels to `loc`.
   - Example: `df.loc[0]` accesses the row with label 0, `df.loc[:, 'Name':'Age']` accesses columns 'Name' to 'Age'.

In summary:
- `iloc` uses integer positions for indexing.
- `loc` uses labels (index or column names) for indexing.

Both `iloc` and `loc` allow you to access specific rows or columns of a DataFrame, but the key difference lies in the indexing method used (integer positions or labels).

It's important to choose the appropriate method (`iloc` or `loc`) based on your specific use case and the indexing requirements of your DataFrame.

### Locating Column in DataFrame

In [15]:
type(df['W'])

pandas.core.series.Series

In [16]:
# Shape of dataframe
df.shape

(5, 4)

#### Method 1: Locating a column using bracket notation:


In [17]:
df['W'] # Accesses the 'W' column

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [18]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [19]:
df[['W','Z','Y']] # Note we are using double square brackets

Unnamed: 0,W,Z,Y
A,2.70685,0.503826,0.907969
B,0.651118,0.605965,-0.848077
C,-2.018168,-0.589001,0.528813
D,0.188695,0.955057,-0.933237
E,0.190794,0.683509,2.605967


#### Method 2: Locating using the dot notation `df.ColumnName`

In [20]:
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

#### Method 3 : Locating a column using `loc`

In [21]:
df.loc[:, 'W']  # Accesses the 'W' column using loc


A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

#### Method 4 : Locating a column using `iloc`

In [22]:
df.iloc[:, 0]  # Accesses the first column (index 0) using iloc

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

## Locating Row in DataFrame

To locate a row in a Pandas DataFrame, you can use the `loc` attribute, which allows you to access rows by their index label or a boolean condition.



In [23]:
type(df['W'])

pandas.core.series.Series

In [24]:
df.iloc[1]  # Accesses the row at index number 1

W    0.651118
X   -0.319318
Y   -0.848077
Z    0.605965
Name: B, dtype: float64

In [25]:
df.loc["B"]  # Accesses the row with index label B

W    0.651118
X   -0.319318
Y   -0.848077
Z    0.605965
Name: B, dtype: float64

In [26]:
# Printing First Two Rows
print(df.iloc[[0, 1]])

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965


In [27]:
# Printing First Two Columns
print(df.iloc[:,[0,1]])

          W         X
A  2.706850  0.628133
B  0.651118 -0.319318
C -2.018168  0.740122
D  0.188695 -0.758872
E  0.190794  1.978757


In [28]:
# Printing First 3 rows and last 2 columns
print(df.iloc[:3, -2:])

          Y         Z
A  0.907969  0.503826
B -0.848077  0.605965
C  0.528813 -0.589001


**Summary of `iloc`**
> `:3` represents the row selection. It indicates that we want to select the rows from the beginning of the DataFrame up to (but not including) the row with index label 3. This is known as slicing and selects the first three rows of the DataFrame.

> `-2:` represents the column selection. It indicates that we want to select the last two columns of the DataFrame. The use of a negative index -2 means we count the columns from the end of the DataFrame. The colon : after -2 indicates that we want to select all columns from -2 (inclusive) to the end of the DataFrame. This is another example of slicing but applied to the columns instead of rows.

## Adding & Removing Columns

### Adding Columns

#### Method 1: Adding a new column based on existing columns:

In [29]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


#### Method 2: Adding a new column using a function or lambda expression:


In [30]:
# Apply a function to calculate the sum of values in each row
df['Sum'] = df.apply(lambda row: row.sum(), axis=1)
print(df)

          W         X         Y         Z       new       Sum
A  2.706850  0.628133  0.907969  0.503826  3.614819  8.361597
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959 -0.107271
C -2.018168  0.740122  0.528813 -0.589001 -1.489355 -2.827588
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542 -1.292899
E  0.190794  1.978757  2.605967  0.683509  2.796762  8.255789


> In Pandas, the axis parameter is used to indicate the axis along which an operation is performed. It can take two values:
- *axis=0* (default): performed vertically, along the rows or index labels. When using axis=0, operations are applied column-wise.
- _*axis=1*: performed horizontally, along the columns or column labels. When using axis=1, operations are applied row-wise

### Removing Columns

To remove columns from a Pandas DataFrame, you can use the `drop()` method or the `del` keyword. 

#### Using the drop() method:

* The `drop()` method allows you to remove one or more columns by specifying their column names.
* By default, the `drop()` method creates a new DataFrame without the specified columns and returns the modified DataFrame.
* The original DataFrame remains unchanged unless you assign the result of `drop()` back to the original DataFrame.


In [31]:
df.drop('new', axis=1)

Unnamed: 0,W,X,Y,Z,Sum
A,2.70685,0.628133,0.907969,0.503826,8.361597
B,0.651118,-0.319318,-0.848077,0.605965,-0.107271
C,-2.018168,0.740122,0.528813,-0.589001,-2.827588
D,0.188695,-0.758872,-0.933237,0.955057,-1.292899
E,0.190794,1.978757,2.605967,0.683509,8.255789


In [32]:
# NOTE THAT IT HADN"T MODIFIED THE ORIGINAL DF
df

Unnamed: 0,W,X,Y,Z,new,Sum
A,2.70685,0.628133,0.907969,0.503826,3.614819,8.361597
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,-0.107271
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,-2.827588
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-1.292899
E,0.190794,1.978757,2.605967,0.683509,2.796762,8.255789


> __NOTE__: When `inplace=True` is specified, the original object is modified, and the method does not return a new object.
  
> Make sure to carefully review and verify your code before using inplace=True to avoid unintended modifications.

In [33]:
# Use inplace to modify the original df
df.drop('new', axis=1, inplace=True)
print(df)

          W         X         Y         Z       Sum
A  2.706850  0.628133  0.907969  0.503826  8.361597
B  0.651118 -0.319318 -0.848077  0.605965 -0.107271
C -2.018168  0.740122  0.528813 -0.589001 -2.827588
D  0.188695 -0.758872 -0.933237  0.955057 -1.292899
E  0.190794  1.978757  2.605967  0.683509  8.255789


#### Using the `del` keyword:

* The `del` keyword allows you to remove columns in-place directly from the DataFrame.
* This method modifies the original DataFrame and does not create a new DataFrame.

In [34]:
del df['Sum']
print(df)

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


## Logical Operation & Boolean Dataframes

In Pandas, a Boolean DataFrame is a DataFrame object where the values in each cell are boolean values (True or False). This type of DataFrame is useful when working with logical operations or filtering data based on certain conditions.



In [35]:
booldf = df>0
booldf

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


### Conditional Selection in DataFrame


Conditional selection in a DataFrame allows you to filter and retrieve specific rows or columns based on certain conditions.

In [36]:
# Selecting values greater than 0 in the DataFrame
selecting_values = df[df>0] #preserves original dataFrame
print(selecting_values)

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118       NaN       NaN  0.605965
C       NaN  0.740122  0.528813       NaN
D  0.188695       NaN       NaN  0.955057
E  0.190794  1.978757  2.605967  0.683509


In [37]:
# Returns values > 0 for "Column W"
df['W'] > 0 

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [38]:
# Selecting rows where values in column 'W' are greater than 0
resultdf = df[df['W'] > 0]
print(resultdf)

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


In [39]:
resultdf[['Z','W']]

Unnamed: 0,Z,W
A,0.503826,2.70685
B,0.605965,0.651118
D,0.955057,0.188695
E,0.683509,0.190794


### Multiple Conditions with the help of & and |

In [40]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [41]:
# 'and' will give error use '&'
df[(df['W']>0) & (df['Y']<0)]

Unnamed: 0,W,X,Y,Z
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057


In [42]:
# 'or' will give error use '|'
df[(df['W']>1) | (df['Y']<1)]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


## Manipulating Index

### Resetting Index: 
Using `set_index()` method,

In [43]:
## To reset index, Use inplace=true to change original df
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


### Setting a new index:
Using `set_index()` method


In [44]:
new_index = "RJ UP MP AP DL".split(sep=' ')
df['states'] = new_index
display(df)

Unnamed: 0,W,X,Y,Z,states
A,2.70685,0.628133,0.907969,0.503826,RJ
B,0.651118,-0.319318,-0.848077,0.605965,UP
C,-2.018168,0.740122,0.528813,-0.589001,MP
D,0.188695,-0.758872,-0.933237,0.955057,AP
E,0.190794,1.978757,2.605967,0.683509,DL


In [45]:
df.set_index('states', inplace=True)
print(df)

               W         X         Y         Z
states                                        
RJ      2.706850  0.628133  0.907969  0.503826
UP      0.651118 -0.319318 -0.848077  0.605965
MP     -2.018168  0.740122  0.528813 -0.589001
AP      0.188695 -0.758872 -0.933237  0.955057
DL      0.190794  1.978757  2.605967  0.683509


### Renaming the index:
using `rename()` method or by directly assigning new values to the `index` attribute.



In [46]:
# Renaming the index using the rename() method
df.rename(index={'AP': 'GJ'}, inplace=True)
display(df)

Unnamed: 0_level_0,W,X,Y,Z
states,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RJ,2.70685,0.628133,0.907969,0.503826
UP,0.651118,-0.319318,-0.848077,0.605965
MP,-2.018168,0.740122,0.528813,-0.589001
GJ,0.188695,-0.758872,-0.933237,0.955057
DL,0.190794,1.978757,2.605967,0.683509


In [47]:
# Assigning new values to the index attribute:
df.index = ['rj', 'up', 'mp','dl', 'gj']
display(df)

Unnamed: 0,W,X,Y,Z
rj,2.70685,0.628133,0.907969,0.503826
up,0.651118,-0.319318,-0.848077,0.605965
mp,-2.018168,0.740122,0.528813,-0.589001
dl,0.188695,-0.758872,-0.933237,0.955057
gj,0.190794,1.978757,2.605967,0.683509


### Reordering the index:
using the `reindex()` method

In [48]:
df.reindex(['up', 'mp', 'rj','gj', 'dl'])

Unnamed: 0,W,X,Y,Z
up,0.651118,-0.319318,-0.848077,0.605965
mp,-2.018168,0.740122,0.528813,-0.589001
rj,2.70685,0.628133,0.907969,0.503826
gj,0.190794,1.978757,2.605967,0.683509
dl,0.188695,-0.758872,-0.933237,0.955057


### Removing the Index Name:
using the `rename_axis()` method with None as the argument.

In [49]:
df.rename_axis(None, inplace=True)
display(df)

Unnamed: 0,W,X,Y,Z
rj,2.70685,0.628133,0.907969,0.503826
up,0.651118,-0.319318,-0.848077,0.605965
mp,-2.018168,0.740122,0.528813,-0.589001
dl,0.188695,-0.758872,-0.933237,0.955057
gj,0.190794,1.978757,2.605967,0.683509


## Multi-Index & Index Hierarchy in DataFrames

In Pandas, a MultiIndex is a way to have multiple levels of indexing in a DataFrame or Series. It allows you to work with higher-dimensional data by creating hierarchical row and column labels. MultiIndexing is useful when you need to represent and analyze data with multiple dimensions or categorical variables.

### Creating Multi-Index

#### Method 1: From a list of arrays:


In [50]:
index = [['A', 'A', 'B', 'B'], ['X', 'Y', 'X', 'Y']]
data = [1, 2, 3, 4]
multi_index = pd.MultiIndex.from_arrays(index, names=['Level1', 'Level2'])
df = pd.DataFrame(data, index=multi_index)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Level1,Level2,Unnamed: 2_level_1
A,X,1
A,Y,2
B,X,3
B,Y,4


#### Method 2: From a list of tuples:


In [51]:
index = [('A', 'X'), ('A', 'Y'), ('B', 'X'), ('B', 'Y')]
data = [1, 2, 3, 4]
multi_index = pd.MultiIndex.from_tuples(index, names=['Level1', 'Level2'])
df = pd.DataFrame(data, index=multi_index)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Level1,Level2,Unnamed: 2_level_1
A,X,1
A,Y,2
B,X,3
B,Y,4


#### Method 3: From a dictionary of arrays:

* Keys represent the level names 
* Values represent the corresponding arrays for each level

In [52]:
data = {
    'Level1': ['A', 'A', 'B', 'B'],
    'Level2': ['X', 'Y', 'X', 'Y']
}
df = pd.DataFrame({'Data': [1, 2, 3, 4]}, index=pd.MultiIndex.from_frame(pd.DataFrame(data)))
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Data
Level1,Level2,Unnamed: 2_level_1
A,X,1
A,Y,2
B,X,3
B,Y,4


### Accessing data

In [53]:
# Accessing the value at index ('A', 'X') in the 'Data' column
df.loc[('A', 'X'), 'Data']


1

In [61]:
# Accessing the value in 'Data' column for index 'A' in the first level and index 'X' in the second level
df.loc['A'].loc['X']['Data']

1

In [54]:
# Accessing all rows with level 'A' in the first level of the MultiIndex
df.loc['A'] 


Unnamed: 0_level_0,Data
Level2,Unnamed: 1_level_1
X,1
Y,2


In [55]:
# Slicing from ('A', 'X') to ('B', 'Y')
df.loc[('A', 'X'):('B', 'Y')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Data
Level1,Level2,Unnamed: 2_level_1
A,X,1
A,Y,2
B,X,3
B,Y,4


In [56]:
df.index.names

FrozenList(['Level1', 'Level2'])