Data filtering is one of the most important steps in data analysis.
### It involves the process of selecting a subset of data / relevant from a larger dataset based on certain conditions.
With the rise of big data and machine learning, data filtering has become a crucial task for data analysts and data scientists.
One of the most popular data analysis libraries used for filtering data is Pandas.
### Pandas has many built-in functions that allow users to filter data based on different criteria.


### 2. Filtering Data with Boolean Indexing:
One of the easiest ways to filter data in Pandas is by using Boolean indexing.
It is a technique that allows us to select rows of data based on a condition.
>The result of a Boolean indexing operation is a series of True and False values that correspond to the rows of the original dataset.
>We can then use this series to select only the rows of the original dataset that meet the condition.



In [3]:
# Importing the Pandas Library and Creating a DataFrame
import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'John'],
        'age': [25, 30, 30, 40, 45, 55],
        'gender': ['F', 'M', 'M', 'M', 'F', 'M'],
        'score': [80, 90, 85, 95, 90, 95]}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,gender,score
0,Alice,25,F,80
1,Bob,30,M,90
2,Charlie,30,M,85
3,David,40,M,95
4,Emily,45,F,90
5,John,55,M,95


Unnamed: 0,name,age,gender,score
0,Alice,25,F,80
1,Bob,30,M,90
2,Charlie,30,M,85
3,David,40,M,95
4,Emily,45,F,90
5,John,55,M,95


Unnamed: 0,name,age,gender,score
0,Alice,25,F,80
1,Bob,30,M,90
2,Charlie,30,M,85
3,David,40,M,95
4,Emily,45,F,90
5,John,55,M,95


In [4]:
# For example, let’s filter the dataset to include only the rows where the score is greater than or equal to 90:
filtered_df = df[df['score'] >= 90]
print(filtered_df)

    name  age gender  score
1    Bob   30      M     90
3  David   40      M     95
4  Emily   45      F     90
5   John   55      M     95


### 3. Filtering Data with Query Method
Another way to filter data in Pandas is by using the query function.
The query function allows us to write SQL-like queries to filter the dataset.
The query function is similar to Boolean indexing, but it allows for more complex queries.

In [5]:
filtered_df = df.query('score >= 90')
print(filtered_df)

    name  age gender  score
1    Bob   30      M     90
3  David   40      M     95
4  Emily   45      F     90
5   John   55      M     95


### Filtering Data with loc and iloc Methods
Using loc Method:
The loc method is used to select rows and columns based on the "labels".
It takes two parameters: row label and column label.


In [6]:
# Selecting a single row using loc method
print(df.loc[2])

name      Charlie
age            30
gender          M
score          85
Name: 2, dtype: object


In [25]:
df

Unnamed: 0,name,age,gender,score
0,Alice,25,F,80
1,Bob,30,M,90
2,Charlie,30,M,85
3,David,40,M,95
4,Emily,45,F,90
5,John,55,M,95


In [7]:
# selecting range of rows from 2 to 5
print(df.loc[2: 5])

      name  age gender  score
2  Charlie   30      M     85
3    David   40      M     95
4    Emily   45      F     90
5     John   55      M     95


In [8]:
# Selecting multiple rows and particluar columns using loc method
print(df.loc[[1, 4, 5], ['name', 'score']])


    name  score
1    Bob     90
4  Emily     90
5   John     95


In [10]:
# selecting records with name 'Age == 30' and score > 90   using labels
display(df.loc[(df.age == 30) & (df.score > 90)])

Unnamed: 0,name,age,gender,score


### Using iloc Method
The iloc method is used to select rows and columns based on the "integer index".
It takes two parameters: row index and column index.
The iloc[ ] is used for selection based on position.
It is similar to loc[] indexer but it takes only integer values to make selections.

In [11]:
# Selecting a single row using iloc method
print(df.iloc[2])

name      Charlie
age            30
gender          M
score          85
Name: 2, dtype: object


In [12]:
# Selecting multiple rows and columns using iloc method
print(df.iloc[[1, 3], [0, 3]])

    name  score
1    Bob     90
3  David     95


###  Filtering Data with filter() Method

The filter() method is particularly useful when working with large datasets,
where you may want to select only a subset of rows or columns based on certain criteria.
This method is also useful when you want to exclude certain rows or columns from your analysis.

In [None]:
# syntax
# DataFrame.filter(items=None, like=None, regex=None, axis=None)


items: A list of column labels or row labels to filter on.

like: A string containing a substring to filter on.
This searches for columns or rows that contain the specified substring.

regex: A regular expression to filter on.
This searches for columns or rows that match the specified regular expression.

axis: The axis along which to filter. By default, axis=1, which filters columns.

In [13]:
# Example 1: Filtering by column names using the "items" parameter
#Suppose you have a DataFrame with several columns, and you want to select only a subset of columns based on their names.
# You can do this using the items parameter of the filter() method, like this"
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

print(df)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


In [15]:
# Use the filter() method to select columns 'A' and 'B'
df_filtered = df.filter(items=['A', 'B'])
print(df_filtered)

   A  B
0  1  4
1  2  5
2  3  6


### Filtering by column names using the like parameter
Suppose you have a DataFrame with several columns, and you want to select only a subset of columns that contain a certain substring. You can do this using the like parameter of the filter() method, like this:



In [16]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'apple': [1, 2, 3], 'banana': [4, 5, 6], 'orange': [7, 8, 9]})
print(df)


   apple  banana  orange
0      1       4       7
1      2       5       8
2      3       6       9


In [18]:
# Use the filter() method to select columns containing the substring 'an'
df_filtered = df.filter(like='an')
print(df_filtered)

   banana  orange
0       4       7
1       5       8
2       6       9


### Filtering by column names using the regex parameter
Suppose you have a DataFrame with several columns, and you want to select only a subset of columns that match certain regular expression. You can do this using the regex parameter of the filter() method, like this:



In [19]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A_1': [1, 2, 3], 'A_2': [4, 5, 6], 'B_1': [7, 8, 9]})
print(df)


   A_1  A_2  B_1
0    1    4    7
1    2    5    8
2    3    6    9


In [20]:
# Use the filter() method to select columns matching the regular expression 'A.*'
df_filtered = df.filter(regex='A.*')
print(df_filtered)

   A_1  A_2
0    1    4
1    2    5
2    3    6
