# Manipulating Pandas Dataframes
Once you have loaded data into your Pandas dataframe, you might
need to further manipulate the data and perform a variety of functions
such as filtering certain columns, dropping the others, selecting a
subset of rows or columns, sorting the data, finding unique values,
and so on.

## 1. Selecting Data Using Indexing and Slicing
Indexing refers to fetching data using index or column information of
a Pandas dataframe. Slicing, on the other hand, refers to slicing a
Pandas dataframe using indexing techniques.

In [24]:
# Importing the matplotlib.pyplot module for data visualization
import matplotlib.pyplot as plt

# Importing the seaborn library for advanced data visualization
import seaborn as sns

# Set the default style of the plots to 'darkgrid' for better readability
sns.set_style("darkgrid")

# Load the Titanic dataset from Seaborn’s built-in datasets
titanic_data = sns.load_dataset('titanic')

# Display the first 5 rows of the Titanic dataset to inspect the data
titanic_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### 1.1. Selecting Data Using Brackets []
One of the simplest ways to select data from various columns is by
using square brackets. To get column data in the form of a series
from a Pandas dataframe, you need to pass the column name inside
square brackets that follow the Pandas dataframe name.

In [25]:
# Print the "class" column from the titanic_data DataFrame
print(titanic_data["class"])

# Print the type of the "class" column (usually a pandas Series)
print(type(titanic_data["class"]))

0       Third
1       First
2       Third
3       First
4       Third
        ...  
886    Second
887     First
888     Third
889     First
890     Third
Name: class, Length: 891, dtype: category
Categories (3, object): ['First', 'Second', 'Third']
<class 'pandas.core.series.Series'>


In [26]:
# Print the type of the DataFrame containing only the 'class', 'sex', and 'age' columns
print(type(titanic_data[["class", "sex", "age"]]))

# Display the contents of the DataFrame with the selected columns
titanic_data[["class", "sex", "age"]]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,class,sex,age
0,Third,male,22.0
1,First,female,38.0
2,Third,female,26.0
3,First,female,35.0
4,Third,male,35.0
...,...,...,...
886,Second,male,27.0
887,First,female,19.0
888,Third,female,
889,First,male,26.0


In [27]:
# Filter the Titanic dataset to include only rows where the passenger's sex is male
my_df = titanic_data[titanic_data["sex"] == "male"]

# Display the first 5 rows of the filtered DataFrame
my_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [28]:
# Filter the Titanic dataset to include only rows where:
# - the passenger is male
# - the passenger was in the First class
my_df = titanic_data[(titanic_data["sex"] == "male") & 
                     (titanic_data["class"] == "First")]

# Display the first 5 rows of the filtered DataFrame
my_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
23,1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True
27,0,1,male,19.0,3,2,263.0,S,First,man,True,C,Southampton,no,False
30,0,1,male,40.0,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
34,0,1,male,28.0,1,0,82.1708,C,First,man,True,,Cherbourg,no,False


In [29]:
# Define a list of specific ages to filter by
ages = [20, 21, 22]

# Filter the Titanic dataset to include only rows where the "age" column matches one of the specified ages
age_dataset = titanic_data[titanic_data["age"].isin(ages)]

# Display the first 5 rows of the filtered dataset
age_dataset.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
37,0,3,male,21.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
51,0,3,male,21.0,0,0,7.8,S,Third,man,True,,Southampton,no,True
56,1,2,female,21.0,0,0,10.5,S,Second,woman,False,,Southampton,yes,True


### 1.2. Indexing and Slicing Using loc Function
The loc function from the Pandas dataframe can also be used to filter
records in the Pandas dataframe.

In [30]:
# Import the pandas library and assign it the alias 'pd' for convenience
import pandas as pd

# Define a list of dictionaries where each dictionary represents a subject's exam result
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts', 'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a Pandas DataFrame for tabular data manipulation
my_df = pd.DataFrame(scores)

# Display the first few rows of the DataFrame (by default, this shows the first 5 rows)
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [31]:
# Print the row in the DataFrame `my_df` with index label 2
print(my_df.loc[2])

# Display the type of the object returned by `my_df.loc[2]`
# This will usually be a pandas Series if only one row is selected
type(my_df.loc[2])

Subject    English
Score           76
Grade            C
Remarks       Fair
Name: 2, dtype: object


pandas.core.series.Series

In [32]:
# Select rows 2 through 4 (inclusive) from the DataFrame `my_df` using .loc[]
# This assumes the DataFrame uses integer-based labels for the index (not the default 0-based positions)
my_df.loc[2:4]

Unnamed: 0,Subject,Score,Grade,Remarks
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [33]:
# Select rows with index 2 through 4 (inclusive) and columns "Grade" and "Score"
my_df.loc[2:4, ["Grade", "Score"]]

Unnamed: 0,Grade,Score
2,C,76
3,C,72
4,A,95


In [34]:
# Create a list of dictionaries, each representing a student's performance in a subject
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts', 'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a DataFrame, assigning custom row indices for students
my_df = pd.DataFrame(scores, index=["Student1", "Student2", "Student3", "Student4", "Student5"])

# Display the DataFrame
my_df

Unnamed: 0,Subject,Score,Grade,Remarks
Student1,Mathematics,85,B,Good
Student2,History,98,A,Excellent
Student3,English,76,C,Fair
Student4,Science,72,C,Fair
Student5,Arts,95,A,Excellent


In [35]:
# Access the row labeled "Student1" in the DataFrame 'my_df'
my_df.loc["Student1"]

Subject    Mathematics
Score               85
Grade                B
Remarks           Good
Name: Student1, dtype: object

In [36]:
# Create a list of index labels for the rows you want to access
index_list = ["Student1", "Student2"]

# Use .loc[] to select rows from the DataFrame `my_df` that match the given index labels
# This returns a new DataFrame containing only the rows for "Student1" and "Student2"
my_df.loc[index_list]

Unnamed: 0,Subject,Score,Grade,Remarks
Student1,Mathematics,85,B,Good
Student2,History,98,A,Excellent


In [37]:
# Access the value in the "Grade" column for the row labeled "Student1"
my_df.loc["Student1", "Grade"]

'B'

In [38]:
# Select a subset of the DataFrame `my_df` using `.loc[]`
# This retrieves rows from "Student1" to "Student2" (inclusive)
# and selects only the column named "Grade"
subset = my_df.loc["Student1":"Student2", "Grade"]
subset

Student1    B
Student2    A
Name: Grade, dtype: object

In [39]:
# Select rows from "Student1" to "Student4" (inclusive) and the "Grade" column from the DataFrame
my_df.loc["Student1":"Student4", "Grade"]

Student1    B
Student2    A
Student3    C
Student4    C
Name: Grade, dtype: object

In [40]:
# Selects rows from the DataFrame `my_df` using a boolean mask.
# The mask has to be the same length as the number of rows in `my_df`.
# Only rows corresponding to `True` values in the list are selected.
# In this case, only the 4th row (index 3, since indexing starts at 0) will be returned.
my_df.loc[[False, False, False, True, False]]

Unnamed: 0,Subject,Score,Grade,Remarks
Student4,Science,72,C,Fair


In [41]:
# Filter the DataFrame to select rows where the 'Score' column is greater than 80
my_df["Score"] > 80

Student1     True
Student2     True
Student3    False
Student4    False
Student5     True
Name: Score, dtype: bool

In [42]:
# Select all rows in the DataFrame 'my_df' where the value in the "Score" column is greater than 80
my_df.loc[my_df["Score"] > 80]

Unnamed: 0,Subject,Score,Grade,Remarks
Student1,Mathematics,85,B,Good
Student2,History,98,A,Excellent
Student5,Arts,95,A,Excellent


In [43]:
# Select rows in the DataFrame 'my_df' where:
# - the value in the "Score" column is greater than 80
# - AND the value in the "Remarks" column is equal to "Excellent"
# The loc[] function is used to access a group of rows and columns by labels or a boolean array
my_df.loc[(my_df["Score"] > 80) & (my_df["Remarks"] == "Excellent")]

Unnamed: 0,Subject,Score,Grade,Remarks
Student2,History,98,A,Excellent
Student5,Arts,95,A,Excellent


In [44]:
# Select rows in 'my_df' where the value in the "Score" column is greater than 80
# Then return only the "Score" and "Grade" columns for those rows
my_df.loc[my_df["Score"] > 80, ["Score", "Grade"]]

Unnamed: 0,Score,Grade
Student1,85,B
Student2,98,A
Student5,95,A


In [45]:
# Assigns the value 90 to the row labeled "Student4" in the DataFrame `my_df`.
# If "Student4" does not already exist, this will create a new row with that label
# and set all columns (if multiple) to 90.
my_df.loc["Student4"] = 90

# Displays the updated DataFrame
my_df

Unnamed: 0,Subject,Score,Grade,Remarks
Student1,Mathematics,85,B,Good
Student2,History,98,A,Excellent
Student3,English,76,C,Fair
Student4,90,90,90,90
Student5,Arts,95,A,Excellent


In [46]:
# Create a list of dictionaries, each representing a subject's score details
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History',     'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English',     'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science',     'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts',        'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a Pandas DataFrame
my_df = pd.DataFrame(scores)

# Display the first 5 rows of the DataFrame (in this case, it will show all since there are only 5 rows)
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [47]:
# Access the 4th row of the DataFrame `my_df` using integer-location based indexing
my_df.iloc[3]

Subject    Science
Score           72
Grade            C
Remarks       Fair
Name: 3, dtype: object

In [48]:
# Use iloc (integer-location based indexing) to select specific rows from the DataFrame
# In this case, we're selecting the row at index 3 (the 4th row, since Python is zero-indexed)
my_df.iloc[[3]]

Unnamed: 0,Subject,Score,Grade,Remarks
3,Science,72,C,Fair


In [49]:
# Use .iloc[] to select specific rows from the DataFrame 'my_df' by their integer positions
# This selects the 3rd and 4th rows (index positions 2 and 3)
my_df.iloc[[2, 3]]

Unnamed: 0,Subject,Score,Grade,Remarks
2,English,76,C,Fair
3,Science,72,C,Fair


In [50]:
# Select rows from index 2 up to (but not including) index 4 using iloc
# iloc is used for integer-location based indexing
my_df.iloc[2:4]

Unnamed: 0,Subject,Score,Grade,Remarks
2,English,76,C,Fair
3,Science,72,C,Fair


In [51]:
# Select specific rows and columns from the DataFrame using iloc (integer-location based indexing)
my_df.iloc[[
    2,  # Select the 3rd row (index starts at 0)
    3   # Select the 4th row
], [
    0,  # Select the 1st column
    1   # Select the 2nd column
]]

Unnamed: 0,Subject,Score
2,English,76
3,Science,72


In [52]:
# Select rows from index 2 (inclusive) to 4 (exclusive) 
# and columns from index 0 (inclusive) to 2 (exclusive)
subset = my_df.iloc[2:4, 0:2]
print(subset)

   Subject  Score
2  English     76
3  Science     72


## 2. Dropping Rows and Columns with the drop() Method
Apart from selecting columns using the loc and iloc functions, you can
also use the drop() method to drop unwanted rows and columns from
your dataframe while keeping the rest of the rows and columns.

### 2.1. Dropping Rows
The following script creates a dummy dataframe that you will use in
this section.

In [53]:
# Create a list of dictionaries where each dictionary represents a subject's record
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts', 'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a pandas DataFrame
my_df = pd.DataFrame(scores)

# Display the first 5 rows of the DataFrame (in this case, it shows all because there are only 5 rows)
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [54]:
# Create a new DataFrame `my_df2` by dropping rows with index labels 1 and 4 from the original DataFrame `my_df`
my_df2 = my_df.drop([1, 4])

# Display the first 5 rows of the new DataFrame to verify the result
my_df2.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
2,English,76,C,Fair
3,Science,72,C,Fair


In [55]:
# Reset the index of the DataFrame 'my_df2' and apply the change in place
# This will move the current index into a column and reset the index to default integer values (0, 1, 2, ...)
my_df2.reset_index(inplace=True)

# Display the first 5 rows of the updated DataFrame to preview the changes
my_df2.head()

Unnamed: 0,index,Subject,Score,Grade,Remarks
0,0,Mathematics,85,B,Good
1,2,English,76,C,Fair
2,3,Science,72,C,Fair


In [56]:
# Drop rows at index 1 and 4 from the original DataFrame 'my_df'
my_df2 = my_df.drop([1, 4])

# Display the first 5 rows of the new DataFrame 'my_df2' after dropping the specified rows
my_df2.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
2,English,76,C,Fair
3,Science,72,C,Fair


In [57]:
# Reset the index of the DataFrame `my_df2` in place.
# `inplace=True` modifies the DataFrame directly without creating a new one.
# `drop=True` means the old index will not be added as a new column in the DataFrame.
my_df2.reset_index(inplace=True, drop=True)

# Display the first 5 rows of the updated DataFrame to check the result.
my_df2.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,English,76,C,Fair
2,Science,72,C,Fair


In [58]:
# Drop the rows with index labels 1, 3, and 4 from the DataFrame 'my_df'
# Note: This returns a new DataFrame by default unless 'inplace=True' is specified
my_df.drop([1, 3, 4])

# Display the first 5 rows of the original DataFrame 'my_df'
# Since drop() was not assigned or used with inplace=True, 'my_df' is unchanged here
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [59]:
# Drop rows with index 1, 3, and 4 from the DataFrame 'my_df'
# The 'inplace=True' argument ensures that the changes are made directly to 'my_df' without needing to assign it to a new variable
my_df.drop([1, 3, 4], inplace=True)

# Display the first 5 rows of the modified DataFrame to check the result
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
2,English,76,C,Fair


### 2.2. Dropping Columns
You can also drop columns using the drop() method.

In [60]:
# Import the pandas library and assign it the alias 'pd'
import pandas as pd

# Define a list of dictionaries where each dictionary represents a subject with its score, grade, and remarks
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts', 'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a Pandas DataFrame
my_df = pd.DataFrame(scores)

# Display the first 5 rows of the DataFrame (in this case, it will display the entire DataFrame since there are only 5 rows)
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [61]:
# Create a new DataFrame 'my_df2' by dropping the "Subject" and "Grade" columns from 'my_df'
# axis=1 specifies that we're dropping columns (axis=0 would be for rows)
my_df2 = my_df.drop(["Subject", "Grade"], axis=1)

# Display the first 5 rows of the new DataFrame to preview the result
my_df2.head()

Unnamed: 0,Score,Remarks
0,85,Good
1,98,Excellent
2,76,Fair
3,72,Fair
4,95,Excellent


In [62]:
# Drop the columns "Subject" and "Grade" from the DataFrame `my_df`
# axis=1 specifies that we are dropping columns (not rows)
# inplace=True means the changes are made directly to `my_df` and not returned as a new DataFrame
my_df.drop(["Subject", "Grade"], axis=1, inplace=True)

# Display the first 5 rows of the modified DataFrame
my_df.head()

Unnamed: 0,Score,Remarks
0,85,Good
1,98,Excellent
2,76,Fair
3,72,Fair
4,95,Excellent


## 3. Filtering Rows and Columns with Filter Method
The drop() method drops the unwanted records, and the filter()
method performs the reverse tasks. It keeps the desired records from
a set of records in a Pandas dataframe.

### 3.1. Filtering Rows

In [63]:
# Create a list of dictionaries, where each dictionary contains details of a subject and corresponding score, grade, and remarks
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts', 'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a Pandas DataFrame for tabular representation and analysis
my_df = pd.DataFrame(scores)

# Display the first five rows of the DataFrame (in this case, it will show the entire table since it only has five rows)
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [64]:
# Select specific rows from the DataFrame `my_df` using the .filter() method
# Here, rows with indices 1, 3, and 4 are selected (axis=0 means rows)
my_df2 = my_df.filter([1, 3, 4], axis=0)

# Display the first 5 rows of the resulting DataFrame `my_df2`
my_df2.head()

Unnamed: 0,Subject,Score,Grade,Remarks
1,History,98,A,Excellent
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [65]:
# Reset the index of the DataFrame `my_df2` and drop the old index column.
# This is useful after filtering or transforming the DataFrame to get a clean, sequential index.
my_df2 = my_df2.reset_index(drop=True)

# Display the first 5 rows of the updated DataFrame to quickly inspect its contents.
my_df2.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,History,98,A,Excellent
1,Science,72,C,Fair
2,Arts,95,A,Excellent


In [66]:
# Define a list of dictionaries, each representing a subject and its corresponding score, grade, and remarks
scores = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Science', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Arts', 'Score': 95, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the list of dictionaries into a Pandas DataFrame
my_df = pd.DataFrame(scores)

# Display the first 5 rows of the DataFrame
my_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Science,72,C,Fair
4,Arts,95,A,Excellent


In [67]:
# Select only the "Score" and "Grade" columns from the DataFrame 'my_df'
# The 'axis=1' specifies that we're selecting columns (axis=0 would be for rows)
my_df2 = my_df.filter(["Score", "Grade"], axis=1)

# Display the first 5 rows of the filtered DataFrame to preview the result
my_df2.head()

Unnamed: 0,Score,Grade
0,85,B
1,98,A
2,76,C
3,72,C
4,95,A


## 4. Sorting Dataframes
You can also sort records in your Pandas dataframe based on values
in a particular column. Let’s see how to do this.
For this section, you will be using the Titanic dataset, which you can
import using the Seaborn library using the following script:

In [68]:
# Import the matplotlib.pyplot module for plotting
import matplotlib.pyplot as plt

# Import seaborn, a statistical data visualization library
import seaborn as sns

# Set the default plotting style to 'darkgrid' for better readability
sns.set_style("darkgrid")

# Load the built-in Titanic dataset from seaborn
titanic_data = sns.load_dataset('titanic')

# Display the first 5 rows of the Titanic dataset to examine the structure
titanic_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [69]:
# Sort the Titanic dataset by the 'age' column in ascending order
# This helps us see passengers from youngest to oldest
age_sorted_data = titanic_data.sort_values(by=['age'])

# Display the first 5 rows of the sorted dataset
# This allows us to quickly view the youngest passengers
age_sorted_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
803,1,3,male,0.42,0,1,8.5167,C,Third,child,False,,Cherbourg,yes,False
755,1,2,male,0.67,1,1,14.5,S,Second,child,False,,Southampton,yes,False
644,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
469,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
78,1,2,male,0.83,0,2,29.0,S,Second,child,False,,Southampton,yes,False


In [70]:
# Sort the Titanic dataset by the 'age' column in descending order (from oldest to youngest)
age_sorted_data = titanic_data.sort_values(by=['age'], ascending=False)

# Display the first 5 rows of the sorted DataFrame
age_sorted_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
630,1,1,male,80.0,0,0,30.0,S,First,man,True,A,Southampton,yes,True
851,0,3,male,74.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
493,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
96,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
116,0,3,male,70.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [71]:
# Sort the Titanic dataset by 'age' and then by 'fare' in descending order
# This means the oldest passengers with the highest fares will appear first
age_sorted_data = titanic_data.sort_values(by=['age', 'fare'], ascending=False)

# Display the first 5 rows of the sorted dataset
age_sorted_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
630,1,1,male,80.0,0,0,30.0,S,First,man,True,A,Southampton,yes,True
851,0,3,male,74.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
493,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
96,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
116,0,3,male,70.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


## 5. Pandas Unique and Count Functions
In this section, you will see how you can get a list of unique values,
the number of all unique values, and records per unique value from a
column in a Pandas dataframe.

In [72]:
# Importing the required libraries for visualization
import matplotlib.pyplot as plt  # Used for creating static, animated, and interactive plots
import seaborn as sns  # Built on top of matplotlib, provides attractive statistical graphics

# Set the aesthetic style of the plots to 'darkgrid'
sns.set_style("darkgrid")

# Load the built-in Titanic dataset from Seaborn
titanic_data = sns.load_dataset('titanic')

# Display the first 5 rows of the Titanic dataset to understand its structure
titanic_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [73]:
# Retrieve the unique values from the "class" column of the Titanic dataset
# This helps identify all distinct passenger classes present in the data (e.g., First, Second, Third)
titanic_data["class"].unique()

['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third']

In [74]:
# This line calculates the number of unique values in the "class" column of the titanic_data DataFrame.
titanic_data["class"].nunique()

3

In [75]:
# Returns the number of non-null (non-NaN) entries in each column of the DataFrame 'titanic_data'
titanic_data.count()

survived       891
pclass         891
sex            891
age            714
sibsp          891
parch          891
fare           891
embarked       889
class          891
who            891
adult_male     891
deck           203
embark_town    889
alive          891
alone          891
dtype: int64

In [76]:
# Count the number of occurrences of each unique value in the "class" column of the Titanic dataset
# This is useful to see how many passengers are in each class (e.g., First, Second, Third)
titanic_data["class"].value_counts()

class
Third     491
First     216
Second    184
Name: count, dtype: int64