# Notebook 02: Data Selection and Cleaning in Pandas** 📊  

In this notebook, we will explore essential techniques for **accessing, slicing, filtering, and cleaning data** in Pandas.  
- We'll start with **accessing and slicing** data using methods like `.loc[]`, `.iloc[]`, and boolean indexing.  
- Then, we'll cover **filtering and replacing** values to refine datasets efficiently.  
- Finally, we'll focus on **handling missing data** using functions like `.fillna()`, `.dropna()`, and interpolation.  

In [111]:
import pandas as pd 
import numpy as np
students = pd.read_csv('./dataset/students_data.csv')
students

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
2,S1003,Faraz,Male,20.0,Mathematics,,2022,9876543000.0
3,S1004,Danish,Male,21.0,Physics,2.5,2023,
4,S1005,Laiba,Female,22.0,Engineering,3.0,2024,9876543000.0
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
6,S1007,Neha,Female,19.0,Business,4.0,2021,9876543000.0
7,S1008,Aqib,Male,,Mathematics,2.5,2022,9876543000.0
8,S1009,Taha,Male,21.0,Physics,3.0,2023,9876543000.0
9,S1010,Sheri,Male,22.0,Engineering,3.5,2024,9876543000.0


## Accessing and Slicing 

- We can access DataFrame columns directly by specifying the column name inside square brackets, similar to how we access elements in a list.

In [21]:
names = students['Name'].head(2)
print(type(names))
print('\n',names[0])
names

<class 'pandas.core.series.Series'>

 Ali


0     Ali
1    Umar
Name: Name, dtype: object

In [16]:
names = students[['Name']].head(2)
print(type(names))
names

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name
0,Ali
1,Umar


- Accessing multiple columns

In [18]:
names = students[['Name','Age','Enrollment_Year']].head(2)
print(type(names))
names

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age,Enrollment_Year
0,Ali,18.0,2020
1,Umar,19.0,2021


- We can slice DataFrame rows directly by specifying the row indices inside square brackets, similar to how we slice in a list.
- df [start : end : step]

In [24]:
students[1:7:2]

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
3,S1004,Danish,Male,21.0,Physics,2.5,2023,
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0


In [27]:
# lets try to slice rows and column both.
students[0:3,1:4]

InvalidIndexError: (slice(0, 3, None), slice(1, 4, None))

- We can't slice rows and columns using this approach. we have two separate function to achieve this.
### **loc** and **iloc**
- loc : Accesses rows and columns by label (e.g., df.loc[2, "Name"]).
- iloc : Accesses rows and columns by integer index (e.g., df.iloc[2, 1]).

- Accessing

In [32]:
students.iloc[[0,3],[4,6]] # 5th and 7th column of 1st and 4th student

Unnamed: 0,Department,Enrollment_Year
0,Computer Science,2020
3,Physics,2023


In [35]:
students.loc[[0,3,12,5,7,9],['Name','Age','Department','GPA']]

Unnamed: 0,Name,Age,Department,GPA
0,Ali,18.0,Computer Science,3.0
3,Danish,21.0,Physics,2.5
12,Zunaira,20.0,Mathematics,3.0
5,Noor,18.0,Computer Science,3.5
7,Aqib,,Mathematics,2.5
9,Sheri,22.0,Engineering,3.5


- Slicing 

In [37]:
students.iloc[:2,:2] # first two rows and first two columns

Unnamed: 0,Student_ID,Name
0,S1001,Ali
1,S1002,Umar


In [44]:
students.iloc[1::2,1:6] # Every second row start from 2nd row

Unnamed: 0,Name,Gender,Age,Department,GPA
1,Umar,Male,19.0,Business,3.5
3,Danish,Male,21.0,Physics,2.5
5,Noor,Female,18.0,Computer Science,3.5
7,Aqib,Male,,Mathematics,2.5
9,Sheri,Male,22.0,Engineering,3.5
11,Mehak,Female,19.0,Business,2.5
13,Awais,Male,21.0,Physics,
15,Mahnoor,Female,18.0,Computer Science,2.5
17,Hafsa,Female,,Mathematics,3.5
19,Umair,Male,22.0,Engineering,2.5


In [47]:
students.loc[20:,'Name':'GPA']

Unnamed: 0,Name,Gender,Age,Department,GPA
20,Umama,Female,18.0,Computer Science,3.0
21,Hamza,Male,19.0,Business,
22,Saifullah,Male,20.0,Mathematics,4.0
23,Mehwish,Female,21.0,Physics,2.5
24,Shayan,Male,22.0,Engineering,3.0


In [51]:
students.loc[::2,'Gender':'Department']  # Gender Age and Department of even indexed rows

Unnamed: 0,Gender,Age,Department
0,Male,18.0,Computer Science
2,Male,20.0,Mathematics
4,Female,22.0,Engineering
6,Female,19.0,Business
8,Male,21.0,Physics
10,Male,18.0,Computer Science
12,Male,20.0,Mathematics
14,Female,,Engineering
16,Female,19.0,Business
18,Female,,Physics


## Filtering and Replacing Data.
Filtering in Pandas allows us to extract specific rows from a DataFrame based on conditions. We use **comparison operators (`==`, `>`, `<`, `!=`)**, **logical operators (`&`, `|`, `~`)**, and **string methods** to filter data efficiently.

#### 🔹 How to Filter a DataFrame?
DataFrame[ condition ]

In [56]:
students[students['Department'] == 'Computer Science'] # Filter students by a specific department

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
10,S1011,Irfanullah,Male,18.0,Computer Science,4.0,2020,9876543000.0
15,S1016,Mahnoor,Female,18.0,Computer Science,2.5,2020,9876543000.0
20,S1021,Umama,Female,18.0,Computer Science,3.0,2020,9876543000.0


In [57]:
students[students["GPA"] > 3.5] #  Filter students with GPA above 3.5

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
6,S1007,Neha,Female,19.0,Business,4.0,2021,9876543000.0
10,S1011,Irfanullah,Male,18.0,Computer Science,4.0,2020,9876543000.0
18,S1019,Aliza,Female,,Physics,4.0,2023,
22,S1023,Saifullah,Male,20.0,Mathematics,4.0,2022,9876543000.0


In [61]:
students[students["Enrollment_Year"] > 2023] # Filter students enrolled after 2023

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
4,S1005,Laiba,Female,22.0,Engineering,3.0,2024,9876543000.0
9,S1010,Sheri,Male,22.0,Engineering,3.5,2024,9876543000.0
14,S1015,Shanza,Female,,Engineering,3.5,2024,9876543000.0
19,S1020,Umair,Male,22.0,Engineering,2.5,2024,9876543000.0
24,S1025,Shayan,Male,22.0,Engineering,3.0,2024,9876543000.0


In [63]:
students[(students["Age"] > 19) & (students["Age"] <= 21)] # Filter students aged between 19 and 21

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
2,S1003,Faraz,Male,20.0,Mathematics,,2022,9876543000.0
3,S1004,Danish,Male,21.0,Physics,2.5,2023,
8,S1009,Taha,Male,21.0,Physics,3.0,2023,9876543000.0
12,S1013,Zunaira,Male,20.0,Mathematics,3.0,2022,9876543000.0
13,S1014,Awais,Male,21.0,Physics,,2023,9876543000.0
22,S1023,Saifullah,Male,20.0,Mathematics,4.0,2022,9876543000.0
23,S1024,Mehwish,Female,21.0,Physics,2.5,2023,9876543000.0


In [64]:
students[(students["Gender"] == "Male") & (students["GPA"] > 3.0)] # Filter male students with GPA above 3.0

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
9,S1010,Sheri,Male,22.0,Engineering,3.5,2024,9876543000.0
10,S1011,Irfanullah,Male,18.0,Computer Science,4.0,2020,9876543000.0
22,S1023,Saifullah,Male,20.0,Mathematics,4.0,2022,9876543000.0


In [69]:
students[students["Department"].isin(["Business", "Computer Science"])] # Filter students of Computer science and Bussiness department

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
6,S1007,Neha,Female,19.0,Business,4.0,2021,9876543000.0
10,S1011,Irfanullah,Male,18.0,Computer Science,4.0,2020,9876543000.0
11,S1012,Mehak,Female,19.0,Business,2.5,2021,9876543000.0
15,S1016,Mahnoor,Female,18.0,Computer Science,2.5,2020,9876543000.0
16,S1017,Kainat,Female,19.0,Business,,2021,9876543000.0
20,S1021,Umama,Female,18.0,Computer Science,3.0,2020,9876543000.0
21,S1022,Hamza,Male,19.0,Business,,2021,9876543000.0


In [71]:
students[students["Contact"].isna()] # Filter students with missing contact information

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
3,S1004,Danish,Male,21.0,Physics,2.5,2023,
18,S1019,Aliza,Female,,Physics,4.0,2023,


In [72]:
students[students["Age"].isna()] # Filter students with missing age

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
7,S1008,Aqib,Male,,Mathematics,2.5,2022,9876543000.0
14,S1015,Shanza,Female,,Engineering,3.5,2024,9876543000.0
17,S1018,Hafsa,Female,,Mathematics,3.5,2022,9876543000.0
18,S1019,Aliza,Female,,Physics,4.0,2023,


In [73]:
students[(students["Enrollment_Year"] == 2020) & (students["GPA"] > 3.2)] # Filter students who enrolled in 2020 and also have GPA Above 3.2

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
10,S1011,Irfanullah,Male,18.0,Computer Science,4.0,2020,9876543000.0


In [76]:
students[students["Name"].str.contains("Ali", case=False, na=False)] # Filter students whose names contain 'Ali'

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
18,S1019,Aliza,Female,,Physics,4.0,2023,


#### Replacing data.

In [79]:
students.replace(to_replace='Ali',value='Ali Raza').head(3)

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali Raza,Male,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
2,S1003,Faraz,Male,20.0,Mathematics,,2022,9876543000.0


In [100]:
# Replacing multiple values in a column
students['Gender'] = students['Gender'].replace(to_replace=['Male','Female'],value=['M','F'])
students.head(5)

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,M,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,M,19.0,Business,3.5,2021,9876543000.0
2,S1003,Faraz,M,20.0,Mathematics,,2022,9876543000.0
3,S1004,Danish,M,21.0,Physics,2.5,2023,
4,S1005,Laiba,F,22.0,Engineering,3.0,2024,9876543000.0


## Handling Missing Data in Pandas  

Missing data occurs when some values in a DataFrame are **NaN (Not a Number) or None**, which can affect analysis. Pandas provides various methods to handle missing data effectively.  

### 🔹 Common Techniques to Handle Missing Data  

 **Detect Missing Values**  
   students.isnull().sum()  # Count missing values in each column


In [104]:
students.isnull().head(8) # True shows that Data is null in this cell

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False
7,False,False,False,True,False,False,False,False


In [106]:
students.isnull().sum()

Student_ID         0
Name               0
Gender             0
Age                4
Department         0
GPA                4
Enrollment_Year    0
Contact            2
dtype: int64

#### Filling Missing Data techniques.
1. **By a specific value**

In [115]:
students.fillna(0)[11:21]

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
11,S1012,Mehak,Female,19.0,Business,2.5,2021,9876543000.0
12,S1013,Zunaira,Male,20.0,Mathematics,3.0,2022,9876543000.0
13,S1014,Awais,Male,21.0,Physics,0.0,2023,9876543000.0
14,S1015,Shanza,Female,0.0,Engineering,3.5,2024,9876543000.0
15,S1016,Mahnoor,Female,18.0,Computer Science,2.5,2020,9876543000.0
16,S1017,Kainat,Female,19.0,Business,0.0,2021,9876543000.0
17,S1018,Hafsa,Female,0.0,Mathematics,3.5,2022,9876543000.0
18,S1019,Aliza,Female,0.0,Physics,4.0,2023,0.0
19,S1020,Umair,Male,22.0,Engineering,2.5,2024,9876543000.0
20,S1021,Umama,Female,18.0,Computer Science,3.0,2020,9876543000.0


2. **By forward filling** : Fills missing values with the **previous non-null value** in the column.

In [117]:
students.ffill()

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
2,S1003,Faraz,Male,20.0,Mathematics,3.5,2022,9876543000.0
3,S1004,Danish,Male,21.0,Physics,2.5,2023,9876543000.0
4,S1005,Laiba,Female,22.0,Engineering,3.0,2024,9876543000.0
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
6,S1007,Neha,Female,19.0,Business,4.0,2021,9876543000.0
7,S1008,Aqib,Male,19.0,Mathematics,2.5,2022,9876543000.0
8,S1009,Taha,Male,21.0,Physics,3.0,2023,9876543000.0
9,S1010,Sheri,Male,22.0,Engineering,3.5,2024,9876543000.0


3. **By Backward filling** : Fills missing values with the **next non-null value** in the column.

In [119]:
students.bfill()

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
2,S1003,Faraz,Male,20.0,Mathematics,2.5,2022,9876543000.0
3,S1004,Danish,Male,21.0,Physics,2.5,2023,9876543000.0
4,S1005,Laiba,Female,22.0,Engineering,3.0,2024,9876543000.0
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
6,S1007,Neha,Female,19.0,Business,4.0,2021,9876543000.0
7,S1008,Aqib,Male,21.0,Mathematics,2.5,2022,9876543000.0
8,S1009,Taha,Male,21.0,Physics,3.0,2023,9876543000.0
9,S1010,Sheri,Male,22.0,Engineering,3.5,2024,9876543000.0


### Filling Missing Values with Mean, Median, and Mode  

When handling missing data, we can replace null values using **statistical measures** like **Mean, Median, and Mode** to maintain data consistency.  

###### 🔹 **Mean**  
The **mean** (average) is calculated by summing all values in a dataset and dividing by the total number of values. It is sensitive to outliers.  

###### 🔹 **Median**  
The **median** is the middle value in an **ordered dataset**. If the total number of values is even, it is the average of the two middle values. It is **less affected by outliers** than the mean.

###### 🔹 **Mode**  
The **mode** is the **most frequently occurring value** in a dataset. A dataset can have **one mode (unimodal), multiple modes (multimodal), or no mode** if all values are unique.

These measures are essential for understanding data distribution and handling missing values. 🚀

In [129]:
# By Mean
# print(students["Age"].mean())
# students["Age"].fillna(students["Age"].mean())

# By Mode
# print(students["Age"].mode()[0]) # 2 Modes, taking firest
# students["Age"].fillna(students["Age"].mode()[0])

# By Median
print(students["Age"].median())
students["Age"].fillna(students["Age"].median())

20.0


0     18.0
1     19.0
2     20.0
3     21.0
4     22.0
5     18.0
6     19.0
7     20.0
8     21.0
9     22.0
10    18.0
11    19.0
12    20.0
13    21.0
14    20.0
15    18.0
16    19.0
17    20.0
18    20.0
19    22.0
20    18.0
21    19.0
22    20.0
23    21.0
24    22.0
Name: Age, dtype: float64

### Droping Missing values.
The `dropna()` function in Pandas is used to **remove missing (NaN) values** from a DataFrame or Series. It helps in cleaning datasets by dropping rows or columns containing null values.  

---

#### **Syntax:**
- inplace : If True, modifies the original DataFrame instead of returning a new one
- how : "any" drops rows/columns with at least one NaN, "all" drops only if all values are NaN
```python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)



In [132]:
students.dropna() # Drop rows Having any null value

Unnamed: 0,Student_ID,Name,Gender,Age,Department,GPA,Enrollment_Year,Contact
0,S1001,Ali,Male,18.0,Computer Science,3.0,2020,9876543000.0
1,S1002,Umar,Male,19.0,Business,3.5,2021,9876543000.0
4,S1005,Laiba,Female,22.0,Engineering,3.0,2024,9876543000.0
5,S1006,Noor,Female,18.0,Computer Science,3.5,2020,9876543000.0
6,S1007,Neha,Female,19.0,Business,4.0,2021,9876543000.0
8,S1009,Taha,Male,21.0,Physics,3.0,2023,9876543000.0
9,S1010,Sheri,Male,22.0,Engineering,3.5,2024,9876543000.0
10,S1011,Irfanullah,Male,18.0,Computer Science,4.0,2020,9876543000.0
11,S1012,Mehak,Female,19.0,Business,2.5,2021,9876543000.0
12,S1013,Zunaira,Male,20.0,Mathematics,3.0,2022,9876543000.0


In [134]:
students.dropna(axis=1) # Drop columns Having any null value

Unnamed: 0,Student_ID,Name,Gender,Department,Enrollment_Year
0,S1001,Ali,Male,Computer Science,2020
1,S1002,Umar,Male,Business,2021
2,S1003,Faraz,Male,Mathematics,2022
3,S1004,Danish,Male,Physics,2023
4,S1005,Laiba,Female,Engineering,2024
5,S1006,Noor,Female,Computer Science,2020
6,S1007,Neha,Female,Business,2021
7,S1008,Aqib,Male,Mathematics,2022
8,S1009,Taha,Male,Physics,2023
9,S1010,Sheri,Male,Engineering,2024


## Conclusion  

In this notebook, we explored essential techniques for working with data in Pandas, including **accessing and slicing DataFrames** efficiently. We also learned **how to filter data** based on specific conditions and replace values to **clean and standardize our dataset**. Finally, we covered various methods to **handle missing data, ensuring data integrity for further analysis**. Mastering these concepts is crucial for **effective data manipulation and preprocessing** in Pandas. 🚀
