The Raw Data

Reference Table: This is the raw data provided by the admin.
| Name     | Age     | Marks   | Subject   |
| -------- | ------- | ------- | --------- |
| Aman     | 22      | 85      | Math      |
| Saurav   | 23      | Missing | Physics   |
| Karan    | Missing | 88      | Math      |
| Shubham  | 22      | 90      | Chemistry |
| Abhishek | 25      | 56      | Math      |
| Missing  | 24      | 72      | Physics   |
| Sameer   | 26      | Missing | Chemistry |

Note: “Missing” indicates data was not captured (NaN/None).

In [7]:
import pandas as pd
import numpy as np  

# Create the raw dataset
data =pd.DataFrame({ 
    "Name": ["Aman", "Saurav", "Karan", "Shubham", "Abhishek", None, "Sameer"],
    "Age": [22, 23, None, 22, 25, 24, 26],
    "Marks": [85, None, 88, 90, 56, 72, None],
    "Subject": ["Math", "Physics", "Math", "Chemistry", "Math", "Physics", "Chemistry"]
})
data

Unnamed: 0,Name,Age,Marks,Subject
0,Aman,22.0,85.0,Math
1,Saurav,23.0,,Physics
2,Karan,,88.0,Math
3,Shubham,22.0,90.0,Chemistry
4,Abhishek,25.0,56.0,Math
5,,24.0,72.0,Physics
6,Sameer,26.0,,Chemistry



Task 1.1: Create the DataFrame
Construct a Pandas DataFrame named df that replicates the "Raw Data" table shown in the previous slide. Use None or np.nan for missing values.

Task 1.2: Basic Inspection
Once created:

Check the shape of the dataset.

Display the data types of the columns.

Generate a statistical summary (mean/min/max) for the numerical columns.

In [8]:

print("DataFrame:")
print(data)

# 1. Shape of dataset
print("\nShape of dataset:")
print(data.shape)

# 2. Data types of columns
print("\nData types:")
print(data.dtypes)

# 3. Statistical summary
print("\nStatistical Summary:")
print(data.describe())

DataFrame:
       Name   Age  Marks    Subject
0      Aman  22.0   85.0       Math
1    Saurav  23.0    NaN    Physics
2     Karan   NaN   88.0       Math
3   Shubham  22.0   90.0  Chemistry
4  Abhishek  25.0   56.0       Math
5      None  24.0   72.0    Physics
6    Sameer  26.0    NaN  Chemistry

Shape of dataset:
(7, 4)

Data types:
Name        object
Age        float64
Marks      float64
Subject     object
dtype: object

Statistical Summary:
             Age      Marks
count   6.000000   5.000000
mean   23.666667  78.200000
std     1.632993  14.254824
min    22.000000  56.000000
25%    22.250000  72.000000
50%    23.500000  85.000000
75%    24.750000  88.000000
max    26.000000  90.000000


Exercise 2: Selection & Aggregation

Task: Filter and group the data.

Selection: Create a view showing only the 'Name' and 'Marks' columns.

Filtering: Select students who have scored more than 80 marks.

Complex Filtering: Select students older than 22 who also scored more than 80.

Aggregation: Count how many students are enrolled in each Subject.

In [9]:

# Selection: only Name and Marks
selection_view = data[["Name", "Marks"]]
print("Selection:\n", selection_view)

#  Filtering: Marks > 80
filtered_students = data[data["Marks"] > 80]
print("\nStudents scoring > 80:\n", filtered_students)

#  Complex Filtering: Age > 22 AND Marks > 80
complex_filter = data[(data["Age"] > 22) & (data["Marks"] > 80)]
print("\nAge > 22 and Marks > 80:\n", complex_filter)

#  Aggregation: count students per Subject
aggregation = data.groupby("Subject")["Name"].count()
print("\nStudents per Subject:\n", aggregation)

Selection:
        Name  Marks
0      Aman   85.0
1    Saurav    NaN
2     Karan   88.0
3   Shubham   90.0
4  Abhishek   56.0
5      None   72.0
6    Sameer    NaN

Students scoring > 80:
       Name   Age  Marks    Subject
0     Aman  22.0   85.0       Math
2    Karan   NaN   88.0       Math
3  Shubham  22.0   90.0  Chemistry

Age > 22 and Marks > 80:
 Empty DataFrame
Columns: [Name, Age, Marks, Subject]
Index: []

Students per Subject:
 Subject
Chemistry    2
Math         3
Physics      1
Name: Name, dtype: int64


Exercise 3: Sorting & Ranking

Task: Order the data to find rankers.

Sort the dataframe by 'Marks' in descending order.

Use a specific Pandas function to find the top 2 highest scorers (Scholarship Candidates) without sorting the whole dataframe manually.

Identify the student with the lowest marks (Remedial Candidate).

In [10]:
#  Sort by Marks (descending)
sorted_df = data.sort_values(by="Marks", ascending=False)
print("Sorted by Marks (Descending):\n", sorted_df)

#  Top 2 highest scorers (without full sorting)
top2 = data.nlargest(2, "Marks")
print("\nTop 2 Highest Scorers:\n", top2)

#  Student with lowest marks
lowest = data.nsmallest(1, "Marks")
print("\nLowest Marks (Remedial Candidate):\n", lowest)

Sorted by Marks (Descending):
        Name   Age  Marks    Subject
3   Shubham  22.0   90.0  Chemistry
2     Karan   NaN   88.0       Math
0      Aman  22.0   85.0       Math
5      None  24.0   72.0    Physics
4  Abhishek  25.0   56.0       Math
1    Saurav  23.0    NaN    Physics
6    Sameer  26.0    NaN  Chemistry

Top 2 Highest Scorers:
       Name   Age  Marks    Subject
3  Shubham  22.0   90.0  Chemistry
2    Karan   NaN   88.0       Math

Lowest Marks (Remedial Candidate):
        Name   Age  Marks Subject
4  Abhishek  25.0   56.0    Math


Exercise 4: Data Cleaning

Task: Handle Missing Values.

Calculate the total number of missing values (NaN) in each column.

Filter and display the rows where 'Marks' are missing.

Create a new dataframe called cleaned_df by removing any row that contains even a single missing value.

Compare the shape of cleaned_df vs original df. 

Exercise 5: Critical Thinking

Analyze the Logic:

Technical Q: Look at the 'Marks' column in your output. Why are the values floats (e.g., 85.0) instead of integers?

Technical Q: What is the risk of using inplace=True when dropping missing values?

Business Q: Based on the data, what specific feedback should you give to the admin team regarding the "Name" column?

Answers (Concise)

1️⃣ Why are Marks values floats instead of integers?
Because the column contains missing values (NaN). In Pandas, NaN is a floating-point value, so the entire column is automatically converted to float datatype to accommodate NaN.

2️⃣ Risk of using inplace=True when dropping missing values:

It permanently modifies the original dataframe.

You cannot easily recover the original data if needed later.

Makes debugging harder since no copy is preserved.

3️⃣ Feedback to admin team about the "Name" column:
The Name column contains missing entries, which indicates incomplete data entry. Names should be mandatory or validated during data collection to avoid unidentified student records and maintain data integrity.