# Reading Student Scores with Pandas

## What is Jupyter Notebooks

Jupyter Notebooks is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is widely used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

## Types of Cells

In Jupyter Notebooks, there are primarily two types of cells:

1. **Code Cells**: These cells allow you to write and execute code. You can run the code in these cells, and the output will be displayed directly below the cell. Code cells support various programming languages, but Python is the most commonly used language in Jupyter Notebooks.

2. **Markdown Cells**: These cells are used for writing text in Markdown format. You can use Markdown cells to add explanations, documentation, headings, lists, links, images, and other formatted text to your notebook. Markdown cells do not execute code; they are purely for text content.

## Markdown Syntax

Here are some common Markdown syntax elements you can use in Jupyter Notebooks:
- **Headings**: Use `#` for headings. More `#` symbols indicate smaller headings.
  - Example: `# Heading 1`, `## Heading 2`, `### Heading 3`
- **Bold Text**: Use `**` or `__` to make text bold.
  - Example: `**bold text**` or `__bold text__`
- **Italic Text**: Use `*` or `_` to italicize text.
  - Example: `*italic text*` or `_italic text_`
- **Lists**: Use `-`, `*`, or `+` for unordered lists and numbers for ordered lists.
    - Example:
        - Unordered: 
        - `- Item 1`
        - `- Item 2`
        - Ordered:
        1. `1. First item`
        2. `2. Second item`

## More on Markdown

- **Links**: Use `[text](URL)` to create hyperlinks.
  - Example: `[OpenAI](https://www.openai.com)`
- **Images**: Use `![alt text](image URL)` to embed images.
  - Example: `![Python Logo](https://www.python.org/static/community_logos/python-logo.png)`

## Markdown Documentation on Github

For more detailed information on Markdown syntax, you can refer to the official GitHub documentation: [Mastering Markdown](https://guides.github.com/features/mastering-markdown/).

In [2]:
# Let's Show Datetime, Python and Pandas version
import datetime
import sys
import pandas as pd

print("Current datetime:", datetime.datetime.now())
print("Python version:", sys.version)
print("Pandas version:", pd.__version__)
# The script is over here well actually our notebook is still continuing
# we can write more code below!
# and we keep all the previous code above intact

Current datetime: 2025-11-30 22:35:48.444100
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
Pandas version: 2.3.3


In [3]:
# let's make some variables
a = 10
b = 20
c = a + b
print("The sum of a and b is:", c)
# this is cell is over but our variables are still ALIVE!!!

The sum of a and b is: 30


In [4]:
# so I can add another value to c and print it again
d = 50
c = c + d
print("The new sum of a, b and d is:", c)

The new sum of a, b and d is: 80


## Loading Excel/CSV Files with Pandas

Pandas offers powerful tools for reading and manipulating data from Excel and CSV files. Here are some common functions used to load these files:
- **Reading CSV Files**: Use `pd.read_csv('file_path.csv')` to read a CSV file into a DataFrame.
- **Reading Excel Files**: Use `pd.read_excel('file_path.xlsx', sheet_name='Sheet1')` to read an Excel file into a DataFrame. You can specify the sheet name or index.

In [5]:
# let's start with reading a csv file using pandas
data_csv = pd.read_csv('grades_jan.csv')
# read_csv has many extra options
# full documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
# show the first 5 rows of the dataframe
# shape of the dataframe
print(f"CSV data has {data_csv.shape[0]} rows and {data_csv.shape[1]} columns.")
print(f"Shape of the CSV data: {data_csv.shape}")
print("First 5 rows of the CSV data:")
data_csv.head() # default is 5 rows but we can add a number inside head() function


CSV data has 30 rows and 5 columns.
Shape of the CSV data: (30, 5)
First 5 rows of the CSV data:


Unnamed: 0,student_id,student_name,course,grade,month
0,1,Alice,Math,83,Jan
1,1,Alice,Physics,76,Jan
2,1,Alice,Biology,85,Jan
3,2,Bob,Math,73,Jan
4,2,Bob,Physics,77,Jan


In [6]:
# let's get some info on the dataframe
print("Information about the CSV data:")
data_csv.info() 

Information about the CSV data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   student_id    30 non-null     int64 
 1   student_name  30 non-null     object
 2   course        30 non-null     object
 3   grade         30 non-null     int64 
 4   month         30 non-null     object
dtypes: int64(2), object(3)
memory usage: 1.3+ KB


In [7]:
# let's read an excel file called students.xlsx
students = pd.read_excel('students.xlsx')
# if you need to specify you can show the sheet name or index
# students = pd.read_excel('students.xlsx', sheet_name='Sheet1')
# shape
print(f"Excel data has {students.shape[0]} rows and {students.shape[1]} columns.")
# head
print("First 7 students from the Excel data:")
students.head(7)

Excel data has 10 rows and 4 columns.
First 7 students from the Excel data:


Unnamed: 0,student_id,first_name,last_name,email
0,1,Alice,Smith,alice.smith@example.com
1,2,Bob,Johnson,bob.johnson@example.com
2,3,Carol,Williams,carol.williams@example.com
3,4,Dave,Brown,dave.brown@example.com
4,5,Eve,Jones,eve.jones@example.com
5,6,Frank,Miller,frank.miller@example.com
6,7,Grace,Davis,grace.davis@example.com


In [8]:
# let's get last 4 students from the Excel data
print("Last 4 students from the Excel data:")
students.tail(4)

Last 4 students from the Excel data:


Unnamed: 0,student_id,first_name,last_name,email
6,7,Grace,Davis,grace.davis@example.com
7,8,Heidi,Wilson,heidi.wilson@example.com
8,9,Ivan,Taylor,ivan.taylor@example.com
9,10,Judy,Anderson,judy.anderson@example.com


In [9]:
# we can also get a random sample of rows
# this can be very handy when working with large datasets
print("Random sample of 5 students from the Excel data:")
students.sample(5)

Random sample of 5 students from the Excel data:


Unnamed: 0,student_id,first_name,last_name,email
8,9,Ivan,Taylor,ivan.taylor@example.com
9,10,Judy,Anderson,judy.anderson@example.com
4,5,Eve,Jones,eve.jones@example.com
0,1,Alice,Smith,alice.smith@example.com
5,6,Frank,Miller,frank.miller@example.com


In [10]:
# you can also get rows by numeric index using iloc
print("Students from index 3 to 7:")
students.iloc[3:8]

Students from index 3 to 7:


Unnamed: 0,student_id,first_name,last_name,email
3,4,Dave,Brown,dave.brown@example.com
4,5,Eve,Jones,eve.jones@example.com
5,6,Frank,Miller,frank.miller@example.com
6,7,Grace,Davis,grace.davis@example.com
7,8,Heidi,Wilson,heidi.wilson@example.com


In [11]:
# Pandas also offers regular indexing using loc
# for example, if the dataframe has a column named 'Name', you can do:
students.loc[students['first_name'] == 'Dave']

Unnamed: 0,student_id,first_name,last_name,email
3,4,Dave,Brown,dave.brown@example.com


In [12]:
# we can also get all students with student_id over 5
students.loc[students['student_id'] > 5]

Unnamed: 0,student_id,first_name,last_name,email
5,6,Frank,Miller,frank.miller@example.com
6,7,Grace,Davis,grace.davis@example.com
7,8,Heidi,Wilson,heidi.wilson@example.com
8,9,Ivan,Taylor,ivan.taylor@example.com
9,10,Judy,Anderson,judy.anderson@example.com


In [13]:
# if I wanted to save a fresh copy of some filtered results to a new variable
filtered_students = students.loc[students['student_id'] > 5] # creates a new dataframe
# actually it is not a full copy but a view of the original dataframe
# so any changes to filtered_students will also affect students dataframe unless we make a full copy
# actual full copy of filtered results
filtered_students_copy = filtered_students.copy()
# show both dataframes
display(filtered_students.head())
# copy
display(filtered_students_copy.head())

Unnamed: 0,student_id,first_name,last_name,email
5,6,Frank,Miller,frank.miller@example.com
6,7,Grace,Davis,grace.davis@example.com
7,8,Heidi,Wilson,heidi.wilson@example.com
8,9,Ivan,Taylor,ivan.taylor@example.com
9,10,Judy,Anderson,judy.anderson@example.com


Unnamed: 0,student_id,first_name,last_name,email
5,6,Frank,Miller,frank.miller@example.com
6,7,Grace,Davis,grace.davis@example.com
7,8,Heidi,Wilson,heidi.wilson@example.com
8,9,Ivan,Taylor,ivan.taylor@example.com
9,10,Judy,Anderson,judy.anderson@example.com


In [14]:
# we can use describe to show some statistics about numeric columns
print("Statistical summary of the grade data:")
data_csv.describe()

Statistical summary of the grade data:


Unnamed: 0,student_id,grade
count,30.0,30.0
mean,5.5,76.5
std,2.921384,11.19344
min,1.0,60.0
25%,3.0,67.0
50%,5.5,77.0
75%,8.0,85.0
max,10.0,98.0


In [15]:
# you can also describe non-numeric columns by specifying include='all'
#print("Statistical summary of all columns in the grade data:")
data_csv.describe(include='all')

Unnamed: 0,student_id,student_name,course,grade,month
count,30.0,30,30,30.0,30
unique,,10,3,,1
top,,Alice,Math,,Jan
freq,,3,10,,30
mean,5.5,,,76.5,
std,2.921384,,,11.19344,
min,1.0,,,60.0,
25%,3.0,,,67.0,
50%,5.5,,,77.0,
75%,8.0,,,85.0,


In [16]:
# we can show columns alone as a list
print("Columns in the CSV data:")
column_list = data_csv.columns.tolist()
print(column_list)

Columns in the CSV data:
['student_id', 'student_name', 'course', 'grade', 'month']


In [17]:
# so let's see head of first 8 of only student_name and course and grade columns
data_csv[['student_name', 'course', 'grade']].head(8)
# note the double square brackets [[]] when selecting multiple columns
# essentially we are passing a list of column names to the dataframe

Unnamed: 0,student_name,course,grade
0,Alice,Math,83
1,Alice,Physics,76
2,Alice,Biology,85
3,Bob,Math,73
4,Bob,Physics,77
5,Bob,Biology,85
6,Carol,Math,61
7,Carol,Physics,78


In [18]:
# we could get a single column as well
data_csv['student_name'].head(4)
# here the output was actually a Pandas Series not a DataFrame
# Series is like a single column dataframe with some extra features
# or you can think of Dataframe as a collection of Series objects

0    Alice
1    Alice
2    Alice
3      Bob
Name: student_name, dtype: object

In [19]:
# let's create a new column for students
# they all go to RTU
students['university'] = 'RTU' # so we copied the string 'RTU' to all rows in the new column
# let's take a sample
students.sample(5)

Unnamed: 0,student_id,first_name,last_name,email,university
6,7,Grace,Davis,grace.davis@example.com,RTU
3,4,Dave,Brown,dave.brown@example.com,RTU
8,9,Ivan,Taylor,ivan.taylor@example.com,RTU
4,5,Eve,Jones,eve.jones@example.com,RTU
0,1,Alice,Smith,alice.smith@example.com,RTU


In [20]:
# we can rename last_name column to surname
students.rename(columns={'last_name': 'surname'}, inplace=True)
# inplace means that we modify the original dataframe directly
students.head()

Unnamed: 0,student_id,first_name,surname,email,university
0,1,Alice,Smith,alice.smith@example.com,RTU
1,2,Bob,Johnson,bob.johnson@example.com,RTU
2,3,Carol,Williams,carol.williams@example.com,RTU
3,4,Dave,Brown,dave.brown@example.com,RTU
4,5,Eve,Jones,eve.jones@example.com,RTU


In [21]:
# so let's load february grades as well
feb_grades = pd.read_csv('grades_feb.csv')
# show first 5 rows
feb_grades.head()


Unnamed: 0,student_id,student_name,course,grade,month
0,1,Alice,Math,64,Feb
1,1,Alice,Physics,70,Feb
2,1,Alice,Biology,80,Feb
3,2,Bob,Math,72,Feb
4,2,Bob,Physics,88,Feb


In [22]:
# so now let's combine our january and february grades into a single dataframe
# in order for this concat to work
# we need the same columns in both dataframes
all_grades = pd.concat([data_csv, feb_grades], ignore_index=True)
# shape of all_grades
print(f"All grades data has {all_grades.shape[0]} rows and {all_grades.shape[1]} columns.")

All grades data has 60 rows and 5 columns.


In [23]:
# let's sort the grades by student_name
all_grades_sorted = all_grades.sort_values(by='student_name')
all_grades_sorted.head(10)

Unnamed: 0,student_id,student_name,course,grade,month
0,1,Alice,Math,83,Jan
32,1,Alice,Biology,80,Feb
31,1,Alice,Physics,70,Feb
30,1,Alice,Math,64,Feb
2,1,Alice,Biology,85,Jan
1,1,Alice,Physics,76,Jan
4,2,Bob,Physics,77,Jan
3,2,Bob,Math,73,Jan
35,2,Bob,Biology,63,Feb
34,2,Bob,Physics,88,Feb


In [24]:
# we can sort by more than one column as well
# let's sort by name and course
all_grades_sorted_multi = all_grades.sort_values(by=['student_name', 'course'])
# there is also a way to sort in descending order
# so course will be tie breaker when there are multiple entries for the same student_name
all_grades_sorted_multi.head(10)

Unnamed: 0,student_id,student_name,course,grade,month
2,1,Alice,Biology,85,Jan
32,1,Alice,Biology,80,Feb
0,1,Alice,Math,83,Jan
30,1,Alice,Math,64,Feb
1,1,Alice,Physics,76,Jan
31,1,Alice,Physics,70,Feb
5,2,Bob,Biology,85,Jan
35,2,Bob,Biology,63,Feb
3,2,Bob,Math,73,Jan
33,2,Bob,Math,72,Feb


In [26]:
# let's save our combined grades to a new csv file
all_grades_sorted_multi.to_csv('all_grades_sorted_multi.csv', index=False) # index=False means we do not want to save the row indices

In [27]:
# we can also save to excel
all_grades_sorted_multi.to_excel('all_grades_sorted_multi.xlsx', index=False)