# Python Data Lab

## Resources 
- [Pandas](https://pandas.pydata.org/)
- [NumPy](https://numpy.org/)
- [Markdown Guide](https://www.markdownguide.org/getting-started/)
  
## Goals 
- Be more familiar with Python as well as learning numpy and pandas to be able to manage data sets and clean them to use to help me with my data analysis work 
- Learn how to use Markdown 


In [1]:
# for shell commands like pip we have to use ! 
# We are just seeing our libraries installed - lets import here as well think

!pip list

Package                   Version
------------------------- -----------
anyio                     4.12.1
argon2-cffi               25.1.0
argon2-cffi-bindings      25.1.0
arrow                     1.4.0
asttokens                 3.0.1
async-lru                 2.0.5
attrs                     25.4.0
babel                     2.17.0
beautifulsoup4            4.14.3
bleach                    6.3.0
certifi                   2026.1.4
cffi                      2.0.0
charset-normalizer        3.4.4
colorama                  0.4.6
comm                      0.2.3
debugpy                   1.8.19
decorator                 5.2.1
defusedxml                0.7.1
executing                 2.2.1
fastjsonschema            2.21.2
fqdn                      1.5.1
h11                       0.16.0
httpcore                  1.0.9
httpx                     0.28.1
idna                      3.11
ipykernel                 7.1.0
ipython                   9.9.0
ipython_pygments_lexers   1.1.1
ipywidgets          

In [2]:
print('HELLO WORLD')

HELLO WORLD


# Data Lab Project Idea

## SSID Matching when doing verifying fall 2 and certification reports 
  
## Descripion 
- When run aeries query report we get list of students but aeries is using runtime data so we have to go to census day as well as including inactive students since CALPADS is snapshot and Aeries is realtime data
- Once we ran aeries query and it reflected CALPADS there could be discrepancy still like how they go from EL to RFEP
- Excel had sort by color we took one report aeries one and highlighted it all blue so we know
- go to supporting report of certifictation report example 2.7 supports 2.4 and highlight columns
- we want to compare both list of SSID or student ID and make sure they are all duplicates
- the ones that are not duplicates and the discreppency that we check in aeires 


In [3]:
# Import our necessary libraries path for getting data and pandas to handle
from pathlib import Path
import pandas as pd

In [4]:
# Referencing Our Sample Data
PROJECT_ROOT = Path("..")
DATA_DIR = PROJECT_ROOT / "data"
aeries_path = DATA_DIR / "aeries-mock.csv"
calpads_path = DATA_DIR / "calpads-mock.csv"
print(aeries_path)

..\data\aeries-mock.csv


In [5]:
# Load our data into pandas
df_aeries = pd.read_csv(aeries_path)
df_calpads = pd.read_csv(calpads_path)


# How to preview data 
- Once use .read_csv() which reads csv data can use .head() or .tail()
- head looks at default first 5 rows of data and tail looks at last 5

In [6]:
df_aeries.head(12)

Unnamed: 0,Student ID,Last Name,First Name,EL_Status
0,10001,Martinez,Alice,English Learner
1,10002,Ramirez,Carlos,English Learner
2,10003,Lopez,Diana,English Learner
3,10004,Nguyen,Emily,English Learner
4,10005,Johnson,Frank,English Learner
5,10006,Patel,Grace,English Learner
6,10007,Kim,Henry,English Learner
7,10008,Rodriguez,Isabella,English Learner
8,10009,Thompson,Jason,English Learner
9,10010,O’Neill,Kevin,English Learner


In [7]:
df_calpads.head(10)

Unnamed: 0,Student ID,Last Name,First Name,EL_Status
0,10001,Martinez,Alice,English Learner
1,10002,Ramirez,Carlos,English Learner
2,10003,Lopez,Diana,English Learner
3,10004,Nguyen,Emily,English Learner
4,10005,Johnson,Frank,English Learner
5,10006,Patel,Grace,English Learner
6,10007,Kim,Henry,English Learner
7,10008,Rodriguez,Isabella,English Learner
8,10009,Thompson,Jason,English Learner
9,10010,O’Neill,Kevin,English Learner


In [8]:
# Can reference a col with df[col name]

Aeries_STU = df_aeries['Student ID']
Calpads_STU = df_calpads['Student ID']


In [9]:
# This is saying how many students we have in our Aeries Query or data
print(f'There are {int(Aeries_STU.count())} students in Aeries')
print(f'There are {int(Calpads_STU.count())} students in CALPADS')

There are 12 students in Aeries
There are 10 students in CALPADS


In [10]:
# CAN CHECK to SEE WHATS IN EACH CSV AND VICE VERSA
Aeries_STU.isin(Calpads_STU)

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11    False
Name: Student ID, dtype: bool

In [22]:
# Let try to access 11 12 in aeries 
print(f'Student with STU_ID {int(Aeries_STU[10])} not in CALPADS supporting report')
print(f'Student with STU_ID {int(Aeries_STU[11])} not in CALPADS supporting report')

Student with STU_ID 10011 not in CALPADS supporting report
Student with STU_ID 10012 not in CALPADS supporting report


# Main Takeaway 
- ## We can check if values in one col are in another col
- ## We were able to do so the possibilities - mess with the df and try access different col and do different checks
- ## Read Documentation to see what else pandas has offer These are the tools you’ll want to explore:
- ## We can flesh out this idea of Aeries Query to compare CALPADS supporting reports and make it so we download the results of mismatch of CSV and get other col and fields as needed


These are the tools to explore:

a) isin()

Checks if values in one column exist in another Series.

Example mental workflow:

“Which Student IDs in Aeries also exist in CALPADS?”

Useful for filtering matched vs unmatched rows

b) merge()

Combines two DataFrames on a key column (Student ID)

Options:

how='inner' → only duplicates

how='left' → all Aeries rows, plus matching CALPADS data

how='outer' → everything from both, highlight missing

This is how you get side-by-side comparison for each student.

c) duplicated()

Checks for duplicate rows in a single column or DataFrame

Can be handy if a CSV has multiple rows for the same Student ID (less likely in your mock, but real data sometimes has this)

d) isnull() + empty string check

Ensures no empty/missing IDs before comparing

Always a good first step

e) value_counts()

Can count occurrences of IDs or statuses

Helps summarize counts after filtering or merging