## Notes

* asterisk indicates suppressed value (see below)
  * 2,788 suppressed values in 2020-2021 school year
  * Impute median of similar-sized schools for these?

"* indicates cell has been suppressed, a blank cell indicates no report has been received for a given district in the indicated time period, a 0 indicates that a report was received and no cases were reported for that group in the reported time period. Single-campus student cases and sources of infection are suppressed when (1) reported student cases are fewer than 5, (2) a campus has at least a 90% student positivity rate when on-campus enrollment for a school is at least 15 students, or (3) a campus has at least a 50% positivity rate when on-campus enrollment has fewer than 15 students. If only one campus in a district has suppressed student numbers then student and source of infection numbers for the campus with the next smallest numbers of positive students are also suppressed. Cumulative student cases and sources of infection numbers for a campus are suppressed when (1) student cases are less than five, or (2) current report numbers have been suppressed for the first three weeks that student cases are reported. If there is only one campus reporting in a district and it is a multiple campus, student and source of infection numbers are not suppressed for the district total. Otherwise, district totals are suppressed when (1) student cases are fewer than 5, or (2) a district has at least a 90% student positivity rate when total district enrollment is least 15 students, or, (3) a district has at least a 50% positivity rate when total district enrollment has fewer than 15 students, or (4) cases on a campus have been suppressed for the first three weeks that student cases are reported and there are fewer than 5 campuses reporting in a district."

In [1]:
import pandas as pd
import numpy as np
import missingno as msno
import requests
from bs4 import BeautifulSoup
import re



In [None]:
# Loading Excel file to get sheet names

sheets1 = pd.ExcelFile('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls')
print(sheets1.sheet_names)

# Loading one dataframe to get column names

nov22 = pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls', sheet_name="Campus Report_November 22", header=5)
print(nov22.head(20))

In [2]:
# NEW LOOP FOR LOADING, CLEANING, AND MERGING ORIGINAL SHEETS

# List of sheet names from original Excel file (excludes district reports)

sheets = ['Campus Report_August 1', 'Campus Report_July 25', 'Campus Report_July 18', 'Campus Report_July 11', 
          'Campus Report_July 04', 'Campus Report_June 27', 'Campus Report_June 20', 'Campus Report_June13', 
          'Campus Report_June 6', 'Campus Report_May 30', 'Campus Report_May 23', 'Campus Report_May 16', 
          'Campus Report_May 09', 'Campus Report_May 02', 'Campus Report_April 25', 'Campus Report_April 18', 
          'Campus Report_April 11', 'Campus Report_April 4', 'Campus Report_March 28', 'Campus Report_March 21', 
          'Campus Report_March 14', 'Campus Report_March 7', 'Campus Report_February 28', 
          'Campus Report_February 21', 'Campus Report_February 14', 'Campus Report_February 7', 
          'Campus Report_January 31', 'Campus Report_January 24', 'Campus Report_January 17', 
          'Campus Report_January 12', 'Campus Report_January 5th', 'Campus Report_December 29', 
          'Campus Report_December 20', 'Campus Report_December 15', 'Campus Report_December 6', 
          'Campus Report_November 29']

# Reversing list to get reports in order from beginning of school year

sheets.reverse()

# Getting first weekly report as base dataframe

df = pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls', sheet_name="Campus Report_November 22", header=5)

# Renaming columns that will be kept (at least at first --
# I'm sure I'll remove many of them once I figure out the exact analyses)
# REMEMBER THAT DATED COLUMNS REPRESENT TOTAL STUDENT CASES IN THAT REPORT

df.rename(columns = {'District Name':'District', 'District\nLEA\nNumber':'Dist LEA', 
                    'Total District\nEnrollment as\nof October 30, 2020':'Dist Enrollment 10/30/20', 
                    'Approximate\nDistrict On Campus\nEnrollment as of\nOctober 30, 2020':'Dist On-Campus Enrollment 10/30/20', 
                    'Total School\nEnrollment as\nof October 30, 2020':'Sch Enrollment 10/30/2020', 
                    'On-Campus\nEnrollment for\nSchool as of\nOctober 30, 2020':'Sch On-Campus Enrollment 10/30/2020', 
                    'Campus\nID':'Campus ID', 'Total\nStudent\nCases':'Nov22', 
                    'Total\nStaff\nCases':'Total staff cases_Nov22'},
                     inplace=True)

# Replace 'multiple campus' listings with NaN campus ID
    
df['Campus ID'].replace('Multiple\nCampus', np.nan, inplace=True)
df['Campus ID'] = df['Campus ID'].str.strip("'").astype('float')
    
# Convert asterisks (suppressed values) to NaN, add to new dataframe
# Using float again so NaNs can coexist in column
    
df['Nov22'].replace([r'*', ' '], np.nan, inplace=True) 
df['Nov22'] = df['Nov22'].astype('float')

df.dropna(subset='Campus ID', inplace=True)
df['Campus ID'] = df['Campus ID'].astype('int')

# Empty list for adding abbreviations (to be appended to column names)

abbrevs=[]

# List for raw dataframes

ds = []

for i, sheet in enumerate(sheets):
    
    # Loading sheets into dataframes in list
    
    ds.append(pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls', 
                                                sheet_name = sheet, header=5))
    
    # Create date abbreviation for student cases columns
    
    if sheet[-2] == " ":
        abbrev = sheet[14:17] + sheet[-1]
    else:
        abbrev = sheet[14:17] + sheet[-2:]
        
    if abbrev == 'Janth':
        abbrev = "Jan5" 
    
    # Add abbreviations to list
    
    abbrevs.append(abbrev)
    
    # Renaming student total column (used columns.values becase not all
    # sheets use the same column headers)
    
    ds[i].rename(columns = {'Campus\nID': 'Campus ID',
                            ds[i].columns.values[13]: abbrevs[i]}, inplace=True)  
    
    # Cutting each dataframe down to only campus ID and student cases                        
                            
    ds[i] = ds[i][['Campus ID', abbrevs[i]]]
                            
    # Replace 'multiple campus' listings with NaN campus ID and drop rows with
    # no campus ID
    
    ds[i]['Campus ID'].replace('Multiple\nCampus', np.nan, inplace=True)
    ds[i].dropna(subset='Campus ID', inplace=True)
                            
    # Remove leading apostrophe and convert to int
                            
    ds[i]['Campus ID'] = ds[i]['Campus ID'].str.strip("'").astype('int')

    # Convert asterisks (suppressed values) and blank spaces to NaN
    # Using float again so NaNs can coexist in column
    
    ds[i][abbrevs[i]].replace([r'*', ' '], np.nan, inplace=True)  
    ds[i][abbrevs[i]] = ds[i][abbrevs[i]].astype('float')
    
    # One campus has duplicated reports (can detail this in report later),
    # and it looks like like the second report is more likely to be accurate.
    
    ds[i].drop_duplicates(subset='Campus ID', keep='last', inplace=True)
    
    # Merge all data frames to base dataframe
    
    df = df.merge(ds[i], how='inner', on='Campus ID')

In [3]:
# Note: Looks like most schools have the same number of reports
# in the last 4–5 reports, so I thought I might have some sort of
# error, but I did find one campus (index 1185) with different
# numbers in Jul18 and Jul25, so I think this probably worked.

# Mean cases goes up every week (just barely)

print(df.sample(20))

                      District Dist LEA Dist Enrollment 10/30/20  \
6622               MIDLAND ISD  '165901                    25703   
7461                CENTER ISD  '210901                     2492   
1763             COMMUNITY ISD  '043918                     2752   
7377      MOUNT ENTERPRISE ISD  '201907                      400   
6772                CONROE ISD  '170902                    64575   
2724                COOPER ISD  '060902                      806   
2023  GATEWAY CHARTER\nACADEMY  '057831                      782   
9161                 HUTTO ISD  '246906                     8432   
299                 SALADO ISD  '014908                     2106   
8270            SAN ANGELO ISD  '226903                    14097   
5555               LA JOYA ISD  '108912                    26633   
7736            FORT WORTH ISD  '220905                    77276   
1315          POINT ISABEL ISD  '031909                     2069   
1681                 PLANO ISD  '043910         

In [4]:
def pullstats(campusid):
    
    '''Grabs percentage of economically disadvantaged students, average class size on campus,
    STAAR performance rates for 'meets grade level or above', and total operating expenditures
    per student.'''
    
    campus_stats = pd.DataFrame({'Campus ID': pd.Series(dtype='int'), 
                                 'Econ disadv': pd.Series(dtype='float'),
                                 'Avg class': pd.Series(dtype='float'),
                                 'STAAR 2021': pd.Series(dtype='float'),
                                 'Spending': pd.Series(dtype='float')})

    campus_stats['Campus ID'] = [campusid]
                                 
    # Request TEA school report card corresponding to campus ID
    
    r = requests.get(f'https://rptsvr1.tea.texas.gov/cgi/sas/broker?_service=marykay&_program=perfrept.perfmast.sas&_debug=0&ccyy=2022&lev=C&id={campusid}&prgopt=reports%2Fsrc%2Fsrc.sas')
    
    # Parse HTML
    
    soup = BeautifulSoup(r.text, 'html.parser')

    # Find percentage of economically disadvantaged students on campus
    # Note: state average 60.7%

    econ = soup.find('td', string=re.compile('Economically Disadvantaged'))

    for i in range(3):
        econ = econ.next_element
    
    campus_stats['Econ disadv'] = [pd.to_numeric(econ.get_text(strip=True).strip('%')) / 100]

    # Gets average class size on campus, only taking into account listed values
    # Note: state averages listed by class

    classes = ['Kindergarten', 'Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5', 'Grade 6', 
              'English/Language Arts', 'Foreign Languages', 'Mathematics', 'Science', 'Social Studies']
    class_sizes = []

    for cl in classes:
        size = soup.find('td', string=re.compile(cl))

        # Not totally sure why, but the campus number is three elements away

        for i in range (3):
            size = size.next_element

        class_sizes.append(pd.to_numeric(size.text, errors='coerce'))

    # np.nanmean ignores NaNs and calculates mean of numbers

    campus_stats['Avg class'] = [np.nanmean(class_sizes)]

    # Get "STAAR Performance Rates at Meets Grade Level or Above" from all subjects in 2021
    # Note: 2021 is closer to start of pandemic and infection stats (double-check?)
    # Note: actual figures from 2019 are in PDF form

    staar = soup.find_all('th', string=re.compile('2021'))[5]

    for i in range(6):
        staar = staar.next_sibling
    
    campus_stats['STAAR 2021'] = [pd.to_numeric((staar.get_text(strip=True).rstrip('%')), errors='coerce') / 100]

    # Expenditures per student

    exp = soup.find('td', string=re.compile('Total Operating Expenditures'))

    for i in range(2):
        exp = exp.next_sibling

    campus_stats['Spending'] = [pd.to_numeric((exp.get_text(strip=True)).replace(',', '').strip('$'), errors='coerce')]
    
    return campus_stats

In [8]:
stats = pd.DataFrame

for campus in df.sample(10)['Campus ID']:
    print(campus)
    print(pullstats(campus))


247903003


  campus_stats['Avg class'] = [np.nanmean(class_sizes)]


IndexError: list index out of range

In [None]:
# List of sheet names from original Excel file (excludes district reports)

sheets = ['Campus Report_August 1', 'Campus Report_July 25', 'Campus Report_July 18', 'Campus Report_July 11', 
          'Campus Report_July 04', 'Campus Report_June 27', 'Campus Report_June 20', 'Campus Report_June13', 
          'Campus Report_June 6', 'Campus Report_May 30', 'Campus Report_May 23', 'Campus Report_May 16', 
          'Campus Report_May 09', 'Campus Report_May 02', 'Campus Report_April 25', 'Campus Report_April 18', 
          'Campus Report_April 11', 'Campus Report_April 4', 'Campus Report_March 28', 'Campus Report_March 21', 
          'Campus Report_March 14', 'Campus Report_March 7', 'Campus Report_February 28', 
          'Campus Report_February 21', 'Campus Report_February 14', 'Campus Report_February 7', 
          'Campus Report_January 31', 'Campus Report_January 24', 'Campus Report_January 17', 
          'Campus Report_January 12', 'Campus Report_January 5th', 'Campus Report_December 29', 
          'Campus Report_December 20', 'Campus Report_December 15', 'Campus Report_December 6', 
          'Campus Report_November 29']

# Reversing list to get reports in order from beginning of school year

sheets.reverse()

# Getting first weekly report as base dataframe

df = pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls', sheet_name="Campus Report_November 22", header=5)

# Renaming columns that will be kept (at least at first --
# I'm sure I'll remove many of them once I figure out the exact analyses)

df.rename(columns = {'District Name':'District', 'District\nLEA\nNumber':'Dist LEA', 
                    'Total District\nEnrollment as\nof October 30, 2020':'Dist Enrollment 10/30/20', 
                    'Approximate\nDistrict On Campus\nEnrollment as of\nOctober 30, 2020':'Dist On-Campus Enrollment 10/30/20', 
                    'Total School\nEnrollment as\nof October 30, 2020':'Sch Enrollment 10/30/2020', 
                    'On-Campus\nEnrollment for\nSchool as of\nOctober 30, 2020':'Sch On-Campus Enrollment 10/30/2020', 
                    'Campus\nID':'Campus ID', 'Total\nStudent\nCases':'Total student cases_Nov22', 
                    'Total\nStaff\nCases':'Total staff cases_Nov22'},
                     inplace=True)

# Empty list for adding abbreviations (to be appended to column names)

abbrevs=[]

# List for raw dataframes

ds = []

for i, sheet in enumerate(sheets):
    
    # Loading sheets into dataframes (df2–df36)
    # Starts with df2 because of base dataframe above
    
    globals()['d' + str(i + 2)] = pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls', 
                                                sheet_name = sheet, header=5)
    
    # Renaming student and staff total columns (used columns.values becase not all
    # sheets use the same column headers)
    
    globals()['d' + str(i+2)].rename(columns = {globals()['d' + str(i+2)].columns.values[13]: 
                                                          'Total student cases', 
                                                globals()['d' + str(i+2)].columns.values[14]: 
                                                            'Total staff cases'}, inplace=True)
    
    # Add name of raw dataframe to ds list for iterating over later
    
    ds.append(str('d' + str(i + 2)))
    
    # Create abbreviations for appending to columns in final merged dataframe
    
    if sheet[-2] == " ":
        abbrev = sheet[14:17] + sheet[-1]
    else:
        abbrev = sheet[14:17] + sheet[-2:]
        
    if abbrev == 'Janth':
        abbrev = "Jan5" 
    
    # Add abbreviations to list
    
    abbrevs.append(abbrev)

In [None]:
# List for dataframes with only columns of interest

dfs = []

for i, d in enumerate(ds):
    
    # Create empty dataframe to append the updated columns from raw df
    
    globals()['df' + str(i+2)] = pd.DataFrame()
    
    # Replace 'multiple campus' listing with NaNs
    # Add to new dataframe as float (float instead of int because it's nullable)
    
    globals()[d]['Campus\nID'].replace('Multiple\nCampus', np.nan, inplace=True) 
    globals()['df' + str(i+2)]['Campus ID'] = globals()[d]['Campus\nID'].str.strip("'").astype('float')
    
    # Convert asterisks (suppressed values) and blank spaces to NaN, add to new dataframe
    # Using float again so NaNs can coexist in column
    
    globals()[d]['Total student cases'].replace([r'*', ' '], np.nan, inplace=True)
    globals()[d]['Total staff cases'].replace([r'*', ' '], np.nan, inplace=True)
    
    globals()['df' + str(i+2)]['Total student cases_' + abbrevs[i]] = globals()[d]['Total student cases'].astype('float')
    globals()['df' + str(i+2)]['Total staff cases_' + abbrevs[i]] = globals()[d]['Total staff cases'].astype('float')
    
    # Add name of new dataframe to dfs list
    
    dfs.append(str('df' + str(i+2)))

In [None]:
# Replace 'multiple campus' listing in base dataframe with NaN
# Update base dataframe dtypes to match

df['Campus ID'].replace('Multiple\nCampus', np.nan, inplace=True) 
df['Campus ID'] = df['Campus ID'].str.strip("'").astype('float')
    
# Convert asterisks (suppressed values) to NaN, add to new dataframe
# Using float again so NaNs can coexist in column
    
df['Total student cases_Nov22'].replace([r'*', ' '], np.nan, inplace=True)
df['Total staff cases_Nov22'].replace([r'*', ' '], np.nan, inplace=True)
    
df['Total student cases_Nov22'] = df['Total student cases_Nov22'].astype('float')
df['Total staff cases_Nov22'] = df['Total staff cases_Nov22'].astype('float')

In [None]:
# Drop rows where there is no campus ID
# Loop for dataframes in dfs list, base dataframe after

for i in dfs:
    
    globals()[i].dropna(subset='Campus ID', inplace=True)
    globals()[i]['Campus ID'] = globals()[i]['Campus ID'].astype(int)
    
df.dropna(subset='Campus ID', inplace=True)
df['Campus ID'] = df['Campus ID'].astype(int)

In [None]:
# One campus has duplicated reports (can detail this in report later),
# and it looks like like the second report is more likely to be accurate.

for i in dfs:
    
    globals()[i].drop_duplicates(subset='Campus ID', keep='last', inplace=True)

In [None]:
for i in dfs:
    
    # Merge all dataframes
    
    df = df.merge(globals()[i], how='inner', on='Campus ID')

In [None]:
df.columns

In [None]:
# List of dates for looking at all reports, just student reports, and
# just staff reports.

all_dates = ['Total student cases_Nov22', 'Total staff cases_Nov29', 'Total student cases_Nov29', 
             'Total staff cases_Nov29', 'Total student cases_Dec6', 'Total staff cases_Dec6', 
             'Total student cases_Dec15', 'Total staff cases_Dec15', 'Total student cases_Dec20', 
             'Total staff cases_Dec20', 'Total student cases_Dec29', 'Total staff cases_Dec29', 
             'Total student cases_Jan5', 'Total staff cases_Jan5', 'Total student cases_Jan12', 
             'Total staff cases_Jan12', 'Total student cases_Jan17', 'Total staff cases_Jan17', 
             'Total student cases_Jan24', 'Total staff cases_Jan24', 'Total student cases_Jan31', 
             'Total staff cases_Jan31', 'Total student cases_Feb7', 'Total staff cases_Feb7', 
             'Total student cases_Feb14', 'Total staff cases_Feb14', 'Total student cases_Feb21', 
             'Total staff cases_Feb21', 'Total student cases_Feb28', 'Total staff cases_Feb28', 
             'Total student cases_Mar7', 'Total staff cases_Mar7', 'Total student cases_Mar14', 
             'Total staff cases_Mar14', 'Total student cases_Mar21', 'Total staff cases_Mar21', 
             'Total student cases_Mar28', 'Total staff cases_Mar28', 'Total student cases_Apr4', 
             'Total staff cases_Apr4', 'Total student cases_Apr11', 'Total staff cases_Apr11', 
             'Total student cases_Apr18', 'Total staff cases_Apr18', 'Total student cases_Apr25', 
             'Total staff cases_Apr25', 'Total student cases_May02', 'Total staff cases_May02', 
             'Total student cases_May09', 'Total staff cases_May09', 'Total student cases_May16', 
             'Total staff cases_May16', 'Total student cases_May23', 'Total staff cases_May23', 
             'Total student cases_May30', 'Total staff cases_May30', 'Total student cases_Jun6', 
             'Total staff cases_Jun6', 'Total student cases_Jun13', 'Total staff cases_Jun13', 
             'Total student cases_Jun20', 'Total staff cases_Jun20', 'Total student cases_Jun27', 
             'Total staff cases_Jun27', 'Total student cases_Jul04', 'Total staff cases_Jul04', 
             'Total student cases_Jul11', 'Total staff cases_Jul11', 'Total student cases_Jul18', 
             'Total staff cases_Jul18', 'Total student cases_Jul25', 'Total staff cases_Jul25', 
             'Total student cases_Aug1', 'Total staff cases_Aug1']

student_dates = ['Total student cases_Nov22', 'Total student cases_Nov29', 'Total student cases_Dec6', 
       'Total student cases_Dec15', 'Total student cases_Dec20', 'Total student cases_Dec29', 
       'Total student cases_Jan5', 'Total student cases_Jan12', 'Total student cases_Jan17', 
       'Total student cases_Jan24', 'Total student cases_Jan31', 'Total student cases_Feb7', 
       'Total student cases_Feb14', 'Total student cases_Feb21', 'Total student cases_Feb28', 
       'Total student cases_Mar7', 'Total student cases_Mar14', 'Total student cases_Mar21', 
       'Total student cases_Mar28', 'Total student cases_Apr4', 'Total student cases_Apr11', 
       'Total student cases_Apr18', 'Total student cases_Apr25', 'Total student cases_May02', 
       'Total student cases_May09', 'Total student cases_May16', 'Total student cases_May23', 
       'Total student cases_May30', 'Total student cases_Jun6', 'Total student cases_Jun13', 
       'Total student cases_Jun20', 'Total student cases_Jun27', 'Total student cases_Jul04', 
       'Total student cases_Jul11', 'Total student cases_Jul18', 'Total student cases_Jul25', 
       'Total student cases_Aug1']

staff_dates = ['Total staff cases_Nov22', 'Total staff cases_Nov29', 'Total staff cases_Dec6', 
       'Total staff cases_Dec15', 'Total staff cases_Dec20', 'Total staff cases_Dec29', 
       'Total staff cases_Jan5', 'Total staff cases_Jan12', 'Total staff cases_Jan17', 
       'Total staff cases_Jan24', 'Total staff cases_Jan31', 'Total staff cases_Feb7', 
       'Total staff cases_Feb14', 'Total staff cases_Feb21', 'Total staff cases_Feb28', 
       'Total staff cases_Mar7', 'Total staff cases_Mar14', 'Total staff cases_Mar21', 
       'Total staff cases_Mar28', 'Total staff cases_Apr4', 'Total staff cases_Apr11', 
       'Total staff cases_Apr18', 'Total staff cases_Apr25', 'Total staff cases_May02', 
       'Total staff cases_May09', 'Total staff cases_May16', 'Total staff cases_May23', 
       'Total staff cases_May30', 'Total staff cases_Jun6', 'Total staff cases_Jun13', 
       'Total staff cases_Jun20', 'Total staff cases_Jun27', 'Total staff cases_Jul04', 
       'Total staff cases_Jul11', 'Total staff cases_Jul18', 'Total staff cases_Jul25', 
       'Total staff cases_Aug1']

In [None]:
# Percentage of missing data from student report columns
# Total number of reports from student columns

pd.set_option('display.max_rows', None)

print(df[student_dates].isna().sum()/len(df[student_dates]))
print(df[student_dates].notnull().sum())

In [None]:
df.sample(10)

In [None]:
def pullstats(campusid):
    
    '''Grabs percentage of economically disadvantaged students, average class size on campus,
    STAAR performance rates for 'meets grade level or above', and total operating expenditures
    per student.'''
    
    campus_stats = pd.DataFrame({'Campus ID': pd.Series(dtype='int'), 
                                 'Econ disadv': pd.Series(dtype='float'),
                                 'Avg class': pd.Series(dtype='float'),
                                 'STAAR 2021': pd.Series(dtype='float'),
                                 'Spending': pd.Series(dtype='float')})

    campus_stats['Campus ID'] = [campusid]
                                 
    # Request TEA school report card corresponding to campus ID
    
    r = requests.get(f'https://rptsvr1.tea.texas.gov/cgi/sas/broker?_service=marykay&_program=perfrept.perfmast.sas&_debug=0&ccyy=2022&lev=C&id={campusid}&prgopt=reports%2Fsrc%2Fsrc.sas')
    
    # Parse HTML
    
    soup = BeautifulSoup(r.text, 'html.parser')

    # Find percentage of economically disadvantaged students on campus
    # Note: state average 60.7%

    econ = soup.find('td', string=re.compile('Economically Disadvantaged'))

    for i in range(3):
        econ = econ.next_element
    
    campus_stats['Econ disadv'] = [pd.to_numeric(econ.get_text(strip=True).strip('%')) / 100]

    # Gets average class size on campus, only taking into account listed values
    # Note: state averages listed by class

    classes = ['Kindergarten', 'Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5', 'Grade 6', 
              'English/Language Arts', 'Foreign Languages', 'Mathematics', 'Science', 'Social Studies']
    class_sizes = []

    for cl in classes:
        size = soup.find('td', string=re.compile(cl))

        # Not totally sure why, but the campus number is three elements away

        for i in range (3):
            size = size.next_element

        class_sizes.append(pd.to_numeric(size.text, errors='coerce'))

    # np.nanmean ignores NaNs and calculates mean of numbers

    campus_stats['Avg class'] = [np.nanmean(class_sizes)]

    # Get "STAAR Performance Rates at Meets Grade Level or Above" from all subjects in 2021
    # Note: 2021 is closer to start of pandemic and infection stats (double-check?)
    # Note: actual figures from 2019 are in PDF form

    staar = soup.find_all('th', string=re.compile('2021'))[5]

    for i in range(6):
        staar = staar.next_sibling
    
    campus_stats['STAAR 2021'] = [pd.to_numeric((staar.get_text(strip=True).rstrip('%')), errors='coerce') / 100]

    # Expenditures per student

    exp = soup.find('td', string=re.compile('Total Operating Expenditures'))

    for i in range(2):
        exp = exp.next_sibling

    campus_stats['Spending'] = [pd.to_numeric((exp.get_text(strip=True)).replace(',', '').strip('$'), errors='coerce')]
    
    return campus_stats

In [None]:
xtest = pullstats(220901008)

print(xtest)