## Notes

* asterisk indicates suppressed value (see below)
  * 2,788 suppressed values in 2020-2021 school year
  * Impute median of similar-sized schools for these?

"* indicates cell has been suppressed, a blank cell indicates no report has been received for a given district in the indicated time period, a 0 indicates that a report was received and no cases were reported for that group in the reported time period. Single-campus student cases and sources of infection are suppressed when (1) reported student cases are fewer than 5, (2) a campus has at least a 90% student positivity rate when on-campus enrollment for a school is at least 15 students, or (3) a campus has at least a 50% positivity rate when on-campus enrollment has fewer than 15 students. If only one campus in a district has suppressed student numbers then student and source of infection numbers for the campus with the next smallest numbers of positive students are also suppressed. Cumulative student cases and sources of infection numbers for a campus are suppressed when (1) student cases are less than five, or (2) current report numbers have been suppressed for the first three weeks that student cases are reported. If there is only one campus reporting in a district and it is a multiple campus, student and source of infection numbers are not suppressed for the district total. Otherwise, district totals are suppressed when (1) student cases are fewer than 5, or (2) a district has at least a 90% student positivity rate when total district enrollment is least 15 students, or, (3) a district has at least a 50% positivity rate when total district enrollment has fewer than 15 students, or (4) cases on a campus have been suppressed for the first three weeks that student cases are reported and there are fewer than 5 campuses reporting in a district."

In [116]:
import pandas as pd
import requests
import bs4
import os

In [117]:
print(os.getcwd())

/Users/dralbright/Documents/Data Career/Projects/tx-school-risk/notebooks


In [118]:
# Loading Excel file to get sheet names

sheets1 = pd.ExcelFile('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls')
print(sheets1.sheet_names)

['Dashboard', 'Campus Report_August 1', 'Campus Report_July 25', 'Campus Report_July 18', 'Campus Report_July 11', 'Campus Report_July 04', 'Campus Report_June 27', 'Campus Report_June 20', 'Campus Report_June13', 'Campus Report_June 6', 'Campus Report_May 30', 'Campus Report_May 23', 'Campus Report_May 16', 'Campus Report_May 09', 'Campus Report_May 02', 'Campus Report_April 25', 'Campus Report_April 18', 'Campus Report_April 11', 'Campus Report_April 4', 'Campus Report_March 28', 'Campus Report_March 21', 'Campus Report_March 14', 'Campus Report_March 7', 'Campus Report_February 28', 'Campus Report_February 21', 'Campus Report_February 14', 'Campus Report_February 7', 'Campus Report_January 31', 'Campus Report_January 24', 'Campus Report_January 17', 'Campus Report_January 12', 'Campus Report_January 5th', 'Campus Report_December 29', 'Campus Report_December 20', 'Campus Report_December 15', 'Campus Report_December 6', 'Campus Report_November 29', 'Campus Report_November 22', 'Distri

In [119]:
# Attempt at putting this all in a loop

# List of sheet names from original Excel file

sheets = ['Campus Report_August 1', 'Campus Report_July 25', 'Campus Report_July 18', 'Campus Report_July 11', 'Campus Report_July 04', 'Campus Report_June 27', 'Campus Report_June 20', 'Campus Report_June13', 'Campus Report_June 6', 'Campus Report_May 30', 'Campus Report_May 23', 'Campus Report_May 16', 'Campus Report_May 09', 'Campus Report_May 02', 'Campus Report_April 25', 'Campus Report_April 18', 'Campus Report_April 11', 'Campus Report_April 4', 'Campus Report_March 28', 'Campus Report_March 21', 'Campus Report_March 14', 'Campus Report_March 7', 'Campus Report_February 28', 'Campus Report_February 21', 'Campus Report_February 14', 'Campus Report_February 7', 'Campus Report_January 31', 'Campus Report_January 24', 'Campus Report_January 17', 'Campus Report_January 12', 'Campus Report_January 5th', 'Campus Report_December 29', 'Campus Report_December 20', 'Campus Report_December 15', 'Campus Report_December 6', 'Campus Report_November 29']

# Getting first weekly report as base dataframe and cleaning out unnecessary columns
# This excludes on- and off-campus numbers, though those may be worth looking at later

df = pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls', sheet_name="Campus Report_November 22", header=5)
df1 = df.iloc[:, 0:8]
df = pd.concat([df1, df.iloc[:, 13:15]])

# Renaming columns

df.columns.values[8] = 'Total student cases_Nov22'
df.columns.values[9] = 'Total staff cases_Nov22'

for i in sheets:
    
    # Create abbreviations for appending to columns in final merged dataframe
    
    if i[-2] == " ":
        abbrev = i[14:17] + i[-1]
    else:
        abbrev = i[14:17] + i[-2:]
        
    if abbrev == 'Janth':
        abbrev = "Jan5"
    
    # Read next sheet from Excel file
    
    df2 = pd.read_excel('../data/raw/Public-School-Data-Files-2020_2021-School-Year.xls',\
                            sheet_name=i, header=5)
    
    # Rename columns of interest (total student and staff cases)
    
    df2.columns.values[13] = 'Total student cases_' + abbrev
    df2.columns.values[14] = 'Total staff cases_' + abbrev
    
    # Merge with base dataframe
    
    df = df.merge(df2.iloc[:, 13:15], how='left', left_index=True, right_index=True)

print(df.columns)

Index(['District Name', 'District\nLEA\nNumber',
       'Total District\nEnrollment as\nof October 30, 2020',
       'Approximate\nDistrict On Campus\nEnrollment as of\nOctober 30, 2020',
       'Campus Name', 'Campus\nID',
       'Total School\nEnrollment as\nof October 30, 2020',
       'On-Campus\nEnrollment for\nSchool as of\nOctober 30, 2020',
       'Total student cases_Nov22', 'Total staff cases_Nov22',
       'Total student cases_Aug1', 'Total staff cases_Aug1',
       'Total student cases_Jul25', 'Total staff cases_Jul25',
       'Total student cases_Jul18', 'Total staff cases_Jul18',
       'Total student cases_Jul11', 'Total staff cases_Jul11',
       'Total student cases_Jul04', 'Total staff cases_Jul04',
       'Total student cases_Jun27', 'Total staff cases_Jun27',
       'Total student cases_Jun20', 'Total staff cases_Jun20',
       'Total student cases_Jun13', 'Total staff cases_Jun13',
       'Total student cases_Jun6', 'Total staff cases_Jun6',
       'Total student c

In [124]:
df.rename(columns = {'District Name':'District', 'District\nLEA\nNumber':'Dist LEA', \
                    'Total District\nEnrollment as\nof October 30, 2020':'Dist Enrollment 10/30/20', \
                    'Approximate\nDistrict On Campus\nEnrollment as of\nOctober 30, 2020':'Dist On-Campus Enrollment 10/30/20', \
                    'Total School\nEnrollment as\nof October 30, 2020':'Sch Enrollment 10/30/2020', \
                    'On-Campus\nEnrollment for\nSchool as of\nOctober 30, 2020':'Sch On-Campus Enrollment 10/30/2020', \
                    'Campus\nID':'Campus ID'}, \
         inplace=True)

In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22300 entries, 0 to 11149
Data columns (total 82 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   District                             11149 non-null  object 
 1   Dist LEA                             11147 non-null  object 
 2   Dist Enrollment 10/30/20             11147 non-null  object 
 3   Dist On-Campus Enrollment 10/30/20   11147 non-null  object 
 4   Campus Name                          9409 non-null   object 
 5   Campus ID                            9931 non-null   object 
 6   Sch Enrollment 10/30/2020            11147 non-null  object 
 7   Sch On-Campus Enrollment 10/30/2020  11147 non-null  object 
 8   Total student cases_Nov22            8404 non-null   object 
 9   Total staff cases_Nov22              8404 non-null   float64
 10  Total student cases_Aug1             22300 non-null  object 
 11  Total staff cases_Aug1      

In [130]:
# New dataframe that does not contain district total rows
# Not working yet

dfsch = df[df['District'].str.contains('Total')==False]

dfsch.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11149 entries, 0 to 11149
Data columns (total 82 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   District                             11149 non-null  object 
 1   Dist LEA                             11147 non-null  object 
 2   Dist Enrollment 10/30/20             11147 non-null  object 
 3   Dist On-Campus Enrollment 10/30/20   11147 non-null  object 
 4   Campus Name                          9409 non-null   object 
 5   Campus ID                            9931 non-null   object 
 6   Sch Enrollment 10/30/2020            11147 non-null  object 
 7   Sch On-Campus Enrollment 10/30/2020  11147 non-null  object 
 8   Total student cases_Nov22            0 non-null      object 
 9   Total staff cases_Nov22              0 non-null      float64
 10  Total student cases_Aug1             11149 non-null  object 
 11  Total staff cases_Aug1      

In [132]:
df.sample(20)

Unnamed: 0,District,Dist LEA,Dist Enrollment 10/30/20,Dist On-Campus Enrollment 10/30/20,Campus Name,Campus ID,Sch Enrollment 10/30/2020,Sch On-Campus Enrollment 10/30/2020,Total student cases_Nov22,Total staff cases_Nov22,...,Total student cases_Dec29,Total staff cases_Dec29,Total student cases_Dec20,Total staff cases_Dec20,Total student cases_Dec15,Total staff cases_Dec15,Total student cases_Dec6,Total staff cases_Dec6,Total student cases_Nov29,Total staff cases_Nov29
9079,KRESS ISD,'219905,267.0,263.0,KRESS EL,'219905101,133,133,,,...,0,2.0,0,3.0,,,0,3.0,*,0.0
3940,HUCKABAY ISD,'072908,267.0,267.0,HUCKABAY SCHOOL,'072908001,267,267,,,...,*,1.0,*,11.0,40,10.0,14,10.0,116,51.0
383,,,,,,,,,*,0.0,...,,,,,,,39,12.0,,
6500,,,,,,,,,0,1.0,...,,,*,0.0,0,1.0,,,16,5.0
9482,MANSFIELD ISD,'220908,35191.0,20597.0,MARY JO SHEPPARD EL,'220908118,422,294,,,...,7,2.0,10,4.0,10,4.0,,,,
8543,AMARILLO ISD,'188901,31402.0,26649.0,BIVINS EL,'188901103,456,398,,,...,12,3.0,,,,,*,2.0,*,1.0
3167,,,,,,,,,9,3.0,...,*,2.0,*,2.0,5,2.0,,,*,0.0
10675,,,,,,,,,0,6.0,...,*,5.0,0,6.0,0,6.0,20,7.0,*,2.0
4603,,,,,,,,,,,...,*,6.0,*,2.0,*,1.0,,,*,1.0
9267,FORT WORTH ISD,'220905,77276.0,33320.0,WORLD LANGUAGES INSTITUTE,'220905084,519,233,,,...,15,0.0,26,0.0,455,7.0,6,5.0,,
