# Project Luther: Web Scraping and Data Collection

In this project, I will be analyzing data on student and teacher demographics and see if there seems to be any kind of relationship between Illinois teacher demographics and the "achievement gap" between hispanic students and their white peers on standardized math tests in high school (The PSAE).

This notebook contains the code to scrape the publicly available data from the Illinois Report Card website and convert it to a pandas dataframe object

URL: https://www.illinoisreportcard.com/ListSchools.aspx

Analysis will be covered in a second notebook

## **Initializing libraries and modules**

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import numpy as np

import time
import re
import pickle
import collections

## Selenium Data Scraping Script

**I decided to use Selenium as my weapon of choice to scrape data because the Illinois Report Card Website has dynamic content that requires quite a bit of clicking**

The code below is just a few values and a function to keep track of scraping progress so that data collection can be smootly resumed if there is an interruption

The code below opens up a new chrome window from where all the data will be collected. The data collection will be automated through Selenium

In [3]:
current_alpha_index = 0
current_item_index = 0
current_school_name= "N/A (No Data collected yet)"

In [4]:
def current_scrape_progress():
    print(f"The last school encountered was: {current_school_name}")
    print(f"Next Alphabetical Index value to check: {current_alpha_index}")
    print(f"Next school index on that page is: {current_item_index}")

Web scraping will yield unique dictionaries for each school containing scraped data. All of these dictionaries will be contained within a **master_list_school_dictionaries** variable until data collection is complete.
Once all the scraping is complete, the list of dictionaries will be converted into a pandas dataframe object.
The **valid_district_counties_master_dictionary** holds a all the school districts present in the data set as keys and associated Counties as values.

In [2]:
master_list_school_dictionaries = []
valid_district_counties_master_dictionary = {}

### Navigation functions

The following five functions will contain the means to navigate the website to collect data on each high school. The high schools are organized by alphabet on separate webpages for each alphabet.

- The function **data_scraper_wrapper** is just a wrapper around the next four main scraping functions

- Running the **page_navigator** will set the entire data scraping apparatus into motion. It will start at the first webpage with school names beginning with 'A' navigate to the next webpage after all the high school links on the current page have been sorted through and will end at the alphabet 'Z'.
- The **link_navigator** will cycle through all the high school school links on the current webpage.
    - The **is_highschool** helper function returns a boolean to assist the link_navigator in differentiating high schools from elementary/middle schools.


- The **open_school_in_new_tab** function opens the school link in a new tab when the link_navigator has selected a high school. It runs the grab_student_data function on the new tab to collect student data. After data collection for the school has been completed, open_school_in_new_tab closes the tab and returns focus to the main window containing all the school links.

In [5]:
def data_scraper_wrapper(driver,
                         starting_alpha_index=current_alpha_index,
                         starting_item_index=current_item_index):
    time.sleep(1)
    current_scrape_progress()
    page_navigator(driver,starting_alpha_index,starting_item_index)
    print(f"Scrape Completed. Total schools in data set = {len(master_list_school_dictionaries)}")


def page_navigator(driver,
                   starting_alpha_index=current_alpha_index,
                   starting_item_index=current_item_index):
    """
    will navigate pages by alphabetical-navigation while scraping data about every high school
    """
    global current_alpha_index
    global current_item_index
    
    driver.switch_to_default_content()
    alpha_page_nav_list = driver.find_elements_by_xpath("//div[contains(@class,'pagination')]//a")
    alpha_page_index =  starting_alpha_index
    working_school_index = starting_item_index
    
    while alpha_page_index < len(alpha_page_nav_list):
        # the need to redefine alpha_page_nav_list is because of all the scraping done, this list of stored
        # elements is now "stale" and no longer in memory
        alpha_page_nav_list = driver.find_elements_by_xpath("//div[contains(@class,'pagination')]//a")
        if alpha_page_index >0:
            moveto_page=alpha_page_nav_list[alpha_page_index]
            moveto_page.send_keys("\n")
        link_navigator(driver,working_school_index)
        print(f'Just completed Alpha_index {alpha_page_index}!!!')
        alpha_page_index +=1
        current_alpha_index = alpha_page_index
        current_item_index = 0
        working_school_index = current_item_index
        

def link_navigator(driver,
                  starting_item_index=current_item_index):
    """
    For the school links on the alpha-nav sorted page, this function
    will append scraped data about every high school to
    master_list_of_school_dictionaries
    """
    global current_item_index
    global current_school_name
    
    # Generate school list for schools remaining on page
    xpath_schools = "//div[contains(@class,'cellLeft')]//a"
    school_elements = driver.find_elements_by_xpath(xpath_schools)
    total_schools_on_page = len(school_elements)
    school_elements = school_elements[starting_item_index:]
    # Generate the associated school type for schools remaining on page
    xpath_school_type = "//div[a and contains(@class,'cellLeft')]"
    school_type_elements = driver.find_elements_by_xpath(xpath_school_type)
    school_type_elements = school_type_elements[starting_item_index:]
    # Generate Districts and counties for remaining schools on page
    xpath_districts = "//div[div[contains(@class,'cellLeft')]//a]/div[contains(@class,'4')]"
    district_county_elements = driver.find_elements_by_xpath(xpath_districts)
    district_county_elements = [element.text for element in district_county_elements]
    district_county_elements = district_county_elements[starting_item_index:]
    #want to only select high schools
    for school_link,school_type_el,district_county_el in zip(school_elements,school_type_elements,district_county_elements):
        current_school_name = school_link.text
        school_type = school_type_el.text
        if is_highschool(school_type):
            school_data_dict = open_school_in_new_tab(school_link,driver)
            if school_data_dict is not None:
                school_data_dict['school_name'] = current_school_name
                dist_county_info = obtain_district_and_county(district_county_el)
                school_data_dict.update(dist_county_info)
                master_list_school_dictionaries.append(school_data_dict)
                if len(master_list_school_dictionaries) % 20 == 0:
                    print(f"Our master list now has data on {len(master_list_school_dictionaries)} schools")                    
        if (current_item_index + 1) % 30 == 0:
            print(f"Just passed school {current_item_index + 1} ({current_school_name}) of {total_schools_on_page} schools; only {total_schools_on_page - current_item_index -1} to go!")
        current_item_index += 1
        
def is_highschool(school_type):
    """
    Based on description on site, checks to see if a given school is a high school
    if it is a high school, returns true. if not a high school, returns false.
    """
    it_is_a_highschool = False
    regex = re.compile('-12',re.DOTALL|re.MULTILINE)
    is_a_highschool_query = re.search(regex,school_type)
    if is_a_highschool_query:
        it_is_a_highschool = True    
    return it_is_a_highschool

def obtain_district_and_county(district_county_raw_text):
    """
    Returns a dictionary with the district and county of current school element
    """
    district_county_dict = {}
    district_regex = re.compile('(.*)\n')
    district = re.findall(district_regex,district_county_raw_text)[0]
    district_county_dict['school_district'] = district
    if district not in valid_district_counties_master_dictionary.keys():
        county_regex = re.compile('\n\s*\((.*)\)')
        county = re.findall(county_regex,district_county_raw_text)[0]
        district_county_dict['school_county'] = county
    else:
        district_county_dict['school_county'] = valid_district_counties_master_dictionary[district]
    return district_county_dict

def open_school_in_new_tab(school_link,driver):
    """
    opens the school link in a new tab, runs data scraping algorithm,
    closes the tab, returns the data for the school as a dictionary,
    and then switches window focus back to the list of schools
    """
    main_window=driver.current_window_handle
    #open the school in a new tab
    school_link.send_keys(Keys.CONTROL + Keys.RETURN)
    #switch to the new tab
    driver.switch_to_window(driver.window_handles[-1])
    #collect the school data
    driver.switch_to_default_content()
    school_data_dict = grab_school_data(driver)
    #close the tab and switch focus to the original school list
    driver.close()
    driver.switch_to_window(main_window)
    driver.switch_to_default_content()
    return school_data_dict

### Data collection functions

The following five functions are responsible for extracting the data from each school.
For each school, a dictionary will be returned containing:

*{School Name, White Hispanic achievement gap, white student demographics, black student demographics, hispanic student demographics, white teacher demographics, black teacher demographics, and hispanic teacher demographics}*

- The **grab_school_data** function is the main wrapper for executing the smaller functions. If the *hispanic-white achivement gap* value is not present for a particular school, the function will stop collecting data for that school and return a value of *None* to the open_school_in_new_tab function which originall called grab_school_data.
- **grab_achievement_gap** is the gatekeeper. If a grab_achievement_gap value cannot be obtained, there is no point in collecting any more data for the school since the achievement gap **is my output variable of interest**.
- **grab_school_name**, **grab_student_ethnicity**, **grab_teacher_ethnicity** are self explanatory.

In [6]:
def grab_school_data(driver):
    """
    runs a scraping script for a specific school and returns
    a dictionary containing desired data in key:value form
    """
    # Sometimes a school's link may be broken. Let's guard against that by seeking the presence of a left navbar
    nav_bar_xpath = "//ul[@id='leftNavTabs']/li/a"
    nav_bar = driver.find_elements_by_xpath(nav_bar_xpath)
    if len(nav_bar) == 0:
        error_xpath = "//div[contains(@class,'alert') and contains(text(),'not available')]"
        error_message = driver.find_elements_by_xpath(error_xpath)
        if len(error_message) > 0:
            return None
    # Onto the main function if the link checks out
    school_data_dict = {}
    hw_achievement_gap = grab_hw_achievement_gap(driver)
    if hw_achievement_gap != {}:
        other_achievement_gaps = grab_other_achievement_gaps(driver)
        student_enrollment = grab_student_enrollment(driver)
        student_demographics = grab_student_ethnicity(driver)
        district_teacher_demographics = grab_teacher_ethnicity(driver)
        school_data_dict.update(hw_achievement_gap)
        school_data_dict.update(other_achievement_gaps)
        school_data_dict.update(student_enrollment)
        school_data_dict.update(student_demographics)
        school_data_dict.update(district_teacher_demographics)
        return school_data_dict
    else:
        return None

def grab_hw_achievement_gap(driver):    
    """
    checks to see if data for the school includes a
    white-hispanic standardized test score achievement gap
    if it does, this will return the value of the gap as a dictionary.
    """
    hw_achievement_gap = {}
    academic_info_xpath = "//ul[@id='leftNavTabs']/li/a[text()='Academic Progress']"
    academic_info = driver.find_element_by_xpath(academic_info_xpath)
    academic_info.send_keys("\n")
    #Clicking on the Achievement Gap Tab
    achievement_gap_xpath = "//ul[@id='leftNavTabs']/li[a[text()='Academic Progress']]//a[text()='Achievement Gap']"
    achievement_gap = driver.find_element_by_xpath(achievement_gap_xpath)
    achievement_gap.send_keys("\n")
    #Selecting SAT scores (not PARCC)
    sat_select_xpath = "//ul[@class='pagination']//a[contains(text(),'SAT')]"
    sat_select = driver.find_element_by_xpath(sat_select_xpath)
    sat_select.send_keys("\n")
    time.sleep(5)
    # Entering the IFrame
    iframe_xpath = "//iframe[@id='IFrame_IRC']"
    driver.switch_to_frame(driver.find_element_by_xpath(iframe_xpath))
    # Clicking options
    poverty_checkbox_xpath = "//input[contains(@id,'Income')]"
    hisp_white_checkbox_xpath = "//input[contains(@id,'Hispanic,White')]"
    math_checkbox_xpath = "//label[@data-code='Mathematics']/input"
    results_xpath = "//div[@class='result']"
    unclick_poverty = driver.find_element_by_xpath(poverty_checkbox_xpath)
    unclick_poverty.send_keys(Keys.SPACE)
    click_hisp_white_gap = driver.find_element_by_xpath(hisp_white_checkbox_xpath)
    click_hisp_white_gap.send_keys(Keys.SPACE)
    click_math = driver.find_element_by_xpath(math_checkbox_xpath)
    click_math.send_keys(Keys.SPACE)
    Hisp_White_Achievement_Gap = driver.find_element_by_xpath(results_xpath)
    Hisp_White_Achievement_Gap = Hisp_White_Achievement_Gap.text
    regex = re.compile('Hispanic and White\n(.?[0-9]+)\n',re.IGNORECASE|re.DOTALL)
    if re.search(regex,Hisp_White_Achievement_Gap):
        Hisp_White_Achievement_Gap_Value = float(re.findall(regex,Hisp_White_Achievement_Gap)[0])
        hw_achievement_gap['Hispanic_White_Achievement_Gap']=Hisp_White_Achievement_Gap_Value
    driver.switch_to_default_content()
    return hw_achievement_gap
    
def grab_other_achievement_gaps(driver):    
    """
    checks to see if data for the school includes a
    white-hispanic standardized test score achievement gap
    if it does, this will return the value of the gap.
    """
    other_achievement_gaps = {}
    #Should already be on the achievement gap , SAT section of the Academic Progress Tab
    #from the grab_hw_achievement_gap function
    # Re-entering the IFrame
    iframe_xpath = "//iframe[@id='IFrame_IRC']"
    driver.switch_to_frame(driver.find_element_by_xpath(iframe_xpath))
    # Extracting Black - White Achievement Gap
    hisp_white_checkbox_xpath = "//input[contains(@id,'Hispanic,White')]"
    blk_white_checkbox_xpath = "//input[contains(@id,'Black,White')]"
    blk_hisp_checkbox_xpath = "//input[contains(@id,'Black,Hispanic')]"
    results_xpath = "//div[@class='result']"
    unclick_hisp_white_gap = driver.find_element_by_xpath(hisp_white_checkbox_xpath)
    unclick_hisp_white_gap.send_keys(Keys.SPACE)
    click_blk_white_gap = driver.find_element_by_xpath(blk_white_checkbox_xpath)
    click_blk_white_gap.send_keys(Keys.SPACE)
    Blk_White_Achievement_Gap = driver.find_element_by_xpath(results_xpath)
    Blk_White_Achievement_Gap = Blk_White_Achievement_Gap.text
    regex = re.compile('Black and White\n(.?[0-9]+)\n',re.IGNORECASE|re.DOTALL)
    if re.search(regex,Blk_White_Achievement_Gap):
        Blk_White_Achievement_Gap_Value = float(re.findall(regex,Blk_White_Achievement_Gap)[0])
        other_achievement_gaps['Black_White_Achievement_Gap'] = Blk_White_Achievement_Gap_Value
    # Extracting Black - Hispanic Achievement Gap
    unclick_blk_white_gap = click_blk_white_gap # Just re-selecting the above checkbox element to un-click it
    unclick_blk_white_gap.send_keys(Keys.SPACE)
    click_blk_hisp_gap = driver.find_element_by_xpath(blk_hisp_checkbox_xpath)
    click_blk_hisp_gap.send_keys(Keys.SPACE)
    Blk_Hisp_Achievement_Gap = driver.find_element_by_xpath(results_xpath)
    Blk_Hisp_Achievement_Gap = Blk_Hisp_Achievement_Gap.text
    regex = re.compile('Black and Hispanic\n(.?[0-9]+)\n',re.IGNORECASE|re.DOTALL)
    if re.search(regex,Blk_Hisp_Achievement_Gap):
        Blk_Hisp_Achievement_Gap_Value = float(re.findall(regex,Blk_Hisp_Achievement_Gap)[0])
        other_achievement_gaps['Black_Hispanic_Achievement_Gap'] = Blk_Hisp_Achievement_Gap_Value
    driver.switch_to_default_content()
    return other_achievement_gaps


def grab_student_enrollment(driver):
    """
    returns student enrollment as a dictionary
    """
    student_enrollment_dict = {}
    students_info_xpath = "//ul[@id='leftNavTabs']/li/a[text()='Students']"
    students_info = driver.find_element_by_xpath(students_info_xpath)
    students_info.send_keys("\n")
    student_enrollment_xpath = "//ul[@id='leftNavTabs']/li[a[text()='Students']]//a[text()='Enrollment']"
    student_enrollment_el = driver.find_element_by_xpath(student_enrollment_xpath)
    student_enrollment_el.send_keys("\n")
    time.sleep(5)
    # Entering IFrame
    iframe_xpath = "//iframe[@id='IFrame_IRC']"
    driver.switch_to_frame(driver.find_element_by_xpath(iframe_xpath))
    enrollment_val_xpath = "//div[contains(@class,'grid-cell')][last()]"
    enrollment_val_el = driver.find_element_by_xpath(enrollment_val_xpath)
    enrollment_val = enrollment_val_el.text
    student_enrollment_dict['student_enrollment'] = int(float(enrollment_val.replace(',','')))
    driver.switch_to_default_content()
    return student_enrollment_dict

def grab_student_ethnicity(driver):
    """
    returns black, white, and hispanic student demographics as a dictionary
    """
    # Should already have Students Tab selected from the left navbar
    student_demographics = {}
    student_ethnicity_el = driver.find_element_by_xpath("//ul[@id='leftNavTabs']/li[a[text()='Students']]//a[text()='Racial/Ethnic Diversity']")
    student_ethnicity_el.send_keys("\n")
    time.sleep(5)
    # Entering IFrame
    iframe_xpath = "//iframe[@id='IFrame_IRC']"
    driver.switch_to_frame(driver.find_element_by_xpath(iframe_xpath))
    graph_info_xpath = "//div[@id='graph-data']"
    graph_info = driver.find_element_by_xpath(graph_info_xpath)
    graph_info_text = graph_info.text
    regex = re.compile('White \(([0-9]+\.?[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    student_demographics['white_students'] = float(re.findall(regex,graph_info_text)[0])
    regex = re.compile('Black \(([0-9]+\.?[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    student_demographics['black_students'] = float(re.findall(regex,graph_info_text)[0])   
    regex = re.compile('Hispanic \(([0-9]+\.?[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    student_demographics['hispanic_students'] = float(re.findall(regex,graph_info_text)[0])
    driver.switch_to_default_content()
    return student_demographics
    

def grab_teacher_ethnicity(driver):
    """
    returns black, white, and hispanic teacher demographics as a dictionary
    """
    teacher_demographics = {}
    teacher_info_xpath = "//ul[@id='leftNavTabs']/li/a[text()='Teachers']"
    teachers_info = driver.find_element_by_xpath(teacher_info_xpath)
    teachers_info.send_keys('\n')
    teacher_dems_xpath = "//ul[@id='leftNavTabs']/li[a[text()='Teachers']]//a[text()='Demographics']"
    teacher_dems_el = driver.find_element_by_xpath(teacher_dems_xpath)
    teacher_dems_el.send_keys('\n')
    time.sleep(7)
    # Entering IFrame
    iframe_xpath = "//iframe[@id='IFrame_IRC']"
    driver.switch_to_frame(driver.find_element_by_xpath(iframe_xpath))
    # Extracting teacher demographics
    graph_info_xpath = "//div[@id='nested-graph']"
    graph_info = driver.find_element_by_xpath(graph_info_xpath)
    graph_info_text = graph_info.text
    regex = re.compile('White \(([0-9]+\.?[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    temp_re = re.findall(regex,graph_info_text)
    teacher_demographics['Dist_white_teachers'] = float(temp_re[0])
    regex = re.compile('Black \(([0-9]+\.?[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    temp_re = re.findall(regex,graph_info_text)
    teacher_demographics['Dist_black_teachers'] = float(temp_re[0])   
    regex = re.compile('Hispanic \(([0-9]+\.?[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    temp_re = re.findall(regex,graph_info_text)
    teacher_demographics['Dist_hispanic_teachers'] = float(temp_re[0])
    driver.switch_to_default_content()
    return teacher_demographics

## Running the script

To scrape the data, all that needs to be done is run the **page_navigator**. All data will be collected within list element *master_list_of_school_dictionaries*.

Caution: A nasty NoSuchElementFound Error will occur if your internet connection is too slow to load the dynamic javascript based webpage content before the script searches for certain elements. A fix for this would be:

a. to add/increase time.sleep() times in certain parts of the script. This is bad practice, hacky, and non-pythonic. Unfortunately, it is needed sometimes because IFrames are weird.

b. Better method: use the WebDriverWait function in conjunction with the expected_conditions module from the appropriate Selenium packages which will wait until an element is loaded for a user-specified time before python throws an error.
    - This is considered best practice.
    - time.sleep() works but is not best practice (allegedly).
    
c. My preferred method: Use implicit waits that instructs the Webdriver to search for an element repeatedly for a set period of time (I choose 12 seconds).

In [9]:
current_scrape_progress()

The last school encountered was: ADLAI E STEVENSON HIGH SCHOOL
Next Alphabetical Index value to check: 0
Next school index on that page is: 43


#### Running the next cell is all that is needed to collect the data. If the data collection is interrupted, just re-run this cell and the collection process will continue where it left of.

Note: My data collection is complete. Script took around 5+ hours to run not including interrruptions. I have included long hard-coded sleeps/waits at points in the data collection functions to account for elements sometimes taking too long to load

In [None]:
chromedriver_path = "/home/farhaan/chromedriver"
chrome_driver = webdriver.Chrome(chromedriver_path)
chrome_driver.implicitly_wait(12)
chrome_driver.get('https://www.illinoisreportcard.com/ListSchools.aspx')
data_scraper_wrapper(chrome_driver,current_alpha_index,current_item_index)

In [110]:
# This cell be used to view the last 5 schools scraped
master_list_school_dictionaries[-5:]

[{'Black_Hispanic_Achievement_Gap': nan,
  'Black_White_Achievement_Gap': nan,
  'Dist_black_teachers': 0.2,
  'Dist_hispanic_teachers': 8.3,
  'Dist_white_teachers': 88.7,
  'Hispanic_White_Achievement_Gap': -15.0,
  'black_students': 0.7,
  'hispanic_students': 32.2,
  'school_county': 'MCHENRY',
  'school_district': 'WOODSTOCK CUSD 200',
  'school_name': 'WOODSTOCK NORTH HIGH SCHOOL',
  'student_enrollment': 942,
  'white_students': 62.6},
 {'Black_Hispanic_Achievement_Gap': -30.0,
  'Black_White_Achievement_Gap': -57.0,
  'Dist_black_teachers': 0.4,
  'Dist_hispanic_teachers': 4.4,
  'Dist_white_teachers': 92.4,
  'Hispanic_White_Achievement_Gap': -27.0,
  'black_students': 3.3,
  'hispanic_students': 13.7,
  'school_county': 'DUPAGE',
  'school_district': 'ELMHURST SD 205',
  'school_name': 'YORK COMM HIGH SCHOOL',
  'student_enrollment': 2732,
  'white_students': 75.2},
 {'Black_Hispanic_Achievement_Gap': -5.0,
  'Black_White_Achievement_Gap': -27.0,
  'Dist_black_teachers': 1.0,

## Pickling Data and formatting it for future use

The variable **master_list_school_dictionaries** contains dictionaries of all the schools.
we can turn our data into a dataframe by turning the list of dictionaries into a dictionary of lists and then using the pandas module to convert it into a dataframe.
The resulting dataframe will be pickled for future use.

The way I wrote the code means that not all the dictionaries have all of the values. For example if there were not enough Black students at a school, there isn't a value for Black-Hispanic or Black-White Achievement Gap. The **check_for_keys()** function will look at each school dictionary individually from the master_list_school_dictionaries and add the missing keys with values of np.NaN in place of actual values. These can be imputed later if necessary.

In [86]:
All_keys = ['Black_Hispanic_Achievement_Gap',
  'Black_White_Achievement_Gap',
  'Dist_black_teachers',
  'Dist_hispanic_teachers',
  'Dist_white_teachers',
  'Hispanic_White_Achievement_Gap',
  'black_students',
  'hispanic_students',
  'school_county',
  'school_district',
  'school_name',
  'student_enrollment',
  'white_students']

In [103]:
def check_for_keys(dict_to_check,list_of_keys):
    """
    checks a dictionary to makes sure it has specified keys.
    If it doesn't it will create a new key with a value of np.NaN
    Returns the modified dictionary
    """
    for key_of_interest in list_of_keys:
        if key_of_interest not in dict_to_check.keys():
            dict_to_check[key_of_interest] = np.NaN
    return dict_to_check

In [108]:
def list_of_dicts_to_dict_of_lists(list_of_dicts):
    """
    Turns a list of dictionaries with common keys into one dictionary containing
    a list of values for each key. This makes it easy to create a dataframe object.
    Dictionaries with missing keys are handled by the check_for_keys function
    Returns a Dictionary of Lists.
    """
    dict_of_lists = collections.defaultdict(list)
    for index,dictionary in enumerate(list_of_dicts,1):
        dictionary = check_for_keys(dictionary,All_keys)
        for key, value in dictionary.items():
            dict_of_lists[key].append(value)
        if index % 25 == 0:
            print(f"{index} / {len(list_of_dicts)} completed")
    print(f"All {len(list_of_dicts)} schools converted")
    return dict_of_lists

In [111]:
print("master_list_school_dictionaries is a",type(master_list_school_dictionaries))
pre_df_student_data = list_of_dicts_to_dict_of_lists(master_list_school_dictionaries)
student_data_df = pd.DataFrame(pre_df_student_data)
print("student_data_df is a",type(student_data_df))

master_list_school_dictionaries is a <class 'list'>
25 / 236 completed
50 / 236 completed
75 / 236 completed
100 / 236 completed
125 / 236 completed
150 / 236 completed
175 / 236 completed
200 / 236 completed
225 / 236 completed
All 236 schools converted
student_data_df is a <class 'pandas.core.frame.DataFrame'>


Pickling the DataFrame to filename:
**student_data_df_pickle**

In [112]:
student_data_df.head()

Unnamed: 0,Black_Hispanic_Achievement_Gap,Black_White_Achievement_Gap,Dist_black_teachers,Dist_hispanic_teachers,Dist_white_teachers,Hispanic_White_Achievement_Gap,black_students,hispanic_students,school_county,school_district,school_name,student_enrollment,white_students
0,-12.0,-29.0,3.6,5.5,89.9,-17.0,24.3,23.0,COOK,CHSD 218,A B SHEPARD HIGH SCH (CAMPUS),1723,49.3
1,,,1.6,7.3,87.2,-34.0,2.2,61.7,DUPAGE,DUPAGE HSD 88,ADDISON TRAIL HIGH SCHOOL,2005,30.4
2,-22.0,-51.0,0.9,0.0,94.8,-29.0,1.9,7.3,LAKE,ADLAI E STEVENSON HSD 125,ADLAI E STEVENSON HIGH SCHOOL,4027,60.3
3,-31.0,-19.0,4.8,0.2,92.3,-13.0,25.0,2.4,MADISON,ALTON CUSD 11,ALTON HIGH SCHOOL,2008,63.6
4,-5.0,-27.0,0.4,0.8,98.1,-22.0,4.3,15.7,COOK,CONS HSD 230,AMOS ALONZO STAGG HIGH SCHOOL,2286,75.3


The code below can pickle the dataframe once it is created
``` python
pd.to_pickle(student_data_df,'../../temp_files_projects/student_data_df.pkl')
```