# Web Scraping: College Factual

In this project, I will scrape ranking and public information from the top 1600 colleges/universities in the United States. Scraping is performed in the College Factual [website](https://www.collegefactual.com/), which provides the ranking and general information such as tuition fees, acceptance rate, number of undergraduate and graduate students, etc.

This scraping project was performed in April 2020 using BeautifulSoup. Since websites are frequently redesigned and updated, I cannot guarantee that this script will work without modifications in the future. However, I will show the DataFrame attained after scraping and make that dataset available in my `Data/college_factual/` folder for download.

In [1]:
import pandas as pd
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import requests
import re
import csv

In [2]:
# helper write functions used for saving out results and debugging
def write_csv_df(df,link):
    df.to_csv(link, index=False)
def write_html(s,link):
    f = open(link, "w")
    f.write(s)
    f.close()

**`scrape_colleges`** is the main function to run the scraping off of College Factual in the following steps:
1. The general [rankings page](https://www.collegefactual.com/rankings/best-colleges/) is scraped using the function **`scrape_search_page`**. This function yields a list of colleges with the ranking, name and CollegeFactual internal link.
2. Each college contains a CollegeFactual link, such as Yale University is `yale-university`. This basic ranking dataset is written as a csv file at `colleges_basic.csv`.
3. For each college, use the link to access its page. For example, Yale University has its own CollegeFactual page at `https://www.collegefactual.com/colleges/yale-university/`. Use function **`scrape_college_page`** to scrape detailed information on each college.

In [3]:
def scrape_colleges():
    url = 'https://www.collegefactual.com/rankings/best-colleges/'

    # chrome driver
    driver = None # initial
    if(driver is None):
        options = webdriver.ChromeOptions()
        options.add_argument('--ignore-certificate-errors')
        options.add_argument('--headless')
        chrome_link = '/Users/chauvu/Documents/Chau/DataScience/bin/chromedriver'
        driver = webdriver.Chrome(chrome_link, options=options)

    # scrape first page
    driver.get(url)
    first_page = driver.page_source
    soup = bs4.BeautifulSoup(first_page, 'lxml')

    # find number of pages
    num_pages = int(soup.findAll("title")[0].text.replace(')','').split()[-1])

    # scrape each search page
    colleges = scrape_search_page(soup)
    for p in range(2, num_pages):
        url_page = url+'p'+str(p)+'.html'
        driver.get(url_page)
        page = driver.page_source
        soup = bs4.BeautifulSoup(page, 'lxml')
        colleges = colleges.append(scrape_search_page(soup),ignore_index=True)
    # ranking and links for every college
    write_csv_df(colleges, '../Data/college_factual/colleges_basic.csv')
    colleges = pd.read_csv('../Data/college_factual/colleges_basic.csv') 

    # scrape each college
    colleges_df = pd.DataFrame(columns=['Rank','Name','Link','Location_City', 'Location_State','Private_Public', 
                                        'Profit_Type', 'Campus_Setting', 'Starting_Salary', 'Percent_from_State', 
                                        'Percent_FTE', 'Percent_Male', 'Accept_rate', 'Old_SAT_Score', 
                                        'New_SAT_Score', 'ACT_Score', 'Num_Undergrad', 'Num_Faculty', 
                                        'Student_Faculty_Ratio', 'Num_Degree', 'Num_Major', 'Num_Field',
                                        'Num_Annual_Grad', 'Percent_POC', 'Tuition_Fee', 'Room_Fee', 
                                        'Total_Cost', 'Financial_Aid'])

    for college_index, college_row in colleges.iterrows():
#         print(int(college_index))
        college = college_row
        features = scrape_college_page(college, colleges_df.columns, driver)
        colleges_df = colleges_df.append(features)
#         if(int(college_index) % 10 == 0): # write every 10 rows
#             write_csv_df(colleges_df, '../Data/college_factual/colleges_detailed_'+str(int(college_index))+'.csv')

    write_csv_df(colleges_df, '../Data/college_factual/colleges_detailed.csv')
    return colleges_df

Function **`scrape_search_page`** scrapes the list of 1600 colleges and returns the college name, ranking and internal CollegeFactual link.

In [4]:
def scrape_search_page(soup):
    tags = soup.findAll("div", {"class": "collegeName ellipsis"})
    links = []
    ranks = []
    names = []
    for tag in tags:
        info = tag.findAll("a")[0]
        links.append(info.get('href').split('/')[-2])
        text = info.get_text().replace('\xa0',' ').split(maxsplit=1)
        ranks.append(int(text[0][1:].replace(',','')))
        names.append(text[1])
    df = pd.DataFrame({'Rank':ranks, 'Name':names, 'Link':links})
    return df

Function **`scrape_college_age`** scrapes each college's individual CollegeFactual page. Each of these pages for each college was scraped, then the resulting dataframe was written out into csv file. For example, Yale University has its own page at `https://www.collegefactual.com/colleges/yale-university/`. Each page contains the following sections:

1. General:
    * Campus location and setting
    * Funding info - public/private, non-profit/for-profit
    * Student and Faculty info - student/teach ratio, in-state percent, graduate starting salary
2. Applying: [Yale example](https://www.collegefactual.com/colleges/yale-university/applying/)
    * Acceptance rate
    * Test scores - SAT and ACT
3. Academics: [Yale example](https://www.collegefactual.com/colleges/yale-university/academic-life/)
    * Number of majors, fields and degrees
4. Student life and diversity: [Yale example](https://www.collegefactual.com/colleges/yale-university/student-life/diversity/)
    * Percent people of color
5. Paying: [Yale example](https://www.collegefactual.com/colleges/yale-university/paying-for-college/)
    * Tuition fee, room & board fee, financial aid

In [5]:
def scrape_college_page(college, df_columns, driver):
    # overview page: type, campus_setting, average starting salary, percent student from state
    url = 'https://www.collegefactual.com/colleges/'+college['Link']+'/'
    driver.get(url)
    page = driver.page_source
    soup = bs4.BeautifulSoup(page, 'lxml')
    tags = soup.findAll('div',{'class':'ellipsis'})
    location = tags[1].get_text().split(',')
    location_city = location[0]
    location_state = location[1]
    tags = soup.find('h3',text='Type')
    funding_type = tags.next_sibling.get_text().split() if tags is not None else None
    if(len(funding_type)>1):
        private_public = funding_type[0]
        profit_type = funding_type[1]
    else:
        private_public = funding_type[0]
        profit_type = None
    tags = soup.find('h3',text='Campus Setting')
    campus_setting = tags.next_sibling.get_text() if tags is not None else None
    tags = soup.find('h3',text='Average Starting Salary')
    starting_salary = tags.next_sibling.get_text().split('$') if tags is not None else None
    starting_salary = int(starting_salary[1].replace(',','_')) if len(starting_salary)>1 else None
    tags = soup.find('h3',text=re.compile(r'Students From*'))  
    percent_from_state = int(tags.next_sibling.get_text().split('%')[0]) if tags is not None else None
    tags = soup.find('h3',text='Full-Time Teachers')
    percent_fte = int(tags.next_sibling.get_text().split('%')[0]) if tags is not None else None
    tags = soup.find('h3',text='Male To Female %')
    percent_male = tags.next_sibling.find('span',{'class':'male'}) if tags is not None else None
    percent_male = int(percent_male.get_text()) if percent_male is not None else 0

    overview_list = [location_city, location_state, private_public, profit_type, campus_setting, starting_salary, percent_from_state, percent_fte, percent_male]

    # applying: acceptance rate, average SAT/ACT
    url_applying = 'https://www.collegefactual.com/colleges/'+college['Link']+'/applying/'
    driver.get(url_applying)
    page = driver.page_source
    soup = bs4.BeautifulSoup(page, 'lxml')
    tags = soup.find('td',text='Applications Accepted')
    accept_rate = float(tags.next_sibling.get_text().split('%')[0]) if tags is not None else None
    tags = soup.find('td',text='Average Old SAT (math/reading)')
    old_sat_score = tags.next_sibling.get_text().replace(',','_') if tags is not None else None
    old_sat_score = int(old_sat_score) if old_sat_score.isdigit() else None
    tags = soup.find('td',text='Average New SAT Score')
    new_sat_score = tags.next_sibling.get_text().replace(',','_') if tags is not None else None
    new_sat_score = int(new_sat_score) if new_sat_score.isdigit() else None
    tags = soup.find('td',text='Average ACT Score')
    act_score = tags.next_sibling.get_text() if tags is not None else None
    act_score = int(act_score) if act_score.isdigit() else None

    applying_list = [accept_rate, old_sat_score, new_sat_score, act_score]

    # academics: full-time, adjunct, faculty-student ratio
    url_academic = 'https://www.collegefactual.com/colleges/'+college['Link']+'/academic-life/'
    driver.get(url_academic)
    page = driver.page_source
    soup = bs4.BeautifulSoup(page, 'lxml')
    tags = soup.find('h2',text='Faculty Resources')
    faculty_resources = tags.next_sibling.get_text() if tags is not None else None
    nums_faculty_resources = [int(word) for word in faculty_resources.replace(',','').replace('.','').split() if(word.isdigit())]
    num_undergrad = nums_faculty_resources[0]
    num_faculty = nums_faculty_resources[1]
    student_faculty_ratio = nums_faculty_resources[2]
    tags = soup.find('h2',text=re.compile(r'Most Popular Majors*'))
    majors_resources = tags.next_sibling.get_text() if tags is not None else None
    nums_majors_resources = [int(word) for word in majors_resources.replace(',','').replace('.','').split() if(word.isdigit())]
    num_degree = nums_majors_resources[0]
    num_major = nums_majors_resources[1]
    num_field = nums_majors_resources[2]
    num_annual_grad = nums_majors_resources[3]

    academic_list = [num_undergrad, num_faculty, student_faculty_ratio, num_degree, num_major, num_field, num_annual_grad]

    # student life/diversity: percent people of color in student body
    url_student = 'https://www.collegefactual.com/colleges/'+college['Link']+'/student-life/diversity/chart-ethnic-diversity.html'
    driver.get(url_student)
    page = driver.page_source
    soup = bs4.BeautifulSoup(page, 'lxml')
    tags = soup.find('td',text='White')
    percent_poc = 100.0-(float(tags.next_sibling.get_text().replace(',',''))/num_undergrad)*100.0 if tags is not None else None

    diversity_list = [percent_poc]

    # paying: tuition fees, room/board, total cost, average aid, average net price, annual growth rate
    url_paying = 'https://www.collegefactual.com/colleges/'+college['Link']+'/paying-for-college/'
    driver.get(url_paying)
    page = driver.page_source
    soup = bs4.BeautifulSoup(page, 'lxml')
    tags = soup.find('td',text='Tuition and fees')
    tuition_fee = int(tags.next_sibling.get_text().split('$')[1].replace(',','')) if tags is not None else None
    tags = tags.parent.next_sibling.findAll('td')[1]
    room_fee = int(tags.get_text().split('$')[1].replace(',','')) if tags is not None else None
    tags = soup.find('td',text='Total cost')
    total_cost = int(tags.next_sibling.get_text().split('$')[1].replace(',','')) if tags is not None else None
    tags = soup.find('h3',text=re.compile(r'Net Price*')).next_sibling.findAll('td')
    financial_aid = int(tags[3].get_text().split('$')[1].replace(',','')) if tags is not None and len(tags)>4 else None

    paying_list = [tuition_fee, room_fee, total_cost, financial_aid]

    all_info = [college['Rank'],college['Name'],college['Link']] + overview_list + applying_list + academic_list + diversity_list + paying_list
    df = pd.DataFrame([all_info], columns=df_columns)

    return df

## Results of scrape

Now I will run the function `scrape_colleges` to scrape the CollegeFactual ranking website and show two datasets: 
1. `df_basic` contains college names, ranks and links
2. `df_detailed` expands to details on general college information, application, academics, student life and financing

In [6]:
scrape_colleges()
df_detailed = pd.read_csv('../Data/college_factual/colleges_detailed.csv')
df_basic = pd.read_csv('../Data/college_factual/colleges_basic.csv')

We see that the CollegeFactual ranking is slightly different from the USNEWS ranking, in which the top 3 spots are taken by HYP (Harvard, Yale, Princeton); the top 3 colleges in CollegeFactual are Duke, MIT and UPenn. Nevertheless, we expect that colleges that rank high in USNEWS would also rank high in CollegeFactual.

In [7]:
df_basic.head()

Unnamed: 0,Link,Name,Rank
0,duke-university,Duke University,1
1,massachusetts-institute-of-technology,Massachusetts Institute of Technology,2
2,university-of-pennsylvania,University of Pennsylvania,3
3,yale-university,Yale University,4
4,harvard-university,Harvard University,5


Here is the detailed information for the top 5 and bottom 5 colleges in CollegeFactual. The top names are easily recognizable, whereas the bottom names are local schools that most people do not recognize.

In [8]:
df_detailed.head()

Unnamed: 0,Rank,Name,Link,Location_City,Location_State,Private_Public,Profit_Type,Campus_Setting,Starting_Salary,Percent_from_State,...,Student_Faculty_Ratio,Num_Degree,Num_Major,Num_Field,Num_Annual_Grad,Percent_POC,Tuition_Fee,Room_Fee,Total_Cost,Financial_Aid
0,1,Duke University,duke-university,Durham,North Carolina,Private,Non-Profit,Large City,48000.0,16,...,7,61,48,20,2446,56.391876,53500,14798,71764,47836
1,2,Massachusetts Institute of Technology,massachusetts-institute-of-technology,Cambridge,Massachusetts,Private,Non-Profit,Midsize City,67000.0,10,...,3,39,37,16,1298,66.593358,49892,14720,67430,43248
2,3,University of Pennsylvania,university-of-pennsylvania,Philadelphia,Pennsylvania,Private,Non-Profit,Large City,54000.0,24,...,6,92,66,21,3531,56.811198,53534,15066,71715,44801
3,4,Yale University,yale-university,New Haven,Connecticut,Private,Non-Profit,Midsize City,48000.0,7,...,6,79,53,17,1566,55.273234,51400,15500,71290,50897
4,5,Harvard University,harvard-university,Cambridge,Massachusetts,Private,Non-Profit,Midsize City,48000.0,17,...,7,88,53,18,2642,57.13999,48949,16660,69600,49870


In [9]:
df_detailed.tail()

Unnamed: 0,Rank,Name,Link,Location_City,Location_State,Private_Public,Profit_Type,Campus_Setting,Starting_Salary,Percent_from_State,...,Student_Faculty_Ratio,Num_Degree,Num_Major,Num_Field,Num_Annual_Grad,Percent_POC,Tuition_Fee,Room_Fee,Total_Cost,Financial_Aid
1650,1657,Huston - Tillotson University,huston-tillotson-university,Austin,Texas,Private,Non-Profit,Large City,,89,...,14,18,18,16,182,94.106464,14346,7568,24414,9347
1651,1658,Missouri Valley College,missouri-valley-college,Marshall,Missouri,Private,Non-Profit,Small Town,,50,...,14,32,28,17,219,43.2703,20200,8750,33830,13881
1652,1659,Lincoln University,lincoln-university-missouri,Jefferson City,Missouri,Public,,Small City,,67,...,17,47,36,22,333,60.031847,7632,6770,18454,7801
1653,1660,Five Towns College,five-towns-college,Dix Hills,New York,Private,For-Profit,Large Suburb,,93,...,15,8,8,5,153,50.159236,19380,12600,36980,9474
1654,1661,Wilberforce University,wilberforce-university,Wilberforce,Ohio,Private,Non-Profit,Rural,,29,...,9,25,22,12,99,97.044335,13250,7000,23450,12308


The dataframes `df_basic` and `df_detailed` are available in the `Data/college_factual/` folder in my GitHub. Feel free to download the csv files for your own analysis. Overall, I am impressed with the information available on the CollegeFactual website. It does not only contain general information, but also a high level of details such as the percent of students of color and number of degrees/majors at each school. Even for the bottom-ranked schools, these details are still provided, thus giving users the ability to research not only top private schools but also local schools.