# Web Scrapping the number of colleges in Massachusetts towns

As part of my project I will look at whether or not having colleges in a commjunity can be correlated to the performance of the community's schools.  In this notebook I will use the BeautifulSoup library in Python to create a Pandas DataFrame containing the towns in Massachusetts that contain colleges and track how many.

## Import the required libraries

In [1]:
import sys
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd
from collections import Counter

## Request the html from Wikipedia

Here I will create a get request and create a BeautifulSoup object from the webpage.

In [17]:
url = "https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Massachusetts"

response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')

## Extract the table containing college and town info

In [53]:
html_tables = soup.find_all('table')
college_town_table  = html_tables[0]

## Extract town names and frequencies from the table
I will iterate through each row of the table and take out the town name of each college.  I will create a list of all these names then create a frequency table in the form of a dictionary using the Counter method.  This will then need to be reformatted into a more usuable form.

In [94]:
# initialize dictionary
frequency_tbl ={'Town':[],'No_of_Colleges':[]}

# Initialize list that will store each college's town
town_list = []

# Go through each row of the college table skipping the first header row
for rows in college_town_table.find_all('tr')[1:]:
    
    # extract town name of the college.  Town names are found in links to that town's wiki page.
    # add the town name to a list
    town_list.append(rows.find_all('a')[1].string)

# use count function to create frequency dictionary
counts = dict(Counter(town_list))

# reorganize dictionary to be more useful in overall project
frequency_tbl['Town'] = list(counts.keys())
frequency_tbl['No_of_Colleges'] = list(counts.values())

df = pd.DataFrame(frequency_tbl)

## Export dataframe as a csv for use in project

In [95]:
df.to_csv('webscrapped_college_freq.csv')

# Webscraping Income per Capita for each Massachusetts Town

## Request the html from Wikipedia

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_Massachusetts_locations_by_per_capita_income"

response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')

## Extract the Table Containing Income Data

In [13]:
html_tables = soup.find_all("table")
income_table = html_tables[2]

## Extract Town and Per Capita Income from Table

In [12]:
# Initialize the dictionary
income_dict = {"Town":[],"Per_Capita_Income":[]}

# Go through the table one row at a time skipping the header row.
# In each row the town name and the per capita income will be added to the dictionary
for row in income_table.find_all('tr')[1:]:
    cells = row.find_all('td')
    income_dict['Town'].append(cells[1].a.string)
    income_dict['Per_Capita_Income'].append(cells[4])

## Convert dictionary to a DataFrame and then save table as a csv file

In [14]:
df = pd.DataFrame(income_dict)
df.to_csv('per_capita_income.csv')