# (Prototype) Extracting NYT table

In this prototype, we will extract data from the table present here:

https://www.nytimes.com/interactive/2018/06/29/nyregion/nyc-high-schools-middle-schools-shsat-students.html?rref=collection%2Fbyline%2Fjasmine-c.-lee&action=click&contentCollection=undefined&region=stream&module=stream_unit&version=latest&contentPlacement=1&pgtype=collection

It contains information about the number of applicants and admitted students (in SPHS) for each school in NYC (but not all of them).

*Data is from the end of 2017*

In [2]:
import requests
import parsel

URL = r'https://www.nytimes.com/interactive/2018/06/29/nyregion/nyc-high-schools-middle-schools-shsat-students.html?rref=collection%2Fbyline%2Fjasmine-c.-lee&action=click&contentCollection=undefined&region=stream&module=stream_unit&version=latest&contentPlacement=1&pgtype=collection'
r = requests.get(URL)
s = parsel.Selector(r.text)

In [4]:
_rows = s.css('.g-main tr')
len(_rows)

589

In [35]:
_row = _rows[0]

In [37]:
DBN = _row.css('::attr(data-dbn)').extract_first()

school_name_number = _row.css('.g-school-name-number::text').extract_first()
school_name_details = _row.css('.g-school-name-details::text').extract_first()
borough = _row.css('.g-borough::text').extract_first()

testers = _row.css('.g-testers::text').extract_first()
offers = _row.css('.g-offers::text').extract_first()
offers_per_student = _row.css('.g-offers-per-student::text').extract_first()
pct_hispanic_black = _row.css('.g-pct::text').extract_first()

In [39]:
from collections import OrderedDict


schools = []

for _row in _rows:
    school = OrderedDict()
    school['DBN'] = _row.css('::attr(data-dbn)').extract_first()
    school['school_name_number'] = _row.css('.g-school-name-number::text').extract_first()
    school['school_name_details'] = _row.css('.g-school-name-details::text').extract_first()
    school['borough'] = _row.css('.g-borough::text').extract_first()
    school['testers'] = _row.css('.g-testers::text').extract_first()
    school['offers'] = _row.css('.g-offers::text').extract_first()
    school['offers_per_student'] = _row.css('.g-offers-per-student::text').extract_first()
    school['pct_hispanic_black'] = _row.css('.g-pct::text').extract_first()    
    schools.append(school)

In [40]:
import pandas as pd

df = pd.DataFrame(schools)
df.head()

Unnamed: 0,DBN,school_name_number,school_name_details,borough,testers,offers,offers_per_student,pct_hispanic_black
0,20K187,Intermediate School 187,The Christa McAuliffe School,Brooklyn,251,205,75%,8%
1,21K239,Intermediate School 239,The Mark Twain Intermediate School for the Gif...,Brooklyn,336,196,46%,13%
2,03M054,Junior High School 54,The Booker T. Washington School,Manhattan,257,150,53%,23%
3,15K051,Midde School 51,The William Alexander School,Brooklyn,280,122,33%,28%
4,02M312,,New York City Lab Middle School for Collaborat...,Manhattan,163,113,62%,8%


In [62]:
df.testers.value_counts().head()

—     52
6     19
7     17
9     17
10    17
Name: testers, dtype: int64

In [63]:
df.shape

(589, 8)

And all is good... Some missing values are '—' (not '-') and they mean that 5 or less students represent that category (or, as a percentage, that it can't be calculated)...

There are far less schools in this dataset here then in the dataset we were provided... This may pose a problem as those schools may be excluded... And the ones that don't appear here may be the ones that need most help... Anyway

Let's convert this into a simple script and join it into the pipeline.