# Scraping Intro Homework: Columbia J-School Data Faculty

In this assignment, we'll practicing our scraping skills by examining the Columbia Journalism School's listing of data faculty: https://journalism.columbia.edu/faculty?expertise=116

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that even though we installed the library as `pip install beautifulsoup4`, the import statement we practiced is slightly different.

In [86]:
import requests
import pandas as pd
import lxml.html
import pprint as pp

### 1) Grab the HTML for the webpage linked above

Use `requests` to get the HTML, assigning it to a variable

In [10]:
URL = "https://journalism.columbia.edu/faculty?expertise=116"
page = requests.get(URL).text
pp.pprint(page)

('<!DOCTYPE html>\n'
 '<html lang="en" dir="ltr">\n'
 '<head>\n'
 '\t\n'
 '<!-- Google tag (gtag.js) -->\n'
 '<script async '
 'src="https://www.googletagmanager.com/gtag/js?id=G-KQW8XM5VEJ"></script>\n'
 '<script>\n'
 '  window.dataLayer = window.dataLayer || [];\n'
 '  function gtag(){dataLayer.push(arguments);}\n'
 "  gtag('js', new Date());\n"
 '\n'
 "  gtag('config', 'G-KQW8XM5VEJ');\n"
 '</script>\n'
 '\n'
 '<!-- Anti-flicker snippet (recommended) INC1425337\xa0-->\n'
 '<style>.async-hide { opacity: 0 !important} </style>\n'
 "<script>(function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;\n"
 "h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};\n"
 '(a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;\n'
 "})(window,document.documentElement,'async-hide','dataLayer',4000,\n"
 "{'GTM-MV2DS4J':true});</script>\n"
 '<!-- Modified Analytics tracking code with Optimize plugin INC1425337 -->\n'
 '\xa0 \xa0 <script>\n'
 '\xa0 \xa0 

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [48]:
j_DOM = lxml.html.fromstring(page)
print(J_DOM)

<Element html at 0x11961fd30>


### 3) Use `.select(...)` to select all elements representing a faculty member

Assign the resulting elements to a variable named `faculty_els`.

You'll want "View Source" or pop open the Element Inspector to figure out which elements to target.

Note: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. 

A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

In [49]:
faculty_els = j_DOM.cssselect(".views-row")
pp.pprint(faculty_els)

[<Element div at 0x11961df30>,
 <Element div at 0x1197b1ee0>,
 <Element div at 0x119372ca0>,
 <Element div at 0x119786bb0>,
 <Element div at 0x10caa13a0>,
 <Element div at 0x1198649f0>,
 <Element div at 0x119864c20>,
 <Element div at 0x119864cc0>]


### 4) Count the number of matching elements, using `len`

Does it match the number of faculty you see on the page? (It should.)

In [27]:
len(faculty_els)

8

### 5) For each faculty member, print their name, title, and faculty page URL

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ /faculty/denise-ajiri
---
```

You'll note that the "href" is not a complete URL, but rather a "[relative path](https://www.w3schools.com/html/html_filepaths.asp)". Don't worry too much about that for now, although you're welcome to try "solving" that part.

In [97]:
j_root_url = "https://journalism.columbia.edu"
j_names = J_DOM.cssselect("h2.title a")
j_titles = J_DOM.cssselect(".sub-title p")
j_list = []

for i in range(len(faculty_els)):
    d = {}
    
    name = j_names[i].text
    d['name'] = name
    
    title = j_titles[i].text_content() # can use .text or .text_content() method
    d['title'] = title
    
    link = j_root_url + j_names[i].attrib["href"]
    d['link'] = link

    j_list.append(d)
    blurb = f"\n{name}'s title is '{title}'. You can find more information about them @ {link}\n"
    print(blurb)
    print("---------------------------------------------\n")


Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ https://journalism.columbia.edu/faculty/denise-ajiri

---------------------------------------------


Andrea Fuller's title is 'Adjunct Faculty'. You can find more information about them @ https://journalism.columbia.edu/faculty/andrea-fuller

---------------------------------------------


Robert Gebeloff's title is 'Adjunct Faculty'. You can find more information about them @ https://journalism.columbia.edu/faculty/robert-gebeloff

---------------------------------------------


Mark Hansen's title is 'David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation'. You can find more information about them @ https://journalism.columbia.edu/faculty/mark-hansen

---------------------------------------------


Tom  Meagher's title is 'Adjunct Faculty'. You can find more information about them @ https://journalism.co

### 6) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `name`, `title`, `href`.

In [130]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html

df = pd.DataFrame.from_records(j_list)
df

Unnamed: 0,name,title,link
0,Denise Ajiri,Adjunct Assistant Professor,https://journalism.columbia.edu/faculty/denise...
1,Andrea Fuller,Adjunct Faculty,https://journalism.columbia.edu/faculty/andrea...
2,Robert Gebeloff,Adjunct Faculty,https://journalism.columbia.edu/faculty/robert...
3,Mark Hansen,David and Helen Gurley Brown Professor of Jour...,https://journalism.columbia.edu/faculty/mark-h...
4,Tom Meagher,Adjunct Faculty,https://journalism.columbia.edu/faculty/tom-me...
5,Dhrumil Mehta,Associate Professor in Data Journalism; Deputy...,https://journalism.columbia.edu/faculty/dhrumi...
6,Matt Rocheleau,Adjunct Faculty,https://journalism.columbia.edu/faculty/matt-r...
7,Giannina Segnini,John S. and James L. Knight Professor of Profe...,https://journalism.columbia.edu/faculty/gianni...


### 7) Using that `DataFrame`, calculate how many are "Adjunct Faculty"

In [133]:
adjunct1 = len(df[df.title == 'Adjunct Faculty'])
print(f"There are currently {adjunct1} adjuncts")

## OR !!!!!!

adjunct2 = len(df[df.title == 'Adjunct Faculty'].value_counts())
print(f"There are currently {adjunct2} adjuncts")

There are currently 4 adjuncts
There are currently 4 adjuncts


---

---

---