# Scraping Intro Homework: Columbia J-School Data Faculty

In this assignment, we'll practicing our scraping skills by examining the Columbia Journalism School's listing of data faculty: https://journalism.columbia.edu/faculty?expertise=116

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that even though we installed the library as `pip install beautifulsoup4`, the import statement we practiced is slightly different.

In [1]:
import requests

In [4]:
requests.get('https://journalism.columbia.edu/faculty?expertise=116')

<Response [200]>

In [5]:
faculty_response = requests.get('https://journalism.columbia.edu/faculty?expertise=116')

In [2]:
import pandas as pd

In [3]:
from bs4 import BeautifulSoup

<title>Faculty | Columbia Journalism School</title>

### 1) Grab the HTML for the webpage linked above

Use `requests` to get the HTML, assigning it to a variable

In [19]:
faculty_html = faculty_response.text

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [21]:
faculty_els = BeautifulSoup(faculty_html)

### 3) Use `.select(...)` to select all elements representing a faculty member

Assign the resulting elements to a variable named `faculty_els`.

You'll want "View Source" or pop open the Element Inspector to figure out which elements to target.

Note: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. 

A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

In [28]:
faculty_els.select('.about-link')

[<a class="about-link" href="/faculty/denise-ajiri">Denise Ajiri</a>,
 <a class="about-link" href="/faculty/andrea-fuller">Andrea Fuller</a>,
 <a class="about-link" href="/faculty/robert-gebeloff">Robert Gebeloff</a>,
 <a class="about-link" href="/faculty/mark-hansen">Mark Hansen</a>,
 <a class="about-link" href="/faculty/tom-meagher">Tom  Meagher</a>,
 <a class="about-link" href="/faculty/dhrumil-mehta">Dhrumil Mehta</a>,
 <a class="about-link" href="/faculty/matt-rocheleau">Matt Rocheleau</a>,
 <a class="about-link" href="/faculty/giannina-segnini">Giannina Segnini</a>]

### 4) Count the number of matching elements, using `len`

Does it match the number of faculty you see on the page? (It should.)

In [92]:
len(faculty_els.select('.faculty-bio'))

8

In [126]:
faculty_els.select('.faculty-bio')

[<article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1596">
 <div class="faculty-photo">
 <a href="/faculty/denise-ajiri"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2022/31/f-88-6-13176708_01wgebpz_denise.jpg?itok=k-ND0dPI" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/denise-ajiri">Denise Ajiri</a></h2>
 <div class="sub-title"><p>Adjunct Assistant Professor</p>
 </div>
 </article>,
 <article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1156">
 <div class="faculty-photo">
 <a href="/faculty/andrea-fuller"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2019/04/andrea-fuller.jpg?itok=o-b7JFxn" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/andrea-fuller">Andrea Fuller</a><

### 5) For each faculty member, print their name, title, and faculty page URL

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ /faculty/denise-ajiri
---
```

You'll note that the "href" is not a complete URL, but rather a "[relative path](https://www.w3schools.com/html/html_filepaths.asp)". Don't worry too much about that for now, although you're welcome to try "solving" that part.

In [155]:
for teacher in faculty_els.select('.faculty-bio'):
    name = teacher.select(".about-link")[0].text
    subtitle = teacher.select(".sub-title")[0].text
    link = teacher.select("a")[0]['href']
    print(f"{name}'s title is \"{subtitle}.\" You can find out more information about them @{link}.")
#resultset seems to be a list

Denise Ajiri's title is "Adjunct Assistant Professor
." You can find out more information about them @/faculty/denise-ajiri.
Andrea Fuller's title is "Adjunct Faculty
." You can find out more information about them @/faculty/andrea-fuller.
Robert Gebeloff's title is "Adjunct Faculty
." You can find out more information about them @/faculty/robert-gebeloff.
Mark Hansen's title is "David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation
." You can find out more information about them @/faculty/mark-hansen.
Tom  Meagher's title is "Adjunct Faculty
." You can find out more information about them @/faculty/tom-meagher.
Dhrumil Mehta's title is "Associate Professor in Data Journalism; Deputy Director of the Tow Center for Digital Journalism
." You can find out more information about them @/faculty/dhrumil-mehta.
Matt Rocheleau's title is "Adjunct Faculty
." You can find out more information about them @/faculty

In [156]:
import pandas as pd

<article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-338">
<div class="faculty-photo">
<a href="/faculty/giannina-segnini"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2016/21/giannina_segnini.jpg?itok=IOEYXzCH" width="269"/></a> </div>
<h2 class="title regular"><a class="about-link" href="/faculty/giannina-segnini">Giannina Segnini</a></h2>
<div class="sub-title"><p>John S. and James L. Knight Professor of Professional Practice in Data Journalism</p>
</div>
</article>

### 6) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `name`, `title`, `href`.

In [172]:
df = pd.DataFrame([{
    'name' : teacher.select('.about-link')[0].text,
    'title' : teacher.select('.sub-title')[0].text.strip(),
    'href' : teacher.select('a')[0]['href']
} for teacher in faculty_els.select('.faculty-bio')])

#It kept adding "\n" to the end of the "title" for some reason, so I used the strip function to remove it.

### 7) Using that `DataFrame`, calculate how many are "Adjunct Faculty"

In [174]:
df

Unnamed: 0,name,title,href
0,Denise Ajiri,Adjunct Assistant Professor,/faculty/denise-ajiri
1,Andrea Fuller,Adjunct Faculty,/faculty/andrea-fuller
2,Robert Gebeloff,Adjunct Faculty,/faculty/robert-gebeloff
3,Mark Hansen,David and Helen Gurley Brown Professor of Jour...,/faculty/mark-hansen
4,Tom Meagher,Adjunct Faculty,/faculty/tom-meagher
5,Dhrumil Mehta,Associate Professor in Data Journalism; Deputy...,/faculty/dhrumil-mehta
6,Matt Rocheleau,Adjunct Faculty,/faculty/matt-rocheleau
7,Giannina Segnini,John S. and James L. Knight Professor of Profe...,/faculty/giannina-segnini


In [178]:
len(df[df['title'].str.contains('Adjunct Faculty')])

4

---

---

---