# Module 5 In-Class Assignment
Python for Data Analytics 
<br>Professor James Ng

<div class="alert alert-block alert-warning">
    <b>This assignment is due by THURSDAY, DEC 12 at 11:59PM.</b> 
    <br>Late submissions will NOT be graded. 
    <br>
    <br>Remember to save and submit frequently!
</div>

## Scraping university websites for email addresses of department heads

Suppose you want to get the email addresses of all the department heads from US universities. Suppose you need these emails so that you can send all of them a survey. You could of course do this manually, but with hundreds of websites, that will be too labor intensive. (*Instructor note: This was part of an actual project my team and I worked on.*)

For this final in-class assignment, you will do some simple web scraping. You will be scraping email addresses from four university websites given in the csv file below. Please see the section on IMPORTANT INSTRUCTIONS AND HINTS below.

## IMPORTANT INSTRUCTIONS AND HINTS
* Your final product should be a DataFrame that looks like this (showing the first ten rows):

![Image of first ten rows of results](https://www3.nd.edu/~jng2/df_emails_head10.png)

* That is, you only need to populate two columns: one for email addresses, and the another that simply stores the URL of the website that an email address came from. Do not spend time scraping anything else.

* Your final DataFrame should have **231 rows** and, as mentioned above, the two columns. Watch out for duplicate entries.

* `pandas`, `BeautifulSoup` and `requests` are all you need for this assignment. I strongly discourage you from using anything else.

* Because all I am asking for are email addresses, code that you write for one website should also work without any trouble on the other websites. Therefore, pick one website and write and test your code on it. Once your code works well on that website, it should work equally well on the other three.
* There may be some irrelevant email addresses, such as webmaster@website.com. Do not spend time trying to detect and remove them; just leave them in your final DataFrame.

In [2]:
# SETUP. RUN BUT DO NOT CHANGE. 
# These should be all the packages you need for this assignment.
import pandas as pd
from bs4 import BeautifulSoup
import requests

# The csv file below contains the four websites to scrape. 
websites = pd.read_csv("http://www3.nd.edu/~jng2/dept_urls_small.csv")
websites

Unnamed: 0,dept_url
0,https://provost.nd.edu/about/deans-and-chairpe...
1,https://provost.uchicago.edu/deans-and-departm...
2,https://as.virginia.edu/department-and-academi...
3,https://www.purdue.edu/provost/heads/dh_list.html


In [3]:
pd.set_option('display.max_rows', 500)  
pd.set_option('display.max_columns', 500)  

In [74]:
# YOUR CODE HERE
def get_emails(url):
    req = requests.get(url)
    page = req.text
    page_soup = BeautifulSoup(page, 'html.parser')
    email_list = []
    mailtos = page_soup.select('a[href^=mailto]')
    for i in mailtos:
            href=i['href']
            try:
                str1, str2 = href.split(':')
            except:
                break
            email_list.append(str2)
    urls = url * len(email_list)
    results = pd.DataFrame(
        {'email': email_list,
         'dept_url': urls
        })
    return results
final = pd.concat(get_emails(url) for url in websites['dept_url']).reset_index().drop(columns='index')
final.drop_duplicates(inplace=True)
final

Unnamed: 0,email,dept_url
0,Michael.N.Lykoudis.1@nd.edu?cc=Barbara.A.Panzi...,https://provost.nd.edu/about/deans-and-chairpe...
1,mustillo.5@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
2,Dianne.M.Pinderhughes.1@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
3,Erika.Doss.2@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
4,afuentes@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
5,rgray@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
6,lgrillo@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
7,Yongping.Zhu.46@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
8,wevans1@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...
9,jlander@nd.edu,https://provost.nd.edu/about/deans-and-chairpe...


In [75]:
len(final)

231