# Live Assignment 5
## DS 6001: Practice and Application of Data Science
### Drew Haynes (rbc6wr)

Data science is a new field, especially compared to long-established academic disciplines like physics, mathematics, and medicine. So it is a lot harder to know the landscape of important journals, programs, and conferences. At the UVA School of Data Science, we are hoping to send our faculty, staff, and students to more conferences to learn cutting edge developments in the field, build a wider network, recruit new students and faculty, and get the word out about our school. One issue is that the phrase "data science" is still very trendy and that makes it difficult to know worthwhile conferences from ones that are sort of scammy. We are working to build out a comprehensive database on data science conferences. But before we take on the challenge of collecting all of the data ourselves, we need to see what databases already exist.

<img src="https://comicsandcartridges.com/wp-content/uploads/2019/04/comic-book-conventions.jpg"
     alt="Figure 12.1"
     width="600" />
     
Does San Diego Comic Con count as a Data Science conference? Source: https://comicsandcartridges.com/7-geeky-conventions-every-comic-book-fan-should-attend/

One option is the conference database on the website of the World Academy of Science, Engineering and Technology: https://waset.org/machine-learning-conferences I would like to collect all of the conferences that are listed here, and organize them in pandas dataframe. Unfortunately, this website does not appear to have any public API, so we have to resort to webscraping.

**Goal 1**: scrape important following features from one conference's page (for example: https://waset.org/aeronautics-and-astronautics-conference-in-february-2022-in-paris

**Goal 2**: Collect all URLs for all conferences listed on World Academy of Science, Engineering and Technology webpage from February 2022, and build a Spider that scrapes the data from Goal 1 for all these conferences

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup

In [2]:
soup("http://www.google.com")



<html><body><p>http://www.google.com</p></body></html>

## Robots.txt

https://waset.org/robots.txt

User-agent: Googlebot
Allow: /

User-agent: *
Disallow:

User-agent: ia_archiver
Disallow: /

Sitemap: https://waset.org/sitemaps/index.xml
Sitemap: https://publications.waset.org/sitemaps/index_publications.xml



In [2]:
url = 'https://waset.org/archaeology-conference-in-march-2022-in-paris'
my_headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 rbc6wr@virginia.edu'}
r = requests.get(url, headers = my_headers)
r

<Response [200]>

In [3]:
conf = soup(r.text, 'html.parser')

In [4]:
conf_title = conf.find_all('title')[0].string # or .text
conf_title

'International Conference on Archaeology ICA in March 2022 in Paris'

In [5]:
conf.find_all("td", "textright")[3].string

AttributeError: ResultSet object has no attribute 'string'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [7]:
conf_date = conf.find_all("h2", "text-center")[0].string.split(' in ')[0]
conf_loc = conf.find_all("h2", "text-center")[0].string.split(' in ')[1]

In [8]:
conf.find('td',text='Conference Dates').findNext().findNext().string

'March 28-29, 2022'

In [9]:
conf_topics = conf.find_all("div", "tab-pane fade", id='nav-topics')[0].div.text

In [10]:
conf_dict = {'Conference title': [conf_title],
           'Conference date' : [conf_date],
           'Conference location' : [conf_loc],
            'Conference topics' : [conf_topics]}
pd.DataFrame(conf_dict)

Unnamed: 0,Conference title,Conference date,Conference location,Conference topics
0,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...


In [11]:
def scrape_one_conference(url):
    r = requests.get(url, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 rbc6wr@virginia.edu'})
    conf_title = conf.find_all('title')[0].string # or .text
    conf_date = conf.find_all("h2", "text-center")[0].string.split(' in ')[0]
    conf_loc = conf.find_all("h2", "text-center")[0].string.split(' in ')[1]
    conf_topics = conf.find_all("div", "tab-pane fade", id='nav-topics')[0].div.text
    conf_dict = {'Conference title': [conf_title],
           'Conference date' : [conf_date],
           'Conference location' : [conf_loc],
            'Conference topics' : [conf_topics]}
    return pd.DataFrame(conf_dict)
    

In [12]:
scrape_one_conference(url)

Unnamed: 0,Conference title,Conference date,Conference location,Conference topics
0,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...


In [13]:
conf_url = 'https://waset.org/conferences-in-march-2022-in-paris'
r = requests.get(conf_url, headers = my_headers)

In [15]:
conflist = soup(r.text, 'html.parser')
my_list = conflist.find_all("a", title=True, class_=False)[2:]

In [93]:
url_list = [m['href'] for m in my_list]

In [96]:
conf_df = pd.DataFrame()
for u in url_list:
    print(u)
    one_conf = scrape_one_conference(u)
    conf_df = conf_df.append(one_conf)
    #conf_df = pd.concat([conf_df,one_conf], join = 'outer', axis = 0) # as an alternative to append

https://waset.org/archaeology-conference-in-march-2022-in-paris
https://waset.org/aerospace-avionics-conference-in-march-2022-in-paris
https://waset.org/aluminum-alloys-and-alloy-design-conference-in-march-2022-in-paris
https://waset.org/aircraft-aerodynamics-aerodynamic-development-and-testing-conference-in-march-2022-in-paris
https://waset.org/advanced-applications-of-cartography-conference-in-march-2022-in-paris
https://waset.org/advanced-aviation-composites-and-technology-conference-in-march-2022-in-paris
https://waset.org/aerospace-and-aeronautical-engineering-conference-in-march-2022-in-paris
https://waset.org/advances-in-algebraic-informatics-conference-in-march-2022-in-paris
https://waset.org/aviation-administration-and-management-conference-in-march-2022-in-paris
https://waset.org/applications-of-advanced-porous-materials-conference-in-march-2022-in-paris
https://waset.org/acting-and-acting-techniques-conference-in-march-2022-in-paris
https://waset.org/advances-in-animal-welfa

In [97]:
conf_df.reset_index(drop=True)

Unnamed: 0,Conference title,Conference date,Conference location,Conference topics
0,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
1,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
2,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
3,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
4,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
...,...,...,...,...
95,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
96,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
97,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
98,International Conference on Archaeology ICA in...,"March 28-29, 2022","Paris, France",\nArchaeology\r\nArchaeo-chronometry \r\nArcha...
