# Welcome to DSC 495 004 Data Wrangling and Web Scraping
***

1. Overview of research project
1. Syllabus
1. Discussion
1. Some data exploration
***

We will start by bringing in the libraries we need. Over the course of the semester we will make ample use of the libraries `urllib` and `BeautifulSoup`. 

Many python packages (and the programming language itself) have neat nomenclature origins. In this case `BeautifulSoup` is named for the song sung by the Mock Turtle in Lewis Carroll's _Alice in Wonderland_. 

In [10]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [13]:
html = urlopen("https://www.ncsu.edu/")
soup = BeautifulSoup(html.read(), 'html.parser')

`BeautifulSoup` takes the webpage and creates a tree of Python objects. 
The first argument is the XML (Extensible Markup Language), HTML (HyperText Markup Language), or page contents. 
The second argument explicitly states the parser to be used. 

We can see the HTML by using `Ctrl+I` on a webpage. This is the inspect command in many web browsers. 

We want to find what the feature columns are for the day. 

NCSU uses division tags to separate these columns on their webpage. These tags appear as `<div>` in the HTML. 


In [83]:
# soup.prettify()
divs = soup.find_all('div', {'class':'feature-column'})
print(divs[0])

<div class="feature-column">
<a class="feature-block" data-ua-action="News Click" data-ua-cat="News Module" data-ua-label="https://news.ncsu.edu/2022/01/is-flurona-even-a-real-thing/" href="https://news.ncsu.edu/2022/01/is-flurona-even-a-real-thing/">
<div class="feature-txt">
<p class="feature-date">Jan 7, 2022</p>
<h3>Is ‘Flurona’ Even a Real Thing? What’s That All About?</h3>
<p>TL;DR: Not <span class="nowrap">really.<span class="glyphicon glyphicon-thin-arrow"></span></span></p>
</div>
</a>
</div>


***
Let's get the title of the articles and when they were featured. Headers have tags `<h#>` where # changes with the size of the header (1 largest). 

We can see here that the titles are in `<h3>` and `<h4>` tags and the dates appear in the paragraph tags `<p>` of class `feature-date`. 

So let's get the titles of the dates. 
***

In [84]:
for div in divs:
    print(div.get_text())




Jan 7, 2022
Is ‘Flurona’ Even a Real Thing? What’s That All About?
TL;DR: Not really.






Dec 15, 2021
Sweat-Powered Wearable Sensors Land NC State Researcher on Newsweek’s Inaugural ‘Greatest Disruptors’ List
Newsweek recognized Amay J. Bandodkar for his work on wearable battery-free sensors and skin-friendly wearable batteries. 






Dec 15, 2021
Fitts-Woolard Hall Named North Carolina ‘Public Project of the Year’
The project was recognized as one of top commercial developments across the state based on design, innovation and community impact.






Jan11


Tuesday1:15 PM
Dean of CVM Nomination Committee Meeting






Jan11


Tuesday7:00 PM
Talley Tuesday Takeover at Talley Student Union






Jan13


Thursday11:00 AM
Campus Community Center Open House at Talley Student Union






Jan14


Friday5:00 PM
Avery Bolden Art Exhibition: Black Girl Maverick at Witherspoon Student Center





In [96]:
for div in divs:
    print(div.h3)

<h3>Is ‘Flurona’ Even a Real Thing? What’s That All About?</h3>
<h3>Sweat-Powered Wearable Sensors Land NC State Researcher on Newsweek’s Inaugural ‘Greatest Disruptors’ List</h3>
<h3>Fitts-Woolard Hall Named North Carolina ‘Public Project of the Year’</h3>
None
None
None
None


***
**Holy dates, Batman** 

Something looks different here...

Some elements are actually calendar invitations. They are not articles! 

So, let's fix this. 
***

In [113]:
print(divs[0].prettify())
divs2 = soup.find_all("div", class_='feature-txt')
divs2

<div class="feature-column">
 <a class="feature-block" data-ua-action="News Click" data-ua-cat="News Module" data-ua-label="https://news.ncsu.edu/2022/01/is-flurona-even-a-real-thing/" href="https://news.ncsu.edu/2022/01/is-flurona-even-a-real-thing/">
  <div class="feature-txt">
   <p class="feature-date">
    Jan 7, 2022
   </p>
   <h3>
    Is ‘Flurona’ Even a Real Thing? What’s That All About?
   </h3>
   <p>
    TL;DR: Not
    <span class="nowrap">
     really.
     <span class="glyphicon glyphicon-thin-arrow">
     </span>
    </span>
   </p>
  </div>
 </a>
</div>



[<div class="feature-txt">
 <p class="feature-date">Jan 7, 2022</p>
 <h3>Is ‘Flurona’ Even a Real Thing? What’s That All About?</h3>
 <p>TL;DR: Not <span class="nowrap">really.<span class="glyphicon glyphicon-thin-arrow"></span></span></p>
 </div>,
 <div class="feature-txt">
 <p class="feature-date">Dec 15, 2021</p>
 <h3>Sweat-Powered Wearable Sensors Land NC State Researcher on Newsweek’s Inaugural ‘Greatest Disruptors’ List</h3>
 <p>Newsweek recognized Amay J. Bandodkar for his work on wearable battery-free sensors and skin-friendly wearable <span class="nowrap">batteries. <span class="glyphicon glyphicon-thin-arrow"></span></span></p>
 </div>,
 <div class="feature-txt">
 <p class="feature-date">Dec 15, 2021</p>
 <h3>Fitts-Woolard Hall Named North Carolina ‘Public Project of the Year’</h3>
 <p>The project was recognized as one of top commercial developments across the state based on design, innovation and community <span class="nowrap">impact.<span class="glyphicon glyphicon-thin-arr

So this adds the NC State visit text. We don't want that. So instead, we just use logic!

In [118]:
for div in divs:
    if div.h3 != None:
        print(div.h3.get_text(), div.p.get_text())

Is ‘Flurona’ Even a Real Thing? What’s That All About? Jan 7, 2022
Sweat-Powered Wearable Sensors Land NC State Researcher on Newsweek’s Inaugural ‘Greatest Disruptors’ List Dec 15, 2021
Fitts-Woolard Hall Named North Carolina ‘Public Project of the Year’ Dec 15, 2021
