# CAS Big Data, Computational Methods, and Programming Summer Camp

### Day Three
Today, we will be working with Python to collect data from websites using two different methods. The first is to use webscraping tools. These look at the actual code of the website and extract the things we ask it to. This is a great way to collect information from websites that you want to analyze or preserve but does require a bit of tinkering as well as some understanding of how HTML works. 

The second way we will use is through accessing an API. A good way to think about an API is to picture websites as an access point to a data storage unit. When you load a website, like The Washington Post, the stories and links that you see are pulled from a server where that data is saved. Social media functions in a similar way. The pictures you see on Instagram are not part of the code of the webpage, but are pulled from a server as you scroll.

An API allows us to access the server directly without having to scrape a website. This is useful as many websites you might be interested in do not allow webscraping on their pages. However, it does require access credentials, can be very costly, and is limited to what the company will give you access to.

### Web scraping

To start webscraping, we will use requests and then beautfiulsoup.

In [None]:
import requests

In [None]:
help(requests)

Requests lets us collect the HTMl code from websites. First we will get the data from wikipedia using requests.get. Then we wil check the status code.

In [None]:
r = requests.get("https://en.wikipedia.org/wiki/Main_Page")
r.status_code

In [None]:
r = requests.get("https://www.govtrack.us/congress/members/current")
r.status_code

Some websites do not allow "robots" to access their sites. Look at the following example and see the status code we get.

In [None]:
r = requests.get("https://www.lansingcitypulse.com")
r.status_code

Status codes:
200 = OK

Errors that start with 4 are likely errors in your code or something you are trying to do but not allowed to.

- 400 = Bad request (your http is likely wrong)
- 401 = Unauthorized (the website is locking you out and would do so on a browser)
- 403 = Forbidden (website is blocking requests, you can try a header, described later)
- 404 = Not Found (the website doesn't exist)
- 408 = Request Timeout (website is taking too long to load)

Errors that start with 5 are likely errors with either your internet connection or the website you are working with. Try and load the webpage using your browser)
- 500 = Internal Server Error (problem with the website server)
- 502 = Bad Gateway (problem with server)
- 503 = Service Unavailable (problem with the server)

You can find more errors here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


Sometimes you can get by this by using headers, which tells the website you are a browser and not a script. However, this does not work on every website, especially those that are protected with Cloudflare. There are sometimes work arounds for Cloudflare, but they are complicated and we will not go over them in this camp.

In [None]:
r = requests.get("https://www.lansingcitypulse.com")
r.status_code

In [None]:
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get('https://www.lansingcitypulse.com', headers=headers)
r.status_code

#### Extracting Things

In [None]:
from bs4 import BeautifulSoup
import requests
import random
import re

Requests allows us to get the HTML of a website, but we need to use a package called beautifulsoup to work with it. Here we will import beautifulsoup (called bs4) and look at the govtrack website for the members of the U.S. Congress.

In [None]:
help(bs4.BeautifulSoup)

In [None]:
res = requests.get("https://en.wikipedia.org/wiki/Main_Page")

In [None]:
soup = BeautifulSoup(res.text)

In [None]:
soup

In [None]:
soup.get_text()

In [None]:
print(soup.get_text())

HTML Tags:
- \<HTML\> Used at the start and end of a website
- \<head\> Where the page resources are loaded, like the title, what is saved when you bookmark, etc.
- \<body\> Where the site content actually is.
- \<div\> Division tag, blocks of code for the site. Makes it easier to swap out content as the page owner goes, which means it is where the main content you want is saved.
- \<h1\> Header tag, also h2, h3, etc.
- \<p\> Paragraph tag, this is where your text content is saved at.

There are a lot others as well, these are just the most common.

#### An example using a fake HTML page

In [None]:
html_doc = """
<html>
<body>
<p> This is a paragraph tag. </p>
<p> This is another paragraph tag. </p>
<a href="https://website.org"> This is the text of the anchor tag</a>
<a href="https://website2.org"> More text for our anchor</a>
<span class="blue"> This is a span tag with the class blue </span>
<span class="blue"> This is another span tag with the class blue </span>
<span class="red"> This is a span tag with the class red </span>
</body>
</html>
"""

In [None]:
soup = BeautifulSoup(html_doc)

paras = soup.find_all('p')

In [None]:
paras

In [None]:
for i in paras:
    print(i.text)

In [None]:
links = soup.find_all('a')

In [None]:
print(links)

In [None]:
links[0].get_text()

In [None]:
linksText2 = []
for i in links:
    linksText2.append(i.get_text())
linksText2

In [None]:
linksText = [links.get_text() for links in links]

In [None]:
print(links)

In [None]:
print(linksText)

In [None]:
spans = soup.find_all('span', {'class' : 'blue'})
print(spans)

In [None]:
spansText = [spans.get_text() for spans in spans]

In [None]:
print(spans)

In [None]:
print(spansText)

#### Practice

In [None]:
html_doc = """
<html>
<body>
<title> This is the title tag! </title>
<td> This is a table tag</td>
<li> li is for a list tag </li>
<div id="id one"> This is a divider or section tag with the id of one</div>
<div id="id two"> This is a divider or section tag with the id of one</div>
<p> This is a paragraph tag. These are usually used for the actual text in the website.</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)

Run the cell above and then write some code to extract the title from the HTML.

Now extract the table tag

Now the list tag

Challenge: Extract the div tag with the id one. Here you will look for "div", id = ...

#### Getting links or text
The next set of codes looks at getting a list of links from a website then collecting the text those links are under. There are a variety of different ways to do this, this is one of them that works for the website in question. You will need to adapt other approaches for different websites if this doesn't work.

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

headers = {'user-agent': 'my-app/0.0.1'} #This is another way you can do headers.

url = 'https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
r = requests.get(url, headers = headers)
soup = bs(r.content)

Here we are going to use regular expressions to search for text that we care about. Regular expressions are like search terms that help you constrain what you are looking for. For instance, if you wanted to search for the words "smile", "smiles", and "smiling", you could write code that would look for all three of those. You could also search for "smil\S+" where the \S+ is a regular expression that will look any non-whitespace character, meaning it would find all three words.

In [None]:
links = soup.find_all('a')
links

In [None]:
links = soup.find_all('a', attrs={'href': re.compile("^https://")})

l = []
for i in links:
    l.append(i['href'])
l[:10]

In [None]:
links

In [None]:
links[45].text

In [None]:
links = soup.find_all('a', attrs={'href': re.compile("^https://")})

l = []
for i in links:
    l.append(i.text)
l[:10]

In [None]:
link = []
text = []

links = soup.find_all('a', attrs={'href': re.compile("^https://")})
for i in links:
    link.append(i['href'])
    text.append(i.text)
    
df = pd.DataFrame()

df['Text'] = text
df['URL'] = link
df.head(10)


In [None]:
df['dup'] = df.duplicated(subset=['Text'])
df.head()

In [None]:
df['dup'].value_counts()

In [None]:
df1 = df.drop_duplicates(subset=['Text'])

In [None]:
print(df.shape)
print(df1.shape)

In [None]:
df1.head(10)

In [None]:
link = []
text = []

links = soup.find_all('a', attrs={'href': re.compile("^https://")})
for i in links:
    link.append(i['href'])
    text.append(i.text)
    
df = pd.DataFrame()

df['Text'] = text
df['URL'] = link

df['Text'] = df['Text'].replace('\t','',regex = True)
df['Text'] = df['Text'].replace('\n','',regex = True)
df.head(10)

In [None]:
df['dup'] = df.duplicated(subset=['Text'])
df.head()
df1 = df.drop_duplicates(subset=['Text'])
print(df.shape)
print(df1.shape)

#### Another example using Wikipedia

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

In [None]:
url = "https://en.wikipedia.org/wiki/Mira_Potkonen"

r = requests.get(url)
if r.status_code == requests.codes.ok:
    print(r.text[:200])

In [None]:
page = r.content
soup = BeautifulSoup(page, 'html.parser')
soup

This looks for the title tag

In [None]:
soup.title

Just the text of the title tag.

In [None]:
soup.title.text

First large header tag

In [None]:
soup.find('h1')

In [None]:
soup.find('h1').text

All anchor tags (links)

In [None]:
soup.findAll('a')

#### Try it out!

Write some code that selects the first table (table)

Write some code that selects all the items in lists on the page (li)

Write some code that selects the text within the first paragraph (p)

[Challenge] Use the Developer Tools in your browser to select some text on the website. Write code that will find that text.

#### Downloading multiple web pages
- First, we'll grab all the URLs to medalist Wikipedia pages.

In [None]:
url = "https://en.wikipedia.org/wiki/Finland_at_the_Olympics"
r = requests.get(url)

In [None]:
page = r.content

soup = BeautifulSoup(page, 'html.parser')
soup

We want to find the tables, specifically the one with the Summer Olympic medalists. It is the 9th table, but you would normally need to do a bit of trial and error to find it.

In [None]:
tables = soup.findAll('table')
tables
table = tables[8]

In [None]:
table.text[:200]

Now we are going to select the rows in the table and get the headers for the table, which is the first row.

In [None]:
rows = table.find_all('tr')
rows[0].text

We can also look at the last row to find our friend Mira. From there, look for the second cell in the row. the td tag is for a cell in a table.

In [None]:
cells = rows[-1].find_all('td')
cells[1]

In [None]:
cells[1].text

Now we can do something a bit more complicated. We are going to make an empty list and the loop through each row and find the links to each athelete's personal page.

In [None]:
rows = table.find_all('tr')

In [None]:
rows

In [None]:
links_to_athletes = []

for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
        if (cells[1].find('a')):
            link_to_athlete = cells[1].find('a')['href']
            links_to_athletes.append(link_to_athlete)

In [None]:
cells

In [None]:
links_to_athletes

Now we can turn those strings into actual links by adding the first part of it.

In [None]:
links_to_athletes = ['https://en.wikipedia.org' + i for i in links_to_athletes]
links_to_athletes[0:10]

Next, we'll visit each of the pages find the birthplace of each athlete and then add them to your list. We will use a line here to select the first HTML element span (span) that has the class attribute "birthplace"

In [None]:
data = []

for athlete_page in links_to_athletes:
    r = requests.get(athlete_page)
    page = r.content
    
    soup = BeautifulSoup(page, 'html.parser')
    
    birthplace = soup.find("span", {"class": "birthplace"})
    
    if birthplace:
        athlete_info = {}
        athlete_info['name'] = soup.find('h1').text
        athlete_info['birthplace'] = birthplace.text
        data.append(athlete_info)
    

    time.sleep(1)

In [None]:
data

### Check this data out in pandas

In [None]:
import pandas as pd
df = pd.DataFrame(data)
df

We can select rows that contain athletes born in Helsinki

In [None]:
df[df.birthplace=="Helsinki, Finland"]

### Try it out!

Select the row that contains our pal Mira

In [None]:
df[df.name=="Paavo Nurmi"]

Use value_counts to see how many times each person appears in the dataframe.

#### Yet another example

In [None]:
import requests
from bs4 import BeautifulSoup as bs
headers = {'user-agent': 'my-app/0.0.1'}

url = 'https://time.com/6110450/kalamazoo-foundation-for-excellence/'
r = requests.get(url, headers = headers)
soup = bs(r.content)

In [None]:
soup.text

In [None]:
pList = soup.find_all('p')
len(pList)

In [None]:
print(pList)

In [None]:
myString = ""
for i, para in enumerate(pList):
    myString = myString + "\n\n" + para.text
print(myString)

In [None]:
soupText = re.findall("residents",soup.text)
print(soupText)

In [None]:
len(soupText)

### Practice

Go to Time Magazine's website and find an article you like. Then count how many paragraphs there are. We did this just above.

Save those paragraphs as an object.

Look for a word that you might care about. For an extra challenge see if you can use a regular expression in your search.

## APIs

As mentioned above, APIs allow us to access the database behind what we see on a website. For our example, we are going to use the Wikipedia API and the newsapi

### Example using Wikipedia's API

There is also a package for Wikipedia's API. This does not require credentials, which is great for us. We do need to install the package first, then import it.

In [None]:
pip install wikipedia-api

In [None]:
import wikipediaapi

In [None]:
# Set up a unique user agent string
# user_agent = 'MyWikipediaBot/1.0 (myemail@example.com)'

# Initialize the Wikipedia object with the specified user agent
# wiki_wiki = wikipediaapi.Wikipedia(
#    language='en',
#    user_agent=user_agent,
#    extract_format=wikipediaapi.ExtractFormat.WIKI
#)

We first need to set the language we want and then we can ask the API to give us a page.

In [None]:
# Initialize the Wikipedia object
wikiPage = wikipediaapi.Wikipedia(language='en')

# Retrieve a Wikipedia page
pagePy = wikiPage.page('Python_(programming_language)')

# Display the first 400 characters of the page text
print(pagePy.text[:400])

We can see what attributes are in our object.

In [None]:
dir(pagePy)

This will give us the title and summary of the page.

In [None]:
pagePy = wikiPage.page('Python_(programming_language)')

print("Page - Title: %s" % pagePy.title)
print("Page - Summary: %s" % pagePy.summary[0:60])

All of the text of the page.

In [None]:
# Retrieve a Wikipedia page
pWiki = wikiPage.page("Michigan State University")

# Display the first 400 characters of the page text
print(pWiki.text[:400])

In [None]:
def print_sections(sections, level=0):
        for s in sections:
                print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
                print_sections(s.sections, level + 1)

In [None]:
print_sections(pWiki.sections)

In [None]:
section_history = pWiki.section_by_title('Campus')
print("%s - %s" % (section_history.title, section_history.text[0:140]))

In [None]:
pWiki.links

In [None]:
def print_links(page):
    links = page.links
    for title in sorted(links.keys()):
        print("%s: %s" % (title, links[title]))

print_links(pWiki)

In [None]:
# Retrieve the Wikipedia page for "1920"
page_1920 = wiki_wiki.page('1920')

# Get all sections by title "January"
sections_january = page_1920.sections_by_title('January')

# Display information about each section in "January"
for s in sections_january:
    print(f"* {s.title} - {s.text[0:140]}")

As mentioned above, APIs allow us to access the database behind what we see on a website. For our example, we will be using NewsAPI. This API gives us access to news articles about topics we might care about. The basic version is free (though limited). It does require an API key, which you can get here: https://newsapi.org

API calls are, at a basic level, a requests just like we did above. However, here we are building a URL with the data we want to get from the data server.

In [None]:
import requests

The components of an API are the actual URL followed by a ? (often) and then the actual search you are executing plus your credential. There is variation here, but that's fairly standard.

In [None]:
r = requests.get("https://newsapi.org/v2/top-headlines?country=us&apiKey=3b5f41ec5ac24aa2b053b934edded111")
r.text

<h1> API Python Wrapper

While that works, it isn't always the best way to get API data. Instead, we can use a wrapper, which is just a Python package. We need to install the NewsAPI python wrapper. This is a shortcut to writing out requests but doesn't come preinstalled on Jupyter. The code to install the package is below. Once installed, we need to import a function.

In [None]:
pip install newsapi-python

In [None]:
from newsapi import NewsApiClient

newsapi = NewsApiClient(api_key='3b5f41ec5ac24aa2b053b934edded111')

headLines = newsapi.get_everything(q='michigan',
                                          sources='abc-news',
                                          language='en')

In [None]:
len(headLines)

In [None]:
headLines

In [None]:
sources = newsapi.get_sources(country="us")
sources

In [None]:
sources

In [None]:
s = pd.DataFrame(sources)
s

In [None]:
headLines = newsapi.get_everything(q='michigan')

In [None]:
headLines

In [None]:
df = pd.DataFrame(headLines)
df

In [None]:
articles = headLines['articles']
s = pd.DataFrame(articles)
s

In [None]:
len(s)

We can add more to our API call if we want. Retrieve news articles by keyword and from specific sources:

In [None]:
headLines = newsapi.get_everything(q='michigan',
                                   language='en',
                                   from_param='2024-04-10',
                                   sources ='associated-press,abc-news'                                   
                                  )
articles = headLines['articles']
s = pd.DataFrame(articles)
s

In [None]:
type(articles)

List available news sources:

In [None]:
# Retrieve news sources in the US
sources_us = newsapi.get_sources(country='us')

# Extract sources and convert to DataFrame
df_sources_us = pd.DataFrame(sources_us['sources'])
print(df_sources_us.head(10))


Filter news by date range:

In [None]:
# Retrieve articles about "climate change" from the past 30 days
headlines_climate = newsapi.get_everything(
    q='climate change',
    from_param='2024-04-10',
    language='en'
)

# Display the total number of results
print(f"Total Results: {headlines_climate['totalResults']}")

# Extract articles and convert to DataFrame
articles_climate = headlines_climate['articles']
df_climate = pd.DataFrame(articles_climate)
print(df_climate.head(5))


Pagination for Large Result Sets

In [None]:
# Retrieve the first page to estimate the total number of results
headlines_soccer_first_page = newsapi.get_everything(q='soccer', language='en')
total_results = headlines_soccer_first_page['totalResults']
print(f"Total Results: {total_results}")

# Calculate the number of pages (limited to 100 results per page)
paginate = min(round(total_results / 100) + 1, 5)  # Limiting to 5 pages for the example

# Retrieve articles across pages
articles_soccer = []
for i in range(1, paginate + 1):
    headlines_soccer_page = newsapi.get_everything(q='soccer', language='en', page=i)
    articles_soccer.extend(headlines_soccer_page['articles'])

# Convert to DataFrame
df_soccer = pd.DataFrame(articles_soccer)
print(df_soccer.head())


#### Practice

Use the NewsAPI to collect some data. It can be anything you want. Print the headLines. 

Now retrieve articles from "BBC News" only and display the first 5 results.

Retrieve articles about "climate change" from the past 30 days and display the total results and the first 5 articles.

Retrieve a list of all available news sources in the US, and display the first 10.

<h2>YouTube Data API: Allows for video searches, channel statistics, and playlist management.
<h3> Python Wrapper: google-api-python-client



In [None]:
pip install google-api-python-client

In [None]:
from googleapiclient.discovery import build

# Initialize the YouTube client with your API key
youtube = build('youtube', 'v3', developerKey='AIzaSyDmIw5P9uvMAGHTyP68CLr_PHl6QIXp2MA')

# Search for videos related to "machine learning"
request = youtube.search().list(
    q='machine learning',
    part='snippet',
    type='video',
    maxResults=5
)
response = request.execute()

# Display the video titles and IDs
for item in response['items']:
    print(f"Video Title: {item['snippet']['title']}")
    print(f"Video ID: {item['id']['videoId']}\n")


<h3> Retrieve all videos in a specified playlist.

In [None]:
from googleapiclient.discovery import build

# Initialize the YouTube Data API client
youtube = build('youtube', 'v3', developerKey='AIzaSyDmIw5P9uvMAGHTyP68CLr_PHl6QIXp2MA')

# Set the playlist ID
playlist_id = 'PLINj2JJM1jxNOvEFIABOBa6OmURYOxOk3'

# Get all videos in the specified playlist
request = youtube.playlistItems().list(
    playlistId=playlist_id,
    part='snippet',
    maxResults=5  # Adjust the number of results per page as needed
)
response = request.execute()

# Display the video titles and IDs
for item in response['items']:
    print(f"Video Title: {item['snippet']['title']}")
    print(f"Video ID: {item['snippet']['resourceId']['videoId']}\n")


<h3> Retrieve channel information:

In [None]:
# Specify the channel username or ID 
channel_username = 'TaylorSwift'

# Get the channel information
request = youtube.channels().list(
    forUsername=channel_username,
    part='snippet,statistics'
)
response = request.execute()

# Display the channel information
for item in response['items']:
    print(f"Channel Title: {item['snippet']['title']}")
    print(f"Subscribers: {item['statistics']['subscriberCount']}")
    print(f"Total Views: {item['statistics']['viewCount']}\n")


<h3> Retrieve Trending Videos

In [None]:
# Get trending videos in the US
request = youtube.videos().list(
    chart='mostPopular',
    regionCode='US',
    part='snippet,statistics',
    maxResults=5
)
response = request.execute()

# Display the video titles and statistics
for item in response['items']:
    print(f"Video Title: {item['snippet']['title']}")
    print(f"Views: {item['statistics']['viewCount']}")
    print(f"Likes: {item['statistics']['likeCount']}\n")


<H3> Retrieve Channel's Playlists

In [None]:
# Specify the channel ID 
channel_id = 'UCqECaJ8Gagnn7YCbPEzWH6g'

# Get all playlists for the specified channel
request = youtube.playlists().list(
    channelId=channel_id,
    part='snippet',
    maxResults=5
)
response = request.execute()

# Display the playlist titles and IDs
for item in response['items']:
    print(f"Playlist Title: {item['snippet']['title']}")
    print(f"Playlist ID: {item['id']}\n")


<h3> Retrieve Video Comments

In [None]:
# Specify the video ID 
video_id = 'nfWlot6h_JM'

# Get comments for the specified video
request = youtube.commentThreads().list(
    videoId=video_id,
    part='snippet',
    maxResults=5
)
response = request.execute()

# Display the comments
for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    print(f"Author: {comment['authorDisplayName']}")
    print(f"Comment: {comment['textOriginal']}\n")
