# Web Scraping Project

In this project, we will try to scrape title, link, teaser, author and date information of articles from a news article website(https://www.searchenginejournal.com/category/news/) with 15 pages and put these informations into one data frame 

In [None]:
# Importing packages

# The Requests package provide the ability to query a webpage’s HTML code via Python.

# The BeautifulSoup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [603]:
# Reading a website by using requests package
page = requests.get('https://www.searchenginejournal.com/category/news/')

In [604]:
# Creating a BeautifulSoup object for parsing HTML and XML documents
soup = BeautifulSoup(page.text, 'html.parser')

In [605]:
# Selecting the titles by tag('a') and class('title-anchor')
# The strip() method removes any leading, and trailing whitespaces.
titles=[title.text.strip() for title in soup.find_all('a', class_="title-anchor")]

In [606]:
titles

['OpenAI CEO Ouster: Sam Altman, Greg Brockman Post Statements',
 'Google Presentation May Change How We Think About Ranking',
 'Google Bard Updates Includes New Math And Data Visualization Features',
 'Google Unwraps New AI Tools To Deck Your Holiday Shopping',
 'Google’s Update Florida Offers Insights 20 Years Later',
 'Bing Employs GPT-4 To Write Custom Search Snippets',
 'Microsoft Copilot AI With Bing Will Use OpenAI GPTs And Plugins',
 'Instagram Adds New Ways To Create Content',
 'YouTube Creators Must Comply With New Rules For AI Content',
 'OpenAI Pauses New Subscriptions And Upgrades To ChatGPT Plus',
 'Google Launches “Notes” To Add User Comments In Search Results',
 'Google Alters Search Rankings To Prioritize First-Hand Knowledge',
 'Google Reveals Best & Worst Times For Holiday Travel',
 'Google Maps Introduces New Ways To Plan Travel & Navigate',
 'This Lawsuit Could Make Social Media Safer For Your Kids',
 'Airbnb Acquires GamePlanner.AI To Accelerate AI Projects',
 'Go

In [607]:
# Filtering and selecting the links by item.a['href'] , tag('h2') and class('h4 dark-link m-top-15 margin-bottom-0')
link_lists=[link.find('a')['href'] for link in soup.find_all('h2', class_='h4 dark-link m-top-15 margin-bottom-0')]

In [608]:
link_lists

['https://www.searchenginejournal.com/sam-altman-replaced-by-mira-murati-as-interim-ceo-at-openai/501582/',
 'https://www.searchenginejournal.com/googles-danny-sullivan-presentation/501558/',
 'https://www.searchenginejournal.com/google-bard-updates-includes-new-math-and-data-visualization-features/501524/',
 'https://www.searchenginejournal.com/google-unwraps-new-ai-tools-to-deck-your-holiday-shopping/501507/',
 'https://www.searchenginejournal.com/googles-update-florida-offers-insights-20-years-later/501482/',
 'https://www.searchenginejournal.com/bing-employs-gpt-4-to-write-custom-search-snippets/501474/',
 'https://www.searchenginejournal.com/microsoft-copilot-ai-with-bing-will-use-openai-gpts-and-plugins/501422/',
 'https://www.searchenginejournal.com/instagram-adds-new-ways-to-create-content/501436/',
 'https://www.searchenginejournal.com/youtube-creators-must-comply-with-new-rules-for-ai-content/501437/',
 'https://www.searchenginejournal.com/openai-pauses-new-chatgpt-plus-subsc

In [609]:
# Filtering and selecting the teasers by tag('p') and class('sej-art-desc')
teasers=[teaser.text.strip() for teaser in soup.find_all('p', class_='sej-art-desc')]
teasers

['Explore the latest updates about OpenAI’s decision to remove Sam Altman as CEO, with Mira Murati stepping in as leader in the interim.',
 'Danny Sullivan suggested in a presentation that Google’s published SEO guidance is not what we think it is',
 'The arrival of Google Bard’s newest features for teens coincided with the launch of Common Sense Media’s AI ratings system to evaluate generative AI safety.',
 'Google rolls out AI shopping tools to inspire gift ideas, visualize searches, and expand virtual try-on.',
 'The legacy of Google’s Update Florida continues to influence how we do SEO 20 years later',
 'Bing launches AI-powered captions for search results to provide more informative snippets and enhance the search experience.',
 'Discover the new Copilot AI Companion with Bing, which will soon utilize OpenAI GPTs for customization and plugins to boost personal and professional productivity.',
 'Instagram added new ways to find, edit and create content, filters for creatively expre

In [610]:
# Filtering and selecting the authors by tag('p') and class('sej-art-author')
# The replace() method replaces a specified phrase with another specified phrase. Here we replace 'By' from authors.
authors=[author.text.strip().replace('By','') for author in soup.find_all('p', class_='sej-art-author')]
authors

[' Kristi Hines',
 ' Roger Montti',
 ' Kristi Hines',
 ' Matt G. Southern',
 ' Roger Montti',
 ' Matt G. Southern',
 ' Kristi Hines',
 ' Roger Montti',
 ' Matt G. Southern',
 ' Kristi Hines',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Kristi Hines',
 ' Roger Montti',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Roger Montti',
 ' Roger Montti',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Matt G. Southern',
 ' Roger Montti',
 ' Kristi Hines',
 ' Roger Montti',
 ' Kristi Hines',
 ' Kristi Hines',
 ' Kristi Hines']

In [611]:
# Filtering and selecting the dates by tag('span') and class('entrydate')
PublishDates=[PublishDate.text.strip() for PublishDate in soup.find_all('span', class_='entrydate')]
PublishDates

['Nov 17, 2023',
 'Nov 17, 2023',
 'Nov 17, 2023',
 'Nov 16, 2023',
 'Nov 16, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 15, 2023',
 'Nov 14, 2023',
 'Nov 14, 2023',
 'Nov 14, 2023',
 'Nov 14, 2023',
 'Nov 14, 2023',
 'Nov 14, 2023',
 'Nov 13, 2023',
 'Nov 13, 2023',
 'Nov 13, 2023',
 'Nov 13, 2023',
 'Nov 13, 2023',
 'Nov 10, 2023',
 'Nov 10, 2023',
 'Nov 9, 2023',
 'Nov 9, 2023',
 'Nov 9, 2023']

In [612]:
# Creating a function that gets a dataframe of one page
def read_one_page(link):
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'html.parser')
    titles=[title.text.strip() for title in soup.find_all('a', class_="title-anchor")]
    link_lists=[link.find('a')['href'] for link in soup.find_all('h2', class_='h4 dark-link m-top-15 margin-bottom-0')]
    teasers=[teaser.text.strip() for teaser in soup.find_all('p', class_='sej-art-desc')]
    authors=[author.text.strip().replace('By','') for author in soup.find_all('p', class_='sej-art-author')]
    PublishDates=[PublishDate.text for PublishDate in soup.find_all('span', class_='entrydate')]
    df = pd.DataFrame({'title' :  titles, 'link' : link_lists, 'teaser' : teasers, 'author': authors, 'date': PublishDates})
    return(df)

In [613]:
# Creating the links of the all pages we want(lets say 15 pages)
# https://www.searchenginejournal.com/category/news/
# https://www.searchenginejournal.com/category/news/page/2/
# https://www.searchenginejournal.com/category/news/page/3/

all_page_links = ['https://www.searchenginejournal.com/category/news/']

links = [f"https://www.searchenginejournal.com/category/news/page/{k}/" for k in range(2, 16)]

# The extend() method adds the specified list elements (or any iterable) to the end of the current list.
all_page_links.extend(links)

all_page_links

['https://www.searchenginejournal.com/category/news/',
 'https://www.searchenginejournal.com/category/news/page/2/',
 'https://www.searchenginejournal.com/category/news/page/3/',
 'https://www.searchenginejournal.com/category/news/page/4/',
 'https://www.searchenginejournal.com/category/news/page/5/',
 'https://www.searchenginejournal.com/category/news/page/6/',
 'https://www.searchenginejournal.com/category/news/page/7/',
 'https://www.searchenginejournal.com/category/news/page/8/',
 'https://www.searchenginejournal.com/category/news/page/9/',
 'https://www.searchenginejournal.com/category/news/page/10/',
 'https://www.searchenginejournal.com/category/news/page/11/',
 'https://www.searchenginejournal.com/category/news/page/12/',
 'https://www.searchenginejournal.com/category/news/page/13/',
 'https://www.searchenginejournal.com/category/news/page/14/',
 'https://www.searchenginejournal.com/category/news/page/15/']

In [614]:
# Mapping our function to our list
# The map() function executes a specified function for each item in an iterable. The item is sent to the function as a parameter.
list_of_dfs = list(map(read_one_page, all_page_links))

https://www.searchenginejournal.com/category/news/
https://www.searchenginejournal.com/category/news/page/2/
https://www.searchenginejournal.com/category/news/page/3/
https://www.searchenginejournal.com/category/news/page/4/
https://www.searchenginejournal.com/category/news/page/5/
https://www.searchenginejournal.com/category/news/page/6/
https://www.searchenginejournal.com/category/news/page/7/
https://www.searchenginejournal.com/category/news/page/8/
https://www.searchenginejournal.com/category/news/page/9/
https://www.searchenginejournal.com/category/news/page/10/
https://www.searchenginejournal.com/category/news/page/11/
https://www.searchenginejournal.com/category/news/page/12/
https://www.searchenginejournal.com/category/news/page/13/
https://www.searchenginejournal.com/category/news/page/14/
https://www.searchenginejournal.com/category/news/page/15/


In [615]:
# Combining all pages into one dataframe
pd.concat(list_of_dfs, axis="rows").reset_index(drop=True)

Unnamed: 0,title,link,teaser,author,date
0,"OpenAI CEO Ouster: Sam Altman, Greg Brockman P...",https://www.searchenginejournal.com/sam-altman...,Explore the latest updates about OpenAI’s deci...,Kristi Hines,"Nov 17, 2023"
1,Google Presentation May Change How We Think Ab...,https://www.searchenginejournal.com/googles-da...,Danny Sullivan suggested in a presentation tha...,Roger Montti,"Nov 17, 2023"
2,Google Bard Updates Includes New Math And Data...,https://www.searchenginejournal.com/google-bar...,The arrival of Google Bard’s newest features f...,Kristi Hines,"Nov 17, 2023"
3,Google Unwraps New AI Tools To Deck Your Holid...,https://www.searchenginejournal.com/google-unw...,Google rolls out AI shopping tools to inspire ...,Matt G. Southern,"Nov 16, 2023"
4,Google’s Update Florida Offers Insights 20 Yea...,https://www.searchenginejournal.com/googles-up...,The legacy of Google’s Update Florida continue...,Roger Montti,"Nov 16, 2023"
...,...,...,...,...,...
445,Google Debunks The “Index Bloat” Theory,https://www.searchenginejournal.com/google-deb...,Google’s John Mueller debunks the “Index Bloat...,Matt G. Southern,"Jun 8, 2023"
446,How To Control Googlebot’s Interaction With Yo...,https://www.searchenginejournal.com/how-to-con...,Google’s Search Relations team provides insigh...,Matt G. Southern,"Jun 8, 2023"
447,WWDC 2023: How Apple Could Revolutionize The W...,https://www.searchenginejournal.com/wwdc-2023-...,Discover Apple’s latest innovations at WWDC 20...,Kristi Hines,"Jun 8, 2023"
448,"Gmail Glitch Sending Newsletters To Spam, Mail...",https://www.searchenginejournal.com/gmail-glit...,An issue with Gmail is redirecting newsletters...,Matt G. Southern,"Jun 8, 2023"
