<a href="https://colab.research.google.com/github/chenoa23/CV-Projects/blob/main/Beautiful%20Soup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping using beautifulsoup

Chenoa Nussberger

### Beautifulsoup

BeautifulSoup, is an open-source tool and used for web scraping

Beautiful Soup is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversible XML structures

#### First, let us download the pakage


#### Let us now import all the required packages

In [None]:
!pip install beautifulsoup4



###  Importing required libraries/ packages
First, we import the packages that are required:
- The `requests` package allows our python script to communicate with websites and to 'request' information from those sites.
- The beautiful soup package, also known as `bs4` takes the raw information from the websites and provides helpful functions to extract information.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import requests
import bs4
from urllib.request import urlopen
import shutil
import os

In [None]:
from bs4 import BeautifulSoup

#### Now that we have the required libraries, let us try something
##### Let us try to get the title of  wikipedia website

In [None]:
base_url = 'https://en.wikipedia.org/wiki/Jupiter'
r = requests.get(base_url)
r
soup = bs4.BeautifulSoup(r.text,'html5lib')
print(soup.title)

<title>Jupiter - Wikipedia</title>


### Task: Try changing the base_url from 'https://en.wikipedia.org/wiki/Jupiter' to 'https://en.wikipedia.org/wiki/Jus21'.
What did the response code become?

These codes can help in troubleshooting and solving connection problems, but now, we just need to know that a response of 200 means that your GET request was successful!

<em> In case the link is not working for you, you may get a `<Response [404]>` from the code above </em>

### Now change the base url back to 'https://en.wikipedia.org/wiki/Jupiter' from 'https://en.wikipedia.org/wiki/Jus21' to continue with the rest of the notebook.

Connection establised! We have access!

### 1.1.3 Using beautifulsoup to get data from websites

Now we can use `bs4` to read the outputs from our request.

Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.
*Note: This section of the Jupyter Notebook is written in 'markup' format.

Much like our browser, bs4 is able to understand HTML and to read the < tags >;



See the code below!

In [None]:
#`r.text` contains the raw HTML returned when we made our GET request earlier.
#`'html5lib'` tells BeautifulSoup that it is reading HTML information.
soup = bs4.BeautifulSoup(r.text,'html5lib')

Essentially, we are taking the raw response from the GET request, and asking BeautifulSoup to read and understand the response, and store all of this as your variable `soup`!

### Using beautifulsoup to search for specific tags

This is where it gets interesting. BeautifulSoup can be used to search for specific tags.

Let's use beautifulsoup to look for all the headings within the Wikipedia website!

headers = []
for url in soup.findAll("h3"):
    headers.append(url.text)
    print(url.text)

In this code snippet, we find all `<h3>` tags. We then loop through each tag found and extract the text by using the .  

*Note: The `<h3>` tag defines a [header](https://www.w3docs.com/learn-html/html-header-tag.html), which defines the heading for the text to follow. To differentiate `<h3>` tags appearing in different locations, they can be 'named' with a `class=tag-name`. In this exercise, we are searching for headers `<h3>` without any classes.

#### Let us try to extract all URLs from the intel website

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Special:SpecialPages
/wiki/Main_Page
/wiki/Special:Search
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Jupiter
/w/index.php?title=Special:UserLogin&returnto=Jupiter
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Jupiter
/w/index.php?title=Special:UserLogin&returnto=Jupiter
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Name_and_symbol
#Formation_and_migration
#Physical_characteristics
#Composition
#Size_and_mass
#Internal_str

## Let us move into scraping images from the web

For purpose of this example, we will be using google website for scraping data

Let us look into the images used in the google homepage

In [None]:
htmldata = urlopen('https://www.walmart.com/search?')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')

for item in images:
    print(item['src'])

##### Let us now look using google for downloading images

The working of the code here will be simple
We put in the required link in which the crawler scraps for data

After finding the required data, we use shutil to download the required amount of images we need

Here we see, the code asks for what type of image we need

So it gives the end-user an option to select any thing in the world to create a dataset

After selecting, the code asks for the number of images we want



Again, any number you want!

Go ahead and enter it!

In [None]:
import requests

# Replace with your API key
API_KEY = 'AIzaSyDDlrMTXE6vglEf6tv0SkBv-LQxnMUGkhg'

def get_video_thumbnail(video_id):
    # API URL
    url = f'https://www.googleapis.com/youtube/v3/videos?id={video_id}&key={API_KEY}&part=snippet'

    # Send request to YouTube API
    response = requests.get(url)

    # Check if response is OK
    if response.status_code == 200:
        video_data = response.json()

        if 'items' in video_data and len(video_data['items']) > 0:
            thumbnail_url = video_data['items'][0]['snippet']['thumbnails']['high']['url']
            print(f"Thumbnail URL: {thumbnail_url}")
            return thumbnail_url
        else:
            print("No video found.")
    else:
        print(f"Error: {response.status_code}")

def download_image(image_url, filename):
    # Download the image and save it
    response = requests.get(image_url)
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded image: {filename}")

# Example: Replace with any YouTube video ID
video_id = 'dQw4w9WgXcQ'  # Replace with the video ID you want to fetch the thumbnail for
thumbnail_url = get_video_thumbnail(video_id)

if thumbnail_url:
    # Download the image to your local machine
    download_image(thumbnail_url, f'{video_id}_thumbnail.jpg')


Thumbnail URL: https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg
Downloaded image: dQw4w9WgXcQ_thumbnail.jpg
