# Playing with Beautiful Soup
* Author: Filip Wodnicki
* Date: August 2018
* Language: Python 3.6
* Objective: Use Beautiful soup to get images from the webpage

## Step 1: Import

In [3]:
from bs4 import BeautifulSoup
import requests

In [4]:
# url = 'https://iot.eetimes.com/renesas-s3a1-mcu-group-offers-improved-security-connectivity-for-modern-iot-solutions/'
url ='https://hackernoon.com/install-python-gdal-using-conda-on-mac-8f320ca36d90'

r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")

## Step 2: Get all Images w/ width & height

In [5]:
images = []
for pic in soup.find_all('img', width=True, height=True):
    images.append(pic.get('src'))
    print(pic['width'], pic['height']) 

267 36


In [6]:
images

['https://cdn-images-1.medium.com/letterbox/534/72/50/50/1*1rGa2Bo9HvfRLuZA2N8qLA.png?source=logoAvatar-lo_1Xn52FhAbdlV---3a8144eabfe3']

Problem: Not all images have width and height

## Step 3: Get all images period.

In [7]:
images = []
for pic in soup.find_all('img'):
    images.append(pic.get('src'))

In [8]:
images

['https://cdn-images-1.medium.com/letterbox/534/72/50/50/1*1rGa2Bo9HvfRLuZA2N8qLA.png?source=logoAvatar-lo_1Xn52FhAbdlV---3a8144eabfe3',
 'https://cdn-images-1.medium.com/fit/c/120/120/1*LugIfotGMh5i1qjxuJiRzQ.jpeg',
 'https://cdn-images-1.medium.com/max/1600/1*m4cnTYJWM7Rmpsju8dSHmQ.jpeg',
 'https://cdn-images-1.medium.com/max/1600/1*3dt1a4jg7DzEti_D3VmGxQ.png',
 'https://cdn-images-1.medium.com/max/1600/1*OpxJXlkjvcCuhkYYlPwTjA.png',
 'https://cdn-images-1.medium.com/max/1600/1*I765vVaynvCDsiPEKBuUfQ.png',
 'https://cdn-images-1.medium.com/max/1600/1*qcumvBkuOKXFu83evbimKQ.png',
 'https://cdn-images-1.medium.com/max/1600/1*B4pr9PAX7ik1s1y5lbdrxA.png',
 'https://cdn-images-1.medium.com/max/1600/1*NM-d0y5B_plTUDv9E4RBvQ.png',
 'https://cdn-images-1.medium.com/max/1600/1*j-qfAUL54oD27GS_quTRmg.png',
 'https://cdn-images-1.medium.com/max/1600/1*BPGCDr9_C5ici0NVI2QESw.png',
 'https://cdn-images-1.medium.com/max/1600/1*D5Oy7phUlCfnfo2gnyi2Aw.png',
 'https://cdn-images-1.medium.com/max/1600

Problem: Soooo many images.
    
## Step 4: Get first image after the h1 tag (should be the featured image)

source: https://stackoverflow.com/questions/36754686/beautiful-soup-get-picture-size-from-html

# Article title

![my alt text](https://cdn-images-1.medium.com/max/1600/1*m4cnTYJWM7Rmpsju8dSHmQ.jpeg)
Summary line 1
Summary line 2
Summary line 3

[https://reddit.com](Full text)

## Step 5: Get the largest image


In [11]:
def get_largest_image(soup):
	images = []
	sizes = []
	for img in soup.findAll('img', width=True, height=True):
		images.append(img.get('src'))
		sizes.append(int(img.get('width')) * int(img.get('height')) )

	max_size = max(sizes)
	max_index = sizes.index(max_size)
	largest_image = images[max_index]
	return largest_image

In [12]:
get_largest_image(soup)

'https://cdn-images-1.medium.com/letterbox/534/72/50/50/1*1rGa2Bo9HvfRLuZA2N8qLA.png?source=logoAvatar-lo_1Xn52FhAbdlV---3a8144eabfe3'

In [13]:
def get_all_img_after_h1(soup):
	soup_h1 = soup.find('h1')
	soup_imgs = soup_h1.find_all_next('img')
	return soup_imgs

In [15]:
get_all_img_after_h1(soup)

[<img class="graf-image" data-height="335" data-image-id="1*m4cnTYJWM7Rmpsju8dSHmQ.jpeg" data-width="568" src="https://cdn-images-1.medium.com/max/1600/1*m4cnTYJWM7Rmpsju8dSHmQ.jpeg"/>,
 <img class="graf-image" data-height="62" data-image-id="1*3dt1a4jg7DzEti_D3VmGxQ.png" data-width="576" src="https://cdn-images-1.medium.com/max/1600/1*3dt1a4jg7DzEti_D3VmGxQ.png"/>,
 <img class="graf-image" data-action="zoom" data-action-value="1*OpxJXlkjvcCuhkYYlPwTjA.png" data-height="774" data-image-id="1*OpxJXlkjvcCuhkYYlPwTjA.png" data-width="1240" src="https://cdn-images-1.medium.com/max/1600/1*OpxJXlkjvcCuhkYYlPwTjA.png"/>,
 <img class="graf-image" data-action="zoom" data-action-value="1*I765vVaynvCDsiPEKBuUfQ.png" data-height="336" data-image-id="1*I765vVaynvCDsiPEKBuUfQ.png" data-width="1244" src="https://cdn-images-1.medium.com/max/1600/1*I765vVaynvCDsiPEKBuUfQ.png"/>,
 <img class="graf-image" data-height="330" data-image-id="1*qcumvBkuOKXFu83evbimKQ.png" data-width="562" src="https://cdn-ima