Our project is "analyzing metadata embedded in the HTML page of popular news outlets".

These metadata are invaluable for wide variety of uses, including making it machine readable that can help visually impaired people browse the web, make it easy for search engines to find information and lastly, make it easier to archive news in the same fashion as what libraries do with paper newspaper copies that has records going back to 1800s.

Once you build the function below; we will use powerful NLP techiniques to see if the metadata explains the fulltext of the news article itself during week 3.

Write a python function based on beautifulsoup that loads the following information into a dictionary/JSON :

- extracts text from title tag of the page.
- mention which meta tags are available on the page; find their content and lengths

 -count the number of characters in title, and meta tags

- extract information in h1 tags and other headings such as h2 and h3.

- count number of images on the page

- load the image urls and check Image Alt Tags and extract the text. this is pretty important!
All images should contain descriptive alt descriptions
An alt-tag or alt description is very helpful for engines to understand the image and its purpose in the content.

Using the keyword in the alt description can help your image rank accordingly in the Google images but also boost your overall ranking of that keyword in Google Search.
Visually impaired people rely on these descriptions to understand the image - so do engines.
People with visual disabilities need those helpful descriptions in order to understand the image.

- check which are internal links from a given page and find its anchor text.

- use extruct package to check structured information on a page

- finally, check the page load times. 

In [30]:
import pandas as pd
from urllib.request import Request, urlopen
import urllib.request as urllib2
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
from collections import defaultdict
from urllib.error import HTTPError
import requests
import re
from w3lib.html import get_base_url
import extruct
import urllib
from time import time
from datetime import timedelta

Below are the functions to get extract each information such as title, meta tags, images, headings, etc. The 'analysis' function extracts all the information and stores it in a dictionary

In [31]:
def get_html(url):
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
    }
    req = requests.get(url, headers=headers)
    return req.text

def is_valid(url):
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

def get_title(html):
    
    title = None
    if html.title.string:
        title = html.title.string
    elif html.find("meta", property="og:title"):
        title = html.find("meta", property="og:title").get('content')
    elif html.find("meta", property="twitter:title"):
        title = html.find("meta", property="twitter:title").get('content')
    elif html.find("h1"):
        title = html.find("h1").string
    return title


def get_meta_tags(soup):
    k=soup.find_all('meta')
    meta , meta2= {},{}
    li2=[]
    default=['description','robots','og:title','og:url','og:type','og:description','og:image','canonical','keywords','viewport']
    for i in k:
        k=i.attrs
        keys={'content'}
        def without_keys(d, keys):
             return {x: d[x] for x in d if x not in keys}

        a=without_keys(k, keys)
        pair=dict(list(a.items())[0: 1])
        val=next(iter(a.values()))
        li2.append(val)
        try:
            s=soup.find("meta", attrs=pair)["content"]
            li=[]
            li.append(s)
            li.append(len(s))
            meta[val]=li
        except Exception:
             meta[val]=""
    for i in default:

        if i not in li2:
            meta2[i]='No'
        else:
            meta2[i]="Yes"

    return meta, meta2


def get_headings(soup):
    k=soup.find_all(re.compile(r'h\d+'))   
    li2=defaultdict(list)
    for i in k:
        li=[]
        li.append(i.text.strip() )
        li.append(len(i.text.strip() ))
        li2[i.name].append(li)
    return dict(li2)   


def get_images(soup):
    img={}
    k=soup.find_all('img')   
    img['Image_count']=len(k)
    full=[]
    for img_tag in k:
        img_alt=[]
        if img_tag.get('alt'):
            img_alt.append(img_tag.get('alt'))
         
            if img_tag.get('src'):
                img_alt.append(img_tag.get('src'))
            elif img_tag.get('data-src'):
                img_alt.append(img_tag.get('data-src'))
            else:
                pass
        full.append(img_alt)

    img['Image_description']=full
    return img


def get_internal_links(url):
    internal_urls = []
    external_urls = []
    li=[]
    
    domain_name = urlparse(url).netloc
    soup = BeautifulSoup(get_html(url), "html.parser")

    for a_tag in soup.findAll("a"):
        a={}
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            continue
        href = urljoin(url, href)
        parsed_href = urlparse(href)
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
        if (href in internal_urls) :
            continue
        if domain_name not in href:
            if (href not in external_urls):
                external_urls.append(href)
            continue
         
        a['link']=href
        a['Anchor Text']=a_tag.string
        internal_urls.append(href)
        li.append(a)
    return li


def get_time(url):
    stream = urlopen(url)
    start = time()
    output = stream.read()
    end = time()
    stream.close()
    return str(timedelta(seconds=end-start))

def struct_info(url):
    
    html=get_html(url)
    base_url = get_base_url(html, url)
    data = extruct.extract(html, base_url)
    return data

def analysis(url):
    
    if not is_valid(url):
        return "Not a valid url"

    html=get_html(url)
    soup = BeautifulSoup(html, 'html.parser')

    #dictionary to store the required information
    final={}
    final['title']=get_title(soup)                                         #1. title
    a,b=get_meta_tags(soup)
    final['Significant meta tags']=b
    final['meta_tags_description']=a                               #2.meta tags - content & length              
    final['headings']=get_headings(soup)                       #3. headings h1, h2,..etc - description
    final['images']=get_images(soup)                             #4. Images - count & descriptive text
    final['internal_links']=get_internal_links(url)              #5. Internal links - link & anchor text
    final['Page Load Time']=get_time(url)                      #6. Page load time
    
    #7. Structured information
    final_structured_info=struct_info(url) 
    
    return final, final_structured_info

#### The data of a sample url is analyzed below

In [32]:
url='http://www.songkick.com/artists/236156-elysian-fields'
data, structured_data=analysis(url)


### Ouput

 - 'data' is the dictionary where all extracted info. is stored

In [33]:
data.keys()

dict_keys(['title', 'Significant meta tags', 'meta_tags_description', 'headings', 'images', 'internal_links', 'Page Load Time'])

In [34]:
data

{'title': 'Elysian Fields Tour Announcements 2021 & 2022, Notifications, Dates, Concerts & Tickets – Songkick',
 'Significant meta tags': {'description': 'Yes',
  'robots': 'Yes',
  'og:title': 'Yes',
  'og:url': 'Yes',
  'og:type': 'Yes',
  'og:description': 'Yes',
  'og:image': 'Yes',
  'canonical': 'No',
  'keywords': 'No',
  'viewport': 'Yes'},
 'meta_tags_description': {'robots': ['all', 3],
  'description': ['Find out when Elysian Fields is next playing live near you. List of all Elysian Fields tour dates, concerts, support acts, reviews and venue info.',
   146],
  'fb:app_id': ['308540029359', 12],
  'viewport': ['user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width',
   74],
  'apple-mobile-web-app-capable': ['yes', 3],
  'og:site_name': ['Songkick', 8],
  'og:type': ['songkick-concerts:artist', 24],
  'og:title': ['Elysian Fields', 14],
  'og:description': ['Find out when Elysian Fields is next playing live near you. List of all Elysian Fields tour dates

- 'structured_data' is the structured info. using extruct

In [35]:
structured_data

{'microdata': [{'type': 'http://schema.org/Review',
   'properties': {'itemReviewed': 'Elysian Fields',
    'reviewBody': "They played at a venue called Das Bett in Frankfurt, at an audience of maybe twenty people. I asked the girl at the beer counter why? She said that Frankfurt might be not the right place for music like this.\n\nElysian Fields performed about one hour. I liked the music and her voice. But it wasn't so much fun in a such a situation.\n\nAnyway. Who missed the show should view their vids on YouTube. Really great!\n\nRead more\n\nReport as inappropriate",
    'author': 'rainerkromarek'}}],
 'json-ld': [{'@context': 'http://schema.org',
   '@type': 'MusicEvent',
   'name': 'Elysian Fields',
   'url': 'https://www.songkick.com/concerts/39553135-elysian-fields-at-espace-180?utm_medium=organic&utm_source=microformat',
   'image': 'media/profile_images/artists/236156/huge_avatar',
   'location': {'@type': 'Place',
    'address': {'@type': 'PostalAddress',
     'addressLocal

###### Counter method on full text of the webpage
- I didn't completely get what has to be done with counter method. I was able to clean the text by removing script and style tags as below. 

In [36]:
def remove_tags(html):
    soup = BeautifulSoup(html)
    for s in soup(['script', 'style']):
        s.decompose()
    return ' '.join(soup.stripped_strings)

In [37]:
url='http://www.songkick.com/artists/236156-elysian-fields'
html=get_html(url)
remove_tags(html)

"Elysian Fields Tour Announcements 2021 & 2022, Notifications, Dates, Concerts & Tickets – Songkick This event has been added to your Plans . Live streams Tiruchchirappalli... Tiruchchirappalli concerts Tiruchchirappalli concerts See all Tiruchchirappalli concerts ( Change\xa0location ) Today · Next 7 days · Next 30 days Artists Most popular artists worldwide Trending artists worldwide Rihanna Coldplay Drake Eminem Maroon 5 Ed Sheeran Bruno Mars Kanye West U2 Adele Olivia Rodrigo Corpse The Kid LAROI. jxdn Sault Get your tour dates seen by one billion fans: Sign up as an artist Festivals Sign up Log in Get the app Home Live streams Tiruchchirappalli concerts Change\xa0location Popular Artists Festivals Language English Français Español Log in to your account Sign up Live streams Tiruchchirappalli Your artists Popular artists Live Check out the best live stream concerts! Elysian Fields On tour: no Upcoming 2021 concerts: none 8,073 fans get concert alerts for this artist. Join Songkick 

-------