# Scrape Static Websites with `requests` and `BeautifulSoup`

## Install Libraries

We need two main libraries:
1. `requests` already installed as part of environment we are using (Python 3.6 and above)
2. `bs4` - see [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup)

and some additional standard data processing ones like:
1. `pandas` - see [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)

## Import Libraries

In [66]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from collections import Counter

!python --version 

Python 3.6.9


## Start with A Problem

Let us start with some concrete examples. Let us say we want to know:

> What genres of animes are most popular this season on `gogoanime.io`?


## Look for `robots.txt` and `sitemap.xml`

Before starting to scrape at people's websites we need to understand that we will be putting loads on their servers so always look at `robots.txt` for their policy of allowing us to scrape or not and `sitemap.xml` to see a list of URLs in their websites.

See https://www18.gogoanime.io/robots.txt/robots.txt

## What Are The Animes of This Season

In [0]:
list_url = 'https://www18.gogoanime.io/new-season.html?page=1'

#get html and format with BeautifulSoup
with requests.get(list_url) as r:
    soup = BeautifulSoup(r.text,features='html.parser')

In [0]:
# #see what a "soup" looks like
# soup

After inspecting the page, we know that the names of anime are in the tag `<p class="name">`

In [28]:
#get one tag we want
soup.find('p', class_='name')

<p class="name"><a href="/category/bna" title="BNA">BNA</a></p>

In [29]:
#get the text of one tag we want
soup.find('p', class_='name').text

'BNA'

In [30]:
#get all tags we want
soup.find_all('p', class_='name')

[<p class="name"><a href="/category/bna" title="BNA">BNA</a></p>,
 <p class="name"><a href="/category/appare-ranman" title="Appare-Ranman!">Appare-Ranman!</a></p>,
 <p class="name"><a href="/category/argonavis-from-bang-dream" title="Argonavis from BanG Dream!	">Argonavis from BanG Dream!	</a></p>,
 <p class="name"><a href="/category/arte" title="Arte">Arte</a></p>,
 <p class="name"><a href="/category/bungou-to-alchemist-shinpan-no-haguruma" title="Bungou to Alchemist: Shinpan no Haguruma">Bungou to Alchemist: Shinpan no Haguruma</a></p>,
 <p class="name"><a href="/category/fruits-basket-2nd-season" title="Fruits Basket 2nd Season">Fruits Basket 2nd Season</a></p>,
 <p class="name"><a href="/category/fugou-keiji-balanceunlimited" title="Fugou Keiji: Balance:Unlimited">Fugou Keiji: Balance:Unlimited</a></p>,
 <p class="name"><a href="/category/gal-to-kyouryuu" title="Gal to Kyouryuu">Gal to Kyouryuu</a></p>,
 <p class="name"><a href="/category/gal-gaku-hijiri-girls-square-gakuin" title=

In [31]:
#get the text of one tag we want
[i.text for i in soup.find_all('p', class_='name')]

['BNA',
 'Appare-Ranman!',
 'Argonavis from BanG Dream!\t',
 'Arte',
 'Bungou to Alchemist: Shinpan no Haguruma',
 'Fruits Basket 2nd Season',
 'Fugou Keiji: Balance:Unlimited',
 'Gal to Kyouryuu',
 'Gal-gaku.: Hijiri Girls Square Gakuin',
 'Gleipnir',
 'Hachi-nan tte, Sore wa Nai deshou!',
 'Honzuki no Gekokujou: Shisho ni Naru Tame ni wa Shudan wo Erandeiraremasen 2nd Season',
 'Houkago Teibou Nisshi',
 'Jashin-chan Dropkick 2nd Season',
 'Kaguya-sama wa Kokurasetai?: Tensai-tachi no Renai Zunousen 2',
 'Kakushigoto (TV)',
 'Kami no Tou',
 'Kingdom 3rd Season',
 'Kiratto Pri☆chan Season 3',
 'Kitsutsuki Tanteidokoro']

In [33]:
#get names from page 1 to 3
pages = [1,2,3]
all_names = []
for page in pages:
    list_url = f'https://www18.gogoanime.io/new-season.html?page={page}'
    #get html and format with BeautifulSoup
    with requests.get(list_url) as r:
        soup = BeautifulSoup(r.text,features='html.parser')
    #get names from page 
    names_of_this_page = [i.text for i in soup.find_all('p', class_='name')]
    all_names+=names_of_this_page
all_names

['BNA',
 'Appare-Ranman!',
 'Argonavis from BanG Dream!\t',
 'Arte',
 'Bungou to Alchemist: Shinpan no Haguruma',
 'Fruits Basket 2nd Season',
 'Fugou Keiji: Balance:Unlimited',
 'Gal to Kyouryuu',
 'Gal-gaku.: Hijiri Girls Square Gakuin',
 'Gleipnir',
 'Hachi-nan tte, Sore wa Nai deshou!',
 'Honzuki no Gekokujou: Shisho ni Naru Tame ni wa Shudan wo Erandeiraremasen 2nd Season',
 'Houkago Teibou Nisshi',
 'Jashin-chan Dropkick 2nd Season',
 'Kaguya-sama wa Kokurasetai?: Tensai-tachi no Renai Zunousen 2',
 'Kakushigoto (TV)',
 'Kami no Tou',
 'Kingdom 3rd Season',
 'Kiratto Pri☆chan Season 3',
 'Kitsutsuki Tanteidokoro',
 'Listeners',
 'Major 2nd (TV) 2nd Season',
 'Nami yo Kiitekure',
 'No Guns Life 2nd Season',
 'Olympia Kyklos',
 'Otome Game no Hametsu Flag shika Nai Akuyaku Reijou ni Tensei shiteshimatta...',
 'Princess Connect! Re:Dive',
 'Shachou, Battle no Jikan Desu!',
 'Shadowverse (TV)',
 'Shin Sakura Taisen the Animation',
 'Shironeko Project: Zero Chronicle',
 'Shokugeki no 

## What Are The URLs of Those Animes

In [42]:
#try with one anime
all_animes = soup.find_all('p', class_='name')
one_anime = all_animes[5]
one_anime

<p class="name"><a href="/category/motto-majime-ni-fumajime-kaiketsu-zorori" title="Motto! Majime ni Fumajime Kaiketsu Zorori">Motto! Majime ni Fumajime Kaiketsu Zorori</a></p>

In [43]:
#get <a> within one_anime
one_anime_a = x[0].find('a')
one_anime_a

<a href="/category/asatir-mirai-no-mukashibanashi" title="Asatir: Mirai no Mukashibanashi">Asatir: Mirai no Mukashibanashi</a>

In [44]:
#get href of <a> within one_anime
one_anime_a.get('href')

'/category/asatir-mirai-no-mukashibanashi'

In [46]:
#use f-string to get real URL
f"https://www18.gogoanime.io{one_anime_a.get('href')}"

'https://www18.gogoanime.io/category/asatir-mirai-no-mukashibanashi'

In [47]:
#get urls from page 1 to 3
pages = [1,2,3]
all_urls = []
for page in pages:
    list_url = f'https://www18.gogoanime.io/new-season.html?page={page}'
    #get html and format with BeautifulSoup
    with requests.get(list_url) as r:
        soup = BeautifulSoup(r.text,features='html.parser')
    #get names from page 
    urls_of_this_page = [f"https://www18.gogoanime.io{i.find('a').get('href')}" for i in soup.find_all('p', class_='name')]
    all_urls+=urls_of_this_page
all_urls[:10]

['https://www18.gogoanime.io/category/bna',
 'https://www18.gogoanime.io/category/appare-ranman',
 'https://www18.gogoanime.io/category/argonavis-from-bang-dream',
 'https://www18.gogoanime.io/category/arte',
 'https://www18.gogoanime.io/category/bungou-to-alchemist-shinpan-no-haguruma',
 'https://www18.gogoanime.io/category/fruits-basket-2nd-season',
 'https://www18.gogoanime.io/category/fugou-keiji-balanceunlimited',
 'https://www18.gogoanime.io/category/gal-to-kyouryuu',
 'https://www18.gogoanime.io/category/gal-gaku-hijiri-girls-square-gakuin',
 'https://www18.gogoanime.io/category/gleipnir']

## Get Genres of Each Anime

In [55]:
one_anime_url = all_urls[10]
one_anime_url

'https://www18.gogoanime.io/category/hachi-nan-tte-sore-wa-nai-deshou'

In [59]:
#get genre elements
#get html and format with BeautifulSoup
with requests.get(one_anime_url) as r:
    soup = BeautifulSoup(r.text,features='html.parser')
p_types = soup.find_all('p', class_='type')
p_types

[<p class="type"><span>Type: </span>
 <a href="/sub-category/spring-2020-anime" title="Spring 2020 Anime">Spring 2020 Anime</a>
 </p>,
 <p class="type"><span>Plot Summary: </span>Shingo Ichinomiya, a 25-year-old man working at a firm company, while thinking of tomorrow's busy working day, goes to sleep. However, when he woke up, he found himself in a room unknown to him and realized that he is inside a 6-years-old body, taking over his body and mind. He soon learns from the memories of the boy that the boy was born as the youngest child of a poor noble family living in a back country. Having no administrative skill, he can't do anything to manage the vast land his family has. Fortunately, he is blessed with a very rare talent, the talent of magic. Unfortunately, while his talent could bring prosperity to his family, in his situation it only brought disaster. This is the story of the boy, Wendelin Von Benno Baumeister, opening his own path in a harsh world.</p>,
 <p class="type"><span>G

In [63]:
#find a under p_types
genres = []
for p_type in p_types:
    print(p_type.find('a'))
    if p_type.find('a'):
        genres.append(p_type.find('a').text)

<a href="/sub-category/spring-2020-anime" title="Spring 2020 Anime">Spring 2020 Anime</a>
None
<a href="http://www18.gogoanime.io/genre/action" title="Action">Action</a>
None
None
None


In [64]:
genres

['Spring 2020 Anime', 'Action']

## Combine Everything Together

In [71]:
genres = []
for url in all_urls[:5]:
    print(f'Computing {url}')

    #get soup of url
    with requests.get(url) as r:
        soup = BeautifulSoup(r.text,features='html.parser')

    #get all <p class="type">
    p_types = soup.find_all('p', class_='type')
    #get all <p class="type"> that has any <a> under it
    for p_type in p_types:
        if p_type.find('a'):
            genres.append(p_type.find('a').text)

Computing https://www18.gogoanime.io/category/bna
Computing https://www18.gogoanime.io/category/appare-ranman
Computing https://www18.gogoanime.io/category/argonavis-from-bang-dream
Computing https://www18.gogoanime.io/category/arte
Computing https://www18.gogoanime.io/category/bungou-to-alchemist-shinpan-no-haguruma


In [72]:
Counter(genres)

Counter({'Action': 1,
         'Drama': 1,
         'Fantasy': 1,
         'Historical': 1,
         'Music': 1,
         'Spring 2020 Anime': 5})