<a href="https://colab.research.google.com/github/VayLorm/ServerManagment/blob/main/PARS_BASE_PRO_stud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Парсинг с помощью BeautifulSoup

Beautiful Soup - это библиотека для Python, которая позволяет парсить (анализировать) HTML и XML документы. Она предоставляет удобный способ искать, навигировать, и модифицировать дерево DOM (Document Object Model), представляющее HTML/XML документ.

# Задание

Вам необходимо собрать датасет, спарсив данные из этого сайта:

https://books.toscrape.com/

Всего на сайте 1000 книг. То есть длина датасета должна равняться количеству книг.

Итоговая таблица должна содержать следующие столбцы:

| Название столбца | Описание |
|--|--|
|id| Идентификатор книги |
|book_name| Название книги |
|price| Цена в £ |
|stock| Наличие книги. 1 или 0|
|url| Ссылка на книгу |

**Примечание по столбцам:**
- `id` - заполняется разработчиком датасета. Первая спарсенная книга имеет `id` = `0`.
- `url` - должна содержать полную ссылку. Не только конец ссылки, указанный на сайте. То есть по данному url можно перейти одним кликом.

## Импорт библиотек

In [None]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from IPython.display import HTML

## Cоздание датасета и парсинг данных

In [None]:
# Создадим переменную и поместим в нее адрес сайта
url = 'https://books.toscrape.com/'

In [None]:
# Создадим переменную для получения ответа от сайта
page = requests.get(url, timeout=10)
page

<Response [200]>

In [None]:
# Создадим переменную для запросов к сайту
if page.status_code == 200:
    soup = BeautifulSoup(page.text, features='html.parser')
else:
    soup = None

# soup = BeautifulSoup(page.text, "html.parser")

In [None]:
# Создадим датафрейм
df = pd.DataFrame({
    'id':[],
    'book_name':[],
    'price':[],
    'stock':[],
    'url':[],
})

In [None]:
# Функция для записи значений, спарсенных с одной страницы, в датафрейм через словарь
def get_quotes(soup, data=df):

    divs = soup.find_all('article', {'class':['product_pod']})
    url = 'https://books.toscrape.com/catalogue/'

    for div in divs:
        quote = {}

        quote['book_name'] = div.findNext('img').get('alt')
        quote['price'] = div.find('p', class_='price_color').text

        if div.find('p', class_='instock availability').text == '\n\n    \n        In stock\n    \n':
            quote['stock'] = 1
        else:
            quote['stock'] = 1
        quote['url'] = url + div.find('a').get('href')[6:]

        df.loc[len(df)] = quote

In [None]:
# Применение функции  get_quotes() ко всем страницам с данными
url = 'https://books.toscrape.com/catalogue/category/books_1/page-'

for i in range(1, 60):
    print(f'Page N:{i}')
    r = requests.get(url + str(i) + '.html', timeout=10)
    print(url + str(i) + '.html')
    soup = BeautifulSoup(r.content, 'html.parser')

    get_quotes(soup, df)

    if '404 Not Found' in soup.text:
        break

Page N:1
https://books.toscrape.com/catalogue/category/books_1/page-1.html
Page N:2
https://books.toscrape.com/catalogue/category/books_1/page-2.html
Page N:3
https://books.toscrape.com/catalogue/category/books_1/page-3.html
Page N:4
https://books.toscrape.com/catalogue/category/books_1/page-4.html
Page N:5
https://books.toscrape.com/catalogue/category/books_1/page-5.html
Page N:6
https://books.toscrape.com/catalogue/category/books_1/page-6.html
Page N:7
https://books.toscrape.com/catalogue/category/books_1/page-7.html
Page N:8
https://books.toscrape.com/catalogue/category/books_1/page-8.html
Page N:9
https://books.toscrape.com/catalogue/category/books_1/page-9.html
Page N:10
https://books.toscrape.com/catalogue/category/books_1/page-10.html
Page N:11
https://books.toscrape.com/catalogue/category/books_1/page-11.html
Page N:12
https://books.toscrape.com/catalogue/category/books_1/page-12.html
Page N:13
https://books.toscrape.com/catalogue/category/books_1/page-13.html
Page N:14
https:/

In [None]:
df['id'] = df.index.astype(str)

In [None]:
# Вывод всего датафрейма с кликабельными ссылками
#HTML(df.to_html(render_links=True, escape=False))

Unnamed: 0,id,book_name,price,stock,url
0,0,A Light in the Attic,£51.77,1,https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
1,1,Tipping the Velvet,£53.74,1,https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
2,2,Soumission,£50.10,1,https://books.toscrape.com/catalogue/soumission_998/index.html
3,3,Sharp Objects,£47.82,1,https://books.toscrape.com/catalogue/sharp-objects_997/index.html
4,4,Sapiens: A Brief History of Humankind,£54.23,1,https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
5,5,The Requiem Red,£22.65,1,https://books.toscrape.com/catalogue/the-requiem-red_995/index.html
6,6,The Dirty Little Secrets of Getting Your Dream Job,£33.34,1,https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
7,7,"The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull",£17.93,1,https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
8,8,The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics,£22.60,1,https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
9,9,The Black Maria,£52.15,1,https://books.toscrape.com/catalogue/the-black-maria_991/index.html


## Итоговый датасет

In [None]:
df.shape

(1000, 5)

In [None]:
display(
    df.head(),
    df.tail()
)

Unnamed: 0,id,book_name,price,stock,url
0,0,A Light in the Attic,£51.77,1,https://books.toscrape.com/catalogue/a-light-i...
1,1,Tipping the Velvet,£53.74,1,https://books.toscrape.com/catalogue/tipping-t...
2,2,Soumission,£50.10,1,https://books.toscrape.com/catalogue/soumissio...
3,3,Sharp Objects,£47.82,1,https://books.toscrape.com/catalogue/sharp-obj...
4,4,Sapiens: A Brief History of Humankind,£54.23,1,https://books.toscrape.com/catalogue/sapiens-a...


Unnamed: 0,id,book_name,price,stock,url
995,995,Alice in Wonderland (Alice's Adventures in Won...,£55.53,1,https://books.toscrape.com/catalogue/alice-in-...
996,996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",£57.06,1,https://books.toscrape.com/catalogue/ajin-demi...
997,997,A Spy's Devotion (The Regency Spies of London #1),£16.97,1,https://books.toscrape.com/catalogue/a-spys-de...
998,998,1st to Die (Women's Murder Club #1),£53.98,1,https://books.toscrape.com/catalogue/1st-to-di...
999,999,"1,000 Places to See Before You Die",£26.08,1,https://books.toscrape.com/catalogue/1000-plac...


# ЗАДАНИЕ ПРО

Так, мы спарсили данные о книгах. Но данные какие-то неполные. Часть названия стирается из-за отображения и нет ни полного названия книги, ни описания этой книги, ни жанра.

Вам необходимо дополнить датасет, спарсив дополнительные данные из того же сайта:

https://books.toscrape.com/

Итоговая таблица должна содержать следующие столбцы:

| Название столбца | Описание |
|--|--|
|id| Идентификатор книги |
|book_name| Название книги - только полное название|
|genre| жанр книги |
|desc| описание |
|price| Цена в £ |
|stock| Наличие книги. 1 или 0|
|url| Ссылка на книгу |
| num_of_rev | количество отзывов|

## Парсинг данных и обогащение датасета

In [None]:
# Функция для записи в soup адреса каждой продаваемой книги
def get_soup(url, **kwargs):
    response = requests.get(url, **kwargs)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, features='html.parser')
    else:
        soup = None
    return soup

In [None]:
# Добавление новых колонок в датафрейм и запись информации каждой продаваемой книги
from pandas.core.dtypes.astype import astype_array

for info in df.url:

    soup = get_soup(info)
    if soup is None:
        break

    if soup.find_all('div', class_='sub-header')[0].find('h2').contents[0] == 'Product Description':
        df.loc[df.url.str.contains(info), 'desc'] = np.array(soup.find_all('article')[0].find_all('p')[3])

    df.loc[df.url.str.contains(info), 'genre'] = soup.find_all('li')[2].find('a').contents[0]

    df.loc[df.url.str.contains(info), 'num_of_rev'] = soup.find_all(class_='table table-striped')[0].find_all('td')[6].text

df.desc = df.desc.fillna ('Данные отсутствуют')

In [None]:
df

Unnamed: 0,id,book_name,price,stock,url,desc,genre,num_of_rev
0,0,A Light in the Attic,£51.77,1,https://books.toscrape.com/catalogue/a-light-i...,It's hard to imagine a world without A Light i...,Poetry,0
1,1,Tipping the Velvet,£53.74,1,https://books.toscrape.com/catalogue/tipping-t...,"""Erotic and absorbing...Written with starling ...",Historical Fiction,0
2,2,Soumission,£50.10,1,https://books.toscrape.com/catalogue/soumissio...,"Dans une France assez proche de la nÃ´tre, un ...",Fiction,0
3,3,Sharp Objects,£47.82,1,https://books.toscrape.com/catalogue/sharp-obj...,"WICKED above her hipbone, GIRL across her hear...",Mystery,0
4,4,Sapiens: A Brief History of Humankind,£54.23,1,https://books.toscrape.com/catalogue/sapiens-a...,From a renowned historian comes a groundbreaki...,History,0
...,...,...,...,...,...,...,...,...
995,995,Alice in Wonderland (Alice's Adventures in Won...,£55.53,1,https://books.toscrape.com/catalogue/alice-in-...,Данные отсутствуют,Classics,0
996,996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",£57.06,1,https://books.toscrape.com/catalogue/ajin-demi...,High school student Kei Nagai is struck dead i...,Sequential Art,0
997,997,A Spy's Devotion (The Regency Spies of London #1),£16.97,1,https://books.toscrape.com/catalogue/a-spys-de...,"In Englandâs Regency era, manners and elegan...",Historical Fiction,0
998,998,1st to Die (Women's Murder Club #1),£53.98,1,https://books.toscrape.com/catalogue/1st-to-di...,"James Patterson, bestselling author of the Ale...",Mystery,0


In [None]:
# Выстроим столбцы в соответствии с заданием
df = df[['id', 'book_name', 'genre', 'desc', 'price', 'stock', 'url', 'num_of_rev']]

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          1000 non-null   object
 1   book_name   1000 non-null   object
 2   genre       1000 non-null   object
 3   desc        1000 non-null   object
 4   price       1000 non-null   object
 5   stock       1000 non-null   int64 
 6   url         1000 non-null   object
 7   num_of_rev  1000 non-null   object
dtypes: int64(1), object(7)
memory usage: 70.3+ KB


## Итоговый датасет PRO

In [None]:
df.shape

(1000, 8)

In [None]:
display(
    df.head(),
    df.tail()
)

Unnamed: 0,id,book_name,genre,desc,price,stock,url,num_of_rev
0,0,A Light in the Attic,Poetry,It's hard to imagine a world without A Light i...,£51.77,1,https://books.toscrape.com/catalogue/a-light-i...,0
1,1,Tipping the Velvet,Historical Fiction,"""Erotic and absorbing...Written with starling ...",£53.74,1,https://books.toscrape.com/catalogue/tipping-t...,0
2,2,Soumission,Fiction,"Dans une France assez proche de la nÃ´tre, un ...",£50.10,1,https://books.toscrape.com/catalogue/soumissio...,0
3,3,Sharp Objects,Mystery,"WICKED above her hipbone, GIRL across her hear...",£47.82,1,https://books.toscrape.com/catalogue/sharp-obj...,0
4,4,Sapiens: A Brief History of Humankind,History,From a renowned historian comes a groundbreaki...,£54.23,1,https://books.toscrape.com/catalogue/sapiens-a...,0


Unnamed: 0,id,book_name,genre,desc,price,stock,url,num_of_rev
995,995,Alice in Wonderland (Alice's Adventures in Won...,Classics,Данные отсутствуют,£55.53,1,https://books.toscrape.com/catalogue/alice-in-...,0
996,996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Sequential Art,High school student Kei Nagai is struck dead i...,£57.06,1,https://books.toscrape.com/catalogue/ajin-demi...,0
997,997,A Spy's Devotion (The Regency Spies of London #1),Historical Fiction,"In Englandâs Regency era, manners and elegan...",£16.97,1,https://books.toscrape.com/catalogue/a-spys-de...,0
998,998,1st to Die (Women's Murder Club #1),Mystery,"James Patterson, bestselling author of the Ale...",£53.98,1,https://books.toscrape.com/catalogue/1st-to-di...,0
999,999,"1,000 Places to See Before You Die",Travel,"Around the World, continent by continent, here...",£26.08,1,https://books.toscrape.com/catalogue/1000-plac...,0
