<a href="https://www.kaggle.com/code/fabinahian/web-scraping-a-bengali-online-library?scriptVersionId=155922904" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Website**: [কবি ও কবিতা](https://banglakobita.com.bd/%E0%A6%97%E0%A6%B2%E0%A7%8D%E0%A6%AA-%E0%A6%89%E0%A6%AA%E0%A6%A8%E0%A7%8D%E0%A6%AF%E0%A6%BE%E0%A6%B8/)
 



# 🎯 GOALS:

*     Scraping book names, writer names, text snippets from each book, and genres from pages 1 through 10.
*     Creating a dataframe using the scraped data.
    
**Minor Details:** 

*      The "Book Text" column will only contain texts from the first page ignoring pagination.
*      The "Genre" column will contain multiple genres for each book if available.

# 🔖 Importing Libraries

In [1]:
from bs4 import BeautifulSoup
import requests #it will send the request to grab the webpages
import pandas as pd

# 🔖 Obtaining the Response (Contents) of the First Page

In [2]:
#getting the response from the first page of the website

url = "https://banglakobita.com.bd/%E0%A6%97%E0%A6%B2%E0%A7%8D%E0%A6%AA-%E0%A6%89%E0%A6%AA%E0%A6%A8%E0%A7%8D%E0%A6%AF%E0%A6%BE%E0%A6%B8/page/1/" 
response = requests.get(url)

In [3]:
response 

# output: <Response [200]> 
# the output indicates that the request was successful. so, we have access.

<Response [200]>

In [4]:
#getting the contents of the page

response = response.content

#displaying the contents in HTML format using BeautifulSoup

soup = BeautifulSoup(response, 'html.parser')

# 🔖 Accessing the Necessary Locations

In [5]:
list_of_books = soup.find('div', class_ = 'sek-grid-items sek-list-layout sek-thumb-no-custom-height sek-shadow-on-hover')

In [6]:
books = list_of_books.find_all('article', class_ = 'sek-has-thumb')

# 🔖 Scraping the First Page

In [7]:
data = []

for book in books:
    article = book.find('img')
    names = article.attrs['alt']
    
    #handling book names and writer names
    
    book_name_and_writer = names.split(" - ")
    
    if len(book_name_and_writer) != 2: #added this since I was facing a split error [expected value:2, received:1 after the split]
        continue  # Skip this iteration
        
    book_name, writer = book_name_and_writer
    
    #handling genres
    
    ctgry = book.find('div', class_ = 'sek-pg-category')
    genre = ctgry.find('a').text
    
    #handling text snippets of the books
    
    snippet_loc_1 = book.find('div', class_ = 'sek-pg-content')
    snippet_loc_2 = snippet_loc_1.find('h2', class_ = 'sek-pg-title')
    text_url = snippet_loc_2.find('a').attrs['href'] #this collects the url of the page that contains the book's material
    
    response_text = requests.get(text_url).text
    soup_text = BeautifulSoup(response_text, 'html.parser') #parsing the webpage
    
    page = soup_text.find('div', class_ =  'entry-content clearfix')
    
    elements = page.find_all('p') #elements has ALL the paragraphs using <p>..</p>
    limit = len(elements)
    text = ''
    for i in range (2, limit-2):
            
        text = text + elements[i].text
        
    #observing the collected information of 1 
    
#     print(book_name_and_writer)
#     print("title = ", book_name)
#     print("writer = ", writer)
#     print("read book here: ", text_url)
#     print("first page: ",text)
#     print("genre = ", genre)
#     print("\n\n")

    
    #storing the scraped data
    
    data.append([book_name, writer, text, genre])

    
# print("collected data: ", data)

# 🔖 Scraping ALL Ten Pages

In [8]:
data = [] #this will hold all the scraped data for each book

for page_no in range (1,11): #since there are 10 pages

    url = f"https://banglakobita.com.bd/%E0%A6%97%E0%A6%B2%E0%A7%8D%E0%A6%AA-%E0%A6%89%E0%A6%AA%E0%A6%A8%E0%A7%8D%E0%A6%AF%E0%A6%BE%E0%A6%B8/page/{page_no}/" 
    response = requests.get(url)

    #getting the contents of the page

    response = response.content

    #getting the contents in HTML format using BeautifulSoup

    soup = BeautifulSoup(response, 'html.parser')

    list_of_books = soup.find('div', class_ = 'sek-grid-items sek-list-layout sek-thumb-no-custom-height sek-shadow-on-hover')
    books = list_of_books.find_all('article', class_ = 'sek-has-thumb')

    for book in books:
        article = book.find('img')
        names = article.attrs['alt']

        #handling book names and writer names

        book_name_and_writer = names.split(" - ")

        if len(book_name_and_writer) != 2: #added this since I was facing a split error [expected value:2, received:1 after the split]
            continue  # Skip this iteration

        book_name, writer = book_name_and_writer

        #handling genres

        ctgry = book.find('div', class_ = 'sek-pg-category')
        genre = ctgry.find('a').text

        #handling text snippets of the books

        snippet_loc_1 = book.find('div', class_ = 'sek-pg-content')
        snippet_loc_2 = snippet_loc_1.find('h2', class_ = 'sek-pg-title')
        text_url = snippet_loc_2.find('a').attrs['href'] #this collects the url of the page that contains the book's material

        response_text = requests.get(text_url).text
        soup_text = BeautifulSoup(response_text, 'html.parser') #parsing the webpage

        page = soup_text.find('div', class_ =  'entry-content clearfix')
        elements = page.find_all('p') #elements has ALL the paragraphs using <p>..</p>
        limit = len(elements)
        text = ''
        for i in range (2, limit-2): #the limit makes sure the title and the publication date gets ignored leaving only the story as text

            text = text + elements[i].text

        #storing the scraped data

        data.append([book_name, writer, text, genre])

# 🔖 Creating the Data Frame


In [9]:
df = pd.DataFrame(data, columns =['Book Name', 'Writer', 'Book Text', 'Genre'], dtype = str)
df.to_csv('Bengali_library.csv')

In [10]:
table = pd.read_csv("/kaggle/working/Bengali_library.csv")
table

Unnamed: 0.1,Unnamed: 0,Book Name,Writer,Book Text,Genre
0,0,সেপ্টোপাসের খিদে,সত্যজিৎ রায়,কড়া নাড়ার আওয়াজ পেয়ে আপনা থেকেই মুখ থেকে এ...,গল্প/উপন্যাস
1,1,টেরোড্যাকটিলের ডিম,সত্যজিৎ রায়,বদনবাবু আপিসের পর আর কার্জন পার্কে আসেন না।আগে...,গল্প/উপন্যাস
2,2,বঙ্কুবাবুর বন্ধু,সত্যজিৎ রায়,বঙ্কুবাবুকে বুকে কেউ কোনওদিন রাগতে দেখেনি। সত্...,গল্প/উপন্যাস
3,3,বর্ণান্ধ,সত্যজিৎ রায়,দুষ্প্রাপ্য বইখানাকে বগলে গুঁজে পুরনো বইয়ের দ...,গল্প/উপন্যাস
4,4,পুরস্কার,সত্যজিৎ রায়,"বয়স চব্বিশ, লম্বা, রোগাটে। হাত-পা রোগা, মুখখা...",গল্প/উপন্যাস
...,...,...,...,...,...
81,81,বিলাসী,শরৎচন্দ্র চট্টোপাধ্যায়,পাকা দুই ক্রোশ পথ হাঁটিয়া স্কুলে বিদ্যা অর্জন...,গল্প/উপন্যাস
82,82,মহেশ,শরৎচন্দ্র চট্টোপাধ্যায়,"১গ্রামের নাম কাশীপুর। গ্রাম ছোট, জমিদার আরও ছো...",গল্প/উপন্যাস
83,83,একরাত্রি,রবীন্দ্রনাথ ঠাকুর,"সুরবালার সঙ্গে একত্রে পাঠশালায় গিয়াছি, এবং ব...",গল্প/উপন্যাস
84,84,কাবুলিওয়ালা,রবীন্দ্রনাথ ঠাকুর,আমার পাঁচ বছর বয়সের ছোটো মেয়...,গল্প/উপন্যাস


# 🏁 Final Thoughts

This code successfully scraped 86 books out of 95 meeting the goal of the project. The remaining 9 books could not be found within the dataframe. 

Observation: Some book names do not follow the standard pattern and this code is not yet equipped to deal with the anomalies. 

