# Web scraping of a Gaming Website using BeautifulSoup

Coming from the days where nearly no one knew much about the gaming news to having the websites onboard that deliver even the minute updates about games. Gamers are always in search of the latest release and updates of games. To make this information available in a structured way, the objective of this project is to extract the information of one such gaming website - [DigiStatement](https://digistatement.com/).
![digi-statement](https://www.linkpicture.com/q/Screenshot_20221226_194400.png)

## Introduction to [Web Scrapping](https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/)
<b> What is meant by web scraping?</b> <br>
![image-4.png](https://i.ibb.co/2PdK54T/3.png)
Web scraping is the process of extracting the unstrutured information out of the webpage and convert it into a structured data using spreadsheets or any database. Many webscraping methods are used to scrape the data. For example, [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a python package that helps in webscraping.
Since the websites have so much data coming in from all around the places. The main frame of the website is the same but the data it shows change over time. Web scrapping is a way to collect the required data for some personal and professional use.
It is a way to collect the data in a structured form. The project is done using the [Python Programming Language](https://en.wikipedia.org/wiki/Python_(programming_language)).


## Outline
Listed are the steps that we will be using to build the project we aim:
1. Installing the required inbuilt python libraries and functions such as `requests` and `beautifulsoup4`.
2. Parse the HTML source code using beautiful soup
3. Extract blog Names, Author Name, Date and URLs from page
4. Extract and combine data from multiple pages
5. Compile extracted information into Python lists and dictionaries
6. Save the extracted information to a CSV file

By the end of this project, we will be able to see the `Top 100 Blogs on DigiStatement` in the following `csv` file format:
![final csv](https://www.linkpicture.com/q/Screenshot_20221227_111300.png)

## 1.  How to get started with the project?

This project is hosted on Jovian. To initialise making of the project, we need to use the `Jupyter Notebook` and run the jupyter notebook on `Binder`. Other alternative is to use `Anaconda`.
![run](https://i.imgur.com/J6vTodm.png)

## 2. Installing and importing the Libraries required
We will proceed by importing the libraries which will be used later in the project
* <b>Requests </b> - to download the the web page in the text format.
* <b>BeautifulSoup</b> - to be able to use the text downloaded from the Requests library to be used for further processing.

In [1]:
# --quiet command is used to hide the processing that goes into installing the libraries.
# --upgrade is used to install the latest version available (if any).
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

Importing the `modules` that we require from the installed `libraries`.

In [2]:
from bs4 import BeautifulSoup
import requests

## 3. Creating the function `get_page()` 
1. The base url is stated and it directs the function to the webpage. 
2. This function takes `page number` as the argument to target the required web pages.
3. `page_url` variable stores the complete path of the webpage to be extracted, including the argument. 
4. `requests.get` is used to render the url passed.
5. Status code is verified using the `if` statement. It must be between 200-299 for successfully extracting the information out. Most of the times, it comes out as 200 only.
6. Finally, a variable `doc` is created which renders the webpage to be extracted using `BeautifulSoup`.

In [3]:
def get_page(page_number):
    base_url= "https://digistatement.com/page/"
    page_url = base_url + str(page_number) + "/"
    response = requests.get(page_url)
    response.status_code
    
    if response.status_code != 200:
        raise Exception("Page did not scrape correctly.")
    
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

## 4. Creating the function `parse_blog()` to extract Blog Title, Author's Name, URL, and Date of Publishing of a particular page
1. It takes the argument as `doc` that was extracted in the previous function.
2. The `a_tags` variable stores the `a` tags found through `find_all()`, present under the H3 tags of a particular article tag. It has been extracted because the `a` tags present in H3 tags stores the title and url information that can be extracted. The same is stored in the `title` and `url` variables.
3. Similarly, for extracting the Date and Author's Name, `div` tags of the respective classes are drawn out.
4. Finally, a dictionary is returned showing Title, Author's Name, URL, and Date of Publishing with the right information.

In [4]:
def parse_blog (doc):
    a_tags = doc.h3.find_all('a')
    title = a_tags[0].text.strip()
    url = a_tags[0]['href']
    author_tag = doc.find_all('div',class_="jeg_meta_author")
    author_name = author_tag[0].text.strip().replace('by ','')
    date_tag = doc.find_all('div',class_="jeg_meta_date")
    date = date_tag[0].text.strip()
    return {
        'Blog Title':title,
        'Author Name': author_name,
        'URL': url,
        'Date of Publishing': date
    }

## 5. Creating the function `get_top_blogs()`
1. It takes the argument as `doc` which was the return value of an earlier function.
2. Each result present on web pages of DigiStatement is stored under `<article>` tags. Therefore, `article_tags` variable stores all the article tags extracted from a particular page. 
3. Finally, a list of all dictionaries is returned consisting of Title, Author's Name, URL, and Date of Publishing for each `article` tag present on a web page using a `for` loop.

In [5]:
def get_top_blogs(doc):
    article_tags = doc.find_all('article', class_ = "jeg_post jeg_pl_md_2 format-standard" )
    blogs_on_page = [parse_blog(i) for i in article_tags]
    return blogs_on_page

## 6. Extracting top 100 Blogs of DigiStatement from its first 10 pages.

As mentioned earlier, each page on DigiStatement returns 10 blogs sorted by the latest date. To extract top 100 blogs of this website, we need to firstly, extract first 10 pages using the `get_page(page_number)` function and then applying the `get_top_blogs(doc)` function to extract the information of 10 blogs present at each page. All of this has been done using an empty list and `for` loop.

`get_top100_blogs()` function is defined to do everything mentioned above and returning the required 100 results.

In [6]:
def get_top100_blogs():
    top_100 = []
    for i in range(1,11):
        top_100.append(get_top_blogs(get_page(i)))
    return top_100


## 7. Creating a flat list to get a clean result

The `get_top100_blogs()` function returns a list of list that cannot be written to a csv file properly. Therefore, the following function, `flatten_list(2d_list)` has been created to create a single list out of it.

In [7]:
def flatten_list(_2d_list):
    flat_list = []
    for e in _2d_list:
        if type(e) is list:
            for i in e:
                flat_list.append(i)
        else:
            flat_list.append(e)
    return flat_list

## 8. Creating a function to deploy the final result into a ".csv" file

In [8]:
def write_csv (items, path):
    with open(path,'w') as f:
        if len(items) == 0:
            return
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header," ")))
            f.write(",".join(values) + "\n")    

## 9. Creating a single function to combine everything 

This final function does not take any arguments and return the required csv file. This function includes make the earlier defined function at one place and return the Top 100 Blogs on DigiStatement written to a csv file.

In [9]:
base_url = 'https://digistatement.com/page/'
def scrape_top_100_blogs():
    '''Get the top 100 blogs for DigiStatement and write them to csv file'''
    path = 'Top_100_Blogs.csv'
    main = get_top100_blogs()
    flat_list = flatten_list(main)
    write_csv(flat_list,path)
    print('Top 100 Blogs of DigiStatement written to file {}'.format(path))
    return path

In [10]:
scrape_top_100_blogs()

Top 100 Blogs of DigiStatement written to file Top_100_Blogs.csv


'Top_100_Blogs.csv'

## Summary
In the current project the gaming website namely DigiStatement.com has been used for scrapping the articles on the 10 pages. Python as programming language and requests and BeautifulSoup libraries have been used to download the pages and then exploring and getting the relevant data from the website. Further the work has been in a csv file to be used for further processing

## Future works
The current web scrappning project is a starting point for a bigger NLP project. The updates of which will be posted afterwards. In this project I started to look into some of the websites where I could get the data for news articles around the specific topic of 'Gaming'.

Later the project will take a flight towards getting the information for user defined topics and scrape of the data accordingly from selective websites - at first the topics in Gaming and later in other areas as well.

Another area to look into is how a particular topic is covered in the media and what is the general sense of the authors and which area of the spectrum of academia is the topic is tipping towards.

## Importing Jovian and commiting the work

In [11]:
pip install jovian

Note: you may need to restart the kernel to use updated packages.


In [12]:
import jovian

In [14]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "abhibhardwajabbb/abhishek-web-scraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/abhibhardwajabbb/abhishek-web-scraping-project-final[0m


'https://jovian.ai/abhibhardwajabbb/abhishek-web-scraping-project-final'