# Web Scraping and Introductory Data Analysis

### purpose of this assignment
- Write a web scraping script using Beautiful Soup and Selenium to collect data from a website.
- Sample from the collected dataset and compare the statistics of the sample and the population.

### Environment Setup

Create the virtualenv by running the following command:

`python -m venv .venv`

Activate the virtualenv by runnig the following command:

windows:
`.venv/Scripts/activate`

linux:
`source .venv/bin/activate`

Now install packages using pip. Install pakages by runnig the following cell:

`pip install -r ./../requirements.txt`

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import bs4
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [5]:
NUMBER_BLOCKS = 10
ETHERESCAN_URL = "https://etherscan.io/txs"

## Web Scraping
**Class: $EthereumScraping$**

This class is designed to scrape Ethereum transaction data from Etherscan. It offers functionalities to:

- Extract data from a specified number of blocks (`number_block`)
- Clean and organize the extracted data into a pandas DataFrame
- Handle web scraping using Selenium (specifically ChromeDriver)

**Properties:**

- `url` (str): The base URL for Etherscan's transaction search page (likely retrieved from an environment variable named `ETHERESCAN_URL`).
- `columns` (list): A list of column names for the DataFrame to be created (`['tnx_hash', 'method', 'block', 'date', 'from', 'to', 'value', 'tnx_fee']`).
- `number_block` (int, default=10): The desired number of blocks to scrape data from.
- `driver` (Optional[webdriver.Chrome]): A reference to the ChromeDriver instance used for web scraping (initialized during the first scraping operation).
- `df` (pd.DataFrame): The DataFrame that stores the scraped transaction data, with columns defined in `columns`.

**Methods:**

- `__init__(self, number_block: int=10) -> None`
  - Initializes the class instance with the specified `number_block`.
  - Sets `driver` to `None` initially.
  - Creates an empty DataFrame with the columns specified in `columns`.

- `__del__(self) -> None` (Destructor)
  - Closes the ChromeDriver instance if it's open to avoid resource leaks.

- `_get_data_from_td_tag(self, element: bs4.element.Tag) -> str` (Private)
  - Extracts the text content from an anchor (`<a>`) tag within a table data cell (`<td>`) element.
  - Splits the extracted URL by '/' and returns the last element, which is typically the transaction hash.

- `_collect_data_from_tr_tag(self, elements: bs4.element.ResultSet) -> pd.core.series.Series` (Private)
  - Takes a collection of table cell elements (`elements`) from a table row (`<tr>`).
  - Creates a pandas Series with the following data (extracted from corresponding cell indices):
    - Method name (index: 'method')
    - Block number (index: 'block') (converted to integer)
    - Date (index: 'date')
    - From address (index: 'from')
    - To address (index: 'to')
    - Transaction value (index: 'value')
    - Transaction fee (index: 'tnx_fee')
  - Utilizes `_get_data_from_td_tag` to extract transaction hash and fee from anchor tags within specific cells.

- `_extract_data_from_html(self, html_content: str) -> int` (Private)
  - Parses the provided HTML content using BeautifulSoup.
  - Finds all table rows (`<tr>`) and iterates through them:
    - Extracts data from table cell elements (`<td>`) within each row using `_collect_data_from_tr_tag`.
    - Appends the extracted Series as a new row to the `df` DataFrame, ignoring the index during concatenation.
    - Keeps track of the maximum block number encountered.
  - Returns the maximum block number extracted from the HTML content.

- `_extract_data_from_url(self) -> int` (Private)
  - Retrieves the HTML content from the current URL using the `driver`'s `get_attribute("outerHTML")` method on the target element identified by the CSS selector `"tbody.align-middle.text-nowrap"`.
  - Calls `_extract_data_from_html` to parse the HTML and extract data.

- `_click_next_button(self) -> None` (Private)
  - Creates a WebDriverWait object with a 10-second timeout.
  - Waits for the "Next" button (`a[aria-label='Next']`) to become clickable using `element_to_be_clickable`.
  - Clicks the button if located, otherwise prints an error message and raises an exception.

- `_extract_data(self) -> None` (Private)
  - Initializes a ChromeDriver instance (`driver`) if it's not already set.
  - Navigates to the `url` using the `driver`.
  - Calls `_extract_data_from_url` to extract data from the initial page.
  

In [6]:
class EthereumScraping:
    url = ETHERESCAN_URL
    columns = ['tnx_hash', 'method', 'block', 'date', 'from', 'to', 'value', 'tnx_fee']

    def __init__(self, number_block: int=10) -> None:
        self.number_block = number_block
        self.driver = None
        self.df = pd.DataFrame(columns=self.columns)

    def __del__(self) -> None:
        if self.driver:
            self.driver.quit()

    def _get_data_from_td_tag(self, element: bs4.element.Tag) -> str:
        return element.find('a').get('href').split('/')[-1] 

    def _collect_data_from_tr_tag(self, elements: bs4.element.ResultSet)-> pd.core.series.Series:
        return pd.Series(
            [
                elements[1].text.strip(),
                elements[2].text.strip(),
                elements[3].text.strip(),
                elements[4].text.strip(),
                self._get_data_from_td_tag(elements[7]),
                self._get_data_from_td_tag(elements[9]),
                elements[10].text.strip(),
                elements[11].text.strip()
            ],
            index=self.columns

        ), int(elements[3].text.strip())


    def _extract_data_from_html(self, html_content: str) -> int:
        soup = BeautifulSoup(html_content, "html.parser")
        rows = soup.find_all("tr")
        block_number = 0
        for row in rows:
            cells = row.find_all("td")
            series, block = self._collect_data_from_tr_tag(cells)
            block_number = max(block_number, block)
            self.df = pd.concat([self.df, pd.DataFrame([series])], ignore_index=True)

        return block_number

            
    def _extract_data_from_url(self) -> int:
        return self._extract_data_from_html(
            self.driver.find_element(
                By.CSS_SELECTOR, "tbody.align-middle.text-nowrap"
            ).get_attribute("outerHTML")
        )
    
    def _click_next_button(self) -> None:
        try:
            WebDriverWait(self.driver, 10).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, "a[aria-label='Next']"))
            ).click()
        except Exception as e:
            print(f"Error clicking the 'Next' button: {e}")
            raise

    def _extract_data(self) -> None:
        self.driver = webdriver.Chrome()
        self.driver.get(self.url)
        block_number = new_block_number  = self._extract_data_from_url()
        while (block_number - new_block_number) < self.number_block:
            self._click_next_button()
            new_block_number = self._extract_data_from_url()
        
    def scrap(self) -> pd.core.frame.DataFrame:
        self._extract_data()
        return self.df
        
      





In [7]:
scraptEther = EthereumScraping(NUMBER_BLOCKS)
df = scraptEther.scrap()


Error sending stats to Plausible: error sending request for url (https://plausible.io/api/event): operation timed out


In [8]:
df

Unnamed: 0,tnx_hash,method,block,date,from,to,value,tnx_fee
0,0xbeed63ac01811c1ef07e90166e0d612a259339116bb1...,Transfer,19389347,2024-03-08 8:47:11,0x95222290dd7278aa3ddd389cc1e1d165cc4bafe5,0x9ca3e337a8587352f4ea97b75f6163ef5fd80c3f,0.062462594 ETH,0.00099483
1,0x45e11803d0a5b68fb4bd47935ef985884deb7a9de3d5...,Transfer,19389347,2024-03-08 8:47:11,0x9658d0971f9690e45eed3d11c42d5c56367c11d8,0xdac17f958d2ee523a2206206994597c13d831ec7,0 ETH,0.00218478
2,0x9c9bc09c34f18324ccef65cd9bad57f9ee0909c7ce4c...,Transfer,19389347,2024-03-08 8:47:11,0xd2d75cf1a4ac4daf3b670398fd1b5b03cd06ecb0,0xdac17f958d2ee523a2206206994597c13d831ec7,0 ETH,0.00299446
3,0xeac497f47800d93c917a315b3f00223e772805e360e9...,Transfer*,19389347,2024-03-08 8:47:11,0x415c8893d514f9bc5211d36eeda4183226b84aa7,0xff00000000000000000000000000000000081457,0 ETH,0.07150771
4,0xe9ffeeb15e066a75d68fde4e1b8446e3e7e330837dbf...,Transfer,19389347,2024-03-08 8:47:11,0x4dc964672b6f3637c56a2afce19ed4fc04166766,0xf017d3690346eb8234b85f74cee5e15821fee1f4,0 ETH,0.00245131
...,...,...,...,...,...,...,...,...
2045,0xe24ed3debeb34467e8f9e0decf8294d7a66572f1e11f...,Transfer,19389337,2024-03-08 8:45:11,0x176df84e6c0cee98df1eccfdbff6de782183b79c,0x42476f744292107e34519f9c357927074ea3f75d,0 ETH,0.0014543
2046,0xfbc937715599dda492a9ce9e8f379072ca2a316f700a...,Transfer,19389337,2024-03-08 8:45:11,0x74dec05e5b894b0efec69cdf6316971802a2f9a1,0xd12cb5e72cb8dafb04b042d0f074819af408d06f,2.79755916 ETH,0.00102526
2047,0xa70f684b5db48522dc8b8d71b22546b4d024d3f7c88e...,Transfer,19389337,2024-03-08 8:45:11,0xf80609e58cf2193a9036f080426083ef96f2e6dd,0x42a7797351dfd281a80807196c8508eb70bb2af9,0 ETH,0.00183204
2048,0x73bae17d7a70d12eb93d3a12fc73a5d45ed4d31a2846...,Transfer,19389337,2024-03-08 8:45:11,0x6edf968da408a9640b8865826429a977a11c5048,0xa0027187490b307388365d0612125654061fe3cd,0.69185 ETH,0.00102526


# Data Cleaning

In [28]:
df['tnx_hash'].nunique()

1622

In [23]:
df.drop_duplicates()

Unnamed: 0,tnx_hash,method,block,date,from,to,value,tnx_fee
0,0xbeed63ac01811c1ef07e90166e0d612a259339116bb1...,Transfer,19389347,2024-03-08 8:47:11,0x95222290dd7278aa3ddd389cc1e1d165cc4bafe5,0x9ca3e337a8587352f4ea97b75f6163ef5fd80c3f,0.062462594 ETH,0.00099483
1,0x45e11803d0a5b68fb4bd47935ef985884deb7a9de3d5...,Transfer,19389347,2024-03-08 8:47:11,0x9658d0971f9690e45eed3d11c42d5c56367c11d8,0xdac17f958d2ee523a2206206994597c13d831ec7,0 ETH,0.00218478
2,0x9c9bc09c34f18324ccef65cd9bad57f9ee0909c7ce4c...,Transfer,19389347,2024-03-08 8:47:11,0xd2d75cf1a4ac4daf3b670398fd1b5b03cd06ecb0,0xdac17f958d2ee523a2206206994597c13d831ec7,0 ETH,0.00299446
3,0xeac497f47800d93c917a315b3f00223e772805e360e9...,Transfer*,19389347,2024-03-08 8:47:11,0x415c8893d514f9bc5211d36eeda4183226b84aa7,0xff00000000000000000000000000000000081457,0 ETH,0.07150771
4,0xe9ffeeb15e066a75d68fde4e1b8446e3e7e330837dbf...,Transfer,19389347,2024-03-08 8:47:11,0x4dc964672b6f3637c56a2afce19ed4fc04166766,0xf017d3690346eb8234b85f74cee5e15821fee1f4,0 ETH,0.00245131
...,...,...,...,...,...,...,...,...
2045,0xe24ed3debeb34467e8f9e0decf8294d7a66572f1e11f...,Transfer,19389337,2024-03-08 8:45:11,0x176df84e6c0cee98df1eccfdbff6de782183b79c,0x42476f744292107e34519f9c357927074ea3f75d,0 ETH,0.0014543
2046,0xfbc937715599dda492a9ce9e8f379072ca2a316f700a...,Transfer,19389337,2024-03-08 8:45:11,0x74dec05e5b894b0efec69cdf6316971802a2f9a1,0xd12cb5e72cb8dafb04b042d0f074819af408d06f,2.79755916 ETH,0.00102526
2047,0xa70f684b5db48522dc8b8d71b22546b4d024d3f7c88e...,Transfer,19389337,2024-03-08 8:45:11,0xf80609e58cf2193a9036f080426083ef96f2e6dd,0x42a7797351dfd281a80807196c8508eb70bb2af9,0 ETH,0.00183204
2048,0x73bae17d7a70d12eb93d3a12fc73a5d45ed4d31a2846...,Transfer,19389337,2024-03-08 8:45:11,0x6edf968da408a9640b8865826429a977a11c5048,0xa0027187490b307388365d0612125654061fe3cd,0.69185 ETH,0.00102526


# 

In [29]:
df['tnx_fee'] = df['tnx_fee'].astype(float)