# Web Scraping and Introductory Data Analysis

Welcome to Homework 0, where we will delve into web scraping and perform an introductory data analysis. This homework will be a hands-on exercise that will help you become familiar with the process of extracting data from websites and conducting basic statistical analysis. 

## Objectives

By the end of this homework, you will be able to:

1. Set up a Python environment with the necessary libraries for web scraping and data analysis.
2. Write a web scraping script using Beautiful Soup and Selenium to collect data from a website.
3. Sample from the collected dataset and compare the statistics of the sample and the population.
   
## Tasks

1. **Environment Setup**: Install the required libraries such as Beautiful Soup, Selenium, pandas, numpy, matplotlib, and seaborn.

2. **Web Scraping**: Write a script to scrape transaction data from [Etherscan.io](https://etherscan.io/txs). Use Selenium to interact with the website and Beautiful Soup to parse the HTML content.

3. **Data Sampling**: Once the data is collected, create a sample from the dataset. Compare the sample statistics (mean and standard deviation) with the population statistics.


## Deliverables

1. A Jupyter notebook with all the code and explanations.
2. A detailed report on the findings, including the comparison of sample and population statistics.
Note: You can include the report in your notebook.

## Getting Started

Begin by setting up your Python environment and installing the necessary libraries. Then, proceed with the web scraping task, ensuring that you handle any potential issues such as rate limiting. Once you have the data, move on to the data sampling and statistical analysis tasks. 

Remember to document your process and findings in the Jupyter notebook, and to include visualizations where appropriate to illustrate your results. <br>
Good luck, and happy scraping!

## Data Collection (Etherscan)

In this section, we will use web scraping to gather transaction data from the Ethereum blockchain using the Etherscan block explorer. Our objective is to collect transactions from the **last 10 blocks** on Ethereum.

To accomplish this task, we will employ web scraping techniques to extract the transaction data from the Etherscan website. The URL we will be targeting for our data collection is:

[https://etherscan.io/txs](https://etherscan.io/txs)

### Steps

1. **Navigate to the URL**: Use Selenium to open the Etherscan transactions page in a browser.

2. **Locate the Transaction Data**: Identify the HTML elements that contain the transaction data for the specified block range.

3. **Extract the Data**: Write a script to extract the transaction details e.g. Hash, Method, Block, etc.

4. **Handle Pagination**: If the transactions span multiple pages, implement pagination handling to navigate through the pages and collect all relevant transaction data.

5. **Store the Data**: Save the extracted transaction data into a structured format, such as a CSV file or a pandas DataFrame, for further analysis.

### Considerations

- **Rate Limiting**: Be mindful of the website's rate limits to avoid being blocked. Implement delays between requests if necessary.
- **Dynamic Content**: The Etherscan website may load content dynamically. Ensure that Selenium waits for the necessary elements to load before attempting to scrape the data.
- **Data Cleaning**: After extraction, clean the data to remove any inconsistencies or errors that may have occurred during the scraping process.

### Resources

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Selenium Documentation](https://selenium-python.readthedocs.io/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Ethereum](https://ethereum.org/en/)

In [26]:
import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import pandas as pd
import time
from bs4 import BeautifulSoup
import re
import json
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC  
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

In [27]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd

def scrape_data():
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options=options)
    

    # Open Etherscan
    driver.get("https://etherscan.io/blocks")
    wait = WebDriverWait(driver, 10)
    table = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table')))
   

    # Use BeautifulSoup to parse block numbers
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    blocks = [row.find('a').text for row in soup.find_all('tr')[1:11]]  # Get the last 10 blocks

    transactions_data = []

    # Iterate over blocks
    # extract transactions
    for block in blocks:
        driver.get(f"https://etherscan.io/txs?block={block}")
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table-hover')))

        # iterate over pagination
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        pagination = soup.find('ul', class_='pagination')
        pages = 1
        if pagination:
            pages = len(pagination.find_all('li')) - 2  # Adjust for 'previous' and 'next' buttons

        for page in range(1, pages + 1):
            if page > 1:
                # Go directly to the page
                driver.get(f"https://etherscan.io/txs?block={block}&p={page}")
                wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table-hover')))

            soup = BeautifulSoup(driver.page_source, 'html.parser')
            transactions_table = soup.find('table', class_='table-hover')

            # Extract transactions
            for row in transactions_table.find('tbody').find_all('tr'):
                cells = row.find_all('td')
                if len(cells) > 1:  # Ensure it's not an empty row
                    transaction = {
                        'Hash': cells[1].text.strip(),
                        'Method': cells[2].text.strip(),
                        'Block': block,
                        'Age': cells[4].text.strip(),
                        'From': cells[6].text.strip(),
                        'To': cells[8].text.strip(),
                        'Value': cells[9].text.strip(),
                        'TxnFee': cells[10].text.strip(),
                    }
                    transactions_data.append(transaction)

    driver.quit()

    # Convert the list of dictionaries into a DataFrame and save as CSV
    df = pd.DataFrame(transactions_data)
    df.to_csv('data.csv', index=False)
    return df

# Execute the function and print the first few rows of the dataframe
df_optimized = scrape_data()
print(df_optimized.head())



                                                Hash               Method  \
0  0xfe2d690b99d54fd95ec62878169fa1bcf2162f4de09c...      Sell To Uniswap   
1  0xe33896e4a5bafa442b90f0d7235bf42380ba62ff1f82...      Sell To Uniswap   
2  0xe3064529e5983ca98f9898636c5ec81f9f68bf3b195c...      Sell To Uniswap   
3  0xc9c806e8554527c51796db3cf814c7a8584ca04b75fe...              Approve   
4  0x5b9815d0ac6885b604b69a37f15d0f4746d187c22c03...  Multiplex Multi ...   

      Block                  Age        From To                   Value  \
0  19356643  2024-03-03 19:11:59  1709493119         0x: Exchange Proxy   
1  19356643  2024-03-03 19:11:59  1709493119         0x: Exchange Proxy   
2  19356643  2024-03-03 19:11:59  1709493119         0x: Exchange Proxy   
3  19356643  2024-03-03 19:11:59  1709493119     PulsePad: PLSPAD Token   
4  19356643  2024-03-03 19:11:59  1709493119         0x: Exchange Proxy   

            TxnFee  
0  0.014403182 ETH  
1  0.018443724 ETH  
2  1.352960825 ETH  
3 

In [28]:
df_optimized

Unnamed: 0,Hash,Method,Block,Age,From,To,Value,TxnFee
0,0xfe2d690b99d54fd95ec62878169fa1bcf2162f4de09c...,Sell To Uniswap,19356643,2024-03-03 19:11:59,1709493119,,0x: Exchange Proxy,0.014403182 ETH
1,0xe33896e4a5bafa442b90f0d7235bf42380ba62ff1f82...,Sell To Uniswap,19356643,2024-03-03 19:11:59,1709493119,,0x: Exchange Proxy,0.018443724 ETH
2,0xe3064529e5983ca98f9898636c5ec81f9f68bf3b195c...,Sell To Uniswap,19356643,2024-03-03 19:11:59,1709493119,,0x: Exchange Proxy,1.352960825 ETH
3,0xc9c806e8554527c51796db3cf814c7a8584ca04b75fe...,Approve,19356643,2024-03-03 19:11:59,1709493119,,PulsePad: PLSPAD Token,0 ETH
4,0x5b9815d0ac6885b604b69a37f15d0f4746d187c22c03...,Multiplex Multi ...,19356643,2024-03-03 19:11:59,1709493119,,0x: Exchange Proxy,0 ETH
...,...,...,...,...,...,...,...,...
1358,0x0e4e8469319a32e232128038cc481b4061c8ce29a893...,Execute,19356634,2024-03-03 19:09:59,1709492999,,Uniswap: Universal Router,0 ETH
1359,0x43cfcfe5bddc1f80bb09b2f0cb85e6baa1f0a6cce4c8...,Execute,19356634,2024-03-03 19:09:59,1709492999,,Uniswap: Universal Router,0.29 ETH
1360,0x87f56a5c67253684dfa8e09611509b401f95ad65bfee...,Execute,19356634,2024-03-03 19:09:59,1709492999,,Uniswap: Universal Router,0 ETH
1361,0xe88c2c159ab64fa0644fad7d8b6572a9658df329c6b1...,0x2b603313,19356634,2024-03-03 19:09:59,1709492999,,MEV Bot: 0x6b7...A80,218 wei


## Data Analysis

Now that we have collected the transaction data from Etherscan, the next step is to perform conduct an initial analysis. This task will involve the following steps:

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by converting data types, removing any irrelevant information, and handling **duplicate** values.

3. **Statistical Analysis**: Calculate the mean and standard deviation of the population. Evaluate these statistics to understand the distribution of transaction values. The analysis and plotting will be on **Txn Fee** and **Value**.

4. **Visualization**: This phase involves the creation of visual representations to aid in the analysis of transaction values. The visualizations include:
    - A histogram for each data column, which provides a visual representation of the data distribution. The selection of bin size is crucial and should be based on the data's characteristics to ensure accurate representation. Provide an explanation on the bin size selection!
    - A normal distribution plot fitted alongside the histogram to compare the empirical distribution of the data with the theoretical normal distribution.
    - A box plot and a violin plot to identify outliers and provide a comprehensive view of the data's distribution.

### Deliverables

The project aims to deliver the following deliverables:

- A refined pandas DataFrame containing the transaction data, which has undergone thorough cleaning and is ready for analysis.
- A simple statistical analysis evaluating the population statistics, offering insights into the distribution of transaction values and fees.
- A set of visualizations showcasing the distribution of transaction values for the population. These visualizations include histograms, normal distribution plots, box plots, and violin plots, each serving a specific purpose in the analysis.

### Getting Started

The project starts with the importing of transaction data into a pandas DataFrame, setting the stage for data manipulation and analysis. Subsequent steps involve the cleaning of the data to ensure its quality and reliability. Followed by the calculation of population statistics. Finally, a series of visualizations are created to visually analyze the distribution of transaction values and fees.

In [2]:
# Your code here

## Data Sampling and Analysis

In this section, we will delve into the process of data sampling and perform an initial analysis on the transaction data we have collected. Our objective is to understand the distribution of transaction values by sampling the data and comparing the sample statistics with the population statistics.

### Steps

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by handling missing values, converting data types, and removing any irrelevant information.

3. **Simple Random Sampling (SRS)**: Create a sample from the dataset using a simple random sampling method. This involves randomly selecting a subset of the data without regard to any specific characteristics of the data.

4. **Stratified Sampling**: Create another sample from the dataset using a stratified sampling method. This involves dividing the data into strata based on a specific characteristic (e.g., transaction value) and then randomly selecting samples from each stratum. Explain what you have stratified the data by and why you chose this column.

5. **Statistical Analysis**: Calculate the mean and standard deviation of the samples and the population. Compare these statistics to understand the distribution of transaction values.

6. **Visualization**: Plot the distribution of transaction values and fees for both the samples and the population to visually compare their distributions.

### Considerations

- **Sample Size**: The size of the sample should be large enough to represent the population accurately but not so large that it becomes impractical to analyze.
- **Sampling Method**: Choose the appropriate sampling method based on the characteristics of the data and the research question.

Explain the above considerations in your report.