## Ethereum Blockchain Transaction Analysis

### Link to GitHub Repository
https://github.com/cs418-fa24/project-check-in-team-2

## Project Introduction
In this project, we are analyzing Ethereum blockchain transaction data to investigate trends and insights related to recent transactions.
Our primary focus is to explore transaction patterns, fees, and the frequency of various transaction types to gain a deeper understanding of blockchain dynamics.

This analysis involves collecting recent transactions, cleaning the data, performing exploratory data analysis (EDA), and applying machine learning techniques to gain insights.

### Key Questions:
1. What patterns or trends can be identified in Ethereum transactions?
2. Are there specific metrics that can help predict Ethereum price movements?

## Changes to Project Scope
Since the initial proposal, we have made the following changes:
- **Scope Addition**: We have included a deeper analysis of transaction fees and patterns by block height to identify temporal patterns.
- **Scope Reduction**: We initially planned to analyze more historical data; however, we have limited the scope to recent blocks due to time constraints.

These adjustments will allow us to focus on recent trends and complete the analysis within the project timeline.

## Data Collection (Etherscan)
**Data Source:** The data is collected from [Etherscan.io](https://etherscan.io/txs) using Selenium, focusing on transaction hashes, block numbers, timestamps, addresses, and fees.

**Initial Observations:** The collected data includes transactional attributes that enable an analysis of transaction patterns, including frequency, transaction values, and fees.


Using Selenium, we automate the collection of transaction data, specifically gathering transaction hashes, block numbers, timestamps, addresses, and fees.

In [1]:
%pip install selenium

Collecting selenium
  Downloading selenium-4.26.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.27.0-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting typing_extensions~=4.9 (from selenium)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pysocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Downloading selenium-4.26.1-py3-none-any.whl (9.7 MB)
[

In [1]:
from bs4 import BeautifulSoup as bs
import requests as req
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import math
import time
from tqdm import tqdm
import re
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import norm
import numpy as np
from math import sqrt
import random

We have disabled the data collection process by commenting it out, as the data has already been saved in a CSV file.

In [None]:
# # Initialize WebDriver
# driver = webdriver.Chrome()

In [None]:
# # Navigate to the blocks page to retrieve the latest block ID
# driver.get("https://etherscan.io/blocks")
# time.sleep(1)
# lastBlock = driver.find_element(by=By.XPATH, value="//*[@id=\"content\"]/section[2]/div[2]/div[2]/table/tbody/tr[1]/td[1]/a")
# lastBlockID = int(lastBlock.text)

In [None]:
# # Generate URLs for the last 20 blocks
# blockURLs = []
# for blockID in range(lastBlockID, lastBlockID-20, -1):
#     blockURLs.append("https://etherscan.io/txs?block=" + str(blockID))

In [None]:
# # Function to read each row's data in the transactions table
# def readRow(rowIndex):
#     row = {}
#     xPath = "//*[@id=\"ContentPlaceHolder1_divTransactions\"]/div[2]/table/tbody/tr[" + str(rowIndex) + "]/"
#     row["txnHash"] = driver.find_element(by=By.XPATH, value=xPath + "td[2]/div/span/a").text
#     row["method"] = driver.find_element(by=By.XPATH, value=xPath + "td[3]/span").text
#     row["block"] = driver.find_element(by=By.XPATH, value=xPath + "td[4]/a").text
#     row["age"] = driver.find_element(by=By.XPATH, value=xPath + "td[6]/span").text
#     row["from"] = driver.find_element(by=By.XPATH, value=xPath + "td[8]/div/a[1]").get_attribute("href").split("/")[-1]
#     row["to"] = driver.find_element(by=By.XPATH, value=xPath + "td[10]/div").find_elements(by=By.CSS_SELECTOR, value="a")[-1].get_attribute("data-clipboard-text")
#     row["value"] = driver.find_element(by=By.XPATH, value=xPath + "td[11]/span").text
#     row["txnFee"] = driver.find_element(by=By.XPATH, value=xPath + "td[12]").text
#     return row

*We used time.sleep(1) to avoid being blocked by the website.*

In [None]:
# # Collect data for each block
# table = []
# for blockUrl in tqdm(blockURLs, desc="Collecting Data Blocks ("):
#     driver.get(blockUrl)
#     tranCount = int(driver.find_element(by=By.XPATH, value="//*[@id=\"ContentPlaceHolder1_divDataInfo\"]/div/div[1]/span").text.split(" ")[3])
#     pageCount = math.ceil(tranCount / 50)
#     for pageIndex in range(1, pageCount + 1):
#         url = blockUrl + "&p=" + str(pageIndex)
#         driver.get(url)
#         time.sleep(1)
#         rowBound = (tranCount - (pageCount - 1) * 50 + 1) if (pageIndex == pageCount) else 51
#         for rowIndex in range(1, rowBound):
#             table.append(readRow(rowIndex))
#     time.sleep(1)

*We're creating a CSV file to use later, so we don't have to run this code again.*

In [2]:
# Convert to DataFrame and save to CSV
# dataFrame = pd.DataFrame(table)
# dataFrame.to_csv("DataFrame.csv", index=False)

# # Close the driver
# driver.quit()