# Introduction to GDeltCryptoNewsDownload Notebook

The purpose of this Jupyter notebook is to demonstrate the process of collecting and consolidating a dataset of news articles related to cryptocurrencies from the Global Database of Events, Language, and Tone (GDELT) project. This dataset, referred to as `GA_data.csv`, is intended for use in risk and fraud analysis projects where understanding the impact of news on the cryptocurrency market is crucial.

## Objective

To automate the retrieval of news articles from the GDELT database, focusing specifically on content related to various cryptocurrencies over a specified time period. This involves filtering articles by keywords associated with the cryptocurrency market, such as "cryptocurrency", "Bitcoin", "Ethereum", and others, to ensure the dataset is relevant to the project's needs.

## Methodology

The notebook utilizes the `gdeltdoc` Python library to interact with the GDELT database. This library simplifies the process of searching for and downloading news articles by providing a structured approach to specify search criteria (e.g., keywords, date ranges).

### Key Steps:

1. **Setting Up Filters**: Define a set of filters using the `Filters` class from the `gdeltdoc` library. These filters specify the keywords to search for (e.g., "crypto", "Bitcoin", "Ethereum", etc.) and the date range for the search. The date range is dynamically set to cover the period from the current day to the previous day to ensure the dataset is continually updated with the most recent news.

2. **Searching for Articles**: Execute a search query against the GDELT database using the defined filters. The search results are filtered to include only articles in English to maintain consistency in the dataset.

3. **Consolidating and Cleaning the Data**: The search results are compiled into a pandas DataFrame, with duplicates (based on article URLs) removed to ensure each article is unique. This step is repeated for each keyword and each day in the specified period, resulting in a comprehensive dataset of cryptocurrency-related news.

4. **Output**: The final DataFrame, containing unique news articles related to the specified keywords and within the given time frame, is saved to a CSV file named `GA_data.csv`. This file serves as the primary dataset for further analysis in the risk and fraud project.

## Usage

This notebook is designed to be executed on a regular basis (e.g., daily) to update the `GA_data.csv` file with the latest news articles. Users can modify the keyword list and the date range as needed to tailor the dataset to their specific project requirements.

## Requirements

- Python 3.x
- Pandas library
- GDELTDoc library (`gdeltdoc`)

## Conclusion

By leveraging the GDELT project's extensive database of global news articles, this notebook provides a streamlined process for collecting and preparing a dataset specifically focused on the cryptocurrency market. This dataset is invaluable for conducting risk and fraud analysis, offering insights into how news coverage can influence market behavior and potentially uncovering patterns or trends that could indicate fraudulent activities.


---
## Library Instalation for all 5 notebooks in the project

In [2]:
# !pip install -r requirements.txt

---

In [None]:
## Imports
from datetime import date, timedelta, datetime
import pandas as pd
from gdeltdoc import GdeltDoc, Filters
import time

In [10]:
start_date = datetime.strptime('31/01/2024 00:00:00', '%d/%m/%Y %H:%M:%S')
end_date = datetime.strptime('20/02/2024 00:00:00', '%d/%m/%Y %H:%M:%S')

In [11]:
print(str(start_date.strftime("%Y%m%d%H%M%S")))

20240210000000


In [12]:
num_of_days = (end_date - start_date).days+1
num_of_days

11

In [21]:
currentday = end_date
previousday = end_date - timedelta(days=1)

f = Filters(
    keyword = "crypto",
    start_date = str(previousday.strftime("%Y%m%d")),
    end_date = str(currentday.strftime("%Y%m%d"))
)

gd = GdeltDoc()

# Search for articles matching the filters
articles = gd.article_search(f)
df = articles[(articles["language"] == 'English')]
df

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry
0,https://www.newsbtc.com/news/company/crypto-on...,https://www.newsbtc.com/news/company/crypto-on...,Crypto on the Rise : Who Are the Strongest Bel...,20240219T211500Z,https://www.newsbtc.com/wp-content/uploads/202...,newsbtc.com,English,United States
1,https://economictimes.indiatimes.com/markets/c...,https://m.economictimes.com/markets/cryptocurr...,Bitcoin : Small investors starting to tiptoe b...,20240219T020000Z,"https://img.etimg.com/thumb/msid-107805046,wid...",economictimes.indiatimes.com,English,India
3,https://cointelegraph.com/news/crypto-super-pa...,,Flood of money from crypto Super PACs could...,20240219T190000Z,https://images.cointelegraph.com/cdn-cgi/image...,cointelegraph.com,English,China
4,https://www.coindesk.com/markets/2024/02/19/bi...,https://www.coindesk.com/markets/2024/02/19/bi...,Bitcoin ( BTC ) ETFs See Record $2 . 4B Weekly...,20240219T161500Z,https://www.coindesk.com/resizer/ZbcDrB_CFh7PN...,coindesk.com,English,United States
5,https://www.pymnts.com/cryptocurrency/2024/cha...,,Chainanalysis : Crypto Money Laundering Plummets,20240219T024500Z,https://www.pymnts.com/wp-content/uploads/2024...,pymnts.com,English,United States
...,...,...,...,...,...,...,...,...
245,https://www.modernreaders.com/news/2024/02/19/...,,Kyber Network Crystal Legacy ( KNCL ) Reaches ...,20240219T020000Z,,modernreaders.com,English,United States
246,https://www.hedgeweek.com/sec-cyber-lapses-pos...,,SEC cyber lapses pose risk to trading secrets ...,20240219T133000Z,,hedgeweek.com,English,United Kingdom
247,https://www.npr.org/2024/02/19/1232449119/ye-t...,,Inside Kanye and Ty Dolla $ign Vulture listen...,20240219T211500Z,https://media.npr.org/assets/img/2024/02/19/ye...,npr.org,English,United States
248,https://dailyreporter.com/2024/02/19/crews-tak...,,Crews secure graffiti - scarred Los Angeles to...,20240219T231500Z,https://dailyreporter.com/files/2024/02/AP2404...,dailyreporter.com,English,United States


In [5]:
for i in range(num_of_days):
    currentday = end_date- timedelta(days=i)
    previousday = end_date- timedelta(days=i+1)
    print("start_date, end_date = ",previousday,currentday)
    for item in ['cryptocurrency', 
                 'cryptocurrencies',
                 'CBDC',
                 'Bitcoin',
                 'Ethereum',  
                 'BinanceCoin',
                 'crypto', 
                 'btc',
                 'eth', 
                 'USDT']:
        print(item)
        try:
            f = Filters(
                keyword = item,
                start_date = str(previousday.strftime("%Y%m%d")),
                end_date = str(currentday.strftime("%Y%m%d"))
            )

            gd = GdeltDoc()
            # Search for articles matching the filters
            articles = gd.article_search(f)
            articles = articles[(articles["language"] == 'English')]
            df = pd.concat([df, articles], axis=0)
            df = df.drop_duplicates(subset=['url'], keep='last')
            time.sleep(1)
        except:
            pass
    time.sleep(5) # Sleeping 5 seconds to avoid being blocked by Gdelt/Gdeltdoc


start_date, end_date =  2024-02-19 00:00:00 2024-02-20 00:00:00
cryptocurrency
cryptocurrencies
CBDC
Bitcoin
Ethereum
BinanceCoin
crypto
btc
eth
USDT
start_date, end_date =  2024-02-18 00:00:00 2024-02-19 00:00:00
cryptocurrency
cryptocurrencies
CBDC
Bitcoin
Ethereum
BinanceCoin
crypto
btc
eth
USDT
start_date, end_date =  2024-02-17 00:00:00 2024-02-18 00:00:00
cryptocurrency
cryptocurrencies
CBDC
Bitcoin
Ethereum
BinanceCoin
crypto
btc
eth
USDT
start_date, end_date =  2024-02-16 00:00:00 2024-02-17 00:00:00
cryptocurrency
cryptocurrencies
CBDC
Bitcoin
Ethereum
BinanceCoin
crypto
btc
eth
USDT
start_date, end_date =  2024-02-15 00:00:00 2024-02-16 00:00:00
cryptocurrency
cryptocurrencies
CBDC
Bitcoin
Ethereum
BinanceCoin
crypto
btc
eth
USDT
start_date, end_date =  2024-02-14 00:00:00 2024-02-15 00:00:00
cryptocurrency
cryptocurrencies
CBDC
Bitcoin
Ethereum
BinanceCoin
crypto
btc
eth
USDT
start_date, end_date =  2024-02-13 00:00:00 2024-02-14 00:00:00
cryptocurrency
cryptocurrencies
CBDC

In [6]:
df = df.reset_index(drop=True)

In [9]:
df.to_csv("GA_data.csv", index=False)

In [8]:
df

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry
0,https://news.yahoo.com/ai-scams-missouri-warns...,,AI Scams : Missouri warns voices of loved ones...,20240219T204500Z,https://media.zenfs.com/en/ktvi_articles_498/2...,news.yahoo.com,English,United States
1,https://www.americanbanker.com/opinion/regulat...,,Regulators should reexamine their assumptions ...,20240219T194500Z,https://source-media-brightspot.s3.us-east-1.a...,americanbanker.com,English,United States
2,https://biztoc.com/x/97e1450bfef84362,,South Korean Political Party Eyes Crypto Revol...,20240219T130000Z,https://c.biztoc.com/p/97e1450bfef84362/s.webp,biztoc.com,English,
3,https://biztoc.com/x/5c2110519540e5cf,,Unraveling the Mystery Behind XRP Price Underp...,20240219T103000Z,https://c.biztoc.com/p/5c2110519540e5cf/s.webp,biztoc.com,English,
4,https://biztoc.com/x/2f038851769a9841,,Cryptocurrency Rankings : Solana Claims the Co...,20240219T181500Z,https://c.biztoc.com/p/2f038851769a9841/s.webp,biztoc.com,English,
...,...,...,...,...,...,...,...,...
13999,https://www.kilkennypeople.ie/news/national-ne...,,Oliver Callan first day on new RTÉ show - feel...,20240129T151500Z,https://www.kilkennypeople.ie/resizer/1200/700...,kilkennypeople.ie,English,Ireland
14000,https://www.leitrimobserver.ie/news/national-n...,,Oliver Callan first day on new RTÉ show - feel...,20240129T151500Z,https://www.leitrimobserver.ie/resizer/1200/70...,leitrimobserver.ie,English,Ireland
14001,https://www.donegallive.ie/news/national-news/...,,Oliver Callan first day on new RTÉ show - feel...,20240129T151500Z,https://www.donegallive.ie/resizer/1200/700/tr...,donegallive.ie,English,
14002,https://www.banklesstimes.com/news/2024/01/29/...,,RFK Jr . Joins Trump in Anti - CBDC Stance,20240129T141500Z,"https://cdn.banklesstimes.com/tr:f-jpg,w-1200,...",banklesstimes.com,English,United States
