---
title: "Data Gathering"
---

# Data Gathering

## Overview

In this section, we gather the data that we will be using for this analysis. The data all comes from free API website Alphavantage.

## API

Initially, we will source our primary dataset from Alpha Vantage, focusing on the stock prices of six prominent high-tech companies -- "AAPL, MSFT, GOOGL, AMZN, META, TSLA", over the recent half-year period. These datasets serve as the base for our exploratory analysis, enabling us to look at the current market trends and behaviors specific to the technology sector.

To get the access of these datasets, we need to get our API key from AlphaVantage, then get the authentication of accessing the data.

In [4]:
import requests
import pandas as pd
from datetime import datetime, timedelta

api_key = 'PMPL4GDC4VRLN6XJ'

def get_stock_data(symbol, api_key):
    
    url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={symbol}&outputsize=full&apikey={api_key}&datatype=csv'
    r = requests.get(url)
    if r.status_code == 200:
        
        df = pd.read_csv(url)
        
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        
        six_months_ago = datetime.now() - timedelta(days=6*30)  
        filtered_df = df[df['timestamp'] >= six_months_ago]
        return filtered_df
    else:
        print(f"Error fetching data: {r.status_code}")
        return None

stock_symbol = 'AAPL'
data = get_stock_data(stock_symbol, api_key)

if data is not None:
    
    data.to_csv(f'{stock_symbol}_six_months_data.csv', index=False)
    print(f"Data stored for {stock_symbol} for the recent 6 months.")
else:
    print("No data to store.")


stock_symbol = 'MSFT'
data = get_stock_data(stock_symbol, api_key)

if data is not None:

    data.to_csv(f'{stock_symbol}_six_months_data.csv', index=False)
    print(f"Data stored for {stock_symbol} for the recent 6 months.")
else:
    print("No data to store.")

stock_symbol = 'GOOGL'
data = get_stock_data(stock_symbol, api_key)


if data is not None:

    data.to_csv(f'{stock_symbol}_six_months_data.csv', index=False)
    print(f"Data stored for {stock_symbol} for the recent 6 months.")
else:
    print("No data to store.")

stock_symbol = 'AMZN'
data = get_stock_data(stock_symbol, api_key)


if data is not None:

    data.to_csv(f'{stock_symbol}_six_months_data.csv', index=False)
    print(f"Data stored for {stock_symbol} for the recent 6 months.")
else:
    print("No data to store.")

stock_symbol = 'META'
data = get_stock_data(stock_symbol, api_key)


if data is not None:

    data.to_csv(f'{stock_symbol}_six_months_data.csv', index=False)
    print(f"Data stored for {stock_symbol} for the recent 6 months.")
else:
    print("No data to store.")

stock_symbol = 'TSLA'
data = get_stock_data(stock_symbol, api_key)


if data is not None:

    data.to_csv(f'{stock_symbol}_six_months_data.csv', index=False)
    print(f"Data stored for {stock_symbol} for the recent 6 months.")
else:
    print("No data to store.")


Data stored for AAPL for the recent 6 months.
Data stored for MSFT for the recent 6 months.
Data stored for GOOGL for the recent 6 months.
Data stored for AMZN for the recent 6 months.
Data stored for META for the recent 6 months.
Data stored for TSLA for the recent 6 months.


We will also look at the real market trend in the United States, thus we also need the data for real gdp per capita to represent the current market trend.

In [5]:
url = f'https://www.alphavantage.co/query?function=REAL_GDP_PER_CAPITA&outputsize=full&apikey={api_key}&datatype=csv'
r = requests.get(url)
if r.status_code == 200:
        
    df = pd.read_csv(url)
    
    df['timestamp'] = pd.to_datetime(df['timestamp'])
        
    six_months_ago = datetime.now() - timedelta(days=6*30)  
    filtered_df = df[df['timestamp'] >= six_months_ago]
else:
    print(f"Error fetching data: {r.status_code}")

if data is not None:

    data.to_csv(f'Real_GDP_PER_CAPITA_six_months_data.csv', index=False)
    print(f"Data stored for the recent 6 months.")
else:
    print("No data to store.")

KeyError: 'timestamp'

The last dataset we need is the news for Apple, which we will use to find out the relationship between news or events to stock price in the future. 

In [None]:
import requests
import pandas as pd
from datetime import datetime, timedelta

api_key = 'PMPL4GDC4VRLN6XJ'
ticker = 'AAPL'
base_url = 'https://www.alphavantage.co/query?function=NEWS_SENTIMENT&tickers={ticker}&apikey={api_key}'

all_data = []
start_date = datetime(2023, 5, 1) 

for _ in range(20):  
    time_from = start_date.strftime('%Y%m%dT%H%M')
    url = f"{base_url}&time_from={time_from}&limit=1000"  # If limit=50 is the max allowed
    r = requests.get(url)
    data = r.json()
    feed_data = data.get('feed', [])
    
    # Break the loop if no data is returned
    if not feed_data:
        break
    
    # Add the retrieved data to the all_data list
    all_data.extend(feed_data)
    
    # Assuming the feed is ordered by time, update the start_date to the time of the last article
    last_article_time = feed_data[-1]['time_published']
    start_date = datetime.strptime(last_article_time, '%Y-%m-%dT%H:%M:%SZ') + timedelta(seconds=1)

# Convert the collected data to a DataFrame
df = pd.DataFrame(all_data)

# Save the DataFrame to a CSV file
df.to_csv("News_AAPL.csv", index=False)
