#
# Accessing Structured and Unstructured Data Sources in Data Science

## Introduction to Data Sources
Data for data science projects comes from a variety of sources, each with unique characteristics. These sources can be broadly categorized into:
- **Structured Data**: Highly organized data stored in tables (e.g., databases) or spreadsheets
- **Unstructured Data**: Free-form data that lacks a pre-defined structure (e.g., text documents, images, web pages)
- **Streams: logs, real-time events, sockets

This session will cover how to access different types of data sources, including databases, web scraping, document processing, streaming data, and real-time APIs.

## Accessing Structured Data: Databases
Databases are a common source of structured data. SQL databases like MySQL and PostgreSQL store data in tables with a defined schema, making them ideal for handling large amounts of structured data.


In [1]:
import sqlite3
import pandas as pd
import random
from datetime import datetime, timedelta

# Connect to an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# STEP1: Create the 'sales_data' table
cursor.execute('''
    CREATE TABLE sales_data (
        id INTEGER PRIMARY KEY,
        product_name TEXT NOT NULL,
        product_price REAL,
        customer_id INTEGER,
        timestamp TEXT
    )
''')

# Build some fake data
products = [
    ("Laptop", 1200.00),
    ("Smartphone", 699.99),
    ("Headphones", 199.99),
    ("Keyboard", 49.99),
    ("Mouse", 29.99),
    ("Monitor", 299.99),
    ("Tablet", 329.99),
    ("Smartwatch", 199.99)
]

# Function to create a random timestamp within the last year
def random_timestamp():
    start_date = datetime.now() - timedelta(days=365)
    random_date = start_date + timedelta(days=random.randint(0, 365), hours=random.randint(0, 23), minutes=random.randint(0, 59))
    return random_date.strftime('%Y-%m-%d %H:%M:%S')

# Insert fake data into the table
for _ in range(50):  # Creating 50 rows of data
    product_name, product_price = random.choice(products)
    customer_id = random.randint(1000, 2000)  # Fake customer IDs between 1000 and 2000
    timestamp = random_timestamp()
    cursor.execute("INSERT INTO sales_data (product_name, product_price, customer_id, timestamp) VALUES (?, ?, ?, ?)",
                   (product_name, product_price, customer_id, timestamp))

# Commit changes
conn.commit()

# Query and display the data in a DataFrame for easy viewing
df = pd.read_sql_query("SELECT * FROM sales_data", conn)
df.head(10)  # Display the first 10 rows of the generated data


Unnamed: 0,id,product_name,product_price,customer_id,timestamp
0,1,Mouse,29.99,1832,2024-01-06 18:09:37
1,2,Smartwatch,199.99,1471,2024-10-17 11:57:37
2,3,Headphones,199.99,1445,2024-08-20 08:19:37
3,4,Keyboard,49.99,1552,2023-11-25 00:40:37
4,5,Smartphone,699.99,1260,2024-11-01 11:57:37
5,6,Tablet,329.99,1485,2024-04-29 08:43:37
6,7,Smartwatch,199.99,1379,2024-01-05 14:05:37
7,8,Keyboard,49.99,1026,2024-07-10 16:35:37
8,9,Smartphone,699.99,1969,2024-05-05 16:29:37
9,10,Laptop,1200.0,1954,2024-09-19 23:58:37


# Accessing Unstructured Data: Web Scraping
Used to extract data from websites. It is commonly used to gather data for research or analysis when APIs are not available, and (questionably) for training LLM's.

## Explanation

- __Send Request__: requests.get(url) sends a GET request to the website.
- __Parse Content__:  The BeautifulSoup parser ('html.parser') processes the page’s HTML content for easy access to elements.
- __Locate Data__:
  - Locate books using via html tag:
 `<article class='product_pod'>`
  - Within each product_pod, we find the title using h3.a['title']
  - We retrieve the price using find('p', class_='price_color').text

This example extracts and prints each book’s title and price, demonstrating how to gather structured data from a website.

In [2]:
import requests
from bs4 import BeautifulSoup

# Target URL -- actually provided online to test scrapers...
theURL = 'http://books.toscrape.com/'

# Send a request to get the top level page
response = requests.get(theURL)
#print(response.content) #show web content...

# Check if the request was successful
if response.status_code == 200:
    # Parse the page content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all book containers
    books = soup.find_all('article', class_='product_pod')

    # Extract book titles and prices
    for book in books:
        # Get the title
        title = book.h3.a['title']

        # Get the price
        price = book.find('p', class_='price_color').text

        # Print the extracted data
        print(f"Title: {title}, Price: {price}")
else:
    print("Failed to retrieve the webpage.")

Title: A Light in the Attic, Price: £51.77
Title: Tipping the Velvet, Price: £53.74
Title: Soumission, Price: £50.10
Title: Sharp Objects, Price: £47.82
Title: Sapiens: A Brief History of Humankind, Price: £54.23
Title: The Requiem Red, Price: £22.65
Title: The Dirty Little Secrets of Getting Your Dream Job, Price: £33.34
Title: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, Price: £17.93
Title: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, Price: £22.60
Title: The Black Maria, Price: £52.15
Title: Starving Hearts (Triangular Trade Trilogy, #1), Price: £13.99
Title: Shakespeare's Sonnets, Price: £20.66
Title: Set Me Free, Price: £17.46
Title: Scott Pilgrim's Precious Little Life (Scott Pilgrim #1), Price: £52.29
Title: Rip it Up and Start Again, Price: £35.02
Title: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991, Price: £57.25
Title: Olio, Price: £23.88
Title:

# Working with Documents (e.g., PDFs and Word Documents)
Document processing is crucial for extracting information from unstructured data in formats like PDFs, Word documents, or text files.

In [3]:
!pip install PyMuPDF  # Library for handling PDFs

import fitz  # PyMuPDF
from io import BytesIO

# Open a sample PDF
# UPDATED: Using the correct URL from the global variable
theURL = "https://images.apple.com/id/environment/pdf/products/iphone/iPhone_15_and_iPhone_15_Plus_PER_Sept2023.pdf"
theResponse = requests.get(theURL)

# Check if the request was successful
if theResponse.status_code == 200:
    theStream = BytesIO(theResponse.content) #load pdf content into stream
    thePDF = fitz.open(stream=theStream, filetype="pdf") #make in-memory pdf from stream

    theFirstPage = thePDF[0].get_text()
    print("Text from the first page:\n", theFirstPage)
else:
    print("Failed to retrieve the PDF ", theResponse)


Collecting PyMuPDF
  Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.13
Text from the first page:
 Progress toward
our 2030 goal
23% recycled or renewable content1 
31% of manufacturing electricity sourced 
from supplier clean energy projects2
Recovery
Return your device through
Apple Trade In, and we’ll give it
a new life or recycle it for free.
Responsible
packaging
99% fiber-based, due to our work to 
eliminate plastic in packaging5
100% recycled or responsibly sourced 
wood fibers
Responsible manufacturing
Apple Supplier Code of Conduct sets 
strict standards for the protection of people 
in our supply chain and the planet.
Product Enviro

# Accessing Streaming Data
Streaming data is real-time data that flows continuously. Common sources include social media feeds, financial market data, and sensor data.

## Example: Simulating a Data Stream with Kafka

	Note: Running Kafka in a Jupyter Notebook requires complex setup. Here’s a simplified example using a generator to simulate streaming data.

In [4]:
import time

# Simulated streaming data
def data_stream():
    for i in range(10):  # Generate 5 data points
        yield {"timestamp": pd.Timestamp.now(), "value": i}
        time.sleep(1)  # Simulate a 1-second delay

# Access the stream
for data_point in data_stream():
    print(data_point)

{'timestamp': Timestamp('2024-11-07 04:14:18.162912'), 'value': 0}
{'timestamp': Timestamp('2024-11-07 04:14:19.168768'), 'value': 1}
{'timestamp': Timestamp('2024-11-07 04:14:20.170486'), 'value': 2}
{'timestamp': Timestamp('2024-11-07 04:14:21.171496'), 'value': 3}
{'timestamp': Timestamp('2024-11-07 04:14:22.172496'), 'value': 4}


# Accessing Real-Time APIs
APIs provide access to data over the internet. Real-time APIs are often used in data science for retrieving current data (e.g., weather, stock prices).

In [28]:
import requests
import pandas as pd

# API key and endpoint
api_key = 'dbfb5afaded4b480dbc5d8983e2b76e2'  # Replace with your API key
city = 'London'
lat=51.5073219
lon=-0.1276474
theURL=f"http://api.openweathermap.org/data/2.5/weather?q={city},uk&APPID={api_key}"

# Make the request
response = requests.get(theURL)

# Display relevant information
data = response.json()
theWeather = {
    "City": city,
    "Lat": lat,
    "Lon": lon,
    "Temperature": data["main"]["temp"],
    "Forecast": data["weather"][0]["description"]
}

df = pd.DataFrame(list(theWeather.items()), columns=['Key', 'Value'])
print(df)



           Key            Value
0         City           London
1          Lat        51.507322
2          Lon        -0.127647
3  Temperature           282.87
4     Forecast  overcast clouds


# Best Practices for Accessing Data Sources
- **Use APIs When Available**: APIs are generally more reliable and structured compared to web scraping
- **Handle Errors and Rate Limits**: APIs and web scraping may have rate limits, so handle errors gracefully
- **Sanitize and Structure Unstructured Data**: When dealing with unstructured data (e.g., web scraping, PDFs), use text cleaning and preprocessing techniques
- **Ensure Data Security and Compliance**: Sensitive data (e.g., customer data) should be accessed securely, following relevant data protection laws
- **Right of Conveyance**: You might have the right to consume a data source, but not to store or forward it to others -- confirm your rights!

## Transient Storage
> Golden Rule of Data: Parallel systems never are

In other words, be careful about gathering data from a primary source, and caching in a secondary store.

> QUESTION : What problems might this introduce?

# Conclusion

Accessing data from various sources is essential for data science. By combining structured and unstructured data, data scientists can generate valuable insights and build comprehensive data-driven solutions. Each data source requires different tools and techniques, and understanding how to effectively access and process each one is crucial in data science.