# **Batch data ingestion process using Python**

The  Python scripts demonstrate an automated batch data ingestion process, utilizing synthetic transaction data to simulate daily activities.
- The first script generates synthetic data, representing various transaction types over consecutive days, which are saved in a CSV file. This data includes randomized transaction amounts, categories, and dates, ensuring a diverse dataset for testing.
- The second script automates the ingestion of this data into an SQLite database, scheduled to run daily.
- This setup is particularly useful for testing and developing data processing workflows like ETL processes and database updates without the need for real, sensitive data.

In [None]:
#Install required libraries
!pip install schedule

In [2]:
# Importing necessary library
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import schedule
import time

In [4]:
# Generate synthetic data. This data generator should not be edited as it simulates a fixed schema dataset.
def generate_synthetic_data():
    np.random.seed(0)
    dates = pd.date_range(start='2021-01-01', periods=100, freq='D')
    modes = ['Cash', 'Card', 'Online', 'Wallet']
    categories = ['Grocery', 'Electronics', 'Apparel', 'Dining']
    subcategories = ['Vegetables', 'Mobiles', 'Clothing', 'Restaurants']
    notes = ['Purchase', 'Payment', 'Refund', 'Expense']
    amounts = np.random.uniform(100, 5000, size=100)
    income_expense = ['Income', 'Expense']
    currency = 'INR'


    df = pd.DataFrame({
        'Date': dates,
        'Mode': np.random.choice(modes, size=100),
        'Category': np.random.choice(categories, size=100),
        'Subcategory': np.random.choice(subcategories, size=100),
        'Note': np.random.choice(notes, size=100),
        'Amount': amounts,
        'Income/Expense': np.random.choice(income_expense, size=100),
        'Currency': currency
    })


    df.to_csv('daily_transactions.csv', index=False)

# Call the function to generate data
generate_synthetic_data()

In [3]:
"""
Function to simulate batch data ingestion from a CSV file into an SQLite database.
Note:
- No need for database credentials or network setup.
"""
def batch_ingest():

    df = pd.read_csv('daily_transactions.csv')
    df['transaction_date'] = pd.to_datetime(df['Date'])
    df['amount'] = df['Amount'].astype(float)

    # Create an SQLite engine
    engine = create_engine('sqlite:///mydatabase.db')

    # Ingest data into the SQLite database
    df.to_sql('transactions', engine, if_exists='append', index=False)

    print(f"Ingested {len(df)} records at {time.strftime('%Y-%m-%d %H:%M:%S')}")

# Setting up a scheduler to run the batch ingestion daily at 1:00 AM
schedule.every().day.at("01:00").do(batch_ingest)

# Infinite loop to keep the script running
# Note: This will run indefinitely. User must manually stop execution.
try:
    while True:
        schedule.run_pending()
        time.sleep(5)
except KeyboardInterrupt:
    print("Batch ingestion stopped by user.")

Batch ingestion stopped by user.


# **Real-time data streaming using Python and Kafka**

- The script demonstrates an implementation of real-time data streaming using Apache Kafka.
- It creates a Kafka consumer to listen to the 'user_activity' topic, dynamically processing JSON messages as they arrive.
- Each message is parsed to extract user activity details, which then triggers specific actions based on the type of activity.

In [None]:
# Install required libraries
!pip install kafka-python

In [6]:
# Install required libraries
from kafka import KafkaConsumer
import json

In [None]:
"""
Creates and returns a Kafka consumer listening on a specified topic with
auto-commit enabled and latest offset reset.
"""
def create_kafka_consumer():
    consumer = KafkaConsumer(
        'user_activity',
        bootstrap_servers=['localhost:9092'],
        auto_offset_reset='latest',
        enable_auto_commit=True,
        value_deserializer=lambda x: json.loads(x.decode('utf-8'))
    )
    return consumer
"""
Processes incoming Kafka messages. It performs operations based on the type of user activity,
such as updating profiles or triggering security alerts.
"""
def process_messages(consumer):
    for message in consumer:
        user_activity = message.value
        print(f"Received user activity: {user_activity}")
        # Example: Update user profile or trigger alert based on activity type.
        if user_activity['activity_type'] == 'purchase':
            # This is a placeholder for your update function
            update_user_profile(user_activity['user_id'], user_activity['item_id'])
        elif user_activity['activity_type'] == 'login' and user_activity['location'] != 'usual_location':
            # This is a placeholder for your security alert function
            trigger_security_alert(user_activity['user_id'], user_activity['location'])


def update_user_profile(user_id, item_id):
    """Placeholder function to update user profile"""
    print(f"Updating profile for user {user_id} with item {item_id}")

def trigger_security_alert(user_id, location):
    """Placeholder function to trigger security alert"""
    print(f"Security alert for user {user_id} logging in from {location}")

# Create the Kafka consumer
consumer = create_kafka_consumer()

# Process incoming messages in real-time
process_messages(consumer)

# **API-based data collection using Python and the requests library**

- The script uses the requests library to access structured data via the OpenWeatherMap API.

- It fetches weather data for a specified city and time range, converting the JSON response into a structured pandas DataFrame. This DataFrame is then saved as a CSV file.

In [None]:
# Install required libraries
!pip install requests

In [9]:
# Import necessary libraries
import requests
import pandas as pd
from datetime import datetime, timedelta

In [10]:
# Collect weather forecast data using the OpenWeatherMap API.
def collect_weather_data(api_key, city, days=7):

    base_url = "http://api.openweathermap.org/data/2.5/forecast"
    params = {
        'q': city,
        'appid': api_key,
        'units': 'metric'
    }
    response = requests.get(base_url, params=params)

    if response.status_code == 200:
        data = response.json()
        weather_data = []
        for item in data['list']:
            weather_data.append({
                'datetime': item['dt_txt'],
                'temperature': item['main']['temp'],
                'humidity': item['main']['humidity'],
                'description': item['weather'][0]['description']
            })
        return weather_data
    else:
        print(f"Error: {response.status_code}")
        return None
# Convert the list of weather data to a DataFrame and save it as a CSV file.
def process_weather_data(weather_data, city):
    if weather_data:
        weather_df = pd.DataFrame(weather_data)
        print(weather_df.head())
        weather_df.to_csv(f"{city}_weather_forecast.csv", index=False)
    else:
        print("No data to process.")

# Example usage of the function
if __name__ == "__main__":
    api_key = '3102dd06454b055d78f482f602e4c4a9'  # Example API key
    city = 'Mexico'  # Example city
    weather_data = collect_weather_data(api_key, city)

    # Process the collected weather data
    process_weather_data(weather_data, city)

              datetime  temperature  humidity      description
0  2024-08-05 09:00:00        30.15        72    moderate rain
1  2024-08-05 12:00:00        27.23        84    moderate rain
2  2024-08-05 15:00:00        25.23        91    moderate rain
3  2024-08-05 18:00:00        24.95        93       light rain
4  2024-08-05 21:00:00        24.77        94  overcast clouds


# **Web scraping using Python and BeautifulSoup**

- The script utilizes the BeautifulSoup library to perform web scraping, specifically designed to extract book information from a webpage.
- It fetches data such as book titles, prices, and availability, which are typical HTML content elements identified by their tags and classes.

In [None]:
# Install required libraries
!pip install requests beautifulsoup4 pandas

In [11]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [12]:
# Scrape book information from a website using BeautifulSoup.
def scrape_book_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    books = []
    for book in soup.find_all('article', class_='product_pod'):
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        availability = book.find('p', class_='instock availability').text.strip()

        books.append({
            'title': title,
            'price': price,
            'availability': availability
        })

    return pd.DataFrame(books)
# Process the scraped book data and store the results in a CSV file.
def process_book_data(books_df):
    books_df.to_csv('data_science_books.csv', index=False)

# Usage
url = 'http://books.toscrape.com/catalogue/shoe-dog-a-memoir-by-the-creator-of-nike_831/index.html'
books_df = scrape_book_data(url)

print(books_df.head())
process_book_data(books_df)

                                               title    price availability
0  The 10% Entrepreneur: Live Your Startup Dream ...  Â£27.55     In stock
1  The Third Wave: An Entrepreneurâs Vision of ...  Â£12.61     In stock
2                             If I Run (If I Run #1)  Â£49.97     In stock
3         Counted With the Stars (Out from Egypt #1)  Â£17.97     In stock
4               Like Never Before (Walker Family #2)  Â£28.77     In stock
