<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_03_MinecraftingOurData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Our Tools: Database Systems, Scripting Languages, and Interactive Notebooks

In the world of data mining, we rely on a combination of powerful tools to collect, store, manipulate, and analyze data effectively. Just as a skilled craftsperson needs various tools for different tasks, a data miner uses different types of software to handle various aspects of the data mining process. In this chapter, we'll explore three key categories of tools: database management systems, scripting languages, and interactive notebooks.

### Database Management Systems: Our Data Warehouses

At the heart of any data mining operation is the need to store and manage large amounts of data efficiently. This is where Relational Database Management Systems (RDBMSs) come into play. A RDBMS is like a highly organized warehouse for our data, providing a structured way to store, retrieve, and manage information.

In our journey, we'll be using SQLite as our example DBMS. You're already familiar with SQLite from previous chapters, but let's recap why RDBMSs like SQLite are crucial in data mining:

1. DBMSs allow us to store vast amounts of data (on disk) in a structured manner. Instead of having information scattered across multiple files, we can organize it into tables with defined relationships. This structure makes it easier to understand and work with our data.
2.   DBMSs enforce rules that help maintain the **integrity** of our data. For example, we can specify that certain fields must always have a value, or that values in one table must correspond to values in another table.
3.  In real-world scenarios, multiple users or processes might need to access the data simultaneously. RDBMSs handle this **concurrent access**, ensuring that data remains consistent even when multiple operations are happening at the same time.
4. Perhaps one of the most powerful features of a RDBMS is its ability to query data. Using **Structured Query Language (SQL)**, we can ask complex questions about our data and retrieve exactly the information we need. This is like having a super-efficient assistant who can instantly find and compile any information we request from our vast data warehouse.
5. RDBMSs provide security mechanisms to control who can access the data and what they can do with it. This is crucial when working with sensitive information.

While SQLite is excellent for learning and small to medium-sized projects, it's worth noting that in large-scale data mining operations, more robust DBMSs like PostgreSQL, MySQL, or Oracle might be used. However, the fundamental principles remain the same.

### Scripting Languages: Our Data Manipulation Tools

While DBMSs excel at storing and retrieving data, we often need more flexible tools for data manipulation, analysis, and visualization. This is where scripting languages come in. In our course, we'll be using Python, but other languages like R or Julia are also popular in data science.

Scripting languages serve several crucial roles in the data mining process:

1.  Scripting languages provide libraries (like Python's `sqlite3`) that allow us to connect to databases and execute SQL queries. This enables us to extract data from our DBMS for further processing.
2.  Real-world data is often messy. Scripting languages offer powerful tools for cleaning data (handling missing values, correcting inconsistencies) and transforming it into more useful formats.
3.   With libraries like Pandas in Python, we can perform complex statistical analyses, aggregate data, and uncover patterns and trends.
4.  Scripting languages, coupled with visualization libraries, allow us to create insightful graphs and charts to communicate our findings visually.
5.  We can use scripting languages to automate repetitive tasks in our data mining workflow, saving time and reducing the chance of human error.

In Python, one of the most important libraries for data mining is Pandas. Pandas provides data structures (like DataFrames) and functions that make it easy to work with structured data. Here's a small sample of what we can do with Pandas:

First, let's start with a Comma-Seperated Value (CSV) file:

In [None]:
%%writefile data.csv
player_name,category,score,blocks_mined,items_crafted
Steve,Miner,95,1000,50
Alex,Builder,88,500,200
Notch,Explorer,92,750,100
Herobrine,None,98,300,75
Enderman,Teleporter,85,100,None
Creeper,Destroyer,78,50,10
Villager,Trader,82,25,300
Zombie,Survivor,70,150,5
Skeleton,Archer,75,200,30
Pig,None,60,0,0
Ghast,Flyer,80,0,15
Wither,Boss,99,500,150

Overwriting data.csv


Now, look what we can do with just a few lines of Pandas code. First, let's load the data into a pandas **dataframe** and look at the **head** (the first few rows of the data).

In [None]:
import pandas as pd

# Read data from a CSV file
minecraft_df = pd.read_csv('data.csv')

# Display the first few rows
minecraft_df.head()

Unnamed: 0,player_name,category,score,blocks_mined,items_crafted
0,Steve,Miner,95,1000,50.0
1,Alex,Builder,88,500,200.0
2,Notch,Explorer,92,750,100.0
3,Herobrine,,98,300,75.0
4,Enderman,Teleporter,85,100,


Pandas also makes it easy to get an "overvew" of our dataset with methods like `df.info()`.

In [None]:
minecraft_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   player_name    12 non-null     object 
 1   category       10 non-null     object 
 2   score          12 non-null     int64  
 3   blocks_mined   12 non-null     int64  
 4   items_crafted  11 non-null     float64
dtypes: float64(1), int64(2), object(2)
memory usage: 608.0+ bytes


This allows us to see important information at a glance--for example, we can tell the data types of each item, and the number of "nulls" (that is, missing values).

### Interactive Notebooks: Our Communication and Provenance Tools

The final piece of our toolset is the interactive notebook, exemplified by Jupyter Notebooks (which we've been using throughout this class). While DBMSs store our data and scripting languages help us process it, interactive notebooks serve two crucial roles: communication and provenance.

1.  **Communication.** Jupyter Notebooks allow us to combine code, its output, visualizations, and narrative text in a single document. This makes it an excellent tool for sharing our data mining process and results with others. We can explain our methodology, show our code, and present our findings all in one place.
2.  **Provenance**. In data mining, it's crucial to be able to trace how we arrived at our conclusions. Jupyter Notebooks provide a chronological record of our data mining process. Each cell in a notebook can be run and re-run, allowing us (or others) to reproduce our results step by step. This reproducibility is a key principle in scientific computing and data analysis.


Here's how a typical data mining workflow might look in a Jupyter Notebook:

1.  We start with a markdown cell explaining the purpose of our analysis.
2.  We then have a code cell that connects to our database and extracts some data.
3.  The next few cells might clean and transform the data, with markdown cells explaining our decisions.
4.  We might then have cells that perform analysis and create visualizations.
5.  Finally, we'd have markdown cells that interpret our results and draw conclusions.

This entire process is documented in a single, shareable file that others can read, run, and even modify to extend our analysis.

As we progress through this chapter, you'll see how these three types of tools - DBMSs, scripting languages, and interactive notebooks - work together seamlessly in the data mining process. We'll use our DBMS (SQLite) to store and query our data, our scripting language (Python) to manipulate and analyze it, and Jupyter Notebooks to document our process and communicate our findings.

## Data Integration: ETL and ELT Processes

The first step of data mining is to get our data into our SQL database. Doing so genernally requires a combination of the tools (SQL and scripting languages) that we just talked about. Imagine we're helping Blocky Betty, the owner of "Pixelated Pickaxes," a thriving online store selling Minecraft-themed merchandise. Betty wants to analyze her business performance, but her data is scattered across various sources. She needs our help to bring it all together for analysis. Here's what we're dealing with:

1.  Sales data from her e-commerce platform in CSV files
2.  Customer reviews stored in JSON format from her website
3.  Inventory data in a MySQL transactional database
4.  Shipping information in an Excel spreadsheet

Our goal is to integrate all this data into a single SQLite database that Betty can use for analysis. This scenario is a perfect example of the challenges in data integration, and we'll use it to explore two common approaches: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform).

Let's dive into how we might tackle this data integration challenge using both ETL and ELT approaches:

### ETL: Extract, Transform, Load

In the ETL process, we'll gather the data, process it, and then load it into our SQLite database. Here's how it might work for Blocky Betty's data:

1.  **Extract**: In this step, we gather all the data from its various sources:
    -   We download the CSV files containing sales data from Betty's e-commerce platform.
    -   We use an API or web scraping tool to collect the customer reviews in JSON format from her website.
    -   We connect to the MySQL database and extract the inventory data using SQL queries.
    -   We open and read the Excel spreadsheet containing shipping information.At this point, we have all of Betty's raw data, but it's in different formats and might have inconsistencies or quality issues.
2.  **Transform**: Now we process the extracted data to make it consistent and ready for analysis. In ETL, we might do much of this in Python/Pandas (since we haven't yet loaded the data into an RDBMS). This might involve:
    -   *Cleaning the data.* We might remove duplicate orders from the sales data, correct any obvious errors in the inventory counts, or standardize shipping addresses.
    -   *Converting formats.* We ensure all dates are in the same format across all datasets. We might need to parse the JSON data into a tabular format.
    -   *Combining data.* We could merge the sales data with the shipping information, using order numbers as a common key.
    -   *Calculating new values.* We might compute the profit for each sale by combining the price from the sales data with the cost from the inventory data.
    -   *Filtering.* We might remove any cancelled orders or out-of-stock items.
    -   *Aggregating.* We could summarize daily sales into monthly totals.After this step, we have a clean, consistent dataset that's ready for analysis.
3.  **Load**: Finally, we load our processed data into the SQLite database:
    -   We create the necessary tables in the SQLite database to hold our transformed data.
    -   We insert the processed data into these tables.
    -   We might create indexes on frequently-queried columns to improve performance.Now, all of Betty's data is in one place, cleaned and ready for analysis.

### ELT: Extract, Load, Transform

In the ELT process, we change the order of operations. Here's how it might look:

1.  **Extract**: This step is the same as in ETL. We gather all the raw data from Betty's various sources.
2.  **Load**: Instead of transforming the data first, we load it directly into our SQLite database:
    -   We create tables in the SQLite database to hold the raw data from each source.
    -   We load the CSV, JSON, MySQL data, and Excel data into these tables, preserving the original format as much as possible.
    -   At this point, our SQLite database contains all of Betty's raw, unprocessed data.
3.  **Transform**: Now we perform our transformations within the SQLite database. This time, we can use SQL rather than Python/Pandas.
    -   We use SQL queries to clean the data, removing duplicates and correcting errors.
    -   We create views that combine data from different tables, joining sales data with shipping information, for example.
    -   We use SQL functions to standardize formats, like converting all dates to a consistent format.
    -   We create new tables or views that contain calculated values, like profit per sale.The advantage here is that we always have access to the original, raw data in our SQLite database, and we can create different transformations for different analytical needs.

### Why Choose ETL or ELT?

The choice between ETL and ELT often depends on the specific needs of a project:

**ETL** might be preferred when:
  -   The source data needs significant cleaning or processing before it can be usefully stored. For example, if we're dealing with sensitive information that needs to be anonymized before storage.
  -   The target system (i.e., the database server) has limited processing power, so it's better to transform the data before loading it.
  -   There are strict data quality requirements that need to be met before data can enter the target system.
  -   We're dealing with relatively small amounts of data that can be processed efficiently before loading. (For example, Pandas deal best with datasets that fit inside a computer's main memory--for example, 16 GB or 32 GB--while relational databases can deal with much larger datasets).

**ELT** might be chosen when:
-   We're dealing with very large volumes of data that would be time-consuming to transform before loading.
-   The target system (like a modern data lake or cloud data warehouse) has powerful processing capabilities that can handle transformations efficiently.
-   We want to preserve the raw data for future use cases we haven't thought of yet.
-   We need the flexibility to transform the same data in different ways for different analyses.

In our Minecraft analogy, ETL would be like processing all player data before adding it to our central database, ensuring everything is clean and consistent from the start. ELT would be like dumping all our raw player data into a massive data lake, then using powerful tools to process it as needed for different analyses, always keeping the original data intact.

Both ETL and ELT are valuable approaches in data integration. As we progress in our data mining journey, we'll explore more specific techniques for each step of these processes. The key is to understand the flow of data from its raw state to a form that's ready for analysis, regardless of the exact order of operations.

## Delta Load: Updating Our Data Efficiently

Imagine Blocky Betty's Pixelated Pickaxes store has been running for a while, and we've already integrated her initial data into our SQLite database. But Betty's store is active every day, with new sales, reviews, and inventory changes. We don't want to reload all of her data every time we need to update our database. This is where the concept of delta load comes in.

**Delta load**, short for incremental load, is a data integration technique where only the new or changed data is processed and added to the target database. Instead of reloading all the data every time, we only deal with the "delta" - the difference between what's already in our database and the new data coming in.

Think of it like this: Imagine you're organizing a chest in Minecraft. You've already sorted all your items, but after a mining expedition, you have new resources. Instead of emptying the entire chest and re-sorting everything, you just add the new items to their appropriate places. That's essentially what delta load does for our data.

To implement delta load, we typically follow these steps:

1.  *Identify New or Changed Data*. We need a way to determine what data is new or has changed since our last update. This often involves using timestamps or unique identifiers.
2.  *Extract the Delta*. We only extract this new or changed data from our source systems.
3.  *Transform (if necessary)*. We apply any needed transformations to this delta data.
4.  *Update the Target*. We add the new data to our target database and update any changed records.

Let's say Betty wants to update her sales data daily. Here's how we might approach this:

1.  We add a 'last_updated' timestamp to our sales table in the SQLite database.
2.  Each day, we check the e-commerce platform for any sales records with a timestamp later than our 'last_updated' time.
3.  We extract only these new records, process them as needed, and add them to our SQLite database.
4.  We update the 'last_updated' timestamp to the current time.

This way, we're only processing a small amount of data each day, rather than Betty's entire sales history.

Delta load is crucial for dealing with large, constantly changing datasets efficiently. It reduces processing time and resources, allowing for more frequent updates to our integrated data.

## The Role of APIs in Data Integration

Now, let's talk about another important concept in modern data integration: APIs.

**API** stands for Application Programming Interface. In simple terms, it's a set of rules and protocols that allows different software applications to communicate with each other. Think of an API as a waiter in a restaurant. You (the customer) don't go directly to the kitchen to order food. Instead, you tell the waiter what you want, and they bring it to you. Similarly, an API takes your request for data, goes to the database or service, and returns the data you need.

APIs play a crucial role in data integration for several reasons:

1.  *Real-time Data Access.* APIs often allow us to access data in real-time, rather than waiting for batch exports.
2.  *Standardized Data Format.* APIs typically return data in standard formats like JSON or XML, which are easy for computers to process.
3.  *Selective Data Retrieval.* Instead of getting all the data from a system, APIs often allow us to request only the specific data we need.
4.  *Automation.* We can write scripts that automatically fetch data from APIs at regular intervals, keeping our integrated data up-to-date.

Let's say Blocky Betty decides to start selling her Minecraft merchandise on a popular online marketplace. This marketplace provides an API for sellers to access their sales data. Here's how we might use this API in our data integration process:

1.  We register for API access and receive an API key (like a special password for our application).
2.  We write a Python script that sends a request to the marketplace API every hour, asking for any new sales data.
3.  The API returns this data in JSON format, which our script can easily process.
4.  We transform this data if needed and add it to our SQLite database.
5.  We can set this script to run automatically, ensuring Betty always has up-to-date sales data from the marketplace.

By using the API, we can keep Betty's sales data current without her having to manually export and send us data files. It's efficient, timely, and reduces the chance of human error.

### APIs and Delta Load Together

APIs and delta load strategies often work hand in hand. Many APIs allow you to request data changes since a specific time, which is perfect for implementing a delta load approach. For example, we could ask the marketplace API, "Give me all sales data since 2023-06-30 14:00:00", and only process those new records.


## Data Collection Methods: Web Scraping

As Blocky Betty's Pixelated Pickaxes business grows, she's always on the lookout for new product ideas and pricing information. She's discovered a website called "Minecraft Marketplace Monitor" that regularly updates a list of top-selling Minecraft-related products across various online platforms. Betty wants to use this information to inform her product decisions, but the website doesn't offer an API or downloadable dataset. This is where web scraping comes in handy.

**Web scraping** is the process of automatically extracting data from websites. It's like having a robot assistant that can visit a webpage, read its content, and copy the specific information you need into a structured format. This technique is useful when data is publicly available on a website but not easily accessible through other means.

### The Target Website

Let's imagine the "Minecraft Marketplace Monitor" website has a simple table that looks like this:

In [5]:
%%writefile marketplace.html
<h1>Top Products</h1>
This is a website!
<table id="top-products">
  <tr>
    <th>Rank</th>
    <th>Product Name</th>
    <th>Category</th>
    <th>Average Price</th>
    <th>Trend</th>
  </tr>
  <tr>
    <td>1</td>
    <td>Diamond Pickaxe Keychain</td>
    <td>Accessories</td>
    <td>$12.99</td>
    <td>↑</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Creeper Face T-Shirt</td>
    <td>Clothing</td>
    <td>$19.99</td>
    <td>↓</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Minecraft Grass Block Mug</td>
    <td>Kitchenware</td>
    <td>$14.99</td>
    <td>→</td>
  </tr>
  <!-- More rows... -->
</table>

Overwriting marketplace.html


In [6]:
# view website
from IPython.display import HTML
HTML(filename='marketplace.html')

Rank,Product Name,Category,Average Price,Trend
1,Diamond Pickaxe Keychain,Accessories,$12.99,↑
2,Creeper Face T-Shirt,Clothing,$19.99,↓
3,Minecraft Grass Block Mug,Kitchenware,$14.99,→


### How to Scrape the Website

To scrape this website, Betty can use Python along with a popular library called BeautifulSoup. Here's a step-by-step explanation of how she could do it:

1.  *Send a request to the website*. First, Betty's script needs to access the webpage, just like a browser would.
2.  *Get the HTML content*. Once the webpage is accessed, the script downloads its HTML content.
3.  *Parse the HTML*. The BeautifulSoup library is used to parse the HTML, making it easy to navigate and search.
4.  *Locate the desired data*. Betty's script looks for the specific table containing the product information.
5.  *Extract the data*. The script goes through each row of the table, pulling out the relevant information.
6.  *Store the data*. Finally, the extracted data is stored in a structured format, like a CSV file or a database.

Here's a simplified version of what Betty's Python script might look like:

In [8]:
import pandas as pd
from bs4 import BeautifulSoup
import csv

# Open and read the local HTML file
with open('marketplace.html', 'r', encoding='utf-8') as file:
    content = file.read()

# Parse the HTML content
soup = BeautifulSoup(content, 'html.parser')

# Find the table with the product data
table = soup.find('table', id='top-products')

# Extract and store the data in a CSV file
with open('top_products.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Rank", "Product Name", "Category", "Average Price", "Trend"])

    for row in table.find_all('tr')[1:]:  # Skip the header row
        columns = row.find_all('td')
        if columns:
            rank = columns[0].text.strip()
            product = columns[1].text.strip()
            category = columns[2].text.strip()
            price = columns[3].text.strip()
            trend = columns[4].text.strip()
            writer.writerow([rank, product, category, price, trend])

print("Data has been scraped and saved to 'top_products.csv'")

# Load the CSV data into a Pandas DataFrame and display it
df = pd.read_csv('top_products.csv')
df.head()

Data has been scraped and saved to 'top_products.csv'


Unnamed: 0,Rank,Product Name,Category,Average Price,Trend
0,1,Diamond Pickaxe Keychain,Accessories,$12.99,↑
1,2,Creeper Face T-Shirt,Clothing,$19.99,↓
2,3,Minecraft Grass Block Mug,Kitchenware,$14.99,→


This script would create a CSV file containing the scraped data, which Betty could then use for her product research and decision-making.

### Ethical Considerations in Web Scraping

While web scraping can be a powerful tool, it's important to use it responsibly:

1.  Check the website's terms of service--Some websites prohibit scraping in their terms of use.
2.  Don't overload the server--Sending too many requests too quickly can strain the website's servers.
3.  Respect **robots.txt**--This file on websites indicates which parts of the site can be scraped.
4.  Be mindful of copyright--Just because data is publicly visible doesn't always mean it's free to use for any purpose.

By using web scraping responsibly, Betty can gather valuable market data to help grow her Pixelated Pickaxes business, ensuring she's always on top of the latest Minecraft merchandise trends.

## Data Collection Methods: Public Databases and APIs

As Blocky Betty's Pixelated Pickaxes continues to grow, she realizes she needs more data to make informed decisions. She's heard about public databases and APIs that could provide valuable information. Let's explore how Betty can use these resources to enhance her data mining efforts.

**Public databases** are collections of data that are freely available for anyone to access and use. They can be goldmines of information for data miners. These databases are often maintained by government agencies, research institutions, or other organizations committed to open data.

For example, let's imagine there's a public database called "Minecraft Server Stats" that collects anonymous data about Minecraft servers worldwide. This database could be valuable for Betty to understand player behavior and preferences.

### APIs: The Key to Accessing Data

An API (Application Programming Interface) is a set of protocols and tools for building software applications. In the context of data mining, APIs often serve as the gateway to accessing data from public databases or other web services.

#### How APIs Work

1.  **Request**: You send a request to a specific URL (endpoint) provided by the API.
2.  **Authentication**: Often, you need to include an API key to prove you have permission to access the data.
3.  **Response**: The API sends back the requested data, typically in a format like JSON or XML.
4.  **Processing**: You parse the received data and use it in your application or analysis.

### Getting Data Using an API

Let's walk through an example of how Betty could use the "Minecraft Server Stats" API to get data about popular server types.

#### Step 1: Obtain API Access

First, Betty would need to register on the Minecraft Server Stats website to get an API key. This key allows her to make requests to the API.

#### Step 2: Make an API Request

Betty can use the Python `requests` library to make an API call. Here's what that might look like:

```python
import requests

# API endpoint URL
url = "https://api.minecraftserverstats.com/v1/server-types"

# Parameters for the API request
params = {
    "limit": 10,  # Get top 10 server types
    "sort": "player_count"
}

# Headers including the API key
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE"
}

# Make the API request
response = requests.get(url, params=params, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()  # Parse the JSON response
    print("Data retrieved successfully!")
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
```
    
#### Step 3: Parse the JSON Data

The API response is typically in JSON format. Here's an example of what the response might look like:

In [9]:
%%writefile minecraft_server.json
{
  "server_types": [
    {
      "type": "Survival",
      "player_count": 500000,
      "average_playtime": 120
    },
    {
      "type": "Creative",
      "player_count": 300000,
      "average_playtime": 90
    },
    {
      "type": "Minigames",
      "player_count": 200000,
      "average_playtime": 45
    }
  ]
}

Writing minecraft_server.json


#### Step 4: Convert to a Pandas DataFrame
Now that we have the JSON data, we can easily convert it to a Pandas DataFrame for further analysis: