# Acquiring Your Data

## Introduction to Data Acquisition

Data acquisition is the first critical step in the data analytics process. It involves gathering raw data from multiple sources, then transforming it into a format suitable for further analysis and processing. Understanding this process is essential because the quality, structure, and relevance of the data directly influence the outcomes of any analytics project.

Data acquisition is the method of collecting, measuring, and analyzing information from various sources. In the context of data analytics, it means gathering data from:
- **Internal data**
- **APIs**: These allow you to retrieve data in real time from online services like weather information, financial markets, social media platforms, and more.
- **Online Datasets**: Repositories such as Kaggle, UCI Machine Learning Repository, and governmental portals offer pre-curated datasets in multiple formats (CSV, JSON, XML, Excel, etc.).
- **Web Scraping**: When the required data is available on websites, web scraping techniques can be employed to extract unstructured data directly from HTML pages.
- **Databases**: Structured data stored in relational databases (e.g., MySQL, PostgreSQL, SQLite) or NoSQL databases can be accessed using query languages like SQL or through specialized connectors.

There are four methods of acquiring data: 
- collecting new data; 
- converting/transforming legacy data; 
- sharing/exchanging data; 
- and purchasing data. 

<img src="https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/styles/side_image/public/thumbnails/image/DataAcquisitionVennDiagram.jpg?itok=zqYml3K-" width="340" height="340">

Source: https://www.usgs.gov/data-management/data-acquisition-methods

This includes automated collection (e.g., of sensor-derived data), the manual recording of empirical observations, and obtaining existing data from other sources.

**Why Is Data Acquisition Important?**
- Foundation for Analysis: The accuracy and reliability of your insights largely depend on the quality of the input data. Inaccurate or poorly formatted data can lead to misleading conclusions.
- Diverse Data Sources: Modern analytics often require the integration of data from multiple sources. A well-designed data acquisition process ensures that disparate data can be merged, cleaned, and analyzed seamlessly.
- Automation and Reproducibility: Automating data acquisition workflows (using scripts, scheduled jobs, or ETL tools) not only saves time but also makes the data analytics process more reproducible and scalable.

**The Process of Data Acquisition**

**1. Identify Data Sources:** Determine where the necessary data resides. This could be external sources like APIs or web pages, or internal systems such as databases or log files.

**2. Extraction:** Use the appropriate tools and techniques to extract the data. For instance, employ Python libraries like requests for API calls, BeautifulSoup or Scrapy for web scraping, and pandas or SQL connectors for databases.

**3. Data Cleaning & Transformation:** Raw data is rarely analysis-ready. It often contains missing values, inconsistencies, or errors. Cleaning involves removing or imputing missing data, normalizing formats, and transforming the data to make it consistent across sources.

**4. Integration:** When data comes from multiple sources, it must be merged into a coherent dataset. This involves aligning different data formats, handling duplicates, and ensuring that the integrated data preserves its integrity.

**5. Storage:** Once cleaned and integrated, data is typically stored in formats that are optimal for analysis, such as CSV files, SQL databases, Excel Tables, JSON...

**Challenges in Data Acquisition**
- **Data Quality Issues**: Incomplete, inconsistent, or outdated data can skew analysis. It is crucial to validate data accuracy at the point of collection.
    - Missing values, duplicates, or inconsistent formatting (e.g., dates as MM/DD/YYYY vs. DD-MM-YYYY).
    - Example: A survey dataset where 30% of respondents skipped income-level fields.
    - Data may come in incompatible formats (e.g., API returns XML, but your tool expects JSON).

- **Data Volume**: As data sources grow, handling large datasets efficiently becomes a challenge, requiring techniques for optimizing memory usage and processing speed.
    - Handling large datasets (e.g., 10GB CSV files) may crash standard tools.

- **Legal and Ethical Concerns**: Some data sources have strict usage policies or privacy restrictions. It is important to adhere to legal guidelines (e.g., GDPR) and respect website terms when scraping data.
    - GDPR/CCPA Compliance: Ensure personal data is anonymized.
    - Web Scraping Ethics: Respect robots.txt, avoid overloading servers.

- **Integration Complexity**: Merging data from multiple formats and sources can lead to complications in maintaining data consistency and resolving conflicts between different data sets.

**Types of Data Sources**

- **Structured Data:**
    - Definition: Organized in predefined formats (tables, rows, columns).
    - Examples:
        - Relational databases (MySQL, PostgreSQL).
        - CSV/Excel files.
    - Pros: Easy to query and analyze.
    - Cons: Limited flexibility for complex/nested data.

- **Semi-Structured Data:**
    - Definition: Loosely organized with tags or markers (no strict schema).
    - Examples:
        - JSON (API responses), XML (web feeds), log files.
    - Pros: Flexible for hierarchical/nested data.
    - Cons: Requires parsing to extract meaning (e.g., nested JSON keys).

- **Unstructured Data:**
    - Definition: No predefined format; often text-heavy or multimedia.
    - Examples:
        - Social media posts, images, audio files, PDFs.
    - Pros: Rich in insights (e.g., sentiment from text).
    - Cons: Requires advanced tools (NLP, computer vision).

## Working with Flat Files

Flat files store data in plain text or tabular formats without complex hierarchies. They are widely used for data exchange due to their simplicity and compatibility. Below is a comparison of common formats:

<table><thead><tr><th><strong>Format</strong></th><th><strong>Structure</strong></th><th><strong>Pros</strong></th><th><strong>Cons</strong></th><th><strong>Use Cases</strong></th></tr></thead><tbody><tr><td><strong>CSV</strong></td><td>Comma-separated values</td><td>Lightweight, universal support</td><td>No data types, no hierarchy</td><td>Exporting SQL tables, raw data</td></tr><tr><td><strong>Excel</strong></td><td>Spreadsheets (rows/columns)</td><td>Supports formulas, multiple sheets</td><td>Proprietary, slow with large data</td><td>Manual data entry, reporting</td></tr><tr><td><strong>JSON</strong></td><td>Key-value pairs (nested)</td><td>Hierarchical, flexible schema</td><td>Verbose, harder to parse</td><td>APIs, web data</td></tr><tr><td><strong>XML</strong></td><td>Tag-based markup</td><td>Standardized, supports metadata</td><td>Bulky syntax, complex parsing</td><td>Legacy systems, config files</td></tr></tbody></table>

### Text encoding: ASCII, Unicode, UTF-8

- [The Absolute Minimum Every Software Developer Must Know About Unicode in 2023](https://tonsky.me/blog/unicode/)
- [Unicode is harder than you think](https://mcilloni.ovh/2023/07/23/unicode-is-hard/)

Text encoding is the process of converting characters (letters, numbers, symbols) into a sequence of bytes that computers can store, process, and transmit. Since computers fundamentally operate with binary data, encoding serves as the bridge between human-readable text and machine-readable code.

In the ASCII encoding, which has 128 characters, only 95 of which are printable. The good news about ASCII encoding is that it’s the lowest common denominator of most data exchange. The bad news is that it doesn’t begin to handle the complexities of the many alphabets and writing systems of the world. Reading files using ASCII encoding is almost certain to cause trouble and throw errors on character values that it doesn’t understand, whether it’s a German ü, a Portuguese ç, or something from almost any language other than English.

One way to mitigate this confusion is Unicode. The Unicode encoding called UTF-8 accepts the basic ASCII characters without any change but also allows an almost unlimited set of other characters and symbols according to the Unicode standard.

Because of its flexibility, UTF-8 was used in more 85% of web pages served at the time I wrote this chapter, which means that your best bet for reading text files is to assume UTF-8 encoding. If the files contain only ASCII characters, they’ll still be read correctly, but you’ll also be covered if other characters are encoded in UTF-8. The good news is that the Python 3 string data type was designed to handle Unicode by default.

Even with Unicode, there’ll be occasions when your text contains values that can’t be successfully encoded. Fortunately, the open function in Python accepts an optional errors parameter that tells it how to deal with encoding errors when reading or writing files. The default option is 'strict', which causes an error to be raised whenever an encoding error is encountered. Other useful options are 'ignore', which causes the character causing the error to be skipped; 'replace', which causes the character to be replaced by a marker character (often, ?).

This code results in a file that contains “ABC” followed by three non-ASCII characters, which may be rendered differently depending on the encoding used.

In [9]:
with open("out2.txt", "wb") as f:
    f.write(bytes([65, 66, 67, 255, 192, 193]))

In [5]:
! powershell cat out2.txt

ABC���


In [10]:
with open("out2.txt", encoding="utf-8") as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

The fourth byte, which had a value of 255, isn’t a valid UTF-8 character in that position, so the 'strict' errors setting raises an exception. Now see how the other error options handle the same file, keeping in mind that the last three characters raise an error:

In [12]:
with open("out2.txt", errors="ignore", encoding="utf-8") as f:
    print(f.read())

ABC


In [13]:
with open("out2.txt", errors="replace", encoding="utf-8") as f:
    print(f.read())

ABC���


In [14]:
with open("out2.txt", errors="backslashreplace", encoding="utf-8") as f:
    print(f.read())

ABC\xff\xc0\xc1


If you want any problem characters to disappear, 'ignore' is the option to use. The 'replace' option only marks the place occupied by the invalid character, and the other options in different ways attempt to preserve the invalid characters without interpretation.

For most Western Windows installations, this will return "cp1252". However, note that the actual default encoding can vary depending on the system’s locale settings. Essentially, Python uses the result of locale.getpreferredencoding(False) as the default encoding for file operations when none is explicitly provided.

In [15]:
import locale

print(locale.getpreferredencoding(False))

cp1252


In [16]:
with open("out2.txt") as f:
    print(f.read())

ABCÿÀÁ


In [17]:
# remove the file
! powershell rm out2.txt

### Reading and Writing Data with pandas

In [1]:
import numpy as np
import pandas as pd

The **pandas I/O API** is a set of top level `reader` functions accessed like `pandas.read_csv()` that generally return a pandas object. The corresponding `writer` functions are object methods that are accessed like `DataFrame.to_csv()`. Below is a table containing available readers and writers.

<table class="table">
<colgroup>
<col style="width: 12.0%">
<col style="width: 40.0%">
<col style="width: 24.0%">
<col style="width: 24.0%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Format Type</p></th>
<th class="head"><p>Data Description</p></th>
<th class="head"><p>Reader</p></th>
<th class="head"><p>Writer</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></p></td>
<td><p><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></p></td>
<td><p><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p>Fixed-Width Text File</p></td>
<td><p><a class="reference internal" href="#io-fwf-reader"><span class="std std-ref">read_fwf</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://www.json.org/">JSON</a></p></td>
<td><p><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></p></td>
<td><p><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></p></td>
<td><p><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></p></td>
<td><p><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/LaTeX">LaTeX</a></p></td>
<td><p><a class="reference internal" href="#io-latex"><span class="std std-ref">Styler.to_latex</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p><a class="reference external" href="https://www.w3.org/standards/xml/core">XML</a></p></td>
<td><p><a class="reference internal" href="#io-read-xml"><span class="std std-ref">read_xml</span></a></p></td>
<td><p><a class="reference internal" href="#io-xml"><span class="std std-ref">to_xml</span></a></p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p>Local clipboard</p></td>
<td><p><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></p></td>
<td><p><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></p></td>
<td><p><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></p></td>
<td><p><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="http://opendocumentformat.org">OpenDocument</a></p></td>
<td><p><a class="reference internal" href="#io-ods"><span class="std std-ref">read_excel</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></p></td>
<td><p><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></p></td>
<td><p><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></p></td>
<td><p><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></p></td>
<td><p><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></p></td>
<td><p><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></p></td>
<td><p><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://orc.apache.org/">ORC Format</a></p></td>
<td><p><a class="reference internal" href="#io-orc"><span class="std std-ref">read_orc</span></a></p></td>
<td><p><a class="reference internal" href="#io-orc"><span class="std std-ref">to_orc</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></p></td>
<td><p><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></p></td>
<td><p><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></p></td>
<td><p><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SPSS">SPSS</a></p></td>
<td><p><a class="reference internal" href="#io-spss-reader"><span class="std std-ref">read_spss</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></p></td>
<td><p><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></p></td>
<td><p><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>SQL</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></p></td>
<td><p><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></p></td>
<td><p><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></p></td>
</tr>
<tr class="row-even"><td><p>SQL</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google BigQuery</a>;:ref:<cite>read_gbq&lt;io.bigquery&gt;</cite>;:ref:<cite>to_gbq&lt;io.bigquery&gt;</cite></p></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Online datasets are pre-collected data available in various formats and hosted on platforms dedicated to data sharing.

## Collecting Data from APIs

Objective: Fetch data from REST APIs and parse results into pandas.
Topics:

REST API fundamentals (endpoints, methods, status codes)

Authentication: API keys, OAuth 2.0

Using requests library: GET/POST, headers, pagination

Handling rate limits and error codes
Tutorial:

Fetch weather data from OpenWeatherMap API.

Paginate through GitHub Issues API and combine results.
Exercise: Build a DataFrame of trending YouTube videos using YouTube Data API.

## Web Scraping Basics

Objective: Extract structured data from websites.
Topics:

HTML/CSS basics: tags, classes, IDs

Ethical scraping: robots.txt, rate limiting

Tools: BeautifulSoup, requests, pandas.read_html()
Tutorial:

Scrape Wikipedia tables into DataFrames with pd.read_html().

Use BeautifulSoup to extract product prices from an e-commerce site.
Exercise: Scrape real estate listings and calculate average prices.

## Working with Databases

Many organizations store data in databases, which can be queried to extract exactly the information needed.

Types of Databases
- Relational Databases: Use SQL for querying (e.g., MySQL, PostgreSQL, SQLite).
- NoSQL Databases: Designed for unstructured data (e.g., MongoDB, Cassandra).

## Big Data & Cloud Storage

Objective: Access large datasets from cloud storage.
Topics:

Cloud storage: AWS S3, Google Cloud Storage

Parquet/Feather formats for efficient I/O

Using boto3 to read from S3
Tutorial:

Load a 1GB Parquet file from S3 using s3fs and pd.read_parquet().

Compare read speeds for CSV vs. Parquet.

## End-to-End Data Acquisition Pipeline

Objective: Combine multiple data sources into a unified dataset.
Requirements:

Scrape product data from a website.

Fetch complementary pricing data via API.

Merge with historical sales data from a SQL database.

Validate, clean, and export to Parquet.