# Acquiring Your Data

## Introduction to Data Acquisition

Data acquisition is the first critical step in the data analytics process. It involves gathering raw data from multiple sources, then transforming it into a format suitable for further analysis and processing. Understanding this process is essential because the quality, structure, and relevance of the data directly influence the outcomes of any analytics project.

Data acquisition is the method of collecting, measuring, and analyzing information from various sources. In the context of data analytics, it means gathering data from:
- **Internal data**
- **APIs**: These allow you to retrieve data in real time from online services like weather information, financial markets, social media platforms, and more.
- **Online Datasets**: Repositories such as Kaggle, UCI Machine Learning Repository, and governmental portals offer pre-curated datasets in multiple formats (CSV, JSON, XML, Excel, etc.).
- **Web Scraping**: When the required data is available on websites, web scraping techniques can be employed to extract unstructured data directly from HTML pages.
- **Databases**: Structured data stored in relational databases (e.g., MySQL, PostgreSQL, SQLite) or NoSQL databases can be accessed using query languages like SQL or through specialized connectors.

There are four methods of acquiring data: 
- collecting new data; 
- converting/transforming legacy data; 
- sharing/exchanging data; 
- and purchasing data. 

<img src="https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/styles/side_image/public/thumbnails/image/DataAcquisitionVennDiagram.jpg?itok=zqYml3K-" width="340" height="340">

Source: https://www.usgs.gov/data-management/data-acquisition-methods

This includes automated collection (e.g., of sensor-derived data), the manual recording of empirical observations, and obtaining existing data from other sources.

**Why Is Data Acquisition Important?**
- Foundation for Analysis: The accuracy and reliability of your insights largely depend on the quality of the input data. Inaccurate or poorly formatted data can lead to misleading conclusions.
- Diverse Data Sources: Modern analytics often require the integration of data from multiple sources. A well-designed data acquisition process ensures that disparate data can be merged, cleaned, and analyzed seamlessly.
- Automation and Reproducibility: Automating data acquisition workflows (using scripts, scheduled jobs, or ETL tools) not only saves time but also makes the data analytics process more reproducible and scalable.

**The Process of Data Acquisition**

**1. Identify Data Sources:** Determine where the necessary data resides. This could be external sources like APIs or web pages, or internal systems such as databases or log files.

**2. Extraction:** Use the appropriate tools and techniques to extract the data. For instance, employ Python libraries like requests for API calls, BeautifulSoup or Scrapy for web scraping, and pandas or SQL connectors for databases.

**3. Data Cleaning & Transformation:** Raw data is rarely analysis-ready. It often contains missing values, inconsistencies, or errors. Cleaning involves removing or imputing missing data, normalizing formats, and transforming the data to make it consistent across sources.

**4. Integration:** When data comes from multiple sources, it must be merged into a coherent dataset. This involves aligning different data formats, handling duplicates, and ensuring that the integrated data preserves its integrity.

**5. Storage:** Once cleaned and integrated, data is typically stored in formats that are optimal for analysis, such as CSV files, SQL databases, Excel Tables, JSON...

**Challenges in Data Acquisition**
- **Data Quality Issues**: Incomplete, inconsistent, or outdated data can skew analysis. It is crucial to validate data accuracy at the point of collection.
    - Missing values, duplicates, or inconsistent formatting (e.g., dates as MM/DD/YYYY vs. DD-MM-YYYY).
    - Example: A survey dataset where 30% of respondents skipped income-level fields.
    - Data may come in incompatible formats (e.g., API returns XML, but your tool expects JSON).

- **Data Volume**: As data sources grow, handling large datasets efficiently becomes a challenge, requiring techniques for optimizing memory usage and processing speed.
    - Handling large datasets (e.g., 10GB CSV files) may crash standard tools.

- **Legal and Ethical Concerns**: Some data sources have strict usage policies or privacy restrictions. It is important to adhere to legal guidelines (e.g., GDPR) and respect website terms when scraping data.
    - GDPR/CCPA Compliance: Ensure personal data is anonymized.
    - Web Scraping Ethics: Respect robots.txt, avoid overloading servers.

- **Integration Complexity**: Merging data from multiple formats and sources can lead to complications in maintaining data consistency and resolving conflicts between different data sets.

**Types of Data Sources**

- **Structured Data:**
    - Definition: Organized in predefined formats (tables, rows, columns).
    - Examples:
        - Relational databases (MySQL, PostgreSQL).
        - CSV/Excel files.
    - Pros: Easy to query and analyze.
    - Cons: Limited flexibility for complex/nested data.

- **Semi-Structured Data:**
    - Definition: Loosely organized with tags or markers (no strict schema).
    - Examples:
        - JSON (API responses), XML (web feeds), log files.
    - Pros: Flexible for hierarchical/nested data.
    - Cons: Requires parsing to extract meaning (e.g., nested JSON keys).

- **Unstructured Data:**
    - Definition: No predefined format; often text-heavy or multimedia.
    - Examples:
        - Social media posts, images, audio files, PDFs.
    - Pros: Rich in insights (e.g., sentiment from text).
    - Cons: Requires advanced tools (NLP, computer vision).

## Working with Flat Files

Flat files store data in plain text or tabular formats without complex hierarchies. They are widely used for data exchange due to their simplicity and compatibility. Below is a comparison of common formats:

<table><thead><tr><th><strong>Format</strong></th><th><strong>Structure</strong></th><th><strong>Pros</strong></th><th><strong>Cons</strong></th><th><strong>Use Cases</strong></th></tr></thead><tbody><tr><td><strong>CSV</strong></td><td>Comma-separated values</td><td>Lightweight, universal support</td><td>No data types, no hierarchy</td><td>Exporting SQL tables, raw data</td></tr><tr><td><strong>Excel</strong></td><td>Spreadsheets (rows/columns)</td><td>Supports formulas, multiple sheets</td><td>Proprietary, slow with large data</td><td>Manual data entry, reporting</td></tr><tr><td><strong>JSON</strong></td><td>Key-value pairs (nested)</td><td>Hierarchical, flexible schema</td><td>Verbose, harder to parse</td><td>APIs, web data</td></tr><tr><td><strong>XML</strong></td><td>Tag-based markup</td><td>Standardized, supports metadata</td><td>Bulky syntax, complex parsing</td><td>Legacy systems, config files</td></tr></tbody></table>

### Text encoding: ASCII, Unicode, UTF-8

- [The Absolute Minimum Every Software Developer Must Know About Unicode in 2023](https://tonsky.me/blog/unicode/)
- [Unicode is harder than you think](https://mcilloni.ovh/2023/07/23/unicode-is-hard/)

Text encoding is the process of converting characters (letters, numbers, symbols) into a sequence of bytes that computers can store, process, and transmit. Since computers fundamentally operate with binary data, encoding serves as the bridge between human-readable text and machine-readable code.

In the ASCII encoding, which has 128 characters, only 95 of which are printable. The good news about ASCII encoding is that it’s the lowest common denominator of most data exchange. The bad news is that it doesn’t begin to handle the complexities of the many alphabets and writing systems of the world. Reading files using ASCII encoding is almost certain to cause trouble and throw errors on character values that it doesn’t understand, whether it’s a German ü, a Portuguese ç, or something from almost any language other than English.

One way to mitigate this confusion is Unicode. The Unicode encoding called UTF-8 accepts the basic ASCII characters without any change but also allows an almost unlimited set of other characters and symbols according to the Unicode standard.

Because of its flexibility, UTF-8 was used in more 85% of web pages served at the time I wrote this chapter, which means that your best bet for reading text files is to assume UTF-8 encoding. If the files contain only ASCII characters, they’ll still be read correctly, but you’ll also be covered if other characters are encoded in UTF-8. The good news is that the Python 3 string data type was designed to handle Unicode by default.

Even with Unicode, there’ll be occasions when your text contains values that can’t be successfully encoded. Fortunately, the open function in Python accepts an optional errors parameter that tells it how to deal with encoding errors when reading or writing files. The default option is 'strict', which causes an error to be raised whenever an encoding error is encountered. Other useful options are 'ignore', which causes the character causing the error to be skipped; 'replace', which causes the character to be replaced by a marker character (often, ?).

This code results in a file that contains “ABC” followed by three non-ASCII characters, which may be rendered differently depending on the encoding used.

In [9]:
with open("out2.txt", "wb") as f:
    f.write(bytes([65, 66, 67, 255, 192, 193]))

In [5]:
! powershell cat out2.txt

ABC���


In [10]:
with open("out2.txt", encoding="utf-8") as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

The fourth byte, which had a value of 255, isn’t a valid UTF-8 character in that position, so the 'strict' errors setting raises an exception. Now see how the other error options handle the same file, keeping in mind that the last three characters raise an error:

In [12]:
with open("out2.txt", errors="ignore", encoding="utf-8") as f:
    print(f.read())

ABC


In [13]:
with open("out2.txt", errors="replace", encoding="utf-8") as f:
    print(f.read())

ABC���


In [14]:
with open("out2.txt", errors="backslashreplace", encoding="utf-8") as f:
    print(f.read())

ABC\xff\xc0\xc1


If you want any problem characters to disappear, 'ignore' is the option to use. The 'replace' option only marks the place occupied by the invalid character, and the other options in different ways attempt to preserve the invalid characters without interpretation.

For most Western Windows installations, this will return "cp1252". However, note that the actual default encoding can vary depending on the system’s locale settings. Essentially, Python uses the result of locale.getpreferredencoding(False) as the default encoding for file operations when none is explicitly provided.

In [15]:
import locale

print(locale.getpreferredencoding(False))

cp1252


In [16]:
with open("out2.txt") as f:
    print(f.read())

ABCÿÀÁ


In [17]:
# remove the file
! powershell rm out2.txt

### Reading and Writing Data with pandas

In [124]:
import numpy as np
import pandas as pd

The **pandas I/O API** is a set of top level `reader` functions accessed like `pandas.read_csv()` that generally return a pandas object. The corresponding `writer` functions are object methods that are accessed like `DataFrame.to_csv()`. Below is a table containing available readers and writers.

<table class="table">
<colgroup>
<col style="width: 12.0%">
<col style="width: 40.0%">
<col style="width: 24.0%">
<col style="width: 24.0%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Format Type</p></th>
<th class="head"><p>Data Description</p></th>
<th class="head"><p>Reader</p></th>
<th class="head"><p>Writer</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></p></td>
<td><p><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></p></td>
<td><p><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p>Fixed-Width Text File</p></td>
<td><p><a class="reference internal" href="#io-fwf-reader"><span class="std std-ref">read_fwf</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://www.json.org/">JSON</a></p></td>
<td><p><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></p></td>
<td><p><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></p></td>
<td><p><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></p></td>
<td><p><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/LaTeX">LaTeX</a></p></td>
<td><p><a class="reference internal" href="#io-latex"><span class="std std-ref">Styler.to_latex</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p><a class="reference external" href="https://www.w3.org/standards/xml/core">XML</a></p></td>
<td><p><a class="reference internal" href="#io-read-xml"><span class="std std-ref">read_xml</span></a></p></td>
<td><p><a class="reference internal" href="#io-xml"><span class="std std-ref">to_xml</span></a></p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p>Local clipboard</p></td>
<td><p><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></p></td>
<td><p><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></p></td>
<td><p><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></p></td>
<td><p><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="http://opendocumentformat.org">OpenDocument</a></p></td>
<td><p><a class="reference internal" href="#io-ods"><span class="std std-ref">read_excel</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></p></td>
<td><p><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></p></td>
<td><p><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></p></td>
<td><p><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></p></td>
<td><p><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></p></td>
<td><p><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></p></td>
<td><p><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://orc.apache.org/">ORC Format</a></p></td>
<td><p><a class="reference internal" href="#io-orc"><span class="std std-ref">read_orc</span></a></p></td>
<td><p><a class="reference internal" href="#io-orc"><span class="std std-ref">to_orc</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></p></td>
<td><p><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></p></td>
<td><p><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></p></td>
<td><p><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SPSS">SPSS</a></p></td>
<td><p><a class="reference internal" href="#io-spss-reader"><span class="std std-ref">read_spss</span></a></p></td>
<td><p>NA</p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></p></td>
<td><p><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></p></td>
<td><p><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>SQL</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></p></td>
<td><p><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></p></td>
<td><p><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></p></td>
</tr>
<tr class="row-even"><td><p>SQL</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google BigQuery</a></p></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

#### CSV files

In [125]:
from pathlib import Path

data_folder_path = Path.cwd().parent / "data"

In [126]:
# example 1
data = pd.read_csv(data_folder_path / "seaslug.txt", sep="\t")
data.head(10)

Unnamed: 0,Time,Percent
0,99,0.067
1,99,0.133
2,99,0.067
3,99,0.0
4,99,0.0
5,0,0.5
6,0,0.467
7,0,0.857
8,0,0.5
9,0,0.357


- `sep`: str, defaults to ',' for read_csv(), \t for read_table(): Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\\r\\t'.
- `delimiter` str, default None: Alternative argument name for sep.

In [127]:
# example 2: Encoding: `iso-8859-1`, separator: `^`
data = pd.read_csv(data_folder_path / "FOOD_DES.txt", sep="^", encoding="iso-8859-1", header=None, nrows=5, quotechar="~")
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1001,100,"Butter, salted","BUTTER,WITH SALT",,,Y,,0,,6.38,4.27,8.79,3.87
1,1002,100,"Butter, whipped, with salt","BUTTER,WHIPPED,W/ SALT",,,Y,,0,,6.38,,,
2,1003,100,"Butter oil, anhydrous","BUTTER OIL,ANHYDROUS",,,Y,,0,,6.38,4.27,8.79,3.87
3,1004,100,"Cheese, blue","CHEESE,BLUE",,,Y,,0,,6.38,4.27,8.79,3.87
4,1005,100,"Cheese, brick","CHEESE,BRICK",,,Y,,0,,6.38,4.27,8.79,3.87


- `nrows: int, default None` Number of rows of file to read. Useful for reading pieces of large files.


- `header: int or list of ints, default 'infer'` Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names.


- `encoding: str, default None` Encoding to use for UTF when reading/writing (e.g. 'utf-8'). [List of Python standard encodings](https://docs.python.org/3/library/codecs.html#standard-encodings).


- `quotechar: str (length 1)`: The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.


In [128]:
# example 3
data = pd.read_csv(data_folder_path / "mpls_stops.csv", nrows=3)
data.columns

Index(['Unnamed: 0', 'id Num', 'date', 'problem', 'MDC', 'citation Issued',
       'person Search', 'vehicle Search', 'pre Race', 'race', 'gender', 'lat',
       'long', 'police Precinct', 'neighborhood'],
      dtype='object')

In [129]:
new_column_names = list(data.columns)
new_column_names = [name.lower().replace(" ", "_") for name in new_column_names]
new_column_names[0] = "case_number_id"
print(new_column_names)

['case_number_id', 'id_num', 'date', 'problem', 'mdc', 'citation_issued', 'person_search', 'vehicle_search', 'pre_race', 'race', 'gender', 'lat', 'long', 'police_precinct', 'neighborhood']


In [130]:
data = pd.read_csv(
    data_folder_path / "mpls_stops.csv",
    names=new_column_names,
    skiprows=2,
    engine="c",
    true_values=["YES"],
    false_values=["NO"],
    index_col="case_number_id",
    parse_dates=["date"],
    date_format="%Y-%m-%d %H:%M:%S",
    na_values=["Unknown"],
    dtype={
        "mdc": "category",
        "problem": "category",
        "pre_race": "category",
        "race": "category",
        "gender": "category",
        "police_precinct": "int8",
        "neighborhood": "category",
    },
)
data.index = data.index.astype("int")
data.head()

Unnamed: 0_level_0,id_num,date,problem,mdc,citation_issued,person_search,vehicle_search,pre_race,race,gender,lat,long,police_precinct,neighborhood
case_number_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
6823,17-000003,2017-01-01 00:00:42,suspicious,MDC,,False,False,,,,44.966617,-93.246458,1,Cedar Riverside
6824,17-000007,2017-01-01 00:03:07,suspicious,MDC,,False,False,,,Male,44.98045,-93.27134,1,Downtown West
6825,17-000073,2017-01-01 00:23:15,traffic,MDC,,False,False,,White,Female,44.94835,-93.27538,5,Whittier
6826,17-000092,2017-01-01 00:33:48,suspicious,MDC,,False,False,,East African,Male,44.94836,-93.28135,5,Whittier
6827,17-000098,2017-01-01 00:37:58,traffic,MDC,,False,False,,White,Female,44.979078,-93.262076,1,Downtown West


In [131]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33198 entries, 6823 to 41399
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id_num           33198 non-null  object        
 1   date             33198 non-null  datetime64[ns]
 2   problem          33198 non-null  category      
 3   mdc              33198 non-null  category      
 4   citation_issued  3656 non-null   object        
 5   person_search    28242 non-null  object        
 6   vehicle_search   28242 non-null  object        
 7   pre_race         9828 non-null   category      
 8   race             22821 non-null  category      
 9   gender           24258 non-null  category      
 10  lat              33198 non-null  float64       
 11  long             33198 non-null  float64       
 12  police_precinct  33198 non-null  int8          
 13  neighborhood     33198 non-null  category      
dtypes: category(6), datetime64[ns](1), float

- `names: array-like, default None` List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed.


- `skiprows: list-like or integer, default None` Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.


- `engine: {‘c’, ‘python’, ‘pyarrow’}` Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine.

In [132]:
%timeit -n 1 -r 1 mpls = pd.read_csv(data_folder_path / "mpls_stops.csv", names=new_column_names, skiprows=2, engine='python')

163 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [133]:
%timeit -n 1 -r 1 mpls = pd.read_csv(data_folder_path / "mpls_stops.csv", names=new_column_names, skiprows=2, engine='c')

70.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [134]:
%timeit -n 1 -r 1 mpls = pd.read_csv(data_folder_path / "mpls_stops.csv", names=new_column_names, skiprows=2, engine='pyarrow')

26.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


- `true_values: list, default None` Values to consider as True.
- `false_values: list, default None` Values to consider as False.


- `index_col: int, str, sequence of int / str, or False, default None` Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.


- `dtype: Type name or dict of column -> type, default None` Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together with suitable na_values settings to preserve and not interpret dtype.


- `parse_dates: boolean or list of ints or names or list of lists or dict, default False.`
  - If True -> try parsing the index.
  - If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
  - If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
  - If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for iso8601-formatted dates.


- `date_format` Format to use for parsing dates when used in conjunction with parse_dates. The strftime to parse time, e.g. "%d/%m/%Y". See strftime documentation for more information on choices, though note that "%f" will parse all the way up to nanoseconds. 


- `na_values: scalar, str, list-like, or dict, default None` Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. See na values const below for a list of the values interpreted as NaN by default.


#### Text files

In [135]:
# example 1
! powershell cat  ../data/iperf.txt

Wed Aug 15 19:35:11 CEST 2018
Connecting to host x.x.x.x, port 5201
[  4] local x.x.x.x port 48944 connected to x.x.x.x port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   375 MBytes  3.14 Gbits/sec  273    471 KBytes
[  4]   1.00-2.00   sec   428 MBytes  3.59 Gbits/sec  145    376 KBytes
[  4]   2.00-3.00   sec   360 MBytes  3.02 Gbits/sec  148    454 KBytes
[  4]   3.00-4.00   sec   339 MBytes  2.84 Gbits/sec   83    407 KBytes
[  4]   4.00-5.00   sec   305 MBytes  2.56 Gbits/sec  104    414 KBytes
[  4]   5.00-6.00   sec   301 MBytes  2.53 Gbits/sec  186    440 KBytes
[  4]   6.00-7.00   sec   325 MBytes  2.73 Gbits/sec  174    485 KBytes
[  4]   7.00-8.00   sec   434 MBytes  3.64 Gbits/sec   81    677 KBytes
[  4]   8.00-9.00   sec   412 MBytes  3.46 Gbits/sec  226    537 KBytes
[  4]   9.00-10.00  sec   409 MBytes  3.43 Gbits/sec   47    372 KBytes
[  4]   10.00-11.00  sec   523 MBytes  3.81 Gbits/sec   96    422 KBytes


In [136]:
import csv
import datetime

file_path = data_folder_path / "iperf.txt"
temp_file_path = data_folder_path / "iperf_temp.txt"

with file_path.open("r") as f:
    raw_data = f.readlines()
    raw_data = [line.strip() for line in raw_data]

start_time = datetime.datetime.strptime(raw_data[0], "%a %b %d %H:%M:%S CEST %Y").replace(tzinfo=datetime.timezone.utc)

print(start_time, type(start_time))

rows = []
for line in raw_data[4:]:
    line_splitted = line.split()
    # seconds to add to start time
    add_seconds = int(line_splitted[2].split(".")[0])
    timestamp = start_time + datetime.timedelta(seconds=add_seconds)
    transfer_mbytesec = int(line_splitted[4])
    bandwidth_gbitsec = float(line_splitted[6])
    retr = int(line_splitted[8])
    cwnd_kbytes = int(line_splitted[9])
    rows.append((timestamp, transfer_mbytesec, bandwidth_gbitsec, retr, cwnd_kbytes))


headers = ["timestamp", "transfer_mbytesec", "bandwidth_gbitsec", "retr", "cwnd_kbytes"]

with temp_file_path.open("w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(rows)

data = pd.read_csv(temp_file_path, parse_dates=["timestamp"], index_col=["timestamp"])
data.head(10)

# remove the file
temp_file_path.unlink()

2018-08-15 19:35:11+00:00 <class 'datetime.datetime'>


In [137]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11 entries, 2018-08-15 19:35:11+00:00 to 2018-08-15 19:35:21+00:00
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   transfer_mbytesec  11 non-null     int64  
 1   bandwidth_gbitsec  11 non-null     float64
 2   retr               11 non-null     int64  
 3   cwnd_kbytes        11 non-null     int64  
dtypes: float64(1), int64(3)
memory usage: 440.0 bytes


#### JSON files

[Datasets examples](https://github.com/jdorfman/awesome-json-datasets#bitcoinm)

[`pandas.read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html): pandas.read_json(path_or_buf, *, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, keep_default_dates=True, precise_float=False, date_unit=None, encoding=None, encoding_errors='strict', lines=False, chunksize=None, compression='infer', nrows=None, storage_options=None, dtype_backend=<no_default>, engine='ujson')

**Orient options**

Default is `'columns'`

<table class="colwidths-given table">
<colgroup>
<col style="width: 12%">
<col style="width: 88%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">split</span></code></p></td>
<td><p>dict like {index -&gt; [index], columns -&gt; [columns], data -&gt; [values]}</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">records</span></code></p></td>
<td><p>list like [{column -&gt; value}, … , {column -&gt; value}]</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">index</span></code></p></td>
<td><p>dict like {index -&gt; {column -&gt; value}}</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">columns</span></code></p></td>
<td><p>dict like {column -&gt; {index -&gt; value}}</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">values</span></code></p></td>
<td><p>just the values array</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">table</span></code></p></td>
<td><p>adhering to the JSON <a class="reference external" href="https://specs.frictionlessdata.io/json-table-schema/">Table Schema</a></p></td>
</tr>
</tbody>
</table>


In [138]:
df_test = pd.DataFrame({"A": range(1, 4), "B": range(4, 7), "C": range(7, 10)}, columns=list("ABC"), index=list("xyz"))
df_test

Unnamed: 0,A,B,C
x,1,4,7
y,2,5,8
z,3,6,9


The format of the JSON string:


- Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column labels acting as the primary index:


In [139]:
df_test.to_json(orient="columns")

'{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'

- Index oriented (the default for Series) similar to column oriented but the index labels are now primary:


In [140]:
df_test.to_json(orient="index")

'{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'

- Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library d3.js


In [141]:
df_test.to_json(orient="records")

'[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'

- Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels are not included:


In [142]:
df_test.to_json(orient="values")

'[[1,4,7],[2,5,8],[3,6,9]]'

- Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name is also included for Series:


In [143]:
df_test.to_json(orient="split")

'{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'

- Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including but not limited to dtypes and index names.


In [144]:
df_test.to_json(orient="table")

'{"schema":{"fields":[{"name":"index","type":"string"},{"name":"A","type":"integer"},{"name":"B","type":"integer"},{"name":"C","type":"integer"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":"x","A":1,"B":4,"C":7},{"index":"y","A":2,"B":5,"C":8},{"index":"z","A":3,"B":6,"C":9}]}'

In [145]:
# example 1
oceans = pd.read_json(data_folder_path / "ocenas.json", orient="columns")
oceans = oceans.drop(columns="description")
oceans = oceans.drop(["title", "units", "base_period", "missing"])
oceans.index.name = "year"
oceans = oceans.rename(columns={"data": "temp_anomaly_celsius"})
oceans.index = pd.to_datetime(oceans.index).year
oceans.head()

Unnamed: 0_level_0,temp_anomaly_celsius
year,Unnamed: 1_level_1
1880,-0.12
1881,-0.09
1882,-0.1
1883,-0.18
1884,-0.27


In [146]:
# example 2
import json

temp_file_path = data_folder_path / "temp_temperatures.json"

with (data_folder_path / "temperatures.json").open("rt") as f:
    raw_data = json.load(f)

raw_data = raw_data["data"]

with temp_file_path.open("wt") as f:
    json.dump(raw_data, f, indent=4)

data = pd.read_json(temp_file_path, orient="index")

# remove the file
temp_file_path.unlink()

data.head()

Unnamed: 0,value,anomaly
189512,50.34,-1.68
189612,51.99,-0.03
189712,51.56,-0.46
189812,51.43,-0.59
189912,51.01,-1.01


In [147]:
# example 3
cities = pd.read_json(data_folder_path / "cities.json", orient="records", dtype={"nametype": "category", "recclass": "category", "fall": "category"})
coordinates = pd.json_normalize(cities["geolocation"].to_list())["coordinates"]
cities["coordinates_x"] = coordinates.str[0]
cities["coordinates_y"] = coordinates.str[1]
cities = cities.set_index("name")
cities = cities.drop(columns=["geolocation", ":@computed_region_cbhk_fwbd", ":@computed_region_nnqa_25f4", "id"])
cities["year"] = cities["year"].astype(str).str[:4].astype(float)
cities.head(5)

Unnamed: 0_level_0,nametype,recclass,mass,fall,year,reclat,reclong,coordinates_x,coordinates_y
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Aachen,Valid,L5,21.0,Fell,1880.0,50.775,6.08333,6.08333,50.775
Aarhus,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,10.23333,56.18333
Abee,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.0,-113.0,54.21667
Acapulco,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.9,-99.9,16.88333
Achiras,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95,-64.95,-33.16667


In [148]:
cities.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Aachen to Tomakovka
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   nametype       1000 non-null   category
 1   recclass       1000 non-null   category
 2   mass           972 non-null    float64 
 3   fall           1000 non-null   category
 4   year           999 non-null    float64 
 5   reclat         988 non-null    float64 
 6   reclong        988 non-null    float64 
 7   coordinates_x  988 non-null    float64 
 8   coordinates_y  988 non-null    float64 
dtypes: category(3), float64(6)
memory usage: 62.8+ KB


[pandas.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html): Normalize semi-structured JSON data into a flat table.


In [149]:
# example 4
with (data_folder_path / "transactions.json").open("rt") as f:
    data = json.load(f)

trans = pd.json_normalize(data["txs"], record_path=["out"], meta=["time", "relayed_by", "vout_sz", "hash"])
trans.head()

Unnamed: 0,spent,tx_index,type,addr,value,n,script,time,relayed_by,vout_sz,hash
0,False,0,0,1H7r57SXAwaKs3Tf5ugbkRNxwfh9YaxC5b,7541,0,76a914b0cd787a7a879ac0a5277b0013ec7b11c145055d...,1586376721,0.0.0.0,2,0f06714015f334626a168ee3e0aa5e0d3866a33dad504b...
1,False,0,0,1BPULhbGfrojrknyD7aZYMtRVUu38Cn75j,1364400,1,76a91471f13b222426eb80b47d2413d21a8904ec1966b2...,1586376721,0.0.0.0,2,0f06714015f334626a168ee3e0aa5e0d3866a33dad504b...
2,False,0,0,1LQ6YURobx4EGZRp8bdEDHup6T56o5NGKN,3127836,0,76a914d4c895721d3a8cd74bb3ccbb699a3dbe342c0807...,1586376722,0.0.0.0,2,3684072a50d7389933210d7adf4f98640d3d53c8cb245e...
3,False,0,0,1HSLVVSSQmzaNG8sbakhFDrmpzUPZLnYCe,30036732,1,76a914b44cae99837337275d21d2c5c6ed6cddf7a7e9f7...,1586376722,0.0.0.0,2,3684072a50d7389933210d7adf4f98640d3d53c8cb245e...
4,False,0,0,3Lb2MJWbBE88BUHf6tAw8ZzhkR6H2cYRhR,206183,0,a914cf48401e3cf81080352f281ea859ccabd51a821487,1586376721,0.0.0.0,3,3d3cc141654170060a7e298a9e5298557970e8cd0051ab...


#### Excel files

pandas provides the `read_excel()` function to read data from Excel files into DataFrames. This function supports various parameters to customize the reading process.

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.


The sheet_names property will generate a list of the sheet names in the file.

In [156]:
# Assign spreadsheet filename: file
file = data_folder_path / "battledeath.xlsx"

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print xlssheet names
print(xls.sheet_names)

['2002', '2004']


Read an Excel file into a pandas DataFrame.

Supports xls, xlsx, xlsm, xlsb, and odf file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets.


In [157]:
df_2002 = pd.read_excel(xls, "2002")
df_2002.head()

Unnamed: 0,"War, age-adjusted mortality due to",2002
0,Afghanistan,36.08399
1,Albania,0.128908
2,Algeria,18.31412
3,Andorra,0.0
4,Angola,18.96456


The ExcelFile class can also be used as a context manager. The primary use-case for an ExcelFile is parsing multiple sheets with different parameters.


In [158]:
with pd.ExcelFile(file) as xls:
    df_2002 = pd.read_excel(xls, "2002", names=["Country", "AAM due to War (2002)"], index_col="Country")
    df_2004 = pd.read_excel(xls, "2004", names=["Country", "War(2004)"], index_col="Country")

In [159]:
df_2002.head(2)

Unnamed: 0_level_0,AAM due to War (2002)
Country,Unnamed: 1_level_1
Afghanistan,36.08399
Albania,0.128908


In [160]:
df_2004.head(2)

Unnamed: 0_level_0,War(2004)
Country,Unnamed: 1_level_1
Afghanistan,9.451028
Albania,0.130354


#### XML files

XML is a markup language designed to store and transport data, with a focus on simplicity and usability across different systems. pandas, a powerful data analysis library in Python, provides functions like `read_xml()` and `to_xml()` to read from and write to XML files, respectively. These functions enable data scientists and analysts to integrate XML data into their workflows seamlessly.

Suppose we have an XML file named employees.xml:

In [164]:
path = data_folder_path / "employees.xml"
data = pd.read_xml(path, xpath=".//employee")
data = data.set_index("id")
data.head()

Unnamed: 0_level_0,name,department
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Alice,HR
2,Bob,Engineering


### Reading online data

Online datasets are pre-collected data available in various formats and hosted on platforms dedicated to data sharing.

Let's say you have an URL with CSV data in raw format. You can read it directly into a pandas DataFrame using the following code:

In [169]:
url = "https://raw.githubusercontent.com/secretGeek/AwesomeCSV/refs/heads/master/awesomecsv.csv"
data = pd.read_csv(url)
data.head()

Unnamed: 0,Category,Name,URL,Description
0,Awesome CSV Tools,NimbleText/Live,https://NimbleText.com/Live,Use patterns to manipulate CSV; the world's si...
1,Awesome CSV Tools,PapaParse,https://www.papaparse.com,A powerful in-browser CSV parser
2,Awesome CSV Tools,CSVKit,http://csvkit.readthedocs.org/en/0.7.3/,CSV utilities that includes csvsql / csvgrep /...
3,Awesome CSV Tools,XSV,https://github.com/BurntSushi/xsv,A fast CSV command-line toolkit written in Rust
4,Awesome CSV Tools,sed (gnu tool),https://www.gnu.org/software/sed/manual/sed.html,Stream editor


### Converting and writing data to files

Pandas is not only excellent for data manipulation and analysis but also for exporting data to various file formats.

In [170]:
f500 = pd.read_csv(data_folder_path / "f500.csv", index_col=0)
f500.head()

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


In [171]:
# Export data to CSV but with a different separator (;)
f500.to_csv(data_folder_path / "f500_semicolon.csv", sep=";", encoding="utf-8")

In [173]:
# Export data to JSON
f500.to_json(data_folder_path / "f500.json", orient="records")

In [180]:
# Export data to Excel
excel = pd.ExcelWriter(data_folder_path / "f500.xlsx")
f500.to_excel(excel, sheet_name="f500")
excel.close()

In [181]:
# Export data to HTML
f500.to_html(data_folder_path / "f500.html")

In [182]:
# Export data to XML
f500.to_xml(data_folder_path / "f500.xml")

In [183]:
# remove all created files
(data_folder_path / "f500_semicolon.csv").unlink(missing_ok=True)
(data_folder_path / "f500.json").unlink(missing_ok=True)
(data_folder_path / "f500.xlsx").unlink(missing_ok=True)
(data_folder_path / "f500.html").unlink(missing_ok=True)
(data_folder_path / "f500.xml").unlink(missing_ok=True)

## Working with Binary Files

### Python Pickle Format


[Pickle in Python: Object Serialization](https://www.datacamp.com/community/tutorials/pickle-python-tutorial)


Pickle is used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object. Pickling is not to be confused with compression! The former is the conversion of an object from one representation (data in Random Access Memory (RAM)) to another (text on disk), while the latter is the process of encoding data with fewer bits, in order to save disk space.


Pickling is useful for applications where you need some degree of persistency in your data. Your program's state data can be saved to disk, so you can continue working on it later on. It can also be used to send data over a Transmission Control Protocol (TCP) or socket connection, or to store python objects in a database. Pickle is very useful for when you're working with machine learning algorithms, where you want to save them to be able to make new predictions at a later time, without having to rewrite everything or train the model all over again.


If you want to use data across different programming languages, pickle is not recommended. Its protocol is specific to Python, thus, cross-language compatibility is not guaranteed. The same holds for different versions of Python itself. Unpickling a file that was pickled in a different version of Python may not always work properly, so you have to make sure that you're using the same version and perform an update if necessary. You should also try not to unpickle data from an untrusted source. Malicious code inside the file might be executed upon unpickling.


- File type native to Python
- Motivation: many datatypes for which it isn’t obvious how to store them
- Pickled files are serialized
- Serialize = convert object to bytestream


There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries. If you want your files to be human readable, you may want to save them as text files in a clever manner. JSONs, which you will see in a later chapter, are appropriate for Python dictionaries.

However, if you merely want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or a bytestream.


In [None]:
# pripravimo datoteko za pisnaje v pickle format
titanic = pd.read_csv(
    "data/titanic_sub.csv",
    index_col="PassengerId",
    usecols=["PassengerId", "Survived", "Pclass", "Sex", "Age", "Fare", "Cabin", "Embarked"],
)

In [None]:
titanic.head(3)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,22.0,7.25,,S
2,1,1,female,38.0,71.2833,C85,C
3,1,3,female,26.0,7.925,,S


[pandas.DataFrame.to_pickle](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html#pandas.DataFrame.to_pickle)


In [None]:
titanic.to_pickle("data/titanic_sub.pkl")

Load pickled pandas object (or any object) from file


[pandas.read_pickle](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html)


In [None]:
titanic_read = pd.read_pickle("data/titanic_sub.pkl")

In [None]:
titanic_read.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,22.0,7.25,,S
2,1,1,female,38.0,71.2833,C85,C
3,1,3,female,26.0,7.925,,S
4,1,1,female,35.0,53.1,C123,S
5,0,3,male,35.0,8.05,,S


## Collecting Data from APIs

Objective: Fetch data from REST APIs and parse results into pandas.
Topics:

REST API fundamentals (endpoints, methods, status codes)

Authentication: API keys, OAuth 2.0

Using requests library: GET/POST, headers, pagination

Handling rate limits and error codes
Tutorial:

Fetch weather data from OpenWeatherMap API.

Paginate through GitHub Issues API and combine results.
Exercise: Build a DataFrame of trending YouTube videos using YouTube Data API.

## Web Scraping Basics

Objective: Extract structured data from websites.
Topics:

HTML/CSS basics: tags, classes, IDs

Ethical scraping: robots.txt, rate limiting

Tools: BeautifulSoup, requests, pandas.read_html()
Tutorial:

Scrape Wikipedia tables into DataFrames with pd.read_html().

Use BeautifulSoup to extract product prices from an e-commerce site.
Exercise: Scrape real estate listings and calculate average prices.

## Working with Databases

Many organizations store data in databases, which can be queried to extract exactly the information needed.

Types of Databases
- Relational Databases: Use SQL for querying (e.g., MySQL, PostgreSQL, SQLite).
- NoSQL Databases: Designed for unstructured data (e.g., MongoDB, Cassandra).

## Big Data & Cloud Storage

Objective: Access large datasets from cloud storage.
Topics:

Cloud storage: AWS S3, Google Cloud Storage

Parquet/Feather formats for efficient I/O

Using boto3 to read from S3
Tutorial:

Load a 1GB Parquet file from S3 using s3fs and pd.read_parquet().

Compare read speeds for CSV vs. Parquet.

## End-to-End Data Acquisition Pipeline

Objective: Combine multiple data sources into a unified dataset.
Requirements:

Scrape product data from a website.

Fetch complementary pricing data via API.

Merge with historical sales data from a SQL database.

Validate, clean, and export to Parquet.