# TASK1 . Data Quality

#### a. Example of Poor Quality Structured Data
We consider a securities trading dataset where:
- *Timestamp* field mixes ISO 8601 format (`2023-08-15T09:30:00Z`) and local time (`Aug 15, 2023 9:30 AM EST`)
- *Asset identifiers* inconsistently use tickers (`AAPL`), ISINs (`US0378331005`), and internal codes (`EQ123`)
- *Transaction volumes* contain negative values

#### b. Recognizing Poor Quality in Structured Data
We identify three critical failures of data quality principles:
1. **Lack of Validity**: Negative share volumes violate domain rules
2. **Inconsistent Formatting**: Timestamp heterogeneity breaches the consistency requirement for time-series analysis.
3. **Ambiguous Identifier Schema**: Mixed asset identification methods compromise uniqueness and traceability, preventing accurate instrument mapping.

#### c. Example of Poor Quality Unstructured Data
Through the course about Social Media Data , we could observe financial social media data exhibiting:
- Tweets truncated mid-sentence due to API limitations
- Duplicate or reposted posts from bot accounts (identical content posted or reposted  ≥5 times/minute)
- Irrelevant content (e.g., meme images without financial context in a market discussion corpus)

#### d. Assessing Poor Quality in Unstructured Data
We would evaluate failures through these lenses:
1. **Incompleteness**: Truncated tweets lose critical sentiment signals, violating the *comprehensiveness* requirement for NLP modeling.
2. **Non-Uniqueness**: Bot-generated duplicates artificially inflate term frequencies, distorting *representational accuracy* of market sentiment.
3. **Contextual Irrelevance**: Non-financial content introduces noise that breaches *fitness-for-purpose* in trading signal extraction.
4. **Unverifiable Provenance**: Absence of user verification metadata undermines *auditability*, a core quality attribute for regulatory use cases.


# Task 2
We retrieved U.S. Treasury yield data from the Federal Reserve Economic Data (FRED) platform using the fredapi.
Specifically, we collected yield curves from January 2024 to June 2025, a recent 18-month period that:

* Covers a wide range of market conditions, including inflation and rate decisions.

* Provides sufficient data points for robust model fitting and comparison.

* Matches the project's requirement of including short- to long-term maturities (from 1 month to 30 years).

The dataset includes maturities: 1M, 3M, 6M, 1Y, 2Y, 3Y, 5Y, 7Y, 10Y, 20Y, and 30Y.
Missing values (e.g., holidays) were left intact to preserve data integrity and may be dropped during fitting.

