# A: Where Data Comes From

This section will cover the main places data comes from, what assumptions each source bakes in, and how to do a quick *source audit* before cleaning.

## A.1 Files: CSV, JSON, Excel

Flat files are common, portable, and simple

**What these formats are good at:**
<table style="text-align=left";>
    <tr>
        <th>Format</th>
        <th>Best suited for</th>
        <th>Common misuse</th>
    </tr>
    <tr>
        <th>CSV</th>
        <th>Simple rectangular tables; system exports; quick sharing</th>
        <th>Encoding complex structure or multiple tables in one file</th>
    </tr>
    <tr>
        <th>Excel</th>
        <th>Human-facing analysis, manual edits, reporting</th>
        <th>Using as a source of truth or automated data pipeline input</th>
    </tr>
    <tr>
        <th>JSON</th>
        <th>Nested or semi-structured records; API responses</th>
        <th>Assuming fields are stable or consistently present</th>
    </tr>
</table>

### Hidden assumptions
* Someone chose column names, encodings, delimiters, and header rows
* Missing values may be implicit("", NA, -1).
* Types are inferred later, not enforced at creation.

### Microlab: File sanity check

Simulate loading messy exports and practice the first questions that should be asked:
* What are the columns?
* What looks like missing data?
* What types are being inferred?

In [1]:
import pandas as pd
from io import StringIO
import json

print("=== CSV export (note blanks, NA, weird spacing) ===")
csv_text = """
user_id,age,income,city
101,34,72000,Denver
102,,NA, Boulder
103,29,65000,Denver
104, -1,5400, "Colorado Springs"
"""
df_csv = pd.read_csv(StringIO(csv_text),
                     skipinitialspace=True)
print(df_csv)
print("\nDtypes:", df_csv.dtypes.to_dict())
print("Missing by column:", df_csv.isna().sum().to_dict())

print("\n=== JSON export (nested fields) ===")
json_text = json.dumps([
      {"user_id": 201, "profile": {"age": 31, "city": "Denver"}, "income": 80000},
      {"user_id": 202, "profile": {"age": None, "city": "Boulder"}, "income": "79000"},
      {"user_id": 203, "profile": {"city": "Denver"}, "income": 61000}
])

records = json.loads(json_text)
df_json = pd.json_normalize(records)
print(df_json)
print("\nDtypes:", df_json.dtypes.to_dict())

print("\nTry:")
print("- Replace -1 with a real age and re-check missingness")
print("- Make income consistently numeric")
print("- Rename columns to something consistent (snake_case)")


=== CSV export (note blanks, NA, weird spacing) ===
   user_id   age   income              city
0      101  34.0  72000.0            Denver
1      102   NaN      NaN           Boulder
2      103  29.0  65000.0            Denver
3      104  -1.0   5400.0  Colorado Springs

Dtypes: {'user_id': dtype('int64'), 'age': dtype('float64'), 'income': dtype('float64'), 'city': dtype('O')}
Missing by column: {'user_id': 0, 'age': 1, 'income': 1, 'city': 0}

=== JSON export (nested fields) ===
   user_id income  profile.age profile.city
0      201  80000         31.0       Denver
1      202  79000          NaN      Boulder
2      203  61000          NaN       Denver

Dtypes: {'user_id': dtype('int64'), 'income': dtype('O'), 'profile.age': dtype('float64'), 'profile.city': dtype('O')}

Try:
- Replace -1 with a real age and re-check missingness
- Make income consistently numeric
- Rename columns to something consistent (snake_case)


#### Example with cleaned data

In [3]:
import pandas as pd
from io import StringIO
import json

print("=== CSV export (note blanks, NA, weird spacing) ===")
csv_text = """
user_id,age,income,city
101,34,72000,Denver
102,27,99000,Boulder
103,29,65000,Denver
104,44,5400,Colorado Springs
"""
df_csv = pd.read_csv(StringIO(csv_text),
                     skipinitialspace=True)
print(df_csv)
print("\nDtypes:", df_csv.dtypes.to_dict())
print("Missing by column:", df_csv.isna().sum().to_dict())

print("\n=== JSON export (nested fields) ===")
json_text = json.dumps([
      {"user_id": 201, "profile": {"age": 31, "city": "Denver"}, "income": 80000},
      {"user_id": 202, "profile": {"age": 27, "city": "Boulder"}, "income": 79000},
      {"user_id": 203, "profile": {"age": 44, "city": "Denver"}, "income": 61000}
])

records = json.loads(json_text)
df_json = pd.json_normalize(records)
print(df_json)
print("\nDtypes:", df_json.dtypes.to_dict())

=== CSV export (note blanks, NA, weird spacing) ===
   user_id  age  income              city
0      101   34   72000            Denver
1      102   27   99000           Boulder
2      103   29   65000            Denver
3      104   44    5400  Colorado Springs

Dtypes: {'user_id': dtype('int64'), 'age': dtype('int64'), 'income': dtype('int64'), 'city': dtype('O')}
Missing by column: {'user_id': 0, 'age': 0, 'income': 0, 'city': 0}

=== JSON export (nested fields) ===
   user_id  income  profile.age profile.city
0      201   80000           31       Denver
1      202   79000           27      Boulder
2      203   61000           44       Denver

Dtypes: {'user_id': dtype('int64'), 'income': dtype('int64'), 'profile.age': dtype('int64'), 'profile.city': dtype('O')}


## A.2 SQL databases