<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-Avantika/Multiple_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Practical Data Ingestion from Multiple Sources in Colab
This lab is inspired by the image and will guide you through ingesting data from multiple sources **(CSV, JSON, REST API)**, cleaning and transforming the data, and producing a unified clean dataset using Python and pandas in Google Colab.

Objectives
- Ingest data from CSV, JSON, and REST API sources

- Use a central “ingestion layer” (pandas) for data import

- Apply cleaning and transformation steps modularly

- Consolidate results into a single, unified output



In [1]:
# Step 1: Set Up Environment
!pip install pandas requests -q

In [2]:
# Step 2: Ingest Data from Multiple Sources
# a. CSV File
import pandas as pd

csv_url = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df_csv = pd.read_csv(csv_url)
print("CSV Data Sample:")
print(df_csv.head())

# b. JSON File
import json

json_url = "https://jsonplaceholder.typicode.com/users"
df_json = pd.read_json(json_url)
print("\nJSON Data Sample:")
print(df_json.head())

#c. REST API
import requests

api_url = "https://randomuser.me/api/?results=5"
response = requests.get(api_url)
data = response.json()
df_api = pd.json_normalize(data['results'])
print("\nREST API Data Sample:")
print(df_api.head())


CSV Data Sample:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

JSON Data Sample:
   id              name   username                      email  \
0   1     Leanne Graham       Bret          Sincere@april.biz   
1   2      Ervin Howell  Antonette          Shanna@melissa.tv   
2   3  Clementine Bauch   Samantha         Nathan@yesenia.net   
3   4  Patricia Lebsack   Karianne  Julianne.OConner@kory.org   
4   5  Chelsey Dietrich     Kamren   Lucio_Hettinger@annie.ca   

                                             address                  phone  \
0  {'street': 'Kulas Light', 'suite': 'Apt. 556',...  1-770-736-8031 x56442   
1  {'street': 'Victor Plains', 

In [3]:
# Step 3: Modular Cleaning/Transformation
# Example: Clean and select specific columns from each

# For CSV (Iris), let's only keep numeric columns and rename
df_csv_clean = df_csv.rename(columns={'species':'source'}).dropna()

# For JSON (User info), select name and email
df_json_clean = df_json[['name', 'email']].copy()
df_json_clean['source'] = 'json'

# For API data (Random users), grab first/last name, email
df_api_clean = pd.DataFrame()
df_api_clean['name'] = df_api['name.first'] + " " + df_api['name.last']
df_api_clean['email'] = df_api['email']
df_api_clean['source'] = 'api'


In [4]:
# Step 4: Prepare each cleaned DataFrame with identical columns

common_cols = ['name', 'email', 'source',
               'sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# CSV (Iris) — rename species to name, add missing columns
df_csv_clean = df_csv.rename(columns={'species': 'name'})
df_csv_clean['email'] = None
df_csv_clean['source'] = 'csv'
for col in ['sepal_length','sepal_width','petal_length','petal_width']:
    # numeric columns already exist
    pass

# JSON (Users) — add placeholder iris columns
df_json_clean = df_json[['name','email']].copy()
df_json_clean['source'] = 'json'
for col in ['sepal_length','sepal_width','petal_length','petal_width']:
    df_json_clean[col] = None

# API (Random Users) — add placeholder iris columns
df_api_clean = pd.DataFrame({
    'name': df_api['name.first'] + ' ' + df_api['name.last'],
    'email': df_api['email'],
    'source': 'api'
})
for col in ['sepal_length','sepal_width','petal_length','petal_width']:
    df_api_clean[col] = None

# Step 5: Concatenate into unified DataFrame
unified_df = pd.concat([
    df_csv_clean[common_cols],
    df_json_clean[common_cols],
    df_api_clean[common_cols]
], ignore_index=True)

print("\nUnified Clean Dataset Sample:")
print(unified_df.head(10))



Unified Clean Dataset Sample:
     name email source  sepal_length  sepal_width  petal_length  petal_width
0  setosa  None    csv           5.1          3.5           1.4          0.2
1  setosa  None    csv           4.9          3.0           1.4          0.2
2  setosa  None    csv           4.7          3.2           1.3          0.2
3  setosa  None    csv           4.6          3.1           1.5          0.2
4  setosa  None    csv           5.0          3.6           1.4          0.2
5  setosa  None    csv           5.4          3.9           1.7          0.4
6  setosa  None    csv           4.6          3.4           1.4          0.3
7  setosa  None    csv           5.0          3.4           1.5          0.2
8  setosa  None    csv           4.4          2.9           1.4          0.2
9  setosa  None    csv           4.9          3.1           1.5          0.1


  unified_df = pd.concat([


# Step 5: Reflection
- Identify which steps required the most standardization.
The most standardization was needed in Step 4: Prepare each cleaned DataFrame with the same columns. This was due to the fact that the data from each source (CSV, JSON, API) had different column names, structures, and data types. For the data to exist together in one DataFrame, we needed to confirm they all contained the same set of columns with matching data types, even if that meant adding placeholder columns of None values for data that did not exist in that source.

- What common problems occur when merging data of different shapes and sources?

Inconsistent column names: Each source may use different names for the same type of data (e.g., "species" vs. "name").
- Data types are different: A column can be a string in one source and a number in another source.
- Missing columns: Some sources may not contain all of the columns necessary for the dataset
Differences in structures of data: JSON and API data may be nested and will need to flatten the data to merge it together
Issues cleaning data: Each source may have individual  issues with data quality (missing values, incorrect formats, etc.) that need to be addressed before merging the data together.

- Why is a central ingestion and transformation layer important for reliability and scalability?

**Reliability**: A central layer allows data from all sources to be processed in the same way based on the same rules and logic, thus reducing the chances of errors and inconsistencies in the resulting dataset.


**Scalability**: When you add more data sources or increase the amount of data being processed, having a modular and central layer enables you to integrate the new data and scale without having to define the logic for each source. It also encourages code reusability and maintainability.


**Maintainability**: Changes to existing data sources or cleansing requirements can occur in one spot within the central layer, allowing you to change the data sources with minimal effort and worry that your changes do not affect other parts of the entire process.


**Reproducibility**: A clean and clearly-defined ingestion and transform layer ensures that all the steps taken to process the data are transparently displayed and reproducible when necessary, which is important for debugging and auditing.