In [1]:


# Step 1: Set Up the Environment
# Install required libraries (requests and pandas)
!pip install requests pandas -q



In [2]:


# Step 2: Ingest Data from a Public REST API
# Example: Fetch random user data from the Random User Generator API.
import requests
import pandas as pd

api_url = "https://randomuser.me/api/?results=10"
response = requests.get(api_url)
data = response.json()

# Convert the results to a DataFrame
df_api = pd.json_normalize(data['results'])
print("Data from REST API:")
print(df_api.head())



Data from REST API:
   gender                           email         phone          cell nat  \
0  female         beatrice.lo@example.com  T43 W20-6596  Q67 W86-4025  CA   
1  female       natalia.nunez@example.com   900-627-559   607-875-987  ES   
2    male        gunbir.singh@example.com    7705147489    8161145902  IN   
3    male     viktor.gjelsvik@example.com      79565344      46408309  NO   
4    male  dharmesh.mardhekar@example.com    7332016631    9159175098  IN   

  name.title name.first  name.last  location.street.number  \
0       Miss   Beatrice         Lo                    8251   
1       Miss    Natalia      Núñez                    6444   
2         Mr     Gunbir      Singh                    4359   
3         Mr     Viktor   Gjelsvik                    3403   
4         Mr   Dharmesh  Mardhekar                    4615   

   location.street.name  ...  \
0               Pine Rd  ...   
1  Calle de La Almudena  ...   
2       MG Rd Bangalore  ...   
3         Hersle

In [3]:


# Step 3: Ingest Data from a CSV File
# Example using the Iris dataset (direct CSV link):

csv_url = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df_csv = pd.read_csv(csv_url)
print("\nData from CSV File:")
print(df_csv.head())




Data from CSV File:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


In [4]:


# Step 4: Inspect and Compare Data
# Perform basic inspection on both data sources.
# Inspect columns and info for both DataFrames
print("API Data Columns:", df_api.columns)
print("CSV Data Columns:", df_csv.columns)

print("\nAPI Data Info:")
print(df_api.info())

print("\nCSV Data Info:")
print(df_csv.info())


API Data Columns: Index(['gender', 'email', 'phone', 'cell', 'nat', 'name.title', 'name.first',
       'name.last', 'location.street.number', 'location.street.name',
       'location.city', 'location.state', 'location.country',
       'location.postcode', 'location.coordinates.latitude',
       'location.coordinates.longitude', 'location.timezone.offset',
       'location.timezone.description', 'login.uuid', 'login.username',
       'login.password', 'login.salt', 'login.md5', 'login.sha1',
       'login.sha256', 'dob.date', 'dob.age', 'registered.date',
       'registered.age', 'id.name', 'id.value', 'picture.large',
       'picture.medium', 'picture.thumbnail'],
      dtype='object')
CSV Data Columns: Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

API Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 34 columns):
 #   Column                          Non-Null Count  Dtype 


Step 5: Reflection Questions
At the end of your notebook, answer these questions:

What were the main steps for ingesting data from a REST API vs. a CSV?

What are some possible challenges or error scenarios for each ingestion method?

For your workflow, when would you prefer an API vs. a CSV file?

Here are the answers to the reflection questions:

**What were the main steps for ingesting data from a REST API vs. a CSV?**

**REST API:**
1. Import necessary libraries (e.g., `requests` for fetching data, `pandas` for data manipulation).
2. Define the API endpoint URL.
3. Make an HTTP GET request to the API endpoint using `requests.get()`.
4. Parse the JSON response from the API using the `.json()` method.
5. Convert the relevant part of the JSON data (e.g., the 'results' list) into a pandas DataFrame using `pd.json_normalize()`.

**CSV File:**
1. Import the `pandas` library.
2. Define the URL or file path to the CSV file.
3. Read the CSV file directly into a pandas DataFrame using `pd.read_csv()`.

**What are some possible challenges or error scenarios for each ingestion method?**

**REST API:**
*   **API Rate Limits:** APIs often have limits on how many requests you can make in a given time frame. Exceeding these limits will result in errors.
*   **Authentication/Authorization:** Many APIs require API keys or tokens for access, which need to be handled securely.
*   **API Changes:** The API structure or endpoint might change, breaking your code.
*   **Network Issues:** Connectivity problems or slow network speeds can cause requests to fail or time out.
*   **Data Format Inconsistencies:** While JSON is common, variations in the structure can require careful parsing.
*   **API Downtime:** The API service could be temporarily unavailable.

**CSV File:**
*   **Incorrect File Path or URL:** The file might not exist at the specified location.
*   **Incorrect Delimiter or Encoding:** CSV files can use different delimiters (comma, semicolon, tab) or character encodings, which can cause parsing issues.
*   **Malformatted Data:** Inconsistent rows, extra commas, or missing values can lead to errors during reading.
*   **Large File Size:** Very large CSV files can consume a lot of memory and take a long time to load.
*   **Changes in File Structure:** If columns are added, removed, or reordered, the code reading the file may break.
*   **Data Type Issues:** Pandas might infer incorrect data types, requiring manual conversion.

**For your workflow, when would you prefer an API vs. a CSV file?**

**Prefer API when:**
*   You need real-time or near-real-time data.
*   The data is frequently updated.
*   You only need a subset of the data based on specific parameters (APIs often allow filtering).
*   You need to interact with the data (e.g., send data back to the source).
*   The data source is dynamic and changes frequently.

**Prefer CSV file when:**
*   You need a static snapshot of the data.
*   The data is not updated frequently.
*   You need to process the entire dataset at once.
*   The data source is readily available as a file and doesn't require complex querying.
*   You are working with historical data archives.
*   Simplicity and ease of use are paramount for smaller datasets.