In [1]:


# Step 1: Set Up the Environment
# Install required libraries (requests and pandas)
!pip install requests pandas -q



In [6]:
# Step 2: Ingest Data from a Public REST API
# Example: Fetch random user data from the Random User Generator API.
import requests
import pandas as pd

api_url = "https://disease.sh/v3/covid-19/countries"
response = requests.get(api_url)
data = response.json()

# Convert the results to a DataFrame
df_api = pd.json_normalize(data)
print("Data from REST API:")
print(df_api.head())

Data from REST API:
         updated      country   cases  todayCases  deaths  todayDeaths  \
0  1754302476454  Afghanistan  234174           0    7996            0   
1  1754302476442      Albania  334863           0    3605            0   
2  1754302476446      Algeria  272010           0    6881            0   
3  1754302476502      Andorra   48015           0     165            0   
4  1754302476477       Angola  107327           0    1937            0   

   recovered  todayRecovered  active  critical  ...  oneTestPerPeople  \
0     211080               0   15098         0  ...                29   
1     330233               0    1025         0  ...                 1   
2     183061               0   82068         0  ...               196   
3          0               0   47850         0  ...                 0   
4     103419               0    1971         0  ...                23   

   activePerOneMillion  recoveredPerOneMillion  criticalPerOneMillion  \
0               370.46 

In [11]:


# Step 3: Ingest Data from a CSV File
# Example using the Iris dataset (direct CSV link):

csv_url = "leads-100.csv"
df_csv = pd.read_csv(csv_url)
print("\nData from CSV File:")
print(df_csv.head())




Data from CSV File:
   Index       Account Id       Lead Owner First Name Last Name  \
0      1  0970F99ED4a2CE4    Bethany Dixon     Victor   Cochran   
1      2  e9AABddbCA4AFee  Andres Callahan    Maureen   Fuentes   
2      3  99CF54fE56dDc5e     Angel Ortega      Ralph   Murillo   
3      4  5B8C661A897ACE3    Dana Mcdonald    Richard    Obrien   
4      5  2C7aEb5e8F432Ab      Sharon Cruz      Terri     Perry   

                   Company                Phone 1               Phone 2  \
0  Grimes, Madden and Huff        +1-532-344-1362            2754733370   
1               Crosby Inc  001-672-799-2170x5610  +1-887-577-1205x5686   
2         Velasquez-Hardin      985.939.5411x2641   +1-852-904-1856x071   
3              Barrett Ltd           223.427.4047    (240)452-2332x4601   
4             Maddox Group             4953467238          802-739-3164   

                    Email 1                       Email 2  \
0  hatkinson@mclaughlin.com             ygibbs@guzman.com   
1  

In [12]:


# Step 4: Inspect and Compare Data
# Perform basic inspection on both data sources.
# Inspect columns and info for both DataFrames
print("API Data Columns:", df_api.columns)
print("CSV Data Columns:", df_csv.columns)

print("\nAPI Data Info:")
print(df_api.info())

print("\nCSV Data Info:")
print(df_csv.info())


API Data Columns: Index(['updated', 'country', 'cases', 'todayCases', 'deaths', 'todayDeaths',
       'recovered', 'todayRecovered', 'active', 'critical',
       'casesPerOneMillion', 'deathsPerOneMillion', 'tests',
       'testsPerOneMillion', 'population', 'continent', 'oneCasePerPeople',
       'oneDeathPerPeople', 'oneTestPerPeople', 'activePerOneMillion',
       'recoveredPerOneMillion', 'criticalPerOneMillion', 'countryInfo._id',
       'countryInfo.iso2', 'countryInfo.iso3', 'countryInfo.lat',
       'countryInfo.long', 'countryInfo.flag'],
      dtype='object')
CSV Data Columns: Index(['Index', 'Account Id', 'Lead Owner', 'First Name', 'Last Name',
       'Company', 'Phone 1', 'Phone 2', 'Email 1', 'Email 2', 'Website',
       'Source', 'Deal Stage', 'Notes'],
      dtype='object')

API Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231 entries, 0 to 230
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------              


Step 5: Reflection Questions
At the end of your notebook, answer these questions:

What were the main steps for ingesting data from a REST API vs. a CSV?

What are some possible challenges or error scenarios for each ingestion method?

For your workflow, when would you prefer an API vs. a CSV file?

Here are the answers to the reflection questions:

**What were the main steps for ingesting data from a REST API vs. a CSV?**

**REST API:**
1. Import necessary libraries (e.g., `requests` for fetching data, `pandas` for data manipulation).
2. Define the API endpoint URL.
3. Make an HTTP GET request to the API endpoint using `requests.get()`.
4. Parse the JSON response from the API using the `.json()` method.
5. Convert the relevant part of the JSON data (e.g., the 'results' list) into a pandas DataFrame using `pd.json_normalize()`.

**CSV File:**
1. Import the `pandas` library.
2. Define the URL or file path to the CSV file.
3. Read the CSV file directly into a pandas DataFrame using `pd.read_csv()`.

**What are some possible challenges or error scenarios for each ingestion method?**

**REST API:**
*   **API Rate Limits:** APIs often have limits on how many requests you can make in a given time frame. Exceeding these limits will result in errors.
*   **Authentication/Authorization:** Many APIs require API keys or tokens for access, which need to be handled securely.
*   **API Changes:** The API structure or endpoint might change, breaking your code.
*   **Network Issues:** Connectivity problems or slow network speeds can cause requests to fail or time out.
*   **Data Format Inconsistencies:** While JSON is common, variations in the structure can require careful parsing.
*   **API Downtime:** The API service could be temporarily unavailable.

**CSV File:**
*   **Incorrect File Path or URL:** The file might not exist at the specified location.
*   **Incorrect Delimiter or Encoding:** CSV files can use different delimiters (comma, semicolon, tab) or character encodings, which can cause parsing issues.
*   **Malformatted Data:** Inconsistent rows, extra commas, or missing values can lead to errors during reading.
*   **Large File Size:** Very large CSV files can consume a lot of memory and take a long time to load.
*   **Changes in File Structure:** If columns are added, removed, or reordered, the code reading the file may break.
*   **Data Type Issues:** Pandas might infer incorrect data types, requiring manual conversion.

**For your workflow, when would you prefer an API vs. a CSV file?**

**Prefer API when:**
*   You need real-time or near-real-time data.
*   The data is frequently updated.
*   You only need a subset of the data based on specific parameters (APIs often allow filtering).
*   You need to interact with the data (e.g., send data back to the source).
*   The data source is dynamic and changes frequently.

**Prefer CSV file when:**
*   You need a static snapshot of the data.
*   The data is not updated frequently.
*   You need to process the entire dataset at once.
*   The data source is readily available as a file and doesn't require complex querying.
*   You are working with historical data archives.
*   Simplicity and ease of use are paramount for smaller datasets.