# Day 08 – Python Scripting: Automate Data Fetching & Cleaning
### Objective: Build a reusable Python workflow that pulls, cleans, and saves real-world data.

In this notebook you’ll learn how to:
- Fetch data (CSV + JSON)
- Clean and preprocess it automatically
- Combine datasets
- Save cleaned data to disk
- Log every step


## 1️⃣ Setup & Imports

In [1]:
import pandas as pd
import requests
import json
import os
from datetime import datetime

# Create folders for raw and cleaned data
os.makedirs('data/raw', exist_ok=True)
os.makedirs('data/cleaned', exist_ok=True)

def log(message):
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {message}")

## 2️⃣ Fetch Data from Public APIs and URLs

In [2]:
# Example 1: Fetch JSON data (COVID-19 summary)
json_url = "https://disease.sh/v3/covid-19/countries"
log("Fetching COVID-19 JSON data...")
response = requests.get(json_url)

if response.status_code == 200:
    covid_data = response.json()
    countries_df = pd.json_normalize(covid_data)
    countries_df.to_csv('data/raw/covid_summary.csv', index=False)
    log(f"COVID data fetched successfully → {countries_df.shape[0]} rows")
else:
    log(f"Failed to fetch JSON data: {response.status_code}")

# Example 2: Fetch CSV from GitHub (sample movies dataset)
csv_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_score_comparison.csv"
log("Fetching Movies CSV data...")
movies_df = pd.read_csv(csv_url)
movies_df.to_csv('data/raw/movies.csv', index=False)
log(f"Movies data fetched successfully → {movies_df.shape[0]} rows")

[2025-10-28 10:18:00] Fetching COVID-19 JSON data...
[2025-10-28 10:18:02] COVID data fetched successfully → 231 rows
[2025-10-28 10:18:02] Fetching Movies CSV data...
[2025-10-28 10:18:03] Movies data fetched successfully → 146 rows


## 3️⃣ Inspect and Clean the Datasets

In [7]:
# Basic inspection
log("Inspecting datasets...")
print(countries_df.head(3))
print(movies_df.head(3))

# Cleaning COVID data
log("Cleaning COVID dataset...")
covid_clean = countries_df[['country', 'cases', 'deaths', 'recovered']].copy()
covid_clean.columns = ['country', 'covid_positive_confirmed', 'deaths', 'recovered'] # Rename 'cases' to 'confirmed' for consistency
covid_clean.dropna(inplace=True)

# Cleaning Movies data
log("Cleaning Movies dataset...")
movies_clean = movies_df[['FILM', 'RottenTomatoes', 'IMDB', 'Metacritic']].copy()
movies_clean.dropna(inplace=True)
movies_clean.rename(columns={'FILM': 'film', 'RottenTomatoes': 'rotten_tomatoes', 'IMDB': 'imdb_rating', 'Metacritic': 'metacritic_rating'}, inplace=True)
log("Movies data cleaned successfully.")

[2025-10-28 10:31:35] Inspecting datasets...
         updated      country   cases  todayCases  deaths  todayDeaths  \
0  1761626698435  Afghanistan  234174           0    7996            0   
1  1761626698428      Albania  334863           0    3605            0   
2  1761626698431      Algeria  272010           0    6881            0   

   recovered  todayRecovered  active  critical  ...  oneTestPerPeople  \
0     211080               0   15098         0  ...                29   
1     330233               0    1025         0  ...                 1   
2     183061               0   82068         0  ...               196   

   activePerOneMillion  recoveredPerOneMillion  criticalPerOneMillion  \
0               370.46                 5179.32                    0.0   
1               357.59               115209.32                    0.0   
2              1809.65                 4036.61                    0.0   

   countryInfo._id countryInfo.iso2  countryInfo.iso3  countryInfo.lat  

## 4️⃣ Combine and Save Cleaned Data

In [8]:
log("Saving cleaned datasets...")
covid_clean.to_csv('data/cleaned/covid_clean.csv', index=False)
movies_clean.to_csv('data/cleaned/movies_clean.csv', index=False)

log("Cleaned data saved successfully ✅")

[2025-10-28 10:31:39] Saving cleaned datasets...
[2025-10-28 10:31:39] Cleaned data saved successfully ✅


## 5️⃣ Automate with Functions (Reusable Script Design)

In [5]:
def fetch_csv(url, save_path):
    log(f"Fetching CSV from {url}")
    df = pd.read_csv(url)
    df.to_csv(save_path, index=False)
    log(f"Saved → {save_path} ({df.shape[0]} rows)")
    return df

def clean_dataframe(df, dropna=True, rename_dict=None):
    if dropna:
        df = df.dropna()
    if rename_dict:
        df = df.rename(columns=rename_dict)
    return df

def main():
    url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
    df = fetch_csv(url, 'data/raw/tips.csv')
    clean_df = clean_dataframe(df, rename_dict={'total_bill':'bill', 'tip':'gratuity'})
    clean_df.to_csv('data/cleaned/tips_clean.csv', index=False)
    log("Automation complete ✅")

if __name__ == "__main__":
    main()

[2025-10-28 10:18:52] Fetching CSV from https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
[2025-10-28 10:18:53] Saved → data/raw/tips.csv (244 rows)
[2025-10-28 10:18:53] Automation complete ✅


## ✅ Summary

- You learned to automate repetitive data collection.
- Cleaned both JSON and CSV data.
- Created a reusable automation pattern (`main()`, logging, modular helpers).
- This structure is how ML engineers handle real data pipelines before model training.

**Next → Day 09: Exploratory Data Analysis (EDA) Automation**