### Part 1: Data Collection (Raw data vs Clean data)

The weather data from OpenWeatherMap is based on a forecasted time period, which typically covers 5 days (120 hours) in 3-hour intervals. This is part of the OpenWeatherMap 5-day forecast API.

Here’s how the data is structured:

The list section of the response contains multiple forecasts, with each forecast representing a 3-hour interval.
So, for every city, weather data is fetched for every 3-hour period, for a total of 40 forecast points (3-hour intervals) over the course of 5 days.
For example:

- Forecast 1: 00:00 to 03:00 (Day 1)
- Forecast 2: 03:00 to 06:00 (Day 1)
- Forecast 3: 06:00 to 09:00 (Day 1)

And so on...

Each forecast will gives the weather information for that specific 3-hour window, such as temperature, humidity, wind speed, cloud coverage, etc.

In [1]:
import requests
import pandas as pd
import json

# Function to fetch weather data for a city
def get_weather_data(city, api_key):
    base_url = "https://api.openweathermap.org/data/2.5/forecast"
    params = {
        "q": city,
        "appid": api_key,
        "units": "metric"  # Metric for Celsius
    }
    
    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()  # Raise an exception for HTTP errors
        return response.json()  # Return the raw JSON response weather data for the city.
    except requests.exceptions.RequestException as e:
        print(f"Error fetching weather data for {city}: {e}")
        return None

# Function to collect weather data in a structured format for pandas DataFrame
def collect_weather_data(city, weather_data):
    weather_list = []
    if not weather_data:
        print(f"No data available for {city}.")
        return weather_list
    
    for forecast in weather_data.get("list", []):
        dt_txt = forecast.get("dt_txt", "N/A")
        temp = forecast.get("main", {}).get("temp", "N/A")
        temp_min = forecast.get("main", {}).get("temp_min", "N/A")
        temp_max = forecast.get("main", {}).get("temp_max", "N/A")
        humidity = forecast.get("main", {}).get("humidity", "N/A")
        pressure = forecast.get("main", {}).get("pressure", "N/A")
        wind_speed = forecast.get("wind", {}).get("speed", "N/A")
        visibility = forecast.get("visibility", "N/A")
        cloud_coverage = forecast.get("clouds", {}).get("all", "N/A")
        weather_description = forecast.get("weather", [{}])[0].get("description", "No description")
        rain_volume = forecast.get("rain", {}).get("3h", "No rain")
        snow_volume = forecast.get("snow", {}).get("3h", "No snow")
        
        weather_list.append({
            "City": city,
            "Date & Time": dt_txt,
            "Temperature": temp,
            "Temp Min": temp_min,
            "Temp Max": temp_max,
            "Humidity": humidity,
            "Pressure": pressure,
            "Wind Speed": wind_speed,
            "Visibility": visibility,
            "Cloud Coverage": cloud_coverage,
            "Weather Description": weather_description.capitalize(),
            "Rain Volume (last 3h)": rain_volume,
            "Snow Volume (last 3h)": snow_volume
        })

    # Returns: A list of dictionaries with weather information.
    return weather_list

# Replace with your actual API key
API_KEY = "2d5c75f9c8a80f2c88cbc3800f917b8c"

# List of cities in Colorado
colorado_cities = ["Boulder", "Denver", "Colorado Springs", "Fort Collins", "Aurora", "Grand Junction", "Pueblo", "Greeley", "Lakewood",
                   "Thornton", "Arvada", "Westminster", "Centennial", "Loveland", "Longmont"]

In [2]:
# Create a dictionary to store raw data for all cities
raw_data_dict = {}

# Collect weather data for each city
all_weather_data = []
for city in colorado_cities:
    print(f"Fetching data for {city}...")
    raw_data = get_weather_data(city, API_KEY)
    if raw_data:
        raw_data_dict[city] = raw_data  # Save raw data for the city
    weather_list = collect_weather_data(city, raw_data)
    all_weather_data.extend(weather_list)

# Save the raw data to a JSON file
raw_data_file = "colorado_weather_raw.json"
with open(raw_data_file, "w") as json_file:
    json.dump(raw_data_dict, json_file, indent=4)
print(f"Raw weather data saved to {raw_data_file}")

Fetching data for Boulder...
Fetching data for Denver...
Fetching data for Colorado Springs...
Fetching data for Fort Collins...
Fetching data for Aurora...
Fetching data for Grand Junction...
Fetching data for Pueblo...
Fetching data for Greeley...
Fetching data for Lakewood...
Fetching data for Thornton...
Fetching data for Arvada...
Fetching data for Westminster...
Fetching data for Centennial...
Fetching data for Loveland...
Fetching data for Longmont...
Raw weather data saved to colorado_weather_raw.json


In [3]:
# Create a pandas DataFrame from the structured data
weather_df = pd.DataFrame(all_weather_data)
print("Structured Data:\n")
weather_df.head()

Structured Data:



Unnamed: 0,City,Date & Time,Temperature,Temp Min,Temp Max,Humidity,Pressure,Wind Speed,Visibility,Cloud Coverage,Weather Description,Rain Volume (last 3h),Snow Volume (last 3h)
0,Boulder,2025-01-27 00:00:00,-7.3,-8.52,-7.3,79,1027,2.5,10000,90,Overcast clouds,No rain,No snow
1,Boulder,2025-01-27 03:00:00,-8.35,-9.18,-8.35,79,1029,3.91,10000,43,Scattered clouds,No rain,No snow
2,Boulder,2025-01-27 06:00:00,-8.06,-8.06,-8.06,78,1029,4.66,10000,11,Few clouds,No rain,No snow
3,Boulder,2025-01-27 09:00:00,-7.21,-7.21,-7.21,74,1028,5.12,10000,4,Clear sky,No rain,No snow
4,Boulder,2025-01-27 12:00:00,-6.62,-6.62,-6.62,70,1026,6.09,10000,4,Clear sky,No rain,No snow


In [4]:
# Save the DataFrame to a CSV file
csv_file = "colorado_weather_data.csv"
weather_df.to_csv(csv_file, index=False)
print(f"Structured weather data saved to {csv_file}")

Structured weather data saved to colorado_weather_data.csv


In [5]:
weather_df.shape

(600, 13)