# Weather Data Analysis

## Introduction

This notebook provides a comprehensive analysis of weather data from the years 2018 to 2020. The objective is to read, process, and analyze the data to calculate average temperature and humidity, identify temperature extremes, and visualize trends over time.

---


In [None]:
import os
import csv
import pandas as pd
import matplotlib.pyplot as plt
import logging


## Data Preparation

### Creating Sample Data


In [None]:
# Create 'data' directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# Sample data for 2018
data_2018 = [
    ['Date', 'Temperature', 'Humidity'],
    ['2018-01-01', '25', '60'],
    ['2018-01-02', '28', '65'],
    ['2018-01-03', '22', '55'],
]

# Write to CSV file
with open('data/weather_2018.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data_2018)

# Sample data for 2019
data_2019 = [
    ['Date', 'Temperature', 'Humidity'],
    ['2019-01-01', '24', '58'],
    ['2019-01-02', '26', '62'],
    ['2019-01-03', '23', '57'],
]

with open('data/weather_2019.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data_2019)

# Sample data for 2020
data_2020 = [
    ['Date', 'Temperature', 'Humidity'],
    ['2020-01-01', '27', '66'],
    ['2020-01-02', '29', '70'],
    ['2020-01-03', '26', '64'],
]

with open('data/weather_2020.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data_2020)


## Data Reading

### Reading CSV Data


In [None]:
def read_csv(file_path):
    return pd.read_csv(file_path)

# Read data into DataFrames
df_2018 = read_csv('data/weather_2018.csv')
df_2019 = read_csv('data/weather_2019.csv')
df_2020 = read_csv('data/weather_2020.csv')

# Display first few rows
print("Data 2018:")
display(df_2018.head())
print("Data 2019:")
display(df_2019.head())
print("Data 2020:")
display(df_2020.head())


## Data Processing

### Processing Weather Data


In [None]:
# Setup logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def to_float(value):
    try:
        return float(value)
    except ValueError:
        logging.warning(f"Unable to convert '{value}' to float.")
        return None

def process_weather(df):
    df['Temperature'] = df['Temperature'].apply(to_float)
    df['Humidity'] = df['Humidity'].apply(to_float)
    
    # Check for missing values
    if df['Temperature'].isnull().any():
        logging.warning("Missing values detected in Temperature column.")
    if df['Humidity'].isnull().any():
        logging.warning("Missing values detected in Humidity column.")
    
    avg_temp = df['Temperature'].mean()
    avg_humidity = df['Humidity'].mean()
    
    return avg_temp, avg_humidity

# Process weather data
avg_temp_2018, avg_humidity_2018 = process_weather(df_2018)
avg_temp_2019, avg_humidity_2019 = process_weather(df_2019)
avg_temp_2020, avg_humidity_2020 = process_weather(df_2020)

# Print averages
print(f"2018 - Average Temperature: {avg_temp_2018:.2f}, Average Humidity: {avg_humidity_2018:.2f}")
print(f"2019 - Average Temperature: {avg_temp_2019:.2f}, Average Humidity: {avg_humidity_2019:.2f}")
print(f"2020 - Average Temperature: {avg_temp_2020:.2f}, Average Humidity: {avg_humidity_2020:.2f}")


## Statistical Analysis

### Calculating Minimum and Maximum Temperatures


In [None]:
def get_min_max(df):
    min_temp = df['Temperature'].min()
    max_temp = df['Temperature'].max()
    return min_temp, max_temp

# Get min and max temperatures
min_temp_2018, max_temp_2018 = get_min_max(df_2018)
min_temp_2019, max_temp_2019 = get_min_max(df_2019)
min_temp_2020, max_temp_2020 = get_min_max(df_2020)

# Print min and max temperatures
print(f"2018 - Min Temperature: {min_temp_2018}, Max Temperature: {max_temp_2018}")
print(f"2019 - Min Temperature: {min_temp_2019}, Max Temperature: {max_temp_2019}")
print(f"2020 - Min Temperature: {min_temp_2020}, Max Temperature: {max_temp_2020}")


## Data Visualization

### Temperature Trends Over Time


In [None]:
# Add 'Year' column to each DataFrame
df_2018['Year'] = 2018
df_2019['Year'] = 2019
df_2020['Year'] = 2020

# Combine DataFrames
df_all = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)

# Convert 'Date' column to datetime
df_all['Date'] = pd.to_datetime(df_all['Date'])

# Sort by date
df_all = df_all.sort_values('Date')


In [None]:
# Plot Temperature over time
plt.figure(figsize=(10,6))
plt.plot(df_all['Date'], df_all['Temperature'], marker='o')
plt.title('Temperature Over Time')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()


### Temperature Distribution


In [None]:
# Histogram of temperatures
plt.figure(figsize=(8,6))
plt.hist(df_all['Temperature'].dropna(), bins=10, edgecolor='black')
plt.title('Distribution of Temperatures')
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


## Conclusion

Average temperatures and humidity levels were calculated for the years 2018 to 2020. The minimum and maximum temperatures were identified for each year. Visualizations display temperature trends over time and the distribution of temperatures across the dataset.

---


## Further Enhancements

- **Error Handling**: Improve robustness by handling missing or malformed data more effectively.
- **Data Validation**: Implement data validation checks before processing.
- **Scalability**: Adapt the code to process larger datasets efficiently, possibly integrating parallel processing.
- **Modular Code**: Refactor code into reusable modules or classes for better maintainability.

---