# Question 3: Weather Data Analysis

**Task**: Convert weather data (temperature measured on certain days) to tidy format, plot simple graphs, and discuss monthly patterns.

## 1. Load Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style('whitegrid')

## 2. Load Data

In [None]:
# Load weather data
df = pd.read_csv('Data/Weather Data/weather.csv')

print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())
print("\nColumn names:")
print(df.columns.tolist())

## 3. Convert to Tidy Format

**Tidy Data Transformation**:
- **Original**: Wide format with columns d1-d31 for each day of the month
- **Target**: Long format with one row per observation (country-year-month-day-element)
- **Variables**: Country, year, month, day, element (tmax/tmin), temperature
- **Values**: Temperature readings in degrees Celsius

In [None]:
# Melt the dataframe to convert day columns (d1-d31) to rows
day_cols = [f'd{i}' for i in range(1, 32)]

df_tidy = df.melt(
    id_vars=['Country', 'year', 'month', 'element'],
    value_vars=day_cols,
    var_name='day',
    value_name='temperature'
)

# Extract day number from 'd1', 'd2', etc.
df_tidy['day'] = df_tidy['day'].str.extract(r'(\d+)').astype(int)

# Remove rows with missing temperature values (NA)
df_tidy = df_tidy.dropna(subset=['temperature'])

# Create a proper date column
df_tidy['date'] = pd.to_datetime(
    df_tidy['year'].astype(str) + '-' + 
    df_tidy['month'].astype(str) + '-' + 
    df_tidy['day'].astype(str),
    errors='coerce'
)

# Remove invalid dates (e.g., February 30)
df_tidy = df_tidy.dropna(subset=['date'])

print(f"Tidy dataset shape: {df_tidy.shape}")
print("\nTidy format sample:")
print(df_tidy.head(10))

## 4. Pivot to Separate tmax and tmin

**Assumption**: It's more useful to have tmax and tmin as separate columns for analysis

In [None]:
# Pivot to have tmax and tmin as separate columns
df_final = df_tidy.pivot_table(
    index=['Country', 'year', 'month', 'day', 'date'],
    columns='element',
    values='temperature'
).reset_index()

# Rename columns
df_final.columns.name = None

print(f"Final dataset shape: {df_final.shape}")
print("\nFinal format:")
print(df_final.head(10))
print("\nData types:")
print(df_final.dtypes)

## 5. Visualize Temperature Patterns

In [None]:
# Plot temperature over time
plt.figure(figsize=(14, 6))
plt.plot(df_final['date'], df_final['tmax'], 'ro-', alpha=0.6, label='Max Temperature', markersize=4)
plt.plot(df_final['date'], df_final['tmin'], 'bo-', alpha=0.6, label='Min Temperature', markersize=4)
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.title('Temperature Measurements Over Time (India, 2015)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Monthly Temperature Patterns

In [None]:
# Calculate monthly statistics
monthly_stats = df_final.groupby('month').agg({
    'tmax': ['mean', 'min', 'max', 'count'],
    'tmin': ['mean', 'min', 'max']
}).round(2)

print("Monthly Temperature Statistics:")
print(monthly_stats)

In [None]:
# Plot monthly average temperatures
monthly_avg = df_final.groupby('month')[['tmax', 'tmin']].mean()

plt.figure(figsize=(12, 6))
months = monthly_avg.index
plt.plot(months, monthly_avg['tmax'], 'ro-', label='Average Max Temp', markersize=8)
plt.plot(months, monthly_avg['tmin'], 'bo-', label='Average Min Temp', markersize=8)
plt.fill_between(months, monthly_avg['tmin'], monthly_avg['tmax'], alpha=0.2)
plt.xlabel('Month')
plt.ylabel('Temperature (°C)')
plt.title('Average Monthly Temperature Range (2015)')
plt.xticks(range(1, 13))
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Temperature Distribution by Month

In [None]:
# Box plot for temperature distribution by month
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Max temperature
df_final.boxplot(column='tmax', by='month', ax=axes[0])
axes[0].set_title('Maximum Temperature Distribution by Month')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Max Temperature (°C)')
axes[0].get_figure().suptitle('')  # Remove default title

# Min temperature
df_final.boxplot(column='tmin', by='month', ax=axes[1])
axes[1].set_title('Minimum Temperature Distribution by Month')
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Min Temperature (°C)')
axes[1].get_figure().suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

## 8. Data Coverage Analysis

In [None]:
# Count measurements per month
coverage = df_final.groupby('month').size().reset_index(name='measurements')

plt.figure(figsize=(10, 5))
plt.bar(coverage['month'], coverage['measurements'])
plt.xlabel('Month')
plt.ylabel('Number of Measurements')
plt.title('Data Coverage: Number of Temperature Measurements per Month')
plt.xticks(range(1, 13))
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nMeasurements per month:")
print(coverage)

## 9. Discussion: Tidy Data Transformation

### Original Data Structure
- **Dimensions**: 22 rows × 35 columns (sparse data)
- **Variables**: Mixed - days (d1-d31) as columns, element (tmax/tmin) as rows
- **Values**: Temperature spread across 31 day columns
- **Problem**: Not tidy - day is a variable encoded as column names

### Transformation Steps
1. **Melted day columns**: Converted d1-d31 from columns to rows
2. **Extracted day number**: Parsed 'd1' → 1, 'd2' → 2, etc.
3. **Created date variable**: Combined year-month-day into proper date
4. **Removed missing values**: Dropped NA entries (days without measurements)
5. **Pivoted element**: Separated tmax/tmin into columns for easier analysis

### Final Tidy Format
- **Each variable forms a column**: Country, year, month, day, date, tmax, tmin
- **Each observation forms a row**: One date with temperature readings
- **Each value in a cell**: Single measurement

### Key Findings - Monthly Patterns
- **Missing data**: September (month 9) has no measurements - data collection gap
- **Seasonal pattern**: Clear seasonal variation with warmer temperatures in summer months
- **Temperature range**: 
  - Highest temperatures: April-June (summer months)
  - Lowest temperatures: November-February (winter months)
- **Daily variation**: Consistent gap between max and min temperatures (~12-15°C)
- **Data coverage**: Sparse measurements - not all days have readings

### Assumptions Made
1. **Invalid dates**: Removed entries like February 30, April 31 (impossible dates)
2. **NA handling**: Assumed NA means no measurement taken (not zero)
3. **Tmax/Tmin separation**: Separated into columns for easier comparison
4. **Location**: Assumed all measurements from same location in India