**Ali Bazrkar - Iran Weather EDA**

# Introduction

In an ever-changing world where weather patterns significantly impact our daily lives, understanding the climate is more important than ever. This exploratory data analysis project delves into the weather data from various cities across Iran, spanning from 2011 (1390) to the present. Using data sourced from **Open-Meteo**, we aim to uncover meaningful insights that can help us better understand regional weather dynamics.

The significance of this analysis extends beyond mere statistics; it has the potential to inform agricultural practices, guide urban planning, and enhance disaster preparedness. By exploring key factors such as temperature variations, precipitation patterns, and wind dynamics, we can paint a comprehensive picture of how climate trends evolve over time.

As an artist specializing in pixel, digital, and portrait art, my passion for mathematics—particularly calculus, probability, and statistics—has driven me to explore the fascinating world of artificial intelligence. My journey into machine learning began about a year ago when I encountered GANs and GPTs, sparking a desire to learn how these technologies create and analyze data.

Throughout this notebook, we will engage in a structured exploration of the dataset, starting with data visualization to identify trends and anomalies. We will then dive into detailed analyses, uncovering correlations and patterns that may have significant implications. Whether you’re a researcher, policymaker, or simply a curious observer, this analysis aims to present the findings in an accessible and engaging manner, ensuring you stay intrigued every step of the way.

This project serves as my final Data Analysis project for the **Tehran Institute of Technology - MFT** program, allowing me to apply my achieved skills in a meaningful way. Join me on this journey as we uncover the stories hidden within the data and explore the world of Iran's weather patterns!

# Importing Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point
from persiantools.jdatetime import JalaliDate

# Data Preparation

## Loading Datasets

In [3]:
geography = pd.read_csv(f"Dataset/Geography Information.csv")
geography['city'] = geography['city'].str.strip()
print(geography)

FileNotFoundError: [Errno 2] No such file or directory: 'Dataset/Geography Information.csv'

In [5]:
# City name lists
cities = [
    "Tehran",
    "Karaj",
    "Tabriz",
    "Mashhad",
    "Isfahan",
    "Shiraz",
    "Kerman",
    "Ahvaz",
    "Bandar Abbas",
    "Rasht"
]

# list to keep dataframes
dataframes = []

# Reading DataFrames and Saving them in a dictionary
for city in cities:
    df = pd.read_csv(f"Dataset/{city}.csv")
    df['city'] = city
    dataframes.append(df)

### Dataset Combining 

In [6]:
df_combined = pd.concat(dataframes, ignore_index=True)
df_combined = pd.merge(df_combined, geography, on='city', how='left')

## Data Cleaning 

In [7]:
del df, df_combined, dataframes, geography, cities
df = pd.read_csv("Dataset/Combined Dataset.csv")

### Checking NaNs 

In [8]:
print("Dataset Missing Values:")
missing_values = df.isnull().sum()
print(missing_values)

Dataset Missing Values:
time                               0
temp_max (°C)                      0
temp_min (°C)                      0
temp_mean (°C)                     0
daylight_duration (s)              0
precipitation_sum (mm)             0
rain_sum (mm)                      0
snowfall_sum (cm)                  0
precipitation_hours (h)            0
wind_speed_max (km/h)              0
wind_gusts_max (km/h)              0
wind_direction_dominant (°)        0
shortwave_radiation_sum (MJ/m²)    0
evapotranspiration (mm)            0
city                               0
latitude                           0
longitude                          0
elevation                          0
dtype: int64


### Gregorian and Jalali Date Settings 

In [9]:
df["time"] = pd.to_datetime(df["time"])

In [10]:
# create Jalali date column
df["date_jalali"] = df["time"].apply(lambda time: JalaliDate(time)) 

# rename time to date_gregorian
df.rename(columns={'time': 'date'}, inplace=True)

In [11]:

df["year"] = df["date_jalali"].apply(lambda date: date.year)
df["month"] = df["date_jalali"].apply(lambda date: date.month)
df["season"] = df["month"].apply(lambda month:
                                    "Spring" if month in [1, 2, 3] else 
                                    "Summer" if month in [4, 5, 6] else 
                                    "Autumn" if month in [7, 8, 9] else 
                                    "Winter")
df['month'] = df['month'].map({
    1: "Farvardin",
    2: "Ordibehesht",
    3: "Khordad",
    4: "Tir",
    5: "Mordad",
    6: "Shahrivar",
    7: "Mehr",
    8: "Aban",
    9: "Azar",
    10: "Dey",
    11: "Bahman",
    12: "Esfand"
})

df.drop(columns="date_jalali", inplace=True)

### Adjusting Units

In [12]:
df["snowfall_sum (cm)"] = df["snowfall_sum (cm)"].apply(lambda item : item * 10)
df.rename(columns={'snowfall_sum (cm)': 'snowfall_sum (mm)'}, inplace=True)

df["daylight_duration (s)"] = df["daylight_duration (s)"].apply(lambda item : item / 3600)
df.rename(columns={'daylight_duration (s)': 'daylight_duration (h)'}, inplace=True)

### GeoPandas Data Additions

In [13]:
city_to_province = {
    'Tehran': 'Tehran',
    'Tabriz': 'East Azerbaijan',
    'Mashhad': 'Razavi Khorasan',
    'Isfahan': 'Isfahan',
    'Shiraz': 'Fars',
    'Ahvaz': 'Khuzestan',
    'Rasht': 'Gilan',
    'Kerman': 'Kerman',
    'Bandar Abbas': 'Hormozgan',
    'Karaj': 'Alborz',
}

# Step 2: Add a province column to the DataFrame
df['province'] = df['city'].map(city_to_province)

### Additional Columns for Easier Analysis

In [14]:
df["temp_diff (°C)"] = df["temp_max (°C)"] - df["temp_min (°C)"]
df["remain_precipitation (mm)"] = df["precipitation_sum (mm)"] - df["evapotranspiration (mm)"]

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49260 entries, 0 to 49259
Data columns (total 25 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   date_gregorian                   49260 non-null  datetime64[ns]
 1   temp_max (°C)                    49260 non-null  float64       
 2   temp_min (°C)                    49260 non-null  float64       
 3   temp_mean (°C)                   49260 non-null  float64       
 4   daylight_duration (h)            49260 non-null  float64       
 5   precipitation_sum (mm)           49260 non-null  float64       
 6   rain_sum (mm)                    49260 non-null  float64       
 7   snowfall_sum (mm)                49260 non-null  float64       
 8   precipitation_hours (h)          49260 non-null  float64       
 9   wind_speed_max (km/h)            49260 non-null  float64       
 10  wind_gusts_max (km/h)            49260 non-null  float64  

### Memory Cleaning 

In [16]:
def datatype_cleaner(df):
    
    for column in df.select_dtypes(include=['float', 'int']).columns:
        min_val = df[column].min()
        max_val = df[column].max()
    
        col_dtype = df[column].dtype
    
        # Check ranges for int and float types
        if np.issubdtype(col_dtype, np.floating):
            if min_val >= -65504 and max_val <= 65504:  # float16
                df[column] = df[column].astype(np.float16)
            elif min_val >= -3.4e38 and max_val <= 3.4e38:  # float32
                df[column] = df[column].astype(np.float32)
            else:
                df[column] = df[column].astype(np.float64)
    
        elif np.issubdtype(col_dtype, np.integer):
            if min_val >= -128 and max_val <= 127:  # int8
                df[column] = df[column].astype(np.int8)
            elif min_val >= -32768 and max_val <= 32767:  # int16
                df[column] = df[column].astype(np.int16)
            elif min_val >= -2147483648 and max_val <= 2147483647:  # int32
                df[column] = df[column].astype(np.int32)
            else:
                df[column] = df[column].astype(np.int64)

In [17]:
# calling the function to clean datatype memory usage
datatype_cleaner(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49260 entries, 0 to 49259
Data columns (total 25 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   date_gregorian                   49260 non-null  datetime64[ns]
 1   temp_max (°C)                    49260 non-null  float16       
 2   temp_min (°C)                    49260 non-null  float16       
 3   temp_mean (°C)                   49260 non-null  float16       
 4   daylight_duration (h)            49260 non-null  float16       
 5   precipitation_sum (mm)           49260 non-null  float16       
 6   rain_sum (mm)                    49260 non-null  float16       
 7   snowfall_sum (mm)                49260 non-null  float16       
 8   precipitation_hours (h)          49260 non-null  float16       
 9   wind_speed_max (km/h)            49260 non-null  float16       
 10  wind_gusts_max (km/h)            49260 non-null  float16  

In [18]:
df.to_csv("Final Dataset.csv", index=False)

# Data Visualization