# **30 Day Forecast: Amberpet Mandal Weather Conditions**

# **Notebook 1: Data Cleaning & Preprocessing**

## 1.0 Introduction

**Telangana**
- Telangana is one of the states in India, situated in the south-central part of the Indian subcontinent on the high Deccan Plateau.
- The capital city of Telangana State is Hyderabad.
- Telangana state is divided into 33 districts which are further divided into 584 mandals.
- Hyderabad district is divided into 16 mandals. 
- Amberpet is a mandal in Hyderabad District.
  
**Climate**
- Telangana is a semi-arid area and has a predominantly hot and dry climate. 
- Summers start in March, and peak in mid-April with average high temperatures in the 37 –38 °C (99 - 100 °F) range. 
- The monsoon arrives in June and lasts until Late-September with about 755 mm (29.7 inches) of precipitation. 
- A dry, mild winter starts in late November and lasts until early February with little humidity and average temperatures in the 22 - 23 °C (72–73 °F) range.

## 1.1 Problem Statement

The daily weather conditions for Telangana State are available from February 1, 2023, to January 31, 2025. The data has been extracted from Open Data Telangana. With this data, the weather conditions for the next 30 days can be predicted from the available data. 

Weather forecasting is important for public safety, agriculture, transportation, and daily life, as it allows people to prepare for weather events and make informed decisions. It helps save lives by predicting severe weather, supports farmers in planting and harvesting, guides travel planning, and enables businesses to manage energy use. 

## 1.2 Project Objectives


- To develop a multivariate time series forecast model for predicting the daily weather conditions for the next 30 days for Amberpet Mandal. 
- To forecast the following daily weather conditions for the next 1 month for Amberpet Mandal using a the developed model; Rain (mm), Min Temp (°C), Max Temp (°C), Min Humidity (%), Max Humidity (%), Min Wind Speed (Kmph), Max Wind Speed (Kmph)  

## 1.3 Installing & Importing Required Libraries

In [None]:
pip install --upgrade statsmodels

Note: you may need to restart the kernel to use updated packages.


In [1]:
# Libraries required for data analysis, manipulation, visualization and preprocessing
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
import math  # For mathematical operations

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## 1.4 Importing Data

In [2]:
# importing Telangana Time Series Dataset from Hugging Face website. - https://huggingface.co/datasets/ron-the-code/Telangana_time_series_2023-2025
# Telangana Time Series Dataset was retrieved from Open Data Telangana, from February 1, 2023, to January 31, 2025 with daily granularity. 

# Login using e.g. `huggingface-cli login` to access this dataset
weather_data = pd.read_csv("hf://datasets/ron-the-code/Telangana_time_series_2023-2025/Telangana_combined_weather_data.csv")

# Display the DataFrame
print(weather_data.head())

   District          Mandal       Date  Rain (mm)  Min Temp (°C)  \
0  Adilabad  Adilabad Rural  01-Feb-23        0.0           16.6   
1  Adilabad  Adilabad Rural  02-Feb-23        0.0           13.7   
2  Adilabad  Adilabad Rural  03-Feb-23        0.0           10.3   
3  Adilabad  Adilabad Rural  04-Feb-23        0.0           11.7   
4  Adilabad  Adilabad Rural  05-Feb-23        0.0            9.2   

   Max Temp (°C)  Min Humidity (%)  Max Humidity (%)  Min Wind Speed (Kmph)  \
0           30.9              49.3              82.0                    0.0   
1           28.7              46.4              81.1                    0.0   
2           29.5              24.2              55.8                    0.0   
3           30.3              26.3              59.9                    0.0   
4           31.3              25.2              57.6                    0.0   

   Max Wind Speed (Kmph)  
0                   11.4  
1                   12.0  
2                   10.9  
3       

In [3]:
# save the dataset to a CSV file
weather_data.to_csv('Telangana_combined_weather_data.csv', index=False)

In [4]:
weather_data.tail()

Unnamed: 0,District,Mandal,Date,Rain (mm),Min Temp (°C),Max Temp (°C),Min Humidity (%),Max Humidity (%),Min Wind Speed (Kmph),Max Wind Speed (Kmph)
445207,Yadadri Bhuvanagiri,Yadagirigutta,27-Jan-25,0.0,17.0,35.2,46.0,77.6,0.0,4.6
445208,Yadadri Bhuvanagiri,Yadagirigutta,28-Jan-25,0.0,16.6,35.6,41.5,83.0,0.0,0.8
445209,Yadadri Bhuvanagiri,Yadagirigutta,29-Jan-25,0.0,17.3,34.9,48.4,81.9,0.0,8.6
445210,Yadadri Bhuvanagiri,Yadagirigutta,30-Jan-25,0.0,18.0,32.7,52.6,86.8,0.0,3.9
445211,Yadadri Bhuvanagiri,Yadagirigutta,31-Jan-25,0.0,17.6,33.1,55.7,86.3,0.0,6.0


In [5]:
weather_data.head()

Unnamed: 0,District,Mandal,Date,Rain (mm),Min Temp (°C),Max Temp (°C),Min Humidity (%),Max Humidity (%),Min Wind Speed (Kmph),Max Wind Speed (Kmph)
0,Adilabad,Adilabad Rural,01-Feb-23,0.0,16.6,30.9,49.3,82.0,0.0,11.4
1,Adilabad,Adilabad Rural,02-Feb-23,0.0,13.7,28.7,46.4,81.1,0.0,12.0
2,Adilabad,Adilabad Rural,03-Feb-23,0.0,10.3,29.5,24.2,55.8,0.0,10.9
3,Adilabad,Adilabad Rural,04-Feb-23,0.0,11.7,30.3,26.3,59.9,0.0,16.7
4,Adilabad,Adilabad Rural,05-Feb-23,0.0,9.2,31.3,25.2,57.6,0.0,9.7


In [6]:
weather_data.info() #get dataset information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445212 entries, 0 to 445211
Data columns (total 10 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   District               445212 non-null  object 
 1   Mandal                 445212 non-null  object 
 2   Date                   445212 non-null  object 
 3   Rain (mm)              445212 non-null  float64
 4   Min Temp (°C)          445212 non-null  float64
 5   Max Temp (°C)          445212 non-null  float64
 6   Min Humidity (%)       445212 non-null  float64
 7   Max Humidity (%)       445212 non-null  float64
 8   Min Wind Speed (Kmph)  445212 non-null  float64
 9   Max Wind Speed (Kmph)  445212 non-null  float64
dtypes: float64(7), object(3)
memory usage: 34.0+ MB


- The dataset has no missing data for each column.
- There are 445212 rows of data.
- There are 10 columns.
- 3 columns contain categorical data, namely: District, Mandal & Date.
- 7 columns contain data on rain, temperature, humidity and windspeed constitute numerical data

In [7]:
# Get number of unique entries in the District Column 
unique_districts = weather_data['District'].nunique()
print(f"Number of districts with weather data:\n {unique_districts}")

Number of districts with weather data:
 33


In [8]:
# Get names of unique entries in the District Column 
name_unique_districts = weather_data['District'].unique()
print(f"Name of districts with weather data:\n {name_unique_districts}")

Name of districts with weather data:
 ['Adilabad' 'Bhadradri Kothagudem' 'Hanumakonda' 'Hyderabad' 'Jagtial'
 'Jangaon' 'Jayashankar' 'Jogulamba Gadwal' 'Kamareddy' 'Karimnagar'
 'Khammam' 'Kumuram Bheem' 'Mahabubabad' 'Mahabubnagar' 'Mancherial'
 'Medak' 'Medchal-Malkajgiri' 'Mulugu' 'Nagarkurnool' 'Nalgonda'
 'Narayanpet' 'Nirmal' 'Nizamabad' 'Peddapalli' 'Rajanna Sircilla'
 'Rangareddy' 'Sangareddy' 'Siddipet' 'Suryapet' 'Vikarabad' 'Wanaparthy'
 'Warangal' 'Yadadri Bhuvanagiri']


In [9]:
# Get number of unique entries in the Mandal Column 
unique_mandal = weather_data['Mandal'].nunique()
print(f"Number of mandals with weather data:\n {unique_mandal}")

Number of mandals with weather data:
 598


## 1.5 Data Preprocessing

### 1.5.1 Selecting district for weather forecasting

- The time series data is for 33 districts.
- Forecasting will be done for Hyderabad district

In [10]:
# Group data by district and get data for Hyderabad district. 
weather_by_district = {District: data for District, data in weather_data.groupby('District')}
hyderabad_weather = weather_by_district['Hyderabad']
hyderabad_weather.head()

Unnamed: 0,District,Mandal,Date,Rain (mm),Min Temp (°C),Max Temp (°C),Min Humidity (%),Max Humidity (%),Min Wind Speed (Kmph),Max Wind Speed (Kmph)
1540,Hyderabad,Amberpet,01-Feb-23,0.0,21.5,29.7,42.2,78.6,0.0,2.0
1541,Hyderabad,Amberpet,02-Feb-23,0.0,19.9,30.4,37.0,73.6,0.0,1.9
1542,Hyderabad,Amberpet,03-Feb-23,0.0,18.5,30.1,27.8,55.1,0.0,2.0
1543,Hyderabad,Amberpet,04-Feb-23,0.0,19.1,31.4,15.6,60.3,0.0,2.1
1544,Hyderabad,Amberpet,05-Feb-23,0.0,20.1,31.9,24.2,58.1,0.0,2.1


In [11]:
# save the dataset to a csv file
hyderabad_weather.to_csv('Hyderabad_weather_data.csv', index=False)

In [12]:
# check information for the hyderabad weather data
print(f"Data for Hyderabad")
hyderabad_weather.info() 

Data for Hyderabad
<class 'pandas.core.frame.DataFrame'>
Index: 11696 entries, 1540 to 428440
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   District               11696 non-null  object 
 1   Mandal                 11696 non-null  object 
 2   Date                   11696 non-null  object 
 3   Rain (mm)              11696 non-null  float64
 4   Min Temp (°C)          11696 non-null  float64
 5   Max Temp (°C)          11696 non-null  float64
 6   Min Humidity (%)       11696 non-null  float64
 7   Max Humidity (%)       11696 non-null  float64
 8   Min Wind Speed (Kmph)  11696 non-null  float64
 9   Max Wind Speed (Kmph)  11696 non-null  float64
dtypes: float64(7), object(3)
memory usage: 1005.1+ KB


**Hyderabad weather data consists of 11696 rows and 10 columns.**

In [13]:
# Get number of unique entries in the Mandal Column 
mandal = hyderabad_weather['Mandal'].nunique()
print(f"Number of hyderabad mandals with weather data:\n {mandal}")

Number of hyderabad mandals with weather data:
 16


In [14]:
# Get names of unique entries in the Mandal Column 
mandal = hyderabad_weather['Mandal'].unique()
print(f"Names of hyderabad mandals with weather data:\n {mandal}")

Names of hyderabad mandals with weather data:
 ['Amberpet' 'Ameerpet' 'Asifnagar' 'Bahadurpura' 'Bandlaguda' 'Charminar'
 'Golkonda' 'Himayatnagar' 'Khairatabad' 'Maredpally' 'Musheerabad'
 'Nampally' 'Saidabad' 'Secunderabad' 'Shaikpet' 'Tirumalgiri']


### 1.5.2 Check for missingness

In [15]:
# check for missing values
missing_values = hyderabad_weather.isnull().sum()
print("\n--- Missing Values in Each Column ---")
print(missing_values)


--- Missing Values in Each Column ---
District                 0
Mandal                   0
Date                     0
Rain (mm)                0
Min Temp (°C)            0
Max Temp (°C)            0
Min Humidity (%)         0
Max Humidity (%)         0
Min Wind Speed (Kmph)    0
Max Wind Speed (Kmph)    0
dtype: int64


### 1.5.3 Check for duplicate rows

In [16]:
# check for duplicates
duplicates = hyderabad_weather.duplicated().sum()
print(f"Number of duplicate rows in the dataset: {duplicates}")

Number of duplicate rows in the dataset: 0


### 1.5.4 Indexing Date column

In [17]:
# Set 'Date column' as index
hyderabad_indexed = hyderabad_weather.set_index('Date')
hyderabad_indexed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11696 entries, 01-Feb-23 to 31-Jan-25
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   District               11696 non-null  object 
 1   Mandal                 11696 non-null  object 
 2   Rain (mm)              11696 non-null  float64
 3   Min Temp (°C)          11696 non-null  float64
 4   Max Temp (°C)          11696 non-null  float64
 5   Min Humidity (%)       11696 non-null  float64
 6   Max Humidity (%)       11696 non-null  float64
 7   Min Wind Speed (Kmph)  11696 non-null  float64
 8   Max Wind Speed (Kmph)  11696 non-null  float64
dtypes: float64(7), object(2)
memory usage: 913.8+ KB


In [18]:
# confirming the index
hyderabad_indexed.head()

Unnamed: 0_level_0,District,Mandal,Rain (mm),Min Temp (°C),Max Temp (°C),Min Humidity (%),Max Humidity (%),Min Wind Speed (Kmph),Max Wind Speed (Kmph)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
01-Feb-23,Hyderabad,Amberpet,0.0,21.5,29.7,42.2,78.6,0.0,2.0
02-Feb-23,Hyderabad,Amberpet,0.0,19.9,30.4,37.0,73.6,0.0,1.9
03-Feb-23,Hyderabad,Amberpet,0.0,18.5,30.1,27.8,55.1,0.0,2.0
04-Feb-23,Hyderabad,Amberpet,0.0,19.1,31.4,15.6,60.3,0.0,2.1
05-Feb-23,Hyderabad,Amberpet,0.0,20.1,31.9,24.2,58.1,0.0,2.1


In [19]:
# confirming the index
hyderabad_indexed.tail()

Unnamed: 0_level_0,District,Mandal,Rain (mm),Min Temp (°C),Max Temp (°C),Min Humidity (%),Max Humidity (%),Min Wind Speed (Kmph),Max Wind Speed (Kmph)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
27-Jan-25,Hyderabad,Tirumalgiri,0.0,17.7,33.0,38.0,89.6,0.0,4.4
28-Jan-25,Hyderabad,Tirumalgiri,0.0,18.0,33.2,35.7,90.1,0.0,3.5
29-Jan-25,Hyderabad,Tirumalgiri,0.0,18.5,32.8,43.7,91.4,0.0,0.3
30-Jan-25,Hyderabad,Tirumalgiri,0.0,17.8,32.0,51.0,100.0,0.0,0.7
31-Jan-25,Hyderabad,Tirumalgiri,0.0,18.0,33.1,51.6,95.9,0.0,3.2


In [20]:
# After setting index, convert it to datetime
hyderabad_indexed.index = pd.to_datetime(hyderabad_indexed.index, format='%d-%b-%y')

### 1.5.5 Selecting the data for the 'mandal' to forecast data for. 

**Weather data for Hyderabad district is further collected for 16 mandals. Therefore weather for one specific mandal; **Amberpet** will be forecast.**

In [21]:
# Group data by mandals and get data for Amberpet Mandal. 
weather_by_mandals_h = {Mandal: data for Mandal, data in hyderabad_indexed.groupby('Mandal')}
Amberpet_weather = weather_by_mandals_h['Amberpet']
Amberpet_weather.head()

Unnamed: 0_level_0,District,Mandal,Rain (mm),Min Temp (°C),Max Temp (°C),Min Humidity (%),Max Humidity (%),Min Wind Speed (Kmph),Max Wind Speed (Kmph)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-02-01,Hyderabad,Amberpet,0.0,21.5,29.7,42.2,78.6,0.0,2.0
2023-02-02,Hyderabad,Amberpet,0.0,19.9,30.4,37.0,73.6,0.0,1.9
2023-02-03,Hyderabad,Amberpet,0.0,18.5,30.1,27.8,55.1,0.0,2.0
2023-02-04,Hyderabad,Amberpet,0.0,19.1,31.4,15.6,60.3,0.0,2.1
2023-02-05,Hyderabad,Amberpet,0.0,20.1,31.9,24.2,58.1,0.0,2.1


In [22]:
# Get information for Amberpet mandal weather data
print(f"Data for Amberpet mandal in Hyderabad district")
Amberpet_weather.info()

Data for Amberpet mandal in Hyderabad district
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 731 entries, 2023-02-01 to 2025-01-31
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   District               731 non-null    object 
 1   Mandal                 731 non-null    object 
 2   Rain (mm)              731 non-null    float64
 3   Min Temp (°C)          731 non-null    float64
 4   Max Temp (°C)          731 non-null    float64
 5   Min Humidity (%)       731 non-null    float64
 6   Max Humidity (%)       731 non-null    float64
 7   Min Wind Speed (Kmph)  731 non-null    float64
 8   Max Wind Speed (Kmph)  731 non-null    float64
dtypes: float64(7), object(2)
memory usage: 57.1+ KB


#### 1.5.5.1 dropping unneccesary columns from Amberpet mandal weather dataset

In [23]:
# District & mandal columns will not be necessary for forecasting and will be dropped.
Amberpet_weather_cleaned = Amberpet_weather.drop(columns=['District', 'Mandal'])
Amberpet_weather_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 731 entries, 2023-02-01 to 2025-01-31
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Rain (mm)              731 non-null    float64
 1   Min Temp (°C)          731 non-null    float64
 2   Max Temp (°C)          731 non-null    float64
 3   Min Humidity (%)       731 non-null    float64
 4   Max Humidity (%)       731 non-null    float64
 5   Min Wind Speed (Kmph)  731 non-null    float64
 6   Max Wind Speed (Kmph)  731 non-null    float64
dtypes: float64(7)
memory usage: 45.7 KB


Amberpet weather dataset has 731 entries/rows and 7 columns.

In [24]:
# get column names
column_names = Amberpet_weather_cleaned.columns
print(f" Column names \n {column_names} ")

 Column names 
 Index(['Rain (mm)', 'Min Temp (°C)', 'Max Temp (°C)', 'Min Humidity (%)',
       'Max Humidity (%)', 'Min Wind Speed (Kmph)', 'Max Wind Speed (Kmph)'],
      dtype='object') 


In [25]:
# Create expected date range
date_range = pd.date_range(start='2023-02-01', end='2025-01-31', freq='D')
missing_dates = date_range.difference(Amberpet_weather_cleaned.index)
print(f"Missing dates: {len(missing_dates)}")

Missing dates: 0


In [26]:
# Save cleaned data to a new CSV file
Amberpet_weather_cleaned.to_csv('Amberpet_weather_cleaned.csv', index=True) # index=True to keep the Date index