
# Temperature Forecast Project using ML

Problem Statement:

 # Data Set Information:

This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model's next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.

 # Attribute Information:

For more information, read [Cho et al, 2020].

station - used weather station number: 1 to 25

Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30')

Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (Â°C): 20 to 37.6

Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (Â°C): 11.3 to 29.9

LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5

LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100

LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (Â°C): 17.6 to 38.5

LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (Â°C): 14.3 to 29.6

LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9

LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4

LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97

LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97

LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98

LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97

LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7


LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6

LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8

LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7

lat - Latitude (Â°): 37.456 to 37.645

lon - Longitude (Â°): 126.826 to 127.135

DEM - Elevation (m): 12.4 to 212.3

Slope - Slope (Â°): 0.1 to 5.2
Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9

Next_Tmax - The next-day maximum air temperature (Â°C): 17.4 to 38.9

Next_Tmin - The next-day minimum air temperature (Â°C): 11.3 to 29.8T

Please note that there are two target variables here:


1) Next_Tmax: Next day maximum temperature

2) Next_Tmin: Next day minimum temperature

In [None]:
# import liberary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# read csv file
df=pd.read_csv(r"C:\Users\DELL\3D Objects\\temperature forecaast_trainning project_2 dataset.csv")
df.head(10)

In [None]:
# shape of dataset
df.shape

In [None]:
df.isnull().sum()

In [None]:
# print max collumns and max row
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

In [None]:
df.head()

In [None]:
df.columns #  all collummns name 

In [None]:
df["Date"]

In [None]:
df.isnull().sum()

In [None]:
# drop null value because good amount of dataset
df=df.dropna()

In [None]:
df.isnull().sum().sum()  # zero null values

# EDA

In [None]:
# check the information
df.info()

In [None]:
df=df.drop(columns=["Date"],axis=1) # drop date columns because object type daata

In [None]:
plt.figure(figsize=(20,25))
sns.heatmap(abs(df.corr()),annot=True) # it is highly multi colenearity so many columns

In [None]:
df.head(2)

In [None]:
df.corr()

In [None]:
# it is all numerical data

In [None]:
df.shape

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# distribution plot
plt.figure(figsize=(40,40))
graph=1
for num in df.columns[:24]:
    plt.subplot(6,4,graph)
    sns.distplot(df[num])
    plt.xlabel(num,fontsize=20)
    graph+=1
plt.show()
# check the univarient analysis

In [None]:
# drop station columnn
df=df.drop(columns=["station"],axis=1)

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df["LDAPS_PPT3"].unique()

In [None]:
df["LDAPS_PPT4"].unique()

* checking the outlier

In [None]:
# print the boxplot
plt.figure(figsize=(40,80))
graph=1
for num in df.columns[:23]:
    plt.subplot(6,4,graph)
    sns.boxplot(df[num]) # check the outlier
    plt.xlabel(num,fontsize=20)
    graph+=1
plt.show()


In [None]:
abs(df.corr()["Next_Tmin"]).sort_values(ascending=False)

In [None]:
abs(df.corr()["Next_Tmin"]).sort_values(ascending=False)

In [None]:
# visulization of target variable
df.drop(columns=["Next_Tmax"]).corrwith(df["Next_Tmax"]).plot(kind='bar',figsize=(15,15))

In [None]:
df.drop(columns=["Next_Tmin"]).corrwith(df["Next_Tmin"]).plot(kind='bar',figsize=(15,15))

In [None]:
# removing the outlier
q1=df.quantile(0.25)
q3=df.quantile(0.75)

iqr=q3-q1
print(iqr)

In [None]:
# OUTLIER DETECTION 
out=(q3.Present_Tmax-1.5*iqr.Present_Tmax)
print(out)
index=np.where(df["Present_Tmax"]<out)
df=df.drop(df.index[index])
df.shape
df=df.reset_index()

# DATA  PREPROCESSSING

In [None]:
X=df.drop(columns=['Next_Tmax','Next_Tmin'],axis=1)
y1=df["Next_Tmin"]   
y2=df["Next_Tmax"]


#  standization and normalization

In [None]:
from sklearn.preprocessing import MinMaxScaler, PowerTransformer

# Scale the features
scaler = MinMaxScaler()
x_scale = scaler.fit_transform(X)
X=pd.DataFrame(X)

In [None]:
pt = PowerTransformer(method='yeo-johnson')
x_transformed = pt.fit_transform(x_scale)
X=pd.DataFrame(X)

# vif for multicolinearity

In [None]:
 from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=pd.DataFrame()
vif["vif"]=[variance_inflation_factor(x_transformed,i) for i in range(x_transformed.shape[1])]

In [None]:
vif["Feature"]=X.columns
vif

# to remove the multicolinearity PCA is applied

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca=PCA(n_components=13)
principal_component=pca.fit_transform(X)  # TOP 13 COLUMNS USE OF PCA MODELS
X=pd.DataFrame(principal_component)

In [None]:
X.shape

# MODEL TRAINING (Next_Tmax prediction)

In [None]:

from sklearn.metrics import r2_score,mean_squared_error


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

lr = LinearRegression()

for i in range(0, 1000):
    x_train, x_test, y_train, y_test = train_test_split(X, y1, test_size=0.2, random_state=i)
    lr.fit(x_train, y_train)
    
    pred_train = lr.predict(x_train)
    pred_test = lr.predict(x_test)
    
    train_r2 = r2_score(y_train, pred_train) * 100
    test_r2 = r2_score(y_test, pred_test) * 100
    
    if round(train_r2, 1) == round(test_r2, 1):
        print("Random state:", i)
        print("Training R² score:", train_r2)
        print("Testing R² score:", test_r2)

In [None]:
# y2 predicctton
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

lr = LinearRegression()

for i in range(0, 1000):
    x_train, x_test, y_train, y_test = train_test_split(X, y2, test_size=0.2, random_state=i)
    lr.fit(x_train, y_train)
    
    pred_train = lr.predict(x_train)
    pred_test = lr.predict(x_test)
    
    train_r2 = r2_score(y_train, pred_train) * 100
    test_r2 = r2_score(y_test, pred_test) * 100
    
    if round(train_r2, 1) == round(test_r2, 1):
        print("Random state:", i)
        print("Training R² score:", train_r2)
        print("Testing R² score:", test_r2)

# random state  y1=995 and y2=295

In [None]:
# split y1
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y1,random_state=995,test_size=0.2)
print("shape of x_train",x_train.shape)
print("shape of x_test",x_test.shape)
print("shape of y_train",y_train.shape)
print("shape of y_test",y_test.shape)

In [None]:
# split y2
from sklearn.model_selection import train_test_split
x_train2,x_test2,y_train2,y_test2=train_test_split(X,y2,random_state=295,test_size=0.2)
print("shape of x_train",x_train.shape)
print("shape of x_test",x_test.shape)
print("shape of y_train",y_train.shape)
print("shape of y_test",y_test.shape)

In [None]:
# import metrix
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score



In [None]:
# imporrt ligistic regression y1 model
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train,y_train)
pred=lr.predict(x_test)
print("score of model",r2_score(y_test,pred))
mse=mean_squared_error(y_test,pred)
print("mean absolute eerror",mean_absolute_error(y_test,pred))
print(np.sqrt(mse))

# print skplot
import scikitplot as skplt
skplt.estimators.plot_learning_curve(lr,X,y1,cv=5,scoring='r2',title="LinearRegression")


In [None]:
# imporrt ligistic regression y2 model
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train2,y_train2)
pred=lr.predict(x_test2)
print("score of model",r2_score(y_test2,pred))
mse=mean_squared_error(y_test2,pred)
print("mean absolute eerror",mean_absolute_error(y_test2,pred))
print(np.sqrt(mse))

# print skplot
import scikitplot as skplt
skplt.estimators.plot_learning_curve(lr,X,y1,scoring='r2',title="LinearRegression")

# ensemble learning

In [None]:
# random forest for y1
from sklearn.ensemble import RandomForestRegressor
RM=RandomForestRegressor()
RM.fit(x_train,y_train)
pred_rm=RM.predict(x_test)
print(r2_score(y_test,pred_rm))
print("mean absolute erors",mean_absolute_error(y_test,pred_rm))
mse=mean_squared_error(y_test,pred_rm)
print("RMSE",np.sqrt(mse))

# plot
#skplt.estimators.plot_learning_curve(RM,X,y1,cv=5,scoring="r2",title="RANDOMFOREST")

In [None]:
# random forest for y1
from sklearn.ensemble import RandomForestRegressor
RM=RandomForestRegressor()
RM.fit(x_train2,y_train2)
pred_rm=RM.predict(x_test2)
print(r2_score(y_test2,pred_rm))
print("mean absolute erors",mean_absolute_error(y_test2,pred_rm))
mse=mean_squared_error(y_test2,pred_rm)
print("RMSE",np.sqrt(mse))

# plot
##skplt.estimators.plot_learning_curve(RM,X,y1,cv=5,scoring="r2",title="RANDOMFOREST")

In [None]:
#  gradient bosting for y1
from sklearn.ensemble import GradientBoostingRegressor
gb=GradientBoostingRegressor()
gb.fit(x_train,y_train)
pred_gb=gb.predict(x_test)
print(r2_score(y_test,pred_gb))
print("mean absolute erors",mean_absolute_error(y_test,pred_gb))
mse=mean_squared_error(y_test,pred_gb)
print("RMSE",np.sqrt(mse))

# plot
# skplt.estimators.plot_learning_curve(RM,X,y1,cv=5,scoring="r2",title="Gradient_boosting")

In [None]:
#  gradient bosting for y2
from sklearn.ensemble import GradientBoostingRegressor
gb=GradientBoostingRegressor()
gb.fit(x_train2,y_train2)
pred_gb=gb.predict(x_test2)
print(r2_score(y_test2,pred_gb))
print("mean absolute erors",mean_absolute_error(y_test2,pred_gb))
mse=mean_squared_error(y_test2,pred_gb)
print("RMSE",np.sqrt(mse))

# plot
# skplt.estimators.plot_learning_curve(RM,X,y1,cv=5,scoring="r2",title="Gradient_boosting")

In [None]:
#  xgb for y1
from xgboost import XGBRegressor
xgb=XGBRegressor()
xgb.fit(x_train,y_train)
pred_gb=xgb.predict(x_test)
print(r2_score(y_test,pred_gb))
print("mean absolute erors",mean_absolute_error(y_test,pred_gb))
mse=mean_squared_error(y_test,pred_gb)
print("RMSE",np.sqrt(mse))

# plot
# skplt.estimators.plot_learning_curve(RM,X,y1,cv=5,scoring="r2",title="Gradient_boosting")

In [None]:
#  xgb for y2
from xgboost import XGBRegressor
xgb1=XGBRegressor()
xgb1.fit(x_train2,y_train2)
pred_gb=xgb1.predict(x_test2)
print(r2_score(y_test2,pred_gb))
print("mean absolute erors",mean_absolute_error(y_test2,pred_gb))
mse=mean_squared_error(y_test2,pred_gb)
print("RMSE",np.sqrt(mse))

# plot
# skplt.estimators.plot_learning_curve(RM,X,y1,cv=5,scoring="r2",title="Gradient_boosting")

# xgboost model - hypertuning 

In [None]:
#from sklearn.model_selection import train_test_split
#from sklearn.model_selection import GridSearchCV
#x_train,x_test,y_train,y_test=train_test_split(X,y1,test_size=0.2,random_state=995)
#3params={
 #   "n_estimators":range(2,100),
  #  "max_depth":range(1,100),
   # "learning_rate":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8],
    
    
#}
#rsc = GridSearchCV(estimator=xgb, param_grid=params, cv=3)
                      
#rsc.fit(x_train,y_train)
#rsc1=rsc.best_estimator_

#rsc1.fit(x_train,y_train)
#predf=rsc1.predict(x_test)
#print("score",r2_score(y_test,predf))
#print("mean absolute errors",mean_absolute_error(y_test,predf
                                          #     ))
# it is reducce the accuracy affter hypertuning

# randomforestregressor - hyperparamter tuning

In [None]:
# random forest for y1
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
x_train,x_test,y_train,y_test=train_test_split(X,y1,test_size=0.2,random_state=995)
params={
    "n_estimators":[50,100],
    "min_samples_split":range(2,4),
    "min_samples_leaf":range(1,2,1)
    
    
    
}
rsc=RandomizedSearchCV(estimator=RM,param_distributions=params,cv=3
                      )
rsc.fit(x_train,y_train)
rsc2=rsc.best_estimator_

rsc2.fit(x_train,y_train)
predf2=rsc2.predict(x_test)


print(r2_score(y_test,predf2))
print("mean absolute erors",mean_absolute_error(y_test,predf2))
mse=mean_squared_error(y_test,predf2)
print("RMSE",np.sqrt(mse))

# After hyperparameter tuning and cross validation Randomforrest as our best model

In [None]:
import joblib
joblib.dump(rsc.best_estimator_,"temperature_prediction_Next_Tmax2.pkl")

In [None]:
# random forest for y2
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
x_train2,x_test2,y_train2,y_test2=train_test_split(X,y1,test_size=0.2,random_state=295)
params={
    "n_estimators":[50,100],
    "min_samples_split":range(2,4),
    "min_samples_leaf":range(1,2,1)
    
    
    
}
rsc=RandomizedSearchCV(estimator=RM,param_distributions=params,cv=3
                      )
rsc.fit(x_train2,y_train2)
rsc2=rsc.best_estimator_

rsc2.fit(x_train2,y_train2)
predf2=rsc2.predict(x_test2)


print(r2_score(y_test2,predf2))
print("mean absolute erors",mean_absolute_error(y_test2,predf2))
mse=mean_squared_error(y_test2,predf2)
print("RMSE",np.sqrt(mse))

In [None]:
import joblib
joblib.dump(rsc.best_estimator_,"temperature_prediction_Next_Tmin2.pkl")

In [None]:
# our model y2
model=joblib.load("temperature_prediction_Next_Tmin2.pkl")
pred_tmin=model.predict(x_test2)
df=pd.DataFrame(pred_tmin,columns=["predicted"])

In [None]:
# our model y1
model=joblib.load("temperature_prediction_Next_Tmax2.pkl")
pred_tmin=model.predict(x_test)
df=pd.DataFrame(pred_tmin,columns=["predicted"])

In [None]:
print("end----------------------------")