# Introduction 

# About the Data

In [None]:
# !pip install pandas
# !pip install numpy
# !pip install matplotlib
# !pip install seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

This dataset was downloaded from [Kaggle](https://www.kaggle.com/datasets/nikhil7280/weather-type-classification). It's a
synthetic dataset which was created for students and data scientists to practice data preprocessing, feature engineering, model evaluation, and other data mining tasks. The dataset has 11 features and 13,200 samples. 

In [None]:
df = pd.read_csv("weather_classification_data.csv")

In [None]:
df.head()

In [None]:
df.describe()

# Humidity ranges from 20% to 109%, this is a due to the dataset being synthetic and the creator
# of the dataset not accounting for real world values. The same issue occurs with precipitation.
# Visibility has a minimum of 0, which also doesn't occur in real world conditions. UV index in the
# dataset ranges from 0 to 14. Real world UV range is from 1 to 11+ (with 12,13,14 being extremely unlikely).

In [None]:
# plot distribution of discrete features

# Overcast conditions occurred most frequently, followed by partly cloudy, clear, and then cloudy skies.
# UV Index values are most commonly low, with frequency decreasing as the index increases.
# Winter had the highest observation count, while the other seasons had roughly equal and lower counts.
# Inland and mountain regions had similar and higher observation counts compared to coastal areas.
# All four weather types (Rainy, Cloudy, Sunny, and Snowy—had equal counts) showing a balanced distribution.

categorical_features = ['Cloud Cover', 'UV Index', 'Season', 'Location']

for cat in categorical_features:
    sns.countplot(data=df, x=cat)
    plt.title(f'Count by {cat}')
    plt.show()

# Pre-Processing

In [None]:
# one-hot encoding
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[categorical_features])
encoded = encoder.transform(df[categorical_features])
column_names = encoder.get_feature_names_out(categorical_features)
encoded_df = pd.DataFrame(encoded, columns=column_names, index=df.index)
non_categorical = df.drop(columns=categorical_features)

# merge
processed_df = pd.concat([non_categorical, encoded_df], axis=1)

### Correcting issues with synthetic data
![Weatherman!](weatherman.png)

In [None]:
# remove precipitations over 100%

processed_df = processed_df[processed_df["Precipitation (%)"] <= 100]
processed_df["Precipitation (%)"].describe()

In [None]:
processed_df.head()

In [None]:
# count nulls
# data is synthetic so should be 0
processed_df.isnull().sum()

In [None]:
# Removing temperature outliers
processed_df = processed_df[processed_df['Temperature'] < 56]

processed_df.head()

In [None]:
# generate a basic correlation matrix to visualize relationships

correlation_matrix = processed_df.drop(columns=["Weather Type"]).corr()
# Plot heatmap for better visualization
plt.figure(figsize=(15, 9))
sns.heatmap(correlation_matrix, annot=True, cmap="inferno", fmt=".2f", linewidths=0.5)
plt.show()

# print 10 values with highest (absolute value) correlation
corr_abs = correlation_matrix.abs()
upper_triangle = corr_abs.where(np.triu(np.ones(corr_abs.shape), k=1).astype(bool))
top10 = upper_triangle.unstack().sort_values(ascending=False).head(10)

print("Top 10 highest (absolute) correlations:")
print(top10)

The correlations make logical sense, showing the dataset isn't completely random even though it's synthetic. This indicates we may be to train a useful model.

## Boxplots

In [None]:
sns.boxplot(data=df[['Temperature', 'Humidity', 'Wind Speed']])
plt.show()

The box plots compares Temperature, Humidity, and Wind Speed. Temperatures has a large range
(~4 to 32) and many outliers above 70C. Humidity is symmetric with few to no outliers after
processing. Wind speed has a low median value, but a high number of outliers

#  Methods 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

X = processed_df.drop(columns=["Weather Type"])
y = label_encoder.fit_transform(processed_df["Weather Type"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")

In [None]:
# generate a graph for k values

accuracy = []
for i in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy.append(accuracy_score(y_test, y_pred))

plt.plot(accuracy, marker='o')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Accuracy for K Values')
plt.ylim(0.85, .9)
plt.grid(True)
plt.show()

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

label_encoder_decoder = ["Cloudy", "Rainy", "Snowy", "Sunny"]
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=label_encoder_decoder, cmap='Blues')
plt.title('kNN Confusion Matrix')
plt.show()

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

dt = DecisionTreeClassifier(criterion='gini', max_depth=3)

dt.fit(X_train, y_train)

print(f"Accuracy Score: {dt.score(X_test, y_test)}")

Now we plot the decision tree we just made, confirming it has a max depth of 3, and seeing what features are being used to do the classification.

In [None]:
fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (12,12), dpi=300)
plot_tree(dt, max_depth = 3, feature_names = X.columns, filled=True)
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = dt.predict(X_test)

cf = confusion_matrix(y_test, y_pred)

sns.heatmap(cf, annot=True, square=True)
plt.show()

#  Evaluation 

# Impact

Many different impacts can come from our project, these impacts can be both positive and negative. One of those negative impacts is that the dataset we are using is a synthetic dataset, which means that the data mimics real-world data. This is viewed as a negative impact because the data can be seen as limited or not an accurate representation of real weather data, so the project results could be seen as unreliable. Not only this, but having unreliable models that can predict weather shrink public trust in meteorologists, so having accurate models is very important. A positive impact that can come from this project is the potential improvements to weather forecasting through better prediction outcomes, given that we supplement some synthetic data with real world data, which can benefit both space exploration and even the lives of the average person. Others include the potential to build more advanced models for more advanced weather patterns, and saving lives with accurate predictions.

### Github Repository/Code/Data
https://github.com/ajebril1/weather_prediction