# KNN Weather Classification

### Importing Libraries:

In [662]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from plotnine import *
from sklearn.feature_selection import SelectKBest, f_classif

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.utils import resample



%matplotlib inline



### Reading the Dataset:

In [663]:
weather = pd.read_csv("seattle-weather.csv")

The "date" variable is dropped to avoid further complications when making predictions, since it is a string and not a quantitative variable.

In [664]:
weather = weather.drop(['date'], axis=1)
weather.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   precipitation  1461 non-null   float64
 1   temp_max       1461 non-null   float64
 2   temp_min       1461 non-null   float64
 3   wind           1461 non-null   float64
 4   weather        1461 non-null   object 
dtypes: float64(4), object(1)
memory usage: 57.2+ KB


## Removing null data

We now remove all data with missing/null results, using the dropna function. This is done to ensure that all parameters have the same amount of data to be used when training. This helps avoid errors with certain machine learning models.








In [687]:
weather.dropna(inplace=True)


In [688]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   precipitation  1461 non-null   float64
 1   temp_max       1461 non-null   float64
 2   temp_min       1461 non-null   float64
 3   wind           1461 non-null   float64
 4   weather        1461 non-null   object 
dtypes: float64(4), object(1)
memory usage: 57.2+ KB


## Tidying Data

The data is tidied to make it easier to extract and analyze when training and testing the predictor.

The weather variable is set to be categorical. This is done to prepare the dataset for classification, since weather is the target variable.

A new column is added called "temp_avg" which is the average temperature of the given day. This gives a new variable to look at potentially using as a potential piece of prediction.

In [689]:
# create new Dataframe "weather_data"
# convert weather column to a categorical variable


weather_data = weather.assign(
    weather=pd.Categorical(weather['weather']),
    temp_avg=(weather['temp_max'] + weather['temp_min']) / 2
)

weather_data


#created new "temp_avg" column

Unnamed: 0,precipitation,temp_max,temp_min,wind,weather,temp_avg
0,0.0,12.8,5.0,4.7,drizzle,8.90
1,10.9,10.6,2.8,4.5,rain,6.70
2,0.8,11.7,7.2,2.3,rain,9.45
3,20.3,12.2,5.6,4.7,rain,8.90
4,1.3,8.9,2.8,6.1,rain,5.85
...,...,...,...,...,...,...
1456,8.6,4.4,1.7,2.9,rain,3.05
1457,1.5,5.0,1.7,1.3,rain,3.35
1458,0.0,7.2,0.6,2.6,fog,3.90
1459,0.0,5.6,-1.0,3.4,sun,2.30


To get an idea of the distribution of weather types throughout the days, comparing their frequency, An HTML table is built, with a column "weather" and a column "n", which shows how many observations of each weather type there is.


The groupby function gives the count of the number of observations for each weather type:

In [693]:
weather_data_count = weather_data.groupby('weather').agg(n=('weather', 'count')).reset_index()

An HTML table is made:

In [695]:
from IPython.display import display, HTML

# Group the data and count the number of observations for each weather type
weather_data_count = weather_data.groupby('weather').agg(n=('weather', 'count')).reset_index()

# Convert the DataFrame to an HTML table and concatenate a caption
table_html = weather_data_count.to_html(index=False)
caption_html = '<caption> Number of observations recorded for each type of weather</caption>'
table_with_caption_html = f'<table>{caption_html}{table_html}</table>'

# Display the table with the caption
display(HTML(table_with_caption_html))


weather,n
drizzle,53
fog,101
rain,641
snow,26
sun,640

weather,n
drizzle,53
fog,101
rain,641
snow,26
sun,640


As shown in this table, there are much less observations of snow, fog, and drizzle proportional to sun and rain. This means that it could be harder for the predictor to correctly predict these weather types. An observation which can be used for further procedures, such as visualizing which predictions are more frequently wrong.

## Forward Selection:

Forward selection is utilized to determine the most relevant predictors for data analysis, with the exclusion of date due to its lack of direct relation to weather and inability to influence observed weather patterns. The objective is to construct a generalizable model applicable to other locations, as seasonal trends can vary greatly by location. In the subsequent step, precipitation, temperature, and wind variables are tested to determine the combination that yields the greatest predictive accuracy.

In [698]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif

# Separate the feature matrix and target variable
X = weather_data.drop("weather", axis=1)
y = weather_data["weather"]

# Perform feature selection using SelectKBest
selector = SelectKBest(score_func=f_classif, k=5)

X_new = selector.fit_transform(X, y)

# Get the selected feature names
selected_features = X.columns[selector.get_support(indices=True)]

# Print the selected feature names
print("Selected Features:", selected_features)

Selected Features: Index(['precipitation', 'temp_max', 'temp_min', 'wind', 'temp_avg'], dtype='object')


# Training and Testing 

During the training phase, the model is presented with a set of labeled data to learn patterns and relationships between the input variables (predictors) and the output variable (target). The goal is for the model to generalize this knowledge and accurately predict the target variable for new, unseen data.

To assess how well the model has learned these patterns, we test it on a separate set of data that was not used in the training process. This is known as the testing phase. The testing data allows us to evaluate the model's performance on new, unseen data and to determine how well it can generalize beyond the training data.

In [671]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Split the data into training and testing sets

X = weather_data[selected_features]
y = weather_data["weather"]


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)






# Scaling

##### After creating training and testing sets, they must now be scaled:

Scaling is a common data preprocessing step in machine learning that involves transforming the data so that the features have the same scale. This is important because many machine learning algorithms assume that all features are on the same scale and have equal importance. If the features are not on the same scale, the algorithm may give more weight to features with larger values, even if those features are not actually more important.

The StandardScaler is a commonly used scaling technique that standardizes the features by subtracting the mean and dividing by the standard deviation. This transformation results in a distribution with a mean of zero and a standard deviation of one.

In [700]:
# Scale the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# Random Forest

Now that the data is scaled, it is time to train.

The RandomForestClassifier is a type of ensemble learning method that creates multiple decision trees and combines their predictions to make a final prediction. The algorithm randomly selects a subset of features and data points to build each tree, which helps to reduce overfitting and improve the generalization performance of the model.

In [701]:
from sklearn.ensemble import RandomForestClassifier

# Initialize a RandomForestRegressor model
forest = RandomForestClassifier()

# Train the model on the training data using the fit method
forest.fit(X_train, y_train)



#### The training has been done, the predictor can now be tested:

GridSearchCV is a technique used to systematically search over a range of hyperparameters to find the best combination of hyperparameters that optimize a model's performance. By extracting the best model from the GridSearchCV object and using it to make predictions on the testing data, we can assess the model's performance on new, unseen data and obtain an estimate of its generalization performance.





In [703]:
# Evaluate the model on the testing data
y_pred = forest.predict(X_test)


# Use the best model from GridSearchCV to make predictions on the testing data
y_pred = grid_search.best_estimator_.predict(X_test)

# Print the accuracy of the model on the testing data
print("Accuracy on testing data: {:.2f}".format(grid_search.best_estimator_.score(X_test, y_test)))




Accuracy on testing data: 0.81




The output of 0.81 represents the accuracy of the best model on the testing data.

Accuracy is a common evaluation metric used for classification problems, which measures the proportion of correctly classified samples in the dataset. In this case, an accuracy of 0.81 means that the model correctly classified 81% of the samples in the testing data.

However, it's important to note that accuracy may not always be the best metric to use, especially if the dataset is imbalanced or if the cost of misclassification is different for different classes. In such cases, other metrics such as precision, recall, F1 score, or area under the ROC curve (AUC-ROC) may be more appropriate for evaluating the performance of the model. these are further analyzations that can be done on this model. 