DATE / nominal / Each record has a date starting from 01/01/2021 to 12/31/2021

CLOUDCOVER / categorical / %/ 17 different types of cloud cover. Categories are: 
    “Fair”, “Fair / Windy", "Partly Cloudy", "Partly Cloudy / Windy", "Cloudy", "Cloudy / Windy","Mostly Cloudy","Mostly Cloudy / Windy","Fog","Haze", "Light Rain",  "Light Rain with      
      Thunder", "Thunder", "Rain" "Thunder / Windy"  "Heavy T-Storm", "Thunder in the Vicinity", "T-Storm"

RAINFALL / continuous / inch / Amount of rainfall of the day/ from 0 to 5

MIN TEMP / continuous / Fahrenheit / Minimum temperature at 3pm / from 34 to 83

WIND SPEED / continuous / mile per hour / wind speed at 3pm/ from 0 to 29

HUMIDITY / continuous / % / Humidity at 3pm/ from 0 to 100

In [45]:
import pandas
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import numpy as np

In [46]:
cloudcover_map = {
    "Fair": 1, "Fair / Windy": 2, "Partly Cloudy": 3, "Partly Cloudy / Windy": 4,
    "Cloudy": 5, "Cloudy / Windy": 6, "Mostly Cloudy": 7, "Mostly Cloudy / Windy": 8,
    "Fog": 9, "Haze": 10, "Light Rain": 11, "Light Rain with Thunder": 12,
    "Thunder": 13, "Rain": 14, "Thunder / Windy": 15, "Heavy T-Storm": 16,
    "Thunder in the Vicinity": 17
}
#reversing map to use in print stmts later on
reverse_cloudcover_map = {v: k for k, v in cloudcover_map.items()}

#gpt had a neat idea on how to overwrite the cloudcover categories with numerical equivalents (ive never done it with the map method before this is cool)
data = pandas.read_csv('weather_data.csv')
data["cloudcover"] = data["cloudcover"].map(cloudcover_map)

data is missing enough random inputs throuhgout that I cant just run it through the k-nearest function. I've got two options, either throw out the rows with incomplete fields or use a more advanced version of the k-nearest function to kinda guess an average value and fill in so that the row can be kept. 

I'll do both just to see.

### for rows with incomplete fields, use best guess and fill empty fields

(used this one for the 3 sets required by the assignment doc -- the differences between the two methods wasnt huge and this one is cooler)

In [47]:
from sklearn.impute import KNNImputer

# k = 7
# k = 14
k = 21  # The golden number for playing around with stuff

# Take fields (not date) and scale because euclidean distance is sensitive to large magnitude differences
# Ensure cloudcover is encoded properly before scaling (if it is categorical)
fields = ['rainfall', 'min_temp', 'windspeed', 'humidity', 'cloudcover']
temp = data[fields]

# Impute missing values using KNN Imputer
imputer = KNNImputer(n_neighbors=5)  # Using k=5 for imputation
temp_imputed = imputer.fit_transform(temp)

# dataScaled = StandardScaler().fit_transform(temp) # standardscaler transforms data to have a mean of 0 and std dev of 1
dataScaled = MinMaxScaler().fit_transform(temp_imputed)  # minmaxscaler normalizes (in theory better for k-nearest)

# Use the scikit-learn library to find k-dist instead of manually calculating the k-nearest neighbors
nbrs = NearestNeighbors(n_neighbors=k).fit(dataScaled)

# Distance is the distance between the current point and k-nearest neighbors
# Indices are the indices of the k-nearest neighbors (if you want them later)
distances, indices = nbrs.kneighbors(dataScaled)

# Calculate the outlier score as the mean distance to k-nearest neighbors and add to data
data['OLS'] = np.mean(distances, axis=1)

# Sort by OLS to find top outliers
top_outliers = data.sort_values(by='OLS', ascending=False)
top_outliers['cloudcover'] = top_outliers['cloudcover'].map(reverse_cloudcover_map)

# Display top outliers
print(top_outliers.head(5))

# Save the entire dataset sorted by OLS to a new CSV file
# output_file = "data_w_fill_sorted_by_OLS_k7.csv"
# output_file = "data_w_fill_sorted_by_OLS_k14.csv"
output_file = "data_w_fill_sorted_by_OLS_k21.csv"
top_outliers.to_csv(output_file, index=False)
print(f"Entire dataset sorted by OLS saved to {output_file}")

         date  min_temp  rainfall  windspeed  humidity             cloudcover  \
226   8/15/21         0       4.9          7        52          Mostly Cloudy   
103   4/14/21         0       1.2          7        88          Heavy T-Storm   
299  10/27/21        65       3.6         21        36  Partly Cloudy / Windy   
210   7/30/21        79       2.3         22        52        Thunder / Windy   
178   6/28/21        75       4.7          7        87             Light Rain   

          OLS  
226  0.970020  
103  0.775861  
299  0.653069  
210  0.576821  
178  0.563471  
Entire dataset sorted by OLS saved to data_w_fill_sorted_by_OLS_k21.csv


### skip rows with incomplete fields

In [48]:
k = 5  # The golden number for playing around with stuff

# Drop rows with missing data
data_clean = data.dropna(subset=['rainfall', 'min_temp', 'windspeed', 'humidity', 'cloudcover'])

# Take fields (not date) and scale the data
fields = ['rainfall', 'min_temp', 'windspeed', 'humidity', 'cloudcover']
temp = data_clean[fields].values
# dataScaled = StandardScaler().fit_transform(temp) # standardscaler transforms data to have a mean of 0 and std dev of 1
dataScaled = MinMaxScaler().fit_transform(temp)  # minmaxscaler normalizes (in theory better for k-nearest)


# Use a method from the scikit-learn library to find k-dist instead of manually calculating the k-nearest neighbors
nbrs = NearestNeighbors(n_neighbors=k).fit(dataScaled)

# Distance is the distance between the current point and k-nearest neighbors
# Indices are the indices of the k-nearest neighbors (if you want them later)
distances, indices = nbrs.kneighbors(dataScaled)

# Calculate outlier score as the mean distance to k-nearest neighbors and add to data_clean
data_clean.loc[:, 'OLS'] = np.mean(distances, axis=1)

# Sort by OLS to find top outliers
top_outliers = data_clean.sort_values(by='OLS', ascending=False)
top_outliers['cloudcover'] = top_outliers['cloudcover'].map(reverse_cloudcover_map)

# Display top outliers
print(top_outliers.head(5))

output_file = "data_no_fill_sorted_by_OLS.csv"
top_outliers.to_csv(output_file, index=False)
print(f"Entire dataset sorted by OLS saved to {output_file}")

         date  min_temp  rainfall  windspeed  humidity             cloudcover  \
226   8/15/21         0       4.9          7        52          Mostly Cloudy   
103   4/14/21         0       1.2          7        88          Heavy T-Storm   
299  10/27/21        65       3.6         21        36  Partly Cloudy / Windy   
210   7/30/21        79       2.3         22        52        Thunder / Windy   
41    2/11/21        40       2.6         17        86                 Cloudy   

          OLS  
226  0.705377  
103  0.508765  
299  0.456199  
210  0.436678  
41   0.364218  
Entire dataset sorted by OLS saved to data_no_fill_sorted_by_OLS.csv
