# Clustern and Dimensionality Reduction on Weather Data from Zurich

## Preparations
We first load the relevant libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

The weather data openly available from the city of Zurich [https://data.stadt-zuerich.ch](https://data.stadt-zuerich.ch) . We directly download them into a Pandas DataFrame:

In [None]:
weather_data = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/download/ugz_ogd_meteo_h1_2023.csv")
weather_data.head(20)

We will focus on the data from the site `Zch_Stampfenbachstrasse`, which is close to the ETH main building:

In [None]:
wd_rel = weather_data[ weather_data['Standort']=='Zch_Stampfenbachstrasse']
wdp = wd_rel.pivot(index = ['Datum', 'Standort'], columns = 'Parameter', values = 'Wert')
wdp.head()

We see that there is one set of records every hour.

In [None]:
wdp.describe()

For simplicity, we limit ourselves to the five attributes humidity, rain duration, temperature, wind speed, and air pressure by dropping the other attributes, and renaming the column to the English terms:

In [None]:
X = wdp.reset_index().drop(['Datum', 'Standort', 'StrGlo', 'WD', 'WVv'], axis=1).dropna().\
    rename(columns={'Hr': 'Humidity', 'RainDur': 'RainDuration', 'T': 'Temperature', 'WVs': 'WindSpeed', 'p': 'AirPressure'})

In [None]:
X.describe()

## Clustering with K-Means
To identify typical weather situations, we apply the k-Means clustering algorithm to the standardized data. In preparation, we first need to scale the data:

### Standardize data
In preparation, we standardize the data.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Clustering data with K-Means: K=4
We choose (for no particular reason) K=4:

In [None]:
kmeans4 = KMeans(n_clusters=4, n_init=30, random_state=42)
kmeans4.fit(X_scaled)

#### Cluster centers
We want to try to understand the clusters, i.e. to identify the corresponding weather situations. To do this, we transform the `cluster_centers_` back to their original sizes, i.e. we undo the scaling with the function `scaler.inverse_transform(...)`. We also add the original attribute names back:

In [None]:
centroids4 = kmeans4.cluster_centers_

# reverse transformation
centroids4_original = scaler.inverse_transform(centroids4)

# Add the original column names to interpret the centroids
centroids4_df = pd.DataFrame(centroids4_original, columns=X.columns)

In [None]:
centroids4_df

We also query the number of data points per cluster:

In [None]:
pd.Series(kmeans4.labels_).value_counts()

#### Interpretation of the clusters

**EXERCISE**: Interprete the cluster centers. To which weather situations do they correspond?

### Clustering data with K-Means: K=5
For comparison, we also run k-Means with K=5:

**EXERCISE**: Run K-Means with K=5. Follow the example above for K=4. Store the obtained centroids in the original scale as dataframe in the variable `centroid5_df`.

In [None]:
# ...

# centroid5_df = 

Again, we will look at the centroids and how many hours are attributed to each cluster:

In [None]:
centroid5_df

In [None]:
pd.Series(kmeans5.labels_).value_counts()

## Principal Component Analysis (PCA)
We now try to better understand the data based on principal component analysis (PCA).

### Calculating the PCA:

In [None]:
pca_weather = PCA().fit(X_scaled)
pca_weather_trans = pca_weather.transform(X_scaled)

### Explained variance per component

In [None]:
plt.bar(range(1, 1+len(pca_weather.explained_variance_ratio_)), pca_weather.explained_variance_ratio_, color='b', label='per component')
plt.plot(range(1, 1+len(pca_weather.explained_variance_ratio_)), np.cumsum(pca_weather.explained_variance_ratio_), 'r-', label='cumulative')
plt.grid(True)
plt.legend()
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Ratio of Explained Variance')
plt.show()

**EXERCISE**:

* How much of the variance is explained by the first component?

* How many components are needed to explain 80% of the variation?

### Visualisation of the Loadings
Next, we plot the loadings matrix to get an understanding of the individual components:

In [None]:
fig, ax = plt.subplots()
im = ax.imshow(pca_weather.components_)

components = ['c1', 'c2', 'c3', 'c4', 'c5']
features = X.columns

ax.figure.colorbar(im)

plt.xticks(ticks = range(5), labels = components)
plt.yticks(ticks = range(5), labels = features)
plt.grid(False)

# Loop over data dimensions and create text annotations.
for i in range(len(components)):
    for j in range(len(features)):
        text = ax.text(i, j, np.round(pca_weather.components_[j, i], 1),
                       ha="center", va="center", color="w")
plt.title('Visualisation of Loadings')
plt.show()

**EXERCISE**:

* Which attributes are most important for the first compnent?
* Which component is most influenced by the rain duration?

## Comparison of the typical weather situations found
We can of course try to compare the situations found directly in the tables. Or we can use PCA to visualize the centers of the two PCA results. To do so, we just transform the two centroids using the PCA analysis we have learned before using all weather measurements.

**EXERCISE**: The below code will calculate the PCA of the centroids from K-Means with k=4, and plot its projection onto the first two principal components. Expand this code such that it also plots the centroids from K-Means with K=5 using the same projection in two dimensions. Interpret the result.

In [None]:
# Transform centroids:
centroids4_pca = pca_weather.transform(centroids4)

# make a plot
plt.plot(centroids4_pca[:, 0], centroids4_pca[:, 1], 'k*', label='K-Means, K=4')
plt.xlabel('First PCA Component')
plt.ylabel('Second PCA Component')
plt.legend()
plt.grid()
plt.show()