<a href="https://colab.research.google.com/github/fedhere/MLTSA_FBianco/blob/main/Code%20examples/timeSeriesClustering_populationexample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering analysis on population trends

You are clustering the "shape" of time series to find trends, specifically, population growth trends. Do any countries stand out in the population growth trends in the past 60 years? are there groups of countries that have similar trends (and why?)

NOTE: your clusters may not be identical to mine!
## Imports

In [None]:
import pandas as pd
import pylab as pl
import numpy as np

from sklearn import preprocessing
from sklearn import cluster

pl.rcParams['font.size'] = 12

do this to read an excel file with python
then restart the notebook

In [None]:
#need this to read the excell file with pandas
pip install xlrd==2.0.1

SyntaxError: invalid syntax (<ipython-input-2-6c0e650ca9fa>, line 2)

# Data processing

## Get the data

I wanted the data to be gotten from the WorldBank API directly but the link is down tonight (11/2) so I put the file on the shared drive. Mount your google drive and get it from `/content/drive/Shareddrives/PUS2022/data`. The file name is `SP.POP.TOTL?downloadformat=excel`

You are going to have to skip some rows (`skiprows=`) and ideally only use relevant columns (the country name and each year column from 1960, you can use `usecold=` or you can read everything in then throw away the columns you do not need).

Finally, set the country name as the index for this dataframe. you can do that with `set_index()` passing the relevant column name as the argument (dont forget that you want to do it inplace! `inplace=True`)

In [None]:
#reading in the data
pop_df = pd.read_excel('https://github.com/fedhere/MLTSA_FBianco/raw/refs/heads/main/data/SP.POP.TOTL_downloadformat=excel', skiprows=3)
columns = ['Country Name', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020']
pop_df = pop_df[columns]
pop_df.set_index("Country Name", inplace=True)
pop_df

##Data Cleaning
remove NaNs, remove any unwanted columns

In [None]:
import missingno as mno
mno.matrix(pop_df)

In [None]:
pop_df.loc[pop_df.isnull().sum(axis=1) > 20]

In [None]:
# I recommend you drop any column that is all NaN and any row that has any NaN
# you control this (dropping only if all are, vs dropping if any is) with the keyord "any" of .dropna()
pop_df_clean = pop_df.dropna(axis=1, how="all")
pop_df_clean = pop_df_clean.dropna(axis=0, how="all")
pop_df_clean.shape


In [None]:
pop_df_clean.dropna(axis=0, how="any", inplace=True)
pop_df_clean.shape


In [None]:
# looking a little at the data
print (f"In the cleaning process we lost {np.array(pop_df.shape) - np.array(pop_df_clean.shape)} (rows, columns)")
print ("In the cleaning process we lost  {:.2f}% of the data".format(
    100* (np.prod(np.array(pop_df.shape) - np.array(pop_df_clean.shape)) / np.prod(pop_df.shape))))

In [None]:
mno.matrix(pop_df_clean)

# consider improving this! can you fill nan values with interpolation or nearest neighbours? what are the pros and cons of each choice?

In [None]:
pop_df_clean.head()

In [None]:
pop_df_clean.tail()

In [None]:
ax = pop_df_clean.T.plot(legend=False);
ax.set_title("Original data")

Figure 1: the time series of population over time for 258 countries. Clearly the overall population size dominates the difference. General growth trends are still obvious

In [None]:
# prompt: cluster the time series in pop_clean with kmeans into 3 clusters

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

## Scale the data
#scaler = StandardScaler()
#scaled_data = scaler.fit_transform(pop_df_clean)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)  # Set random_state for reproducibility
kmeans.fit(pop_df_clean)

# Add cluster labels to the DataFrame
pop_df_clean['cluster'] = kmeans.labels_

pop_df_clean.head()


In [None]:
ax = pop_df_clean[pop_df_clean.cluster == 0].drop("cluster", axis=1).T.plot(legend=False, color="k")
pop_df_clean[pop_df_clean.cluster == 1].drop("cluster", axis=1).T.plot(ax=ax, legend=False, color="r")
pop_df_clean[pop_df_clean.cluster == 2].drop("cluster", axis=1).T.plot(ax=ax, legend=False, color="b")


In [None]:
import matplotlib.pylab as plt
from tqdm import tqdm
fig, ax = plt.subplots(20, 1, figsize=(10,20))
for i,idx in tqdm(enumerate(pop_df_clean.index[:20])):
   pop_df_clean.loc[idx].T.plot(ax=ax[i])
   ax[i].axis('off')
   ax[i].text(0, pop_df_clean.loc[idx, "1960"], idx)



Figure 2: the first 30 time series in the collection shown in Figure 2, with mute axis so as to display the trend difference rather than the overall normalization. Genral growth trends are obvious but specific trends are also obvious: e.g. Bulgaria Armania and Albania have a population drop while most country have a steady increas. Clustering without normalizing did not capture this and used the mean to cluster

In [None]:

## Scale the data# standardizing the data
X= pop_df_clean.drop("cluster", axis=1).values
scaled_data = preprocessing.scale(X, axis=0, with_mean = True, with_std = True)

pod_standardized_bycol = pd.DataFrame(data=scaled_data, index = pop_df_clean.index, columns=pop_df_clean.drop("cluster", axis=1).columns )

fig, ax = plt.subplots(20, 1, figsize=(10,20))
for i,idx in tqdm(enumerate(pod_standardized_bycol.index[:20])):
   pod_standardized_bycol.loc[idx].T.plot(ax=ax[i])
   ax[i].axis('off')
   ax[i].text(0, pod_standardized_bycol.loc[idx, "1960"], idx)

#pod_standardized_bycol.head()

Figure 3: the shape has changed!!! so the time series have lost their meaning. the clustering will be on what was the value at one year compared to the mean of all time series that year

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10,5))
ax[0] = pop_df_clean.drop("cluster", axis=1).T.plot(legend=False, ax=ax[0]);
ax[0].set_title("Original data")
ax[1] = pod_standardized_bycol.T.plot(legend=False, ax=ax[1]);
ax[1].set_title("Standardized by column (feature)")

In [None]:
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)  # Set random_state for reproducibility
kmeans.fit(scaled_data)

# Add cluster labels to the DataFrame
pod_standardized_bycol['cluster'] = kmeans.labels_

pod_standardized_bycol.head()

ax = pod_standardized_bycol[pod_standardized_bycol.cluster == 0].drop("cluster", axis=1).T.plot(legend=False, color="k")
pod_standardized_bycol[pod_standardized_bycol.cluster == 1].drop("cluster", axis=1).T.plot(ax=ax, legend=False, color="r")
pod_standardized_bycol[pod_standardized_bycol.cluster == 2].drop("cluster", axis=1).T.plot(ax=ax, legend=False, color="b")


Figure 4: having standardized wrong the clustering is still based on the overall average population size in that country

## Scaling (standardizing) correctly!

In [None]:

## Scale the data
# standardizing the data
X= pop_df_clean.drop("cluster", axis=1).values
scaled_data = preprocessing.scale(X, axis=1, with_mean = True, with_std = True)

pod_standardized_byrow = pd.DataFrame(data=scaled_data, index = pop_df_clean.index, columns=pop_df_clean.drop("cluster", axis=1).columns )

fig, ax = plt.subplots(20, 1, figsize=(10,20))
for i,idx in tqdm(enumerate(pod_standardized_byrow.index[:20])):
   pod_standardized_byrow.loc[idx].T.plot(ax=ax[i])
   ax[i].axis('off')
   ax[i].text(0, pod_standardized_byrow.loc[idx, "1960"], idx)

#pod_standardized_bycol.head()

Figure 5: in this plot the time series look again like in figure 2 because each time series has been scaled but its shape has not changed.


In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15,5))
ax[0] = pop_df_clean.drop("cluster", axis=1).T.plot(legend=False, ax=ax[0]);
ax[0].set_title("Original data")
ax[1] = pod_standardized_bycol.drop("cluster", axis=1).T.plot(legend=False, ax=ax[1]);
ax[1].set_title("Standardized by column (feature)")
ax[2] = pod_standardized_byrow.T.plot(legend=False, ax=ax[2]);
ax[2].set_title("Standardized by column (time series)")


Figure 6: now the time series are standardized correctly! we can see difference in trends and cluster according to those

In [None]:
# looking at the data
plt.plot(pod_standardized_byrow.T, color="k", alpha=0.2)
plt.xlabel("year")
plt.ylabel("standardized population")
plt.xticks(range(0,70,10), ["%d"%i for i in range(1960, 2030, 10)]);

**Fig**. 7: These figures show changes in population by year. The image to the left shows the population (in billions) of differenct countries (each country represented by a color) from the year 1960 to 2020. The figure to the right shows the population of each year and country in standarized units. different trends are visible including near-linear growth, rise and fall, some dramatic drops at different times.

In [None]:
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)  # Set random_state for reproducibility
kmeans.fit(scaled_data)

# Add cluster labels to the DataFrame
pod_standardized_byrow['cluster'] = kmeans.labels_

pod_standardized_byrow.head()

ax = pod_standardized_byrow[pod_standardized_byrow.cluster == 0].drop("cluster", axis=1).T.plot(legend=False, color="k", alpha=0.2)
pod_standardized_byrow[pod_standardized_byrow.cluster == 1].drop("cluster", axis=1).T.plot(ax=ax, legend=False, color="r", alpha=0.2)
pod_standardized_byrow[pod_standardized_byrow.cluster == 2].drop("cluster", axis=1).T.plot(ax=ax, legend=False, color="b", alpha=0.2)


Figure 8: Clustering the time series after correct standardization shows three distinct trends, convex grow with increasingly rapid growing in the 2000s, concave growth with decreased growth speed in the 1990s, and a subset of countries with population size dicrease after 1990.


In [None]:
!wget https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.geojson


In [None]:
import geopandas as gpd
countriesshp = gpd.GeoDataFrame.from_file("geoBoundariesCGAZ_ADM0.geojson")


In [None]:
countriesshp.replace("&", "and")
countriesshp.sort_values(by="shapeName").shapeName.values

In [None]:
count = 0
for i in pod_standardized_byrow.index:
  if not i in countriesshp.shapeName.values:
    count +=1
    print(i, count)

In [None]:
fig,ax = pl.subplots(1,1, figsize=(10,10))

ax.set_title("Cluster 1 ")
ax.set_xticks(range(0,70,10))
ax.set_xticklabels(["%d"%i for i in range(1960, 2030, 10)]);
ax.plot(pod_standardized_byrow[pod_standardized_byrow.cluster == 1 ].drop("cluster", axis=1).T);
ax.legend(labels=pod_standardized_byrow.loc[pod_standardized_byrow.cluster == 1].index, bbox_to_anchor=(1.0, 1.0), loc='upper left')



In [None]:
mapcluster = countriesshp.merge(pod_standardized_byrow, left_on="shapeName", right_index=True)
ax = mapcluster.plot("cluster", figsize=(20,20), legend=True,     categorical=True)
plt.axis('off')

In [None]:
# prompt: cluster the data with dbscan

from sklearn.cluster import DBSCAN

# Assuming 'scaled_data' from the previous code contains the correctly scaled data
# Replace with your actual data if different.

# Define DBSCAN parameters
dbscan = DBSCAN(eps=3, min_samples=5) # Adjust eps and min_samples as needed

# Fit DBSCAN to the data
dbscan.fit(scaled_data)

# Get cluster labels
labels = dbscan.labels_

# Add cluster labels to the DataFrame
pod_standardized_byrow['cluster'] = labels
print(pod_standardized_byrow['cluster'].unique())

# Now you can analyze the clusters as you did with KMeans
# For example:
# Visualize clusters
ax = pod_standardized_byrow[pod_standardized_byrow.cluster == -1].drop("cluster", axis=1).T.plot(legend=False, color="k", alpha=0.2)
for l in range(0, pod_standardized_byrow['cluster'].max()+1):
  pod_standardized_byrow[pod_standardized_byrow.cluster == l].drop("cluster", axis=1).T.plot(legend=False, alpha=0.2)
# ... plot other clusters

# Evaluate the clusters as needed
# ...


# Example of mapping the clusters with geopandas, continuing from the original notebook
# ...

# Note: the eps and min_samples parameters are crucial. You might need to tune these
# values based on your dataset to get meaningful results.




# Identifying the countries in the smallest clusters

plot the two smallest clusters with labels for the countries

In [None]:
fig,ax = pl.subplots(1,2, figsize=(20,10))

ax[0].set_title("Cluster 1 ")
ax[0].set_xticks(range(0,70,10))
ax[0].set_xticklabels(["%d"%i for i in range(1960, 2030, 10)]);
ax[0].plot(pod_standardized_byrow[pod_standardized_byrow.cluster == 1 ].drop("cluster", axis=1).T);
ax[0].legend(labels=pod_standardized_byrow.loc[pod_standardized_byrow.cluster == 1].index, bbox_to_anchor=(1.0, 1.0), loc='upper left')


ax[1].set_title("Cluster -1 (outliers)")
ax[1].set_xticks(range(0,70,10))
ax[1].set_xticklabels(["%d"%i for i in range(1960, 2030, 10)]);
ax[1].plot(pod_standardized_byrow[pod_standardized_byrow.cluster == -1 ].drop("cluster", axis=1).T);
ax[1].legend(labels=pod_standardized_byrow.loc[pod_standardized_byrow.cluster == -1].index, bbox_to_anchor=(1.0, 1.0), loc='upper left')
# placing legend method via https://www.delftstack.com/howto/matplotlib/how-to-place-legend-outside-of-the-plot-in-matplotlib/



Figure 4: This figure shows the countries that cluster together in the smallest clusters of the sample. These two clusters are include the countries that either had a decline in population or did not have population increases.

Can you do some library research to figure out why those countries may cluster together?

In both cases the inflection point was around the 1990. This year was characterized by the fall of the Soviet Union that ended up in a crisis in Eastern Europe and other socialist countries.