In [None]:
# In this mockup, we will be setting up the bones of an unsupervised machine learning model that will cluster our country-level data for our Expat App.
# At a high level, what will happen with the ML model is that it will use unsupervised learning to cluster countries from our data.
#
# Then, when we use a ranked-choice algorithm to choose a "best" country for a user based on their preferences, we will also be able to return
# a cluster of "similar" countries the user can also consider. This seemed like the best possible way to use ML for our project for a few reasons:
#
# 1) It is difficult to use supervised learning in this circumstance because it is challenging to find data about where expatriated people moved
# 2) Even if we used migration data to see where people did in fact move, that data doesn't tell us if they are satisfied with their new country
# 3) Most of the data we are using can be made into a numerical form, and so clusters can happen by percentile and/or by normalization
#
# Also, if a particular data column is challenging to incorporate into our ML model (like language, where it can be hard to track related languages
# without recourse to  more advanced techniques), we can instead give it to the end user as a filter (e.g. filter for countries where English is spoken)
#
# Because the number of rows in the datasets we are working with are relatively small, we will use hierarchical clustering instead of just K-Means
#
# SUMMARY: We are planning to use unsupervised learning to cluster our country-level data in addition to a ranked-choice-style algorithm that will choose
# a country for expatriation based on user input. The user will weight/filter which of a small set of components is most important to them (indicators
# include Economy, Health, Political System, Education, & Lifestyle) and then the algorithm will return the #1 country (via ranked choice), along with 
# its cluster of alternatives as determined by our unsupervised machine learning algorithm.
#
# We still need to have some drill-down conversations about specific aspects of the data that will be most important to our final dataframe, but
# based on our conversations last week some of the columns are likely to include: GDP of the country, a cost of living index, internet speed, a happiness
# index (along with some other health indicators), the human freedom index for the country, the average years of schooling, the literacy rate, an index
# for gender equality in education, and a few indicators that affect lifestyle (some climatological data points, what languages are spoken in the country
# + the % of English speakers, and some cultural indicators we are still working on solidifying like art, music, movies, literacy, nightlife, and others)

In [None]:
# Imports for hierarchical clustering
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import hvplot.pandas

In [None]:
# A cell for importing the data from the database

In [None]:
# A cell for applying PCA

In [None]:
import plotly.figure_factory as ff

In [None]:
# Create a dendrogram
fig = ff.create_dendrogram(df_iris_pca, color_threshold=0)
fig.update_layout(width=800, height=500)
fig.show()

In [None]:
# Perform agglomerative clustering
agg = AgglomerativeClustering(n_clusters=3)
model = agg.fit(df_iris_pca)

In [None]:
# Add a new cluster column to the dataframe

In [None]:
# Create a plot to show the results of the hierarchical clustering algorithm

In [None]:
# (MAybe) perform statistical testing to verify differences between clusters?

In [None]:
# Search for the returned country's cluster & return the cluster