# Matching tool 

geinspireerd door : https://towardsdatascience.com/dating-algorithms-using-machine-learning-and-ai-814b68ecd75e 
data set verkregen door : https://generatedata.com/generator

## Importeren 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# pip install seaborn
# pip install sklearn
# pip install tensorflo


In [None]:
df = pd.read_json("data-for_model.json")
df


## Voorbewerken van de data 

In [None]:
from sklearn.preprocessing import StandardScaler

# scaling the data
scaler = StandardScaler().fit(df)
array_scaled = scaler.transform(df)


In [None]:
df_scaled = pd.DataFrame(array_scaled, columns=[
                         'k1', 'k2', 'k3', 'k4', 'k5', 's1', 's2', 'd4'])
df_scaled


In [None]:
# bewijs dat het genormaliseerd :
df_scaled.mean()

# afgerond is dit 0 --> goed genormaliseerd


## Cluster methode bepalen

In [None]:
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

# kiezen van het aantal clusters

cluster_cnt = [i for i in range(2, 20, 1)]

# Establishing empty lists to store the scores for the evaluation metrics
s_scoresA = []
db_scoresA = []

s_scoresB = []
db_scoresB = []

# Looping through different iterations for the number of clusters
for i in cluster_cnt:

    # Hierarchical Agglomerative Clustering with different number of clusters
    hac = AgglomerativeClustering(n_clusters=i)

    hac.fit(df_scaled)

    cluster_assignmentsA = hac.labels_

    # KMeans Clustering with different number of clusters
    k_means = KMeans(n_clusters=i)

    k_means.fit(df_scaled)

    cluster_assignmentsB = k_means.predict(df_scaled)

    # Appending the scores to the empty lists
    s_scoresA.append(silhouette_score(df_scaled, cluster_assignmentsA))
    db_scoresA.append(davies_bouldin_score(df_scaled, cluster_assignmentsA))

    s_scoresB.append(silhouette_score(df_scaled, cluster_assignmentsB))
    db_scoresB.append(davies_bouldin_score(df_scaled, cluster_assignmentsB))


In [None]:
def plot_evaluation(y, x=cluster_cnt):
    """
    Plots the scores of a set evaluation metric. Prints out the max and min values of the evaluation scores.
    """

    # Creating a DataFrame for returning the max and min scores for each cluster
    df = pd.DataFrame(columns=['Cluster Score'], index=[
                      i for i in range(2, len(y)+2)])
    df['Cluster Score'] = y

    print('Max Value:\nCluster #',
          df[df['Cluster Score'] == df['Cluster Score'].max()])
    print('\nMin Value:\nCluster #',
          df[df['Cluster Score'] == df['Cluster Score'].min()])

    # Plotting out the scores based on cluster count
    plt.figure(figsize=(16, 6))
    plt.style.use('ggplot')
    plt.plot(x, y)
    plt.xlabel('Het aantal clusters')
    plt.ylabel('Score')
    plt.show()


Grafiek 1  - "The silhouette score of 1 means that the clusters are very dense and nicely separated. The score of 0 means that clusters are overlapping. The score of less than 0 means that data belonging to clusters may be wrong/incorrect. "

https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam#:~:text=The%20silhouette%20score%20of%201,value%20of%20the%20K%20(no.


Grafiek 2 - "Davies-Bouldin index is a validation metric that is often used in order to evaluate the optimal number of clusters to use. It is defined as a ratio between the cluster scatter and the cluster's separation and a lower value will mean that the clustering is better " 

https://stackoverflow.com/questions/59279056/davies-bouldin-index-higher-or-lower-score-better#:~:text=Davies%2DBouldin%20index%20is%20a,that%20the%20clustering%20is%20better.


In [None]:
# Running the function on the list of scores
plot_evaluation(s_scoresA)
plot_evaluation(db_scoresA)


Dus vanuit deze twee grafieken is 16 het optimale aantal clusters. En gebruiken kmeans 

### AgglomerativeClustering

grafiek 1 de max is bij 19 clusters met een waarde van 0.129011

grafiek 2 de min is bij 19 clusters met een waarde van 1.61575

In [None]:
plot_evaluation(s_scoresB)
plot_evaluation(db_scoresB)


### Kmeans 
grafiek 1 de max is bij 17 clusters met een waarde van 0.140338

grafiek 2 de min is bij 19 clusters met een waarde van 1.595732

### Conclusie 
De waarden zijn nagenoeg gelijk voor beide clustering methodes. Maar AgglomerativeClustering heeft een kleine voor sprong. En heeft bij beide evaluatie methodes de piek/dal bij 19 zitten. 

## Clusteren 

In [None]:
# Instantiating HAC
hac = AgglomerativeClustering(n_clusters=19)

# Fitting
hac.fit(df_scaled)

# Getting cluster assignments
cluster_assignments = hac.labels_

# Assigning the clusters to each profile
df_scaled['Cluster #'] = cluster_assignments

# Viewing the dating profiles with cluster assignments
df_scaled


In [None]:
data = df_scaled
fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(data['k1'], [data['k2']], c=data["Cluster #"], s=50)
ax.set_title("Agglomerative Clustering")
ax.set_xlabel("K1")
ax.set_ylabel("K2")
plt.colorbar(scatter)
plt.show()


In [None]:
df["# Cluster"] = data["Cluster #"]
df_grouped = df.groupby(by=["# Cluster"]).mean()
df_grouped


hierboven staan alle gemiddelde van de verschillende clusters. Op deze clusters gaat er dus een model gemaakt worden. 

## TF -Model 

In [None]:
np.set_printoptions(precision=3, suppress=True)


In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


In [None]:
print(tf.__version__)


In [None]:
df["# Cluster"] = data["Cluster #"]
dataset = df.copy()


In [None]:
dataset.isna().sum()


Geen NaN of gemiste data. --> data klaar voor training

In [None]:
# Data splitsen in training en test sets
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)


In [None]:
sns.pairplot(train_dataset, diag_kind='kde')


De grafiek hierboven laat zien dat er geen duidelijke functies zijn tussen het resultaat (het cluster nummer) en de andere colums. 

In [None]:
# Nu verdelen we de dataset in train en test
train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('# Cluster')
test_labels = test_features.pop('# Cluster')


Om de verschillende colommen te normaliseren gebruiken we de keras normalisatie laag 

In [None]:
from tensorflow.keras.layers import Normalization

normalizer = Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
print(normalizer.mean.numpy())


In [None]:
first = np.array(train_features[:1])

with np.printoptions(precision=2, suppress=True):
    print('First example:', first)
    print()
    print('Normalized:', normalizer(first).numpy())


Hierboven zie je het verschil tussen genormaliseerd en rauwe data. Genormaliseerde data zorgt ervoor dat in het neurale netwerk de gewichten die aan elke berekening/handeling worden gehangen, sneller berekent kunnen worden. 

In [None]:
def build_and_compile_model1(norm):
    model = keras.Sequential([
        norm,
        layers.Dense(64, activation='relu'),
        layers.Dropout(.2, input_shape=(2,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)
    ])

    model.compile(loss='mean_absolute_error', optimizer=tf.keras.optimizers.Adam(0.001))
    return model


In [None]:
model1 = build_and_compile_model1(normalizer)
model1.summary()

In [None]:
% % time
history = model1.fit(
    train_features,
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)


In [None]:
def plot_loss(history):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylim([0, 10])
    plt.xlabel('Epoch')
    plt.ylabel('Error [Cluster]')
    plt.legend()
    plt.grid(True)


In [None]:
plot_loss(history)


In [None]:
gem_afwijking = model1.evaluate(test_features, test_labels, verbose=0)


In [None]:
gem_afwijking


x is de gemiddelde absolute afwijking dit berekent het gemiddelde verschil tussen de berekende en de echte waarden. Hoe kleiner deze is, hoe beter het model getraind is. 

In [None]:
test_predictions = model1.predict(test_features).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('Echte waarden [Clusters]')
plt.ylabel('Voorspelling  [Clusters]')
lims = [0, 20]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)


Hoe dichter de punten bij de rechte lijn liggen. Hoe beter het model voorspelt. 

interessante layers : 
* Embedding layer 
* Dropout layer
* Noise layer 


In [None]:
def build_and_compile_model2(norm):
    model = keras.Sequential([
        norm,
        layers.Dense(64, activation='relu'),
        layers.Dropout(.3, input_shape=(2,)),
        layers.Dense(64, activation='relu'),
        layers.Dropout(.3, input_shape=(2,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)
    ])

    model.compile(loss='mean_absolute_error', optimizer=tf.keras.optimizers.Adam(0.001))
    return model


In [None]:
model2 = build_and_compile_model2(normalizer)
model2.summary()


In [None]:
history2 = model2.fit(
    train_features,
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=200)

In [None]:
plot_loss(history2)

In [None]:
model2.evaluate(test_features, test_labels, verbose=0)