# 4IF Data Mining Lab
Nadine Saadalla, Noémie Varjabedian, Eléonore Dravet  
February 2025


# Introduction
During this Data Mining project, we will be analyzing the geolocation data of photographs taken in the Grand Lyon. The data comes from the Flickr database and was taken between 2010 and 2019. We will explore different clustering methods to identify areas of interest for tourism in the metropolis, i.e. areas where many photographs are taken. Finally, we'll work on photograph captions to assign descriptions to our areas of interest

# Python set-up

In this section, we will set up a virtual python environment and install all necessary packages to run the code.

In [None]:
! source ./vitualenvpython/bin/activate
! which python3

In [None]:
# installation of required libraries and dependencies
# numeric calculations
! pip install numpy==1.26.0 
# data frames 
! pip install pandas==2.1.1 
# machine learning algorithms 
! pip install scikit-learn==1.5.1 
! pip install scipy==1.12.0
# plotting 
! pip install plotly==5.24.1 
! pip install matplotlib==3.8.0 
! pip install plotly-express==0.4.1 
! pip install chart-studio==1.1.0 
# web app library 
! pip install streamlit==1.37.1 
#maping library
! pip install folium
# association rules
! pip install mlxtend==0.23.3

! pip install nbformat==5.9.2 

In [None]:
# load pandas to deal with the data
import pandas as pd
# plotting
import matplotlib.pyplot as plt
import folium
import numpy as np
#Kmeans
from sklearn.cluster import KMeans
#Map
import folium
#clustering groups
from scipy.spatial import ConvexHull, QhullError
#Hierarchical
from sklearn.cluster import AgglomerativeClustering
#Tags
from collections import Counter
#Images
import os

# Data loading and cleaning

We start by viewing the current dataset.

In [None]:
data = pd.read_table("flickr_data2.csv", sep=",")
data.head()

### Cleaning data in unnamed columns

In [None]:
data.info()

We notice the three lines    

16  Unnamed: 16          142 non-null     float64  
17  Unnamed: 17          0 non-null       float64  
18  Unnamed: 18          2 non-null       float64  

which indicate that some lines have values in unnamed columns, probably due to a wrong use of the separators ','. They represent a very small ration of the overall data (~150/420 000) so we chose to delete those lines without further investigation. 

In [None]:
unnamed_columns = data.columns[data.columns.str.contains('^Unnamed')]
data = data.loc[~data["Unnamed: 16"].notna(),:]
data = data.loc[~data["Unnamed: 17"].notna(),:]
data = data.loc[~data["Unnamed: 18"].notna(),:]
data.info()

The lines have been deleted ("0 non-null"). The columns still appear, we can delete them. 

In [None]:
data=data.drop(columns=unnamed_columns)
data.info()

### Spaces

We notice that except for 'id', the column names start with a space. To manipulate the columns more easily, we remove these spaces. 

In [None]:
data=data.rename(columns=lambda x: x.strip())

### Incoherent values
We can look at the statistics to check that they are coherent.

In [None]:
data.describe()

We see that the max date_taken_year is 2238, which is impossible. This pushes us to delete all rows where the date taken is more than 2025 (the current year), and print all the years where pictures have been taken in our dataset, to check that everything is coherent.


In [None]:
index=data[data['date_taken_year']>2025].index
data.drop(index=index,axis=1)
taken_years = data["date_taken_year"].unique()
uploded_years = data["date_upload_year"].unique()
print(f" When pictures are taken : {taken_years}\n")
print(f" When pictures are uploaded : {uploded_years}")

<a id="missing-vals"></a>
### Missing Values

In [None]:
data.isna().sum()

We see that the missing values are either from the tags, the title or the upload_year. For our analysis on the coordinates, these information are not determining and we can choose to keep the corresponding data. However, in order to run our algorithm, we sample the data (our CPUs are not powerful enough to run with all the data) and keep only a thousand points, so we may as well keep the data with tags for our text analysis later.

In [None]:
data=data.drop(index=data[data['tags'].isna()].index)

<a id="duplicates"></a>
### Duplicates

For our analysis to be pertinent, we do not want to count several times pictures taken by the same person, in the same place, in the same hour. This allows us to count visits. This avoids getting influenced by someone taking burst photos as well. Hence we remove the duplicates using the following columns. 

In [None]:
columns_for_duplicates=['user', 'lat', 'long','date_taken_hour', 'date_taken_day', 'date_taken_month','date_taken_year', 'date_upload_minute', 'date_upload_hour','date_upload_day', 'date_upload_month', 'date_upload_year']
data=data.drop_duplicates(subset=columns_for_duplicates)

We now have ~90 00 points from the ~420 000 initial. We still have enough values. 

In [None]:
data.info()

We can visualize all the points on a map as a last check of our values. We choose 1000 random tagged points.

In [None]:
m = folium.Map([45.762611,4.832805	], zoom_start=14)
data_sample = data.sample(1000)

for index, row in data_sample.iterrows():
    folium.Marker(
        location=[row["lat"],row["long"]],
        icon=folium.DivIcon(html=f"""<svg width="20px" height="20px" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
        <path d="M12 9.5C13.3807 9.5 14.5 10.6193 14.5 12C14.5 13.3807 13.3807 14.5 12 14.5C10.6193 14.5 9.5 13.3807 9.5 12C9.5 10.6193 10.6193 9.5 12 9.5Z" fill="#e63946"/>
        </svg>""")
    ).add_to(m)

m

Everything looks cleaned. We can save our cleaned data to a new csv and work with this version as for now. We will not have do do the treatment again since we can just directly use the data_cleaned.csv. 

In [None]:
data.to_csv('./data_cleaned.csv',index=False)

# Analyzing clsutering methods

In [None]:

data_cleaned=pd.read_csv('./data_cleaned.csv')
data_sample = data_cleaned.sample(1000, random_state=42)
cluster_colors = ['#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#FF00FF', '#FFA500', '#800080']

## k-means Clustering and Find the Optimal Number of Clusters using Elbow Method

Before using the Elbow method, we observe the different clusters we obtain by changing the value of k. 

### Elbow method 
We use the elbow method to determine the best parameter k for our K-Means clustering.

In [None]:
inertias = []
for i in range(1,30):
  kmeans=KMeans(n_clusters=i,init='k-means++')
  kmeans.fit(data_sample[["lat", "long"]])
  inertias.append(kmeans.inertia_)

plt.plot(range(1,30),inertias)
plt.show()

The optimal value seems to be k=7. When we visualize the result on the map, there are not enough clusters to represent touristic places. We hence chose to use a higher k to adapt to the reality.

P.S. : We only show clusters than contain more than 10 points. Samller clusters are considered irrelevant. This changes nothing for k=7 but for a bigger k or for other methods, this rule has an impact on the printed map.

### k-means = 7 | 100 maps

In [None]:
k = 7

kmeans = KMeans(n_clusters=k, init='k-means++')
kmeans.fit(data_sample[["lat", "long"]])

m = folium.Map(location=[45.762611, 4.832805], zoom_start=14)

labels = kmeans.labels_

clustered_points = data_sample.assign(cluster=labels)

for cluster_id in clustered_points['cluster'].unique():
    cluster_points = clustered_points[clustered_points['cluster'] == cluster_id]
    points = cluster_points[['lat', 'long']].values

    if len(points) >= 10:  # We only show clusters with more than 10 points.
        try:
            hull = ConvexHull(points)
            hull_points = points[hull.vertices]  
            
            folium.Polygon(
                locations=hull_points,
                color=cluster_colors[cluster_id % len(cluster_colors)],
                weight=2,
                fill=True,
                fill_opacity=0.2
            ).add_to(m)
        except QhullError:
            print(f"Cluster {cluster_id}: QhullError occurred; skipping hull computation.")
    else:
        print(f"Cluster {cluster_id}: Less than 10 points; skipping perimeter.")

for _, row in data_sample.iterrows():
    folium.Marker(
        location=[row["lat"], row["long"]],
        icon=folium.DivIcon(html=f"""
        <svg width="10px" height="10px" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
            <circle cx="12" cy="12" r="5" fill="black"/>
        </svg>
        """)
    ).add_to(m)

m.save("cluster_map_k_means_7.html")


In [None]:
k = 100

kmeans = KMeans(n_clusters=k, init='k-means++')
kmeans.fit(data_sample[["lat", "long"]])

m = folium.Map(location=[45.762611, 4.832805], zoom_start=14)

labels = kmeans.labels_

clustered_points = data_sample.assign(cluster=labels)

for cluster_id in clustered_points['cluster'].unique():
    cluster_points = clustered_points[clustered_points['cluster'] == cluster_id]
    points = cluster_points[['lat', 'long']].values

    if len(points) >= 10:  # We only show clusters with more than 10 points.
        try:
            hull = ConvexHull(points)
            hull_points = points[hull.vertices]  
            
            
            folium.Polygon(
                locations=hull_points,
                color=cluster_colors[cluster_id % len(cluster_colors)],
                weight=2,
                fill=True,
                fill_opacity=0.2
            ).add_to(m)
        except QhullError:
            print(f"Cluster {cluster_id}: QhullError occurred; skipping hull computation.")
    else:
        print(f"Cluster {cluster_id}: Less than 10 points; skipping perimeter.")

for _, row in data_sample.iterrows():
    folium.Marker(
        location=[row["lat"], row["long"]],
        icon=folium.DivIcon(html=f"""
        <svg width="10px" height="10px" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
            <circle cx="12" cy="12" r="5" fill="black"/>
        </svg>
        """)
    ).add_to(m)

m.save("cluster_map_k_means_100.html")


## Task #5: Cluster Evaluation using Silhouette Coefficient | A FAIRE

To evaluate the quality of the clustering, we can use **Silhouette Coefficient**. The Silhouette Coefficient for a sample is given by $(b - a) / max(a, b)$ where `b` is the distance between a sample and the nearest cluster that the sample is not a part of, and `a` is the mean intra-cluster distance (i.e. the mean distance between a sample and all other samples in the same cluster). 

The silhouette score ranges from -1 to 1 and indicates how well each data point fits within its assigned cluster:

* Score near +1 means:
    - The data point is far from neighboring clusters
    - The point is well-matched to its cluster
    - Indicates very distinct, well-separated clustering
* Score near 0 means:
    - The data point is close to the decision boundary between clusters
    - The point could potentially belong to either cluster
    - Suggests overlapping or not well-defined clusters
* Score near -1 means:
    - The data point might be assigned to the wrong cluster
    - The point is closer to points in another cluster than its own
    - Indicates poor clustering or potential misassignments

We can use [`sklearn.metrics.silhouette_score`](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.silhouette_score.html) and [`sklearn.metrics.silhouette_samples`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html)

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples

**QUESTIONS**

* For k-means clustering with `k=3`, calculate Silhouette score for each data point, for each cluster and average silhouette score 
* Display Silhouette score plot
* Comment

In [None]:
kmeans=KMeans(n_clusters=3,init='k-means++')
for_silhouette_df = scaled_data_df.copy()
for_silhouette_df["labels"] = labels

for_silhouette_df = for_silhouette_df.sort_values(by=["labels"])

for_silhouette_df["silhouettes"] = silhouette_samples(for_silhouette_df[features],for_silhouette_df["labels"])

cluster_silhouette_score = for_silhouette_df.groupby(["labels"]).mean()["silhouettes"]

average_silhouette_score = for_silhouette_df["silhouettes"].mean()

In [None]:
import matplotlib.cm as cm

In [None]:
fig = plt.figure(figsize = (20,20))
fig, (ax1) = plt.subplots(1, 1)
y_lower = 10
for i in range(3):
    ith_cluster_silhouette_values = for_silhouette_df.loc[for_silhouette_df["labels"] == i]

    ith_cluster_silhouette_values.sort_values(by=["silhouettes"])

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.nipy_spectral(float(i) / 3)
    ax1.fill_betweenx(
        np.arange(y_lower, y_upper),
        0,
        ith_cluster_silhouette_values,
        facecolor=color,
        edgecolor=color,
        alpha=0.7,
    )

    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    y_lower = y_upper + 10

ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

ax1.axvline(x=average_silhouette_score, color="red", linestyle="--")

ax1.set_yticks([]) 
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

In [None]:
# ANSWER


As general guidelines, the plot can be interpreted by looking at:
* *the thickness of the clusters (number of points)*;
* check if any cluster has many negative values;
* check the consistency of the silhouette widths within clusters;
* the average value. Recall that in general, the following interpretation applies:
    - \> 0.7: Strong clustering structure
    - 0.5 - 0.7: Reasonable clustering structure
    - 0.25 - 0.5: Weak clustering structure
    - < 0.25: No substantial clustering structure


* Cluster Silhouette scores. 

YOUR COMMENT: TO COMPLETE

## Hierarchical Clustering : single, complete and ward

In this part, we analyse the 3 variations of hierarchical clustering.

In [None]:
clustering_single = AgglomerativeClustering(n_clusters=100, linkage='single').fit(data_sample[["lat", "long"]] )

In [None]:
clustering_complete = AgglomerativeClustering(n_clusters=100, linkage='complete').fit(data_sample[["lat", "long"]] )

In [None]:
clustering_ward = AgglomerativeClustering(n_clusters=100, linkage='ward').fit(data_sample[["lat", "long"]] )

After observing on the map the different results, we determine that the best hierarchical cluserting method is the ward one. 

To see the results on the map, we only need to change this line below with the corresponding clustering variation and then run the code. We left it at the ward method because we have concluded that it's the best of the three.

In [None]:
labels = clustering_ward.labels_

In [None]:
m = folium.Map(location=[45.762611, 4.832805], zoom_start=14)
clustered_points = data_sample.assign(cluster=labels)

for cluster_id in clustered_points['cluster'].unique():
    cluster_points = clustered_points[clustered_points['cluster'] == cluster_id]
    points = cluster_points[['lat', 'long']].values

    if len(points) >= 10: # We only show clusters with more than 10 points.
        try:
            hull = ConvexHull(points)
            hull_points = points[hull.vertices]
            
            folium.Polygon(
                locations=hull_points,
                color=cluster_colors[cluster_id % len(cluster_colors)], 
                weight=2,
                fill=True,
                fill_opacity=0.2,
            ).add_to(m)
        except QhullError:
            print(f"Cluster {cluster_id}: QhullError occurred; skipping hull computation.")
    else:
        print(f"Cluster {cluster_id}: Less than 10 points; skipping perimeter.")

for _, row in data_sample.iterrows():
    folium.Marker(
        location=[row["lat"], row["long"]],
        icon=folium.DivIcon(html=f"""
        <svg width="10px" height="10px" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
            <circle cx="12" cy="12" r="5" fill="black"/>
        </svg>
        """),

    ).add_to(m)

m.save("cluster_map_hierarchical.html")
m

# Task 7: Apply DBSCAN

**QUESTIONS**

* Apply [sklearn.cluster.DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) algorithm
* Identify the best values for `eps` and `min_sanples` by varying the values within a range and by using Silhouette coefficient
* Apply DBSCAN with the best parameters found
* Print number of clusters and noise points

In [None]:
# DBSCAN
from sklearn.cluster import DBSCAN

In [None]:
# ANSWER
data_sample = data_sample.sample(1000, random_state=42)

clustering_db = DBSCAN(eps=0.002, min_samples=3).fit(data_sample[["lat", "long"]] )
clustering_db.labels_

In [None]:
from scipy.spatial import ConvexHull, QhullError

# Initialize the map
m = folium.Map(location=[45.762611, 4.832805], zoom_start=12)

# Assign cluster labels
labels = clustering_db.labels_

# Group points by cluster
clustered_points = data_sample.assign(cluster=labels)

# Define colors for clusters (extendable)
cluster_colors = ['#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#FF00FF', '#FFA500', '#800080']

# Track if any polygons are added
polygon_added = False

# Loop through each cluster and compute its perimeter
for cluster_id in clustered_points['cluster'].unique():

    if cluster_id < 0:
        print(f"Skipping cluster {cluster_id} (outlier)")
        continue

    cluster_points = clustered_points[clustered_points['cluster'] == cluster_id]
    points = cluster_points[['lat', 'long']].values  # Extract points as (lat, long)

    if len(points) >= 3:  # ConvexHull requires at least 3 points
        try:
            hull = ConvexHull(points)
            hull_points = points[hull.vertices]  # Get the hull points
            
            # Add the perimeter as a polygon to the map
            folium.Polygon(
                locations=hull_points,
                color=cluster_colors[cluster_id % len(cluster_colors)],  # Use modulo for repeating colors
                weight=2,
                fill=True,
                fill_opacity=0.2
            ).add_to(m)
            polygon_added = True  # Mark that a polygon was successfully added
        except QhullError:
            print(f"Cluster {cluster_id}: QhullError occurred; skipping hull computation.")
    else:
        print(f"Cluster {cluster_id}: Less than 3 points; skipping perimeter.")

# Add markers for all points
for _, row in data_sample.iterrows():
    folium.Marker(
        location=[row["lat"], row["long"]],
        icon=folium.DivIcon(html=f"""
        <svg width="10px" height="10px" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
            <circle cx="12" cy="12" r="5" fill="black"/>
        </svg>
        """)
    ).add_to(m)

# Save and display the map
m.save("cluster_map_dbscan.html")
m

## Task 8: Cluster Characterisation using Apriori algorithm

Now, we would like to describe the obtained cluster. To do so, let's use frequent pattern mining and in particular **Apriori algorithm**. 

**QUESTIONS**
* First, convert numerical features to categorical (low, medium, high) based on quantiles. Add binary columns, e.g. `sepal length low`, `sepal length medium`, `sepal length high` depending on the values

In [None]:
# ANSWER


To find association, we are going to use [mlxtend.frequent_patterns.apriori](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).

In [None]:
# frequent patterns
from mlxtend.frequent_patterns import apriori

**QUESTIONS**

* Use apriori algorithm to find frequent patterns for each cluster
* Then among these itemsets, find those that are not frequent for other clusters

In [None]:
# ANSWER

# Special function : show image on click

In order to show image taken when we click on a point, we had to first get a flicker key.

In [None]:
FLICKR_API_KEY = "c9f9183c6f9f862b589a12aa27b9c8e6"

The process of showing images :
- We first wanted to show images on hover but the only way to do that was to preload all the images and that approximately 27 seconds to load 100 images and 5min30s to load all 1000.
- Thus we decided to show images on click instead however we faced the issue of dynamically rendering the photos when we click on the marker.
- We used HTML in order to dynamically render the images.

In [None]:
unavailable_image_path = os.path.abspath("unavailable-image.jpg")

fetch_image_script = f"""
<script>
function fetchFlickrImage(photo_id, containerId) {{
    console.log("Fetching image for photo_id:", photo_id);
    var apiUrl = "https://api.flickr.com/services/rest/?method=flickr.photos.getInfo&api_key={FLICKR_API_KEY}&photo_id=" + photo_id + "&format=json&nojsoncallback=1";

    fetch(apiUrl)
        .then(response => response.json())
        .then(data => {{
            var container = document.getElementById(containerId);

            if (data.photo) {{
                var server_id = data.photo.server;
                var secret = data.photo.secret;
                var imageUrl = "https://live.staticflickr.com/" + server_id + "/" + photo_id + "_" + secret + ".jpg";
                console.log("Image URL:", imageUrl);

                // Replace placeholder with the fetched image
                container.innerHTML = "<img src='" + imageUrl + "' style='width: 250px; height: 180px; object-fit: cover; border-radius: 8px; display: block; margin: auto;'/>";
            }} else {{
                console.log("Image not found, using fallback.");
                container.innerHTML = "<img src='file://{unavailable_image_path}' style='width: 250px; height: 180px; object-fit: cover; border-radius: 8px; display: block; margin: auto;'/>";
            }}
        }})
        .catch(error => {{
            console.error("Error fetching image:", error);
            var container = document.getElementById(containerId);
            container.innerHTML = "<img src='file://{unavailable_image_path}' style='width: 250px; height: 180px; object-fit: cover; border-radius: 8px; display: block; margin: auto;'/>";
        }});
}}
</script>
"""

And in case you haven't run the code in order, make sure to run the next cell.

In [None]:
clustering_ward = AgglomerativeClustering(n_clusters=100, linkage='ward').fit(data_sample[["lat", "long"]] )
labels = clustering_ward.labels_

In [None]:
m = folium.Map(location=[45.762611, 4.832805], zoom_start=14)
clustered_points = data_sample.assign(cluster=labels)


m.get_root().html.add_child(folium.Element(fetch_image_script))

for cluster_id in clustered_points['cluster'].unique():
    cluster_points = clustered_points[clustered_points['cluster'] == cluster_id]
    points = cluster_points[['lat', 'long']].values

    if len(points) >= 10: # We only show clusters with more than 10 points.
        try:
            hull = ConvexHull(points)
            hull_points = points[hull.vertices]
            
            folium.Polygon(
                locations=hull_points,
                color=cluster_colors[cluster_id % len(cluster_colors)], 
                weight=2,
                fill=True,
                fill_opacity=0.2,
            ).add_to(m)
        except QhullError:
            print(f"Cluster {cluster_id}: QhullError occurred; skipping hull computation.")
    else:
        print(f"Cluster {cluster_id}: Less than 10 points; skipping perimeter.")


for i, row in data_sample.iterrows():
    lat, lon, photo_id = row["lat"], row["long"], row["id"]
    
    
    link_id = f"image_link_{i}"
    container_id = f"image_container_{i}"

    popup_html = f"""
    <div id="{container_id}" style="width: 250px; height: 180px; display: flex; align-items: center; justify-content: center; text-align: center; border-radius: 8px; background-color: white; padding: 10px;">
        <a href="#" id="{link_id}" onclick="fetchFlickrImage('{photo_id}', '{link_id}', '{container_id}'); return false;"
           style="text-decoration: none; font-size: 14px; color: blue; font-weight: bold;">
            📷 Click to Load Image
        </a>
    </div>
    """

    popup = folium.Popup(popup_html, max_width=270)

    folium.Marker(
        location=[lat, lon],
        popup=popup,
        icon=folium.DivIcon(html=f"""
        <svg width="10px" height="10px" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
            <circle cx="12" cy="12" r="5" fill="black"/>
        </svg>
        """),
    ).add_to(m)

m.save("cluster_map_click_fixed_size_image.html")
m
