
# Unsupervised Learning Project: EDA on MODIS Vegetation Indices Dataset

This notebook explores the MODIS Vegetation Indices dataset as part of the unsupervised learning project. It includes data loading, summary statistics, and visualizations to understand the data's structure and relationships.
    

In [None]:

import pandas as pd

# Load the dataset from GitHub
file_path = "https://raw.githubusercontent.com/fangayou90/Unsupervised_Project_EDA/main/MODIS_Vegetation_Indices.csv"
modis_data = pd.read_csv(file_path)

# Display the shape of the dataset
print("Dataset Shape:", modis_data.shape)

# Show the first few rows of the dataset
modis_data.head()
    

In [None]:

# Summary statistics
modis_data.describe()
    


### Pair Plot

The pair plot below visualizes the relationships between NDVI, EVI, Latitude, and Longitude. This helps identify potential clusters or patterns.
    

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

# Pair plot for key variables
sns.pairplot(modis_data[['NDVI', 'EVI', 'Latitude', 'Longitude']])
plt.show()
    


### Correlation Matrix Heat Map

The heat map below shows the correlation between NDVI, EVI, Latitude, and Longitude, helping identify strong or weak relationships.
    

In [None]:

# Correlation matrix
corr_matrix = modis_data[['NDVI', 'EVI', 'Latitude', 'Longitude']].corr()

# Heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix Heatmap")
plt.show()
    


### Additional Plots

Below are additional visualizations to explore the data further.
- **NDVI Distribution**: Highlights the spread of NDVI values.
- **Latitude vs NDVI Scatter Plot**: Examines geographic trends in vegetation indices.
- **Cumulative Distribution Function (CDF)**: Provides a cumulative view of NDVI values.
    

In [None]:

# NDVI distribution plot
sns.histplot(modis_data['NDVI'], kde=True)
plt.title("NDVI Distribution")
plt.xlabel("NDVI")
plt.show()
    

In [None]:

# Scatter plot of Latitude vs NDVI
sns.scatterplot(data=modis_data, x='Latitude', y='NDVI')
plt.title("Latitude vs NDVI")
plt.xlabel("Latitude")
plt.ylabel("NDVI")
plt.show()
    

In [None]:

# CDF of NDVI
sns.ecdfplot(data=modis_data, x='NDVI')
plt.title("CDF of NDVI")
plt.xlabel("NDVI")
plt.ylabel("Proportion")
plt.show()
    

In [None]:

# Check for missing data
missing_data = modis_data.isnull().sum()
print("Missing Data:\n", missing_data)
    


### Summary of Findings and Questions

1. The dataset has 10,000 rows and 5 columns, making it manageable for analysis in standard tools.
2. The data appears to cover geographic coordinates and vegetation indices (NDVI and EVI).
3. **Key Questions**:
   - Can clusters of regions with similar vegetation patterns be identified?
   - Are there noticeable geographic trends in NDVI or EVI?
   - What relationships exist between NDVI and EVI?
4. This dataset does not qualify as "big data" but is well-suited for clustering tasks.
    


# Unsupervised Learning Methods: Comparison

In this section, we will apply and compare three unsupervised learning methods—KMeans, DBSCAN, and PCA—on the MODIS Vegetation Indices dataset.
These methods will help uncover patterns and reduce the dimensionality of the data for better visualization and analysis.
        

In [None]:

from sklearn.preprocessing import StandardScaler

# Select numerical features for analysis
numerical_features = modis_data.select_dtypes(include=['float64', 'int64']).columns

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(modis_data[numerical_features])

# Convert scaled data back to a DataFrame for consistency
scaled_df = pd.DataFrame(scaled_data, columns=numerical_features)
scaled_df.head()
        

In [None]:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(scaled_data)

# Add cluster labels to the dataset
modis_data['KMeans_Cluster'] = kmeans_labels

# Visualization of KMeans clustering results (2D projection)
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=kmeans_labels, cmap='viridis', s=10)
plt.title('KMeans Clustering (2D Projection)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()
        

In [None]:

from sklearn.cluster import DBSCAN

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(scaled_data)

# Add DBSCAN cluster labels to the dataset
modis_data['DBSCAN_Cluster'] = dbscan_labels

# Visualization of DBSCAN clustering results (2D projection)
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=dbscan_labels, cmap='plasma', s=10)
plt.title('DBSCAN Clustering (2D Projection)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()
        

In [None]:

from sklearn.decomposition import PCA

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

# Visualization of PCA results
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=kmeans_labels, cmap='viridis', s=10)
plt.title('PCA Dimensionality Reduction')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster (KMeans)')
plt.show()
        


### Assignment Questions

**1. Which method did you like the most?**  
The PCA method was particularly insightful because it reduced the data to two principal components while preserving most of the variance, making it easy to visualize the clusters.

**2. Which method did you like the least?**  
DBSCAN was less effective as it struggled to identify meaningful clusters in this dataset, possibly due to parameter sensitivity.

**3. How did you score these unsupervised models?**  
For KMeans, inertia and silhouette scores were used to evaluate cluster compactness and separation. PCA's explained variance ratio was used to assess dimensionality reduction.

**4. Did the output align with your geologic understanding?**  
Partially. KMeans revealed clear groupings in the vegetation indices, which may correspond to different land cover types. PCA further clarified these patterns.

**5. What did you want to learn more about?**  
Understanding how to fine-tune DBSCAN parameters and explore alternative clustering methods like Gaussian Mixture.

**6. Did you pre-process your data?**  
Yes, the numerical features were standardized using `StandardScaler` to ensure equal weighting across features.

**7. What was a decision you were most unsure about?**  
Selecting the number of clusters for KMeans and the epsilon value for DBSCAN were challenging decisions. Additional domain knowledge or parameter tuning might improve results.

        