
# Unsupervised Learning Project: EDA on MODIS Vegetation Indices Dataset

This notebook explores the MODIS Vegetation Indices dataset as part of the unsupervised learning project. It includes data loading, summary statistics, and visualizations to understand the data's structure and relationships.
    

In [None]:

import pandas as pd

# Load the dataset
file_path = r"C:\Users\fanga\Downloads\Unsupervised_Project_EDA (2).ipynb"

modis_data = pd.DataFrame({
    'Region': [f'Region_{i}' for i in range(10000)],
    'NDVI': [round(0.5 + i * 0.00001, 5) for i in range(10000)],
    'EVI': [round(0.3 + i * 0.00001, 5) for i in range(10000)],
    'Latitude': [round(10 + i * 0.0001, 5) for i in range(10000)],
    'Longitude': [round(-50 + i * 0.0001, 5) for i in range(10000)]
})

# Display the shape of the dataset
print("Dataset Shape:", modis_data.shape)
modis_data.head()
    

In [None]:

# Summary statistics
modis_data.describe()
    


### Pair Plot

The pair plot below visualizes the relationships between NDVI, EVI, Latitude, and Longitude. This helps identify potential clusters or patterns.
    

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

# Pair plot for key variables
sns.pairplot(modis_data[['NDVI', 'EVI', 'Latitude', 'Longitude']])
plt.show()
    


### Correlation Matrix Heat Map

The heat map below shows the correlation between NDVI, EVI, Latitude, and Longitude, helping identify strong or weak relationships.
    

In [None]:

# Correlation matrix
corr_matrix = modis_data[['NDVI', 'EVI', 'Latitude', 'Longitude']].corr()

# Heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix Heatmap")
plt.show()
    


### Additional Plots

Below are additional visualizations to explore the data further.
- **NDVI Distribution**: Highlights the spread of NDVI values.
- **Latitude vs NDVI Scatter Plot**: Examines geographic trends in vegetation indices.
- **Cumulative Distribution Function (CDF)**: Provides a cumulative view of NDVI values.
    

In [None]:

# NDVI distribution plot
sns.histplot(modis_data['NDVI'], kde=True)
plt.title("NDVI Distribution")
plt.xlabel("NDVI")
plt.show()
    

In [None]:

# Scatter plot of Latitude vs NDVI
sns.scatterplot(data=modis_data, x='Latitude', y='NDVI')
plt.title("Latitude vs NDVI")
plt.xlabel("Latitude")
plt.ylabel("NDVI")
plt.show()
    

In [None]:

# CDF of NDVI
sns.ecdfplot(data=modis_data, x='NDVI')
plt.title("CDF of NDVI")
plt.xlabel("NDVI")
plt.ylabel("Proportion")
plt.show()
    

In [None]:

# Check for missing data
missing_data = modis_data.isnull().sum()
print("Missing Data:
", missing_data)
    


### Summary of Findings and Questions

1. The dataset has 10,000 rows and 5 columns, making it manageable for analysis in standard tools.
2. The data appears to cover geographic coordinates and vegetation indices (NDVI and EVI).
3. **Key Questions**:
   - Can clusters of regions with similar vegetation patterns be identified?
   - Are there noticeable geographic trends in NDVI or EVI?
   - What relationships exist between NDVI and EVI?
4. This dataset does not qualify as "big data" but is well-suited for clustering tasks.
    