# 2.4 DataFrame Exploration

When working with geoscientific data in machine learning, the quality and structure of your data are key factors in building reliable models. An initial preparation for AI-ready data is to perform initial explortion of the data set. In this lecture, we will walk through a typical data preparation pipeline using a pandas DataFrame, focusing on cleaning and transforming the data to be ready for modeling. The example will highlight reading data, checking correlations, handling missing values (NaNs), and removing zeros where appropriate.

## Read and Explore the data

We will download a Kaggle data set:

https://www.kaggle.com/datasets/lucidlenn/sloan-digital-sky-survey




https://www.kaggle.com/code/alanabd/skyserver-sql2-27-2018-ile-ml-ve-cv/input

In [11]:
import pandas as pd

!wget "https://raw.githubusercontent.com/UW-MLGEO/MLGeo-dataset/refs/heads/main/data/Skyserver_SQL2_27_2018 6_51_39 PM.csv"

--2024-10-07 06:15:27--  https://raw.githubusercontent.com/UW-MLGEO/MLGeo-dataset/refs/heads/main/data/Skyserver_SQL2_27_2018%206_51_39%20PM.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1377602 (1.3M) [text/plain]
Saving to: ‘Skyserver_SQL2_27_2018 6_51_39 PM.csv’


2024-10-07 06:15:28 (4.27 MB/s) - ‘Skyserver_SQL2_27_2018 6_51_39 PM.csv’ saved [1377602/1377602]



In [12]:
df = pd.read_csv("Skyserver_SQL2_27_2018 6_51_39 PM.csv")

In [3]:
# Get the first few rows of the dataset
df.head()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1.23765e+18,183.531326,0.089693,19.47406,17.0424,15.94699,15.50342,15.22531,752,301,4,267,3.72236e+18,STAR,-9e-06,3306,54922,491
1,1.23765e+18,183.59837,0.135285,18.6628,17.21449,16.67637,16.48922,16.3915,752,301,4,267,3.63814e+17,STAR,-5.5e-05,323,51615,541
2,1.23765e+18,183.680207,0.126185,19.38298,18.19169,17.47428,17.08732,16.80125,752,301,4,268,3.23274e+17,GALAXY,0.123111,287,52023,513
3,1.23765e+18,183.870529,0.049911,17.76536,16.60272,16.16116,15.98233,15.90438,752,301,4,269,3.72237e+18,STAR,-0.000111,3306,54922,510
4,1.23765e+18,183.883288,0.102557,17.55025,16.26342,16.43869,16.55492,16.61326,752,301,4,269,3.72237e+18,STAR,0.00059,3306,54922,512


In [13]:
# what datatypes are in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   objid      10000 non-null  float64
 1   ra         10000 non-null  float64
 2   dec        10000 non-null  float64
 3   u          10000 non-null  float64
 4   g          10000 non-null  float64
 5   r          10000 non-null  float64
 6   i          10000 non-null  float64
 7   z          10000 non-null  float64
 8   run        10000 non-null  int64  
 9   rerun      10000 non-null  int64  
 10  camcol     10000 non-null  int64  
 11  field      10000 non-null  int64  
 12  specobjid  10000 non-null  float64
 13  class      10000 non-null  object 
 14  redshift   10000 non-null  float64
 15  plate      10000 non-null  int64  
 16  mjd        10000 non-null  int64  
 17  fiberid    10000 non-null  int64  
dtypes: float64(10), int64(7), object(1)
memory usage: 1.4+ MB


It looks like attribute ``class`` is a string of characters, others are numerical values.

In [4]:
# Summary statistics
df.describe()


Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,redshift,plate,mjd,fiberid
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.23765e+18,175.529987,14.836148,18.619355,17.371931,16.840963,16.583579,16.422833,981.0348,301.0,3.6487,302.3801,1.645022e+18,0.143726,1460.9864,52943.5333,353.0694
std,0.0,47.783439,25.212207,0.828656,0.945457,1.067764,1.141805,1.203188,273.305024,0.0,1.666183,162.577763,2.013998e+18,0.388774,1788.778371,1511.150651,206.298149
min,1.23765e+18,8.2351,-5.382632,12.98897,12.79955,12.4316,11.94721,11.61041,308.0,301.0,1.0,11.0,2.99578e+17,-0.004136,266.0,51578.0,1.0
25%,1.23765e+18,157.370946,-0.539035,18.178035,16.8151,16.173333,15.853705,15.618285,752.0,301.0,2.0,184.0,3.389248e+17,8.1e-05,301.0,51900.0,186.75
50%,1.23765e+18,180.394514,0.404166,18.853095,17.495135,16.85877,16.554985,16.389945,756.0,301.0,4.0,299.0,4.96658e+17,0.042591,441.0,51997.0,351.0
75%,1.23765e+18,201.547279,35.649397,19.259232,18.010145,17.512675,17.25855,17.141447,1331.0,301.0,5.0,414.0,2.8813e+18,0.092579,2559.0,54468.0,510.0
max,1.23765e+18,260.884382,68.542265,19.5999,19.91897,24.80204,28.17963,22.83306,1412.0,301.0,6.0,768.0,9.46883e+18,5.353854,8410.0,57481.0,1000.0


In [5]:
# Check for missing values (NaNs)
print(df.isnull().sum())


objid        0
ra           0
dec          0
u            0
g            0
r            0
i            0
z            0
run          0
rerun        0
camcol       0
field        0
specobjid    0
class        0
redshift     0
plate        0
mjd          0
fiberid      0
dtype: int64


In [6]:

# Check for zeros
print((df == 0).sum())


objid         0
ra            0
dec           0
u             0
g             0
r             0
i             0
z             0
run           0
rerun         0
camcol        0
field         0
specobjid     0
class         0
redshift     19
plate         0
mjd           0
fiberid       0
dtype: int64


Geoscience datasets often contain missing values (e.g., due to sensor malfunctions or data collection gaps) and zeros (which may or may not be meaningful depending on the context). You'll need to treat these cases carefully.

It 

## 2. Correlation Analysis
In machine learning, understanding the relationships between features can provide valuable insights. For example, in geosciences, soil moisture might be correlated with precipitation or vegetation indices. You can use a correlation matrix to check for such relationships.

In [14]:
# Calculate correlation matrix
corr_matrix = df.corr()

ValueError: could not convert string to float: 'STAR'

In [None]:

# Display correlation matrix
print(corr_matrix)

# Visualize the correlation matrix (optional but helpful)
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()


In [None]:
Look for high correlations, which can indicate redundancy among features. In cases where variables are highly correlated, you may decide to drop one to avoid multicollinearity.