# NOTEBOOK 2: Data Exploration

## Imports libraries for data exploration

In [None]:
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark import version
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
import sys
sys.path.append('..')
from utilities.creds import Credentials
from snowflake.snowpark import version
print(version.VERSION)

%matplotlib inline

## Create Snowpark Session

In [None]:
session = Session.builder.configs(Credentials().__dict__).create()

session.use_role("LEARNINGSNOWPARKROLE")
session.use_database("SCIKIT_LEARN")
session.use_schema("SCIKIT_LEARN.PUBLIC")
session.use_warehouse("LEARNINGSNOWPARKVW")

print(session.sql('select current_warehouse(), current_database(), current_schema()').collect())

## Pandas DataFrames compared to Snowpark DataFrames

This section showcase the differences between using a Pandas Dataframe (data in memory on client machine) and a Snowpark DataFrame (data in Snowflake)

In [None]:
# Creating a Pandas DataFrame
pandas_df = pd.read_csv('datasets/housing/housing.csv')
print(type(pandas_df))

In [None]:
# Creating a Snowpark DataFrame
snowpark_df = session.table('HOUSING_DATA')
print(type(snowpark_df))

Since a Snowpark DataFrame does not contain any dayta, ie only a "pointer" to data in Snowflake, the memory used by it on the client side is minimum.

In [None]:
# Compare size
print('Size in MB of Pandas DataFrame in Memory:\n', np.round(sys.getsizeof(pandas_df) / (1024.0**2), 2))
print('Size in MB of Snowpark DataFrame in Memory:\n', np.round(sys.getsizeof(snowpark_df) / (1024.0**2), 2))

A Snowpark DataFrame can be easily converted to a Pandas DataFrame by using the **to_pandas** method, this will cause the data to be pulled back from Snowflake and loaded into the client memory

In [None]:
# Converting a Snowpark DataFrame to Pandas DataFrame
pandas_df_from_snowflake = snowpark_df.to_pandas()

In [None]:
pandas_df.shape, pandas_df_from_snowflake.shape

To have a peak of the data that a Snowpark DataFrame is representing **show** function can be used

In [None]:
snowpark_df.show()

Looking at the queries will show us what is actual keept in the client memmory, the SQL needed to return the data accordingly to our DataFram defenition.

In [None]:
snowpark_df.queries

The Snowpark DataFrame API supports multiple ways to select specific columns

In [None]:
# Select specific columns
snowpark_df_subset = snowpark_df.select('HOUSING_MEDIAN_AGE','TOTAL_ROOMS','TOTAL_BEDROOMS','HOUSEHOLDS','OCEAN_PROXIMITY')
snowpark_df_subset.show()

In [None]:
# pandas-like syntax for column selection from Snowflake dataframe
snowpark_df_subset = snowpark_df[['HOUSING_MEDIAN_AGE','TOTAL_ROOMS','TOTAL_BEDROOMS','HOUSEHOLDS','OCEAN_PROXIMITY']]
snowpark_df_subset.show()

In [None]:
snowpark_df_subset.queries

**with_column** function can be used to add a new column to a Snowpark DataFrame (with_columns allows us to add multiple at the same time)

In [None]:
snowpark_df_new_col = snowpark_df_subset.with_column('BEDROOM_RATIO', F.col('TOTAL_BEDROOMS') / F.col('TOTAL_ROOMS'))
snowpark_df_new_col.show()

In [None]:
snowpark_df_new_col.queries

To remove a column from a Snowpark DataFrame **drop** function can be used

In [None]:
snowpark_df_drop_col = snowpark_df_new_col.drop('BEDROOM_RATIO')
snowpark_df_drop_col.show()

To filter (select rows) from a Snowpark DataFrame **filter** or **where** can be used

In [None]:
snowpark_df_filtered = snowpark_df_drop_col.filter(F.col('OCEAN_PROXIMITY').in_(['INLAND','ISLAND', 'NEAR BAY']))
snowpark_df_filtered.show()

In [None]:
snowpark_df_filtered.queries

To to aggregation of the data in a Snowpark DataFrame **group_by** and **agg** can used

In [None]:
# Aggregate data
snowpark_df_agg = snowpark_df_filtered.group_by(['OCEAN_PROXIMITY']).agg([F.avg('HOUSEHOLDS').as_('AVG_HOUSEHOLDS')])
snowpark_df_agg.show()

The returned result for a Snowpark DataFrame can be sorted using **sort**

In [None]:
snowpark_df_sorted = snowpark_df_agg.sort(F.col('AVG_HOUSEHOLDS').asc())
snowpark_df_sorted.show()

## Data Preprocessing using Scikit

Let's start by getting some basic understanding of our data.

We can use the **describe** function on our **numeric** and **character** columns to get some basic statistics, count shows number of non null rows

In [None]:
snowpark_df.describe().show()

Above shows that TOTAL_BEDROOMS has missing values ie the count is less than 20640, so we need to manage that before training a model.

Using the schema of a Snowpark DataFrame allows us to easily get the numerical and categorical (character) column names

In [None]:
# Get all numerical columns
numeric_types = [T.DecimalType, T.LongType, T.DoubleType, T.FloatType, T.IntegerType]
numeric_columns = [c.name for c in snowpark_df.schema.fields if type(c.datatype) in numeric_types]
numeric_columns

In [None]:
# Get all categorical columns (columns with character data type)
categorical_types = [T.StringType]
categorical_columns = [c.name for c in snowpark_df.schema.fields if type(c.datatype) in categorical_types]
categorical_columns

Now we will impute missing values from total_bedroom using scikit learn impute function

In [None]:
from sklearn.impute import SimpleImputer

# Pull back the data from SNowflake into a Pandas Dataframe, data now is stored in memory
pandas_df = snowpark_df.to_pandas()

imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputer = imputer.fit(pandas_df[['TOTAL_BEDROOMS']])
pandas_df['TOTAL_BEDROOMS'] = imputer.transform(pandas_df[['TOTAL_BEDROOMS']])
pandas_df

Now if we print count of total_bedrooms column we will see full count of 20640.

In [None]:
print(pandas_df["TOTAL_BEDROOMS"].count())

Similarly machine learning models expects data to be normalised before training the models.

For that we can use scikit learn normalise functions to normalise the data

In [None]:
from sklearn import preprocessing

df_norm = preprocessing.normalize(pandas_df[["LATITUDE","LONGITUDE","TOTAL_BEDROOMS"]].dropna())
df_norm


What we did so far with scikit learn functions was using pandas dataframe which was all executed on local machine, but in next worksheet we will see how we can run all this inside Snowflake.

## Data Visualisation

To understand which features are useful for our machine learning models we can do some visualisation on our data set to get better view of our data. 
Let's create a basic visualisation on data we have

In [None]:
# We will start by creating a pie chart. To create pie chart we will OCEAN_PROXIMITY column and see its distribution.
# First we get the distinct values in column OCEAN_PROXIMITY and the number of rows for each unique value
# We are are using pyplot for this visualisation

df_pie = snowpark_df.group_by("OCEAN_PROXIMITY").agg(F.sum('MEDIAN_HOUSE_VALUE').as_('MEDIAN_HOUSE_VALUE')).to_pandas()
df_pie.set_index('OCEAN_PROXIMITY', inplace=True)
df_pie.plot.pie(y='MEDIAN_HOUSE_VALUE', figsize=(8,8))

To analyse distribution of our continous variables we will plot histograms for all continuous variables

In [None]:
# Plotting histograms for all continous variables

pd_numeric = snowpark_df.select(numeric_columns).to_pandas()
pd_numeric.hist(bins=30, figsize=(15,15))
plt.show()

Plotting correlation matrix helps to identify how different features are related to each others

In [None]:
# We will use seaborn lib to plot correlation matrix
sn.heatmap(snowpark_df.to_pandas().corr(), annot=True)
plt.show()

In [None]:
session.close()