*For Local installation we create a virtual environment.*
- `conda create --name snowpark-de-ml -c https://repo.anaconda.com/pkgs/snowflake python=3.8` # Keep python 3.8 because snowpark package doesn't work above versions.

- `conda activate snowpark-de-ml`
Install Snowpark Python and other libraries in conda environment

- `conda install -c https://repo.anaconda.com/pkgs/snowflake snowflake-snowpark-python pandas notebook scikit-learn cachetools`

Now use snowpark-de-ml as your current interpreter

## Import Libraries

In [16]:
#Import all the necessary packages
from snowflake.snowpark.session import Session
from snowflake.snowpark.version import VERSION
from snowflake.snowpark.functions import month, year, col, sum
import json
import logging
logger = logging.getLogger("snowflake.snowpark.session")
logger.setLevel(logging.ERROR)

## Establish Connection to Snowflake

I have already kept a JSON file (your_authentication.json) with the structure of credentials mentioned in it.

In [3]:
connection_parameters = json.load(open('auth.json'))
session = Session.builder.configs(connection_parameters).create()
session.sql_simplifier_enabled = True
snowflake_environment = session.sql('select current_user(), current_version()').collect()
snowpark_version = VERSION

In [4]:
# Current Environment Details
print('User: {}'.format(snowflake_environment[0][0]))
print('Warehouse: {}'.format(session.get_current_warehouse()))
print('Snowflake version: {}'.format(snowflake_environment[0][1]))

User: AGUPTA07
Warehouse: "COMPUTE_WH"
Snowflake version: 7.23.1


## Read the table from your Snowflake environment


### Load and read the data by Snowpark Dataframe

In [5]:
dataset = session.table('google_keywords_search_dataset.datafeeds.google_keywords')
dataset.queries

{'queries': ['SELECT  *  FROM (google_keywords_search_dataset.datafeeds.google_keywords)'],
 'post_actions': []}

In [7]:
dataset.show(50)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"COUNTRY"  |"KEYWORD"                        |"SITE"                    |"YEAR"  |"MONTH"  |"DAY"  |"PLATFORM"  |"REFERRAL_TYPE"  |"CLEAN_LANDINGPAGE"                                 |"CALIBRATED_USERS"  |"CALIBRATED_CLICKS"  |"IS_BRANDED_KEYWORD"  |"IS_QUESTION"  |"DATE"    |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|36         |japanese name converter          |nolanlawson.com           |22      |06       |02     |desktop     |ORGANIC          |japanesenameconverter.nolanlaws

In [10]:
# Like SQL worksheet, you can mention your DB and Schema names
session.use_database('workforce_data_analytics')
session.use_schema('public')

In [14]:
dataset2 = session.table('REVELIO_INDIVIDUAL_POSITION')
dataset2.show(20)

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"USER_ID"  |"POSITION_ID"         |"COMPANY_RAW"                               |"COMPANY_LINKEDIN_URL"                              |"COMPANY_CLEANED"                           |"REGION"                  |"COUNTRY"      |"STATE"           |"MSA"                                   |"STARTDATE"  |"ENDDATE"   |"JOBTITLE_RAW"                    |"MAPPED_ROLE"                 |"JOB_CATEGORY"  |"SENIORITY"  |"SALARY"    |"RN"  |"RCID"

Now Let's check current environment details.

In [11]:
print('User :{}'.format(snowflake_environment[0][0]))
print('Role :{}'.format(session.get_current_role()))
print('Warehouse :{}'.format(session.get_current_warehouse()))
print('Database :{}'.format(session.get_current_database()))
print('Schema :{}'.format(session.get_current_schema()))

User :AGUPTA07
Role :"ACCOUNTADMIN"
Warehouse :"COMPUTE_WH"
Database :"WORKFORCE_DATA_ANALYTICS"
Schema :"PUBLIC"


### Load and Read the data by SQL dialect

In [12]:
dataset3 = session.sql('select user_id, company_cleaned, region, country, job_category, salary from workforce_data_analytics.public.revelio_individual_position')
dataset3.show(20)

-----------------------------------------------------------------------------------------------------------------------------------
|"USER_ID"  |"COMPANY_CLEANED"                           |"REGION"                  |"COUNTRY"      |"JOB_CATEGORY"  |"SALARY"    |
-----------------------------------------------------------------------------------------------------------------------------------
|8953745    |synchrony                                   |Northern America          |United States  |Finance         |34108.966   |
|8953745    |universal orlando resort                    |Northern America          |United States  |Sales           |33845.617   |
|89538111   |credipass                                   |Southern Europe           |Italy          |Sales           |98148.392   |
|89538117   |dla piper llp us                            |Northern America          |United States  |Admin           |54068.393   |
|89538117   |suncoke energy                              |Northern America  

## Data Transformation

### Find Total Salary Spend per year and month for all companies

In [23]:
dataset2.withColumn("STARTDATE", col("STARTDATE").cast("date"))

<snowflake.snowpark.dataframe.DataFrame at 0x215a433abb0>

In [25]:
print(dataset2.schema['STARTDATE'].datatype)

StringType()


In [19]:
spend_per_company = dataset2.group_by(year('startdate'), month('startdate'), 'company_cleaned').agg(sum('salary').as_('total_salary')).\
                    with_column_renamed('"year(startdate)"', "year"). with_column_renamed('"month(date)"', "month").sort('year', 'month')

SnowparkSQLException: (1304): 01adaed3-0001-32de-0002-1e320007108a: 002016 (22000): SQL compilation error:
Function EXTRACT does not support VARCHAR(512) argument type