Skip to content

Latest commit

 

History

History
95 lines (67 loc) · 2.55 KB

DEVELOPMENT.md

File metadata and controls

95 lines (67 loc) · 2.55 KB

DEVELOPMENT LOG

This file contains info sources and tricks

Create smaller test data

Use this script: data/helper/create_test_data.py.

Open Issues and Recherche

Using Spark and Airflow on AWS

Sources:

Merging Dimension Table from different sources:

  • Use a Join (Inner or Left?)
  • Maybe use fuzzy matching
  • Maybe keep different dimensions and introduce a mapping table between different dims?

Sources:

Knowledge:

AWS

EDA

Geo-spatial analysis

Time-series:

Profiling / Make Pandas quicker:

SQL:

Good article about window functions: link

Tricks:

Spark

  • Remove null values form dataframe:
df.where(df.FirstName.isNotNull())

# or for strings only  
df.where(df.FirstName != '').

other Syntax:

df.where(col("FirstName").isNull())
  • multiple Filter conditions:
# and
df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

# or
at_least_one_factual_values = df.filter( df['trade_usd'].isNotNull() | df['weight_kg'].isNotNull() | df['quantity'].isNotNull())
  • get list of column names from pyspark dataframe:
col_names = df.columns
  • check if a column contains a value:
df.filter(df.name.contains('o')).collect()
  • remove duplicates:
df.dropDuplicates()

# or only based on a specific column/s
df.dropDuplicates(['col_name_a'])
df.dropDuplicates(['col_name_a', 'col_name_b'])