Skip to content

Latest commit

 

History

History
605 lines (546 loc) · 10.5 KB

pyspark.sql.rst

File metadata and controls

605 lines (546 loc) · 10.5 KB

Spark SQL

Core Classes

pyspark.sql

SparkSession DataFrame Column Row GroupedData PandasCogroupedOps DataFrameNaFunctions DataFrameStatFunctions Window

Spark Session APIs

pyspark.sql

The entry point to programming Spark with the Dataset and DataFrame API. To create a Spark session, you should use SparkSession.builder attribute. See also SparkSession.

SparkSession.builder.appName SparkSession.builder.config SparkSession.builder.enableHiveSupport SparkSession.builder.getOrCreate SparkSession.builder.master SparkSession.catalog SparkSession.conf SparkSession.createDataFrame SparkSession.getActiveSession SparkSession.newSession SparkSession.range SparkSession.read SparkSession.readStream SparkSession.sparkContext SparkSession.sql SparkSession.stop SparkSession.streams SparkSession.table SparkSession.udf SparkSession.version

Configuration

pyspark.sql.conf

RuntimeConfig

Input and Output

pyspark.sql

DataFrameReader.csv DataFrameReader.format DataFrameReader.jdbc DataFrameReader.json DataFrameReader.load DataFrameReader.option DataFrameReader.options DataFrameReader.orc DataFrameReader.parquet DataFrameReader.schema DataFrameReader.table DataFrameWriter.bucketBy DataFrameWriter.csv DataFrameWriter.format DataFrameWriter.insertInto DataFrameWriter.jdbc DataFrameWriter.json DataFrameWriter.mode DataFrameWriter.option DataFrameWriter.options DataFrameWriter.orc DataFrameWriter.parquet DataFrameWriter.partitionBy DataFrameWriter.save DataFrameWriter.saveAsTable DataFrameWriter.sortBy DataFrameWriter.text

DataFrame APIs

pyspark.sql

DataFrame.agg DataFrame.alias DataFrame.approxQuantile DataFrame.cache DataFrame.checkpoint DataFrame.coalesce DataFrame.colRegex DataFrame.collect DataFrame.columns DataFrame.corr DataFrame.count DataFrame.cov DataFrame.createGlobalTempView DataFrame.createOrReplaceGlobalTempView DataFrame.createOrReplaceTempView DataFrame.createTempView DataFrame.crossJoin DataFrame.crosstab DataFrame.cube DataFrame.describe DataFrame.distinct DataFrame.drop DataFrame.dropDuplicates DataFrame.drop_duplicates DataFrame.dropna DataFrame.dtypes DataFrame.exceptAll DataFrame.explain DataFrame.fillna DataFrame.filter DataFrame.first DataFrame.foreach DataFrame.foreachPartition DataFrame.freqItems DataFrame.groupBy DataFrame.head DataFrame.hint DataFrame.inputFiles DataFrame.intersect DataFrame.intersectAll DataFrame.isLocal DataFrame.isStreaming DataFrame.join DataFrame.limit DataFrame.localCheckpoint DataFrame.mapInPandas DataFrame.na DataFrame.orderBy DataFrame.persist DataFrame.printSchema DataFrame.randomSplit DataFrame.rdd DataFrame.registerTempTable DataFrame.repartition DataFrame.repartitionByRange DataFrame.replace DataFrame.rollup DataFrame.sameSemantics DataFrame.sample DataFrame.sampleBy DataFrame.schema DataFrame.select DataFrame.selectExpr DataFrame.semanticHash DataFrame.show DataFrame.sort DataFrame.sortWithinPartitions DataFrame.stat DataFrame.storageLevel DataFrame.subtract DataFrame.summary DataFrame.tail DataFrame.take DataFrame.toDF DataFrame.toJSON DataFrame.toLocalIterator DataFrame.toPandas DataFrame.transform DataFrame.union DataFrame.unionAll DataFrame.unionByName DataFrame.unpersist DataFrame.where DataFrame.withColumn DataFrame.withColumnRenamed DataFrame.withWatermark DataFrame.write DataFrame.writeStream DataFrame.writeTo DataFrame.to_pandas_on_spark DataFrameNaFunctions.drop DataFrameNaFunctions.fill DataFrameNaFunctions.replace DataFrameStatFunctions.approxQuantile DataFrameStatFunctions.corr DataFrameStatFunctions.cov DataFrameStatFunctions.crosstab DataFrameStatFunctions.freqItems DataFrameStatFunctions.sampleBy

Column APIs

pyspark.sql

Column.alias Column.asc Column.asc_nulls_first Column.asc_nulls_last Column.astype Column.between Column.bitwiseAND Column.bitwiseOR Column.bitwiseXOR Column.cast Column.contains Column.desc Column.desc_nulls_first Column.desc_nulls_last Column.dropFields Column.endswith Column.eqNullSafe Column.getField Column.getItem Column.isNotNull Column.isNull Column.isin Column.like Column.name Column.otherwise Column.over Column.rlike Column.startswith Column.substr Column.when Column.withField

Data Types

pyspark.sql.types

ArrayType BinaryType BooleanType ByteType DataType DateType DecimalType DoubleType FloatType IntegerType LongType MapType NullType ShortType StringType StructField StructType TimestampType

Row

pyspark.sql

Row.asDict

Functions

pyspark.sql.functions

abs acos acosh add_months aggregate approxCountDistinct approx_count_distinct array array_contains array_distinct array_except array_intersect array_join array_max array_min array_position array_remove array_repeat array_sort array_union arrays_overlap arrays_zip asc asc_nulls_first asc_nulls_last ascii asin asinh assert_true atan atanh atan2 avg base64 bin bitwise_not bitwiseNOT broadcast bround bucket cbrt ceil coalesce col collect_list collect_set column concat concat_ws conv corr cos cosh count count_distinct countDistinct covar_pop covar_samp crc32 create_map cume_dist current_date current_timestamp date_add date_format date_sub date_trunc datediff dayofmonth dayofweek dayofyear days decode degrees dense_rank desc desc_nulls_first desc_nulls_last element_at encode exists exp explode explode_outer expm1 expr factorial filter first flatten floor forall format_number format_string from_csv from_json from_unixtime from_utc_timestamp get_json_object greatest grouping grouping_id hash hex hour hours hypot initcap input_file_name instr isnan isnull json_tuple kurtosis lag last last_day lead least length levenshtein lit locate log log10 log1p log2 lower lpad ltrim map_concat map_entries map_filter map_from_arrays map_from_entries map_keys map_values map_zip_with max md5 mean min minute monotonically_increasing_id month months months_between nanvl next_day nth_value ntile overlay pandas_udf percent_rank percentile_approx posexplode posexplode_outer pow product quarter radians raise_error rand randn rank regexp_extract regexp_replace repeat reverse rint round row_number rpad rtrim schema_of_csv schema_of_json second sentences sequence sha1 sha2 shiftleft shiftright shiftrightunsigned shuffle signum sin sinh size skewness slice sort_array soundex spark_partition_id split sqrt stddev stddev_pop stddev_samp struct substring substring_index sum sum_distinct sumDistinct tan tanh timestamp_seconds toDegrees toRadians to_csv to_date to_json to_timestamp to_utc_timestamp transform transform_keys transform_values translate trim trunc udf unbase64 unhex unix_timestamp upper var_pop var_samp variance weekofyear when window xxhash64 year years zip_with

pyspark.sql.avro.functions

from_avro to_avro

Window

pyspark.sql

Window.currentRow Window.orderBy Window.partitionBy Window.rangeBetween Window.rowsBetween Window.unboundedFollowing Window.unboundedPreceding WindowSpec.orderBy WindowSpec.partitionBy WindowSpec.rangeBetween WindowSpec.rowsBetween

Grouping

pyspark.sql

GroupedData.agg GroupedData.apply GroupedData.applyInPandas GroupedData.avg GroupedData.cogroup GroupedData.count GroupedData.max GroupedData.mean GroupedData.min GroupedData.pivot GroupedData.sum PandasCogroupedOps.applyInPandas