#### PySpark select() in Azure Databricks

###### Gentle reminder: 
In Databricks,
  - sparkSession made available as spark
  - sparkContext made available as sc
  
In case, you want to create it manually, use the below code.

In [0]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("azurelib.com") \
    .getOrCreate()

sc = spark.sparkContext

##### a) Create manual PySpark DataFrame

In [0]:
data = [    
    (1,"Sascha","1998-09-03"),
    (2,"Lise","2008-09-17"),
    (3,"Nola","2008-08-23"),
    (4,"Demetra","1997-06-02"),
    (5,"Lowrance","2006-07-02")
]

df = spark.createDataFrame(data, schema=["id","name","dob"])
df.printSchema()
df.show()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- dob: string (nullable = true)

+---+--------+----------+
| id|    name|       dob|
+---+--------+----------+
|  1|  Sascha|1998-09-03|
|  2|    Lise|2008-09-17|
|  3|    Nola|2008-08-23|
|  4| Demetra|1997-06-02|
|  5|Lowrance|2006-07-02|
+---+--------+----------+



##### b) Create PySpark DataFrame by reading files

In [0]:
# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

##### Note: Here, I will be using the manually created dataframe

##### 1. Selecting single column

In [0]:
from pyspark.sql.functions import col

# 1. Selecting columns using String
df.select("name").show()

# 2. Selecting columns using python Dot Notation
df.select(df.name).show()

# 3. Selecting columns using column name as Key
df.select(df["name"]).show()

# 4. Selecting columns using col() function
df.select(col("name")).show()

+--------+
|    name|
+--------+
|  Sascha|
|    Lise|
|    Nola|
| Demetra|
|Lowrance|
+--------+

+--------+
|    name|
+--------+
|  Sascha|
|    Lise|
|    Nola|
| Demetra|
|Lowrance|
+--------+

+--------+
|    name|
+--------+
|  Sascha|
|    Lise|
|    Nola|
| Demetra|
|Lowrance|
+--------+

+--------+
|    name|
+--------+
|  Sascha|
|    Lise|
|    Nola|
| Demetra|
|Lowrance|
+--------+



##### 2. Selecting multiple columns

In [0]:
from pyspark.sql.functions import col

# 1. Selecting columns using String
df.select("id", "name").show()

# 2. Selecting columns using python Dot Notation
df.select(df.id, df.name).show()

# 3. Selecting columns using column name as Key
df.select(df["id"], df["name"]).show()

# 4. Selecting columns using col() function
df.select(col("id"), col("name")).show()

+---+--------+
| id|    name|
+---+--------+
|  1|  Sascha|
|  2|    Lise|
|  3|    Nola|
|  4| Demetra|
|  5|Lowrance|
+---+--------+

+---+--------+
| id|    name|
+---+--------+
|  1|  Sascha|
|  2|    Lise|
|  3|    Nola|
|  4| Demetra|
|  5|Lowrance|
+---+--------+

+---+--------+
| id|    name|
+---+--------+
|  1|  Sascha|
|  2|    Lise|
|  3|    Nola|
|  4| Demetra|
|  5|Lowrance|
+---+--------+

+---+--------+
| id|    name|
+---+--------+
|  1|  Sascha|
|  2|    Lise|
|  3|    Nola|
|  4| Demetra|
|  5|Lowrance|
+---+--------+



##### 3. Selecting entire column

In [0]:
# 1. Selecting all columns using "*" symbol
df.select("*").show()

# 2. Selecting all columns list
df.select(["id", "name", "dob"]).show()

# 3. Selecting all columns using columns field
df.select(df.columns).show()

+---+--------+----------+
| id|    name|       dob|
+---+--------+----------+
|  1|  Sascha|1998-09-03|
|  2|    Lise|2008-09-17|
|  3|    Nola|2008-08-23|
|  4| Demetra|1997-06-02|
|  5|Lowrance|2006-07-02|
+---+--------+----------+

+---+--------+----------+
| id|    name|       dob|
+---+--------+----------+
|  1|  Sascha|1998-09-03|
|  2|    Lise|2008-09-17|
|  3|    Nola|2008-08-23|
|  4| Demetra|1997-06-02|
|  5|Lowrance|2006-07-02|
+---+--------+----------+

+---+--------+----------+
| id|    name|       dob|
+---+--------+----------+
|  1|  Sascha|1998-09-03|
|  2|    Lise|2008-09-17|
|  3|    Nola|2008-08-23|
|  4| Demetra|1997-06-02|
|  5|Lowrance|2006-07-02|
+---+--------+----------+



##### 4. Selecting column by index

In [0]:
# 1. Selecting the first column
df.select(df.columns[0]).show()

# 2. Selecting all the columns from second column
df.select(df.columns[1:]).show()

# 3. Selecting the columns on every two steps
df.select(df.columns[::2]).show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
+---+

+--------+----------+
|    name|       dob|
+--------+----------+
|  Sascha|1998-09-03|
|    Lise|2008-09-17|
|    Nola|2008-08-23|
| Demetra|1997-06-02|
|Lowrance|2006-07-02|
+--------+----------+

+---+----------+
| id|       dob|
+---+----------+
|  1|1998-09-03|
|  2|2008-09-17|
|  3|2008-08-23|
|  4|1997-06-02|
|  5|2006-07-02|
+---+----------+



##### 5. Selecting columns in reverse order

In [0]:
df.select(df.columns[::-1]).show()

+----------+--------+---+
|       dob|    name| id|
+----------+--------+---+
|1998-09-03|  Sascha|  1|
|2008-09-17|    Lise|  2|
|2008-08-23|    Nola|  3|
|1997-06-02| Demetra|  4|
|2006-07-02|Lowrance|  5|
+----------+--------+---+

