## <mark> df.select().show()

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.

    df.select("firstname","lastname").show()
    df.select(df.firstname,df.lastname)
    df.select(df["firstname"],df["lastname"])

    #By using col() function
    from pyspark.sql.functions import col
    df.select(col("firstname"),col("lastname"))

    #Select columns by regular expression
    df.select(df.colRegex("`^.*name*`"))


    # Select All columns from List
    df.select(*columns)

    # Select All columns
    df.select([col for col in df.columns])
    df.select("*")

    #Selects first 3 columns
    df.select(df.columns[:3])

    #Selects columns 2 to 4
    df.select(df.columns[2:4])


#### Select Nested Struct Columns

    df2.select("name")

    +----------------------+
    |name                  |
    +----------------------+
    |[James, Mac, Smith]   |

    df2.select("name.firstname","name.lastname")

    +---------+--------+
    |firstname|lastname|
    +---------+--------+
    |James    |Smith   |
    
    df2.select("name.*")
    to get all columns from StuctType

### <mark> df.collect()

    deptDF.collect() returns Array of Row type.
    deptDF.collect()[0] returns the first element in an array (1st row).
    deptDF.collect[0][0] returns the value of the first row & first column.
    

PySpark RDD/DataFrame collect() is **an action operation (not transformation) that is used to retrieve all the elements of the dataset (from all nodes) to the driver node.** We should use the collect() on **smaller dataset** usually after filter(), group() e.t.c. Retrieving larger datasets results in OutOfMemory error.
    
    dataCollect = deptDF.collect()
    print(dataCollect)

    [Row(dept_name='Finance', dept_id=10), 
    Row(dept_name='Marketing', dept_id=20), 
    Row(dept_name='Sales', dept_id=30), 
    Row(dept_name='IT', dept_id=40)]

Note that collect() is an action hence it **does not return a DataFrame instead, it returns data in an Array** to the driver. Once the data is in an array, you can use python for loop to process it further.

    for row in dataCollect:
        print(row['dept_name'] + "," + str(row['dept_id']))
        
    # to return value of First Row, First Column which is "Finance"
    deptDF.collect()[0][0]
    
In case you want to just return certain elements of a DataFrame, you should call PySpark select() transformation first.

    dataCollect = deptDF.select("dept_name").collect()
    
**select() is a transformation that returns a new DataFrame** and holds the columns that are selected whereas collect() is an action that returns the entire data set in an Array to the driver.

### <mark> withColumn() 
    
is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.
    
returns a new df instead of changing the original one

change the datatype usign withColumn and cast()

    df.withColumn(
        "salary", 
        col("salary").cast("Integer")
        )

Update The Value of an Existing Column

    df.withColumn("salary", col("salary")*100)
        
Create a Column from an Existing

    df.withColumn("new_salary", col("salary")*1.5)
        
Add a New Column

    df.withColumn("country", lit('USA')) \
    .withColumn("planet", lit('earth'))
    
Drop Column
    
    df.drop("salary")

### <mark> df.withColumnRenamed()
    
Since DataFrame’s are an immutable collection, you can’t rename or update a column instead when using withColumnRenamed() it **creates a new DataFrame with updated column names and doesn’t modify the current DataFrame.**

    df.withColumnRenamed("dob","DateOfBirth")

    df2 = df.withColumnRenamed("dob","DateOfBirth") \
            .withColumnRenamed("salary","salary_amount") \
            .withColumnRenamed("fname","first_name") \
            .withColumnRenamed("lname","last_name")
        
When we have data in a flat structure (without nested) , use toDF() with a new schema to change all column names.
    
    newColumns = ["newCol1","newCol2","newCol3","newCol4"]
    df.toDF(*newColumns)

### <mark> distinct() & dropDuplicates()

PySpark distinct() function is used to **drop/remove the duplicate rows (all columns)** from DataFrame and dropDuplicates() is used to **drop rows based on selected (one or multiple) columns.**

    distinctDF = df.distinct()

    df2 = df.dropDuplicates()

    dropDisDF = df.dropDuplicates(["department","salary"])