#### Schema
In PySpark, a schema is like a blueprint that describes the structure of your data. 
It defines what kind of information each part of your data contains. The schema helps PySpark understand how to organize and process the data, making it easier to work with and analyze. 
It's like having a map that guides PySpark in navigating and making sense of the information within your dataset.

#### Main data types in PySpark
* **float**: Continuous number, with decimals.
* **integer**: Discrete number. 
* **string**: Textual information
* **boolean**: True or False type
* **timestamp**: Datetime type


In [0]:
path = '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv'
df = spark.read.csv(path, header=True, inferSchema=True)

In [0]:
display(df.limit(2))

In [0]:
# Importing Struct Types
from pyspark.sql.types import DoubleType, StringType, StructField, StructType, IntegerType
from pyspark.sql import functions as F
 
 # StructType is a list of StrctFields
schema = StructType([
# A StructFielad takes in the name of the variable, 
# the type and a boolean indicating if null is accepted or not.
  StructField("_c0", StringType(), False),
  StructField("carat", DoubleType(), False),
  StructField("cut", StringType(), False),
  StructField("color", StringType(), False),
  StructField("clarity", StringType(), False),
  StructField("depth", DoubleType(), False),
  StructField("table", IntegerType(), False),
  StructField("price", IntegerType(), False),
  StructField("x", DoubleType(), False),
  StructField("y", DoubleType(), False),
  StructField("z", DoubleType(), False)
])

# Reload the DF with the schema applied
df = spark.read.format("csv").schema(schema).load(path)

### Creating Data Frames

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

my_data = [("2023-01-01","PPDQ1","A",366.3),
        ("2023-01-02","PPDQ2","A",389.5),
        ("2023-01-02","PPDQ1","B",289.78),
        ("2023-01-03","PPDQ6","A",367.45),
        ("2023-01-04","PPDQ3","A",766.45),
        ("2023-01-04","PPDQ3","B",703.7),
        ("2023-01-05","PPDQ3","A",426.74)
  ]

schema = StructType([ \
    StructField("date",StringType(),True), \
    StructField("product",StringType(),True), \
    StructField("store",StringType(),True), \
    StructField("total", DoubleType(), True)  ])
 
sales = spark.createDataFrame(data=my_data,schema=schema)

In [0]:
display(sales)

#### Date and Time

In [0]:
# column "date" to date type
sales = (
    sales #dataset
    .select(F.to_date(col('date'), 'yyyy-mm-dd').alias('date'),
                     'product', 'store', 'total' )
)

In [0]:
display(sales.limit(2))

In [0]:
(sales
 .select('date',
             F.day('date').alias('day'),
             F.weekofyear('date').alias('week'),
             F.month('date').alias('mth'))
 .display()
)

In [0]:
display(
    sales #dataset
    .withColumn('today', F.current_date()) # current system date
    .withColumn('today_tmstp', F.current_timestamp()) #add current timestamp
    .withColumn('date+10', F.date_add(col('date'), 10)) # add 10 days to date
    .withColumn('date_difference', F.date_diff(col('today'), col('date') ) ) #date difference
)