<a href="https://colab.research.google.com/github/diego-crai/IntroToDatabricksTechLab/blob/main/Databricks_PySpark_Example_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!pip install pyspark

Hands-On Tutorial: PySpark DataFrame and Basic EDA
1. Setting up the PySpark Environment

First, ensure that PySpark is installed and configured in your environment. In Databricks, the environment is pre-configured.

In [None]:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("PySpark DataFrame and EDA Tutorial") \
    .getOrCreate()


2. Creating a DataFrame with Random Data

We'll create a DataFrame with random data. For simplicity, let's create a DataFrame with two columns: 'id' and 'value', where 'id' is a unique identifier and 'value' is a random number.

In [None]:
from pyspark.sql.functions import rand
import pyspark.sql.functions as F

# Number of rows in the DataFrame
num_rows = 10000

# Create a DataFrame with random data
df = spark.range(0, num_rows).withColumn('value', rand(seed=10) * 10000)
df.show(5)


+---+-----------------+
| id|            value|
+---+-----------------+
|  0|1709.497137955568|
|  1|8051.143958005459|
|  2|5775.925576589018|
|  3|9476.047869880924|
|  4|   2093.704977577|
+---+-----------------+
only showing top 5 rows



3. Basic Exploratory Data Analysis (EDA)

Now, let's perform some basic EDA on the DataFrame.

3.1. Basic Descriptive Statistics

Get summary statistics for the 'value' column.

In [None]:
df.describe('value').show()

+-------+------------------+
|summary|             value|
+-------+------------------+
|  count|             10000|
|   mean| 4996.345729236676|
| stddev|2876.5823304479145|
|    min|0.6312299744748451|
|    max| 9997.842389944872|
+-------+------------------+



3.2. Count and Distinct Count

Count the total number of rows and distinct 'id' values.

In [None]:
total_rows = df.count()
distinct_ids = df.select('id').distinct().count()

print(f"Total Rows: {total_rows}, Distinct IDs: {distinct_ids}")

Total Rows: 10000, Distinct IDs: 10000


3.3. Minimum, Maximum, and Average

Calculate the minimum, maximum, and average of the 'value' column.

In [None]:
df.agg(F.min('value'), F.max('value'), F.avg('value')).show()


+------------------+-----------------+-----------------+
|        min(value)|       max(value)|       avg(value)|
+------------------+-----------------+-----------------+
|0.6312299744748451|9997.842389944872|4996.345729236676|
+------------------+-----------------+-----------------+



3.4. Simple Data Visualization (Optional)

If you're using Databricks or an environment that supports visualization, plot a histogram or a bar chart of the 'value' column.

In [None]:
display(df.select('value'))


DataFrame[value: double]

4. Closing the Spark Session

Finally, close the Spark session when you're done.

In [None]:
spark.stop()