# Spark DataFrame Basics I

<p>Obs.: After download the databricks notebook to .ipynb we have problems in the output format but if you run this notebook in a databricks cluster you'll have a output in a table format.</p>

<p>E.g.:</p>
<p>The following output:</p>
<p>+----+-------+ age| name| +----+-------+ null|Michael| 30| Andy| 19| Justin| +----+-------+</p>
<p>actually is:</p>
<pre>+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+  </pre>

### Import SparkSession

In [2]:
from pyspark.sql import SparkSession

### Create spark session

In [3]:
spark = SparkSession.builder.appName('Basics').getOrCreate()

### Load json data

In [4]:
df = spark.read.json('/FileStore/tables/people.json')

### Display the content of the DataFrame

In [5]:
df.show()

### Print the schema in a tree format

In [6]:
df.printSchema()

### Return the columns of the DataFrame

In [7]:
df.columns

### Display  descriptive statistics of the data

In [8]:
df.describe().show()

In [9]:
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

### Indicate struct field of a column

In [10]:
data_schema = [StructField('age', IntegerType(), True),
               StructField('name', StringType(), True)]

### Indicate struct type of the fields

In [11]:
final_struc = StructType(fields=data_schema)

### Load data indicating the schema

In [12]:
df = spark.read.json('/FileStore/tables/people.json', schema=final_struc)

In [13]:
df.printSchema()

### Select a column to be returned

In [14]:
df.select('age').show()

In [15]:
df.head(2)[0]

### Select columns to be returned

In [16]:
df.select(['age', 'name']).show()

### Create new column

In [17]:
df.withColumn('double_age', df['age'] * 2).show()

### Rename column

In [18]:
df.withColumnRenamed('age', 'my_new_age').show()

### Create temporary view

In [19]:
df.createOrReplaceTempView('people')

### Use SQL comands to handle the data using the temporary view

In [20]:
results = spark.sql("SELECT * FROM people")
results.show()

In [21]:
new_results = spark.sql("SELECT * FROM people WHERE age=30")
new_results.show()