## Dataframe Basics - JSON File

---

In [None]:
#Import our SparkSession so we can use it
from pyspark.sql import SparkSession

In [None]:
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName('basics').getOrCreate()

In [None]:
# Let's read in some data to play with
data = spark.read.json('data/food.json')

In [None]:
# Let's show the data
data.show()

In [None]:
#Print schema
data.printSchema()

In [None]:
#Show the columns
data.columns

In [None]:
# Describe our data
data.describe()

When you are working with a csv file, it's relatively simple to infer the schema. When working with JSON in this example, we have to manually set the schema, which we can do like this.

In [None]:
#Import Struct Fields that we can use
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

In [None]:
#Next we need to create the list of Struct Fields
schema = [StructField("price", IntegerType(), True), StructField("food", StringType(), True)]
schema

In [None]:
#Pass in our fields
final = StructType(fields=schema)
final

In [None]:
#Read our data with our new schema
df = spark.read.json('data/food.json', schema=final)
df

In [None]:
#Print it out
df.printSchema()

---

## Accessing data

In [None]:
df['price']

In [None]:
type(df['price'])

In [None]:
df.select('price')

In [None]:
type(df.select('price'))

In [None]:
df.select('price').show()

---

## Manipulating Columns

In [None]:
#Add new columns
df.withColumn('newprice', df['price']).show()

In [None]:
#Update new column name
df.withColumnRenamed('price','newerprice').show()

In [None]:
#Double the price
df.withColumn('doubleprice', df['price']*2).show()

In [None]:
#Add a dollar to th price
df.withColumn('add_one_dollar',df['price']+1).show()

In [None]:
#Half the price
df.withColumn('half_price', df['price']/2).show()

In [None]:
#Collecting a columns as a list
df.select('price').collect()

---

## Converting PySpark Dataframe to Pandas Dataframe

In [None]:
import pandas as pd
pandas_df = df.toPandas()

In [None]:
pandas_df.head()