## Show
Used for debugging to check if you are correctly uploading your DF. It prints the first n values in **.show(n)**

In [None]:
# Create a Spark Session object
spark = SparkSession.builder.getOrCreate()

# Create a DataFrame from persons.csv
df = spark.read.load( "persons.csv", format="csv", header=True, inferSchema=True)
df.show(2)

## Print Schema
Prints on the std ouput the schema of your DataFrame

## Count
Returns the number of rows in the input DF

## Distinct
Removes duplicates from DF. Use it only if you really need it, since it is always a hevy operation in terms of data sent on the network. A shuffle phase is indeed needed.

## Select
The select(col1, .., coln) method of the DataFrame class returns a new DataFrame that contains only the specified columns of the input DataFrame. Use * only to create a copy if the DF since it selects all columns. 
**PAY ATTENTION:** select can generate errors at runtime if there are mistakes in the names of the columns.

In [None]:
# Create a DataFrame from persons2.csv
df = spark.read.load( "persons2.csv",format="csv",header=True,inferSchema=True)

dfNamesAges = df.select("name", "age")

## SelectExpr
The selectExpr(expression1, ..,expressionN) method of the DataFrame class is a variant of the select method, where expr can be an SQL expression.

In [None]:
# Create a DataFrame from persons.csv
df = spark.read.load("persons2.csv", format="csv", header=True, inferSchema=True)

# Create a new DataFrame with four columns:
# name, age, gender, newAge = age +1
dfNewAge = df.selectExpr("name", "age", "gender", "age+1 as newAge")

## Filter
The filter(conditionExpr) method of the DataFrame class returns a new DataFrame that contains only the rows satisfying the specified condition. In Scala or Java we have also dataset instead of df, were we can use lambda function to catch errors at compile time, not only at runtime.

In [None]:
# Create a Spark Session object
spark = SparkSession.builder.getOrCreate()

# Create a DataFrame from persons.csv
df = spark.read.load( "persons.csv",format="csv",header=True,inferSchema=True)

df_filtered = df.filter("age>=20 and age<=31"

## Where
Is equivalent to Filter, they are alias

## Join
The join(right, on, how) method of the DataFrame class is used to join two DataFrames. It returns a DataFrame that contains the join of the tuples of the two input DataFrames based on the on join condition.

**Type of join:**

- inner (default)
- cross
- outer
- full
- full_outer
- left
- left_outer
- right
- right outer
- left_semi
- left_anti

**NOTE:** this method can generate errors at runtime if there are errors in the join expression.

### **Example 1**
Two DFs:
- uid, name, age
- uid, sportname

Join the content of the two DataFrames (uid is the join column) and show it on the standard output.

In [None]:
# Read persons_id.csv and store it in a DataFrame
dfPersons = spark.read.load("persons_id.csv",format="csv",header=True,inferSchema=True)

# Read liked_sports.csv and store it in a DataFrame
dfUidSports = spark.read.load("liked_sports.csv",format="csv",header=True,inferSchema=True)

# Join the two input DataFrames, note that the first param is the 'right' df
dfPersonLikes = dfPersons.join(dfUidSports,dfPersons.uid == dfUidSports.uid)

# Print the result on the standard output
dfPersonLikes.show()

### **Example 2**
Two DFs:
- uid, name, age
- uid, bannedmotivation

Select the profiles of the non-banned users and show them on the standard output. Anti join selects only the rows were a specified parameter is not present.

In [None]:
# Read persons_id.csv and store it in a DataFrame
dfPersons = spark.read.load("persons_id.csv",format="csv",header=True,inferSchema=True)

# Read banned.csv and store it in a DataFrame
dfBannedUsers = spark.read.load("banned.csv",format="csv",header=True,inferSchema=True)

# Apply the Left Anti Join on the two input DataFrames
dfSelectedProfiles = dfPersons.join(dfBannedUsers,dfPersons.uid == dfBannedUsers.uid,"left_antiâ€œ)

# Print the result on the standard output
dfSelectedProfiles.show()