- The code is written in Python and it is using the PySpark library to perform data manipulation and analysis.
- `mpg` is a DataFrame or a table-like data structure in PySpark that contains information about car mileage.
- `mpg.select(...)` is a method that allows you to select specific columns from the DataFrame. In this case, it is selecting the "hwy" column.
- `mpg.hwy.alias("highway_mileage")` is an operation on the "hwy" column. It renames the column to "highway_mileage" using the `alias()` method. This will create a new column in the resulting DataFrame.
- `.show(5)` is a method that displays the resulting DataFrame. The argument "5" specifies that it should show the first 5 rows of the DataFrame.
- Overall, this code selects the "hwy" column from the "mpg" DataFrame and renames it as "highway_mileage". It then displays the first 5 rows of the resulting DataFrame, showing the values in the "highway_mileage" column.

### Create a spark data frame that contains your favorite programming languages.


- This code creates a SparkSession, which is the entry point to interact with Spark.
- The data is defined as a list of tuples, where each tuple contains a single value representing a programming language.
- spark.createDataFrame(data, ["language"]) creates a DataFrame named df with a single column named "language".
- df.printSchema() displays the schema of the DataFrame, which shows the column name and its data type.
- The shape of the DataFrame is calculated using df.count() to get the number of rows and len(df.columns) to get the number of columns.
- The first 5 records in the DataFrame are shown using df.show(5).

In [None]:
# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Define the data
data = [("Python",), ("Java",), ("JavaScript",), ("C++",), ("Ruby",)]

# Create a DataFrame with a single column named "language"
df = spark.createDataFrame(data, ["language"])

# View the schema of the DataFrame
df.printSchema()

# Output the shape of the DataFrame
shape = (df.count(), len(df.columns))
print("Shape of the DataFrame:", shape)

# Show the first 5 records in the DataFrame
df.show(5)

### Load the mpg dataset as a spark dataframe.

In [None]:
# Load the mpg dataset as a DataFrame
mpg = spark.read.csv("mpg.csv", header=True, inferSchema=True)

# Create a new column named "output" with the desired message
mpg = mpg.withColumn("output", F.concat(F.lit("The "), mpg.year, F.lit(" "), mpg.manufacturer, F.lit(" "), mpg.model, F.lit(" has a "), mpg.cyl, F.lit(" cylinder engine.")))

# Transform the "trans" column to only contain "manual" or "auto"
mpg = mpg.withColumn("trans", F.when(F.col("trans").like("%manual%"), "manual").otherwise("auto"))

The code assumes that the "mpg.csv" file is present in the current directory.
spark.read.csv("mpg.csv", header=True, inferSchema=True) loads the dataset as a DataFrame named "mpg". The header=True argument indicates that the first row of the CSV file contains column names, and inferSchema=True infers the data types of the columns.
mpg.withColumn("output", ...) adds a new column named "output" to the "mpg" DataFrame using the withColumn() method and the concat() function from the pyspark.sql.functions module. It concatenates the desired message using the existing columns.
mpg.withColumn("trans", ...) transforms the "trans" column to contain only "manual" or "auto" values using the withColumn() method and the when() and otherwise() functions from the pyspark.sql.functions module.

### Load the tips dataset as a spark dataframe.

In [None]:
# Load the tips dataset as a DataFrame
tips = spark.read.csv("tips.csv", header=True, inferSchema=True)

# Calculate the percentage of observations that are smokers
smokers_percentage = (tips.filter(F.col("smoker") == "Yes").count() / tips.count()) * 100
print("Percentage of observations that are smokers:", smokers_percentage)

# Create a column named "tip_percentage"
tips = tips.withColumn("tip_percentage", (F.col("tip") / F.col("total_bill")) * 100)

# Calculate the average tip percentage for each combination of sex and smoker
average_tip_percentage = tips.groupby("sex", "smoker").agg(F.avg("tip_percentage").alias("avg_tip_percentage"))
average_tip_percentage.show()

The code assumes that the "tips.csv" file is present in the current directory.
spark.read.csv("tips.csv", header=True, inferSchema=True) loads the dataset as a DataFrame named "tips".
The percentage of observations that are smokers is calculated by filtering the DataFrame for rows where the "smoker" column is "Yes", counting the number of rows, dividing it by the total number of rows, and multiplying by 100.
tips.withColumn("tip_percentage", ...) adds a new column named "tip_percentage" to the "tips" DataFrame. The column is calculated by dividing the "tip" column by the "total_bill" column and multiplying by 100.
The average tip percentage for each combination of "sex" and "smoker" is calculated by grouping the DataFrame by those columns and calculating the average of the "tip_percentage" column using the groupby(), agg(), and avg() functions from the pyspark.sql.functions module. The resulting DataFrame is displayed using show().

### Use the seattle weather dataset referenced in the lesson to answer the questions below.

In [None]:
# Load the seattle weather dataset as a DataFrame
weather = spark.read.csv("seattle_weather.csv", header=True, inferSchema=True)

# Convert the temperatures to Fahrenheit
weather = weather.withColumn("temp_max_fahrenheit", (weather["temp_max"] * 9/5) + 32)
weather = weather.withColumn("temp_min_fahrenheit", (weather["temp_min"] * 9/5) + 32)

# Which month has the most rain, on average?
most_rain_month = weather.groupby("month").agg(F.avg("precipitation").alias("avg_precipitation")).orderBy(F.desc("avg_precipitation")).first()
print("Month with the most rain, on average:", most_rain_month["month"])

# Which year was the windiest?
windiest_year = weather.groupby("year").agg(F.sum("wind").alias("total_wind")).orderBy(F.desc("total_wind")).first()
print("Year with the highest wind, on average:", windiest_year["year"])

# What is the most frequent type of weather in January?
january_weather = weather.filter(F.col("month") == 1)
most_frequent_weather = january_weather.groupby("weather").count().orderBy(F.desc("count")).first()
print("Most frequent type of weather in January:", most_frequent_weather["weather"])

# What is the average high and low temperature on sunny days in July in 2013 and 2014?
july_sunny_days = weather.filter((F.col("month") == 7) & (F.col("weather") == "sun"))
july_sunny_days_2013_2014 = july_sunny_days.filter((F.col("year") == 2013) | (F.col("year") == 2014))
average_high_low_temp = july_sunny_days_2013_2014.agg(F.avg("temp_max_fahrenheit").alias("avg_temp_max"), F.avg("temp_min_fahrenheit").alias("avg_temp_min")).first()
print("Average high temperature on sunny days in July in 2013 and 2014:", average_high_low_temp["avg_temp_max"])
print("Average low temperature on sunny days in July in 2013 and 2014:", average_high_low_temp["avg_temp_min"])

# What percentage of days were rainy in Q3 of 2015?
q3_2015_rainy_days = weather.filter((F.col("year") == 2015) & (F.col("quarter") == 3) & (F.col("weather") == "rain"))
rainy_days_percentage = (q3_2015_rainy_days.count() / weather.filter((F.col("year") == 2015) & (F.col("quarter") == 3)).count()) * 100
print("Percentage of rainy days in Q3 of 2015:", rainy_days_percentage)

# For each year, find what percentage of days it rained (had non-zero precipitation)
rainy_days_percentage_by_year = weather.groupby("year").agg((F.sum(F.when(F.col("precipitation") > 0, 1).otherwise(0)) / F.count("*")) * 100).withColumnRenamed("((sum(CASE WHEN (precipitation > 0) THEN 1 ELSE 0 END) / count(1)) * 100)", "rainy_days_percentage")
rainy_days_percentage_by_year.show()

The code assumes that the "seattle_weather.csv" file is present in the current directory.
spark.read.csv("seattle_weather.csv", header=True, inferSchema=True) loads the dataset as a DataFrame named "weather".
The temperatures are converted to Fahrenheit by creating new columns using the appropriate conversion formulas.
The month with the most average rain is determined by grouping the DataFrame by "month" and calculating the average precipitation using the groupby(), agg(), and avg() functions. The resulting DataFrame is ordered in descending order by average precipitation, and the first row is selected to get the month with the most rain.
The windiest year is determined by grouping the DataFrame by "year" and calculating the sum of the "wind" column. The resulting DataFrame is ordered in descending order by total wind, and the first row is selected to get the windiest year.
The most frequent type of weather in January is determined by filtering the DataFrame for rows where the "month" column is 1, grouping by "weather" and counting the occurrences. The resulting DataFrame is ordered in descending order by count, and the first row is selected to get the most frequent weather type.
The average high and low temperature on sunny days in July in 2013 and 2014 is determined by filtering the DataFrame for rows where the "month" column is 7 and the "weather" column is "sun", and then filtering for the years 2013 and 2014. The average temperatures are calculated using the avg() function and are displayed.
The percentage of rainy days in Q3 of 2015 is determined by filtering the DataFrame for rows where the "year" column is 2015, the "quarter" column is 3, and the "weather" column is "rain". The count of rainy days is divided by the count of all days in Q3 of 2015, and the result is multiplied by 100 to get the percentage.
For each year, the percentage of days with non-zero precipitation is calculated by grouping the DataFrame by "year", using the sum(), when(), otherwise(), and count() functions to count the number of days with non-zero precipitation and the total number of days, and calculating the percentage. The resulting DataFrame is displayed using show().