<a href="https://colab.research.google.com/github/chinnuanna123/spark/blob/main/local_mode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install pyspark




creating spark session

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkCreation").getOrCreate()
print("Created Spark Session")


Created Spark Session


Creating and dispalying Dataframe

In [6]:
from pyspark.sql import SparkSession

# Create a Spark session
#spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
# Create a Spark session in local mode
spark = SparkSession.builder.master("local[*]").appName("LocalModeExample").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Display the DataFrame
df.show()

# Stop Spark session
#spark.stop()


+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



Checking spark version


In [11]:
spark.version


'3.5.5'

Writing a CSV File Using Pandas
python



In [7]:
import pandas as pd

# Create a sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 35]}
df = pd.DataFrame(data)

# Save to CSV (in Colab session)
df.to_csv("sample.csv", index=False)

# Verify the file was created
!ls


sample.csv  sample_data


Reading a CSV File

In [28]:
df = spark.read.csv("sample.csv", header=True, inferSchema=True)
df.show()
df.printSchema()


+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)



Add a New Column

In [32]:
df.withColumn("new_age", df.Age + 5).show()


+-------+---+-------+
|   Name|Age|new_age|
+-------+---+-------+
|  Alice| 25|     30|
|    Bob| 30|     35|
|Charlie| 35|     40|
+-------+---+-------+



Select Specific Columns

In [29]:
df.select("name").show()
df.select("name", "age").show()


+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



Filtering and Selecting Data

In [10]:
df.filter(df.Age > 28).select("Name").show()

+-------+
|   Name|
+-------+
|    Bob|
|Charlie|
+-------+



Creating RDDs
Parallelizing a Collection

In [18]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("RDDExample").getOrCreate()

# Create an RDD from a Python list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Show RDD contents
print(rdd.collect())


[1, 2, 3, 4, 5]


RDD Operations
Transformations

In [19]:
numbers = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
squared = numbers.map(lambda x: x ** 2)
print(squared.collect())  # Output: [1, 4, 9, 16, 25]


[1, 4, 9, 16, 25]


In [20]:
numbers = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
even_numbers = numbers.filter(lambda x: x % 2 == 0)
print(even_numbers.collect())

[2, 4, 6, 8, 10]


In [22]:
sentences = spark.sparkContext.parallelize(["Hello World PySpark is fun flatMap is useful"])
words = sentences.flatMap(lambda sentence: sentence.split(" "))
print(words.collect())

['Hello', 'World', 'PySpark', 'is', 'fun', 'flatMap', 'is', 'useful']


In [23]:
# Create an RDD with key-value pairs (category, amount)
transactions = spark.sparkContext.parallelize([
    ("electronics", 1000),
    ("clothing", 500),
    ("electronics", 1200),
    ("clothing", 700),
    ("grocery", 200)
])
# Use reduceByKey to sum values by category
category_totals = transactions.reduceByKey(lambda a, b: a + b)
# Collect and print results
print(category_totals.collect())


[('clothing', 1200), ('electronics', 2200), ('grocery', 200)]


Word Count Using reduceByKey()

In [25]:
# Create an RDD of key-value pairs (word, count)
words = spark.sparkContext.parallelize(["apple", "banana", "apple", "orange", "banana", "apple"])
# Convert words into (word, 1) pairs
word_pairs = words.map(lambda word: (word, 1))
# Use reduceByKey to count occurrences
word_counts = word_pairs.reduceByKey(lambda a, b: a + b)
# Collect and print results
print(word_counts.collect())

[('apple', 3), ('banana', 2), ('orange', 1)]


Run SQL Queries on DataFrame

In [9]:
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 25").show()


+-------+---+
|   Name|Age|
+-------+---+
|    Bob| 30|
|Charlie| 35|
+-------+---+



In [10]:
df.orderBy("Age").show()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

