# Task 1 – Big Data Analysis using PySpark
Dataset: NYC Taxi Trips ([Kaggle Link](https://www.kaggle.com/datasets/nyc-tlc/trip-record-data))

This notebook performs big data processing using PySpark to derive insights from a large-scale taxi dataset.

In [None]:
# Step 1: PySpark Setup
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NYC Taxi Analysis').getOrCreate()

In [None]:
# Step 2: Load Dataset
# Download a sample CSV from Kaggle manually (e.g., yellow_tripdata_2022-01.csv)
df = spark.read.csv("sample_taxi_data.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)

In [None]:
# Step 3: Basic Exploration
df.select('passenger_count').groupBy('passenger_count').count().orderBy('count', ascending=False).show()
df.select('payment_type').groupBy('payment_type').count().show()
df.select('trip_distance').summary().show()

In [None]:
# Step 4: Time-based Insights
from pyspark.sql.functions import hour, to_timestamp
df = df.withColumn('pickup_hour', hour(to_timestamp('tpep_pickup_datetime')))
df.groupBy('pickup_hour').count().orderBy('pickup_hour').show()

### ✅ Summary:
- We loaded a large dataset with PySpark
- Performed grouping and aggregations
- Extracted useful business insights

You can now extend this by visualizing sampled data in Pandas/Matplotlib or saving filtered outputs.