# Task 1: **Company Count by Sector in California**

1. **Import SparkSession** – This is needed to start and manage a Spark job.
2. **Create a Spark session** – `SparkSession.builder.appName(...)` starts a Spark environment with the name `"FINC612_Task1_CA_SectorCounts"`.
3. **Set the file path** – `data_path` points to where your `sp500_constituents.json` file is stored (in this case, in Colab’s `/content` folder).
4. **Read the JSON file** – `spark.read.json(data_path)` loads the S\&P 500 data into a Spark DataFrame called `sp500_df`.
5. **Create a SQL view** – `createOrReplaceTempView("sp500")` makes the DataFrame available as a temporary table named `"sp500"` so you can run SQL queries directly on it.


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FINC612_Task1_CA_SectorCounts").getOrCreate()

data_path = "content\sp500_constituents.json"
sp500_df = spark.read.json(data_path)
sp500_df.createOrReplaceTempView("sp500")

1. **Write the SQL query** –

   * `SELECT sector, COUNT(*) AS company_count` chooses the sector column and counts how many companies are in each sector.
   * `FROM sp500` tells Spark to use the temporary table you created earlier.
   * `WHERE state IN ('CA', 'California')` filters the data so only companies from California are included.
   * `GROUP BY sector` groups the rows by sector so the count is calculated per sector.
   * `ORDER BY company_count DESC, sector` sorts the results, showing the sectors with the most companies first, and then sorting by sector name if counts are equal.

2. **Run the query** – `spark.sql(sector_counts_sql)` executes the SQL and stores the results in `sector_counts`.

3. **Display the results** – `sector_counts.show(truncate=False)` prints the full table in the notebook without shortening long text.


In [None]:
sector_counts_sql = """
SELECT
  sector,
  COUNT(*) AS company_count
FROM sp500
WHERE state IN ('CA', 'California')
GROUP BY sector
ORDER BY company_count DESC, sector
"""

sector_counts = spark.sql(sector_counts_sql)

sector_counts.show(truncate=False)

+----------------------+-------------+
|sector                |company_count|
+----------------------+-------------+
|Technology            |31           |
|Healthcare            |9            |
|Communication Services|7            |
|Real Estate           |6            |
|Consumer Cyclical     |5            |
|Financial Services    |4            |
|Utilities             |3            |
|Consumer Defensive    |2            |
|Energy                |1            |
+----------------------+-------------+



For a more pleasing output converted the PySpark output to pandas series.




In [None]:
sector_counts_pd = sector_counts.toPandas()
sector_counts_pd

Unnamed: 0,sector,company_count
0,Technology,31
1,Healthcare,9
2,Communication Services,7
3,Real Estate,6
4,Consumer Cyclical,5
5,Financial Services,4
6,Utilities,3
7,Consumer Defensive,2
8,Energy,1
