# Task
Demonstrate Hadoop and Spark in Google Colab by installing necessary dependencies, configuring the environment, initializing a Spark session, loading and processing a sample dataset, and extracting meaningful results.

## Install dependencies

### Subtask:
Install necessary libraries for PySpark and findspark in Colab.


**Reasoning**:
The subtask requires installing pyspark and findspark libraries. A single code block can be used to install both libraries using pip.



In [1]:
%pip install pyspark findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


## Configure environment

### Subtask:
Set up the environment variables for Spark and Java.


**Reasoning**:
Set up the environment variables for Spark and Java, and initialize findspark.



In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.10/dist-packages/pyspark"
import findspark
findspark.init()

Exception: Unable to find py4j in /usr/local/lib/python3.10/dist-packages/pyspark/python, your SPARK_HOME may not be configured correctly

**Reasoning**:
The previous attempt failed because the `SPARK_HOME` environment variable was set incorrectly. The error message indicates that `py4j` could not be found in the specified path. I need to correct the `SPARK_HOME` path to the actual Spark installation directory, not the pyspark library directory. I will install Spark first and then set the `SPARK_HOME` environment variable to the correct path. I will also remove the unnecessary `findspark.init()` call as it is likely to fail again if the environment variable is not set correctly.



In [3]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!tar xf spark-3.5.1-bin-hadoop3.tgz
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:4 https://cli.github.com/packages stable InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [80.2 kB]
Get:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64

**Reasoning**:
The previous command failed to extract the Spark archive because the file was not found. I need to correct the wget command to download the Spark archive to the correct location. I will keep the rest of the commands from the previous cell as they are still needed.



In [4]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz -P /content/
!tar xf /content/spark-3.5.1-bin-hadoop3.tgz -C /content/
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [                                                                               Hit:2 https://cli.github.com/packages stable InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [                                                                               Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [                                                                               Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
0% [Waiting for

**Reasoning**:
The previous command still failed to extract the Spark archive because the wget command did not download the file. The output shows that the wget command did not complete successfully. I will try a different mirror for downloading Spark and also verify the download before attempting to extract it.



In [5]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz -P /content/
!ls /content/
!tar xf /content/spark-3.5.1-bin-hadoop3.tgz -C /content/
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 https://cli.github.com/packages stable InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
sample_data  s

**Reasoning**:
The previous command successfully downloaded and extracted the Spark archive and set the SPARK_HOME environment variable. Now I need to set the JAVA_HOME environment variable and initialize findspark to make Spark available in the Python environment.



In [6]:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import findspark
findspark.init()

## Initialize spark session

### Subtask:
Create a Spark session to start using Spark.


**Reasoning**:
Import the SparkSession class and create a Spark session builder, then get or create the Spark session.



In [7]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

## Load and process data

### Subtask:
Load a sample dataset (e.g., a CSV file) into a Spark DataFrame and perform some basic data processing (e.g., filtering, aggregation).


**Reasoning**:
Download a sample CSV file, load it into a Spark DataFrame, perform a basic data processing operation, and display the result.



In [8]:
!wget -q https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv -O gapminder.csv
df = spark.read.csv("gapminder.csv", header=True, inferSchema=True)
filtered_df = df.filter("pop > 100000000")
filtered_df.show()

+----------+----+------------+---------+--------+-----------+
|   country|year|         pop|continent| lifeExp|  gdpPercap|
+----------+----+------------+---------+--------+-----------+
|Bangladesh|1987|1.03764241E8|     Asia|  52.819|751.9794035|
|Bangladesh|1992|1.13704579E8|     Asia|  56.018|837.8101643|
|Bangladesh|1997|1.23315288E8|     Asia|  59.412|972.7700352|
|Bangladesh|2002| 1.3565679E8|     Asia|  62.013| 1136.39043|
|Bangladesh|2007|1.50448339E8|     Asia|  64.062|1391.253792|
|    Brazil|1972|1.00840058E8| Americas|  59.504|4985.711467|
|    Brazil|1977|1.14313951E8| Americas|  61.489|6660.118654|
|    Brazil|1982|1.28962939E8| Americas|  63.336|7030.835878|
|    Brazil|1987|1.42938076E8| Americas|  65.205|7807.095818|
|    Brazil|1992|1.55975974E8| Americas|  67.057|6950.283021|
|    Brazil|1997|1.68546719E8| Americas|  69.388|7957.980824|
|    Brazil|2002|1.79914212E8| Americas|  71.006|8131.212843|
|    Brazil|2007|1.90010647E8| Americas|   72.39|9065.800825|
|     Ch

## Extract meaningful results

### Subtask:
Analyze the processed data to extract insights or patterns.


**Reasoning**:
Calculate and display the average life expectancy and GDP per capita, the number of unique countries, and the average population by continent for the filtered data.



In [9]:
from pyspark.sql import functions as F

# Calculate and display average life expectancy and GDP per capita
avg_life_expectancy = filtered_df.agg(F.avg("lifeExp")).collect()[0][0]
avg_gdp_per_capita = filtered_df.agg(F.avg("gdpPercap")).collect()[0][0]
print(f"Average Life Expectancy: {avg_life_expectancy:.2f}")
print(f"Average GDP per Capita: {avg_gdp_per_capita:.2f}")

# Calculate and display the number of unique countries
unique_countries_count = filtered_df.select("country").distinct().count()
print(f"Number of unique countries: {unique_countries_count}")

# Group by continent and calculate average population
avg_pop_by_continent = filtered_df.groupBy("continent").agg(F.avg("pop").alias("average_population"))
avg_pop_by_continent.show()

Average Life Expectancy: 63.89
Average GDP per Capita: 8637.62
Number of unique countries: 10
+---------+--------------------+
|continent|  average_population|
+---------+--------------------+
|   Africa|1.2038009233333333E8|
| Americas|1.8778264472727272E8|
|     Asia|4.5764983661538464E8|
+---------+--------------------+



## Summary:

### Data Analysis Key Findings

*   The average life expectancy within the filtered dataset (countries with population > 100,000,000) is approximately 63.89 years.
*   The average GDP per capita in the filtered dataset is approximately \$8637.62.
*   The filtered dataset contains data for 10 unique countries.
*   The average population by continent in the filtered data shows significant variation: Africa has an average population of approximately 1.20 x 10⁸, the Americas approximately 1.88 x 10⁸, and Asia approximately 4.58 x 10⁸.

### Insights or Next Steps

*   The analysis highlights the impact of filtering on the dataset's statistical characteristics, providing insights into high-population countries.
*   Further analysis could involve comparing these averages to the global averages or performing more granular analysis by country or year to identify trends.
