<a href="https://colab.research.google.com/github/chris-kehl/DataScience_with_PySpark/blob/main/data_analysis_with_python_and_pyspark_chapter2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: line 1: nvidia-smi: command not found


In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


In [None]:
# Executing pySpark on colab
!apt-get update # Update apt-get repository
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # install findspark. Adds PySpark to the system path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()

# Create pySpark session
from pyspark.sql import SparkSession
spark = (SparkSession
         .builder
         .appName("Analyzing the vocabulary of Pride and Justice.")
         .getOrCreate())

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease [23.8 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [1,713 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1,671 kB]
Get:13 http://archive.ub

In [None]:
spark.sparkContext


Decide how Chatty you want your spark session. I'm going to use debug

In [None]:
spark.sparkContext.setLogLevel("ERROR")

How to read the data, spark.read is how we do it followed by the directory of types of data spark can read.

In [None]:
spark.read
dir(spark.read)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_df',
 '_jreader',
 '_set_opts',
 '_spark',
 'csv',
 'format',
 'jdbc',
 'json',
 'load',
 'option',
 'options',
 'orc',
 'parquet',
 'schema',
 'table',
 'text']

Pull the data from github at https://github.com/jonesberg/DataAnalysisWithPythonAndPySpark

In [None]:
book = spark.read.text("/content/drive/MyDrive/DataAnalysisWithPythonAndPySpark-Data-trunk/gutenberg_books/1342-0.txt")

book

DataFrame[value: string]

In [None]:
book.printSchema()
print(book.dtypes)

root
 |-- value: string (nullable = true)

[('value', 'string')]


Get a look at the data in the dataframe

In [None]:
book.show()

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
| Author: Jane Austen|
|                    |
|Posting Date: Aug...|
|Release Date: Jun...|
|Last Updated: Mar...|
|                    |
|   Language: English|
|                    |
|Character set enc...|
|                    |
+--------------------+
only showing top 20 rows



Show top ten rows truncated to 50


In [None]:
book.show(10, truncate=50)

+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejud...|
|                                                  |
|This eBook is for the use of anyone anywhere at...|
|almost no restrictions whatsoever.  You may cop...|
|re-use it under the terms of the Project Gutenb...|
|    with this eBook or online at www.gutenberg.org|
|                                                  |
|                                                  |
|                        Title: Pride and Prejudice|
|                                                  |
+--------------------------------------------------+
only showing top 10 rows



Split the lines into arrays or words


In [None]:
from pyspark.sql.functions import split
lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



Select the value column from the book dataframe

In [None]:
from pyspark.sql.functions import col
book.select(book.value)
book.select(book["value"])
book.select(col("value"))
book.select("value")

DataFrame[value: string]

In [None]:
book.show(5, truncate = 100)

+--------------------------------------------------------------------+
|                                                               value|
+--------------------------------------------------------------------+
|  The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen|
|                                                                    |
|    This eBook is for the use of anyone anywhere at no cost and with|
|almost no restrictions whatsoever.  You may copy it, give it away or|
| re-use it under the terms of the Project Gutenberg License included|
+--------------------------------------------------------------------+
only showing top 5 rows



Split the lines of text into a list of words

In [None]:
from pyspark.sql.functions import col, split

lines = book.select(split(col("value"), " "))

lines.printSchema()

lines.show(5, truncate=50)

root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = true)

+--------------------------------------------------+
|                               split(value,  , -1)|
+--------------------------------------------------+
|[The, Project, Gutenberg, EBook, of, Pride, and...|
|                                                []|
|[This, eBook, is, for, the, use, of, anyone, an...|
|[almost, no, restrictions, whatsoever., , You, ...|
|[re-use, it, under, the, terms, of, the, Projec...|
+--------------------------------------------------+
only showing top 5 rows

