In [27]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz
!tar xf spark-3.0.3-bin-hadoop3.2.tgz
!pip install -q findspark

We can now check the directory content for Java

In [28]:
!ls /usr/lib/jvm/

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


Now that we have installed Spark and Java in Colab, it is time to set the environment path that enables us to run PySpark in our Colab environment. Set the location of Java and Spark by running the following code:

In [29]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop3.2"

Finally, lets install the python library for spark called pyspark

In [30]:
!pip install pyspark==3.0.2



Configuring a SparkSession

The entry point to using Spark SQL is an object called SparkSession. It initiates a Spark Application which all the code for that Session will run on.

.builder — gives access to Builder API which is used to configure the session .

.master() — determines where the program will run; "local[*]" sets it to run locally on all cores but you can use "local[1]" to run on one core for example. In this case, our programs will be run on Google’s servers.

.appName() — optional method to name the Spark Application

.getOrCreate() — gets an existing SparkSession or creates new one if none exists

In [31]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Big_Data_Application_ICP_2").getOrCreate()

To open a local file on Google Colab you need to run the following code which will prompt you to select a file from your computer:

In [32]:
from google.colab import files
files.upload()

Saving icp2.txt to icp2 (1).txt


{'icp2.txt': b'As the Labor Day holiday nears, many people are planning travel and get-togethers to see family and friends.Unfortunately, this is occurring at the same time Covid-19 rates are climbing. The rates of new coronavirus infections are higher than they have been since January. Hospitalizations are also at their highest levels since January. In many parts of the United States, both infections and hospitalizations are higher than they were during Labor Day weekend in 2020.How should people think about Covid-19 safety now, compared to last year? Is it safe to see family and friends? What if extended family members want to stay in a house together -- what are some steps they should take to reduce risk? And how does the start of school affect our risk?To help navigate these questions, we spoke with CNN Medical Analyst Dr.Leana Wen. Wen is an emergency physician and visiting professor of health policy and management at the George Washington University Milken Institute School of Pub

In [33]:
#import required libraries
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter
from itertools import groupby
from operator import itemgetter
import pprint

def rmv_duplic(data):
 
    # split data string separated
    data = data.split(" ")
 
    # joins two adjacent elements in iterable way
    for i in range(0, len(data)):
        data[i] = "".join(data[i])
 
    # now create dictionary using counter method which will have strings as key and their frequencies as value
    newdic = Counter(data)
 
    # joins two adjacent elements in iterable way
    s = " ".join(newdic.keys())
    return s

In [34]:
#sparkContext : main entry point for Spark functionality and represents the connection to a Spark cluster
sc = spark.sparkContext

readfile = sc.textFile('/content/icp2.txt')

newFile = readfile.map(lambda x: (x)).collect()
print(newFile)

#this is varialbe is for checking each aplha in the text file
alphacheck = 'abcdefghijklmnopqrstuvwxyz'
dictionary = {}
newListForDict = []

['As the Labor Day holiday nears, many people are planning travel and get-togethers to see family and friends.Unfortunately, this is occurring at the same time Covid-19 rates are climbing. The rates of new coronavirus infections are higher than they have been since January. Hospitalizations are also at their highest levels since January. In many parts of the United States, both infections and hospitalizations are higher than they were during Labor Day weekend in 2020.How should people think about Covid-19 safety now, compared to last year? Is it safe to see family and friends? What if extended family members want to stay in a house together -- what are some steps they should take to reduce risk? And how does the start of school affect our risk?To help navigate these questions, we spoke with CNN Medical Analyst Dr.Leana Wen. Wen is an emergency physician and visiting professor of health policy and management at the George Washington University Milken Institute School of Public Health. S

In [35]:
for i in newFile:
  newList = rmv_duplic(i)
  #split the data after removing the duplicate
  for j in newList.split():
    #remove punc
    j = re.sub(r'[^\w\s]','',j)
    if j.lower() not in stopwords.words('english'):
      newListForDict.append(j)

In [36]:
#open txt file to write on it
newListForDict
with open('/content/output.txt', 'w') as saveFile:
  for letter in alphacheck:
    print('\n', letter, end=', ')
    saveFile.write('\n%s' %(letter + ', '))
    for n in newListForDict:
      if n.lower().startswith(letter.lower()):
          print(n, end=', ')
          saveFile.write(n + ', ')


 a, also, affect, Analyst, author, Angeles, 
 b, book, 
 c, Covid19, climbing, coronavirus, compared, CNN, County, Centers, Control, 
 d, Day, DrLeana, Doctors, different, disease, Disease, 
 e, extended, emergency, 
 f, family, friendsUnfortunately, friends, Fight, 
 g, gettogethers, George, 
 h, holiday, higher, Hospitalizations, highest, hospitalizations, house, help, health, Health, HealthThings, hospitalized, 
 i, infections, Institute, 
 j, January, Journey, 
 k, 
 l, Labor, levels, last, Lifelines, likely, Los, 
 m, many, members, Medical, management, Milken, main, 
 n, nears, new, navigate, 
 o, occurring, one, officials, 
 p, people, planning, parts, physician, professor, policy, Public, protect, People, published, Prevention, 
 q, questions, 
 r, rates, reduce, risk, riskTo, reason, report, 
 s, see, since, States, safety, safe, stay, steps, start, school, spoke, School, Shes, severe, said, 
 t, travel, time, think, together, take, times, 
 u, United, University, unvaccinate