In [1]:
## run this file on google colab!!

In [1]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, substring, split, size
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType
from string import punctuation


In [3]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Read text from disk

In [4]:
text_file_path = "news.txt"
text_df = spark.read.text(text_file_path)

### Number of news and words

Following cell performs the following operations using PySpark:

1. **Counting Paragraphs:**
   - `paragraphs_df = text_df.select(split(text_df.value, "\n").alias("paragraphs"))`: This line splits the text into paragraphs using newline ("\n") as the delimiter and creates a DataFrame with a single column named "paragraphs" containing arrays of words.
   - `paragraph_count_df = paragraphs_df.select(size(paragraphs_df.paragraphs).alias("paragraph_count"))`: This line calculates the size (number of elements) of each array in the "paragraphs" column, essentially giving the count of paragraphs in each row.
   - `total_paragraph_count = paragraph_count_df.agg({"paragraph_count": "sum"}).collect()[0][0]`: This line aggregates the counts to get the total number of paragraphs in the entire file.

2. **Counting Words:**
   - `words_df = text_df.select(split(text_df.value, " ").alias("words"))`: This line splits the text into words using space (" ") as the delimiter and creates a DataFrame with a single column named "words" containing arrays of words.
   - `word_count_df = words_df.select(size(words_df.words).alias("word_count"))`: This line calculates the size (number of elements) of each array in the "words" column, essentially giving the count of words in each row.
   - `total_word_count = word_count_df.agg({"word_count": "sum"}).collect()[0][0]`: This line aggregates the counts to get the total number of words in the entire file.

3. **Getting the First Five Words from the First Row:**
   - `first_row_words = words_df.head(1)[0]["words"][:5]`: This line retrieves the first row of the "words" column, which is an array of words, and then selects the first five words from that array.

Finally, the code prints the total number of paragraphs, total number of words, and the first five words from the first row. It provides basic statistics and insights into the structure of the text file.

In [5]:
paragraphs_df = text_df.select(split(text_df.value, "\n").alias("paragraphs"))
paragraph_count_df = paragraphs_df.select(size(paragraphs_df.paragraphs).alias("paragraph_count"))
total_paragraph_count = paragraph_count_df.agg({"paragraph_count": "sum"}).collect()[0][0]
print("Total number of paragraphs in the file:", total_paragraph_count)

words_df = text_df.select(split(text_df.value, " ").alias("words"))
word_count_df = words_df.select(size(words_df.words).alias("word_count"))
total_word_count = word_count_df.agg({"word_count": "sum"}).collect()[0][0]
print("Total number of words in the file:", total_word_count)

first_row_words = words_df.head(1)[0]["words"][:5]
print("First five words:", first_row_words)

Total number of paragraphs in the file: 12
Total number of words in the file: 2787
First five words: ['JAPAN', 'TO', 'REVISE', 'LONG', '-']


### Top ten most repeated words

Next cell performs the following tasks:

1. **Lowercasing Words:**
   - Converts all words in the "words" column of the original DataFrame (`words_df`) to lowercase, creating a new DataFrame (`lowercase_words_df`).

2. **Exploding the Array of Words:**
   - Transforms each array of words into separate rows, resulting in a DataFrame (`exploded_words_df`) with a single column "word" containing individual words.

3. **Counting Word Occurrences:**
   - Groups the DataFrame by the "word" column and counts the occurrences of each word, creating a new DataFrame (`word_counts_df`) with columns "word" and "count."

4. **Selecting Top 10 Words:**
   - Orders the DataFrame by word frequency in descending order and selects the top 10 words, creating a new DataFrame (`top_10_words`).

5. **Displaying the Result:**
   - Prints the top 10 words and their frequencies without truncation.


In [6]:
lowercase_words_df = words_df.selectExpr("transform(words, word -> lower(word)) as words")

exploded_words_df = lowercase_words_df.select(explode(lowercase_words_df.words).alias("word"))
word_counts_df = exploded_words_df.groupBy("word").count()
top_10_words = word_counts_df.orderBy("count", ascending=False).limit(10)
top_10_words.show(truncate=False)

+----+-----+
|word|count|
+----+-----+
|.   |130  |
|the |123  |
|,   |102  |
|to  |84   |
|of  |64   |
|said|55   |
|and |55   |
|in  |54   |
|a   |45   |
|s   |33   |
+----+-----+



### Top ten most repeated words without punctuation
  Following cells defines a PySpark script for cleaning and analyzing a DataFrame of words. Here's explanation:

  1. **`remove_punctuation` Function:**
    - Defines a Python function `remove_punctuation` that takes a list of words, removes punctuation from each word, and excludes empty strings.
    - Registers this function as a PySpark User-Defined Function (UDF) named `remove_punctuation_udf` with the return type of an array of strings.

  2. **UDF Registration:**
    - Registers the `remove_punctuation` UDF to be used in Spark SQL queries.

  3. **Temporary View Creation:**
    - Creates a temporary view named "lowercase_words_view" from the DataFrame `lowercase_words_df`. This allows you to refer to the DataFrame in Spark SQL queries.

  4. **Spark SQL Query:**
    - Executes a Spark SQL query to apply the registered UDF to the "words" column of the "lowercase_words_view" and creates a new DataFrame `clean_words_df` with the cleaned words.



In [7]:
def remove_punctuation(words):
    translator = str.maketrans("", "", punctuation)
    cleaned_words = [word.translate(translator) for word in words]
    cleaned_words = [word for word in cleaned_words if word.strip()]
    return cleaned_words


In [8]:
spark.udf.register("remove_punctuation_udf", remove_punctuation, ArrayType(StringType()))
lowercase_words_df.createOrReplaceTempView("lowercase_words_view")
clean_words_df = spark.sql("SELECT remove_punctuation_udf(words) as words FROM lowercase_words_view")

exploded_words_df = clean_words_df.select(explode(clean_words_df.words).alias("word"))
word_counts_df = exploded_words_df.groupBy("word").count()
top_words = word_counts_df.orderBy("count", ascending=False)
top_words.show(10, truncate=False)


+----+-----+
|word|count|
+----+-----+
|the |123  |
|to  |84   |
|of  |64   |
|and |55   |
|said|55   |
|in  |54   |
|a   |45   |
|s   |33   |
|on  |28   |
|for |22   |
+----+-----+
only showing top 10 rows



 ### Top ten most repeated letters in first
 Following cell extends the analysis to calculate and display the count of words based on their first letters from the previously obtained `top_words` DataFrame:

1. **Extracting First Letters:**
   - Adds a new column "first_letter" to the `top_words` DataFrame, containing the first letter of each word.


2. **Counting Words by First Letter:**
   - Groups the DataFrame (`first_letter_df`) by the "first_letter" column.
   - Aggregates the counts of each first letter, renaming the result column to "letter_count."


3. **Sorting Letter Counts:**
   - Orders the DataFrame (`letter_counts`) by the "letter_count" column in descending order.

4. **Displaying Top 5 Letter Counts:**
   - Prints the top 5 letters along with the count of words starting with each letter.


In [9]:
first_letter_df = top_words.withColumn("first_letter", substring(col("word"), 1, 1))
letter_counts = first_letter_df.groupBy("first_letter").agg({"count": "sum"}).withColumnRenamed("sum(count)", "letter_count")
sorted_letter_counts = letter_counts.orderBy("letter_count", ascending=False)
sorted_letter_counts.show(5, truncate=False)


+------------+------------+
|first_letter|letter_count|
+------------+------------+
|t           |337         |
|a           |224         |
|s           |200         |
|o           |164         |
|i           |150         |
+------------+------------+
only showing top 5 rows

