**HW 03**  
> In this homework, we will use MapReduce to calculate two canonical quantities in data analyses: the term frequency-inverse document frequency (tf-idf) measure, and the loss function for support vector machine.  
> In this homework, you should write your own functions to calculate the quantities, instead of using functions from pyspark's MLLib library. Your code should utilize the RDDs and dataframes created from pyspark.  
> While you probably will be able to use Chat to generate all of the necessary functions, I would encourage you to give it a try to design and process through how you may do it, before asking Chat.

1. tf-idf definition
The tf-idf measure is defined as the following:

Let $t$ be a term (a word), $d$ be a document, and $D$ be the collection of the documents.

Term frequency (tf):

$$\mathrm{tf}(t, d) = \frac{\textrm{\# occurrences of } t \textrm{ in } d}{\textrm{\# terms in } d},$$

Inverse document frequency (idf): $$\mathrm{idf}(t, D) = \log\left(\frac{\textrm{\# docs in } D}{\textrm{\# docs containing } t}\right).$$

As a result, the tf-idf measure is

$$\textrm{tf-idf}(t, d, D) = \mathrm{tf}(t, d)\times \mathrm{idf}(t, D).$$

Note: You can assume the number of documents in $D$ can be pre-computed, i.e. .count() in your dataframe/rdd.

Tasks
Design the MapReduce functions for calculating the tf-idf measure.
Calculate tf-idf measure for each row in the agnews_clean.csv. Save the measures in a new column.
Print out the tf-idf measure for the first 5 documents.
Dataset
The AG news dataset is cleaned and stored in agnews_clean.csv below:

In [1]:
!curl https://raw.githubusercontent.com/mosesyhc/de300-2025sp-class/refs/heads/main/agnews_clean.csv -O

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33.2M  100 33.2M    0     0  5009k      0  0:00:06  0:00:06 --:--:-- 5010k


In [2]:
# import necessary packages
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType

In [24]:
# ignore warnings:
spark.sparkContext.setLogLevel("ERROR")

In [26]:
# set up pyspark:
from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .master("local[*]")
         .appName("AG news")
         .getOrCreate()
        )

# read in csv:
agnews = spark.read.csv("agnews_clean.csv", inferSchema=True, header=True)

In [18]:
# fix headers:
agnews = agnews.withColumnRenamed('_c0', 'id')
agnews = agnews.withColumn('filtered', F.from_json('filtered', ArrayType(StringType())))

In [25]:
# each row contains the document id and a list of filtered words
agnews.show(5, truncate=30)

+---+------------------------------+
|_c0|                      filtered|
+---+------------------------------+
|  0|[wall, st, bears, claw, bac...|
|  1|[carlyle, looks, toward, co...|
|  2|[oil, economy, cloud, stock...|
|  3|[iraq, halts, oil, exports,...|
|  4|[oil, prices, soar, time, r...|
+---+------------------------------+
only showing top 5 rows



**1. tf-idf definition**  
    
    The tf-idf measure is defined as the following:
    
    Let $t$ be a term (a word), $d$ be a document, and $D$ be the collection of the documents.
    
    Term frequency (tf):
    
    $$\mathrm{tf}(t, d) = \frac{\textrm{\# occurrences of } t \textrm{ in } d}{\textrm{\# terms in } d},$$
    
    Inverse document frequency (idf): $$\mathrm{idf}(t, D) = \log\left(\frac{\textrm{\# docs in } D}{\textrm{\# docs containing } t}\right).$$
    
    As a result, the tf-idf measure is
    
    $$\textrm{tf-idf}(t, d, D) = \mathrm{tf}(t, d)\times \mathrm{idf}(t, D).$$
    
    Note: You can assume the number of documents in $D$ can be pre-computed, i.e. .count() in your dataframe/rdd.
    
    Tasks
    Design the MapReduce functions for calculating the tf-idf measure.
    Calculate tf-idf measure for each row in the agnews_clean.csv. Save the measures in a new column.
    Print out the tf-idf measure for the first 5 documents.
    Dataset
    The AG news dataset is cleaned and stored in agnews_clean.csv below:

What do we need to calculate?

* For each $d$, the counts of $t$,   
> refer to word count example  
* For each $d$, the counts of words,  
* For each $t$, the counts of $d$ that contains $t$.  
> what should be returned if we only want to know if the document contains $t$ of not.
