## Question 1: Part-of-Speech (POS) Counts
**Objective**: Create an RDD pipeline to count the occurrences of each part-of-speech tag in a text dataset.

## Steps:
## 1. Read the Text File:
- Load the text data using `sc.textFile` to create an RDD from the dataset.
- Example: `text = sc.textFile("path/to/textfile.txt")`
## 2. Filter the Data:
- Filter out blank lines and lines starting with 'URL'.
## 3. Tokenization and POS Tagging:
- Define a function `pos_tag_counter` that tokenizes each line and tags each token with its part-of-speech.
- Use `nltk` for tokenization and POS tagging.
- Example:

In [None]:
def pos_tag_counter(line):
    tokens = nltk.word_tokenize(line)
    pos_tags = nltk.pos_tag(tokens)
    return pos_tags

## 4. FlatMap to Apply POS Tagging:
- Use `flatMap` to apply `pos_tag_counter` to each entry in the RDD.
- Example: `pos_tagged_rdd = filtered_rdd.flatMap(pos_tag_counter)`
## 5. Map and Reduce by Key:
- Map each POS tag to a key-value pair where the key is the tag and the value is 1.
- Reduce by key to get the count of each POS tag.
- Example:

In [None]:
pos_counts_rdd = pos_tagged_rdd.map(lambda x: (x[1], 1)).reduceByKey(lambda a, b: a + b)

## 6. Sort Results:
- Sort the results by counts in descending order.
- Example: `sorted_pos_counts_rdd = pos_counts_rdd.sortBy(lambda x: x[1], ascending=False)`
## 7. Output the Results:
- Take the top 10 results and print them.

Example:

In [None]:
top_10_pos_counts = sorted_pos_counts_rdd.take(10)
for pos, count in top_10_pos_counts:
    print(f"{pos}: {count}")

## Question 2: Noun Phrase Length Distribution
**Objective**: Create an RDD pipeline to show the distribution of the length of noun phrases in a text dataset.
## Steps:
## 1. Define the Grammar for Noun Phrases:
- Define a grammar to identify noun phrases using `nltk.RegexpParser`.
- Example:

In [None]:
grammar = r"""
    NBAR:
        {<NN.*|JJS>*<NN.*>}
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}

## 2. Define a Function to Extract Noun Phrases:
- Create a function `get_noun_phrases` that tokenizes and POS tags each line, then extracts noun phrases based on the defined grammar.
- Example:

In [None]:
def get_noun_phrases(line):
    tokens = nltk.word_tokenize(line)
    pos_tags = nltk.pos_tag(tokens)
    chunker = nltk.RegexpParser(grammar)
    tree = chunker.parse(pos_tags)
    return [subtree.leaves() for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP')]

## 3. FlatMap to Apply Noun Phrase Extraction:
- Use `flatMap` to apply `get_noun_phrases` to each entry in the RDD.
- Example: `noun_phrases_rdd = text.flatMap(get_noun_phrases)`
## 4. Map and Reduce by Key:
- Map each noun phrase to a key-value pair where the key is the length of the noun phrase and the value is 1.
- Reduce by key to get the count of each length.
- Example:

In [None]:
length_rdd = noun_phrases_rdd.map(lambda np: (len(np), 1)).reduceByKey(lambda a, b: a + b)

## 5. Sort Results:
- Sort the results by counts in descending order.
- Example: `sorted_length_counts_rdd = length_rdd.sortBy(lambda x: x[1], ascending=False)`
## 6. Output the Results:
- Take the top 10 results and print them.
- Example:

In [None]:
top_10_length_counts = sorted_length_counts_rdd.take(10)
for length, count in top_10_length_counts:
    print(f"Length: {length}, Count: {count}")