# Transforming Government Petition Data

## Objective
We are provided with a JSON data file which contains some government petitions where we are given title of the petition labelled as *label*, main text of the petition labelled as *abstract* and number of signatures labelled as *numberOfSignatures*.

In this report, our main objective is to to transform government petition data from a JSON format into a structured Pandas DataFrame. So, we have to create a CSV file with one row per petition with $21$ columns. These $21$ columns should include a new column called *petition_id* which needs to be created and one column for each of the $20$ words that appear the most number of times across all the abstracts of all petitions. The column *petition_id* will be a unique identifier to identify numerous petitions and the $20$ most common words must contain $5$ or more letters. These $20$ columns will store the count of each word for each
petition.


## Code Overview
First, we have started by importing all the necessary libraries which will be required for achieving the objective of this task. We have included $json$ for JSON parsing, $pandas$ for data manipulation, $Counter$ for word counting, and $nltk$ for natural language processing tasks. We have also downloaded essential resources from the Natural Language Toolkit (NLTK), such as tokenizers and stopwords, which are crucial for text processing.

In [1]:
import json
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download NLTK resources
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Akanksha99\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Akanksha99\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

When we want to extract the common words, there are some words which frequently occur in the English Language like “himself”, “the”, “is”, “are” etc. So, there is a list called stopwords in NLTK which we can use to filter such words. There are still some words which are not there in this list. So, we can customise this list and add the words which we don't need. 

In [2]:
# Get English stopwords
english_stopwords = set(stopwords.words('english'))

# Add 'could' to the stopwords list
custom_stopwords = english_stopwords.copy()
custom_stopwords.add('could')
custom_stopwords.add('without')
custom_stopwords.add('every')

This part of the code loads the input data from the specified JSON file, and then checks if there are any missing values in the objects, label and abstract in JSON file.

To check if there are any missing values, we have created an empty list called missing_labels. Then, we will iterate through the json file and start petition_id from 1. This loop will check if there are any missing values in the object, abstract. If value is missing then the loop will put the id of that abstract in missing_labels list. Lastly, if there are any missing values, then the code will print the ids of those petitions.

In [3]:
# Load input data from the JSON file
with open("input_data.json", "r") as file:
    input_data = json.load(file)

In [4]:
missing_labels = []
for petition_id, petition_data in enumerate(input_data, start=1):
    abstract = petition_data.get("abstract", {}).get("_value")
    if abstract is None or not abstract.strip():
        missing_labels.append(petition_id)

if missing_labels:
    print("Missing abstracts in petitions with indices: {missing_labels}")
else:
    print("No missing abstracts found.")

No missing abstracts found.


Now, we know there are no missing values in the abstract object in json file. We have also similarly checked for the label object and found no missing values. 

Then, we have created data pre-processing pipeline by creating the *preprocess_text()* function which will convert all the text to lowercase, remove punctuations, and retain only alphanumeric characters and whitespaces. This step will help us to extract words from sentences in an efficient way by ensuring a consistent format.

In [5]:
def preprocess_text(text):
    # Convert text to lowercase and remove punctuation
    text = text.lower()
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])
    return text

Now, we will create the function *get_top_words()* which will help us extract the most common words from input data. This function tokenizes the input texts i.e. extract each word from all the sentences, filters out stopwords using our customised list custom_stopwords and short words with length less than 5 letters. 

This is done by using a for loop which iterates through input and tokenizes them using *word_tokenize()* function. This means extracting each word from the input and put in a list called words. Then, a sub-loop iterates through this list and filters stop words and words with less than 5 letters from this list. 

This function also counts the occurrences of each words using counter from collections. Counter() creates an empty Counter object. A Counter is a collection where elements are stored as dictionary keys, and their counts as dictionary values.

The top 20 words are selected based on their frequency using *most_common()* method of the Counter class. It returns a list of tuples where each tuple contains a word and its count. Lastly, a list called *top_words* is returned containing 20 most common words.

In [6]:
def get_top_words(texts, n=20):
    # Count word occurrences in the texts
    word_counts = Counter()
    for text in texts:
        words = word_tokenize(text)
        for word in words:
            if len(word) >= 5 and word not in custom_stopwords:
                word_counts[word] += 1
    
    # Get the top n words
    top_words = [word for word, _ in word_counts.most_common(n)]
    return top_words

Then, we have created a second function called, *create_petition_df()* which is the core of the code, responsible for transforming the input petition data into a structured DataFrame. It returns a Pandas DataFrame with 21 columns as required in the task. 

In this, we have created a loop which iterates through the input data, creates a list of all abstracts by extracting the "_value" field from the "abstract" key for each petition in the input data and puts it into variable, abstracts. Then, we pre-process all the abstracts using *preprocess_text()* and put it in the list called preprocessed_abstracts. Finally, we get the top 20 common words using *get_top_words()* function.

Now after this, we create a loop which iterates through the input data and starts the petition_id from 1. The loop extracts abstract from each petition, pre-process it and then appends it to the petitions list for each petition, including "petition_id" and the count of each top word in the preprocessed abstract. Finally, we convert this to a Pandas Dataframe. 

In [7]:
def create_petition_df(data):
    petitions = []
    
    # Extract the abstracts and petition IDs
    abstracts = [petition["abstract"]["_value"] for petition in data]
    # Preprocess abstracts
    preprocessed_abstracts = [preprocess_text(abstract) for abstract in abstracts]

    # Get the top 20 words
    top_words = get_top_words(preprocessed_abstracts, n=20)

    # Create DataFrame
    for petition_id, petition_data in enumerate(data, start=1):
        abstract = petition_data["abstract"]["_value"]

        # Preprocess abstract
        processed_abstract = preprocess_text(abstract)

        petitions.append({
            "petition_id": petition_id,
            **{word: processed_abstract.count(word) for word in top_words}
        })

    df = pd.DataFrame(petitions)
    return df

The below code calls the *create_petition_df()* to create the petition DataFrame, and then saves the DataFrame to a CSV file named "output.csv".

In [8]:
# Create DataFrame with the desired structure
petition_df = create_petition_df(input_data)

# Save the DataFrame to a CSV file
petition_df.to_csv("output.csv", index=False)

## Conclusion
The provided code successfully achieves its objective of transforming government petition data into a structured CSV file, incorporating natural language processing techniques to identify and count the most common words. 

Now, the dataframe petition_df can be utilized for further analysis.

## Test
Now lets us perform a set of unit tests using the "unittest" framework to test functions related to transforming government petition data.

We have first imported "unittest" module, which provides a testing framework for writing and running tests in Python. Then, we define a test class named "TestPetitionTransformation" that inherits from unittest.TestCase. Each method in this class represents an individual test case.

*preprocess_text()* function checks if the function correctly converts the input text to lowercase and removes punctuation. The assertion is that the processed text should be "hello world".

*get_top_words()* function provides a set of texts and checks if the function correctly extracts the top 2 words with a minimum length of 5 characters. The assertion is that the extracted top words should be "sauce" and "tasty".

*create_petition_df()* function provides a set of input data and checks if the function correctly creates a DataFrame. The assertion is that the columns of the DataFrame should include the expected top words ("sauce" and "tasty").

Then we run all the test methods in the TestPetitionTransformation class.

In [13]:
# Unit tests
import unittest

class TestPetitionTransformation(unittest.TestCase):
    def test_preprocess_text(self):
        self.assertEqual(preprocess_text("Hello, World!"), "hello world")

    def test_get_top_words(self):
        texts = ["This is a tasty sauce.", "sauce is very tasty.", "I like tomato sauce."]
        self.assertCountEqual(get_top_words(texts, n=2), ["sauce", "tasty"])

    def test_create_petition_df(self):
        input_data = [
            {"abstract": {"_value": "This is a tasty sauce."}},
            {"abstract": {"_value": "sauce is very tasty."}}
        ]

        df = create_petition_df(input_data)
        expected_top_words = set(get_top_words(["This is a tasty sauce.", "sauce is very tasty."], n=2))

        self.assertTrue(expected_top_words.issubset(set(df.columns)))

# Run the tests
unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestPetitionTransformation))

...
----------------------------------------------------------------------
Ran 3 tests in 0.004s

OK


<unittest.runner.TextTestResult run=3 errors=0 failures=0>