<div class="alert alert-block alert-success">
    
# FIT5196 Task 2 in Assessment 1
#### Student Name 1: Animesh Dubey
#### Student ID: 33758484
#### Student Name 2: Ashwin Gururaj
#### Student ID: 33921199

Date: 08 Aug 2024


Environment: Python 3.11.5

Libraries used:
* nltk.corpus
* nltk.tokenize
* nltk.stem
* nltk.collocations
* collections.defaultdict
* collections.Counter
* os
* json
* pandas
    
</div>

## Importing Libraries and Downloading NLTK Data

In [10]:
from nltk.corpus import stopwords, words
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from collections import defaultdict, Counter
import nltk
import os
import json
import pandas as pd

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('words')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashwingururaj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ashwingururaj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/ashwingururaj/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

Importing Libraries and Downloading NLTK Data:<br>

We begin by getting the libraries we need, like nltk, pandas, os, and others. These libraries help us work with text, handle data, and do things with files. We also download important data from NLTK, like stop words and word lists, to help us understand and process natural language.

## Set Up File Paths

In [11]:
# Get the directory of the current script
script_dir = os.getcwd()

# Define file paths dynamically using os.path.join
csv_output_path = os.path.join(script_dir, 'task1_100.csv')
json_output_path = os.path.join(script_dir, 'task1_100.json')
stopwords_file_path = os.path.join(script_dir, 'stopwords_en.txt')
vocab_file_path = os.path.join(script_dir, 'group100_vocab.txt')
countvec_file_path = os.path.join(script_dir, 'group100_countvec.txt')

Defining File Paths:<br>

The paths for the CSV, JSON, stop words, and output files are set up using a special way called os.path.join. This method makes sure that the code works in different places and can be easily used in different computer setups. It makes the code more flexible and easier to move between different environments.

## Filtering Eligible Businesses and Reviews Based on Text Review Count

In [12]:
# Load CSV output from Task 1 to identify eligible businesses with at least 70 text reviews
csv_df = pd.read_csv(csv_output_path)

# Filter gmapId with review_text_count >= 70
eligible_gmapIDs = csv_df[csv_df['review_text_count'] >= 70]['gmapID'].tolist()

# Load JSON output from Task 1 containing review data
with open(json_output_path, 'r') as json_file:
    json_data = json.load(json_file)

# Filter JSON data to include only businesses with eligible gmapIDs
filtered_json_data = {gmapID: data for gmapID, data in json_data.items() if gmapID in eligible_gmapIDs}

Loading CSV and JSON Data:<br>

The code begins by loading the data from a CSV file into a table called a DataFrame. This table helps us organize and work with the data. It then filters out the Google Map IDs that don't have enough reviews, focusing on the ones that have sufficient data for analysis.<br>

Next, the code loads the data from a JSON file. This file contains more detailed information about each Google Map ID. The code then filters this data to only include the information about the Google Map IDs that passed the review count filter from the CSV file. This ensures that we are working with a consistent dataset that has enough reviews for meaningful analysis.

## Converting Filtered JSON Data to DataFrame and Preprocessing Text Reviews

In [13]:
# Convert the filtered JSON data into a DataFrame for further processing
selected_reviews = []
for gmapID, data in filtered_json_data.items():
    for review in data['reviews']:
        selected_reviews.append({
            'gmapID': gmapID,
            'text': review['review_text']
        })

df_reviews = pd.DataFrame(selected_reviews)

# Dropping NaN values and ensuring the text is in lowercase
df_reviews.dropna(subset=['text'], inplace=True)
df_reviews['text'] = df_reviews['text'].str.lower()

Converting Filtered JSON Data to DataFrame:

The data from the filtered JSON file is transformed into a DataFrame, a structured table that makes it easier to work with and analyze the data. This step is crucial for transforming the semi-structured JSON data into a format that is more suitable for further processing.<br>

By converting the JSON data into a DataFrame, we can take advantage of the powerful data manipulation and analysis capabilities offered by the Pandas library. This includes tasks such as filtering, sorting, grouping, and calculating summary statistics, which are essential for extracting valuable insights from the data.

## Filtering Text Reviews to Retain Only Valid English Words and Remove Stopwords

In [14]:
# Further filter text to keep only valid English words using NLTK's words corpus
english_words = set(words.words())

# Load the custom stopwords from a provided file and combine with NLTK stopwords
with open(stopwords_file_path, 'r') as f:
    custom_stopwords = set(f.read().splitlines())

# Combine custom stopwords with NLTK's English stopwords
user_defined_stopwords = set(stopwords.words('english')).union(custom_stopwords)

# Function to filter only valid English words from the text
def filter_english_words(text):
    """
    Description: Filters out non-English words from a given text. This function splits the input text into individual tokens (words), 
                 checks each token against a predefined set of valid English words, and returns a new string containing only valid English words.
    Arguments:
        text (str): The input text string that needs to be filtered for English words.
    Returns:
        str: A string containing only valid English words from the input text, separated by spaces.
    """
    tokens = text.split()
    return ' '.join([token for token in tokens if token in english_words])

df_reviews['text'] = df_reviews['text'].apply(filter_english_words)

Filtering for English Words and Removing Stopwords:<br>

The review texts are cleaned up to only keep the words that are valid English words. This is done using a list of English words from NLTK. We also remove some common words called stop words that don't add much meaning to the text. This cleaning process is important for making the data better and ensuring that only the important and meaningful words are kept.<br>

NLTK word list: This is a list of commonly used English words that is provided by the Natural Language Toolkit (NLTK) library. By filtering the review texts to only include words from this list, we can remove any misspelled or non-English words that might introduce noise or errors into our analysis.

## Text Preprocessing: Tokenization, Stopword Removal, and Stemming

In [15]:
# Initialize tokenizer, stemmer and stopwords
tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Function to preprocess text: tokenization, stopword removal, and stemming
def text_transformation(text):
    """
    Description: Function to preprocess text data by performing the steps like tokenization, stemming, stopwords removal and custom stopwords removal
    Arguments: text(str), The input text that needs to be preprocessed
    Return: list(str), A list of processed tokens after preprocessing
    """
    tokens = tokenizer.tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [token for token in tokens if token not in user_defined_stopwords]
    tokens = [stemmer.stem(token) for token in tokens if len(token) >= 3]
    return tokens

df_reviews['tokens'] = df_reviews['text'].apply(text_transformation)

Tokenization, Stemming, and Stopwords Removal:<br>

The cleaned review texts are broken down into individual words, called tokens. These tokens are then changed to their root forms, which is called stemming. Finally, any common words that don't add much meaning (stop words) are removed. This process breaks down the text into its most important parts, which are necessary for later analysis like counting how often words appear and finding pairs of words (bigrams).<br>

Tokenization: The process of breaking down text into individual words or tokens. This is the first step in most natural language processing tasks.<br>

Stemming: The process of reducing words to their root or stem form. For example, the words "running," "runs," and "ran" could all be stemmed to the root form "run." This helps to reduce the number of unique words in the text and makes it easier to identify related terms.

## Token Frequency Analysis and Contextual Stopwords Removal

In [16]:
# Compute the frequency of tokens for each business
business_entities = defaultdict(set)
for gmapID, tokens in zip(df_reviews['gmapID'], df_reviews['tokens']):
    business_entities[gmapID].update(tokens)

# Calculate frequency of each token across all businesses
token_frequency_distribution = defaultdict(int)
for tokens in business_entities.values():
    for token in tokens:
        token_frequency_distribution[token] += 1

# Remove tokens appearing in more than 95% of businesses (context-dependent stopwords)
total_businesses = len(business_entities)
context_stopwords = {token for token, freq in token_frequency_distribution.items() if freq / total_businesses >= 0.95}
df_reviews['tokens'] = df_reviews['tokens'].apply(lambda tokens: [token for token in tokens if token not in context_stopwords])

# Remove rare tokens appearing in less than 5% of businesses
low_frequency_tokens = {token for token, freq in token_frequency_distribution.items() if freq / total_businesses < 0.05}
df_reviews['tokens'] = df_reviews['tokens'].apply(lambda tokens: [token for token in tokens if token not in low_frequency_tokens])

Analyzing Token Frequency and Removing Context-Dependent Stopwords:<br>

The code counts how often each word appears in different businesses. It then finds the words that appear in a large percentage of businesses, which are called context-dependent stop words. These words, along with words that appear rarely, are removed to make the analysis better. This helps us focus on the words that are most different between businesses.<br>

By removing context-dependent stopwords and rare tokens, we can refine the analysis and focus on the terms that are most discriminative. These terms are more likely to be informative for understanding the differences between businesses and identifying patterns or trends.

## Bigram Identification and Vocabulary Construction

In [17]:
# Combine all tokens into a single list to find meaningful bigrams
aggregate_tokens = [token for tokens in df_reviews['tokens'] for token in tokens]

# Initialize Bigram finder and use PMI to identify significant bigrams
bigram_analysis = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(aggregate_tokens)
finder.apply_freq_filter(5)  # Ignore bigrams that appear less than 5 times
bigrams = finder.nbest(bigram_analysis.pmi, 200)

# Combine unigrams and bigrams into a single vocabulary and sort alphabetically
vocab = set(aggregate_tokens + [' '.join(bigram) for bigram in bigrams])
vocab = sorted(vocab)

# Save the vocabulary with token indices to a file
vocab_dict = {token: index for index, token in enumerate(vocab)}
with open(vocab_file_path, 'w') as f:
    for token, index in vocab_dict.items():
        f.write(f'{token}:{index}\n')

Bigram Analysis and Vocabulary Creation:<br>

The code finds important pairs of words called bigrams using a method called Pointwise Mutual Information (PMI). It then combines these bigrams with single words called unigrams to make a complete vocabulary. This vocabulary is saved to a file, representing the most important words and phrases from the reviews.<br>

Bigram analysis: The process of identifying pairs of words (bigrams) that frequently occur together in the text. This can help to uncover semantic relationships and patterns.<br>
Pointwise Mutual Information (PMI): A measure of the statistical association between two words. A higher PMI value indicates a stronger association between the words.<br>

By combining bigrams with unigrams, we can create a more comprehensive vocabulary that captures both single words and multi-word phrases. This vocabulary can be used for various natural language processing tasks, such as text classification, topic modeling, and information retrieval.

## Count Vector Generation and Aggregation

In [18]:
# Function to generate count vectors for each review using the vocabulary
def generate_token_count_vector(tokens, vocab_dict):
    """ 
    Description: Creates a count vector string from a list of words using a vocabulary dictionary.
    Arguments:
    - tokens (list of str): The words you want to count.
    - vocab_dict (dict): A dictionary that maps each word to its corresponding index in the vocabulary.
    Returns:
    - str: A string showing how many times each word appears, formatted as "index:count" and separated by commas.
    """
    token_counts = Counter(tokens)
    count_vec = [f"{vocab_dict[token]}:{count}" for token, count in token_counts.items() if token in vocab_dict]
    return ', '.join(count_vec)

# Generate count vectors for each review and save them to the countvec.txt file
df_reviews['count_vector'] = df_reviews['tokens'].apply(lambda tokens: generate_token_count_vector(tokens, vocab_dict))

# Remove reviews with empty count vectors
df_reviews = df_reviews[df_reviews['count_vector'].str.strip() != '']

# Aggregate the count vectors for each gmapID
aggregated_count_vectors = defaultdict(Counter)

for gmapID, count_vector in zip(df_reviews['gmapID'], df_reviews['count_vector']):
    if count_vector:
        for pair in count_vector.split(', '):
            index, freq = pair.split(':')
            aggregated_count_vectors[gmapID][index] += int(freq)

# Save the aggregated count vectors to a file
with open(countvec_file_path, 'w') as f:
    for gmapID, counter in aggregated_count_vectors.items():
        count_vector_str = ', '.join([f"{index}:{freq}" for index, freq in sorted(counter.items(), key=lambda x: int(x[0]))])
        f.write(f"{gmapID}, {count_vector_str}\n")

Generating Count Vectors and Saving to File:<br>

The code creates a number for each review that shows how often each word appears in the vocabulary. These numbers are then grouped by GMap ID and saved to a file. This step changes the text data into a structured format that can be used for more analysis, modeling, or machine learning tasks.<br>

Count vector: A numerical representation of a document where each element corresponds to the frequency of a term in the document's vocabulary.<br>

Aggregation by gmapID: The process of grouping the count vectors for all reviews associated with a particular GMap ID. This allows us to analyze the overall word usage patterns for each business.

<div class="alert alert-block alert-success"> 

## References <a class="anchor" name="Ref"></a>

* Bird, S., Klein, E., & Loper, E. (n.d.). Natural Language Toolkit. NLTK Project. https://www.nltk.org/
* Stack Overflow. (2015, April 14). Python remove stop words from pandas dataframe. https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe
* Natural Language Toolkit. (n.d.). collocations.doctest. GitHub. https://github.com/nltk/nltk/blob/develop/nltk/test/collocations.doctest

## Acknowledgement
* We acknowledge the assistance of ChatGPT, powered by OpenAI, in completing certain parts of this assignment. The use of this AI tool provided valuable support in areas such as   regex patterns, text preprocessing, and enhancing the overall quality of the work. 
OpenAI. (2023). ChatGPT (GPT-4). https://openai.com/chatgpt


</div>