<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

# IFN619 :: UA2 - Extending Analytics (40%)

**IMPORTANT:** Refer to the instructions in Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774) *BEFORE* working on this assignment.

#### REQUIREMENTS ####

1. Complete and run the code cell below to display your name, student number, and assignment option
2. Identify an appropriate question (or questions) to be addressed by your overall data analytics narrative
3. Extend your analysis in assignment 1 with:
    - the analysis of additional unstructured data using the Guardian API (See accessing the Guardian API notebook),
    - the use of one machine learning technique (as used in the class materials), and
    - identification of ethical considerations relevant to the analysis (by drawing on class materials).
4. Ensure that you include documentation of your thinking and decision-making using markdown cells
5. Ensure that you include appropriate visualisations, and that they support the overall narrative
6. Ensure that your insights answer your question/s and are appropriate to your narrative. 
7. Ensure that your insights are consistent with the ethical considerations identified.

**NOTE:** you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.

#### SUBMISSION ####

1. Create an assignment 2 folder named in the form **UA2-surname-idnumber** and put your notebook and any data files inside this folder. Note, do not put large training data in this folder (reference any training data that you used but keep it outside this folder), only keep small data files and models in this folder with your notebook.
2. When you have everything in the correct folder, reset all cells and restart the kernel, then run the notebook completely, checking that all cells have run without error. If you encounter errors, fix your notebook and re-run the process. It is important that your notebook runs without errors only requiring the files in the folder that you have created.
3. When the notebook is error free, zip the entire folder (you can select download folder in Jupyter).
4. Submit the zipped folder on Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774)


In [None]:
# Complete the following cell with your details and run to produce your personalised header for this assignment

from IPython.display import HTML

# personal details
first_name = "Yonten"
last_name = "Loday"
student_number = "N11828773"

personal_header = f"<h1>{first_name} {last_name} ({student_number})</h1>"
HTML(personal_header)

## Introduction
The Advance Queensland Program and Grants is an initiative by the Queensland Government with an aim to promote innovation, economic growth, and job creation within Queensland. The first findings (assessment 1) have clearly shown that there is a unequal fund distribution between South East Queensland and Regional Queensland. South-East Queensland received considerably more funds compared to Regional Queensland, raising concerns about the fairness and effectiveness of the current fund distribution policy. And the number of funding recipients have reduced drastically since the beginning of 2021. Through those findings, we have understood the patterns, trends and biases in fund allocation. Based on the insights from assessment 1, this report will look in to the following questions. 
1. What is the reaction from public on the funding distribution? (Sentiment Analysis)
2. Can Machine Learning confirm the regional funding disparity? (K-Means Clustering)
3. Why is there a significant difference in fund allocation between South-East Queensland and Regional Queensland? (NMF)

To address these questions, I will follow QDAVI cycle of data analytics, incorporating  unstructured data from the Guardian API, applying a machine learning techniques to uncover deeper insights, and considering the ethical implications of the findings. Through this comprehensive approach, I aim to provide a robust narrative that not only highlights the existing issues but also offers policy recommendations for improving fund distribution policies. 

Ethical considerations are integral to ensure that the results and outcomes are just, transparent, and beneficial for all stakeholders. Adhering to ethical standards, analysts can produce reliable, fair, and actionable insights that contribute positively to decision-making processes and societal well-being. Therfore, this report has incorporated crucial ethical aspects and are identified when used (just before the implementation code cell) with note **Ethical Consideration**

--- -----------------------------------------------------------------------------------------------------------------------
I will start by importing of all required libraries, the use of the libraries are marked as comment.

In [None]:
#json data and dataframe
import requests
import numpy as np
import pandas as pd
#preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
#text analysis
from collections import Counter
from textblob import TextBlob
#visualization
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
#machine learning 
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import NMF

# Question 1
**What is the reaction from public on the funding distribution? (Sentiment Analysis)**

Importance of the question: It is important to understand the opinions of th public before starting the indepth analysis. It will give overview of the analysis. For instance, if the sentiment of the public is negative or close to negative, it will prepare us what to look into next. 

## Data
In order to answer the first question, I will use articles (unstructured data) from Guardian API.

**Ethical Considerations**: API Key Management & Data Privacy</br>
Before requesting the articles from Guardian API, I have to set up my api key which is in private folder with key.txt file. This ensures that access credentials are not exposed publicly, protecting the data source from unauthorized access. The articles retrieved from the Guardian API are publicly available information, ensuring compliance with data privacy standards. No personal data is used or exposed in the analysis.

In [None]:
with open('../../private/key.txt', 'r') as file:
    key = file.read().strip()

To get the public reaction, I want to do sentiment analysis on the content (bodyText) of the articles. Sentiment analysis will give me how positve, negative, or neutral the public opinion has on the program. 

I also want to check if different number of articles result different sentiments. So, the following code cell has two dictionaries to request two different-sized articles, one with 100 articles (params_100) and another with 200 articles (params_200). The only difference is page-size. As for the sentiment analysis, I want only headline (title), trailText(summary/highlights), bodyText, and webUrl(for question 3) of the articles. 

In [None]:
base_url = 'https://content.guardianapis.com/search'

params_100 = {
    'q': 'Queensland AND (economic growth OR fund allocation OR regional development OR south east)',
    'api-key': key,
    'page-size': 100,  
    'show-fields': 'headline,trailText,bodyText,webUrl'   
}

params_200 = {
    'q': 'Queensland AND (economic growth OR fund allocation OR regional development)',
    'api-key': key,
    'page-size': 200,  
    'show-fields': 'headline,trailText,bodyText' 
}

I then request Guardian API for articles related to my 'q': 'Queensland AND (economic growth OR fund allocation OR regional development OR south east)' with number of articles specified. Then stored the json data in data_100 and data_200. 

In [None]:
# Make the API request
response_100 = requests.get(base_url, params=params_100)
data_100 = response_100.json()

response_200 = requests.get(base_url, params=params_200)
data_200 = response_200.json()

The following code cell prints the unique keys inside ['response']['results']

In [None]:
results_100 = data_100['response']['results']
# Collect unique keys from each dictionary in the 'results' array
unique_keys = set()
fields_keys = set()
for result in results_100:
    unique_keys.update(result.keys())
    fields_keys.update(result['fields'].keys())
print(unique_keys)
print(fields_keys) 

results_200 = data_200['response']['results']
unique_keys = set()
fields_keys = set()
for result in results_200:
    unique_keys.update(result.keys())
    fields_keys.update(result['fields'].keys())
print("------For 200 pages of artciles------")
print(unique_keys)
print(fields_keys) 

Then I collectd the relevant information of two different-sized articles using `append()` function. The collected information of the articles are then converted to data frame df_articles_100 and df_articles_200 using `pd.DataFrame()`. 

In [None]:
# Extract relevant information from the response
articles_100 = []
for result in data_100['response']['results']:
    article = {
        'headline': result['fields']['headline'],
        'trailText': result['fields']['trailText'],
        'bodyText': result['fields']['bodyText'],
        'webUrl': result['webUrl']
    }
    articles_100.append(article)
    
articles_200 = []
for result in data_200['response']['results']:
    article = {
        'headline': result['fields']['headline'],
        'trailText': result['fields']['trailText'],
        'bodyText': result['fields']['bodyText'],
        'webUrl': result['webUrl']
    }
    articles_200.append(article)

Then I printed the last 5 rows of each dataframe to see if I have collected different articles and the right number of articles. This is because last 5 articles of df_articles_100 should be different from df_articles_200. 

In [None]:
df_articles_100 = pd.DataFrame(articles_100)
df_articles_100.tail()

In [None]:
df_articles_200 = pd.DataFrame(articles_200)
df_articles_200.tail()

## Analysis

In [None]:
# nltk.download('stopwords')
# nltk.download('punkt')

I have downloaded the stopwords and punkt to proprocess the text data of the dataframes. English stop words and non-alphanumeric characters are removed becuase these characters and stop words do not add any value to the sentiment analysis. The text of bodyText is converted all to lower and then tokenized. These are all done by `preprocess_text()` function created. 

The tokens of both the dataframes are then joined back to form string for sentiment analysis compatability. The preprocessed bodyText is stored in new column called cleaned_text which I will be used hereafter (for sentiment analysis).

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove non-alphanumeric characters
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize the text
    filtered_tokens = [word for word in tokens if word not in stop_words]  # Remove stop words
    return ' '.join(filtered_tokens)

# Apply preprocessing to the bodyText column
df_articles_100['cleaned_text'] = df_articles_100['bodyText'].apply(preprocess_text)
df_articles_200['cleaned_text'] = df_articles_200['bodyText'].apply(preprocess_text)
df_articles_200.tail()

## Visualization

Before I do sentiment analysis, I want to provide readers a quick initial insights and visual summary of the prominent keywords and topics of the articles which will guide the readers to sentiment analysis. 

To do this, I created a word cloud using WordCloud library. To generate a word cloud, I first need frequency of each word. Therefore, the following cell code combines all text of cleaned_text column as one single string. Then, `Counter()` is used to get the frequency counter with each word and their number of occurence(frequency).

In [None]:
# Combine all articles into a single string
all_text_100 = ' '.join(df_articles_100['cleaned_text'])
all_text_200 = ' '.join(df_articles_200['cleaned_text'])

# Get word frequency
word_freq_100 = Counter(all_text_100.split())
word_freq_200 = Counter(all_text_200.split())

The frequency counters is then fed to `generate_from_frequencies()` of WordCloud with other parameters for the word cloud [<a href = "https://www.geeksforgeeks.org/generating-word-cloud-python/">geeksforgeeks</a>]

In [None]:
wordcloud_100 = WordCloud(
    width = 800,
    height = 400,
    max_font_size = 80, 
    background_color = 'white',
    colormap = 'viridis'
).generate_from_frequencies(word_freq_100)

wordcloud_200 = WordCloud(
    width = 800,
    height = 400,
    max_font_size = 80, 
    background_color = 'white',
    colormap = 'viridis'
).generate_from_frequencies(word_freq_200)

The two wordclouds, first from 100 articles and another from 200 artcles are plotted together using matplotlib subplots. 

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
# Plot the 100 articles word cloud
axs[0].imshow(wordcloud_100, interpolation='bilinear')
axs[0].axis("off")
axs[0].set_title('Word Cloud from 100 Articles')

# Plot the 200 articles word cloud
axs[1].imshow(wordcloud_200, interpolation='bilinear')
axs[1].axis("off")
axs[1].set_title('Word Cloud from 200 Articles')

# Show the plots
plt.show()

Insights from the Word Clouds of 100 and 200 Guardian Articles
Common Themes:
- In both word clouds, "government," "australia," "said," "minister," and "people" are among the most prominent words. This indicates that discussions around government, national affairs, and public statements are central themes in the articles.
- Words like "albanese" (likely referring to Anthony Albanese, the Australian Prime Minister) and "minister" suggest a significant focus on political leadership and governmental figures.

Word Cloud from 100 Articles:
- Key Topics: In addition to the common themes, this word cloud highlights "health," "education," "public," and "funding." These topics suggest that a substantial portion of the articles focus on public services and funding issues.
- Geographical Focus: The presence of words like "queensland" and "south" indicates some regional focus, particularly on Queensland and southern parts of Australia.
- Specific Issues: Words such as "energy," "nuclear," "school," and "police" suggest discussions on energy policies, law enforcement, and educational institutions.
  
Word Cloud from 200 Articles:
- Expanded Topics: With a larger dataset, additional prominent words like "NSW" (New South Wales), "court," "report," and "time" emerge, indicating a broader range of topics covered, including legal matters and various time-bound events.
- Public and Social Services: Similar to the 100 articles, there is a strong emphasis on "public," "health," and "services," indicating consistent coverage of public welfare topics.
- Government and Policy: The continued prominence of words like "government," "state," and "federal" suggests ongoing discussions about government policies and state-level governance.

Now that readers have brief insight to the overall articles with the help of wordcloud, I will do sentment analysis using `TextBlob()`. TextBlob is a popular library in Python for basic natural language processing tasks. It provides a simple and easy-to-use interface for performing basic sentiment analysis, making it suitable for quick and general-purpose tasks without needing extensive setup or customization 
[<a href ='https://textblob.readthedocs.io/en/dev/'>TextBlob Documentation</a>]

Using textblob, we get the polarity and that is addded to the dataframe of each article length as new column called sentiment.  

In [None]:
# Function to get sentiment polarity
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

# Apply sentiment analysis
df_articles_100['sentiment'] = df_articles_100['cleaned_text'].apply(get_sentiment)
df_articles_100.tail()

In [None]:
# Apply sentiment analysis
df_articles_200['sentiment'] = df_articles_200['cleaned_text'].apply(get_sentiment)
df_articles_200.tail()

With the sentiment column, the polarity score of each articles, I want to make a visual representation of the sentiments distribution to my readers about the sentiments of the public. I am using seaborn subplots to plot two histograms with  Kernel Density Estimate (KDE), each histogram for different-sized articles. 

KDE plots smooth out data noise to reveal underlying patterns and provide a clearer view of data distribution, making it easier to compare peaks, spread, and overall shape between datasets. They complement histograms by offering a continuous estimate of probability density, enhancing the insights gained from data visualization. [<a href = "https://seaborn.pydata.org/generated/seaborn.kdeplot.html">seaborn.kdeplot</a>]

**Ethical Considerations**</br>
The number of bins are not randomly selected, it is calculated using Strurges' Rule. Sturges' Rule is widely employed data analysis technique to determine the number of bins for a histogram. Strurges' formula: </br> <img src = "formula.png" width = "200"/> where: \( n \) is the number of observations and ⌈⋅⌉ denotes eiling function (round up to the nearest integer)[<a href = "https://en.wikipedia.org/wiki/Sturges%27s_rule#:~:text=This%20rule%20is%20widely%20employed,the%20default%20bin%20selection%20method.&text=(due%20to%20counting%20the%200,the%20result%20is%20rounded%20up.">Wikipedia</a>]
</br> 
*for 100 articles*</br>
1+ log2(100) ≈ 7.644 = 8
</br>*for 200 articles*</br>
1+ log2(200) ≈ 8.644 = 9

**Ethical Considerations** </br>
As I have different number of articles, the visualization of 2 histograms will have no consistent x and y axis scale which will mislead the visualisation and it will be diffult to compare. So, I synchronise the y-axis and x-axis to derive accurate and meaningful visual comparison.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 6))

# Plot the sentiment distribution for 100 articles
sns.histplot(df_articles_100['sentiment'], bins=8, kde=True, ax=axs[0])
axs[0].set_title('Sentiment Distribution of 100 Guardian Articles')
axs[0].set_xlabel('Sentiment Polarity')
axs[0].set_ylabel('Frequency')

# Plot the sentiment distribution for 200 articles
sns.histplot(df_articles_200['sentiment'], bins=9, kde=True, ax=axs[1])
axs[1].set_title('Sentiment Distribution of 200 Guardian Articles')
axs[1].set_xlabel('Sentiment Polarity')
axs[1].set_ylabel('Frequency')

# Synchronize y-axis limits
max_freq = max(axs[0].get_ylim()[1], axs[1].get_ylim()[1])
axs[0].set_ylim(0, max_freq)
axs[1].set_ylim(0, max_freq)

# Synchronize x-axis limits
min_x = min(axs[0].get_xlim()[0], axs[1].get_xlim()[0])
max_x = max(axs[0].get_xlim()[1], axs[1].get_xlim()[1])
axs[0].set_xlim(min_x, max_x)
axs[1].set_xlim(min_x, max_x)

plt.tight_layout()
plt.show()

## Insights
- Both distributions are centered around a sentiment polarity of approximately 0.1, indicating that the average sentiment of Guardian articles tends to be neutral to mildly positive. This central tendency is consistent across both the smaller sample of 100 articles and the larger sample of 200 articles. However, the sentiment polarity also spreads around -0.1 which is a concern for the program although the frequency is low. 
- The highest frequency of articles is found around a sentiment polarity of 0.1 in both graphs. This peak of KDE curve indicates that mildly positive sentiment is the most common sentiment observed in Guardian articles. The increased sample size in the 200-article graph results in a more pronounced and smoother peak, confirming the trend seen in the smaller sample.
- The slightly broader range in the 200-article graph suggests a more diverse sample but maintains a similar sentiment spread. This indicates that the increase in number of articles do not affect the sentiment of public, it just smooths the spread. 

# Question 2
**Can Machine Learning confirm the regional funding disparity?**

Importance of Question 2: Accurate identification of funding disparities ensures that resources can be reallocated more equitably, promoting balanced regional development and reducing economic inequalities. Although the disparity is seen in assesment 1, this approach using machine learning further confirms the regional funding disparity.The confirmation of the disparity will answer the **sentiment/reaction** of the public.  

To answer this, I will be using K-Means clustering. K-Means clustering is suitable for analyzing regional funding disparity due to its ability to handle multi-dimensional data, discover patterns, and provide clear and interpretable results. Its efficiency and scalability allows analysts to confirm and visualize funding disparities objectively.

## Data
For K-Means Clustering, I will be using cleaned data of Advance Queensland Funding Recipients from assessment 1. 

In [None]:
fund_data = pd.read_csv('fund_data.csv')
print(fund_data.dtypes)

In [None]:
fund_data.columns

## Analysis

The cleaned data has four columns, but only Region_Category	and Actual Contractual Commitment($) are valuable for my K-Means Clustering. Therefore, I choose only those two with the following code cell. 

In [None]:
imp_cols = ['Region_Category','Actual Contractual Commitment($)']
fund_data = fund_data[imp_cols]
fund_data

For simplicity, I changed Region_Category to Region. 

In [None]:
fund_data = fund_data.rename(columns = {'Region_Category':'Region'})

The Region as we can see is categorical, so I convert that categorical labels using `LabelEncoder()`. LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1 [<a href = "https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets">Label Encoding </a>]

Regional Queensland: Region 0 </br>
South East Queesnsland: Region 1

In [None]:
# Encode categoridfcal data
label_encoder = LabelEncoder()
fund_data['Region'] = label_encoder.fit_transform(fund_data['Region'])
fund_data

**Ehtical Considerations** : Fair Clustering</br>
Then I standardized the Actual Contractual Commitment($) column using `StandardScaler()` of sklearn. This step is important because standardizing before applying K-Means clustering ensures that each feature contributes equally to the distance calculations used to form clusters. This process converts the features to a common scale, preventing features with larger ranges from disproportionately influencing the clustering results. Standardization therefore, improves the accuracy and efficiency of the clustering algorithm, enabling it to correctly identify underlying patterns in the data without bias towards any particular feature.

In [None]:
# Standardize the numerical data
scaler = StandardScaler()
fund_data['Actual Contractual Commitment($)'] = scaler.fit_transform(fund_data[['Actual Contractual Commitment($)']])

fund_data.head()

**Ethical Considerations**</br>
One of the important parameters of KMeans clustering is n_clusters which is a number of clusters to which we want the clustering to be done. There are multiple ways of selecting the appropriate n_clusters from randomly selecting to complex algorithms. For my K-Means Clustering, I used Elbow method. It involves plotting the variance explained by different numbers of clusters and identifying the “elbow” point, where the rate of variance decreases sharply levels off, suggesting an appropriate cluster count for analysis or model training. To plot the elbow graph, we need inertia which is a sum of squared distances between each data point and the nearest cluster centroid. 

The following code cell generates interia to plot elbow graph. And using matplotlib and inertia, I plotted the elbow graph. 

In [None]:
inertia = []
scaled_data = fund_data.values

for n in range(1, 11): #(number of clusters trial)
    kmeans = KMeans(n_clusters=n, random_state=42)
    kmeans.fit(scaled_data)
    inertia.append(kmeans.inertia_)

In [None]:
inertia

In [None]:
# Plot the elbow curve
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Determining Optimal Number of Clusters')
plt.show()

Based on the elbow plot, the optimal (appropriate) number of clusters for this data is 3 because the variance decreases sharply levels off at 3. And Kmeans is applied. 

In [None]:
optimal_clusters = 3
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
fund_data['cluster'] = kmeans.fit_predict(scaled_data)
fund_data

Then plotted the kmeans clustering of fund allocations between seq and rq to visualise the clusters. 

In [None]:
fig = px.scatter(fund_data, x='Region', y='Actual Contractual Commitment($)', color='cluster',
                 title='K-Means Clustering of Fund Allocation between SEQ and RQ',
                 labels={'Region': 'Region', 'Actual Contractual Commitment($)': 'Fund Amount (A$)'})
fig.show()

## Insights

<b>Cluster 0 (Blue):</b>
This cluster primarily consists of funding amounts allocated to South-East Queensland (SEQ) with a a lower range of funding amounts. 
</br><b>Cluster 1 (Pink):</b>
This cluster also consists of funding amounts allocated to SEQ with more dispersed and higher funding amounts compared to Cluster 0.
</br><b>Cluster 2 (Yellow):</b>
This cluster is associated with Regional Queensland (RQ).The data points are tightly grouped together, suggesting a consistent and lower range of funding allocations for RQ. 

The insights from assignment 1 were confirmed here with the machine learning (K-Means clustering). The K-Means clustering analysis provides clear evidence of the regional funding disparities between South-East Queensland and Regional Queensland. SEQ receives a broader range of funding amounts, with some projects receiving significantly higher allocations. In contrast, RQ receives lower funding amounts. 

# Question 3
**Why is there a significant difference in fund allocation between South-East Queensland and Regional Queensland? (NMF)**

Importance of the question: After understanding that there is a significant funding allocation disparing between SEQ and RQ, it is important to understand what caused such disparity. 

To answer this question, I will use NMF (Non-negative Matrix Factorization) with Guardian articles for topic modelling. With this I will identify key topics within the articles. In the end, I will select top articles, and then filter only those articles related to politics, Queensland, Regional Queensland, and economy. I will review the article manually to see the appropriate reasons.

## Data 

I will use the cleaned 200 articles data from **Question 1**. I am choosing 200 because it increase the chances of getting more articles related to the project.

In [None]:
df_articles = df_articles_200
df_articles.head()

## Analysis 
All the cleaning inlcuding the removal of stop words, non-alphanumeric characters, lowercasing, and tokenizing are done in the analysis part of question1. 

The following code cell verctorises the text with 95% occurrence in the documents and occurrence in more than 2 documents. The vectorization is done on the cleaned_text column of the dataframe. 

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(df_articles['cleaned_text'])

Then NMF is applied to generate 5 distinct topics. 

In [None]:
nmf = NMF(n_components=5, random_state=42)
nmf.fit(tfidf)

The following code cell displays the top words for each topic. 

In [None]:
def display_nmf_topics(model, feature_names, no_top_words):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        topics[topic_idx] = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
    return topics

no_top_words = 10
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
nmf_topics = display_nmf_topics(nmf, tfidf_feature_names, no_top_words)

# Display the topics
for topic, words in nmf_topics.items():
    print(f"Topic {topic}: {', '.join(words)}")

Now based on the topics, I want to generate top Guardian Articles with their respective webUrls. 

In [None]:
# Transform the document-term matrix using the NMF model to get the topic distribution
topic_distribution = nmf.transform(tfidf)

# Function to get the top N articles for each topic
def get_top_articles_per_topic(topic_distribution, df_articles, top_n=10):
    top_articles = {}
    for topic_idx in range(topic_distribution.shape[1]):
        top_article_indices = np.argsort(topic_distribution[:, topic_idx])[::-1][:top_n]
        top_articles[topic_idx] = df_articles.iloc[top_article_indices][['webUrl']]
    return top_articles

# Get the top 10 articles for each topic
top_articles_per_topic = get_top_articles_per_topic(topic_distribution, df_articles, top_n=10)

# Display the top article URLs for each topic
for topic, articles in top_articles_per_topic.items():
    print(f"\nTopic {topic}:")
    for idx, row in articles.iterrows():
        print(f"- URL: {row['webUrl']}")


As expected, not alltop articles are related to regional funding disparity. Therefore, I want to filter and retrieve only the articles that are related to politics, Queensland, regional Queensland and Economy. The following code cell does this filtering.  

In [None]:
# Define keywords for filtering
keywords = ['politics', 'Queensland', 'Regional Queensland', 'economic']

def filter_articles(df_articles, keywords):
    filtered_articles = []
    for idx, row in df_articles.iterrows():
        if any(keyword.lower() in row['webUrl'].lower() for keyword in keywords):
            filtered_articles.append(row['webUrl'])
    return filtered_articles

# Filter the top articles
for topic, articles in top_articles_per_topic.items():
    print(f"\nTopic {topic}:")
    filtered_urls = filter_articles(articles, keywords)
    for url in filtered_urls:
        print(f"- URL: {url}")

## Insights
I have reviewed the articles and following are the findings:
1. *"Queensland government approves <b> Winchester South mine </b> despite report warning of potential ‘climate change consequences’"*- This is an    <a href='https://www.theguardian.com/australia-news/2024/feb/07/winchester-south-mine-queensland-federal-government-approval-emissions-whitehaven'>excerpt</a> from an article. With mining works approved in one of the Regional Queensland towns, it will create many <b> job opportunities, improve livelihood of the region, and impact the community</b>. Although the full article reveals that the environmentalist have warned the government not to approve due to climate changing issues, but introducing such program in the regional areas will help the people of that community if we take into <b> economical benefits </b> it brings. This must have improved the sentiment of the public.
   
2. *"South East Queensland’s population is expected to grow by 2.2 million people by 2046. The state government, in consultation with communities, industry and other key stakeholders, has created a plan to respond to this growth."* - This is an <a href = "https://www.theguardian.com/qld-gov-shapingseq/2024/apr/08/the-future-of-south-east-queensland-how-a-new-plan-is-supporting-a-fast-growing-region"> excerpt</a>. This article shows one of the many **reasons why funding are mostly allocated in SEQ**. With growing population, there is a need to have a proper plan (money) to avoid obstacles. Queensland government must have kept and spent more on SEQ for this reason.
   
3. *"Australia’s economic growth slows in March quarter despite rise in household spending"*- This is an <a href = "https://www.theguardian.com/business/2022/jun/01/australias-economic-growth-slows-in-march-quarter-despite-rise-in-household-spending">excerpt</a> from an article. This article has helped me to answer my insight from assignment 1 about the decreasing number of funding recipients. Due to the **slow economic growth**, the government must have cut some of the budgets on this program. "In the first three months of 2022, gross domestic product rose to an annual rate of 3.3%, the Australian Bureau of Statistics said. That eased from the earlier reported annual pace of 4.2% and compared with about 3% forecast by economists."- from the same article. 

# Policy Recommendations:
- Equitable Distribution: Implement policies that ensure equitable distribution of funds based on clearly defined and transparent criteria that consider the unique needs of different regions within RQ. The equal distribution of funds to SEQ and RQ will be impossible considering the population, necessity, importance and economy. The SEQ will obiviously have more fund allocation, but the distribution should be equitable that RQ also receives what those regions have to receive.
  
- Periodic Reviews: Conduct periodical reviews of fund allocation policies to assess their effectiveness and make necessary adjustments based on feedback and changing regional needs.
  
- Stakeholder Involvement: Involve local communities and stakeholders in the decision-making process to ensure that the fund distribution policy is inclusive and addresses the actual needs of the regions.
  
- Transparency and Accountability: Enhance transparency in the fund allocation process by publicly sharing criteria, decision-making processes, and outcomes. This can help build public trust and ensure accountability.

# Other Ethical Considerations 
- Methodology Documentation: The project includes clear documentation of all steps, including data collection, preprocessing, analysis, visualisations, and insights. This transparency allows for reproducibility and exploration by others, ensuring that the analysis can be independently verified.

- Clear visualisation: The use of clear and interpretable visualizations, such as word clouds, sentiment distribution histograms, and K-Means Clustering helps communicate findings effectively and transparently to the audience.

- Ethical Data Handling: The data from the Guardian API is used responsibly, with a clear understanding that it is intended for analysis and insights rather than any misuse.

- Avoiding Misinterpretation: The project ensures that the findings are presented accurately to avoid misinterpretation that could lead to harmful decisions. 

# Conclusion

The comprehensive analysis of the Advance Queensland Program and Grants reveals significant regional disparities in funding allocation between South-East Queensland (SEQ) and Regional Queensland (RQ). Common themes in Guardian articles highlight the centrality of government actions and public statements, with a strong focus on political leadership and public services. Sentiment analysis shows a neutral to mildly positive public reaction, with a consistent sentiment spread regardless of sample size.

K-Means clustering analysis confirms the initial findings, demonstrating that SEQ receives a broader and higher range of funding amounts, while RQ receives lower funding amounts. This disparity is supported by various articles: the approval of the Winchester South mine in RQ is seen as a job creator, while SEQ's growing population justifies higher funding to manage growth effectively. Additionally, the overall economic slowdown has led to budget cuts, affecting the number of funding recipients in the recent years.

The narrative underscores the importance of equitable fund distribution to promote balanced regional development and reduce economic inequalities across Queensland. By addressing these disparities and considering the broader socio-economic context, policymakers can make informed decisions (policy recommendations) to support both SEQ and RQ, ensuring sustainable growth and development for all regions.