# <div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

# IFN619 :: UA2 - Extending Analytics (40%)

**IMPORTANT:** Refer to the instructions in Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774) *BEFORE* working on this assignment.

#### REQUIREMENTS ####

1. Complete and run the code cell below to display your name, student number, and assignment option
2. Identify an appropriate question (or questions) to be addressed by your overall data analytics narrative
3. Extend your analysis in assignment 1 with:
    - the analysis of additional unstructured data using the Guardian API (See accessing the Guardian API notebook),
    - the use of one machine learning technique (as used in the class materials), and
    - identification of ethical considerations relevant to the analysis (by drawing on class materials).
4. Ensure that you include documentation of your thinking and decision-making using markdown cells
5. Ensure that you include appropriate visualisations, and that they support the overall narrative
6. Ensure that your insights answer your question/s and are appropriate to your narrative. 
7. Ensure that your insights are consistent with the ethical considerations identified.

**NOTE:** you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.

#### SUBMISSION ####

1. Create an assignment 2 folder named in the form **UA2-surname-idnumber** and put your notebook and any data files inside this folder. Note, do not put large training data in this folder (reference any training data that you used but keep it outside this folder), only keep small data files and models in this folder with your notebook.
2. When you have everything in the correct folder, reset all cells and restart the kernel, then run the notebook completely, checking that all cells have run without error. If you encounter errors, fix your notebook and re-run the process. It is important that your notebook runs without errors only requiring the files in the folder that you have created.
3. When the notebook is error free, zip the entire folder (you can select download folder in Jupyter).
4. Submit the zipped folder on Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774)


In [1]:
# Complete the following cell with your details and run to produce your personalised header for this assignment

from IPython.display import HTML

# personal detail
first_name = "Adnan"
last_name = "Chowdhry"
student_number = 11869828

personal_header = f"<h1>{first_name} {last_name} ({student_number})</h1>"
HTML(personal_header)

---


## CONTEXT
In Assignment 1, I established that there is an imbalance in Advance Queensland funds distribution between SEQ and non-SEQ regions. However, this imbalance is not due to population differences or a bias towards SEQ recipients. Instead, it fulfills the clear objective of promoting science, technology, and innovation initiatives. It is merely a coincidence that the majority of significant amounts are awarded to SEQ clients. Additionally, I found that the overall committed amounts have declined from 2019 to 2023. One reason for this decline could be that, after funding notable and worthy clients throughout the state, proposals for less impactful projects are now being submitted. This indicates that Advance Queensland needs to take the lead and collaborate with public and private sector organizations to initiate projects that require urgent attention.


## INTRODUCTION
In 2020, the funding commitment of $5 million to AI Consortium Pty Ltd for the AI Hub seemed appropriate and future-oriented. However, from then until 2023, no further monetary investments have been made in the broad AI domain except some specific projects such as in health or mininng. This suggests an underestimation of AI's potential to transform the future of technology. The lack of funding to either support the positive aspects or mitigate the negatives of AI is particularly concerning given the arrival of Generative AI applications such as ChatGPT in 2023, which have sparked intense debates about the risks for users and regulators. The focus of my analysis is to give a glance how the world is reacting to this technology and what are the risks this AI technology poses to the society.



## QUESTION 1: 
What projects in the domain of AI has Advance Queensland funded, and which of them have the potential to address its broader challenges?

In [2]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random
import plotly.express as px
from sklearn.cluster import KMeans
import numpy as np

## ADVANCE QLD FUNDING RECIPIENTS DATA

In [3]:
# open a CSV file into a new dataframe from a url
funding_recipients_data = "https://www.data.qld.gov.au/dataset/db190f2d-f866-4811-9a6e-4b78744b551b/resource/0f97b985-f5c7-49d2-8b0a-bc5dfbe070b9/download/advance-queensland-funding-recipients.csv"

df = pd.read_csv(funding_recipients_data, encoding='latin 1')


## Ethical Considerations in Data Access and Use:
The Advance Queensland funding recipients data is freely available for sharing and adaptation without the need for anonymization.

In this step, I will remove the inconsistencies from "Actual Contractual Commitment" column and convert it to data type float which would allow me to use it in my analysis.

In [4]:
# Chat gpt helped me with the loop used in this code
# to get the aggregate of the column 'Actual Contractual Commitment ($)' we need to remove the inconsistencies and convert it to float

# Loop through each value in the 'amount' column
for idx, value in enumerate(df['Actual Contractual Commitment ($)']):
    # Check if value contains a comma
    if ',' in value:
        # Replace comma with empty string
        df.loc[idx, 'Actual Contractual Commitment ($)'] = value.replace(',', '')

# Convert column to float
df['Actual Contractual Commitment ($)'] = df['Actual Contractual Commitment ($)'].astype(float)

KeyError: 'Actual Contractual Commitment ($)'

In [None]:
df.head(5)

In this step, I would like to confirm whether the data type has changed.

In [None]:
# listing data types of columns 
print(df.dtypes)

In this step, I am converting the 'Approval date' column to datetime format, considering the possibility of using it for comparison and extraction of data .

In [None]:
# Convert 'date_column' from object to datetime
df['Approval date'] = pd.to_datetime(df['Approval date'], format='%d/%m/%Y')

# Check the data types after conversion
print(df.dtypes)

In this step, I would like to see the first and the last day of data to make sense of some of the dates I will consider in my analysis.

In [None]:
# Get the first and last dates in the 'Approval date' column
first_date = df['Approval date'].min()
last_date = df['Approval date'].max()

print("First date:", first_date)
print("Last date:", last_date, '\n')

In [None]:
# Sort DataFrame by date_column in descending order
df_sorted = df.sort_values(by='Approval date', ascending=False)

In this step, I am extracting the year part from the 'Approval date' to create a new column to use it in my analysis for grouping the data.

In [None]:
# Extract year from 'Approval date' and create a new column
df_sorted.loc[:, 'Year'] = df_sorted['Approval date'].dt.year

# listing the column names
column_names = df_sorted.columns
print('List of column Names: \n')
for name in column_names:
    print(name)

In this step, I am pulling all the rows in which the project titles contain the string "artificial intelligence," regardless of case sensitivity.

In [None]:
# Filtering rows containing the string "Artificial Intelligence" in the 'Investment/Project Title' column (case insensitive)
filtered_df = df_sorted[df_sorted['Investment/Project Title'].str.contains('artificial intelligence', case=False)]

# Displaying the number of filtered DataFrame rows and columns
filtered_df.shape

In this step, I would like to display the project titles I have extracted above in a list to understand the objective and scope of the project.

In [None]:
# Create a list to store formatted titles
formatted_titles = []

# Loop through the column and store formatted titles in the list
for title in filtered_df['Investment/Project Title']:
    #formatted_title = '\n'.join(textwrap.wrap(title, width=80))
    formatted_titles.append(title)

# Print each formatted title from the list
for title_list in formatted_titles:
    print(title_list)
    print('-' * 80)  # Separator line for better readability

## INSIGHT:
As we can see, out of 10 projects, only 2 seem to have broader goals beyond solving a specific existing problem. The first is the Artificial Intelligence Hub, and the second is the Artificial Intelligence STEM Platform. The AI STEM Platform is focused on the education sector, whereas the AI Hub aims to connect, promote, and enhance AI capabilities. The AI Hub is a project that brings together businesses, government, and research institutes to address current and future challenges of AI with support and accountability
(https://qldaihub.com/about-2/).

In this step, I would like to see the details of the funding to AI Consortium Pty Ltd to know the stakeholders and the amount in hand for spending.

In [None]:
# Filter out rows belonging to the program "Artificial Intelligence Hub"
AIhub_df = filtered_df[filtered_df['Investment/Project Title'] == 'Artificial Intelligence Hub']
AIhub_df = AIhub_df[['Year', 'Investment/Project Title', 'Recipient Name', 'University Collaborator (if applicable)','Other Partners; Collaborators (if applicable)', 'Actual Contractual Commitment ($)']]
AIhub_df

## INSIGHT:
I would call this project a holistic endeavor where all the bases are covered: great purpose, partners, and financial resources. However, the objectives and funding should be revised to address future challenges, or new projects should be initiated. This has not happened, as the AI Hub remains the only project of its kind.

## QUESTION 2:
At this point in my analysis, I want to highlight how power-hungry tech companies are reacting to the arrival of OpenAI's ChatGPT application. This reaction underscores the intense challenges that users, regulators, and lawmakers face in handling AI. These challenges are likely to become even more significant, requiring consistent effort to survive and thrive.

## UNSTRUCTURED DATA FROM GUARDIAN OPEN PLATFORM API

In [None]:
# Load the data - articles from The Guardian about the war in Ukraine
file_path = ""
file_name = "microsoft_invest_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

## Ethical Considerations in Data Access and Use:
The Guardian news data is available for free use in any non-profit project. Access to this data has been granted following registration and the acquisition of a developer key. All necessary guidelines and protocols provided by The Guardian for accessing and using the data have been followed.

In this step, I would show article titles which contains ChatGPT and OpenAI keywords in the year 2023.

In [None]:
keywords = ["chatgpt", "openai"]

for i, title in enumerate(articles.keys()):
    if any(keyword.lower() in title.lower() for keyword in keywords):
        print(f"Title {i+1}: {title}")

## INSIGHT:
The above output shows that number of tech giants such as Microsoft in Title 3 and Amazon in Title 6 investing billions into AI which confirms the importance of AI technology and the potential to draw profits for these organizations. On the other hand, one can also understand the speed with which this technology will evolve because of the scale of investment. To conclude I would like to say that embracing this technology in our society would not only bring the ease but also challenge everone of us at different levels that is why there is a need to monitor and improvise with urgency to gain positive impact from it. 

## Ethical Consideration:
Asserting that major tech companies high investments in AI-driven projects, such as ChatGPT, are driving a significant shift in technology is not necessarily baseless or negative publicity for these organizations. Rather, it reflects a trend observed in multiple articles included in this analysis, indicating that numerous knowledgeable individuals share the same perspective. For instance, titles such as "Title 4:UK watchdog to examine Microsoft’s partnership with OpenAI" and "Title 31:The OpenAI meltdown will only accelerate the artificial intelligence race" by Sarah Kreps highlight the growing attention and scrutiny surrounding AI advancements and collaborations.To justify the heavy investments in AI it is critical for stakeholders to balance the rapid advancement and integration of AI technology with the potential societal impacts. 


## QUESTIONS 3:
What are the risks from the use of AI technology such as ChatGPT if the users, regulators and the law makers do not take appropriate and timely actions to control the spread of this technology into the society?

## UNSTRUCTURED DATA FROM GUARDIAN OPEN PLATFORM API
To answer the above question, I have chosen to use the guardian open platform to see the news articles available regarding this issue and by the use of tfidf and nmf topic modelling technique I would extract latent topics on risks of AI within a corpus of documents.

In [None]:
# Load the data - articles from The Guardian about the war in Ukraine
file_path = ""
file_name = "AI risks to society.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    documents = json.load(fp)

print(f"Loaded {len(documents)} documents from {file_name}")

In this step, I will convert a collection of documents which I extracted from guardian open platform and saved in Json file in to a matrix of tfidf features(terms). I used tfidf because tfidf provides an effective way to preprocess text data for nmf such as normalization and feature selection.

In [None]:
# Only count terms that in maximum of 75% of documents, and a minimum of 2 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=2, max_features=10000, stop_words="english"
    
)

In this step, I will computes the tfidf representation of the documents, retrieves the feature names, and displays the tfidf vector for the first document in array format,

In [None]:
# Get the document vectors
tfidf_ft_matrix = tfidf_vectorizer.fit_transform(documents.values())

# Get the feature names (terms)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Display the vector for the first document
tfidf_ft_matrix.toarray()[0]

In this step, I will initializes and fits an NMF model to the tfidf matrix, resulting in document-topic and topic-term matrices, These matrices allow you to analyze the distribution of topics across documents and the distribution of terms within topics, which are crucial for interpreting the results of topic modeling. I used nmf because it generates non-negative topic-term and document-topic matrices which means more interpretable topics.

In [None]:
# Set the number of topics
num_topics = 15

# Create the model
nmf_model = NMF(n_components=num_topics,init='random',beta_loss='frobenius', max_iter=500, random_state=42)

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_ft_matrix)

topic_term_nmf = nmf_model.components_

In this step, I will store all the topics and their terms in the dictionary and print them.

In [None]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()



In this step, I would like to label the topics produced above according to their terms and weights for better understanding and the identification of topics specific to AI risks.

In [None]:
# Inspect and label each topic
topic_labels = {}

# Example labels based on provided topics and their top terms
topic_labels[0] = "Australian Politics"
topic_labels[1] = "Indigenous America"  
topic_labels[2] = "Epstein Cases" 
topic_labels[3] = "Australian News" 
topic_labels[4] = "Cybersecurity"
topic_labels[5] = "Deception and Privacy(AI)" 
topic_labels[6] = "Identification and Surveillance(AI)"
topic_labels[7] = "Disinformation(AI)" 
topic_labels[8] = "Market Indicators"
topic_labels[9] = "Safety and Regulations(AI)"
topic_labels[10] = "Online Harms and legal Issues"
topic_labels[11] = "Market Competition"
topic_labels[12] = "Political Conspiracies" 
topic_labels[13] = "Global Economic Forum (Davos)" 
topic_labels[14] = "Gaza and Israel War"

# Verify the topic labels
for index, label in topic_labels.items():
    print(f"Topic {index}: {label}")
    print(f"Top terms: {nmf_topic_dict[f'topic_{index}']}")
    print()


## VISUALIZATION

In [None]:
# code has been generated with the help of Chatgpt

# Calculate topic distribution (sum of topic weights across all documents)
topic_distribution = doc_topic_nmf.sum(axis=0)

# Update topic_labels_list based on the updated topic_labels dictionary
topic_labels_list = [topic_labels[i] for i in range(num_topics)]

# Convert topic distribution and labels to a DataFrame
topic_distribution_df = pd.DataFrame({'Topics': topic_labels_list, 'Document Count': topic_distribution})

# Create a custom color palette for the bars
custom_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', 
                 '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
                 '#ff0000', '#00ff00', '#0000ff', '#ffff00', '#00ffff']

# Create a bar plot using Plotly Express with custom colors
fig = px.bar(topic_distribution_df, x='Topics', y='Document Count', 
             title='Topic Distribution',
             labels={'Document Count': 'Document Count', 'Topics': 'Topics'},
             color='Document Count',
             color_continuous_scale=custom_colors)

# Rotate x-axis labels
fig.update_layout(xaxis=dict(tickangle=45))

# Increase the height of the graph
fig.update_layout(height=600)

# Show the plot
fig.show()


## INSIGHTS:
The chart visualizes the relationship between topics and their document counts within the text corpus.

X-axis (Topics): Each bar on the chart represents a topic derived from the text corpus. The topics are labeled based on their content or themes, as assigned in the topic_labels dictionary.

Y-axis (Document Count): The height of each bar represents the total number of documents in the corpus associated with the corresponding topic. It indicates the prevalence or frequency of each topic within the corpus.

To conclude, leveraging topic modeling, tfidf, and nmf methods, I've identified four significant risks or challenges facing users, regulators, and lawmakers in the context of AI in the near future:

Deception and Privacy Risk, 
Identification and Surveillance Risk, 
Safety and Regulations Risk, 
Disinformation Risk.

The higher document count associated with the safety and regulations risk underscores the urgent need for swift and comprehensive directives aimed at addressing safety concerns and establishing robust regulatory frameworks to mitigate potential risks associated with AI technologies.

## Ethical Consideration: Unawareness and Lack of Capacity:

My analysis that swift and comprehensive directives aimed at addressing safety concerns and establishing robust regulatory frameworks to mitigate potential risks associated with AI technologies may be weak on ethical ground because the regulators and law makers may not have the technical background necessary to fully comprehend the complexities of AI technology. This gap in understanding can lead to misinformed decisions, inadequate regulations, and an inability to anticipate or mitigate risks effectively.

## FURTHER ANALYSIS USING MACHINE LEARNING TECHNIQUE

In this step, I would like to group documents into clusters based on their topic distributions using Kmeans machine learning technique and then determine and assign a dominant topic to each cluster which would again emphasise the importance of topics within documents. Further, it will also organize and identify common themes among documents.

In [None]:
# Fit KMeans clustering model
num_clusters = num_topics  # Using the number of topics as the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(doc_topic_nmf)

cluster_labels

In [None]:
# code has been generated with the help of Chatgpt

# Group documents by cluster
clusters = {i: [] for i in range(num_clusters)}
for doc_index, cluster_label in enumerate(cluster_labels):
    clusters[cluster_label].append(doc_index)

# Determine the dominant topic in each cluster
cluster_topic_distribution = np.zeros((num_clusters, num_topics))

for cluster_label, doc_indices in clusters.items():
    for doc_index in doc_indices:
        cluster_topic_distribution[cluster_label] += doc_topic_nmf[doc_index]

# Assign cluster names based on the dominant topics
cluster_names = {}
for cluster_label in range(num_clusters):
    dominant_topic = np.argmax(cluster_topic_distribution[cluster_label])
    cluster_names[cluster_label] = topic_labels[dominant_topic]

# Print the cluster name and documents in each cluster
for cluster_label, doc_indices in clusters.items():
    print(f"Cluster {cluster_label} ({cluster_names[cluster_label]}): {doc_indices}")

## INSIGHTS:

The above output illustrate the following insights.

1. Clustering Documents: By using KMeans clustering, the code groups documents that have similar topic distributions (as identified by NMF).
2. Determining Dominant Topics: For each cluster, it calculates the overall topic distribution and identifies the most prevalent topic within the cluster.
3. Assigning Cluster Names: Based on the dominant topic, it assigns a descriptive name to each cluster.
4. Visualizing Results: Finally, it prints out the cluster names and the documents belonging to each cluster, which helps in understanding the composition of each cluster in terms of topic distribution.

## Ethical Point of View: Verification and Validation:

My limited understanding of K-means means I may not fully grasp the nuances and potential pitfalls of the algorithm, such as how it handles different types of data or the influence of initial centroids. This can lead to weak reasoning and conclusions based on the algorithm's output, undermining the credibility of my analysis. It becomes difficult to ensure that the clusters identified are meaningful and accurately represent the underlying data patterns.
