<a href="https://colab.research.google.com/github/bradleymclellan/stc510/blob/main/Exploratory_Data_Analysis_Essentials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


The following RQs motivated this empirical study:

RQ1. Artificial Intelligence Sentiments - What are the sentiments that are being expressed about AI headline on Reddit?

RQ2. Artificial Intelligence Topics - What are the main headlines that are being discussed about ChatGPT on Twitter?

The following Python script scrapes the "new" section of the "artificial" subreddit on Reddit using the PRAW (Python Reddit API Wrapper) library, analyzes the sentiment of the headlines using the Natural Language Toolkit (NLTK) library, and creates visualizations of the word frequency distribution of positive and negative headlines using Matplotlib and Seaborn libraries.

The code first imports the required libraries, including PRAW, NLTK, Matplotlib, and Seaborn, and then creates a Reddit instance using PRAW with a specific client ID, client secret, username, password, and user agent. The script then scrapes the "new" section of the "artificial" subreddit and stores the titles of the posts in a set called "headlines."

Next, the script uses the SentimentIntensityAnalyzer from the NLTK library to analyze the sentiment of the headlines. The polarity scores for each headline are stored in a list called "results," which is then converted into a Pandas DataFrame called "df." The DataFrame contains the polarity scores for each headline and a label (positive, neutral, or negative) based on a compound score cutoff of 0.2.

The script then creates a CSV file called "reddit_ai_headlines.csv" containing the headline and label data. It also prints out the positive and negative headlines and their frequencies and creates a bar plot showing the percentage of positive, neutral, and negative headlines.

Finally, the script processes the text of the positive and negative headlines using the NLTK tokenizer and stop words, creates frequency distributions for each set of headlines, and visualizes the word frequency distribution of the top 20 words for each set of headlines using Matplotlib.

In [None]:
# Import libraries needed for this project
from IPython import display
import math
import datetime as dt
from pprint import pprint
import pandas as pd
import numpy as np
!pip install praw
import praw
from asyncpraw.models import MoreComments
from asyncpraw.models.reddit.comment import Comment
from asyncpraw.models.reddit.submission import Submission
from asyncpraw.models.reddit.subreddit import Subreddit
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('vader_lexicon')
nltk.download('stopwords')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Dark2')

In [9]:
# Connect to the Reddit API using praw
reddit = praw.Reddit(client_id='NJiWDlforFHIDkta161q_Q',
                     client_secret='bJWi1Xx7VtSwF4Xvbtsv5XoiA3Itdw',
                     username='bradmclellan',
                     password='#Flusso-rouge51',
                     user_agent='reddit devscraper by u/bradmclellan')

In [10]:
# Define the set for the subreddits to be scraped
headlines = set()

In [None]:
# Iterate through the /r/artificial subreddit using the API client
for submission in reddit.subreddit('artificial').new(limit=None):
    headlines.add(submission.title)
    display.clear_output()
    print(len(headlines))

Iterating through the "new" entries in /r/artificial provides up to 1000 headlines by setting the limit to None. Reddit's 1000 result limit prevents displaying any more results without using some more sophisticated tactics. This loop can be repeated numerous times to continue adding new headlines to the collection, or a streaming variation can be used instead. For now, the focus is on the sentiment analysis of the ai headlines.

In [None]:
# Append each sentiment dictionary to a results list
sia = SIA()
results = []

for line in headlines:
    pol_score = sia.polarity_scores(line)
    pol_score['headline'] = line
    results.append(pol_score)

pprint(results[:3], width=100)

In [None]:
# Transform the results list into a pandas dataframe
df = pd.DataFrame.from_records(results)
df.head()

Four columns from the sentiment scoring make up the dataframe: Neu, Neg, Pos, and compound. The first three show the sentiment score percentage for each of the categories and the overall sentiment score. The value of "compound" is between -1 (Extremely Negative) and 1. (Extremely Positive).

Posts with a compound value higher than 0.2 and less than -0.2 are regarded as positive and negative, respectively. If the compound is higher than 0.2, a positive label of 1 is created, and if it is less than -0.2, a label of -1 is created. The remainder will be 0.

In [None]:
# Create labels for the columns
df['label'] = 0
df.loc[df['compound'] > 0.2, 'label'] = 1
df.loc[df['compound'] < -0.2, 'label'] = -1
df.head()

In [17]:
# Save the dataframe to a csv file
df2 = df[['headline', 'label']]
df2.to_csv('reddit_ai_headlines.csv', mode='a', encoding='utf-8', index=False)

In [None]:
# Review the positve and negative headlines in the dataframe
print("Positive headlines:\n")
pprint(list(df[df['label'] == 1].headline)[:5], width=200)

print("\nNegative headlines:\n")
pprint(list(df[df['label'] == -1].headline)[:5], width=200)

In [None]:
# Check the number of positive and negatives in the dataset
print(df.label.value_counts())

print(df.label.value_counts(normalize=True) * 100)

Above, the labels' original value counts are shown in the first line, while the normalize keyword-accompanied percentages are shown in the second line. Below is the plot showing the percentage of negative, neutral and postive headlines.

In [None]:
# Plot the percentage distibution of negative, neutral, and positive headlines
fig, ax = plt.subplots(figsize=(8, 8))

counts = df.label.value_counts(normalize=True) * 100

sns.barplot(x=counts.index, y=counts, ax=ax)

ax.set_xticklabels(['Negative', 'Neutral', 'Positive'])
ax.set_ylabel("Percentage")

plt.show()

There appear to be two primary causes for the abundance of neutral headlines:

First, the assumption made earlier was neutral headlines are those with a compound value of between -0.2 and 0.2. The frequency of neutral headlines increases with increasing margins. Then the AI posts were divided into categories using a common lexicon. However, using an AI-specific lexicon would represent a more objective approach, but to do so, would require the hand labeling of the data or locating an existing custom lexicon.

In [21]:
tokenizer = RegexpTokenizer(r'\w+')

In [26]:
stop_words = stopwords.words('english')

In [27]:
# Read the headline list and perfrom lowercasting, tokenizing, and stopword removal
def process_text(headlines):
    tokens = []
    for line in headlines:
        toks = tokenizer.tokenize(line)
        toks = [t.lower() for t in toks if t.lower() not in stop_words]
        tokens.extend(toks)
    
    return tokens

In [None]:
# Collect the most common words in the positive headlines
pos_lines = list(df[df.label == 1].headline)

pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)

pos_freq.most_common(20)

In [None]:
# Plot the frequency distribution of the most common words in the positive headlines
y_val = [x[1] for x in pos_freq.most_common()]

fig = plt.figure(figsize=(10,5))
plt.plot(y_val)

plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Word Frequency Distribution (Positive)")
plt.show()

In [None]:
# Use a logarithmic scale for the frequency distribution of positive words
y_final = []
for i, k, z, t in zip(y_val[0::4], y_val[1::4], y_val[2::4], y_val[3::4]):
    y_final.append(math.log(i + k + z + t))

x_val = [math.log(i + 1) for i in range(len(y_final))]

fig = plt.figure(figsize=(10,5))

plt.xlabel("Words (Log)")
plt.ylabel("Frequency (Log)")
plt.title("Word Frequency Distribution (Positive)")
plt.plot(x_val, y_final)
plt.show()

As anticipated, a nearly straight line and a heavy tail (noisy tail). The plot mentioned above demonstrates that while the majority of words occur less frequently in our word distribution, a significant minority of them do so more frequently.

In [None]:
# Collect the most common words in the negative headlines
neg_lines = list(df2[df2.label == -1].headline)

neg_tokens = process_text(neg_lines)
neg_freq = nltk.FreqDist(neg_tokens)

neg_freq.most_common(20)

In [None]:
# Plot the frequency distribution of the most common words in the negative headlines
y_val = [x[1] for x in neg_freq.most_common()]

fig = plt.figure(figsize=(10,5))
plt.plot(y_val)

plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Word Frequency Distribution (Negative)")
plt.show()

In [None]:
# Use a logarithmic scale for the frequency distribution of negative words
y_final = []
for i, k, z in zip(y_val[0::3], y_val[1::3], y_val[2::3]):
    if i + k + z == 0:
        break
    y_final.append(math.log(i + k + z))

x_val = [math.log(i+1) for i in range(len(y_final))]

fig = plt.figure(figsize=(10,5))

plt.xlabel("Words (Log)")
plt.ylabel("Frequency (Log)")
plt.title("Word Frequency Distribution (Negative)")
plt.plot(x_val, y_final)
plt.show()

The incline of the negative distribution is a little smoother, but the heavy tail is unmistakably present. This conclusion here is same as the prior one shown by the positive distribution.

The conclusions drawn from this research show that attitudes toward AI are generally positive, but there are a few themes that show opposition to some of its features.

Literature review:

Fast, E., & Horvitz, E. (2017, February). Long-term trends in the public perception of artificial intelligence. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1).

This paper examines the long-term trends in public perception of AI. It contributes to the research question by exploring the factors influencing public sentiment towards AI. It also conceptualizes how AI is perceived, operationalizes the methods used to measure public sentiment, and a description of the population under examination. Additionally, the paper discusses the importance of understanding the underlying causes of public sentiment towards AI and how it affects the development of policies to address the potential ethical concerns of AI technologies.

Bimber, B., & Gil de Zúñiga, H. (2020). The unedited public sphere. New media & society, 22(4), 700-715.

This paper examines the unedited public sphere and its relationship to AI. It contributes to the research question by exploring how AI is perceived in the public discourse and how it influences public sentiment. The paper conceptualizes the unedited public sphere and operationalizes the methods used to access and analyze conversations about AI. It also describes the population of people engaging in those conversations and the data selection process used for the study. Additionally, it discusses the implications of AI on public discourse and how it shapes public opinion.
