#Name of student: ADITI PALA


#Sentiment Polarity - Classifier Building

---


In [1]:
import nltk
nltk.download('sentence_polarity')
from nltk.corpus import sentence_polarity
import random

[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package sentence_polarity is already up-to-date!


In [2]:
documents = [(sent, cat) for cat in sentence_polarity.categories()
	for sent in sentence_polarity.sents(categories=cat)]

print(documents[0])
print(documents[-1])
#displaying to check the last and the first entry in the list

(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'neg')
(['provides', 'a', 'porthole', 'into', 'that', 'noble', ',', 'trembling', 'incoherence', 'that', 'defines', 'us', 'all', '.'], 'pos')


Feature Extraction:
Define the set of words that will be used for features.
Limit the words to the 2000 most frequent words.
python


In [3]:
# Get all words in the entire document collection
all_words_list = [word for (sent, cat) in documents for word in sent]
all_words = nltk.FreqDist(all_words_list)

# Get the 2000 most frequently appearing keywords in the corpus
word_items = all_words.most_common(2000)
word_features = [word for (word, count) in word_items]
print(word_features[:50])


['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'it', 'as', 'but', 'with', 'film', 'this', 'for', 'its', 'an', 'movie', "it's", 'be', 'on', 'you', 'not', 'by', 'about', 'more', 'one', 'like', 'has', 'are', 'at', 'from', 'than', '"', 'all', '--', 'his', 'have', 'so', 'if', 'or', 'story', 'i', 'too', 'just', 'who', 'into', 'what']


####Bag of words
A sentiment analysis model using the Bag of Words approach and a Naive Bayes classifier.
Selected important words as features and creates feature sets for each document, representing word presence.

The Bag of Words approach is a text analysis technique that represents text data by counting the frequency of words in a document, ignoring word order and structure. It converts text into a numerical format suitable for machine learning algorithms.






In [4]:
#Bag of Words (BOW) Features

# Define a function to extract BOW features for a document
def document_features(document, word_features):
    document_words = set(document)
    features = {}

    # For each word in word_features, create a feature 'V_<word>' with a Boolean value
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)

    return features

# Get feature sets for each document, including BOW features and category feature
featuresets = [(document_features(d, word_features), c) for (d, c) in documents]

# The feature sets are now ready for sentiment classification
# You can check the features for the first document as an example
print(featuresets[0])


({'V_.': True, 'V_the': False, 'V_,': True, 'V_a': False, 'V_and': True, 'V_of': False, 'V_to': False, 'V_is': False, 'V_in': False, 'V_that': False, 'V_it': False, 'V_as': False, 'V_but': False, 'V_with': False, 'V_film': False, 'V_this': False, 'V_for': False, 'V_its': False, 'V_an': False, 'V_movie': False, "V_it's": False, 'V_be': False, 'V_on': False, 'V_you': False, 'V_not': False, 'V_by': False, 'V_about': False, 'V_more': False, 'V_one': False, 'V_like': False, 'V_has': False, 'V_are': False, 'V_at': False, 'V_from': False, 'V_than': False, 'V_"': False, 'V_all': False, 'V_--': False, 'V_his': False, 'V_have': False, 'V_so': False, 'V_if': False, 'V_or': False, 'V_story': False, 'V_i': False, 'V_too': False, 'V_just': False, 'V_who': False, 'V_into': False, 'V_what': False, 'V_most': False, 'V_out': False, 'V_no': False, 'V_much': False, 'V_even': False, 'V_good': False, 'V_up': False, 'V_will': False, 'V_comedy': False, 'V_time': False, 'V_can': False, 'V_some': False, 'V_char

In [5]:
# Train the Naive Bayesian classifier using the training set
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluate the accuracy of the classifier
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Classifier Accuracy: {:.2%}".format(accuracy))


Classifier Accuracy: 71.40%


The function show_most_informative_features shows the top ranked features according to the ratio of one label to the other one.

In [6]:

# Display the most informative features
classifier.show_most_informative_features(30)


Most Informative Features
               V_generic = True              neg : pos    =     18.5 : 1.0
               V_routine = True              neg : pos    =     17.6 : 1.0
            V_engrossing = True              pos : neg    =     17.6 : 1.0
                  V_warm = True              pos : neg    =     17.6 : 1.0
             V_wonderful = True              pos : neg    =     17.6 : 1.0
                V_boring = True              neg : pos    =     16.0 : 1.0
              V_mediocre = True              neg : pos    =     15.2 : 1.0
              V_powerful = True              pos : neg    =     15.1 : 1.0
                  V_flat = True              neg : pos    =     14.0 : 1.0
             V_inventive = True              pos : neg    =     12.7 : 1.0
            V_unexpected = True              pos : neg    =     12.7 : 1.0
                V_unless = True              neg : pos    =     12.7 : 1.0
                  V_dull = True              neg : pos    =     11.6 : 1.0

###Sentiment Analysis - Negation
Negation handling in text sentiment analysis is part of the methodology. It recognises contractions that contain negation (like "doesn't") and negation words (like "not," "never," and "no"). Sentiment analysis is impacted by words that are marked as negated after negation. The main goal of this tactic is to mark singular negated words.

####Bag-of-Words + Negation

In [7]:
# Define a list of negation words
negation_words = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone', 'rather', 'hardly', 'scarcely', 'rarely', 'seldom', 'neither', 'nor']

# Define a function to extract features for sentiment analysis with negation handling
def document_features_with_negation(document, word_features, negation_words):
    document_words = set(document)
    features = {}

    is_negated = False
    for word in document:
        # If a negation word is encountered, set the negation flag to True
        if word in negation_words:
            is_negated = True

        # If a word is found after a negation word, mark it as negated
        if is_negated:
            features['V_NOT{}'.format(word)] = True
            is_negated = False
        else:
            features['V_{}'.format(word)] = True

    return features


In [8]:

# Get feature sets for each document, including BOW features and category feature
featuresets = [(document_features_with_negation(d, word_features, negation_words), c) for (d, c) in documents]


In [9]:
# Define the number of folds for cross-validation
num_folds = 5



In [10]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)


In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize lists to store evaluation metrics
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for train_index, test_index in kf.split(featuresets):
    train_set = [featuresets[i] for i in train_index]
    test_set = [featuresets[i] for i in test_index]

    # Train the classifier
    classifier = nltk.NaiveBayesClassifier.train(train_set)

    # Test the classifier
    test_labels = [label for (_, label) in test_set]
    predicted_labels = [classifier.classify(features) for (features, _) in test_set]

    # Calculate and store evaluation metrics for this fold
    accuracy_scores.append(accuracy_score(test_labels, predicted_labels))
    precision_scores.append(precision_score(test_labels, predicted_labels, average='weighted'))
    recall_scores.append(recall_score(test_labels, predicted_labels, average='weighted'))
    f1_scores.append(f1_score(test_labels, predicted_labels, average='weighted'))


In [231]:
# Calculate and print the mean and standard deviation of scores
mean_accuracy = sum(accuracy_scores) / num_folds
mean_precision = sum(precision_scores) / num_folds
mean_recall = sum(recall_scores) / num_folds
mean_f1 = sum(f1_scores) / num_folds

std_accuracy = (sum((x - mean_accuracy) ** 2 for x in accuracy_scores) / num_folds) ** 0.5
std_precision = (sum((x - mean_precision) ** 2 for x in precision_scores) / num_folds) ** 0.5
std_recall = (sum((x - mean_recall) ** 2 for x in recall_scores) / num_folds) ** 0.5
std_f1 = (sum((x - mean_f1) ** 2 for x in f1_scores) / num_folds) ** 0.5

# Print the mean and standard deviation of scores
print("\nMean Scores:")
print(f"Accuracy: {mean_accuracy:.4f}")
print(f"Precision: {mean_precision:.4f}")
print(f"Recall: {mean_recall:.4f}")
print(f"F1-Score: {mean_f1:.4f}")

print("\nStandard Deviation:")
print(f"Accuracy: {std_accuracy:.4f}")
print(f"Precision: {std_precision:.4f}")
print(f"Recall: {std_recall:.4f}")
print(f"F1-Score: {std_f1:.4f}")



Mean Scores:
Accuracy: 0.7704
Precision: 0.7707
Recall: 0.7704
F1-Score: 0.7704

Standard Deviation:
Accuracy: 0.0094
Precision: 0.0094
Recall: 0.0094
F1-Score: 0.0094


###Information on model feature & experiments:

Objective:
- The objective of this experiment was to enhance the accuracy of sentiment analysis by addressing the challenges posed by negation. In natural language, negation is a linguistic phenomenon where words or phrases are used to negate or reverse the meaning of other words or phrases. Negation handling aims to correctly identify and interpret sentences containing negation words such as "not," "never," "no," and similar terms, which can significantly affect the sentiment expressed.

Features:
- To address negation effectively, we implemented a negation handling mechanism. This involved identifying negation words and marking singular negated words in the text data. For instance, consider a sentence like "I do not like this product." In this case, "not" is a negation word, and "like" is a negated word. By marking "like" as negated, we aimed to capture the sentiment correctly, which would be negative in this example.

Classifier:
- For the sentiment analysis task with negation handling, we selected the Naive Bayes classifier. Naive Bayes is a widely used machine learning algorithm for text classification tasks, and it is particularly well-suited for addressing features related to negation. The classifier is trained to understand the context and relationships between words and make predictions about the sentiment of a given text.

- In this experiment, the Naive Bayes classifier learned from labeled data and applied its knowledge to correctly classify the sentiment of text data while considering the presence of negation.

Overall, the experiment aimed to demonstrate that the incorporation of negation handling in sentiment analysis can lead to more accurate sentiment classification, making it a valuable tool in natural language processing and text analysis.

###Sentiment Analysis - Stopwords

####Bag-of-Words + Stopwords removal

In [13]:
# Download NLTK stopwords data (if not already downloaded)
nltk.download('stopwords')
negationwords = []

# Define the NLTK stopwords list and remove some negation words
stopwords = nltk.corpus.stopwords.words('english')
negationwords.extend(['ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'])
new_stopwords = [word for word in stopwords if word not in negationwords]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:

# Remove stop words from the all words list
new_all_words_list = [word for (sent, cat) in documents for word in sent if word not in new_stopwords]

# Define a new all words dictionary and get the 2000 most common words as new_word_features
new_all_words = nltk.FreqDist(new_all_words_list)
new_word_items = new_all_words.most_common(2000)
new_word_features = [word for (word, count) in new_word_items]
print(new_word_features[:30])


['.', ',', 'film', 'movie', 'one', 'like', '"', '--', 'story', 'much', 'even', 'good', 'comedy', 'time', 'characters', 'little', 'way', 'funny', 'make', 'enough', 'never', 'makes', 'may', 'us', 'work', 'best', 'bad', 'director', ')', '?']


###SL Features Creation
readSubjectivity to process text files and store subjectivity data in a dictionary. It ultimately creates the SL dictionary containing word subjectivity information from the specified file.

In [226]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [16]:
from google.colab import files
#uploaded = files.upload()
SLpath = "/content/drive/MyDrive/Colab Notebooks/IST 664/2023/subjclueslen1-HLTEMNLP05.tff"

In [17]:


# this function returns a dictionary where you can look up words and get back
# the four items of subjectivity information described above
def readSubjectivity(path):
    flexicon = open(path, 'r')
    # initialize an empty dictionary
    sldict = { }
    for line in flexicon:
        fields = line.split()   # default is to split on whitespace
        # split each field on the '=' and keep the second part as the value
        strength = fields[0].split("=")[1]
        word = fields[2].split("=")[1]
        posTag = fields[3].split("=")[1]
        stemmed = fields[4].split("=")[1]
        polarity = fields[5].split("=")[1]
        if (stemmed == 'y'):
            isStemmed = True
        else:
            isStemmed = False
        # put a dictionary entry with the word as the keyword
        #     and a list of the other values
        sldict[word] = [strength, posTag, isStemmed, polarity]
    return sldict
SL = readSubjectivity(SLpath)

In [18]:
# Define a function to extract features for sentiment analysis
def SL_features(document, word_features, SL):
    document_words = set(document)
    features = {}

    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)

    # count variables for the 4 classes of subjectivity
    weakPos = 0
    strongPos = 0
    weakNeg = 0
    strongNeg = 0

    for word in document_words:
        if word in SL:
            strength, posTag, isStemmed, polarity = SL[word]
            if strength == 'weaksubj' and polarity == 'positive':
                weakPos += 1
            if strength == 'strongsubj' and polarity == 'positive':
                strongPos += 1
            if strength == 'weaksubj' and polarity == 'negative':
                weakNeg += 1
            if strength == 'strongsubj' and polarity == 'negative':
                strongNeg += 1
            features['positivecount'] = weakPos + (2 * strongPos)
            features['negativecount'] = weakNeg + (2 * strongNeg)

    return features


In [19]:
# Create SL_featuresets using the new_word_features
SL_featuresets2 = [(SL_features(d, new_word_features, SL), c) for (d, c) in documents]




In [20]:
import collections

In [21]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Define the number of folds for cross-validation
num_folds = 5

kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Initialize lists to store evaluation metrics for the second classifier
accuracy_scores_2 = []
precision_scores_2 = []
recall_scores_2 = []
f1_scores_2 = []

for train_index, test_index in kf.split(SL_featuresets2):
    train_set = [SL_featuresets2[i] for i in train_index]
    test_set = [SL_featuresets2[i] for i in test_index]

    # Train the second classifier with NLTK's NaiveBayesClassifier (classifier2)
    classifier2 = nltk.NaiveBayesClassifier.train(train_set)

    # Test the second classifier
    test_labels = [label for (_, label) in test_set]
    predicted_labels_2 = [classifier2.classify(features) for (features, _) in test_set]

    # Calculate and store evaluation metrics for the second classifier
    accuracy_scores_2.append(accuracy_score(test_labels, predicted_labels_2))
    precision_scores_2.append(precision_score(test_labels, predicted_labels_2, average='weighted'))
    recall_scores_2.append(recall_score(test_labels, predicted_labels_2, average='weighted'))
    f1_scores_2.append(f1_score(test_labels, predicted_labels_2, average='weighted'))



In [234]:

# Calculate and print the average evaluation metrics across all folds for the second classifier
avg_precision_2 = np.mean(precision_scores_2)
avg_recall_2 = np.mean(recall_scores_2)
avg_accuracy_2 = np.mean(accuracy_scores_2)
avg_f1_2 = np.mean(f1_scores_2)

# Calculate the standard deviation for each metric
std_precision_2 = np.std(precision_scores_2)
std_recall_2 = np.std(recall_scores_2)
std_accuracy_2 = np.std(accuracy_scores_2)
std_f1_2 = np.std(f1_scores_2)

print(f"Average Precision (Classifier 2): {avg_precision_2:.2f}")
print(f"Average Recall (Classifier 2): {avg_recall_2:.2f}")
print(f"Average Accuracy (Classifier 2): {avg_accuracy_2:.2f}")
print(f"Average F1 (Classifier 2): {avg_f1_2:.2f}")

# Print the standard deviation for each metric
print(f"Standard Deviation Precision (Classifier 2): {std_precision_2:.2f}")
print(f"Standard Deviation Recall (Classifier 2): {std_recall_2:.2f}")
print(f"Standard Deviation Accuracy (Classifier 2): {std_accuracy_2:.2f}")
print(f"Standard Deviation F1 (Classifier 2): {std_f1_2:.2f}")

Average Precision (Classifier 2): 0.75
Average Recall (Classifier 2): 0.75
Average Accuracy (Classifier 2): 0.75
Average F1 (Classifier 2): 0.75
Standard Deviation Precision (Classifier 2): 0.01
Standard Deviation Recall (Classifier 2): 0.01
Standard Deviation Accuracy (Classifier 2): 0.01
Standard Deviation F1 (Classifier 2): 0.01


####Information on model features and Experiment:

Objective:
- By emphasising more significant words and lessening the impact of common and uninformative words in the text data, this experiment aimed to improve sentiment analysis. Stopwords, which include words like "the," "is," and many more, are common in natural language text. Our goal was to increase the weight of words that carry content and boost sentiment analysis accuracy by eliminating stopwords.

Features:
- We eliminated stopwords from the text data in this experiment. Words that are commonly used in English but usually have no particular meaning when used in a sentence are known as stopwords. Stopwords include things like "the," "is," "and," and "in." We aimed to highlight the importance of content-bearing words, which are more indicative of sentiment, by removing these stopwords from the text data.

Classifier:
- In keeping with the earlier studies, we decided to apply sentiment analysis in the context of stopword removal using the Naive Bayes classifier. Text classification tasks are a good fit for the Naive Bayes classifier, which has demonstrated efficacy in handling features like stopwords.

- The classifier was trained to take into account the lack of stopwords and identify the sentiment expressed in the text. This method attempted to decrease the noise caused by common, non-discriminatory words in order to increase the accuracy of sentiment analysis.

Overall, this experiment showed how stopword removal affects sentiment analysis and emphasised how crucial it is to take word choice in text data into account for precise sentiment classification.

###Comparision of results from Negation & Stop word removal experiments
---

- Received better mean scores in the negation handling experiment for recall, accuracy, precision, and F1-Score. Around 0.7704 was the accuracy and F1-Score.
The mean scores for accuracy, precision, recall, and F1-Score in the stopword removal experiment were approximately 0.75.

- In comparison to the negation handling experiment, these scores are marginally lower.

- In general, the experiment involving negation handling appears to have outperformed the stopword removal experiment in terms of sentiment analysis.

- Both experiments' standard deviations are rather low, which suggests consistent outcomes. Nonetheless, the experiment on negation handling performed marginally better.


#Sentiment Polarity – Analysis of Fake and Real “text”

---

In [181]:
from google.colab import drive
drive.mount("/drive", force_remount=True)

Mounted at /drive


take only first 50

In [182]:
import csv
import pandas as pd
true_data = pd.read_csv('/drive/My Drive/Colab Notebooks/IST 664/2023/True.csv').head(50)

In [183]:
fake_data = pd.read_csv('/drive/My Drive/Colab Notebooks/IST 664/2023/Fake.csv').head(50)

In [184]:
# Print the first few rows of the DataFrame
display(true_data.head())

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [185]:
display(fake_data.head())

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


##*Data Preprocessing*

---



In [186]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Why Tokenization?

Tokenizationis the process of splitting a text or a document into individual words or tokens.

In this data, we need to break text into smaller words in order to perform text analysis

In [187]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

true_data['text_tokenized'] = true_data['text'].apply(lambda x: word_tokenize(x))


In [188]:
display(true_data.head())

Unnamed: 0,title,text,subject,date,text_tokenized
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017","[WASHINGTON, (, Reuters, ), -, The, head, of, ..."
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017","[WASHINGTON, (, Reuters, ), -, Transgender, pe..."
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017","[WASHINGTON, (, Reuters, ), -, The, special, c..."
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017","[WASHINGTON, (, Reuters, ), -, Trump, campaign..."
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017","[SEATTLE/WASHINGTON, (, Reuters, ), -, Preside..."


In [189]:

fake_data['text_tokenized'] = fake_data['text'].apply(lambda x: word_tokenize(x))


In [190]:
display(fake_data.head())

Unnamed: 0,title,text,subject,date,text_tokenized
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017","[Donald, Trump, just, couldn, t, wish, all, Am..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017","[House, Intelligence, Committee, Chairman, Dev..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017","[On, Friday, ,, it, was, revealed, that, forme..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017","[On, Christmas, day, ,, Donald, Trump, announc..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017","[Pope, Francis, used, his, annual, Christmas, ..."


Creating copies of the original DataFrames to store results

so I can perform calculations and save the information in these DataFrames to obtain the required statistics for each article

In [191]:
results_data_true=true_data.copy()
results_data_fake=fake_data.copy()

Why Converting tokens into lowercase?

This ensures uniformity in text analysis.

It helps treat words in different cases (e.g., "Trump" and "trump") as the same, preventing duplication of information and improving the accuracy of text-based operations.






In [192]:
# Make tokens lowercase
true_data['text_tokenized_lower'] = true_data['text_tokenized'].apply(lambda x: [word.lower() for word in x])


In [193]:
# Make tokens lowercase
fake_data['text_tokenized_lower'] = fake_data['text_tokenized'].apply(lambda x: [word.lower() for word in x])


In [194]:
display(true_data.head())

Unnamed: 0,title,text,subject,date,text_tokenized,text_tokenized_lower
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017","[WASHINGTON, (, Reuters, ), -, The, head, of, ...","[washington, (, reuters, ), -, the, head, of, ..."
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017","[WASHINGTON, (, Reuters, ), -, Transgender, pe...","[washington, (, reuters, ), -, transgender, pe..."
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017","[WASHINGTON, (, Reuters, ), -, The, special, c...","[washington, (, reuters, ), -, the, special, c..."
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017","[WASHINGTON, (, Reuters, ), -, Trump, campaign...","[washington, (, reuters, ), -, trump, campaign..."
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017","[SEATTLE/WASHINGTON, (, Reuters, ), -, Preside...","[seattle/washington, (, reuters, ), -, preside..."


In [195]:
display(fake_data.head())

Unnamed: 0,title,text,subject,date,text_tokenized,text_tokenized_lower
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017","[Donald, Trump, just, couldn, t, wish, all, Am...","[donald, trump, just, couldn, t, wish, all, am..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017","[House, Intelligence, Committee, Chairman, Dev...","[house, intelligence, committee, chairman, dev..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017","[On, Friday, ,, it, was, revealed, that, forme...","[on, friday, ,, it, was, revealed, that, forme..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017","[On, Christmas, day, ,, Donald, Trump, announc...","[on, christmas, day, ,, donald, trump, announc..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017","[Pope, Francis, used, his, annual, Christmas, ...","[pope, francis, used, his, annual, christmas, ..."


Why Remove Non-words?

The text data when checking for fake news becomes noisy when non-words like punctuation, special characters, and digits are used. By getting rid of them, we can lessen the background noise and concentrate on the text's important ideas.

In [196]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [197]:
print(type(true_data['text_tokenized_lower'].iloc[0]))


<class 'list'>


In [198]:
import string

# Define a function to remove non-words and return tokens
def remove_non_words(tokens):
    # Filter out tokens that contain digits or special characters
    cleaned_tokens = [token for token in tokens if all(char.isalpha() or char.isspace() for char in token)]

    return cleaned_tokens


In [199]:
true_data['text_words'] = true_data['text_tokenized_lower'].apply(remove_non_words)

In [200]:
fake_data['text_words'] = fake_data['text_tokenized_lower'].apply(remove_non_words)

In [218]:
display(fake_data)

Unnamed: 0,title,text,subject,date,text_tokenized,text_tokenized_lower,text_words,positive_sentences,negative_sentences
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017","[Donald, Trump, just, couldn, t, wish, all, Am...","[donald, trump, just, couldn, t, wish, all, am...","[donald, trump, just, couldn, t, wish, all, am...",0,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017","[House, Intelligence, Committee, Chairman, Dev...","[house, intelligence, committee, chairman, dev...","[house, intelligence, committee, chairman, dev...",0,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017","[On, Friday, ,, it, was, revealed, that, forme...","[on, friday, ,, it, was, revealed, that, forme...","[on, friday, it, was, revealed, that, former, ...",0,0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017","[On, Christmas, day, ,, Donald, Trump, announc...","[on, christmas, day, ,, donald, trump, announc...","[on, christmas, day, donald, trump, announced,...",0,0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017","[Pope, Francis, used, his, annual, Christmas, ...","[pope, francis, used, his, annual, christmas, ...","[pope, francis, used, his, annual, christmas, ...",0,0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017","[The, number, of, cases, of, cops, brutalizing...","[the, number, of, cases, of, cops, brutalizing...","[the, number, of, cases, of, cops, brutalizing...",0,0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017","[Donald, Trump, spent, a, good, portion, of, h...","[donald, trump, spent, a, good, portion, of, h...","[donald, trump, spent, a, good, portion, of, h...",0,0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017","[In, the, wake, of, yet, another, court, decis...","[in, the, wake, of, yet, another, court, decis...","[in, the, wake, of, yet, another, court, decis...",0,0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017","[Many, people, have, raised, the, alarm, regar...","[many, people, have, raised, the, alarm, regar...","[many, people, have, raised, the, alarm, regar...",0,0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017","[Just, when, you, might, have, thought, we, d,...","[just, when, you, might, have, thought, we, d,...","[just, when, you, might, have, thought, we, d,...",0,0


S

In [227]:
# Load SL from the provided file
SLpath = "/content/drive/MyDrive/Colab Notebooks/IST 664/2023/subjclueslen1-HLTEMNLP05.tff"
SL = readSubjectivity(SLpath)

In [236]:
# Define a function to classify sentences
def classify_sentences_with_SL(data, SL):
    # Initialize lists to store the results
    texts = []
    positive_counts = []
    negative_counts = []

    for i, row in data.iterrows():
        text_words = row['text_words']
        text = " ".join(text_words)  # Convert the list of words into a single string
        sentences = nltk.sent_tokenize(text)

        # Initialize counters for positive and negative sentences
        positive_count = 0
        negative_count = 0

        for sentence in sentences:
            words = word_tokenize(sentence)
            features = SL_features(words, word_features, SL)

            # Count positive and negative sentences
            positive_count += features['positivecount']
            negative_count += features['negativecount']

        texts.append(text)
        positive_counts.append(positive_count)
        negative_counts.append(negative_count)

    # Create a new dataframe with the specified fields
    results_data = pd.DataFrame({'text': texts, 'the number of positive_sentences': positive_counts, 'the number of negative_sentences': negative_counts})

    return results_data


In [240]:

# Call the function to classify sentences in fake_data
results_data_fake = classify_sentences_with_SL(fake_data, SL)

# Call the function to classify sentences in true_data
results_data_true = classify_sentences_with_SL(true_data, SL)




In [241]:
# Display the resulting dataframes
print("Fake Data:")
display(results_data_fake.head(20))

print("True Data:")
display(results_data_true.head(20))

Fake Data:


Unnamed: 0,text,the number of positive_sentences,the number of negative_sentences
0,donald trump just couldn t wish all americans ...,34,25
1,house intelligence committee chairman devin nu...,12,24
2,on friday it was revealed that former milwauke...,28,31
3,on christmas day donald trump announced that h...,14,8
4,pope francis used his annual christmas day mes...,26,13
5,the number of cases of cops brutalizing and ki...,18,25
6,donald trump spent a good portion of his day a...,13,22
7,in the wake of yet another court decision that...,20,23
8,many people have raised the alarm regarding th...,30,33
9,just when you might have thought we d get a br...,24,23


True Data:


Unnamed: 0,text,the number of positive_sentences,the number of negative_sentences
0,washington reuters the head of a conservative ...,37,11
1,washington reuters transgender people will be ...,42,9
2,washington reuters the special counsel investi...,13,17
3,washington reuters trump campaign adviser geor...,11,16
4,reuters president donald trump called on the p...,19,37
5,west palm beach reuters the white house said o...,23,23
6,west palm beach fla reuters president donald t...,25,26
7,the following statements were posted to the ve...,4,12
8,the following statements were posted to the ve...,6,3
9,washington reuters alabama secretary of state ...,5,10


In [230]:
# Save the resulting dataframes as CSV files
results_data_fake.to_csv("/content/drive/MyDrive/Colab Notebooks/IST 664/2023/fake_results.csv", index=False)
results_data_true.to_csv("/content/drive/MyDrive/Colab Notebooks/IST 664/2023/true_results.csv", index=False)

####Comparative analysis of the positive sentences and negative sentences in Fake.csv versus in True.csv
---
- Different patterns show up when positive and negative sentences from the "Fake.csv" and "True.csv" datasets are compared. There are significantly more negative sentences than positive ones in the "Fake.csv" dataset. This shows that the sensational and misleading content of the fake news articles, combined with their fabricated or misleading content, evokes a stronger negative sentiment.

- On the other hand, the distribution of positive and negative sentences is more evenly distributed in the "True.csv" dataset. This balance may be explained by the factual and objective character of the news reports that were designated as "true." It is anticipated that the tone of these articles will remain more neutral, leading to a less skewed distribution of sentiment.

- Authentic news displays a more balanced distribution of positive and negative emotions, whereas fake news is more likely to elicit a negative emotional response. This result emphasises how crucial sentiment analysis is for identifying reliable news from false information.


####What I learned from this assignment:
During this assignment, I investigated sentiment analysis and became familiar with feature extraction techniques like Bag of Words (BoW). In addition, I learned how to construct sentiment classifiers using the Naive Bayes algorithm and the significance of methods like stopword removal and negation handling. When I analysed the results, I discovered that different approaches had an impact on the sentiment analysis results; negation handling performed marginally better than stopword removal.

I compared the sentiment in "Fake.csv" and "True.csv," finding that while real news maintains a balanced emotional distribution, fake news typically elicits more negative sentiment. This demonstrates how sentiment analysis can be used to tell real news from fake content.

To sum up, this assignment helped me gain a deeper understanding of sentiment analysis, how it affects outcomes, and how it can be used practically in news classification.