# Assignment 02

This is the Text Processing project.

See Canvas for its deadline. 

In [None]:
# import packages
import numpy as np
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import requests
import re
from urllib.parse import urlparse
import urllib.robotparser
from bs4 import BeautifulSoup

# This code checks the robots.txt file
def canFetch(url):

    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(domain + "/robots.txt")
    try:
        rp.read()
        canFetchBool = rp.can_fetch("*", url)
    except:
        canFetchBool = None
    
    return canFetchBool

## Assignment 02 examples

#### Example 0 of project for assignment 02: Text processing

1) perform the analysis of data/ira.csv (similarly to what done in processing_text.ipynb, including most frequent words etc.)

2) perform sentiment analysis on this dataset

3) detail comments, explain step by step what is happening, and try to write down a paragraph or two at the end discussing what you figured out

************

# Decoding Voter Sentiments


In anticipation of the upcoming presidential election scheduled for November of this year, this project aims to conduct sentiment analysis on a comprehensive dataset comprising tweets from American citizens. With a specific focus on evaluating public sentiment towards Donald Trump, who held office for the past four years before Joe Biden, the analysis provides valuable insights into the current political landscape. The dataset utilized in this project has been sourced from the Internet Research Agency, offering a rich and diverse collection of opinions and perspectives

## The Main Question

This project seeks to understand the prevailing sentiments of American voters, particularly towards Donald Trump.
Are these sentiments predominantly positive or negative?

## First Glance: Exploring the Data

In [None]:
# Taking a look at all the data by sorting through it to attain a list of lists
ira_tweets = [x.strip() for x in open("data/ira.csv", encoding='utf8').readlines()]
ira_tweets

In [68]:
# Finding out how many tweets in total have been uncovered by the Internet Research Agency
def count_ira(x):
    count = 0
    for i in x:
        count+=1
    return count

count_ira(ira_tweets)

90000

#### There are 90,000 tweets collected by the IRA that were analysed to make a prediction about Trump's standing in the elections.

In [69]:
# Computing the length of the shortest tweet found by IRA
def shortest_ira_tweet(lst):
    least_len = float('inf')
    for i in lst:
        if len(i) < least_len:
            least_len = len(i)
    return least_len 

shortest_ira_tweet(ira_tweets)

43

#### The shortest  tweet is 43 characters. 

In [70]:
# Computing the length of the longest tweet found by IRA
def longest_ira_tweet(lst):
    most_len = 0
    for i in lst:
        if len(i) > most_len:
            most_len = len(i)
    return most_len 

longest_ira_tweet(ira_tweets)

305

#### The longest tweet is 305 characters. 

## Cleaning & Filtering the Data

Every tweet in ira_tweets is preceded by a number code, account name and date+time stamp.
This is irrelevant in our analysis of solely the text content in the tweets itself.

In [6]:
#Sorting through data to attain a list of lists, where each item related to a tweet is in the corresponding sublist.
tweets_all = [x.strip().split(',') for x in open("data/ira.csv").readlines()]
tweets_all

[['3906258',
  'ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef974526887856fe299d3f2c0',
  '2016-11-16 09:04',
  'The Best Exercise To Lose Belly Fat In 2 weeks  https://t.co/oHFToG7rh6 #Exercise #LoseBellyFat #CatTV #TeenWolf… https://t.co/b4pr9gEx38'],
 ['1051443',
  '8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5fa79c825e2d6f291bf',
  '2016-12-24 04:31',
  '"RT @Philanthropy: Dozens of ‘hate groups’ have charity status',
  ' Chronicle study finds https://t.co/FxUBBHNlKy"'],
 ['2823399',
  'Room Of Rumor',
  '2016-08-18 20:26',
  '"Artificial intelligence can find',
  ' map poverty',
  ' researchers say  #tech"'],
 ['272878',
  'San Francisco Daily',
  '2016-03-18 19:28',
  'Uber balks at rules proposed by world’s busiest airport  #news'],
 ['7697802',
  '41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed8980383c82378e7518',
  '2016-07-30 15:44',
  '"RT @dirtroaddiva1: #IHatePokemonGoBecause he  didn\'t let me do ""that"" for a Klondike bar.    Screw you Pokemon.  #PokesAreJokes. https://t.…"'

In [7]:
# Filtering the data to get a list of only tweets, thereby eliminating the unrequired information.
tweets_only = [x[3] for x in tweets_all]
tweets_only

['The Best Exercise To Lose Belly Fat In 2 weeks  https://t.co/oHFToG7rh6 #Exercise #LoseBellyFat #CatTV #TeenWolf… https://t.co/b4pr9gEx38',
 '"RT @Philanthropy: Dozens of ‘hate groups’ have charity status',
 '"Artificial intelligence can find',
 'Uber balks at rules proposed by world’s busiest airport  #news',
 '"RT @dirtroaddiva1: #IHatePokemonGoBecause he  didn\'t let me do ""that"" for a Klondike bar.    Screw you Pokemon.  #PokesAreJokes. https://t.…"',
 'Chick-fil-A remains closed after health violations  #health',
 "RT @SenSanders: We cannot afford to wait to address this public health crisis. We must quickly fund efforts to stop Zika's spread. https://…",
 'RT @MatthewGellert: #IWouldPreferToForget that the two leading Republican candidates are an ignorant bully and an ignorant preacher.',
 '"RT @rapstationradio: #NowPlaying: RJ (OMMIO) ""From Nothing (Prod. By Davo)"" #rap #hiphop #music https://t.co/8TJZ3vVCxs"',
 'Hill Street Vida Blues. #AthleticsTVShows @susanslusser',
 '

In [8]:
# Filtering the IRA tweets further to attain the ones that talk about Trump, former President of the United States.
def in_text(y):
    return 'Trump' in y
trump_tweets = list(filter(in_text, tweets_only))
trump_tweets

["RT @shannoncoulter: You don't have to use your daughters and wives as surrogates for your outrage over #TrumpTapes. You can just be offende…",
 '"RT @DanScavino: .@realDonaldTrump was considered ""not nice"" regarding Paris &amp; #Brussels comments- months ago.   #WakeUpAmerica! https://t.c…"',
 "RT @thehill: Kasich: I'm not ready to endorse Trump https://t.co/gvQMujpJ7v https://t.co/XLJEHp267T",
 "Trump: I 'most likely' won't do GOP debate",
 "RT @TeamTrump: 'Trump's final speech a message of HOPE to Michigan voters' #ElectionDay https://t.co/xQ60buYYND",
 'RT @PaulBlu: Donald Trump owes hundreds of millions in debt to Goldman Sachs. https://t.co/6Y9Mzvl2zi',
 'I hope @SheriffClarke will become part of the Trump Administration! He is a true patriot &amp; Hes just cool. Like John Wayne or Clint Eastwood https://t.co/3bEkpeAJlG',
 'Alex Jones “CIA Report is FAKE NEWS to Attack Trump” https://t.co/KrWrkzZ8ff https://t.co/EV7lHpyIjh',
 'All 4 of my Grandparents born USA. All 4 rolling o

In [9]:
len(trump_tweets)

6276

#### 6276 of the original 90,000 tweets uncovered by the IRA directly reference Trump and are related to him. 

In [10]:
# Creating a DataFrame with just the filtered tweets.
trump_df = pd.DataFrame().assign(Tweets=trump_tweets)
trump_df

Unnamed: 0,Tweets
0,RT @shannoncoulter: You don't have to use your...
1,"""RT @DanScavino: .@realDonaldTrump was conside..."
2,RT @thehill: Kasich: I'm not ready to endorse ...
3,Trump: I 'most likely' won't do GOP debate
4,RT @TeamTrump: 'Trump's final speech a message...
...,...
6271,People for Trump! #CrookedHillary only has fal...
6272,RT @Don_Vito_08: #Trump was never called a rac...
6273,RT @mike_pence: Stopping in to surprise our ha...
6274,Put your ballots for Trump right there https:/...


# Counting Words

Finding the most frequently used words, using tokenizing


In [11]:
# Made a single long string with the tweet text, and split the tweets into a list of only words.
all_tweets_text = " ".join(trump_tweets)
words_list = all_tweets_text.split()
print("First 20 words:", words_list[:20])

First 20 words: ['RT', '@shannoncoulter:', 'You', "don't", 'have', 'to', 'use', 'your', 'daughters', 'and', 'wives', 'as', 'surrogates', 'for', 'your', 'outrage', 'over', '#TrumpTapes.', 'You', 'can']


In [12]:
# Checking total number of words, and distinct words used in the tweets. 
total_words = len(words_list)
distinct_words = set(words_list)
num_distinct_words = len(distinct_words)

print("Total words:", total_words)
print("Number of distinct words:", num_distinct_words)


Total words: 87407
Number of distinct words: 24149


#### There is a total of 87407 words, out of which 24,249 are unique.

In [13]:
# Removing 'stop' words like 'a', 'the', 'in' that are not helpful in our analysis.

# Removed short words (less than three characters)
filtered_words = [word for word in words_list if len(word) >= 3]

# Calculated the total number of words after filtering
total_filtered_words = len(filtered_words)

# Calculated the number of distinct words after filtering using a set
distinct_filtered_words = set(filtered_words)
num_distinct_filtered_words = len(distinct_filtered_words)

# Print the results
print("Total words after removing short words:", total_filtered_words)
print("Number of distinct words after removing short words:", num_distinct_filtered_words)

Total words after removing short words: 72702
Number of distinct words after removing short words: 23694


#### After eliminating 'stop' words that hindered the sentiment analysis, we were left with 72,702 words, out of which 23,694 are unique.

# Counting Word Frequency

In [14]:
# Created a categorical distribution using dictionary.

categorical_distribution = {}
for word in words_list:
    if word in categorical_distribution:
        categorical_distribution[word] += 1
    else:
        categorical_distribution[word] = 1

# Printed the categorical distribution
print(categorical_distribution)



# Tokenizing again (using NLTK)

In [55]:
from nltk import tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/aag022/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [56]:
allText = all_tweets_text # pass in a string consisting of all tweets

wordList = tokenize.word_tokenize(allText)
len(wordList)

124352

# Counting again

In [57]:
# Removed short words
filtered_words = [word for word in wordList if len(word) >= 3]

# Created a categorical distribution using a dictionary for filtered words
categorical_distribution_filtered = {}
for word in filtered_words:
    if word in categorical_distribution_filtered:
        categorical_distribution_filtered[word] += 1
    else:
        categorical_distribution_filtered[word] = 1

# Printed the categorical distribution for filtered words
print(categorical_distribution_filtered)



# Sentiment with NLTK

In [58]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/aag022/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [59]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [60]:
sid = SentimentIntensityAnalyzer()
sid.polarity_scores("Good test!")

{'neg': 0.0, 'neu': 0.239, 'pos': 0.761, 'compound': 0.4926}

In [61]:
tweetSentiments = []

for tweet in trump_tweets:
    tweetSentiment = sid.polarity_scores(tweet)
    tweetSentiment['text'] = tweet
    tweetSentiments.append(tweetSentiment)
tweetSentiments  

[{'neg': 0.13,
  'neu': 0.87,
  'pos': 0.0,
  'compound': -0.5106,
  'text': "RT @shannoncoulter: You don't have to use your daughters and wives as surrogates for your outrage over #TrumpTapes. You can just be offende…"},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'text': '"RT @DanScavino: .@realDonaldTrump was considered ""not nice"" regarding Paris &amp; #Brussels comments- months ago.   #WakeUpAmerica! https://t.c…"'},
 {'neg': 0.312,
  'neu': 0.688,
  'pos': 0.0,
  'compound': -0.4717,
  'text': "RT @thehill: Kasich: I'm not ready to endorse Trump https://t.co/gvQMujpJ7v https://t.co/XLJEHp267T"},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'text': "Trump: I 'most likely' won't do GOP debate"},
 {'neg': 0.0,
  'neu': 0.768,
  'pos': 0.232,
  'compound': 0.5622,
  'text': "RT @TeamTrump: 'Trump's final speech a message of HOPE to Michigan voters' #ElectionDay https://t.co/xQ60buYYND"},
 {'neg': 0.161,
  'neu': 0.839,
  'pos': 0.0,
  'compound':

In [62]:
tweetSentimentDf = pd.DataFrame(tweetSentiments)

In [63]:
tweetSentimentDf.sort_values('compound')

Unnamed: 0,neg,neu,pos,compound,text
4997,0.592,0.408,0.000,-0.9769,RT @tracieeeeee: I am SICK &amp; DAMN tired of...
4847,0.428,0.507,0.065,-0.9627,RT @Blackamazon: Want to be mad at Trump? Cool...
718,0.644,0.356,0.000,-0.9590,Terrorist who tried to kill Donald Trump charg...
1837,0.540,0.460,0.000,-0.9561,RT @TEN_GOP: 🚨Spread it Illegal Immigrant acti...
4250,0.535,0.465,0.000,-0.9537,RT @TrumpSuperPAC: The only thing worse than r...
...,...,...,...,...,...
2744,0.000,0.474,0.526,0.9238,WOW! Another slap to the MSM! They are Mexican...
2346,0.000,0.561,0.439,0.9260,RT @leftyguitar1: #ProbableTrumpsTweets My bes...
3266,0.000,0.569,0.431,0.9276,RT @USAforTrump2016: Wow!!! I can't believe Tw...
5962,0.000,0.569,0.431,0.9276,RT @USAforTrump2016: Wow!!! I can't believe Tw...


The dataframe above represents the negative, positive and neutral quotient for each of the filtered tweets.

In [64]:
# Calculating total proportion of negative words from the filtered tweets.
neg_sum = tweetSentimentDf['neg'].sum()
neg_prop = neg_sum/tweetSentimentDf.shape[0]

print("Negativity quotient:", neg_prop)

Negativity quotient: 0.07897418738049714


In [65]:
# Calculating total proportion of positive words from the filtered tweets.
pos_sum = tweetSentimentDf['pos'].sum()
pos_prop = pos_sum/tweetSentimentDf.shape[0]

print("Positivity quotient:", pos_prop)

Positivity quotient: 0.08453346080305926


In [66]:
# Calculating total proportion of neutral words from the filtered tweets.
neu_sum = tweetSentimentDf['neu'].sum()
neu_prop = neu_sum/tweetSentimentDf.shape[0]

print("Neutral quotient:", neu_prop)

Neutral quotient: 0.8364929891650733


## Conclusion

The outcome of the sentiment analysis yielded a nuanced perspective, revealing a somewhat inconclusive sentiment distribution. Approximately 83.65 percent of the analyzed tweets exhibited a neutral stance towards Donald Trump. A discernible polarization was observed among the remaining 16.35 percent, with 7 percent expressing a negative sentiment, reflecting disapproval or discontent with Trump, while 8 percent conveyed a positive sentiment, indicating support or admiration for the former president. This breakdown illustrates the complex landscape of public sentiment, suggesting a considerable prevalence of neutrality alongside discernible expressions of both criticism and endorsement.