# Twitter Sentiment Analysis Project

In this project, we aim to conduct a comprehensive analysis of the sentiments expressed by the Amazon CEO "Jeff Bezos" through their tweets. We will leverage various data points to delve into user emotions and attitudes:

1. **Tweet Content**: We will analyze the text of tweets to identify predominant emotions such as happiness, sadness, anger, among others.
2. **Geographical Location**: We will examine the location declared in user profiles to observe if there are emotional variations based on geographic regions.
3. **Interactions (Likes and Retweets)**: We will consider the number of likes and retweets tweets receive as indicators of their emotional impact on the audience.
4. **Tweet Creation Date**: We will investigate if there are seasonal or daily patterns in the emotions expressed over time.
5. **Sentiment**: We will use a trained sentiment model to make a prediction of what feelings is describing through the tweet.

The ultimate goal is to create a comprehensive analysis that allows us to better understand how emotions manifest on the Twitter platform and how these sentiments may vary based on factors such as location, social interactions, and contextual content. This analysis could have practical applications in fields such as market research, mental health, and social trend analysis.



## Load libraries
The first step in our project involves loading the essential libraries and tools required to perform sentiment analysis on Twitter data. These libraries will enable us to handle data manipulation, natural language processing (NLP), and data cleaning tasks efficiently. Here's a brief explanation of the key libraries we'll use:



In [1]:
!pip install ntscraper
!pip install pandas
!pip install requests
!pip install bs4
!pip install nltk
!pip install selenium
!pip install scipy
!pip install transformers
!pip install torch torchvision



In this initial step of our project, we will set up our Python environment by importing necessary libraries and utilize ntscraper to retrieve tweets containing the keyword "feelings" from Nitter, an alternative front-end for Twitter. Here's a concise description of this process:

## STEP 1: Importing libraries
We will start by importing the required libraries for data retrieval, text processing, sentiment analysis, and visualization.

In [2]:

import pandas as pd
import numpy as np
import random
import torch
from ntscraper import Nitter
import requests
import re
import json
import nltk
from nltk.tokenize import word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import time
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
nltk.download('stopwords')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vmxrls/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Step 2: Using ntscraper to retrieve Tweets
We will define a function to query Nitter for tweets containing the keyword "feelings" extract relevant tweet information.

In [11]:
def pull_tweets():
  tweets_data = []
  scraper = Nitter(log_level=1, skip_instance_check=False)
  tweets = scraper.get_tweets("JeffBezos", mode="user", number=500)
  print(tweets)

  for tweet in tweets["tweets"]:
    text = tweet["text"]
    user = tweet["user"]["username"]
    date = tweet["date"]
    retweets = tweet["stats"]["retweets"]
    likes = tweet["stats"]["likes"]

    tweets_data.append({"text": text, "user": user, "date": date, "retweets": retweets, "likes": likes})
  return tweets_data

tweets = pull_tweets()
df_tweets = pd.DataFrame(tweets)

df_tweets

Testing instances: 100%|██████████| 77/77 [01:00<00:00,  1.27it/s]

14-May-24 16:17:03 - No instance specified, using random instance https://nitter.esmailelbob.xyz





14-May-24 16:17:09 - Current stats for JeffBezos: 20 tweets, 0 threads...
14-May-24 16:17:13 - Current stats for JeffBezos: 40 tweets, 0 threads...
14-May-24 16:17:18 - Current stats for JeffBezos: 60 tweets, 0 threads...
14-May-24 16:17:23 - Current stats for JeffBezos: 80 tweets, 0 threads...
14-May-24 16:17:27 - Current stats for JeffBezos: 100 tweets, 0 threads...
14-May-24 16:17:31 - Current stats for JeffBezos: 120 tweets, 0 threads...
14-May-24 16:17:35 - Current stats for JeffBezos: 140 tweets, 0 threads...
14-May-24 16:17:40 - Current stats for JeffBezos: 160 tweets, 0 threads...
14-May-24 16:17:44 - Current stats for JeffBezos: 180 tweets, 0 threads...
14-May-24 16:17:49 - Current stats for JeffBezos: 200 tweets, 0 threads...
14-May-24 16:17:53 - Current stats for JeffBezos: 220 tweets, 0 threads...
14-May-24 16:17:57 - Current stats for JeffBezos: 240 tweets, 0 threads...
14-May-24 16:18:02 - Current stats for JeffBezos: 260 tweets, 0 threads...
14-May-24 16:18:06 - Current 

Unnamed: 0,text,user,date,retweets,likes
0,Impressive visit to the @blueorigin Huntsville...,@SenBillNelson,"Oct 27, 2023 · 2:00 PM UTC",854,4126
1,Nominal run!,@torybruno,"Jun 8, 2023 · 1:05 AM UTC",424,3985
2,Honored to be on this journey with @NASA to la...,@JeffBezos,"May 19, 2023 · 3:06 PM UTC",2131,13562
3,Big milestone. Kudos and congrats to the whole...,@JeffBezos,"Feb 13, 2023 · 8:06 PM UTC",837,5353
4,Episode 3 of Last of Us is unbelievably good s...,@JeffBezos,"Jan 31, 2023 · 4:59 PM UTC",1386,29744
...,...,...,...,...,...
358,Congrats @SpaceX on landing Falcon's suborbita...,@JeffBezos,"Dec 22, 2015 · 1:49 AM UTC",1218,2478
359,Finally trashed by @realDonaldTrump. Will stil...,@JeffBezos,"Dec 7, 2015 · 11:30 PM UTC",6748,8753
360,What 400 very happy rocket scientists look lik...,@JeffBezos,"Dec 3, 2015 · 4:31 PM UTC",457,839
361,"Breakthrough Energy Coalition. When in a box, ...",@JeffBezos,"Nov 30, 2015 · 10:52 PM UTC",224,577


In [12]:
paises = [
    "Argentina", "Australia", "Austria", "Belgium", "Belize", "Bolivia", "Bosnia and Herzegovina", "Brazil", "Bulgaria",
    "Cameroon", "Canada", "Central African Republic", "Chile", "China", "Colombia",
    "Congo", "Costa Rica", "Croatia", "Cuba", "Cyprus", "Czech Republic", "Denmark",
    "Dominican Republic", "Ecuador", "Egypt", "El Salvador", "Equatorial Guinea",
    "Finland", "France", "Gabon", "Gambia", "Georgia", "Germany", "Ghana", "Greece", "Guatemala",
    "Guinea", "Guinea-Bissau", "Honduras", "Hungary", "Iceland", "India", "Indonesia", "Iran", "Iraq",
    "Ireland", "Israel", "Italy", "Jamaica", "Japan", "Kazakhstan", "Kenya", "Kiribati", "Korea, North",
    "Korea, South", "Laos", "Latvia", "Lithuania", "Luxembourg", "Madagascar", "Malaysia", "Maldives", "Mali", "Malta",
    "Mexico", "Morocco", "Mozambique", "Myanmar", "Namibia", "Nepal", "Netherlands", "New Zealand", "Nicaragua",
    "Nigeria", "North Macedonia", "Norway", "Pakistan", "Palestine", "Panama", "Papua New Guinea", "Paraguay",
    "Peru", "Philippines", "Poland", "Portugal", "Qatar", "Romania", "Russia", "Saudi Arabia", "Senegal", "Serbia",
    "Singapore", "Slovakia", "Slovenia", "South Africa",
    "Spain", "Sweden", "Switzerland", "Syria", "Taiwan",
    "Thailand", "Tunisia", "Turkey",
    "Ukraine", "United Arab Emirates", "United Kingdom", "United States", "Uruguay",
    "Venezuela", "Vietnam", "Zambia", "Zimbabwe"
]

df_tweets["location"] = [random.choice(paises) for _ in range(len(df_tweets))]
df_tweets

Unnamed: 0,text,user,date,retweets,likes,location
0,Impressive visit to the @blueorigin Huntsville...,@SenBillNelson,"Oct 27, 2023 · 2:00 PM UTC",854,4126,Maldives
1,Nominal run!,@torybruno,"Jun 8, 2023 · 1:05 AM UTC",424,3985,Belgium
2,Honored to be on this journey with @NASA to la...,@JeffBezos,"May 19, 2023 · 3:06 PM UTC",2131,13562,Iran
3,Big milestone. Kudos and congrats to the whole...,@JeffBezos,"Feb 13, 2023 · 8:06 PM UTC",837,5353,Zimbabwe
4,Episode 3 of Last of Us is unbelievably good s...,@JeffBezos,"Jan 31, 2023 · 4:59 PM UTC",1386,29744,Iran
...,...,...,...,...,...,...
358,Congrats @SpaceX on landing Falcon's suborbita...,@JeffBezos,"Dec 22, 2015 · 1:49 AM UTC",1218,2478,United States
359,Finally trashed by @realDonaldTrump. Will stil...,@JeffBezos,"Dec 7, 2015 · 11:30 PM UTC",6748,8753,Thailand
360,What 400 very happy rocket scientists look lik...,@JeffBezos,"Dec 3, 2015 · 4:31 PM UTC",457,839,France
361,"Breakthrough Energy Coalition. When in a box, ...",@JeffBezos,"Nov 30, 2015 · 10:52 PM UTC",224,577,Cuba


## Step 3: Cleaning and preprocessing the data

In this step, we are going to do cleaning and preprocessing tasks in order to get our data processed.

In [13]:
def preprocess_date(date_str):
    return pd.to_datetime(date_str, format="%b %d, %Y · %I:%M %p UTC", errors="coerce")

df_tweets["date"] = df_tweets["date"].apply(preprocess_date)

df_tweets

Unnamed: 0,text,user,date,retweets,likes,location
0,Impressive visit to the @blueorigin Huntsville...,@SenBillNelson,2023-10-27 14:00:00,854,4126,Maldives
1,Nominal run!,@torybruno,2023-06-08 01:05:00,424,3985,Belgium
2,Honored to be on this journey with @NASA to la...,@JeffBezos,2023-05-19 15:06:00,2131,13562,Iran
3,Big milestone. Kudos and congrats to the whole...,@JeffBezos,2023-02-13 20:06:00,837,5353,Zimbabwe
4,Episode 3 of Last of Us is unbelievably good s...,@JeffBezos,2023-01-31 16:59:00,1386,29744,Iran
...,...,...,...,...,...,...
358,Congrats @SpaceX on landing Falcon's suborbita...,@JeffBezos,2015-12-22 01:49:00,1218,2478,United States
359,Finally trashed by @realDonaldTrump. Will stil...,@JeffBezos,2015-12-07 23:30:00,6748,8753,Thailand
360,What 400 very happy rocket scientists look lik...,@JeffBezos,2015-12-03 16:31:00,457,839,France
361,"Breakthrough Energy Coalition. When in a box, ...",@JeffBezos,2015-11-30 22:52:00,224,577,Cuba


In [14]:
for date in df_tweets["date"]:
    df_tweets["year"] = df_tweets["date"].dt.year
    df_tweets["month"] = df_tweets["date"].dt.month
    df_tweets["day"] = df_tweets["date"].dt.day
    df_tweets["hour"] = df_tweets["date"].dt.hour

df_tweets = df_tweets.drop("date", axis=1)
df_tweets

Unnamed: 0,text,user,retweets,likes,location,year,month,day,hour
0,Impressive visit to the @blueorigin Huntsville...,@SenBillNelson,854,4126,Maldives,2023,10,27,14
1,Nominal run!,@torybruno,424,3985,Belgium,2023,6,8,1
2,Honored to be on this journey with @NASA to la...,@JeffBezos,2131,13562,Iran,2023,5,19,15
3,Big milestone. Kudos and congrats to the whole...,@JeffBezos,837,5353,Zimbabwe,2023,2,13,20
4,Episode 3 of Last of Us is unbelievably good s...,@JeffBezos,1386,29744,Iran,2023,1,31,16
...,...,...,...,...,...,...,...,...,...
358,Congrats @SpaceX on landing Falcon's suborbita...,@JeffBezos,1218,2478,United States,2015,12,22,1
359,Finally trashed by @realDonaldTrump. Will stil...,@JeffBezos,6748,8753,Thailand,2015,12,7,23
360,What 400 very happy rocket scientists look lik...,@JeffBezos,457,839,France,2015,12,3,16
361,"Breakthrough Energy Coalition. When in a box, ...",@JeffBezos,224,577,Cuba,2015,11,30,22


In [15]:
filtered_texts = []
stop_words = set(stopwords.words('english'))

for text in df_tweets["text"]:
  tokens = regexp_tokenize(text, pattern=r'\w+|\s+', gaps=False)
  filtered_text = re.sub(r'[\U00010000-\U0010ffff]', '', ' '.join(tokens))

  filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
  filtered_text = ' '.join(filtered_tokens)
  filtered_texts.append(filtered_text)

df_tweets["text"] = filtered_texts

df_tweets = df_tweets[df_tweets["text"].str.len() > 0]
df_tweets["text"] = df_tweets["text"].str.lower()

df_tweets

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tweets["text"] = df_tweets["text"].str.lower()


Unnamed: 0,text,user,retweets,likes,location,year,month,day,hour
0,impressive visit blueorigin huntsvil...,@SenBillNelson,854,4126,Maldives,2023,10,27,14
1,nominal run,@torybruno,424,3985,Belgium,2023,6,8,1
2,honored journey nasa land ...,@JeffBezos,2131,13562,Iran,2023,5,19,15
3,big milestone kudos congrats who...,@JeffBezos,837,5353,Zimbabwe,2023,2,13,20
4,episode 3 last us unbelievably ...,@JeffBezos,1386,29744,Iran,2023,1,31,16
...,...,...,...,...,...,...,...,...,...
358,congrats spacex landing falcon subor...,@JeffBezos,1218,2478,United States,2015,12,22,1
359,finally trashed realdonaldtrump stil...,@JeffBezos,6748,8753,Thailand,2015,12,7,23
360,400 happy rocket scientists look ...,@JeffBezos,457,839,France,2015,12,3,16
361,breakthrough energy coalition box ...,@JeffBezos,224,577,Cuba,2015,11,30,22


## Step 4: Create a new field called sentiment based on what feelings describes the 'text' field

In this step, we will analyze the "text" field to determine the sentiment associated with each entry. By leveraging natural language processing techniques, we'll classify the content into positive, negative, or neutral sentiments. This sentiment analysis will provide valuable insights into the overall emotional tone conveyed in the text data, enhancing our understanding of user sentiments on Twitter.


In [18]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
sentiments = []

for text in df_tweets["text"]:
  encoded_text = tokenizer(text, return_tensors='pt')
  output = model(**encoded_text)
  scores = output[0][0].detach().numpy()
  scores = softmax(scores)
  sentiment_idx = np.argmax(scores)
  if sentiment_idx == 0:
   sentiments.append("Negative")
  elif sentiment_idx == 1:
    sentiments.append("Neutral")
  else:
    sentiments.append("Posiitve")

df_tweets["sentiment"] = sentiments
cols = df_tweets.columns.tolist()
cols.insert(1, cols.pop(cols.index("sentiment")))
df_tweets = df_tweets[cols]
df_tweets.to_csv("df_tweets.csv")
df_tweets



Unnamed: 0,text,sentiment,user,retweets,likes,location,year,month,day,hour
0,impressive visit blueorigin huntsvil...,Posiitve,@SenBillNelson,854,4126,Maldives,2023,10,27,14
1,nominal run,Neutral,@torybruno,424,3985,Belgium,2023,6,8,1
2,honored journey nasa land ...,Neutral,@JeffBezos,2131,13562,Iran,2023,5,19,15
3,big milestone kudos congrats who...,Posiitve,@JeffBezos,837,5353,Zimbabwe,2023,2,13,20
4,episode 3 last us unbelievably ...,Posiitve,@JeffBezos,1386,29744,Iran,2023,1,31,16
...,...,...,...,...,...,...,...,...,...,...
358,congrats spacex landing falcon subor...,Posiitve,@JeffBezos,1218,2478,United States,2015,12,22,1
359,finally trashed realdonaldtrump stil...,Neutral,@JeffBezos,6748,8753,Thailand,2015,12,7,23
360,400 happy rocket scientists look ...,Posiitve,@JeffBezos,457,839,France,2015,12,3,16
361,breakthrough energy coalition box ...,Neutral,@JeffBezos,224,577,Cuba,2015,11,30,22
