## Project Outline:

- start off by deriving the monthly aggregated social media sentiment.
- Use semantic methods on the tweets to determine the major sentiment associated with each event     and plot it out for better visual understanding.
- Categorize the events into different domains - political, entertainment, etc.
- use the most suitable correlation method to analayze the strength of the correlation between - -- Google's stock prices and the monthly aggregated sentiments. (Should be grouped my month for       uniformity)
- find out which event specifically had the strongest correlation with google's stock price.
- use all this data to predict google's future stocks using time-series anlaysis.

In [3]:
!pip install tqdm



In [4]:
# import statements

import pandas as pd
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import tqdm
from tqdm.auto import tqdm

In [5]:
google_dataset = pd.read_csv("/kaggle/input/google-daily-stock-prices-2004-today/googl_daily_prices.csv")
twitter_dataset = pd.read_csv("/kaggle/input/twitter-dataset/twitter_dataset.csv")

In [7]:
# conduct eda on the twitter dataset to understand it better
twitter_dataset.columns

Index(['Tweet_ID', 'Username', 'Text', 'Retweets', 'Likes', 'Timestamp'], dtype='object')

In [8]:
# conduct sentiment anlaysis using TextBlob on the following tweets

twitter_dataset["Text"]

0       Party least receive say or single. Prevent pre...
1       Hotel still Congress may member staff. Media d...
2       Nice be her debate industry that year. Film wh...
3       Laugh explain situation career occur serious. ...
4       Involve sense former often approach government...
                              ...                        
9995    Agree reflect military box ability ever hold. ...
9996    Born which push still. Degree sometimes contro...
9997    You day agent likely region. Teacher data mess...
9998    Guess without successful save. Particular natu...
9999    Body onto understand team about product beauti...
Name: Text, Length: 10000, dtype: object

In [10]:
from textblob import TextBlob

# create two columns for polarity and subjectivity

for id, row in twitter_dataset.iterrows():
    tweet = row["Text"]
    sentiment = TextBlob(tweet).sentiment
    polarity = sentiment.polarity
    subjectivity = sentiment.subjectivity
    twitter_dataset.loc[id, "polarity"] = polarity
    twitter_dataset.loc[id, "subjectivity"] = subjectivity

In [11]:
twitter_dataset

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,polarity,subjectivity
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,0.115714,0.552857
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,0.308333,0.558333
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,0.220000,0.600000
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,0.054762,0.428571
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,0.033333,0.133333
...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,-0.150000,0.550000
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,0.046667,0.586667
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,-0.090476,0.378571
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,0.253770,0.506944


In [6]:
# cell for committing work to github

# !git clone https://huggingface.co/boltuix/bert-emotion

sentiment_model = pipeline("text-classification", model="boltuix/bert-emotion")
twitter_dataset["sentiment"] = twitter_dataset["Text"].apply(lambda tweet: sentiment_model(tweet)[0]["label"])

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/44.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [7]:
# write the twitter dataset into a csv file because it takes a while to load the dataset

twitter_dataset.to_csv("twitter_dataset.csv", index=False)

In [8]:
twitter_dataset = pd.read_csv("/kaggle/working/twitter_dataset.csv")

In [9]:
twitter_dataset

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,neutral
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral
...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,happiness
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,neutral
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness


In [10]:
import torch

print(torch.cuda.is_available())  # Should return True

True


In [None]:
# categorize these tweets into different domains; add a domains column to the dataset
from transformers import pipeline
tqdm.pandas()

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
candidate_labels = ["sports", "politics", "technology", "finance", "entertainment", "health", "education"]

twitter_dataset["domain"] = twitter_dataset["Text"].progress_apply(lambda x: classifier(x, candidate_labels)["labels"][0])
  

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


  0%|          | 0/10000 [00:00<?, ?it/s]

In [12]:
twitter_dataset

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,neutral,entertainment
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics
...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,happiness,sports
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,neutral,education
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment


In [13]:
# write the twitter dataset into a csv file because it takes a while to load the dataset

twitter_dataset.to_csv("twitter_dataset(domain).csv", index=False)

In [14]:
twitter_dataset = pd.read_csv("/kaggle/working/twitter_dataset(domain).csv")

In [None]:
# plot out the popularity of emotions for each of the domains

In [None]:
# find the total number of retweets and likes for each of the domain

In [None]:
# merge the twitter dataset and the google dataset after grouping by month and year in both the datasets

In [None]:
# Analyze the correlation between Google stock prices and tweet emotions across different domains
# to identify which domain's emotional activity most closely tracks market movements.