## Project Outline:

- start off by deriving the monthly aggregated social media sentiment.
- Use semantic methods on the tweets to determine the major sentiment associated with each event     and plot it out for better visual understanding.
- Categorize the events into different domains - political, entertainment, etc.
- use the most suitable correlation method to analayze the strength of the correlation between - -- Google's stock prices and the monthly aggregated sentiments. (Should be grouped my month for       uniformity)
- find out which event specifically had the strongest correlation with google's stock price.
- use all this data to predict google's future stocks using time-series anlaysis.

In [1]:
!pip install tqdm



In [23]:
# import statements

import pandas as pd
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import tqdm
from tqdm.auto import tqdm
import plotly.express as px

In [3]:
google_dataset = pd.read_csv("/kaggle/input/google-daily-stock-prices-2004-today/googl_daily_prices.csv")
twitter_dataset = pd.read_csv("/kaggle/input/twitter-dataset/twitter_dataset.csv")

In [4]:
# conduct eda on the twitter dataset to understand it better
twitter_dataset.columns

Index(['Tweet_ID', 'Username', 'Text', 'Retweets', 'Likes', 'Timestamp'], dtype='object')

In [5]:
# conduct sentiment anlaysis using TextBlob on the following tweets

twitter_dataset["Text"]

0       Party least receive say or single. Prevent pre...
1       Hotel still Congress may member staff. Media d...
2       Nice be her debate industry that year. Film wh...
3       Laugh explain situation career occur serious. ...
4       Involve sense former often approach government...
                              ...                        
9995    Agree reflect military box ability ever hold. ...
9996    Born which push still. Degree sometimes contro...
9997    You day agent likely region. Teacher data mess...
9998    Guess without successful save. Particular natu...
9999    Body onto understand team about product beauti...
Name: Text, Length: 10000, dtype: object

In [6]:
from textblob import TextBlob

# create two columns for polarity and subjectivity

for id, row in twitter_dataset.iterrows():
    tweet = row["Text"]
    sentiment = TextBlob(tweet).sentiment
    polarity = sentiment.polarity
    subjectivity = sentiment.subjectivity
    twitter_dataset.loc[id, "polarity"] = polarity
    twitter_dataset.loc[id, "subjectivity"] = subjectivity

In [7]:
twitter_dataset

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,polarity,subjectivity
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,0.115714,0.552857
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,0.308333,0.558333
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,0.220000,0.600000
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,0.054762,0.428571
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,0.033333,0.133333
...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,-0.150000,0.550000
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,0.046667,0.586667
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,-0.090476,0.378571
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,0.253770,0.506944


In [12]:
# cell for committing work to github

# !git clone https://huggingface.co/boltuix/bert-emotion

sentiment_model = pipeline("text-classification", model="boltuix/bert-emotion")
twitter_dataset["sentiment"] = twitter_dataset["Text"].apply(lambda tweet: sentiment_model(tweet)[0]["label"])

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/44.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [13]:
# write the twitter dataset into a csv file because it takes a while to load the dataset

twitter_dataset.to_csv("twitter_dataset.csv", index=False)

In [14]:
twitter_dataset = pd.read_csv("/kaggle/working/twitter_dataset.csv")

In [15]:
twitter_dataset

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,polarity,subjectivity,sentiment
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,0.115714,0.552857,neutral
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,0.308333,0.558333,neutral
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,0.220000,0.600000,love
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,0.054762,0.428571,neutral
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,0.033333,0.133333,neutral
...,...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,-0.150000,0.550000,happiness
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,0.046667,0.586667,neutral
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,-0.090476,0.378571,neutral
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,0.253770,0.506944,happiness


In [16]:
import torch

print(torch.cuda.is_available())  # Should return True

True


In [36]:
# # categorize these tweets into different domains; add a domains column to the dataset
# from transformers import pipeline
# tqdm.pandas()

# classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
# candidate_labels = ["sports", "politics", "technology", "finance", "entertainment", "health", "education"]

# twitter_dataset["domain"] = twitter_dataset["Text"].progress_apply(lambda x: classifier(x, candidate_labels)["labels"][0])
  

In [12]:
twitter_dataset

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,neutral,entertainment
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics
...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,happiness,sports
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,neutral,education
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment


In [35]:
# # write the twitter dataset into a csv file because it takes a while to load the dataset

# twitter_dataset.to_csv("twitter_dataset(domain).csv", index=False)

In [34]:
# twitter_dataset = pd.read_csv("/kaggle/working/twitter_dataset(domain).csv")

In [22]:
# plot out the popularity of emotions for each of the domains

twitter_updated = pd.read_csv("/kaggle/input/twitter-domain/twitter_dataset(domain).csv")
twitter_updated

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,neutral,entertainment
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics
...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,happiness,sports
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,neutral,education
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment


In [51]:
plot_df = twitter_updated.groupby("domain")[["Retweets", "Likes"]].sum().reset_index()
melted_df = plot_df.melt(id_vars="domain", value_vars=["Retweets", "Likes"], var_name="Metric", value_name="Count")
melted_df

Unnamed: 0,domain,Metric,Count
0,education,Retweets,53067
1,entertainment,Retweets,267567
2,finance,Retweets,33998
3,health,Retweets,47347
4,politics,Retweets,46790
5,sports,Retweets,20850
6,technology,Retweets,27593
7,education,Likes,53729
8,entertainment,Likes,264161
9,finance,Likes,34754


In [52]:
fig = px.bar(melted_df, x="domain", y="Count", color="Metric")
fig.show()

In [37]:
import plotly.express as px

plot_df2 = twitter_updated.groupby(["sentiment", "domain"]).size().reset_index(name="count")

fig = px.density_heatmap(
    plot_df2, 
    x="domain", 
    y="sentiment", 
    z="count", 
    color_continuous_scale="Blues", 
    title="Sentiment vs. Domain Heatmap"
)
fig.show()


In [55]:
# merge the twitter dataset and the google dataset after grouping by month and year in both the datasets

google_prices = pd.read_csv("/kaggle/input/google-daily-stock-prices-2004-today/googl_daily_prices.csv")
google_prices

Unnamed: 0,date,1. open,2. high,3. low,4. close,5. volume
0,2025-05-30,171.350,172.2050,167.4400,171.740,52639911.0
1,2025-05-29,174.000,174.4193,170.6300,171.860,29373803.0
2,2025-05-28,173.160,175.2650,171.9107,172.360,34783997.0
3,2025-05-27,170.160,173.1700,170.0000,172.900,37995670.0
4,2025-05-23,169.055,169.9600,167.8900,168.470,35211439.0
...,...,...,...,...,...,...
5224,2004-08-25,104.760,108.0000,103.8800,106.000,9188600.0
5225,2004-08-24,111.240,111.6000,103.5700,104.870,15247300.0
5226,2004-08-23,110.760,113.4800,109.0500,109.400,18256100.0
5227,2004-08-20,101.010,109.0800,100.5000,108.310,22834300.0


In [56]:
# add separate columns for day, month, year, weekend/weekday, time, hour, etc for the boosting models to be able to identify patterns better
twitter_updated

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,neutral,entertainment
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics
...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,happiness,sports
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,neutral,education
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment


In [60]:
twitter_updated["Timestamp"] = pd.to_datetime(twitter_updated["Timestamp"])

twitter_updated["year"] = twitter_updated["Timestamp"].dt.year
twitter_updated["month"] = twitter_updated["Timestamp"].dt.month
twitter_updated["day"] = twitter_updated["Timestamp"].dt.day
twitter_updated["day_of_week"] = twitter_updated["Timestamp"].dt.dayofweek  # Monday=0, Sunday=6
twitter_updated["is_weekend"] = twitter_updated["day_of_week"].isin([5, 6]).astype(int)
twitter_updated["hour"] = twitter_updated["Timestamp"].dt.hour
twitter_updated["minute"] = twitter_updated["Timestamp"].dt.minute
twitter_updated["second"] = twitter_updated["Timestamp"].dt.second
twitter_updated["time_of_day"] = twitter_updated["hour"].apply(
    lambda h: "morning" if 5 <= h < 12 else
              "afternoon" if 12 <= h < 17 else
              "evening" if 17 <= h < 21 else
              "night"
)


twitter_updated

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain,year,month,day,day_of_week,is_weekend,hour,minute,second,time_of_day
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment,2023,1,30,0,0,11,0,51,morning
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58,neutral,entertainment,2023,1,2,0,0,22,45,58,night
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment,2023,1,18,2,0,11,25,19,morning
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment,2023,4,10,0,0,22,6,29,night
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics,2023,1,24,1,0,7,12,21,morning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20,happiness,sports,2023,1,15,6,1,11,46,20,morning
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54,neutral,education,2023,5,6,5,1,0,46,54,night
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education,2023,2,27,0,0,14,55,8,afternoon
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment,2023,1,9,0,0,16,9,35,afternoon


In [70]:
google_prices["date"] = pd.to_datetime(google_prices["date"])
google_prices["year"] = google_prices["date"].dt.year
google_prices["month"] = google_prices["date"].dt.month
google_prices["day"] = google_prices["date"].dt.day
google_prices["day_of_week"] = google_prices["date"].dt.dayofweek  
google_prices["is_weekend"] = google_prices["day_of_week"].isin([5, 6]).astype(int)

google_prices

Unnamed: 0,date,1. open,2. high,3. low,4. close,5. volume,year,month,day,day_of_week,is_weekend
0,2025-05-30,171.350,172.2050,167.4400,171.740,52639911.0,2025,5,30,4,0
1,2025-05-29,174.000,174.4193,170.6300,171.860,29373803.0,2025,5,29,3,0
2,2025-05-28,173.160,175.2650,171.9107,172.360,34783997.0,2025,5,28,2,0
3,2025-05-27,170.160,173.1700,170.0000,172.900,37995670.0,2025,5,27,1,0
4,2025-05-23,169.055,169.9600,167.8900,168.470,35211439.0,2025,5,23,4,0
...,...,...,...,...,...,...,...,...,...,...,...
5224,2004-08-25,104.760,108.0000,103.8800,106.000,9188600.0,2004,8,25,2,0
5225,2004-08-24,111.240,111.6000,103.5700,104.870,15247300.0,2004,8,24,1,0
5226,2004-08-23,110.760,113.4800,109.0500,109.400,18256100.0,2004,8,23,0,0
5227,2004-08-20,101.010,109.0800,100.5000,108.310,22834300.0,2004,8,20,4,0


In [82]:
# merge both the datasets

merged_df = twitter_updated.merge(google_prices, on = ["year", "month", "day"])
merged_df

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain,year,month,...,second,time_of_day,date,1. open,2. high,3. low,4. close,5. volume,day_of_week_y,is_weekend_y
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment,2023,1,...,51,morning,2023-01-30,97.48,98.2900,96.395,96.94,27226198.0,0,0
1,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment,2023,1,...,19,morning,2023-01-18,92.14,92.7999,90.640,91.12,29116691.0,2,0
2,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment,2023,4,...,29,night,2023-04-10,106.98,107.5900,105.120,106.44,27067355.0,0,0
3,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics,2023,1,...,21,morning,2023-01-24,98.10,99.6100,97.200,97.70,33078512.0,1,0
4,6,ramirezmikayla,Cell without report weight. Could father chang...,22,75,2023-03-30 09:56:07,neutral,finance,2023,3,...,7,morning,2023-03-30,100.91,101.1550,99.780,100.89,33086183.0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6880,9993,holly83,Including father some level in. Mr born claim ...,64,54,2023-02-24 01:31:29,desire,entertainment,2023,2,...,29,night,2023-02-24,89.44,89.8900,88.575,89.13,36585093.0,4,0
6881,9995,rthornton,Boy deal wrong sport. We maintain game languag...,20,22,2023-01-31 13:27:50,neutral,sports,2023,1,...,50,afternoon,2023-01-31,96.87,98.8800,96.820,98.84,29870669.0,1,0
6882,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education,2023,2,...,8,afternoon,2023-02-27,89.87,90.1600,89.335,89.87,27502302.0,0,0
6883,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment,2023,1,...,35,afternoon,2023-01-09,88.36,90.0500,87.860,88.02,29003901.0,0,0


In [80]:
# check if there are any rows with null open values, indicating a holiday

merged_df["1. open"].isnull().value_counts()

1. open
False    6885
Name: count, dtype: int64

In [86]:
merged_df["MA_20"] = merged_df["4. close"].rolling(window = 20).mean()
merged_df["STD_20"] = merged_df["4. close"].rolling(window = 20).std()
merged_df[:20]


invalid value encountered in greater


invalid value encountered in less


invalid value encountered in greater


invalid value encountered in greater


invalid value encountered in less


invalid value encountered in greater



Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain,year,month,...,date,1. open,2. high,3. low,4. close,5. volume,day_of_week_y,is_weekend_y,MA_20,STD_20
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment,2023,1,...,2023-01-30,97.48,98.29,96.395,96.94,27226198.0,0,0,,
1,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment,2023,1,...,2023-01-18,92.14,92.7999,90.64,91.12,29116691.0,2,0,,
2,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment,2023,4,...,2023-04-10,106.98,107.59,105.12,106.44,27067355.0,0,0,,
3,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics,2023,1,...,2023-01-24,98.1,99.61,97.2,97.7,33078512.0,1,0,,
4,6,ramirezmikayla,Cell without report weight. Could father chang...,22,75,2023-03-30 09:56:07,neutral,finance,2023,3,...,2023-03-30,100.91,101.155,99.78,100.89,33086183.0,3,0,,
5,9,turneredgar,Reveal table state view manager she. Fly yeah ...,15,26,2023-03-24 15:17:03,neutral,entertainment,2023,3,...,2023-03-24,104.99,105.49,103.84,105.44,30411043.0,4,0,,
6,10,audreymooney,List allow family rather continue. Agency mess...,97,28,2023-02-01 20:32:07,neutral,entertainment,2023,2,...,2023-02-01,98.71,101.19,97.58,100.43,35531104.0,2,0,,
7,11,timothyhardy,Image simply article list event imagine want r...,82,0,2023-03-01 08:31:29,neutral,entertainment,2023,3,...,2023-03-01,89.98,91.03,89.67,90.36,31111225.0,2,0,,
8,12,qdavis,You hold central. Seem miss look very. None hi...,99,97,2023-02-07 13:22:19,neutral,entertainment,2023,2,...,2023-02-07,103.22,108.18,103.12,107.64,49010230.0,1,0,,
9,13,davidgarcia,Paper but then field audience. Read pick sudde...,12,99,2023-03-15 07:21:43,neutral,politics,2023,3,...,2023-03-15,93.22,96.93,92.64,96.11,50622050.0,2,0,,


In [93]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Candlestick(
    x=merged_df["date"],
    open=merged_df["1. open"],
    high=merged_df["2. high"],
    low=merged_df["3. low"],
    close=merged_df["4. close"],
    name="Candlestick Graph"
))

fig.update_layout(
    title="Google Stock Prices with Candlestick Chart",
    xaxis_title="Date",
    yaxis_title="Price (USD)",
    xaxis_rangeslider_visible=False
)

fig.show()


In [134]:
monthly_vol = merged_df.groupby(["month", "year"], as_index=False)["STD_20"].mean()
monthly_vol["month_year"] = monthly_vol["year"].astype(str) + "-" + monthly_vol["month"].astype(str).str.zfill(2)

In [135]:
monthly_vol

Unnamed: 0,month,year,STD_20,month_year
0,1,2023,7.195996,2023-01
1,2,2023,7.154607,2023-02
2,3,2023,7.049839,2023-03
3,4,2023,7.107367,2023-04
4,5,2023,7.336349,2023-05


In [136]:
fig = px.line(
    monthly_vol,
    x="month_year",
    y="STD_20",
    title="Average Monthly Price Volatility (20-day Std Dev)"
)

fig.show()


In [158]:
daily_activity

Unnamed: 0,year,month,day,Retweets,5. volume
0,2023,1,3,4129,28131224.0
1,2023,1,4,3661,34854776.0
2,2023,1,5,3928,27194375.0
3,2023,1,6,3743,41381495.0
4,2023,1,9,3515,29003901.0
...,...,...,...,...,...
87,2023,5,9,3826,36360141.0
88,2023,5,10,3672,63153367.0
89,2023,5,11,4003,78900029.0
90,2023,5,12,4414,41102330.0


In [161]:
daily_activity = merged_df.groupby(["year", "month", "day"], as_index=False).agg({
    "Retweets": "sum",
    "Likes": "sum",  
    "5. volume": "mean"
})

corr = daily_activity["Retweets"].corr(daily_activity["5. volume"], method = "pearson")
corr


0.031858159990331515

## Number of retweets and the volume of google shares traded do not have any correlation

In [166]:
daily_activity = merged_df.groupby(["year", "month", "day"], as_index=False).agg({
    "Retweets": "sum",
    "Likes": "sum",  
    "1. open": "mean"
})

corr = daily_activity["Retweets"].corr(daily_activity["1. open"], method = "pearson")
corr


-0.03642602325190193

In [167]:
merged_df


invalid value encountered in greater


invalid value encountered in less


invalid value encountered in greater


invalid value encountered in greater


invalid value encountered in less


invalid value encountered in greater



Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,sentiment,domain,year,month,...,1. open,2. high,3. low,4. close,5. volume,day_of_week_y,is_weekend_y,MA_20,STD_20,month_year
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51,neutral,entertainment,2023,1,...,97.48,98.2900,96.395,96.94,27226198.0,0,0,,,2023-01
1,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19,love,entertainment,2023,1,...,92.14,92.7999,90.640,91.12,29116691.0,2,0,,,2023-01
2,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29,neutral,entertainment,2023,4,...,106.98,107.5900,105.120,106.44,27067355.0,0,0,,,2023-04
3,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21,neutral,politics,2023,1,...,98.10,99.6100,97.200,97.70,33078512.0,1,0,,,2023-01
4,6,ramirezmikayla,Cell without report weight. Could father chang...,22,75,2023-03-30 09:56:07,neutral,finance,2023,3,...,100.91,101.1550,99.780,100.89,33086183.0,3,0,,,2023-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6880,9993,holly83,Including father some level in. Mr born claim ...,64,54,2023-02-24 01:31:29,desire,entertainment,2023,2,...,89.44,89.8900,88.575,89.13,36585093.0,4,0,99.9925,7.562728,2023-02
6881,9995,rthornton,Boy deal wrong sport. We maintain game languag...,20,22,2023-01-31 13:27:50,neutral,sports,2023,1,...,96.87,98.8800,96.820,98.84,29870669.0,1,0,99.6360,7.433064,2023-01
6882,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08,neutral,education,2023,2,...,89.87,90.1600,89.335,89.87,27502302.0,0,0,98.7085,7.436642,2023-02
6883,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35,happiness,entertainment,2023,1,...,88.36,90.0500,87.860,88.02,29003901.0,0,0,97.2810,6.510012,2023-01


In [180]:
prevailing_domains = merged_df[
    merged_df["domain"] == "technology"].groupby(["year", "month", "day"], as_index=False).agg({
    "Retweets": "sum",
    "Likes": "sum",  
    "1. open": "mean"
})

corr = prevailing_domains["Retweets"].corr(prevailing_domains["1. open"], method = "pearson")
corr

0.10896390192900142

In [178]:
prevailing_domains = merged_df[
    merged_df["domain"].isin(["technology", "finance"])
].groupby(["year", "month", "day"], as_index=False).agg({
    "Retweets": "sum",
    "Likes": "sum",  
    "1. open": "mean"
})

corr = prevailing_domains["Retweets"].corr(prevailing_domains["1. open"], method = "pearson")
corr

0.18513138858366812