# Metaverse Twitter Analysis

## Motivation

As students growing up in a digital age we are intrigued to see what the Metaverse amounts to. The hype around the Metaverse feels oddly reminiscent of the creation of “The Internet”  in the ’70s and carries the same mystic, with little to nothing being known about Meta’s approach (Ravenscraft). In our analysis, we hope to investigate consumers’ metadata regarding tweets involving the Metaverse through social media platforms such as Twitter. We feel our analysis will give us a better idea of the trajectory of the Metaverse and consumer demographics that are expressing interest in the technology. Our group also wanted to experiment with using the Twitter API to generate our own data and familiarize ourselves with extensive field data. Our inspiration for this idea and code came from “Metaverse Tweets” a Kaggle database that is linked below and cited in our code. 

Ravenscraft, Eric. “What Is the Metaverse, Exactly?” Wired, 25 Nov. 2021, www.wired.com/story/what-is-the-metaverse.

Inspiration: https://www.kaggle.com/mathurinache/metaverse-tweets/version/1

## Related Work

The concept of the Metaverse has been all over social media and has only been gaining traction. The Metaverse has been controversial in a number of online communities in terms of what it contributes to its users, the tech community, and more broadly, implications on society. No particular articles influenced us about this topic, however, the Wired article mentioned above did help to provide an interesting analogy as to why people, us included, are intrigued. The Metaverse does carry the same buzz as the start of the Internet which means people’s opinions on its success can be polarizing (Ravenscraft). This is an aspect of the data our group was hoping to capture. 

The dataset we are using for our project has been generated from a python script using Twitter API. This method is used in other analyses such as the article by Ağrali Özgür, a researcher at Muğla Sıtkı Koçman University in the department of Artificial Intelligence. In the article “Tweet Classification and Sentiment Analysis on Metaverse Related Messages”, Özgür uses a sentiment analysis approach to analyzing the data and finds the increasing quantity of tweets about the Metaverse days after Mark Zuckerberg’s highly publicized company name change to “Meta”. He also notes the increase of neutral to negative tweets after Mark’s speech. In a similar light, we hope to provide insight into the consumers’ perceptions of the Metaverse and its potential success. However, our analysis focuses heavily on the consumer’s metadata rather than the sentiment of their tweets since the main focus of this research is data analytics rather than machine learning. We also utilized another article “Identifying Correlated Bots in Twitter” by Nikan Chavoshi, Senior Member of the Technical Staff at Oracle Labs. In this paper, we were able to gather key assumptions about Twitter bots which we used in our analysis.  

Ağrali, Özgür & Aydın, Ömer. (2021). Tweet Classification and Sentiment Analysis on Metaverse Related Messages. Journal of Metaverse, 1. 25-30. Link

Chavoshi(, N. (2016). Identifying correlated bots in Twitter - cs.unm.edu. Identifying Correlated Bots in Twitter. Retrieved March 20, 2022, from https://www.cs.unm.edu/~chavoshi/debot/SocInfo.pdf


## Data

The concept of the Metaverse has been all over social media and has only been gaining traction. The Metaverse has been controversial in a number of online communities in terms of what it contributes to its users, the tech community, and more broadly, implications on society. No particular articles influenced us about this topic, however, the Wired article mentioned above did help to provide an interesting analogy as to why people, us included, are intrigued. The Metaverse does carry the same buzz as the start of the Internet which means people’s opinions on its success can be polarizing (Ravenscraft). This is an aspect of the data our group was hoping to capture. 

The dataset we are using for our project has been generated from a python script using Twitter API. This method is used in other analyses such as the article by Ağrali Özgür, a researcher at Muğla Sıtkı Koçman University in the department of Artificial Intelligence. In the article “Tweet Classification and Sentiment Analysis on Metaverse Related Messages”, Özgür uses a sentiment analysis approach to analyzing the data and finds the increasing quantity of tweets about the Metaverse days after Mark Zuckerberg’s highly publicized company name change to “Meta”. He also notes the increase of neutral to negative tweets after Mark’s speech. In a similar light, we hope to provide insight into the consumers’ perceptions of the Metaverse and its potential success. However, our analysis focuses heavily on the consumer’s metadata rather than the sentiment of their tweets since the main focus of this research is data analytics rather than machine learning. We also utilized another article “Identifying Correlated Bots in Twitter” by Nikan Chavoshi, Senior Member of the Technical Staff at Oracle Labs. In this paper, we were able to gather key assumptions about Twitter bots which we used in our analysis.  

Ağrali, Özgür & Aydın, Ömer. (2021). Tweet Classification and Sentiment Analysis on Metaverse Related Messages. Journal of Metaverse, 1. 25-30. Link

Chavoshi(, N. (2016). Identifying correlated bots in Twitter - cs.unm.edu. Identifying Correlated Bots in Twitter. Retrieved March 20, 2022, from https://www.cs.unm.edu/~chavoshi/debot/SocInfo.pdf
 

## Question 

For particular questions, we are making certain assumptions in order to generalize our data analysis.

A question we are interested in is “Are most of the tweets posted by users with usernames relating to the Metaverse or user descriptions relating to the Metaverse (is it true that user accounts are dedicated to the Metaverse, and by extension, the users behind those accounts, or is most activity driven by the average Twitter user)? 

In this question, we broke down the hashtags and usernames into certain keywords: ‘nft’, ‘meta’, and ‘crypto’. This list of keywords was constructed based on visually analyzing the frequency of hashtags and usernames. After checking for the existence of these words we manually placed the remaining tags into these three buckets based on their association with the terms. 

Another question our group wanted to tackle was “Identifying, filtering, and studying bots in our database as well as their Twitter activity to analyze their interest. 

In answering this question we utilized the paper “Identifying Correlated Bots in Twitter” to identify coefficients for our bot detection algorithm. For example, if a user tweeted twice within 40 seconds or tweeted more than 40 times in an hour they would be flagged as a bot. Our algorithm is very restrictive with bot classification and prioritizes minimizing false positives. 

Our last question we wanted to answer is “Which geographic regions drive the most Twitter traffic/activity on the Metaverse?  

The assumption made for this question was related to user inputs of locations. As the Twitter API is very private about users’ actual locations our analysis relied on user input alone. Which required we scrap any tweets with null inputs for locations or locations that do not map to a specific geographic location. We also cleaned the location fields of hashtags and emoticons.

Many of these questions can be answered with analysis of just a few, individual variables found in the dataset once it is cleaned as described in the data section of this project proposal. However, this is not true of all of the questions.

## HashTags and Username Analysis

In [195]:
import pandas as pd
import numpy as np
import altair as alt
from collections import Counter
import NLP
from dataprep.clean import validate_country
import altair as alt
# instantiate one
import spacy 
nlp = spacy.load('en_core_web_sm')
#import geopy.geocoder
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from functools import partial
from tqdm import tqdm
import folium 
import pandas as pd
tqdm.pandas()
import re

First five rows of the data from the csv file

In [196]:
df = pd.read_csv('metaverse_tweets.csv')
df.head(5)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Diep Ngoc,,,2021-08-22 09:53:23+00:00,162,129,1816,False,2022-02-27 23:59:46+00:00,@KingArthurrNFT My choice is #Elemon 😊🤩😍🥰\n#BS...,"['Elemon', 'BSC', 'NFT', 'GameFi', 'Metaverse'...",Twitter for Android,False
1,Duncs,"Melbourne, Australia",NFT's + Crypto + Metaverse + Outdoors + Goodtimes,2012-11-28 01:47:19+00:00,745,1363,2583,False,2022-02-27 23:59:33+00:00,Can't wait to hear this #Bapesclan \n\n#BapesM...,"['Bapesclan', 'BapesMetavestor', 'Metaverse']",Twitter Web App,False
2,Putik Gita,"Banyuwangi, Indonesia",just ordinary woman,2016-04-07 03:31:10+00:00,21,130,65,False,2022-02-27 23:59:26+00:00,@SuperStarPad @binance @Nuls good project.. ho...,,Twitter for Android,False
3,🍑𝗔𝗕𝗗 𝗖𝗫 𝟏𝟎𝟎𝐊🍑,"Dhaka, Bangladesh",🍑🌱🍑𝐅𝐎𝐋𝐋𝐎𝐖 𝐌𝐄 🍑🌱🍑\n𝐈 𝐅𝐎𝐋𝐋𝐎𝐖 𝐁𝐀𝐂𝐊 𝐅𝐀𝐒𝐓🍑𝟏𝟎𝟎%𝐈𝐅𝐁🍑\...,2021-10-15 18:21:33+00:00,2355,3349,8550,False,2022-02-27 23:59:13+00:00,Starts Trading on Uniswap and XT Exchanges Feb...,,Twitter Web App,False
4,Ilyass,,Cool,2021-07-30 15:48:01+00:00,4,130,115,False,2022-02-27 23:59:05+00:00,@hustleofwargame #P2E #GameFi #Metaverse #Game...,"['P2E', 'GameFi', 'Metaverse', 'GameNFTs', 'Bl...",Twitter for Android,False


Histogram of Source Tweets

In [197]:
alt.Chart(df).mark_bar().encode(
    x ="source",
    y='count()',
)

# General Hashtag and Username Classification

In [198]:
hashtags_no_na = df.hashtags.dropna()

def formatted_ht(ht):
    ht = ht[1:-1]
    hts = ht.split(',')
    return [stripped.strip().replace("'", "") for stripped in hts]
    
hashtags_processed = [formatted_ht(str_ht) for str_ht in hashtags_no_na]
hashtags_flat = [item.lower() for sublst in hashtags_processed for item in sublst]

ht_series = pd.Series(hashtags_flat)  
frame = {'hashtags_flat': ht_series }
result = pd.DataFrame(frame)

In [199]:
# GENERAL HASHTAG CLASSIFICATION
count_dict = dict(Counter(hashtags_flat).items())
other_dict = count_dict

cat_dict = {'nft' : 0, 'meta' : 0, 'crypto' : 0, 'other' : 0}

for k1 in list(count_dict.keys()):
    categorized = False
    for k2 in list(cat_dict.keys()):
        if k2 in k1:
            categorized = True
            cat_dict[k2] += count_dict[k1]
    if categorized:
        del other_dict[k1]
        
def l_in(L, v): 
    for val in L:
        if val in v:
            return True
    return False
        
for k1 in list(other_dict.keys()):
    categorized = False
    if l_in(['coin', 'eth', 'chain', 'bsc', 'btc', 'doge', 'dao', 'elemon', 'bape', 'shib', 'binance', 'token', 'ryoshis', 'minikishu', 'lte', 'stak', 'hold', 'bluesparrow', 'solana', 'chee', 'defi'], k1):
        cat_dict['crypto'] += other_dict[k1]
        categorized = True
    if l_in(['art', 'mvbiv', 'opensea'], k1):
        cat_dict['nft'] += other_dict[k1]
        categorized = True
    if l_in(['game', 'gaming', 'moniwar', 'morpheuslabs', 'mlseed'], k1):
        cat_dict['meta'] += other_dict[k1]
        categorized = True
    if categorized:
        del other_dict[k1] 
    
for k in list(other_dict.keys()):
    # needs justification
    if other_dict[k] <= 1:
         del other_dict[k]
            
total = 0
for k1 in list(other_dict.keys()):
    total += other_dict[k1] 

for k in list(other_dict.keys()):
    cat_dict['other'] += count_dict[k]
            
cat_cts = {'counts': pd.Series(cat_dict) }
cc = pd.DataFrame(cat_cts)
cc["categories"] = cc.index

alt.Chart(cc).mark_bar().encode(
    x=alt.X('counts'),
    y=alt.Y('categories')
)

In [200]:
# GENERAL USERNAME CLASSIFICATION
df['user_name_lowered'] = df['user_name'].str.lower()

count_dict = dict(Counter(df['user_name_lowered']).items())
other_dict = count_dict
cat_dict = {'nft' : 0, 'meta' : 0, 'crypto' : 0, 'other' : 0}

for k1 in list(count_dict.keys()):
    categorized = False
    for k2 in list(cat_dict.keys()):
        if k2 in k1:
            categorized = True
            cat_dict[k2] += count_dict[k1]
    if categorized:
        del other_dict[k1]
        
for k1 in list(other_dict.keys()):
    categorized = False
    if l_in(['coin', 'eth', 'chain', 'bsc', 'btc', 'doge', 'dao', 'elemon', 'bape', 'shib', 'binance', 'token', 'ryoshis', 'minikishu', 'lte', 'stak', 'hold', 'bluesparrow', 'solana', 'chee', 'defi'], k1):
        cat_dict['crypto'] += other_dict[k1]
        categorized = True
    if l_in(['art', 'mvbiv', 'opensea'], k1):
        cat_dict['nft'] += other_dict[k1]
        categorized = True
    if l_in(['game', 'gaming', 'moniwar', 'morpheuslabs', 'mlseed'], k1):
        cat_dict['meta'] += other_dict[k1]
        categorized = True
    if categorized:
        del other_dict[k1] 
    
for k in list(other_dict.keys()):
    # needs justification
    if other_dict[k] <= 1:
         del other_dict[k]
            
total = 0
for k1 in list(other_dict.keys()):
    total += other_dict[k1] 

for k in list(other_dict.keys()):
    cat_dict['other'] += count_dict[k]
            
cat_cts = {'counts': pd.Series(cat_dict) }
cc = pd.DataFrame(cat_cts)
cc["categories"] = cc.index

alt.Chart(cc).mark_bar().encode(
    x=alt.X('counts'),
    y=alt.Y('categories')
)

Bot Username Classification 

In [201]:
user_dict = dict(Counter(df['user_name']).items())
for k in list(user_dict.keys()):
    if user_dict[k] <= 1:
         del user_dict[k]
user_dict
# frequent tweeters

def min_diff(user):
    diffs = []
    min = 10000000000000
    for index, row in df.iterrows():
        if row['user_name'] == user:
            diffs.append(row['date'])
    if len(diffs) <= 1:
        return 0
    else: 
        for i in range(len(diffs) - 1):
            if (pd.to_datetime(diffs[i]) - pd.to_datetime(diffs[i+1])).seconds < min:
                min = (pd.to_datetime(diffs[i]) - pd.to_datetime(diffs[i+1])).seconds
        return min

bots_master = []
bots = []
for k in list(user_dict.keys()):
    if min_diff(k) != 0 and (user_dict[k] > 40 or min_diff(k) < 40):
        bots_master.append((k, user_dict[k], min_diff(k)))
        bots.append(k)

# dataframe of bots
bot_df = df[df['user_name'].isin(bots)]
#bot_df

In [202]:
# BOT HASHTAG CLASSIFICATION
hashtags_no_na_bots = bot_df.hashtags.dropna()
hashtags_processed_bots = [formatted_ht(str_ht) for str_ht in hashtags_no_na_bots]
hashtags_flat_bots = [item.lower() for sublst in hashtags_processed_bots for item in sublst]
ht_series = pd.Series(hashtags_flat_bots)  

count_dict = dict(Counter(hashtags_flat_bots).items())
other_dict = count_dict
cat_dict = {'nft' : 0, 'meta' : 0, 'crypto' : 0, 'other' : 0}

for k1 in list(count_dict.keys()):
    categorized = False
    for k2 in list(cat_dict.keys()):
        if k2 in k1:
            categorized = True
            cat_dict[k2] += count_dict[k1]
    if categorized:
        del other_dict[k1]
        
for k1 in list(other_dict.keys()):
    categorized = False
    if l_in(['coin', 'eth', 'chain', 'bsc', 'btc', 'doge', 'dao', 'elemon', 'bape', 'shib', 'binance', 'token', 'ryoshis', 'minikishu', 'lte', 'staking', 'hold', 'bluesparrow', 'solana', 'chee', 'defi'], k1):
        cat_dict['crypto'] += other_dict[k1]
        categorized = True
    if l_in(['art', 'mvbiv', 'opensea'], k1):
        cat_dict['nft'] += other_dict[k1]
        categorized = True
    if l_in(['game', 'gaming', 'moniwar', 'morpheuslabs', 'mlseed'], k1):
        cat_dict['meta'] += other_dict[k1]
        categorized = True
    if categorized:
        del other_dict[k1] 
    
for k in list(other_dict.keys()):
    # needs justification
    if other_dict[k] <= 1:
         del other_dict[k]
            
total = 0
for k1 in list(other_dict.keys()):
    total += other_dict[k1] 
            
cat_cts = {'counts': pd.Series(cat_dict) }
cc = pd.DataFrame(cat_cts)
cc["categories"] = cc.index

for k in list(other_dict.keys()):
    cat_dict['other'] += count_dict[k]

alt.Chart(cc).mark_bar().encode(
    x=alt.X('counts'),
    y=alt.Y('categories')
)

In [203]:
# BOT USERNAME CLASSIFICATION
bot_df['user_name_lowered'] = bot_df['user_name'].str.lower()

count_dict = dict(Counter(bot_df['user_name_lowered']).items())
other_dict = count_dict
cat_dict = {'nft' : 0, 'meta' : 0, 'crypto' : 0, 'other' : 0}

for k1 in list(count_dict.keys()):
    categorized = False
    for k2 in list(cat_dict.keys()):
        if k2 in k1:
            categorized = True
            cat_dict[k2] += count_dict[k1]
    if categorized:
        del other_dict[k1]
        
for k1 in list(other_dict.keys()):
    categorized = False
    if l_in(['coin', 'eth', 'chain', 'bsc', 'btc', 'doge', 'dao', 'elemon', 'bape', 'shib', 'binance', 'token', 'ryoshis', 'minikishu', 'lte', 'stak', 'hold', 'bluesparrow', 'solana', 'chee', 'defi'], k1):
        cat_dict['crypto'] += other_dict[k1]
        categorized = True
    if l_in(['art', 'mvbiv', 'opensea'], k1):
        cat_dict['nft'] += other_dict[k1]
        categorized = True
    if l_in(['game', 'gaming', 'moniwar', 'morpheuslabs', 'mlseed'], k1):
        cat_dict['meta'] += other_dict[k1]
        categorized = True
    if categorized:
        del other_dict[k1] 
    
for k in list(other_dict.keys()):
    # needs justification
    if other_dict[k] <= 1:
         del other_dict[k]
            
total = 0
for k1 in list(other_dict.keys()):
    total += other_dict[k1] 

for k in list(other_dict.keys()):
    cat_dict['other'] += count_dict[k]
            
cat_cts = {'counts': pd.Series(cat_dict) }
cc = pd.DataFrame(cat_cts)
cc["categories"] = cc.index

alt.Chart(cc).mark_bar().encode(
    x=alt.X('counts'),
    y=alt.Y('categories')
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df['user_name_lowered'] = bot_df['user_name'].str.lower()


## Number of Bot tweets in the Database

In [204]:
len(bot_df)

389

## Histogram of the source tweet for Bots

In [205]:
alt.Chart(bot_df).mark_bar().encode(
    x ="source",
    y='count()',
)

## Cleaning Twitter Location Data

In [206]:
#Removing hashtags from the data
df.user_location.astype(str)
df = df[~df.user_location.str.contains("#", na=False)]  
#removing na's from the data
df = df[df["user_location"].notna()]

In [207]:
#Removing emoticons from the data using the re model and hard coding values 
def remove_emoji(string):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F" #emoticons
        u"\U0001F300-\U0001F5FF" #symbols & Pictographs
        u"\U0001F680-\U0001F6FF" #transport & map symbols
        u"\U0001F1E0-\U0001F1FF" #flags IOS
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE,
    )
    return emoji_pattern.sub(r"", string)

df["user_location"] = df["user_location"].map(lambda x: remove_emoji(x))

In [208]:
# Ref: https://geopy.readthedocs.io/en/stable/#usage-with-pandas
#geolocator = Nominatim(user_agent='my_email@myserver.com')

#This code uses the GeoPy API to get coordinates of locations but takes about 25 minutes to convert the cleaned
#locations to coordinates we have already ran and have saved in a csv file 

#COMMENT THIS CODE OUT TO RUN THAT PORTION

#geolocator = Nominatim(user_agent='my_email@myserver.com')
#geocode = RateLimiter(geolocator.geocode, min_delay_seconds=3, max_retries=5)
#df["location"] = df["user_location"].progress_apply(geocode, language="en") # Some locations are in hindi, chinese. Language=’en’ returns location in english
#df["coordinates"] = df["location"].apply(lambda loc: tuple(loc.point) if loc else None)
#df["state"] = df["location"].apply(lambda loc: loc[0].split(",")[0] if loc else None)
#df["country"] = df["location"].apply(lambda loc: loc[0].split(",")[-1] if loc else None)
#df.to_csv('metaverse_tweets_final.csv', encoding='utf-8', index = False)


In [209]:
#Reading already constructed file 
df = pd.read_csv('metaverse_tweets_final.csv')

In [210]:
# Cleaing new coordinates column
df['coordinates'] = df['coordinates'].map(lambda x: tuple(x[1:-1].split(',')))
df = df[df['coordinates'].notna()]

In [211]:
#Creating map using GeoPy
my_map = folium.Map(
    location=[13.133932434766733, 16.103938729508073],
    zoom_start=2)

## Geographical visualization of all tweets 

In [212]:
#print(len(df)) ---> 435 about half the orginal database 
for index, row in df.iterrows():
    lat, lon, *a = row['coordinates']
    folium.Marker(
        location = [lat, lon],
        popup= row['user_location'],
        tooltip = row['user_location'],
    ).add_to(my_map)
    
my_map

## Geographical visualization of all bot tweets 

In [213]:
#Creating map using GeoPy
my_map2 = folium.Map(
    location=[13.133932434766733, 16.103938729508073],
    zoom_start=2)

# dataframe of bots
bot_df = df[df['user_name'].isin(bots)]
#print(len(bot_df)) ---> 193 about half the bot database 

for index, row in bot_df.iterrows():
    #print(row['coordinates'], row['user_location'])
    lat, lon, *a = row['coordinates']
    folium.Marker(
        location = [lat, lon],
        popup= row['user_location'],
        tooltip = row['user_location'],
    ).add_to(my_map2)
    
my_map2

## Possible Findings and Implications:

At this stage, we will most likely begin by tackling the questions that we presented while considering how we can address some of the most complex ideas mentioned previously. This would also allow us to become more familiar with the dataset and its quirks as we proceed in our analyses. Eventually, it could be interesting to compare metaverse tweet activity with general tweet activity, or other news data relating to the metaverse coinciding with the time period we are studying. Potential implications of the dataset are that it only spans a few hours of one day and gives a limited analysis of the overall influence of the metaverse. 