## Cleaning Data for Metaverse Tweets

In [10]:
import pandas as pd
import numpy as np
import NLP
from dataprep.clean import validate_country
import altair as alt
# instantiate one
nlp = NLP.NLP()

### CSV file to Dataframe

In [11]:
df = pd.read_csv('metaverse_tweets.csv')

In [12]:
df.head(10)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Diep Ngoc,,,2021-08-22 09:53:23+00:00,162,129,1816,False,2022-02-27 23:59:46+00:00,@KingArthurrNFT My choice is #Elemon 😊🤩😍🥰\n#BS...,"['Elemon', 'BSC', 'NFT', 'GameFi', 'Metaverse'...",Twitter for Android,False
1,Duncs,"Melbourne, Australia",NFT's + Crypto + Metaverse + Outdoors + Goodtimes,2012-11-28 01:47:19+00:00,745,1363,2583,False,2022-02-27 23:59:33+00:00,Can't wait to hear this #Bapesclan \n\n#BapesM...,"['Bapesclan', 'BapesMetavestor', 'Metaverse']",Twitter Web App,False
2,Putik Gita,"Banyuwangi, Indonesia",just ordinary woman,2016-04-07 03:31:10+00:00,21,130,65,False,2022-02-27 23:59:26+00:00,@SuperStarPad @binance @Nuls good project.. ho...,,Twitter for Android,False
3,🍑𝗔𝗕𝗗 𝗖𝗫 𝟏𝟎𝟎𝐊🍑,"Dhaka, Bangladesh",🍑🌱🍑𝐅𝐎𝐋𝐋𝐎𝐖 𝐌𝐄 🍑🌱🍑\n𝐈 𝐅𝐎𝐋𝐋𝐎𝐖 𝐁𝐀𝐂𝐊 𝐅𝐀𝐒𝐓🍑𝟏𝟎𝟎%𝐈𝐅𝐁🍑\...,2021-10-15 18:21:33+00:00,2355,3349,8550,False,2022-02-27 23:59:13+00:00,Starts Trading on Uniswap and XT Exchanges Feb...,,Twitter Web App,False
4,Ilyass,,Cool,2021-07-30 15:48:01+00:00,4,130,115,False,2022-02-27 23:59:05+00:00,@hustleofwargame #P2E #GameFi #Metaverse #Game...,"['P2E', 'GameFi', 'Metaverse', 'GameNFTs', 'Bl...",Twitter for Android,False
5,Lyxo🌿🌿,"Uyo,Akwa Ibom",Anatomist||Man utd faithful || Music freak || ...,2019-09-16 00:42:59+00:00,849,94,26422,False,2022-02-27 23:59:04+00:00,@supertobi64 #MiniKishu is the next BIG projec...,"['MiniKishu', 'BSC', 'gaming', 'staking']",Twitter for iPhone,False
6,Diep Ngoc,,,2021-08-22 09:53:23+00:00,162,129,1816,False,2022-02-27 23:58:58+00:00,@aegean2356 My choice is #Elemon 🥰🤩💖💗💞\n#BSC ...,"['Elemon', 'BSC', 'NFT', 'GameFi', 'Metaverse'...",Twitter for Android,False
7,🈹ㄒ山|几🈯️,"Atlanta, GA📍",𝕋𝕚𝕞𝕞𝕚𝕚 T̆̈𝙬𝙞𝙣T̑̈𝙬𝙞𝙣 || This the link: https://...,2013-06-03 18:46:28+00:00,3965,4975,119194,False,2022-02-27 23:58:57+00:00,NFT community!!! 📣\nYa go follow @twin_nfts. W...,,Twitter for iPhone,False
8,Lyxo🌿🌿,"Uyo,Akwa Ibom",Anatomist||Man utd faithful || Music freak || ...,2019-09-16 00:42:59+00:00,849,94,26422,False,2022-02-27 23:58:46+00:00,@PeeGee05 @MatrixETF @DEnergizer645 @anietieet...,"['MiniKishu', 'BSC']",Twitter for iPhone,False
9,Rabby78,,Like comments flow korben plz,2021-08-30 19:02:38+00:00,30,1166,1767,False,2022-02-27 23:58:32+00:00,The project is great and this projector has a ...,,Twitter for Android,False


### Cleaning Location: Removing emojis, converting NaN values

Histogram of Location 

Since Twitter allows a variety of inputs for their location, it was very hard to come up 
with a straightforward way to clean this field. Initially, we started with removing NaN, string splitting the values 
to get more straightforward variables to work with. Lastly, we used the module data prep and its validating country function to subset the tweets that can correctly spell countries/regions. In the future, we hope to perfect this and map states to similar countries and account for locations that correspond with our topic "Metaverse".
For now we have a histogram of the countries/regions which texted the most on Feb 27, 2022. The data was taken from the metaverse_tweets.csv file, which collected the last 1000 tweets of the day. 

In [13]:
#Need to convert all null in the database to none in order for this to work
df["user_location"] = df["user_location"].fillna("x")
df["user_location"] = df["user_location"].apply(lambda x: None if (x=="x") else x)
#df["user_location"] = df["user_location"].astype(str)
df["new_location"] = df["user_location"].str.split(",").str[1]
srs = validate_country(df["new_location"])
df= df.loc[srs == True]
df.head(10)
df["new_location"]= df["new_location"].apply(lambda x: x.capitalize())
alt.Chart(df).mark_bar().encode(
    x ="new_location",
    y='count()',
)

Histogram of the Hashtags 

In [14]:
hashtags_no_na = df.hashtags.dropna()

def formatted_ht(ht):
    ht = ht[1:-1]
    hts = ht.split(',')
    return [stripped.strip().replace("'", "") for stripped in hts]
    
hashtags_processed = [formatted_ht(str_ht) for str_ht in hashtags_no_na]

hashtags_flat = [item for sublst in hashtags_processed for item in sublst]

ht_series = pd.Series(hashtags_flat)  
frame = {'hashtags_flat': ht_series }
result = pd.DataFrame(frame)

alt.Chart(result).mark_bar().encode(
    x ='hashtags_flat',
    y='count()',
)

Histogram of the source of tweets

Histogram of the breakdown of where these tweets came from. 

In [15]:
alt.Chart(df).mark_bar().encode(
    x ="source",
    y='count()',
)

In [16]:
# testing git

In [None]:
#Cleaning Twitter Data 