# Data Cleaning

In this notebook, the goal is to preprocess the data so that it is more suitable for performing sentiment analysis. To this end, we aim to remove the "#" character from tweets as sometimes the following words can impact the sentiment.

## Section 1: Removing non-english tweets

In our dataset, we have 150k tweets available to us over the span of 7 months. Some of the tweets are not in english. Here, we aim to remove all non-english tweets using a ML model that can detect the language of a given input.

In [1]:
import pandas as pd
import numpy as np
from langdetect import detect
from pandarallel import pandarallel

In [2]:
tesla_df = pd.read_csv("tesla-tweets-updated.csv")

In [3]:
tesla_df.head()

Unnamed: 0,Date & Time,Profile Picture Link,Twitter ID,Tweet Text,Tweet Link
0,"April 10, 2022 at 07:44PM",http://pbs.twimg.com/profile_images/1512074518...,@Jessica1988kk,"RT @invest_answers: Crypto news, #Bitcoin Whal...",https://twitter.com/Jessica1988kk/status/15131...
1,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_ s/87878355348773...,@JotaGe2014,#Tesla tiene récord de autos vendidos. Es impr...,https://twitter.com/JotaGe2014/status/15131737...
2,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/9364223687...,@MmeCallas,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/MmeCallas/status/151317374...
3,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/1463665918...,@BotSecx,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/BotSecx/status/15131737626...
4,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/1116758599...,@agseh,RT @RupiReportero_: 🙆‍♂️🚘 Al que le robaron la...,https://twitter.com/agseh/status/1513173864829...


In [4]:
def detect_en(text):
    try:
        return detect(text) == 'en'
    except:
        return False

In [5]:
#remove all rows that are not english 
pandarallel.initialize(progress_bar=True)
tesla_df = tesla_df[tesla_df['Tweet Text'].parallel_apply(detect_en)]

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=38000), Label(value='0 / 38000')))…

In [6]:
tesla_df.head()

Unnamed: 0,Date & Time,Profile Picture Link,Twitter ID,Tweet Text,Tweet Link
0,"April 10, 2022 at 07:44PM",http://pbs.twimg.com/profile_images/1512074518...,@Jessica1988kk,"RT @invest_answers: Crypto news, #Bitcoin Whal...",https://twitter.com/Jessica1988kk/status/15131...
2,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/9364223687...,@MmeCallas,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/MmeCallas/status/151317374...
3,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/1463665918...,@BotSecx,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/BotSecx/status/15131737626...
5,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/1507382366...,@ElTendies,RT @cb_doge: Tesla - A Trillion Dollar Company...,https://twitter.com/ElTendies/status/151317393...
6,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/1355296710...,@LauraCory2013,"@elonmusk, few #chargingstations in my area. I...",https://twitter.com/LauraCory2013/status/15131...


## Section 2: Grouping by Week

In [7]:
#ATTENTION: two of the dates are in this form: day-month-year (with no timestamp)
#I just changed both to the other format with a timestamp of 12:00 am 
#The new csv is named "tesla-tweets-updated"
tesla_df['Date & Time'] = pd.to_datetime(tesla_df['Date & Time'], format='%B %d, %Y at %I:%M%p')
tesla_df.head()

Unnamed: 0,Date & Time,Profile Picture Link,Twitter ID,Tweet Text,Tweet Link
0,2022-04-10 19:44:00,http://pbs.twimg.com/profile_images/1512074518...,@Jessica1988kk,"RT @invest_answers: Crypto news, #Bitcoin Whal...",https://twitter.com/Jessica1988kk/status/15131...
2,2022-04-10 19:45:00,http://pbs.twimg.com/profile_images/9364223687...,@MmeCallas,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/MmeCallas/status/151317374...
3,2022-04-10 19:45:00,http://pbs.twimg.com/profile_images/1463665918...,@BotSecx,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/BotSecx/status/15131737626...
5,2022-04-10 19:45:00,http://pbs.twimg.com/profile_images/1507382366...,@ElTendies,RT @cb_doge: Tesla - A Trillion Dollar Company...,https://twitter.com/ElTendies/status/151317393...
6,2022-04-10 19:45:00,http://pbs.twimg.com/profile_images/1355296710...,@LauraCory2013,"@elonmusk, few #chargingstations in my area. I...",https://twitter.com/LauraCory2013/status/15131...


In [8]:
#get rid of timestamps
tesla_df['Date & Time'] = pd.to_datetime(tesla_df['Date & Time']).dt.date
tesla_df.head()

Unnamed: 0,Date & Time,Profile Picture Link,Twitter ID,Tweet Text,Tweet Link
0,2022-04-10,http://pbs.twimg.com/profile_images/1512074518...,@Jessica1988kk,"RT @invest_answers: Crypto news, #Bitcoin Whal...",https://twitter.com/Jessica1988kk/status/15131...
2,2022-04-10,http://pbs.twimg.com/profile_images/9364223687...,@MmeCallas,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/MmeCallas/status/151317374...
3,2022-04-10,http://pbs.twimg.com/profile_images/1463665918...,@BotSecx,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/BotSecx/status/15131737626...
5,2022-04-10,http://pbs.twimg.com/profile_images/1507382366...,@ElTendies,RT @cb_doge: Tesla - A Trillion Dollar Company...,https://twitter.com/ElTendies/status/151317393...
6,2022-04-10,http://pbs.twimg.com/profile_images/1355296710...,@LauraCory2013,"@elonmusk, few #chargingstations in my area. I...",https://twitter.com/LauraCory2013/status/15131...


In [9]:
#organize tweets in columns by week

from datetime import datetime, timedelta

start_date = datetime.strptime('2022-04-10', '%Y-%m-%d').date()
end_date = datetime.strptime('2022-04-16', '%Y-%m-%d').date()

final_end_date = datetime.strptime('2022-11-12', '%Y-%m-%d').date()

week_count = 1

dfs = []

while end_date <= final_end_date:
    
    weekly_tweets = tesla_df.loc[(tesla_df['Date & Time'] >= start_date) & 
                                 (tesla_df['Date & Time'] <= end_date), 'Tweet Text']
    weekly_df = weekly_tweets.to_frame().reset_index(drop=True)
    
    week_range = f"Week {week_count} ({start_date.strftime('%m/%d')}-{end_date.strftime('%m/%d')})"
    weekly_df.columns = [week_range]

    dfs.append(weekly_df)

    start_date += timedelta(days=7)
    end_date += timedelta(days=7)
    
    week_count += 1

tesla_df_byweek = pd.concat(dfs, axis=1)


In [10]:
tesla_df_byweek.head()

Unnamed: 0,Week 1 (04/10-04/16),Week 2 (04/17-04/23),Week 3 (04/24-04/30),Week 4 (05/01-05/07),Week 5 (05/08-05/14),Week 6 (05/15-05/21),Week 7 (05/22-05/28),Week 8 (05/29-06/04),Week 9 (06/05-06/11),Week 10 (06/12-06/18),...,Week 22 (09/04-09/10),Week 23 (09/11-09/17),Week 24 (09/18-09/24),Week 25 (09/25-10/01),Week 26 (10/02-10/08),Week 27 (10/09-10/15),Week 28 (10/16-10/22),Week 29 (10/23-10/29),Week 30 (10/30-11/05),Week 31 (11/06-11/12)
0,"RT @invest_answers: Crypto news, #Bitcoin Whal...",❎VRR#BNWA-Q9IY-NXLN ❎\nIf you consider buying ...,RT @Ali_TeslaMY: Isn’t it ironic that the guy ...,RT @JasonDanheiser: $TSLA FSD Beta 10.11.2 dri...,"RT @ElonPromises: ""Offer Tesla car insurance w...",Good job! h8ten was first to spot a 2021 Tesla...,"RT @JohnRadioTFI: Jeez @EdRadioTFI, we'll have...",THE F’IN CAR NEEDS THREE NEW TIRES. #tesla BIT...,Please @JoeBiden please send @elonmusk a check...,This still seems to be a pretty serious issue ...,...,RT @Cyberfleet_NFT: TRUCK OF THE DAY\n ...,#Tesla Solar + #Powerwall more than covers mon...,"RT @gaspassman: Come on internet, help me catc...","@WellsFargo To support, We are launching a mas...",RT @Irvings31075139: @WhalePumpCom 🤖🔥 it doesn...,RT @amiris_brown: @Just_Hey_JR This sounds lik...,RT @mvollmer1: Tesla’s Bot 🤖 is going to remov...,RT @deanlee4real: @elonmusk @muskQu0tes @elonm...,@sarajahgp @Teslarati @WilliamWritin Most peop...,Let's take a look at #Tesla as an #investment....
1,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,RT @EveryElonReply: Elon Musk liked a tweet fr...,RT @Ali_TeslaMY: Isn’t it ironic that the guy ...,RT @JasonDanheiser: $TSLA FSD Beta 10.11.2 dri...,RT @OzobgO: @ZinuNews @elonmusk @ZinuToken Thi...,Whee! h8ten just spotted a 2022 Tesla Model Y ...,Elon Musk conditions his takeover of Twitter -...,Come Tesla in india #Tesla #India $TSLA #Remem...,RT @Aiaddict1: #Tesla service rep claims they ...,@SimplisticSimon @GonzalesKristie @EvilMopacAT...,...,RT @DreamCarClubNFT: Grandpa rides in the fast...,"RT @Maxlock911: You know what #Tesla , #Amazon...",RT @YOSEquehacer2_: Finanzas con sentido común...,"RT @jotticsgo: @elonmusk To support, We are la...",RT @geniusventures_: Tesla CEO Elon Musk demon...,#bb24 #Ethereum #EthereumMerge #BNB #TRX #TORN...,RT @ToddJobson: @elonmusk @KimDotcom @Twitter ...,RT @NotATeslaApp: Model Y is expected to gener...,RT @ClaireMusk: Gary Black says #Tesla's gonna...,Still waiting for your #Tesla pick-up truck? I...
2,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,RT @kimpaquette: Todays activity. 😍 #tesla #mo...,RT @Teslaconomics: Fuck him up @elonmusk. 👊\n\...,RT @jsongtrades: Buying a Tesla was one of the...,RT @JilianneParker: I completely shut down the...,Just posting the stats for this mornings drive...,@spicedrop71 @TheRickWilson Don’t forget the A...,RT @EvaFoxU: Tesla jumped 35 positions on the ...,RT @BLKMDL3: Model X Plaid beats a Porsche 992...,@CryptosGemsCom ⚡ @VoltInuOfficial⚡\n#VOLTINU ...,...,RT @DreamCarClubNFT: Grandpa rides in the fast...,@KadakKaspar @CJ_NFA @WPipperger Yeah I am not...,RT @YOSEquehacer2_: Finanzas con sentido común...,"@GilbertAZMayor @WellsFargo To support, We are...",RT @goldenkatepark: Who knew the hardest part ...,@Tesla Semi to begin production of first #Elec...,RT @GerberKawasaki: Private companies should n...,#Bitcoin: #Tesla’s $BTC holding after selling ...,RT @ElonGoatToken: $EGT is busy planning the r...,"@elonmusk @Plinz Is that really you, Elon? It ..."
3,RT @cb_doge: Tesla - A Trillion Dollar Company...,"Musk Go-Private Tweet Ruled False by Judge, Te...",RT @Teslaconomics: I bet Wall Street Bets &amp...,Is it possible that earth was once two moons o...,RT @HermitPossibly: #TRUST AND #INCLUSION ARE ...,Sign Up : Top Stedy #Crypto Platform #Bitcoin ...,@netshrink @aliasSubbu Isn't hard cash means c...,RT @NodeopolyNFT: #NFTGiveaway coming tomorrow...,#Tesla scores 100/100 for 7th consecutive year...,RT @_iTsLxght_: @VoltInuOfficial @FTX_Official...,...,RT @MalcolmNance: Treat #CNN like #Tesla. Let’...,@VoltInuOfficial @henokcrypto #VOLT #VDSC #VOL...,RT @YOSEquehacer2_: Finanzas con sentido común...,RT @jotticsgo: @WhiteHouse @SecDebHaaland To s...,RT @Irvings31075139: @nftsonsolana 🤖🔥 it doesn...,A cunning plan indeed… 🤦🏼‍♂️🤣 #matt #tesla #El...,RT @EdTowers2022: #Tesla 50% drop in stock va...,RT @EvaFoxU: Tesla accelerated its Supercharge...,So we finally got this question solved once an...,RT @TechTreesSDG: @elonmusk @kcoleman @ubiquit...
4,"@elonmusk, few #chargingstations in my area. I...",RT @web3_coin: 🎁Buy WEB3COIN IDO to Win a Tesl...,"@jhall Bill Gates writes a book called ""How to...",RT @JasonDanheiser: $TSLA FSD Beta 10.11.2 dri...,RT @GerberKawasaki: Hey @elonmusk what if we w...,Tesla wouldn’t be successful without Liberals....,https://t.co/M6a7pBwZY0 @clusterdafrica #europ...,RT @granitacademy: If you post a motivation qu...,RT @BLKMDL3: Model X Plaid beats a Porsche 992...,@DrPepperBlog Doesn't the #tesla logo look lik...,...,RT @DreamCarClubNFT: Grandpa rides in the fast...,RT @stevenmarkryan: https://t.co/NGZB8dP6Us - ...,RT @YOSEquehacer2_: Finanzas con sentido común...,RT @jotticsgo: @WhiteHouse @SecDebHaaland To s...,RT @Irvings31075139: @WholeMarsBlog @elonmusk ...,Contact me if you want to hire me :\nGmail: ta...,RT @GerberKawasaki: Private companies should n...,RT @mootron: Every once in a while @Nextdoor o...,@PplFuture @SolomonYue @catturd2 @birdymating ...,RT @BaldInvesting: Let's take a look at #Tesla...


## Section 3: Performing the sentiment analysis

In [None]:
tesla_df.tail()

In [None]:
tweets = tesla_df["Tweet Text"]
#note - include date
tweets.head()

## Sample test of Sentiment Analysis model

Via huggingface and this (https://huggingface.co/blog/sentiment-analysis-twitter) tutorial

In [11]:
#importing the stuff and definining the analysis function

import requests
model = "cardiffnlp/twitter-roberta-base-sentiment-latest"
hf_token = "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"

API_URL = "https://api-inference.huggingface.co/models/" + model
headers = {"Authorization": "Bearer %s" % (hf_token)}

def analysis(data):
    #function that computes the sentiment
    payload = dict(inputs=data, options=dict(wait_for_model=True))
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


In [13]:
tweets = tesla_df_byweek["Week 1 (04/10-04/16)"]
tweets_analysis = []

for tweet in tweets:
    if tweet is None:
        break
    try:
        sentiment_result = analysis(tweet)[0]
        top_sentiment = max(sentiment_result, key=lambda x: x['score']) # Get the sentiment with the higher score
        tweets_analysis.append({'tweet': tweet, 'sentiment': top_sentiment['label']})
 
    except Exception as e:
        print(repr(e))

KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)


KeyboardInterrupt: 

In [12]:
tweets = tesla_df["Tweet Text"]
tweets_analysis = []

for tweet in tweets:
    try:
        sentiment_result = analysis(tweet)[0]
        top_sentiment = max(sentiment_result, key=lambda x: x['score']) # Get the sentiment with the higher score
        tweets_analysis.append({'tweet': tweet, 'sentiment': top_sentiment['label']})
 
    except Exception as e:
        print(repr(e))

KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)
KeyError(0)


KeyboardInterrupt: 

In [None]:
print(sentiment_result)

In [None]:

# for i in range(10):
#     print(tweets[i])
#     print('\n')
print(tweets[1])

In [None]:
#sanity check

tweets2 = tweets[0:1]
tweets2[1] = "Tesla has a record for cars sold. It's impressive, but it never ceases to amaze me that by selling 6 times less units than Toyota, for example, Tesla is worth 3 or 4 times more. https://t.co/u7Jm8oS54t via @Inoreader"
print(tweets2)

print(analysis(tweets2[1])[0])