# Data Cleaning

In this notebook, the goal is to preprocess the data so that it is more suitable for performing sentiment analysis. To this end, we aim to remove the "#" character from tweets as sometimes the following words can impact the sentiment.

## Section 1: Removing non-english tweets

In our dataset, we have 150k tweets available to us over the span of 7 months. Some of the tweets are not in english. Here, we aim to remove all non-english tweets using a ML model that can detect the language of a given input.

In [1]:
import pandas as pd
from langdetect import detect
from pandarallel import pandarallel

In [4]:
tesla_df = pd.read_csv("data/tesla-tweets.csv")
#tesla_df.head()
#tesla_df.tail()

In [5]:
def detect_en(text):
    try:
        return detect(text) == 'en'
    except:
        return False


In [7]:
#remove all rows that are not english 
pandarallel.initialize(progress_bar=False)
tesla_df = tesla_df[tesla_df['Tweet Text'].parallel_apply(detect_en)]

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


## Section 2: Performing the sentiment analysis

In [8]:
import pandas as pd
import numpy as np

In [9]:
tesla_df.head()

Unnamed: 0,Date & Time,Profile Picture Link,Twitter ID,Tweet Text,Tweet Link
0,"April 10, 2022 at 07:44PM",http://pbs.twimg.com/profile_images/15120745...,@Jessica1988kk,"RT @invest_answers: Crypto news, #Bitcoin Whal...",https://twitter.com/Jessica1988kk/status/15131...
2,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/936422368...,@MmeCallas,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/MmeCallas/status/151317374...
3,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/146366591...,@BotSecx,RT @CottonCodes: 🐒 #love in my #MariaCallas I ...,https://twitter.com/BotSecx/status/15131737626...
5,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/150738236...,@ElTendies,RT @cb_doge: Tesla - A Trillion Dollar Company...,https://twitter.com/ElTendies/status/151317393...
6,"April 10, 2022 at 07:45PM",http://pbs.twimg.com/profile_images/135529671...,@LauraCory2013,"@elonmusk, few #chargingstations in my area. I...",https://twitter.com/LauraCory2013/status/15131...


In [10]:
tesla_df.tail()

Unnamed: 0,Date & Time,Profile Picture Link,Twitter ID,Tweet Text,Tweet Link
151986,"November 12, 2022 at 02:18PM",http://pbs.twimg.com/profile_images/261032294...,@NamaloomInsan,@stratosathens @alfonslopeztena @elonmusk @tes...,https://twitter.com/NamaloomInsan/status/15913...
151991,"November 12, 2022 at 02:19PM",http://pbs.twimg.com/profile_images/765238470...,@DemApples00,#DOGE #DogelonMars ……🚀🌗\n\nThe PEOPLES AND OFF...,https://twitter.com/DemApples00/status/1591382...
151992,"November 12, 2022 at 02:19PM",http://pbs.twimg.com/profile_images/238736534...,@Mrtnl79,"RT @HakanHoca22: Elon, my friend, come to Edir...",https://twitter.com/Mrtnl79/status/15913826993...
151993,"November 12, 2022 at 02:20PM",http://pbs.twimg.com/profile_images/765238470...,@DemApples00,#DOGE #DogelonMars …🚀🌗\n\nThe PEOPLES AND OFFI...,https://twitter.com/DemApples00/status/1591382...
151999,"November 12, 2022 at 03:11PM",http://pbs.twimg.com/profile_images/157116418...,@JandTContent,Crash and burn EVERYWHERE... \n\nAnother one b...,https://twitter.com/JandTContent/status/159139...


In [12]:
tweets = tesla_df["Tweet Text"]
#note - include date
tweets.head()

0    RT @invest_answers: Crypto news, #Bitcoin Whal...
1    #Tesla tiene récord de autos vendidos. Es impr...
2    RT @CottonCodes: 🐒 #love in my #MariaCallas I ...
3    RT @CottonCodes: 🐒 #love in my #MariaCallas I ...
4    RT @RupiReportero_: 🙆‍♂️🚘 Al que le robaron la...
Name: Tweet Text, dtype: object

## Sample test of Sentiment Analysis model

Via huggingface and this (https://huggingface.co/blog/sentiment-analysis-twitter) tutorial

In [25]:
#importing the stuff and definining the analysis function

import requests
model = "cardiffnlp/twitter-roberta-base-sentiment-latest"
hf_token = "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"

API_URL = "https://api-inference.huggingface.co/models/" + model
headers = {"Authorization": "Bearer %s" % (hf_token)}

def analysis(data):
    #function that computes the sentiment
    payload = dict(inputs=data, options=dict(wait_for_model=True))
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


In [34]:

tweets_analysis = []

for tweet in tweets:
    try:
        sentiment_result = analysis(tweet)[0]
        top_sentiment = max(sentiment_result, key=lambda x: x['score']) # Get the sentiment with the higher score
        tweets_analysis.append({'tweet': tweet, 'sentiment': top_sentiment['label']})
 
    except Exception as e:
        print(e)

In [35]:
print(sentiment_result)

[{'label': 'neutral', 'score': 0.8032652139663696}, {'label': 'positive', 'score': 0.14004221558570862}, {'label': 'negative', 'score': 0.05669258534908295}]


In [47]:

# for i in range(10):
#     print(tweets[i])
#     print('\n')
print(tweets[1])

#Tesla tiene récord de autos vendidos. Es impresionante, pero no deja de sorprenderme que vendiendo 6 veces menos unidades que Toyota, por ejemplo, Tesla valga 3 o 4 veces más. https://t.co/u7Jm8oS54t vía @Inoreader


In [50]:
#sanity check

tweets2 = tweets[0:1]
tweets2[1] = "Tesla has a record for cars sold. It's impressive, but it never ceases to amaze me that by selling 6 times less units than Toyota, for example, Tesla is worth 3 or 4 times more. https://t.co/u7Jm8oS54t via @Inoreader"
print(tweets2)

print(analysis(tweets2[1])[0])

0    RT @invest_answers: Crypto news, #Bitcoin Whal...
1    Tesla has a record for cars sold. It's impress...
Name: Tweet Text, dtype: object
[{'label': 'positive', 'score': 0.8890513181686401}, {'label': 'neutral', 'score': 0.09520529210567474}, {'label': 'negative', 'score': 0.015743352472782135}]
