# Twitter vs. Public

Table of Contents
- Introduction
- Data Collection
- Data Cleaning
- Analysis
- Results
- Appendix

## Introduction <a name="Introduction"></a>

This work seeks to provide evidence on the differences in sentiment (Positive/Negative) between perceptions shared on twitter versus perceptions held by the public at large. Specifically, this analysis will be limited to sentiment of famous brands between the two groups. The public's sentiment is approximated by brand specific survey data collected by YouGov. Whereas for Twitter data, tweets responding to these brands will be scraped and classified as either positive or negative. 

It is worth noting that the publicly available YouGov brand data is country specific. To ensure the validity of the data, only UK data will be considered. The UK was chosen for three reasons. First, in most countries, large brands have region specific twitter accounts (i.e., Pizza Hut has the @pizzahutuk twitter account). However, often times for the United States, brands opt not to have a US specific account (i.e., the American Pizza Hut account is simply @pizzahut). It is fair to assume that someone responding to a brand's regional account has a high likelihood of being from that region. However, it is not fair to assume that tweets directed at a brand's main account are from the US. Second, the UK is a predominantly English speaking and writing country, enabling me to process the text of the replies. Third, I could not find this same level of data publicly available for any other country satisfying the above two criteria.

### Modules used for analysis, version numbers are available in packages.txt

In [2]:
import numpy as np
import pandas as pd
from selenium import webdriver
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import random
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import json
from sklearn.feature_extraction.text import CountVectorizer
import dill
from flair.models import TextClassifier
from flair.data import Sentence
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

## Data Collection <a name="DataCollection"/>

Both the survey data and Twitter data were scraped using the Selenium library. The benefit of Selenium for this project comes from the ability to click on buttons, collect data not in a tabular form, and navigate to links within other webpages.

### Survey Scraping

Data comes from https://yougov.co.uk/ratings/consumer/fame/brands/all (collected 2022 Q2 data). Only brands with fame of at least 95% were included. This criterion was chosen with the assumption that more famous brands are more likely to have regional twitter accounts, and with a greater number of replies than their less famous counterparts.

In [10]:
public_data_url = "https://yougov.co.uk/ratings/consumer/fame/brands/all" #survey data site

def page_down_wait_one_second(webdriver, body='/html/body'):
    webdriver.find_element(By.XPATH, body).send_keys(Keys.PAGE_DOWN)
    time.sleep(1)

def repeat_page_down(number_of_times, webdriver, body='/html/body'):
    for x in range(0,number_of_times):
        page_down_wait_one_second(webdriver, body='/html/body')

# chunk below: going to survey url and getting all data on screen
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(public_data_url)
driver.find_element(By.ID, "onetrust-accept-btn-handler").send_keys(Keys.ENTER)
driver.find_element(By.NAME, "rankings-load-more-entities").send_keys(Keys.ENTER)
repeat_page_down(70, driver) # after viewing the webpage, pressing page down 70 times is sufficient to capture all relevant data on screen

[WDM] - Downloading: 100%|██████████| 6.21M/6.21M [00:00<00:00, 7.57MB/s]
  if sys.path[0] == "":


In [11]:
# get link for yougov profile for each brand. This page has more robust sentiment data than the main list
all_brands = driver.find_elements(By.XPATH, '//yg-rankings-entities-list//a')
all_brands_links = [brand.get_attribute("href") for brand in all_brands]
above_95 = all_brands_links[0:343] # only include brands with fame >= 95

In [12]:
temp_driver = webdriver.Chrome(ChromeDriverManager().install())

  """Entry point for launching an IPython kernel.


In [13]:
# loop through each url and scrape sentiment data
brands_sentiment = []
for link in above_95:
    temp_driver.get(link)
    time.sleep(10) # Selenium risks recording incomplete data if there is not sufficient time for the page to load
    just_sentiment = [item.text for item in temp_driver.find_elements(By.CLASS_NAME, "value")]
    just_name = temp_driver.find_element(By.CLASS_NAME, "entity-name").text
    brand_sentiment_and_name = [just_name, just_sentiment[0], just_sentiment[1], just_sentiment[2], just_sentiment[3]]
    brands_sentiment.append(brand_sentiment_and_name)
    print(brand_sentiment_and_name)

['ADIDAS', '100%', '67%', '6%', '26%']
['ALDI', '100%', '72%', '9%', '19%']
['CADBURY ROSES', '99%', '67%', '10%', '22%']
['MALTESERS', '99%', '86%', '5%', '8%']
['LINDT LINDOR MILK CHOCOLATE', '99%', '77%', '8%', '15%']
['PRINGLES', '99%', '74%', '8%', '18%']
['PIZZA HUT', '99%', '54%', '17%', '29%']
['WEETABIX', '99%', '77%', '7%', '16%']
['VOLVO', '99%', '55%', '7%', '37%']
['KIT KAT FOUR FINGER', '99%', '80%', '7%', '13%']
['TESCO', '99%', '76%', '6%', '18%']
['DOVE', '99%', '76%', '2%', '21%']
['GREGGS', '99%', '73%', '7%', '19%']
['HEINZ TOMATO KETCHUP', '99%', '78%', '9%', '13%']
['SKY', '99%', '49%', '23%', '27%']
['TK MAXX', '99%', '55%', '12%', '32%']
['FERRERO ROCHER', '99%', '65%', '16%', '18%']
['SPORTS DIRECT', '99%', '45%', '23%', '32%']
['SHELL', '99%', '31%', '31%', '38%']
['PG TIPS', '99%', '70%', '8%', '21%']
['VISA', '99%', '71%', '7%', '21%']
['TWIX', '99%', '76%', '6%', '18%']
['LEGO', '99%', '83%', '3%', '14%']
['AUDI', '99%', '56%', '11%', '32%']
['LLOYDS BANK',

In [14]:
#Removing percent sign and casting as float
for brand in brands_sentiment:
    brand[1] = float( str.replace(brand[1], '%', "") )
    brand[2] = float( str.replace(brand[2], '%', "") )
    brand[3] = float( str.replace(brand[3], '%', "") )
    brand[4] = float( str.replace(brand[4], '%', "") )

In [15]:
#saving survey data to csv
df_public = pd.DataFrame(brands_sentiment)
df_public.columns = ['name', 'fame', 'positive', 'negative', 'neutral']
df_public.to_csv('public_data.csv')
driver.close()
temp_driver.close()

### Twitter Data

I hand collected the twitter handles for companies with fame greater than 95%. This data is stored on the famous_brands.csv file.

The inclusion criteria is as follows:
- The brand's Twitter handle must contain the prefix or suffix 'UK'.
- A subsidiary brand like Peanut M&M will have its larger brand included (M&M), unless there exists an account for the subsidiary brand.
- If multiple accounts claiming to be the twitter account of a brand are shown in the search bar, and none of them are verified, no account will be added.

#### Preparation

In [17]:
famous_brands_with_twitter_df = pd.read_csv('famous_brands_with_twitter.csv')

Verifying that each brand in famous_brands_with_twitter_df is in public_df

In [18]:
brand_names_with_twitter = list(famous_brands_with_twitter_df['name'])
public_brand_names = list(df_public['name'])

not_included = []
for brand in brand_names_with_twitter:
    if brand.upper() not in public_brand_names:
        not_included.append(brand)

print(not_included)
print(len(brand_names_with_twitter))

[]
123


Adding Twitter handle variable to survey data

In [19]:
#Joining twitter handle to dataframe with survey sentiment data
famous_brands_with_twitter_df['name'] = famous_brands_with_twitter_df['name'].apply(lambda x: str.upper(x))
public_with_twitter_df = pd.merge(df_public, famous_brands_with_twitter_df, on='name')
public_with_twitter_df.drop(columns='Unnamed: 0', inplace=True)
public_with_twitter_df

Unnamed: 0,name,fame,positive,negative,neutral,twitter_uk
0,ADIDAS,100.0,67.0,6.0,26.0,adidasUK
1,ALDI,100.0,72.0,9.0,19.0,AldiUK
2,MALTESERS,99.0,86.0,5.0,8.0,MaltesersUK
3,PRINGLES,99.0,74.0,8.0,18.0,Pringles_UK
4,PIZZA HUT,99.0,54.0,17.0,29.0,pizzahutuk
...,...,...,...,...,...,...
118,CASIO,95.0,54.0,7.0,33.0,CasioMusicUK
119,THE BODY SHOP,95.0,56.0,8.0,31.0,TheBodyShopUK
120,BOUNTY,95.0,52.0,20.0,22.0,BountyUK
121,HÄAGEN-DAZS COOKIES AND CREAM ICE CREAM,95.0,50.0,16.0,29.0,haagendazsuk


In [20]:
public_with_twitter_df.to_csv('merged_corp_public.csv')

#### Scraping Twitter

In [21]:
advanced_search_url = 'https://twitter.com/search-advanced?lang=en'

def q2_search_url(handle):
    return f'https://twitter.com/search?lang=en&q=(to%3A{handle})%20until%3A2022-06-30%20since%3A2022-04-01&src=typed_query'

driver = webdriver.Chrome(ChromeDriverManager().install())

  


Following code does three things for each brand
1. Search for tweets sent in reply to the brand during Q2 period
2. Scrape all these tweets
3. Create a dictionary with handle as key, with a list of all tweets to that brand as the value

In [22]:
twitter_handles = pd.read_csv('merged_corp_public.csv')
twitter_handles = list(twitter_handles['twitter_uk'])

brand_replies = {}
for handle in twitter_handles:
    driver.get(q2_search_url(handle))
    time.sleep(10) # allow sufficient time for page to fully load
    # After scrolling some distance down, earlier tweets will not be accessable from the HTML file.
    # To account for this, all available tweets will be scraped after every page down, and duplicate tweets will be removed later
    tweets_with_duplicates = []
    for pg_down in range(50):
        #scrape text from tweets
        tweet_objects = driver.find_elements(By.XPATH, '//div[@data-testid="tweetText"]')    
        for tweet in tweet_objects:
            tweets_with_duplicates.append(tweet.text)
        repeat_page_down(1, driver)
    # including only unique tweets
    tweets = []
    for tweet in tweets_with_duplicates:
        if tweet not in tweets:
            tweets.append(tweet)
    brand_replies.update({handle: tweets})

In [23]:
with open("brand_replies.json", "w") as file:
    json.dump(brand_replies, file)

## Data Cleaning <a name="DataCleaning"/>

Verifying every handle in the twitter replies corresponds with a handle in the merged_corp_public dataset

In [24]:
bool_test = []
for brand in brand_replies.keys():
    bool_test.append(brand in list(twitter_handles))

print(False in bool_test)

False


Verifying correct data types

In [25]:
public_with_twitter_df.dtypes #correct data types

name           object
fame          float64
positive      float64
negative      float64
neutral       float64
twitter_uk     object
dtype: object

Processing Text to Remove Symbols and Non-English Letters

In [26]:
with open('brand_replies.json') as file:
    brand_replies = json.load(file)

In [27]:
symbols = []
number_of_characters = 0
for brand in brand_replies:
    for reply in brand_replies[brand]:
        for char in reply:
            symbols.append(char)
            number_of_characters += 1
symbols = set(symbols)

english_letters_and_space = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ']


In [28]:
english_letters_and_space = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ']

# replace newline with space
for brand in brand_replies:
    replies_no_symbols = []
    for reply in brand_replies[brand]:
        reply = reply.replace('\n', "")
        replies_no_symbols.append(reply)
    brand_replies.update({brand : replies_no_symbols})  

# removing all non-english symbols from tweets, allowing for greater text processing
for brand in brand_replies:
    replies_no_symbols = []
    for reply in brand_replies[brand]:
        english_reply = reply # initializes vairable that will hold a reply containing only english letters
        for symbol in symbols:
            if symbol not in english_letters_and_space:
                english_reply = english_reply.replace(symbol, "")
        replies_no_symbols.append(english_reply)
    brand_replies.update({brand : replies_no_symbols})    

In [29]:
with open("brand_replies.json", "w") as file:
    json.dump(brand_replies, file)

Survey data has values within valid range of 0 and 100

In [30]:
num_of_fame_violations = len(public_with_twitter_df[ (public_with_twitter_df["fame"] > 100) | (public_with_twitter_df["fame"] < 0) ])
num_of_positive_violations = len(public_with_twitter_df[ (public_with_twitter_df["positive"] > 100) | (public_with_twitter_df["positive"] < 0) ])
num_of_negative_violations = len(public_with_twitter_df[ (public_with_twitter_df["negative"] > 100) | (public_with_twitter_df["negative"] < 0) ])
num_of_neutral_violations = len(public_with_twitter_df[ (public_with_twitter_df["neutral"] > 100) | (public_with_twitter_df["neutral"] < 0) ])
num_of_total_violations = num_of_fame_violations + num_of_positive_violations +num_of_negative_violations + num_of_neutral_violations
print(num_of_total_violations)

0


Theoretically, Fame = Positive + Negative + Neutral. However as shown below, rounding/truncating causes the formula to occasionally be slightly off. Beyond setting the inclusion criteria, the fame variable plays no role in the remainder of the analysis. As such the violation poses no threat to analysis.

In [31]:
print(len(public_with_twitter_df[ public_with_twitter_df['fame'] != (public_with_twitter_df["positive"] + public_with_twitter_df["negative"] + public_with_twitter_df["neutral"]) ])) # many rows violate the fame = positive + negative + neutral
# testing if discrepency is due to rounding
rounding_test = pd.Series(public_with_twitter_df['fame'] - (public_with_twitter_df["positive"] + public_with_twitter_df["negative"] + public_with_twitter_df["neutral"]))
print(max(rounding_test))
print(min(rounding_test))
# since the difference between public_with_twitter_df['fame'] and (public_with_twitter_df["positive"] + public_with_twitter_df["negative"] + public_with_twitter_df["neutral"]) is at most 1 point away, and the original data did not have
# fractions, the difference is most likely due to rounding/truncating.

38
1.0
-1.0


### Missing Values
There were no missing values in the survey data. There were multiple brands that did not get responses to their tweets in Q2 2022. Since no sentiment analysis could be performed on these brands, they were dropped from analysis.

In [32]:
# for survey data
public_with_twitter_df.isnull().sum()

name          0
fame          0
positive      0
negative      0
neutral       0
twitter_uk    0
dtype: int64

In [33]:
# for tweets
num_of_brands_no_replies = 0
brands_no_replies = []
for brand in brand_replies:
    if brand_replies[brand] == []:
        num_of_brands_no_replies += 1
        brands_no_replies.append(brand)

# As no twitter data exists to estimate sentiment distribution, these brands will be dropped
for brand in brands_no_replies:
    del brand_replies[brand]

# brands_no_replies in public_with_twitter_df["twitter_uk"]
just_handles = public_with_twitter_df["twitter_uk"]
public_with_twitter_df = public_with_twitter_df[ public_with_twitter_df['twitter_uk'].apply(lambda handle: False if (handle in brands_no_replies) else True) ]

### Filtering out Brands with less than 10 Replies

Finding Brands with less than 10 replies

In [40]:
with open('brand_replies.json') as file:
    brand_dict = json.load(file)

In [41]:
brand_handles , brand_replies = list(brand_dict.keys()), list(brand_dict.values())

In [None]:
reply_counter_per_brand = []
for brand in brand_replies:
    counter = 0
    for tweet in brand:
        counter += 1
    reply_counter_per_brand.append(counter)

Joining twitter reply count to dataset

In [43]:
df = pd.read_csv('twitter_vs_public.csv')
df_to_merge = pd.DataFrame(list(zip(brand_handles,reply_counter_per_brand)), columns=['twitter_uk', 'reply_counter'])

In [44]:
df = pd.merge(df, df_to_merge, on='twitter_uk')

In [45]:
print(len(df))
print(len(df[ df['reply_counter'] >= 10 ]))

115
102


In [46]:
df = df[ df['reply_counter'] >= 10 ]

In [47]:
df.to_csv('twitter_vs_public.csv')

## Analysis <a name="Analysis"/>

### Sentiment Analysis Model

Flair is a pretrained neural network used for sentiment classification. As demonstrated below, Flair has a high accuracy for predicting the sentiment of amazon reviews. One star reviews were used for negative sentiment, five star reviews were used for positive sentiment. 10000 reviews from each category were randomly chosen and passed through the model for classification. The model demonstrated an accuracy and precision greater than 90%.

Flair classifies data as either 'Positive' or 'Negative'. However, some text may be neutral. This is examined in greater detail in the appendix

### Amazon and Flair

In [6]:
class Sentiment:
    negative = 'NEGATIVE'
    positive = 'POSITIVE'
    neutral = 'NEUTRAL'

class Review:
    def __init__(self, text, rating):
        self.text = text
        self.rating = int(rating)
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.rating == 5:
            return Sentiment.positive
        elif self.rating == 1:
            return Sentiment.negative
        elif self.rating == 3:
            return Sentiment.neutral
        else: return "DROP"

In [7]:
# importing data
amazon_data = []
with open('kcore_5.json') as file:
    num = 0
    for line in file:
        review = json.loads(line)
        amazon_data.append(Review(review["reviewText"], review["overall"]))
        num += 1
        if num >= 250000:
            break

In [8]:
one_star_reviews = []
five_star_reviews = []
for review in amazon_data:
    if review.text == "":
        amazon_data.remove(review)
        continue
    if review.rating == 1:
        one_star_reviews.append(review)
    elif review.rating == 5:
        five_star_reviews.append(review)

print("number of 1 star reviews: ",len(one_star_reviews))
print("number of 5 star reviews: ",len(five_star_reviews))

number of 1 star reviews:  12706
number of 5 star reviews:  142122


In [9]:
#selecting a subset of data for testing flair model
random.seed(0)
ten_k_one_star = random.sample(one_star_reviews,10000)
ten_k_five_star = random.sample(five_star_reviews,10000)
aggregated_samples = []
for sample in [ten_k_one_star, ten_k_five_star]:
    for draw in sample:
        aggregated_samples.append(draw)

print("number of observations: ",len(aggregated_samples))

number of observations:  20000


In [10]:
text_reviews = []
for review in aggregated_samples:
    text_reviews.append(review.text)

numerical_reviews = []
for review in aggregated_samples:
    numerical_reviews.append(review.rating)

In [11]:
classifier = TextClassifier.load('en-sentiment')

2022-08-29 09:51:49,737 loading file C:\Users\franc\.flair\models\sentiment-en-mix-distillbert_4.pt


In [12]:
predicted_sentiment = []

for review in text_reviews:
    sentence = Sentence(review)
    classifier.predict(sentence)

    # taking prediction from object form into string
    temp_string = str.split(sentence.labels[0].__str__(), '→')[1]
    temp_string = str.split(temp_string,'(')[0].strip()


    predicted_sentiment.append(temp_string) #POSITIVE and NEGATIVE options


In [13]:
amazon_sentiment_data = pd.DataFrame(list(zip(text_reviews, numerical_reviews, predicted_sentiment)), columns=["text_reviews", "true_stars_reviews", "predicted_sentiment"])
    

def stars_to_words(series):
    if series == 1:
        return 'NEGATIVE'
    if series == 3:
        return 'NEUTRAL'
    if series == 5:
        return 'POSITIVE'


amazon_sentiment_data['true_sentiment_reviews'] = amazon_sentiment_data['true_stars_reviews'].apply(lambda x: stars_to_words(x))

In [14]:


print("Accuracy Score: ",accuracy_score(amazon_sentiment_data["true_sentiment_reviews"], amazon_sentiment_data["predicted_sentiment"]))
print("Precision Score: ",precision_score(amazon_sentiment_data["true_sentiment_reviews"], amazon_sentiment_data["predicted_sentiment"], pos_label=Sentiment.positive))

Accuracy Score:  0.9477
Precision Score:  0.9690905280804694


### Flatten Twitter Reply Data

The below code transforms the twitter data to have one response on each row, rather than one row containing every reply to a brand

In [15]:
with open('brand_replies.json') as file:
    brand_replies = json.load(file)

In [16]:
brand_list = []
reply_list = []
sentiment_list = []

for brand in brand_replies:
    for reply in brand_replies[brand]:
        brand_list.append(brand)
        reply_list.append(reply)
        

flat_reply = pd.DataFrame(list(zip(brand_list, reply_list)), columns=['brand', 'reply'])
print(len(flat_reply))
flat_reply = flat_reply[ flat_reply['reply'] != ""]
flat_reply = flat_reply[ flat_reply['reply'] != " "]
print(len(flat_reply))

8115
8104


### Sentiment Analysis, Twitter

applying the Flair model to scraped tweets

In [17]:
predicted_sentiment = []

for reply in flat_reply['reply']:
    sentence = Sentence(reply)
    classifier.predict(sentence)
    # taking prediction from object form into string
    temp_string = str.split(sentence.labels[0].__str__(), '→')[1]
    temp_string = str.split(temp_string,'(')[0].strip()

    predicted_sentiment.append(temp_string) #POSITIVE and NEGATIVE options

In [18]:
predicted_sentiment_dummy = []
positive_prediction = []
negative_prediction = []

for prediction in predicted_sentiment:
    if prediction == Sentiment.positive:
        predicted_sentiment_dummy.append(1)
        positive_prediction.append(1)
        negative_prediction.append(0)
    if prediction == Sentiment.negative:
        predicted_sentiment_dummy.append(0)
        negative_prediction.append(1)
        positive_prediction.append(0)

#flat_reply["classification"] = predicted_sentiment_dummy
flat_reply["positive_prediction"] = positive_prediction
flat_reply["negative_prediction"] = negative_prediction
flat_reply["total_prediction"] = flat_reply['positive_prediction'] + flat_reply['negative_prediction']

Aggregating predicted sentiment data to merge with survey data

In [19]:
grouped_twitter = flat_reply.groupby('brand').mean(['positive_prediction', 'negative_prediction'])

In [20]:
grouped_twitter.to_csv('brand_classification.csv')

Merging survey data and classification summary data

In [21]:
grouped_twitter = pd.read_csv('brand_classification.csv')
merged_corp_public_df = pd.read_csv('merged_corp_public.csv')

In [22]:
#creating shared column name

grouped_twitter.rename(columns={'brand': 'twitter_uk'}, inplace=True)

print(len(grouped_twitter))
print(len(merged_corp_public_df))
df = pd.merge(merged_corp_public_df, grouped_twitter, on='twitter_uk', how='inner')
print(len(df))

114
123
115


Error in data collection, Tescos and Tesco Express were both added when they shared the same twitter handle. Tesco Express is dropped below

In [23]:
df[ df['twitter_uk'] == 'TescosUK' ]
df = df[ df['name'] != 'TESCO EXPRESS' ] # 1 obs was removed, error in data collection, they share the same twitter handle

In [24]:
df.drop(columns=['fame', 'neutral', 'Unnamed: 0'], inplace=True) #not necessary for analysis

In [25]:
df.rename(columns={'positive': 'positive_survey', 'negative': 'negative_survey'}, inplace=True)

In [26]:
df.to_csv('twitter_vs_public.csv')

## Creating additional Variables

3 additional variables were created as defined below:
1. like_50_survey = 1 if at least 50% of respondents had a positive impression of a brand, 0 otherwise
2. like_50_prediction = 1 if replies to a brand was at least 50% positive, 0 otherwise
3. twitter_more_popular = 1 if a brand had a greater percent of positive responses than positive impression from survey

In [27]:
df = pd.read_csv('twitter_vs_public.csv')

In [28]:
df['like_50_survey'] = df['positive_survey'].apply(lambda x: 1 if (x >= 50) else 0 )

In [29]:
df['like_50_prediction'] = df['positive_prediction'].apply(lambda x: 1 if (x >= .50) else 0 )

In [30]:
df['twitter_more_positive'] = (df['positive_prediction']*100 > df['positive_survey']).astype(int)

In [31]:
df.to_csv('twitter_vs_public.csv')

## Results <a name="Results"/>

%%html
<div class='tableauPlaceholder' id='viz1661448710888' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Tw&#47;TwittervsPublic&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='TwittervsPublic&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Tw&#47;TwittervsPublic&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1661448710888');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1526px';vizElement.style.height='878px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

## Appendix <a name="Appendix"/>

The sentiment classification model only classifies a string as positive or negative. As such tweets that were neutral in sentiment were classified as positive or negative. This section aims to estimates the classification distribution of neutral text via amazon review data. 

In [32]:
# importing data
amazon_data = []
with open('kcore_5.json') as file:
    num = 0
    for line in file:
        review = json.loads(line)
        amazon_data.append(Review(review["reviewText"], review["overall"]))
        num += 1
        if num >= 250000:
            break

In [33]:
three_star_reviews = []
for review in amazon_data:
    if review.text == "":
        amazon_data.remove(review)
        continue
    if review.rating == 3:
        three_star_reviews.append(review)


print("number of 3 star reviews: ",len(three_star_reviews))

number of 3 star reviews:  26019


In [34]:
#selecting a subset of data for testing flair model
random.seed(0)
ten_k_three_star = random.sample(three_star_reviews,10000)
aggregated_samples = []
for sample in [ten_k_three_star]:
    for draw in sample:
        aggregated_samples.append(draw)

len(aggregated_samples)

10000

In [35]:
text_reviews = []
for review in aggregated_samples:
    text_reviews.append(review.text)

numerical_reviews = []
for review in aggregated_samples:
    numerical_reviews.append(review.rating)

In [36]:
classifier = TextClassifier.load('en-sentiment')

2022-08-29 10:41:57,073 loading file C:\Users\franc\.flair\models\sentiment-en-mix-distillbert_4.pt


In [37]:
predicted_sentiment = []

for review in text_reviews:
    sentence = Sentence(review)
    classifier.predict(sentence)

    # taking prediction from object form into string
    temp_string = str.split(sentence.labels[0].__str__(), '→')[1]
    temp_string = str.split(temp_string,'(')[0].strip()


    predicted_sentiment.append(temp_string) #POSITIVE and NEGATIVE options

predicted_sentiment

['NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',

In [38]:
amazon_sentiment_data = pd.DataFrame(list(zip(text_reviews, numerical_reviews, predicted_sentiment)), columns=["text_reviews", "true_stars_reviews", "predicted_sentiment"])
    

def stars_to_words(series):
    if series == 1:
        return 'NEGATIVE'
    if series == 3:
        return 'NEUTRAL'
    if series == 5:
        return 'POSITIVE'


amazon_sentiment_data['true_sentiment_reviews'] = amazon_sentiment_data['true_stars_reviews'].apply(lambda x: stars_to_words(x))
amazon_sentiment_data

Unnamed: 0,text_reviews,true_stars_reviews,predicted_sentiment,true_sentiment_reviews
0,I read a review of this in the Marine Corp Gaz...,3,NEGATIVE,NEUTRAL
1,"I love me some Kingsolver, but this book was a...",3,NEGATIVE,NEUTRAL
2,I have discovered Elizabeth George years ago w...,3,POSITIVE,NEUTRAL
3,"In EXECUTIVE ORDERS, Tom Clancy continues the ...",3,NEGATIVE,NEUTRAL
4,"Interesting, but the author really needs a goo...",3,NEGATIVE,NEUTRAL
...,...,...,...,...
9995,I found this hard to write because I am such a...,3,NEGATIVE,NEUTRAL
9996,"Kaye Gibbons has a very breezy, readable style...",3,POSITIVE,NEUTRAL
9997,This was a true story and what a sad story it ...,3,NEGATIVE,NEUTRAL
9998,Let me start by saying that I like photos in m...,3,NEGATIVE,NEUTRAL


Among the 3 star reviews, 74% of them were assigned as negative and 26% were assigned positive. From personal experience and opinion, 3 star amazon reviews tend to skew negative, which may partially account for the difference, but is nevertheless important to keep in mind when considering this analysis. 

In [39]:
only_threes = amazon_sentiment_data[ amazon_sentiment_data['true_stars_reviews'] == 3 ]
print('the number of 3 star reviews predicted positive are: ' ,len( only_threes[ only_threes['predicted_sentiment'] == Sentiment.positive ])/len(only_threes))
print('the number of 3 star reviews predicted negative are: ' ,len( only_threes[ only_threes['predicted_sentiment'] == Sentiment.negative ])/len(only_threes))

the number of 3 star reviews predicted positive are:  0.2578
the number of 3 star reviews predicted negative are:  0.7422
