# Apple Sentiment on Twitter: 2013 vs 2023
By Sarah Prusaitis, Rick Lataille, and Allison Ward

## Overview
- 2013 SXSW festival was a big success, very positive responses to new products
- What was the nature of that positive response, what did people like
- 10 years later, what is the public's response to Apple's new products

## Business Problem
- Has Apple maintained the public's support

## Data Limitations
- language shifts over time
- dataset only includes pos/neg, no neutral examples
- working with limited vocabulary, more robust approach would train on larger dataset
- pos/neg sentiment may reflect not just apple, but tech in general

# Analysis
- Using 2013 dataset, tagged positive or negative by human raters
    - Also including 10% of more recent tweets, labeled with VADER, to broaden vocabulary
- Applying fitted models to new Vision Pro datasets to determine sentiment balance

<span style="color: red;">NOTE: I ADDED ITERTOOLS IN THE NEXT CELL<span>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns
import networkx as nx

import re
import string
import langid
import itertools

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

import nltk
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer, word_tokenize, regexp_tokenize, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import lightgbm as lgb

In [2]:
# Read original data
# df = pd.read_csv('data/judge-1377884607_tweet_product_company.csv', encoding = 'ISO-8859-1')

In [3]:
# # Read synthetic data
# df2 = pd.read_csv('data/Apple_Product_Negative_ Tweets_Sheet1.csv', encoding = 'ISO-8859-1')

<span style="color: red;">NOTE: I ADDED THE NEXT SEVERAL CELLS<span>

In [31]:
# Read Vision Pro data
df_vp = pd.read_csv('data/vision_pro_sentiment.csv',
                    encoding='ISO-8859-1',
                    usecols=['tweetText', 'mark']
                   ).rename(columns={'tweetText': 'tweet', 'mark':'sentiment'})

### Pre-processing
- Converting emoticons to unique strings
- Removing semantically meaningless patterns (mentions, links, etc)
- Adding limited additional stopwords that likely have no semantic meaning
- Tokenization, POS-tagging and Lemmatization
- TF-IDF Vectorization and Doc2Vec

In [32]:
# Keep only English-language tweets, then drop language column
df_vp['language'] = df_vp['tweet'].apply(lambda x: langid.classify(x)[0])
df_english = df_vp.loc[df_vp['language']=='en'].drop('language',axis=1).copy()
df_english.to_csv('Data/english_vp.csv')

In [6]:
# # Take 10% of Vision Pro tweets for training purposes
# df_english_train = df_english.sample(n=round(len(df_vp)*.1))
# df_english.drop(df_english_train.index, inplace=True)

In [7]:
# Convert Vision Pro labels to 1, 0
convert = {'Positive emotion':1, 'Negative emotion':0}
df_english['sentiment'] = df_english['sentiment'].map(convert)

In [8]:
# # Rename columns for simplicity
# df = df.rename(columns = {'tweet_text': 'tweet', 
#                          'emotion_in_tweet_is_directed_at': 'product', 
#                          'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'})

In [9]:
# Combine rows into a single DataFrame
# df = pd.concat([df1, df2], ignore_index = True)

In [10]:
# # Combined and renamed Apple products and non Apple products 

# df['product'] = df['product'].replace({
#     'iPad': 'Apple',
#     'Apple': 'Apple',
#     'iPad or iPhone App': 'Apple',
#     'iPhone': 'Apple',
#     'Other Apple product or service': 'Apple',
#     'Google': 'Other',
#     'Other Google product or service': 'Other',
#     'Android App': 'Other',
#     'Android': 'Other'
# })
# #there are 5802 rows that are null - what should we do with those?

<span style="color: red;">NOTE: I THINK I MOVED THIS UP IN ORDER TO DROP AS MANY TWEETS AS EARLY AS POSSIBLE (MAKES SUBSEQUENT CODE RUN FASTER)<span>

In [11]:
# # Filter DataFrame for only Apple tweets, and drop 'product column'
# df_apple = df[df['product']=='Apple'].drop('product',axis=1).copy()

In [12]:
# # Bring in Vision Pro tweets
# df_apple = pd.concat([df_apple, df_english_train], ignore_index = True)

In [13]:
# Consolidate no emotion entries, and drop
# df_apple['sentiment'] = df_apple['sentiment'].replace("I can't tell", "No emotion toward brand or product")
# df_apple = df_apple.drop(df_apple[df_apple['sentiment'] == 'No emotion toward brand or product'].index).reset_index(drop=True)

In [14]:
# df_apple['tweet'] = df_apple['tweet'].astype(str)

In [15]:
def replace_emoticons(text):
    # Define a dictionary mapping emoticons to their corresponding meanings
    emoticon_mapping = {
        ':D': 'emojismile',
        ':)': 'emojismile',
        ':-D': 'emojismile',
        ':\'': 'emojiunsure',
        ':p': 'emojitongue',
        ':P': 'emojitongue',
        ':(': 'emojisad'
        # Add more emoticons and their meanings as needed
    }
    pattern = re.compile('|'.join(re.escape(emoticon) for emoticon in emoticon_mapping.keys()))
    
    def replace(match):
        return emoticon_mapping[match.group(0)]

    return pattern.sub(replace, text)

<span style="color: red;">NOTE: I ADDED A LINE BELOW<span>

In [16]:
# Replace emoticons with mapped strings
# df_apple['tweet'] = df_apple['tweet'].apply(replace_emoticons)
df_english['tweet'] = df_english['tweet'].apply(replace_emoticons)

In [17]:
####  NEW

# Define stopwords
additional_stopwords = {'w', 'u', 'amp', 'sxsw', 'rt', 'apple', 'sxswi', 'ipad', 'iphone', 'store'}  # amp = & 
stop_words = set(stopwords.words('english'))
stop_words.update(additional_stopwords)

In [18]:
def preprocess_tweet(tweet):
    # Remove links and mentions
    tweet = re.sub(r'http\S+|@\S+', '', tweet)
    
    # Remove {link}
    tweet = re.sub(r'\{link\}', '', tweet)
    
    # Replace &quot; with "
    tweet = tweet.replace('&quot;', '"')
    
    # Remove extra space between quotation mark and words
    tweet = re.sub(r'\s+"', '"', tweet)
    tweet = re.sub(r'"\s+', '"', tweet)
    
    # Convert to lowercase
    tweet = tweet.lower()
    
    # Remove numbers
    tweet = re.sub(r'\d+', '', tweet)
    
    # Remove punctuation
    tweet = re.sub(r'([^\w\s]|_)+', ' ', tweet)
    
    # Tokenize
    tokens = nltk.word_tokenize(tweet)
    
    # Part-of-speech tagging
    tagged_tokens = nltk.pos_tag(tokens)
    
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = []
    for word, pos in tagged_tokens:
        if pos.startswith('J'):
            pos = 'a'  # Adjective
        elif pos.startswith('V'):
            pos = 'v'  # Verb
        elif pos.startswith('N'):
            pos = 'n'  # Noun
        elif pos.startswith('R'):
            pos = 'r'  # Adverb
        else:
            pos = 'n'  # Default to noun
        lemma = lemmatizer.lemmatize(word, pos=pos)
        lemmatized_tokens.append(lemma)
    
    # Remove stopwords
    tweet = [word for word in tokens if word not in stop_words]
    
    return tweet

<span style="color: red;">NOTE: I ADDED A LINE BELOW<span>

In [19]:
# Preprocess all tweets
# df_apple['tweet'] = df_apple['tweet'].astype(str).apply(preprocess_tweet)
df_english['tweet'] = df_english['tweet'].astype(str).apply(preprocess_tweet)

In [20]:
# Label target with 1's and 0's
# df_apple['target'] = df_apple['sentiment'].map({'Positive emotion': 1, 'Negative emotion': 0})

In [21]:
#### NEW
df_english = df_english.loc[df_english['sentiment']==1]

In [22]:
# # train test split
# X = df_apple['tweet']
# y = df_apple['target'] # Target
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [23]:
# Save tokenized data for Doc2Vec vectorizer
# tokenized_train_data = X_train
# tokenized_test_data = X_test

In [24]:
# Convert tokenized tweets back into strings for TfidfVectorizer
# X_train_str = X_train.apply(lambda x: ' '.join(x))
# X_test_str = X_test.apply(lambda x: ' '.join(x))


In [25]:
# ####  NEW
# tweets = pd.concat([X_train_str, X_test_str])
tweets = df_english['tweet'].apply(lambda x: ' '.join(x))

In [26]:
####  NEW
token_list = []
for tweet in tweets:
    tokens = word_tokenize(tweet)
    token_list.extend(tokens)

In [27]:
#### NEW

fdist_train = FreqDist(token_list)

In [28]:
#### NEW

print(f'Total Number of tokens: {fdist_train.N()}')
print(f'Total Unique tokens: {fdist_train.B()}')
print(f"Frequency of 'pop': {fdist_train['pop']}")
most_common_words = fdist_train.most_common(250)
print(f"Most Common Words: {most_common_words}")

Total Number of tokens: 122591
Total Unique tokens: 10855
Frequency of 'pop': 7
Most Common Words: [('applevisionpro', 6258), ('vision', 3854), ('ð', 3848), ('pro', 2945), ('â', 1090), ('like', 899), ('vr', 868), ('visionpro', 719), ('new', 663), ('ar', 572), ('see', 553), ('future', 520), ('tech', 512), ('love', 485), ('app', 472), ('get', 467), ('experience', 455), ('quest', 451), ('ï', 450), ('spatial', 448), ('would', 440), ('one', 381), ('headset', 378), ('spatialcomputing', 371), ('zombies', 371), ('win', 365), ('technology', 361), ('jup', 354), ('gme', 351), ('first', 337), ('exciting', 334), ('amazing', 331), ('visionos', 319), ('solana', 318), ('time', 312), ('video', 307), ('meta', 305), ('apps', 302), ('ai', 300), ('innovation', 300), ('best', 298), ('world', 298), ('reality', 297), ('also', 290), ('free', 281), ('share', 274), ('immersive', 271), ('great', 271), ('better', 270), ('use', 265), ('digital', 260), ('virtualreality', 258), ('ready', 258), ('way', 255), ('wait', 

In [29]:
####  NEW

[item[0] for item in most_common_words]

['applevisionpro',
 'vision',
 'ð',
 'pro',
 'â',
 'like',
 'vr',
 'visionpro',
 'new',
 'ar',
 'see',
 'future',
 'tech',
 'love',
 'app',
 'get',
 'experience',
 'quest',
 'ï',
 'spatial',
 'would',
 'one',
 'headset',
 'spatialcomputing',
 'zombies',
 'win',
 'technology',
 'jup',
 'gme',
 'first',
 'exciting',
 'amazing',
 'visionos',
 'solana',
 'time',
 'video',
 'meta',
 'apps',
 'ai',
 'innovation',
 'best',
 'world',
 'reality',
 'also',
 'free',
 'share',
 'immersive',
 'great',
 'better',
 'use',
 'digital',
 'virtualreality',
 'ready',
 'way',
 'wait',
 'people',
 'itâ',
 'create',
 'iâ',
 'us',
 'xr',
 'good',
 'metaquest',
 'computing',
 'device',
 'via',
 'think',
 'using',
 'day',
 'game',
 'make',
 'metaverse',
 'tag',
 'news',
 'look',
 'help',
 'opportunity',
 'launch',
 'virtual',
 'today',
 'experiences',
 'want',
 'check',
 'next',
 'days',
 'fun',
 'friend',
 'watch',
 'could',
 'available',
 'real',
 'really',
 'spread',
 'x',
 'cool',
 'latest',
 'work',
 'even

#### TF-IDF Vectorization
- Most common approach, more meaning than simple bag-of-words 

<span style="color: red;">NOTE: I CHANGED SOME OF THE INSTANTIATION ARGUMENTS<span>

In [30]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=3, ngram_range=(1,2))

# Fit and transform the vectorizer on the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_str)
X_test_tfidf = tfidf_vectorizer.transform(X_test_str)

NameError: name 'X_train_str' is not defined

<span style="color: red;">NOTE: THIS CELL IS NEW, IT'S NEEDED IN ORDER TO RESTRICT VOCAB ON VISION PRO DATA<span>

In [None]:
# Get vocab for vision pro analysis and retrain
vocab = tfidf_vectorizer.get_feature_names_out()
tfidf_vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1,2), vocabulary=vocab)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_str)
X_test_tfidf = tfidf_vectorizer.transform(X_test_str)

#### Doc2Vec Vectorization
- This is a more sophisticated approach, captures additional meaning from document and word context
- GBM models can make use of this approach
- May not be better for smaller datasets

In [None]:
# Tag tokenized_tweets with an index for identification
tagged_train_data = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_train_data)]

# Initialize and train a Doc2Vec vectorizer
vectorizer = Doc2Vec(tagged_train_data, vector_size=50, window=2, min_count=1, workers=4, epochs=40)

In [None]:
# Infer vectors for testing set
test_vectors = np.array([vectorizer.infer_vector(doc_tokens) for doc_tokens in tokenized_test_data])

In [None]:
vectors = np.array([vectorizer.dv[i] for i in range(len(tagged_train_data))])

### Modeling
- Many potentially good models, not clear which one will work best
- Trying 5
    - Logistic Regression
    - Multinomial Naive Bayes
    - Support Vector Machines
    - Random Forest Classifier
    - LightGBM

#### Logistic Regression
- Performs well with binomial classification tasks
- Interpretable

In [None]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_tfidf, y_train)

In [None]:
y_preds_lr = logreg.predict(X_train_tfidf)
y_test_preds_lr = logreg.predict(X_test_tfidf)

In [None]:
print(classification_report(y_train, y_preds_lr))

In [None]:
print(f'LogReg Train Accuracy: {accuracy_score(y_train, y_preds_lr):.1%}')

In [None]:
print(classification_report(y_test,y_test_preds_lr))

In [None]:
print(f'LogReg Test Accuracy: {accuracy_score(y_test,y_test_preds_lr):.1%}')

#### Multinomial Naive Bayes
- Good for multinomial classification problems

In [None]:
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)

<span style="color: red;">NOTE: I THINK THERE WERE SOME TYPOS IN THE MULTI NB SECTION, I FIXED<span>

In [None]:
y_preds_mnb = mnb.predict(X_train_tfidf)
y_preds_test_mnb = mnb.predict(X_test_tfidf)

In [None]:
print(classification_report(y_train, y_preds_mnb))

In [None]:
print(f'MultinomialNB Train Accuracy: {accuracy_score(y_train, y_preds_mnb):.1%}')

In [None]:
print(classification_report(y_test, y_preds_test_mnb))

In [None]:
print(f'MultinomialNB Test Accuracy: {accuracy_score(y_test, y_preds_test_mnb):.1%}')

#### Support Vector Machines
- Typically performs better in image or text classification task
- Not interpretable
- May not work as well on data with new features

<span style="color: red;">NOTE: I COMMENTED OUT INITIAL AND GRIDSEARCH CODE TO SPEED UP THE MODEL, WE CAN UNCOMMENT FOR THE FINAL AND LET IT RUN<span>

In [None]:
# svc = SVC(random_state=42)
# svc.fit(X_train_tfidf, y_train)

In [None]:
# y_preds_svc = svc.predict(X_train_tfidf)
# y_preds_test_svc = svc.predict(X_test_tfidf)

In [None]:
# print(classification_report(y_train, y_preds_svc))

In [None]:
# print(classification_report(y_test, y_preds_test_svc))

In [None]:
# params = {'kernel':['linear', 'poly', 'rbf', 'sigmoid'],
#           'degree':[2,3,4],
#           'shrinking':[True,False],
#          }

In [None]:
# svc_grid = GridSearchCV(svc, param_grid=params, cv=5)

In [None]:
# %%time
# svc_grid.fit(X_train_tfidf, y_train)

In [None]:
# print(svc_grid.best_estimator_)
# print(svc_grid.best_params_)

In [None]:
svc_tuned = SVC(degree=2, kernel='poly', shrinking=True, random_state=42)
svc_tuned.fit(X_train_tfidf, y_train)

In [None]:
y_preds_svc = svc_tuned.predict(X_train_tfidf)
y_preds_test_svc = svc_tuned.predict(X_test_tfidf)

In [None]:
print(classification_report(y_train, y_preds_svc))

In [None]:
print(f'SVM Train Accuracy: {accuracy_score(y_train, y_preds_svc):.1%}')

In [None]:
print(classification_report(y_test, y_preds_test_svc))

In [None]:
print(f'SVM Test Accuracy: {accuracy_score(y_test, y_preds_test_svc):.1%}')

#### Random Forest
- Might be better for non-linear relationships

In [None]:
# rf = RandomForestClassifier()
# rf.fit(X_train_tfidf, y_train)

In [None]:
# y_preds_rf = rf.predict(X_train_tfidf)
# y_preds_test_rf = rf.predict(X_test_tfidf)

In [None]:
# print(classification_report(y_train, y_preds_rf))

In [None]:
# print(classification_report(y_test, y_preds_test_rf))

In [None]:
# rf_params = {'n_estimators':[10, 50, 100],
#              'criterion':['gini','entropy','log_loss'],
#              'max_depth':[5,10,20]
#             }

In [None]:
# rf_grid = GridSearchCV(rf, param_grid=rf_params, cv=5)

In [None]:
# rf_grid.fit(X_train_tfidf, y_train)

In [None]:
# print(rf_grid.best_estimator_)
# print(rf_grid.best_params_)

In [None]:
rf2 = RandomForestClassifier(criterion='log_loss', max_depth=20, n_estimators=100, random_state=42)
rf2.fit(X_train_tfidf, y_train)

In [None]:
y_preds_rf2 = rf2.predict(X_train_tfidf)
y_preds_test_rf2 = rf2.predict(X_test_tfidf)

In [None]:
print(classification_report(y_train, y_preds_rf2))

In [None]:
print(f'Random Forest Train Accuracy: {accuracy_score(y_train, y_preds_rf2):.1%}')

In [None]:
print(classification_report(y_test, y_preds_test_rf2))

In [None]:
print(f'Random Forest Test Accuracy: {accuracy_score(y_test, y_preds_test_rf2):.1%}')

#### LightGBM
- Performs very well on very large datasets
- Excels at detecting complex patterns

In [None]:
train_data = lgb.Dataset(vectors, label=y_train.ravel())

In [None]:
params = {'boosting_type': 'gbdt',  # Traditional Gradient Boosting Decision Tree
          'objective': 'binary',    # Binary classification
          'metric': ['binary_error'],  # Evaluation metrics
          'lambda_l1': 0.5,
          'lambda_l2': 0.5,
          'max_bin': 100,
          'num_leaves': 20,         # Number of leaves in full trees
          'learning_rate': 0.05,    # Learning rate
          'feature_fraction': 0.9,  # Fraction of features to be used at each iteration
          'bagging_fraction': 0.8,  # Fraction of data to be used for each iteration
          'bagging_freq': 5,        # Frequency for bagging
          'verbose': 1              # Verbose output in the terminal
}

In [None]:
# Train the model
num_round = 100  # Number of boosting rounds
lgb_model = lgb.train(params, train_data, num_round)

In [None]:
# Make predictions and convert to binary
y_preds_lgb = lgb_model.predict(vectors, num_iteration=lgb_model.best_iteration)
y_preds_binary = [1 if prob > 0.5 else 0 for prob in y_preds_lgb]

print(classification_report(y_train, y_preds_binary))

In [None]:
print(f'LightGBM Train Accuracy: {accuracy_score(y_train, y_preds_binary):.1%}')

In [None]:
# Make predictions on test set
y_preds_test_lgb = lgb_model.predict(test_vectors, num_iteration=lgb_model.best_iteration)
y_preds_test_binary = [1 if prob > 0.5 else 0 for prob in y_preds_test_lgb]

print(classification_report(y_test, y_preds_test_binary))

In [None]:
print(f'LightGBM Test Accuracy: {accuracy_score(y_test, y_preds_test_binary):.1%}')

### Modeling Conclusion
- LightGBM is bad
- SVM seems badly overfit
- Multinomial NB and Random Forest are okay
- Logistic Regression is best

<span style="color: red;">NOTE: EVERYTHING BELOW IS NEW<span>

### Vectorize vision pro tweets
- Use the *fitted* vectorizer to vectorize the Vision Pro tweets
- The vectorizer must be limited to only the vocabulary seen in the initial dataset
- Words not in the initial dataset will be dropped, which will limit the model's accuracy on unseen data

In [None]:
# Convert tokenized tweets back into strings for TfidfVectorizer
df_english['tweet_join'] = df_english['tweet'].apply(lambda x: ' '.join(x))

In [None]:
# Transform the new tweets with the fitted vectorizer
vp_vectored = tfidf_vectorizer.fit_transform(df_english['tweet_join'])

### Logistic Regression predictions

In [None]:
# Make predictions on new data
vp_preds = logreg.predict(vp_vectored)

In [None]:
cm = confusion_matrix(df_english['sentiment'], vp_preds)
ConfusionMatrixDisplay(cm).plot();

In [None]:
print(classification_report(df_english['sentiment'], vp_preds))

In [None]:
print(f'LogReg Accuracy vs VADER: {accuracy_score(df_english["sentiment"], vp_preds):.1%}')