# 🚀👨‍🚀🔴 **Martian Brand Audit Pipeline [LogReg Version]**  
*For Intergalactic Reputational Assessment*

This pipeline was developed by the **Intergalactic Branding Board (IBB)** on behalf of the **Martian Alliance for Population Growth (MAPG)** to evaluate Earthlings' emotional response to *The Subject* — Elon Musk.

The Subject has served as MAPG’s primary brand ambassador since the early 2000s, inspiring the dream of interplanetary expansion and human settlement on Mars. However, his Earth-based public image has become increasingly polarized, making it critical to assess whether he still evokes trust and admiration, or now sparks mockery or fear, among potential Martian recruits.

To maintain intergalactic budget discipline and remain on schedule, IBB assigned the mission to a junior Earthling data scientist — operating under significant time constraints and with limited computational credits.

Given these constraints, the team opted for a lightweight yet interpretable machine learning model: **Logistic Regression**. While less complex than sentiment engines like BERT, this model offers clarity and agility — qualities well-suited for this high-frequency Martian audit.

Despite its simplicity, the system successfully processed a representative sample of labeled tweets, extracted emotional signals with reasonable accuracy, and scaled predictions across a corpus of **500,000 real Earthling tweets**.

The result: a **baseline emotional footprint** of the Subject’s Earth reputation — just enough to illuminate the prevailing sentiment frequencies across the Terran social sphere.

📦 Full dataset the project is based on: 500,000 Elon Musk Tweets on Kaggle:  [https://www.kaggle.com/datasets/clementdelteil/500-000-tweets-on-elon-musk-nov-dec-2022](https://) (JSON file)


# 0. Import Libraries & Setup

This section loads all required Python libraries for data handling, text processing, model training, and visualization. Compatible with Google Colab, SageMaker Studio, and local environments.

In [7]:
# 📦 Import all libraries and setup

# ✅ Install external libraries (only needed once per session in Colab)
!pip install emoji

# 🧮 Data Handling
import pandas as pd
import numpy as np
import random

# 🧹 Text Preprocessing
import re
import string
import emoji
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# 🔤 Text Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# 🤖 Model Training & Evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score

# 📊 Visualization (optional but useful)
import matplotlib.pyplot as plt
import seaborn as sns

# ⚙️ NLTK Resource Downloads
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# 🔐 Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m583.7/590.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


# 1. Sample Extraction & Manual Labeling

To enable supervised sentiment classification, we begin by extracting a high-quality sample of 500 tweets from the full dataset.
These tweets are preserved in raw format — including emojis — to retain emotional signals and tone-of-voice for sentiment interpretation to facilitate labeling.

Before sampling, we apply semantic filters to ensure the tweets are relevant, directed toward Elon Musk, and not part of noisy tagging chains or reply spam.
This helps improve the quality of both labeling and upcoming model training.

When the sample is collected, a hybrid annotation method is used:

1. Initial labels are suggested via ChatGPT

2. Then reviewed and refined manually to ensure emotional accuracy and contextual integrity

This human-in-the-loop process ensures the sentiment labels reflect nuanced, real-world interpretation aligned with Martian values.

📂 **Note:**  
This notebook was originally developed in Google Colab. If you're running this in a different environment (e.g. SageMaker), make sure to update the file path accordingly.

In [9]:
# SECTION 1.1: Load the Full JSON Dataset

import json

# ⚠️ Adjust this path if needed in a different environment (e.g. SageMaker)
file_path = "/content/drive/MyDrive/2025, Apr-Jul: AI BOOT CAMP/Projects/Project #3 [Module 3]/elonmusk_tweets_raw.json"

# Load the JSON content
with open(file_path, "r", encoding="utf-8") as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data)

# Inspect available columns
df.columns

Index(['id', 'text'], dtype='object')

In [None]:
# 1.2 Preview raw tweet data

df[['id', 'text']].head(10)

Unnamed: 0,id,text
0,1596647314030231552,@DonutOperator @elonmusk @stillgray It's fiery...
1,1596647313887346689,@SenMarkey @elonmusk Anti-freedom is anti-Amer...
2,1596647313719853056,@FoxNews Elon Musk voices support for Trump ri...
3,1596647313346215941,@elonmusk @CollinRugg Having meetings about me...
4,1596647312746754048,@GregA06555436 @elonmusk @TimRunsHisMouth Yes!...
5,1596647312088076291,@elonmusk @Liz_Wheeler Good luck.
6,1596647311597502464,@Molly85224872 @RealSaavedra @elonmusk @AGHami...
7,1596647309244133377,@RelaxedRelic @atdavidhoffman @cheryela0114 @e...
8,1596647309089153030,@AllenFarer @_Twittmonger @Bucue4 @DogmaticTow...
9,1596647308497600512,@RealJamesWoods @elonmusk I left during the da...


In [10]:
# 1.3: Pre-Filter Raw Tweets Before Sampling

# 1. Remove short tweets (less than 5 words)
df_sample_base = df[df['text'].str.split().apply(len) >= 5]

# 2. Keep only tweets with @elonmusk + max one additional mention
df_sample_base = df_sample_base[df_sample_base['text'].str.count("@") <= 2]

# 3. Ensure tweets reference "elon" for relevance
df_sample_base = df_sample_base[df_sample_base['text'].str.contains("elon", case=False)]

# 4. Randomly sample 500 tweets for manual annotation
df_sample = df_sample_base.sample(n=500, random_state=42).reset_index(drop=True)

# Preview sampled tweets
df_sample.head(10)

Unnamed: 0,id,text
0,1598401435775143937,@Realnoni4Real @elonmusk this is a fake accoun...
1,1605087968888242176,"@elonmusk @ShellenbergerMD Invariably, the tax..."
2,1600134121228861440,@muskQu0tes @elonmusk I was a stay at home mom...
3,1602258083597717506,@StationCDRKelly @elonmusk We're not going to ...
4,1600121539088723968,@GrahamAllen_1 @elonmusk Yes my Facebook accou...
5,1605485612756148225,@neiltyson @elonmusk This tweet is some deep i...
6,1604960919632728070,@elonmusk Are you taking yourself for somethin...
7,1603729043265830912,@yousuck2020 @elonmusk Four powerful natural f...
8,1605014990456594432,@TwitterBusiness @elonmusk Just like you wrote...
9,1603181534944677890,@MayraFlores2022 Amnesty for all that walk acr...


In [11]:
# 1.4: Save Sample to CSV for Hybrid Labeling

# Export the sampled tweets to CSV for manual annotation (via ChatGPT + review)
df_sample.to_csv("elon_sample_for_hybrid_labeling.csv", index=False)


**1.5 Sentiment categories & labeling instructions**

The following labels were used during annotation, based on Martian values (Admiration, Trust, Mockery, and Fear), emotional tone and intent:

| Label        | Description                                                                 |
|--------------|----------------------------------------------------------------------------|
| `admiration` | Expressions of awe, love, or deep respect                                 |
| `trust`      | Signals of belief, support, or confidence                                 |
| `mockery`    | Sarcasm, ridicule, or laughter at the subject’s expense                   |
| `fear`       | Worry, anxiety, or concern about Elon or his actions                      |
| `irrelevant` | Tweets not expressing sentiment *about* Elon (e.g. spam, generic replies) |


Tweets marked as "irrelevant" will be excluded from model training.
All remaining labels will be used to train the supervised classifier (via Logistic Regression) in the next steps.

In [18]:
# 1.6: Load Hybrid-Labeled Dataset for Model Training

# Load manually annotated and cleaned dataset
labeled_file_path = "/content/drive/MyDrive/2025, Apr-Jul: AI BOOT CAMP/Projects/Project #3 [Module 3]/elon_hybrid_labeled_data.csv"
df_labeled = pd.read_csv(labeled_file_path)

# Drop rows with null sentiment or irrelevant labels
df_clean = df_labeled[
    (df_labeled['sentiment'].notnull()) &
    (df_labeled['sentiment'] != 'irrelevant')
].copy()

# Save filtered version for modeling
df_clean.to_csv("elon_tweets_labeled_filtered.csv", index=False)

# Preview sample rows
df_clean.head(100)

# Check label distribution
print("Label distribution:\n")
print(df_clean['sentiment'].value_counts())


Label distribution:

sentiment
mockery                       195
trust                          95
fear                           75
admiration                     37
Hinting admiration → trust      1
Name: count, dtype: int64


In [19]:
df_clean['sentiment'] = df_clean['sentiment'].replace('Hinting admiration → trust', 'trust')

# Check label distribution
print("Label distribution:\n")
print(df_clean['sentiment'].value_counts())

Label distribution:

sentiment
mockery       195
trust          96
fear           75
admiration     37
Name: count, dtype: int64


## 2. General Pre-Processing (applied to full dataset)

This step standardizes the raw tweet data for both sentiment and topic modeling. It includes removing irrelevant characters, links, and noise, converting emojis to text tokens, and filtering out very short tweets.

By cleaning the text, we ensure better performance in both modeling tracks and create consistency across the dataset.

In [30]:
import re
import emoji
import json

# Load raw tweet data (adjust path if needed)
file_path = "/content/drive/MyDrive/2025, Apr-Jul: AI BOOT CAMP/Projects/Project #3 [Module 3]/elonmusk_tweets_raw.json"
with open(file_path, "r", encoding="utf-8") as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data)

# 2.1 Lowercasing
df['clean_text'] = df['text'].str.lower()

# 2.2 Convert emojis to text (e.g., 😱 → :fearful_face:)
df['clean_text'] = df['clean_text'].apply(lambda x: emoji.demojize(x, language='en', delimiters=(" :", ":")))

# 2.3 Remove noise: links, mentions, punctuation (but keep colons), extra whitespace
def clean_noise(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)           # remove links
    text = re.sub(r"@\w+|#\w+", "", text)                         # remove mentions and hashtags
    text = re.sub(r"[^\w\s:]", "", text)                          # remove special chars/punct, keep colons
    text = re.sub(r"\s+", " ", text).strip()                      # remove extra spaces
    return text

df['clean_text'] = df['clean_text'].apply(clean_noise)

# 2.4 Filter: remove short tweets (<5 words)
df = df[df['clean_text'].str.split().apply(len) >= 5]

# 2.5 Semantic Filtering
df = df[df['text'].str.count("@") <= 2]                           # keep tweets w/ @elonmusk + max 1 more
df = df[df['text'].str.contains("elon", case=False)]             # ensure tweet is about Elon

# 2.6 Output size and preview
print(f"Number of tweets after filtering: {len(df)}")
df[['text', 'clean_text']].head()

# 2.7 Save for modeling
df.to_csv("elon_preprocessed.csv", index=False)

# 2.8 Re-load for next steps (ensures filepath consistency)
df = pd.read_csv("elon_preprocessed.csv")
df.head()

Number of tweets after filtering: 238427


Unnamed: 0,id,text,clean_text
0,1596647313719853056,@FoxNews Elon Musk voices support for Trump ri...,elon musk voices support for trump rival ron d...
1,1596647313346215941,@elonmusk @CollinRugg Having meetings about me...,having meetings about meetings and communicati...
2,1596647308497600512,@RealJamesWoods @elonmusk I left during the da...,i left during the dark days glad youre still here
3,1596684036826857472,@littlemissjacob @elonmusk Have a wonderful we...,have a wonderful weekend :grinning_face: from ...
4,1596684035241443328,@Liz_Wheeler @elonmusk The problem is not the ...,the problem is not the phone its the ecosystem...


# 3. NLP Pre-Processing (for labeled sample)

This section prepares the labeled sample from 1.6. for model training. Includes tokenization, stopword removal, lemmatization, and TF-IDF vectorization.

This transforms human-readable tweets into machine-readable vectors while preserving key semantic signals.

In [40]:
import re
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# 3.1 Load pre-labeled & cleaned dataset
df_clean = pd.read_csv("/content/elon_tweets_labeled_filtered.csv")

# Fix stray labels if needed
df_clean['sentiment'] = df_clean['sentiment'].replace({
    'Hinting admiration → trust': 'trust'
})

# 3.2 Tokenization
def tokenize(text):
    text = re.sub(r"[^\w\s]", "", str(text))  # remove punctuation
    return text.split()

df_clean['tokens'] = df_clean['text'].apply(tokenize)

# 3.3 Stopword removal
stop_words = set([
    'the', 'and', 'is', 'in', 'it', 'of', 'to', 'a', 'for', 'on', 'with',
    'this', 'that', 'at', 'as', 'an', 'are', 'was', 'but', 'by', 'be', 'from'
])

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

df_clean['tokens'] = df_clean['tokens'].apply(remove_stopwords)

# 3.4 Lemmatization
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

df_clean['tokens'] = df_clean['tokens'].apply(lemmatize_tokens)

# 3.5 Join tokens for vectorization
df_clean['processed_text'] = df_clean['tokens'].apply(lambda x: " ".join(x))

# 3.6 TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_clean['processed_text'])

# 3.7 Prepare labels
y = df_clean['sentiment']

# 3.8 Token stats (for TF-IDF tuning)
df_clean['token_count'] = df_clean['tokens'].apply(len)
print(f"Average tokens per tweet: {df_clean['token_count'].mean():.2f}")
print(f"Max tokens in a tweet: {df_clean['token_count'].max()}")

# 3.9 Preview
print("TF-IDF matrix shape:", X.shape)
print("Label distribution:\n", y.value_counts())

Average tokens per tweet: 16.01
Max tokens in a tweet: 46
TF-IDF matrix shape: (403, 2336)
Label distribution:
 sentiment
mockery       195
trust          96
fear           75
admiration     37
Name: count, dtype: int64


# 4. Model Pipeline: Logistic Regression

A classic, interpretable machine learning model is trained on the labeled sample. We split the data, train the classifier, evaluate its performance, and then use it to predict sentiment across the full 500k tweets.

In [46]:
# Train & Evaluate Logistic Regression

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score

# 4.1 Train/Test Split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train size:", X_train.shape[0])
print("Test size:", X_test.shape[0])

# 4.2 Train Logistic Regression (balanced to handle label skew)
clf = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)

# 4.3 Evaluate on Test Set
y_pred = clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("F1 Score (macro):", f1_score(y_test, y_pred, average='macro'))

Train size: 322
Test size: 81
Classification Report:
               precision    recall  f1-score   support

  admiration       0.20      0.12      0.15         8
        fear       0.25      0.13      0.17        15
     mockery       0.55      0.67      0.60        39
       trust       0.48      0.53      0.50        19

    accuracy                           0.48        81
   macro avg       0.37      0.36      0.36        81
weighted avg       0.44      0.48      0.46        81

Accuracy Score: 0.48148148148148145
F1 Score (macro): 0.3581025900287781


In [48]:
from sklearn.metrics import classification_report
import pandas as pd

# Get detailed classification report as dict
report = classification_report(y_test, y_pred, output_dict=True)

# Convert to DataFrame for nicer display
report_df = pd.DataFrame(report).transpose()

# Keep only the actual classes (remove avg totals)
class_metrics = report_df[~report_df.index.str.contains("avg|accuracy")]

# Round and preview
class_metrics_rounded = class_metrics[['precision', 'recall', 'f1-score']].round(2)
class_metrics_rounded

Unnamed: 0,precision,recall,f1-score
admiration,0.2,0.12,0.15
fear,0.25,0.13,0.17
mockery,0.55,0.67,0.6
trust,0.48,0.53,0.5


In [47]:
# 4.4 Predict Sentiment on Full Tweet Dataset (500k tweets)

# Load full preprocessed dataset
df_full = pd.read_csv("elon_preprocessed.csv")

# Vectorize using trained TF-IDF (do not fit again!)
X_full = vectorizer.transform(df_full['clean_text'])

# Predict sentiment
df_full['sentiment_pred'] = clf.predict(X_full)

# Summary
print("Predicted Sentiment Distribution (Full Dataset):")
print(df_full['sentiment_pred'].value_counts())

# Preview sample predictions
df_full[['text', 'clean_text', 'sentiment_pred']].sample(5, random_state=42)

Predicted Sentiment Distribution (Full Dataset):
sentiment_pred
mockery       73424
admiration    65672
trust         50664
fear          48667
Name: count, dtype: int64


Unnamed: 0,text,clean_text,sentiment_pred
78925,@EddieZipperer @elonmusk Crackpot. You are del...,crackpot you are delusional and part of the pr...,admiration
39204,@elonmusk @FoxNews Trump needs to understand t...,trump needs to understand this,trust
10390,"@elonmusk @kanyewest Drake Meme, but for Elon ...",drake meme but for elon and twitter free speec...,trust
228364,@elonmusk Omnibus includes $2.5 million for re...,omnibus includes 25 million for residential se...,fear
48131,@elonmusk @FoxNews Trump never said that. Can ...,trump never said that can you just listen his ...,admiration


# 5. Gradio Dashboard Prototype

In [50]:
# 🐾 Install Gradio if not already
!pip install gradio --quiet

import gradio as gr

# ✅ Your prediction function using trained model + vectorizer
def predict_sentiment(text):
    vectorized = vectorizer.transform([text])
    prediction = clf.predict(vectorized)[0]
    return f"🪐 Predicted Sentiment: {prediction}"

# 🎛️ Create the Gradio interface
interface = gr.Interface(
    fn=predict_sentiment,
    inputs=gr.Textbox(lines=3, placeholder="Type a tweet about Elon Musk..."),
    outputs="text",
    title="🚀 Martian Sentiment Scanner",
    description="Enter a tweet about Elon Musk to see how it's perceived by the Intergalactic Branding Board (IBB)."
)

# 🚀 Launch it!
interface.launch(share=True)

def predict_sentiment_debug(text):
    vectorized = vectorizer.transform([text])
    prediction = clf.predict(vectorized)[0]
    probs = clf.predict_proba(vectorized)[0]
    labels = clf.classes_
    probs_dict = {label: round(prob, 3) for label, prob in zip(labels, probs)}
    return f"🪐 Prediction: {prediction}\n\n📊 Probabilities: {probs_dict}"



Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://8e65e81c866778d158.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
