# Twitter Sentiment Analysis — Complete Project
1. Data sources (sample CSV + instructions to fetch from Twitter API)
2. Preprocessing (cleaning, tokenization, stopwords)
3. Feature extraction (TF-IDF)
4. Modeling (Logistic Regression) and evaluation
5. Saving the trained model and vectorizer

In [1]:
# Setup: imports and creating a small sample dataset
import pandas as pd
from pathlib import Path

DATA_PATH = Path('/mnt/data')
DATA_PATH.mkdir(parents=True, exist_ok=True)

sample_csv = DATA_PATH / 'twitter_large_dataset'
if not sample_csv.exists():
    sample_data = [
        {'tweet_id':1, 'text':"I love the new phone! Battery life is amazing 😍 #tech", 'sentiment':'positive'},
        {'tweet_id':2, 'text':"Terrible customer service, I'm never buying from them again.", 'sentiment':'negative'},
        {'tweet_id':3, 'text':"Looks okay, but nothing special. Expected more.", 'sentiment':'neutral'},
        {'tweet_id':4, 'text':"What a fantastic update — app works much smoother now!", 'sentiment':'positive'},
        {'tweet_id':5, 'text':"App crashes every time I open it. So frustrating.", 'sentiment':'negative'},
        {'tweet_id':6, 'text':"Not bad for the price. Decent features.", 'sentiment':'neutral'},
    ]
    df = pd.DataFrame(sample_data)
    df.to_csv(sample_csv, index=False)
else:
    df = pd.read_csv(sample_csv)

print(f"Sample dataset saved to: {sample_csv}")
df.head()

Sample dataset saved to: \mnt\data\twitter_large_dataset


Unnamed: 0,tweet_id,text,sentiment
0,1,I love the new phone! Battery life is amazing ...,positive
1,2,"Terrible customer service, I'm never buying fr...",negative
2,3,"Looks okay, but nothing special. Expected more.",neutral
3,4,What a fantastic update — app works much smoot...,positive
4,5,App crashes every time I open it. So frustrating.,negative


## Fetch tweets (optional)

If you want to use real Twitter data, follow these steps:

1. Create a Twitter Developer account and create a Project & App to get API keys (Bearer token for v2 endpoints).
2. Install `tweepy` or use `requests` to call Twitter API v2 search endpoints. Example code (fill in `BEARER_TOKEN`) below.


In [2]:
# Example: fetch tweets using Twitter API v2 (BEARER_TOKEN required)
# This is a template. Fill BEARER_TOKEN and uncomment to run.
'''
import requests
import pandas as pd
BEARER_TOKEN = '<YOUR_BEARER_TOKEN_HERE>'  # replace with your token
query = 'climate change -is:retweet lang:en'  # example query
url = 'https://api.twitter.com/2/tweets/search/recent'
params = {'query': query, 'max_results': 100, 'tweet.fields':'id,text,created_at,lang'}
headers = {'Authorization': f'Bearer {BEARER_TOKEN}'}
resp = requests.get(url, headers=headers, params=params)
if resp.status_code == 200:
    data = resp.json()
    tweets = [{'tweet_id':t['id'], 'text':t['text']} for t in data.get('data', [])]
    pd.DataFrame(tweets).to_csv('/mnt/data/twitter_fetched.csv', index=False)
else:
    print('Request failed', resp.status_code, resp.text)
'''


"\nimport requests\nimport pandas as pd\nBEARER_TOKEN = '<YOUR_BEARER_TOKEN_HERE>'  # replace with your token\nquery = 'climate change -is:retweet lang:en'  # example query\nurl = 'https://api.twitter.com/2/tweets/search/recent'\nparams = {'query': query, 'max_results': 100, 'tweet.fields':'id,text,created_at,lang'}\nheaders = {'Authorization': f'Bearer {BEARER_TOKEN}'}\nresp = requests.get(url, headers=headers, params=params)\nif resp.status_code == 200:\n    data = resp.json()\n    tweets = [{'tweet_id':t['id'], 'text':t['text']} for t in data.get('data', [])]\n    pd.DataFrame(tweets).to_csv('/mnt/data/twitter_fetched.csv', index=False)\nelse:\n    print('Request failed', resp.status_code, resp.text)\n"

In [3]:
# Preprocessing utilities (cleaning & basic tokenization)
import re
from sklearn.model_selection import train_test_split

def clean_tweet(text):
    text = str(text)
    # lower
    text = text.lower()
    # remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # remove mentions and hashtags (keep the tag word)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#', '', text)
    # remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', ' ', text)
    # collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# simple stopwords list (small, self-contained so no external downloads needed)
STOPWORDS = set(["a","an","the","and","or","is","it","this","that","for","of","to","in","on","with","was","are","i","me","my","you","we","they","them","but","so","not"])

def tokenize(text):
    toks = [t for t in text.split() if t and t not in STOPWORDS]
    return ' '.join(toks)

# Quick test
print(clean_tweet("I love this! Check http://example.com @user #awesome"))
print(tokenize(clean_tweet("I love this! Check http://example.com @user #awesome")))


i love this check awesome
love check awesome


In [4]:
# Load dataset, preprocess and create train/test split
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('/mnt/data/twitter_large_dataset.csv')  # using large dataset
# If you fetched real tweets into twitter_fetched.csv, you can load that instead
# df = pd.read_csv('/mnt/data/twitter_fetched.csv')

# If there is no 'sentiment' column, you will need to label data (manual or use distant supervision)
print('Original rows:', len(df))

# Clean and tokenize
df['clean_text'] = df['text'].apply(clean_tweet).apply(tokenize)

# Encode labels (positive/negative/neutral)
le = LabelEncoder()
df['label'] = le.fit_transform(df['sentiment'])

# Train-test split
train_df, test_df = train_test_split(df, test_size=0.33, random_state=42, stratify=df['label'])
print('Train size:', len(train_df), 'Test size:', len(test_df))
train_df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/twitter_large_dataset.csv'

In [None]:
# Feature extraction with TF-IDF and model training
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_train = vectorizer.fit_transform(train_df['clean_text'])
y_train = train_df['label']

X_test = vectorizer.transform(test_df['clean_text'])
y_test = test_df['label']

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

print('Training complete')


In [None]:
# Evaluation: accuracy, classification report, confusion matrix
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:\n', classification_report(y_test, y_pred, target_names=le.classes_))
print('\nConfusion matrix:\n', confusion_matrix(y_test, y_pred))


In [None]:
# Save the trained model and vectorizer
import joblib
joblib.dump(clf, '/mnt/data/sentiment_clf.joblib')
joblib.dump(vectorizer, '/mnt/data/tfidf_vectorizer.joblib')
joblib.dump(le, '/mnt/data/label_encoder.joblib')
print('Saved: /mnt/data/sentiment_clf.joblib, tfidf_vectorizer.joblib, label_encoder.joblib')


In [None]:
# Inference: simple function to predict sentiment for new text
import numpy as np

def predict_sentiment(texts):
    texts_clean = [tokenize(clean_tweet(t)) for t in texts]
    X = vectorizer.transform(texts_clean)
    preds = clf.predict(X)
    return [le.inverse_transform([p])[0] for p in preds]

examples = [
    "I hate waiting in long lines. Worst experience ever.",
    "Absolutely love the camera on this phone!",
    "It's okay — could be better, could be worse."
]
print('Predictions:', list(zip(examples, predict_sentiment(examples))))


## Next steps / Improvements

- Use a larger labeled dataset (Sentiment140, SemEval datasets, or label your own scraped tweets).
- Use data augmentation or transfer learning (fine-tune a transformer like `bert-base-uncased` with `transformers` library).
- Handle imbalanced classes with class weights, oversampling (SMOTE), or focal loss.
- Add more advanced preprocessing: emoji handling, slang normalization, negation handling.
- Deploy the model as an API using FastAPI or Streamlit for a demo dashboard.

---

You're all set — the notebook is saved to `/mnt/data/twitter_sentiment_analysis.ipynb`. Download it and run locally (or open in Google Colab after uploading).