**Mini Project**
# Airline Tweet Sentiment Classifier using Natural Language Processing
-----------------------

**Notes:**
- Use sample dataset - airline_tweets_sample.csv
- Steps:
1. Import libraries
2. Load and explore dataset
3. Clean and preprocess the text
4. Convert text to numerical vectors (TF-IDF)
5. Split into train and test sets
6. Train a logistic regression model
7. Evaluate accuracy and classification report
8. Predict sentiment for new example tweets

Upload by 18Nov2025 Morning



1. Import libraries

In [None]:
import re, nltk, string
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

2. Load and explore dataset

In [None]:
df = pd.read_csv('/content/sample_data/airline_tweets_sample.csv')
print("Shape of data frame", df.shape)
df
#df.head(5)

Shape of data frame (30, 2)


Unnamed: 0,text,sentiment
0,"@United flight was delayed for 3 hours, worst ...",negative
1,"Loved the service on @Delta, crew was super fr...",positive
2,"@AmericanAir lost my luggage again, so disappo...",negative
3,Smooth boarding and on-time arrival. Great job...,positive
4,The seats were uncomfortable but staff was polite,neutral
5,"@JetBlue flight attendants were rude, not flyi...",negative
6,"Got a free upgrade to business class, thank yo...",positive
7,"Average flight, nothing special to mention",neutral
8,@DeltaAirLines provided excellent support with...,positive
9,The in-flight entertainment was not working,negative


3. Clean and preprocess the text

In [None]:
def clean_text(text):
  text=str(text)
  text=re.sub(r'http\S+','',text) # Fixed \http\S++ to http\S+
  text=text.translate(str.maketrans('','',string.punctuation))
  text=re.sub(r'\s+',' ',text).strip() # Changed to replace with single space and strip
  text=re.sub(r'[^\x00-\x7F]','',text)
  text=text.lower()
  return text

In [None]:
#Add new column for cleaned up text
df['cleaned_text']=df['text'].apply(clean_text)
df.head()

Unnamed: 0,text,sentiment,cleaned_text
0,"@United flight was delayed for 3 hours, worst ...",negative,united flight was delayed for 3 hours worst ex...
1,"Loved the service on @Delta, crew was super fr...",positive,loved the service on delta crew was super frie...
2,"@AmericanAir lost my luggage again, so disappo...",negative,americanair lost my luggage again so disappointed
3,Smooth boarding and on-time arrival. Great job...,positive,smooth boarding and ontime arrival great job s...
4,The seats were uncomfortable but staff was polite,neutral,the seats were uncomfortable but staff was polite


4. Convert text to numerical vectors (TF-IDF)
5. Split into train and test sets

In [None]:
le = LabelEncoder()
df['sentiment']=le.fit_transform(df['sentiment'])
X = df['cleaned_text']
y = df['sentiment']

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tfidf = TfidfVectorizer(max_features=5000)
X_train_tdidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

6. Train a logistic regression model

In [None]:
model = LogisticRegression()
model.fit(X_train_tdidf, y_train)

7. Evaluate accuracy and classification report

In [None]:
y_pred = model.predict(X_test_tfidf)
print("Accuracy: ",accuracy_score(y_test, y_pred))
print ("Classification Report: ", classification_report(y_test, y_pred))

Accuracy:  0.5
Classification Report:                precision    recall  f1-score   support

           0       0.25      1.00      0.40         1
           1       0.00      0.00      0.00         1
           2       1.00      0.50      0.67         4

    accuracy                           0.50         6
   macro avg       0.42      0.50      0.36         6
weighted avg       0.71      0.50      0.51         6



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


8. Predict sentiment for new example tweets

In [None]:
new_tweets = ["y crew and bad food"]
new_tweets_cleaned = [clean_text(tweet) for tweet in new_tweets]
new_tweets_vectorized = tfidf.transform(new_tweets_cleaned)
predictions = model.predict(new_tweets_vectorized)
predictions_text = le.inverse_transform(predictions)
print("Predictions:", predictions_text)

Predictions: ['positive']
