<a href="https://colab.research.google.com/github/benjamin-dinh/tweet-sentiment-analysis/blob/main/tweet_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis of Airline Tweets**

--- 

Labeled American airline tweets based on polarity of opinion: positive, neutral, and negative

Utilized Natural Language Toolkit (nltk) to tokenize, lemmatize, and determine word frequency

Used TF-IDF, term frequency–inverse document frequency, to calculate word importance

Classified tweets using logistic regression

The data set comes from https://raw.githubusercontent.com/lkyin/ECS189L/main/Tweets.csv

##  ***Top 10 Most Frequently Used Words in Each Sentiment Group***

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/lkyin/ECS189L/main/Tweets.csv')

### Clean and Tokenize Tweets

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter

nltk.download('stopwords')
nltk.download('punkt')
stopwords.words('english')
stop_words = set(stopwords.words('english'))

# Add a column to df with only tokenized text
def clean(df):
  clean_text = []
  clean_joined_text = []
  for index, row in df.iterrows():
    # Strip of Punctuation, Emojis, Hyperlinks, etc
    text = row['text'].split(' ')
    for w in text:
      if (w!='' and (w[0]=='@' or w[0]=='#')):
        text.remove(w)
    newtext = ' '.join(text)
    word_tokens = word_tokenize(newtext)
    # Tokenize
    temp = [w.lower() for w in word_tokens if not w in stop_words and w.isalpha() and not w=='http' and not w=='https' and not len(w)<=1]
    tokentext = ' '.join(temp)
    clean_text.append(temp)
    clean_joined_text.append(tokentext)
  df['clean_text'] = clean_text
  df['clean_joined_text'] = clean_joined_text
  return df

df = clean(df)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Lemmatize

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

def lemmatize(text):
  words = []
  for index, row in enumerate(text):
    words += row
  for i,word in enumerate(words):
    words[i] = lemmatizer.lemmatize(word)
  return words

neutral_df = df.loc[df['airline_sentiment'] == 'neutral']
negative_df = df.loc[df['airline_sentiment'] == 'negative']
positive_df = df.loc[df['airline_sentiment'] == 'positive']

neutral_words = lemmatize(neutral_df['clean_text'])
negative_words = lemmatize(negative_df['clean_text'])
positive_words = lemmatize(positive_df['clean_text'])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Results

In [None]:
def count_words(words):
  freq = nltk.FreqDist(words)
  for word, frequency in freq.most_common(10):
      print(u'{}:{}'.format(word, frequency))
  print('\n')

print("Neutral Frequency")
count_words(neutral_words)
print("Negative Frequency")
count_words(negative_words)
print("Positive Frequency")
count_words(positive_words)

Neutral Frequency
flight:768
get:241
need:180
please:180
help:164
thanks:154
dm:129
would:127
ticket:112
tomorrow:107


Negative Frequency
flight:3326
hour:1083
get:1013
cancelled:914
customer:771
service:761
time:730
bag:656
hold:611
help:600


Positive Frequency
thanks:606
thank:452
flight:432
great:235
service:163
love:130
customer:123
guy:122
get:120
you:118




##  ***Classification Model Using Logistic Regression***

### Select Dependent Variable and Independent Variables

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tweets = df['clean_joined_text'].to_list()
import numpy as np
np.set_printoptions(precision=3)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tweets)
Y = df['airline_sentiment']

### Split Training and Testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=100000)
model.fit(X_train,Y_train)
Y_pred = model.predict(X_test)

### Model Accuracy

In [None]:
from sklearn.metrics import classification_report
report = classification_report(Y_test, Y_pred, output_dict = True)
pd.DataFrame(report).transpose()

Unnamed: 0,precision,recall,f1-score,support
negative,0.803242,0.944415,0.868127,1889.0
neutral,0.671958,0.437931,0.530271,580.0
positive,0.823708,0.590414,0.687817,459.0
accuracy,0.788593,0.788593,0.788593,0.788593
macro avg,0.766303,0.657587,0.695405,2928.0
weighted avg,0.780444,0.788593,0.772936,2928.0


##  ***Airline Rankings Based on Sentiment***

### Count Number of Positive and Negative Tweets Per Airline

In [None]:
sentiment_df = df.groupby(["airline", "airline_sentiment"]).size().reset_index(name="count")
airlinecount_df = df.groupby(["airline"]).size().reset_index(name="total")
airlinecount_df = pd.DataFrame(np.repeat(airlinecount_df.values,3,axis=0))

### Calculate Fraction of Positive and Negative Tweets

In [None]:
sentiment_df['fraction'] = sentiment_df['count']/airlinecount_df[1]
sentiment_df

Unnamed: 0,airline,airline_sentiment,count,fraction
0,American,negative,1960,0.710402
1,American,neutral,463,0.167814
2,American,positive,336,0.121783
3,Delta,negative,955,0.429793
4,Delta,neutral,723,0.325383
5,Delta,positive,544,0.244824
6,Southwest,negative,1186,0.490083
7,Southwest,neutral,664,0.27438
8,Southwest,positive,570,0.235537
9,US Airways,negative,2263,0.776862


### Results

*Top 3 Airlines in Terms of Fraction of Positive Tweets*
1. Virgin America
2. Delta
3. Southwest

*Top 3 Airlines in Terms of Fraction Negative Tweets*
1. US Airways
2. American
3. United