<a href="https://colab.research.google.com/github/datasistah/ml_system_design_class_notebooks/blob/main/20230719_airline_tweet_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Twitter Sentiment Analysis

**Problem statement:** Airline industry had a very hard time post covid to sustain their business due to a long hault. It is very important for them to make sure they exceed customer expectations. The best way to evaluate performance is customer feedback. You are given a dataset of airline tweets from real customers.

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

You will use the text column and sentiment column to create a classification model that classifies a given tweet into one of the 3 classes - positive, negative, neutral.

**Understanding the Dataset:**

Dataset contains many columns out of which below are most important ones-
1. airline_sentiment - defines the sentiment of the tweet
2. negative_reason - reason for the negative feedback (if negative)
3. Text - tweet text content
4. tweet_location - location from which tweet was posted

You can use more columns in your model training if you want.


**Steps to perform**
1. Load dataset - https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
2. Clean, preprocess data and EDA
3. Vectorise columns that contain text
4. Run Classification model to classify - positive, negative or neutral
5. Evaluate model



## Steps to Download kaggle datasets using Kaggle Public API

1. Go to your account, Scroll to API section and Click Expire API Token to remove previous tokens

2. Click on Create New API Token - It will download kaggle.json file on your machine.

In [None]:
!pip install -q kaggle

In [None]:
pwd

'/content'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls

drive  sample_data


In [None]:
from google.colab import files
files.upload()

Saving kaggle (2).json to kaggle (2).json


{'kaggle (2).json': b'{"username":"mathluva","key":"9e099b96c38311863a80ee5497b89343"}'}

In [None]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download crowdflower/twitter-airline-sentiment


Downloading twitter-airline-sentiment.zip to /content
 78% 2.00M/2.55M [00:00<00:00, 2.97MB/s]
100% 2.55M/2.55M [00:00<00:00, 3.15MB/s]


In [None]:
!unzip -q "twitter-airline-sentiment.zip"

In [None]:
import pandas as pd
file = "Tweets.csv"
df = pd.read_csv(file)
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
len(df)

14640

In [None]:
#! pip install ydata_profiling

#### EDA- Pandas Profile

In [None]:
#pandas profile https://pypi.org/project/pandas-profiling/
import pandas as pd
from ydata_profiling import ProfileReport

In [None]:
#!apt-get install -y fonts-freefont-ttf


In [None]:
#! pip install --upgrade Pillow

In [None]:
profile = ProfileReport(df,minimal=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [16]:
print(f"The length of the original dataset is {len(df)}")

The length of the original dataset is 14640


In [17]:
#important columns
imp_cols = ['tweet_id','airline_sentiment','negativereason','text','tweet_location']
df = df[imp_cols]
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,


In [18]:
df = df.drop_duplicates()
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,


In [19]:
print(f"The length after droping duplicates {len(df)}")

The length after droping duplicates 14532


In [20]:
df.airline_sentiment.value_counts()

negative    9118
neutral     3074
positive    2340
Name: airline_sentiment, dtype: int64

#### Data Cleaning

In [25]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
nltk.download('punkt')

def clean_tweet(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Remove user mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)

    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Convert to lowercase
    text = text.lower()

    # Tokenize and remove stop words
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Join the tokens back into a single string
    cleaned_text = ' '.join(tokens)

    return cleaned_text



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [26]:
df['clean_tweet'] = df['text'].apply(lambda x: clean_tweet(x))
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location,clean_tweet
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,,said
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,,plus youve added commercials experience tacky
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play,didnt today must mean need take another trip
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,,really aggressive blast obnoxious entertainmen...
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,,really big bad thing


In [27]:
#create labels for sentiment
possible_labels = df.airline_sentiment.unique()

label_dict = {}
for index, label in enumerate(possible_labels):
    label_dict[label] = index
label_dict

{'neutral': 0, 'positive': 1, 'negative': 2}

In [28]:
#create column for sentiment label as int
df['sentiment_label'] = df['airline_sentiment'].apply(lambda x: label_dict[x])
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location,clean_tweet,sentiment_label
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,,said,0
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,,plus youve added commercials experience tacky,1
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play,didnt today must mean need take another trip,0
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,,really aggressive blast obnoxious entertainmen...,2
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,,really big bad thing,2


In [29]:
df.airline_sentiment.value_counts()

negative    9118
neutral     3074
positive    2340
Name: airline_sentiment, dtype: int64

In [30]:
#use stratified sampling for imbalanced dataset, there are more negative tweets than positive or neutral

# import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [31]:
tweets = list(df.clean_tweet)
tweets[:2]

['said', 'plus youve added commercials experience tacky']

In [32]:
# create a count vectorizer to convert the text into numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(tweets)

In [33]:
sentiments = list(df.sentiment_label)
sentiments[:2]

[0, 1]

### Training

In [34]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, sentiments, test_size=0.2, random_state=42, stratify = df['sentiment_label'].values)



In [35]:
import numpy as np
# convert the target arrays to NumPy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)


In [36]:
print(f" X_train.shape = {X_train.shape}\n X_test.shape = {X_test.shape}\n y_train = {y_train.shape}\n y_test = {y_test.shape}")

 X_train.shape = (11625, 13024)
 X_test.shape = (2907, 13024)
 y_train = (11625,)
 y_test = (2907,)


In [37]:
from sklearn.linear_model import LogisticRegression
# create a logistic regression model with increased max_iter
model = LogisticRegression(max_iter=1000)


In [38]:
#fit model to training data
model.fit(X_train, y_train)

In [39]:
# make predictions on the testing data
y_pred = model.predict(X_test)


In [40]:
import pickle
# save the model to disk
filename = 'twitter_classification_model.sav'
pickle.dump(model, open(filename, 'wb'))

In [41]:
# evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7791537667698658


In [42]:
# define a single tweet to predict the sentiment of
tweet = "I hate going to work on Mondays"

# convert the tweet to numerical features using the same vectorizer
X_tweet = vectorizer.transform([tweet])

# make a prediction using the trained model
sentiment = model.predict(X_tweet)

In [43]:
print(sentiment)

[2]


In [44]:
from sklearn.metrics import classification_report

# print precision, recall, and F1 score for each class
print(classification_report(y_test, y_pred, target_names=possible_labels))


              precision    recall  f1-score   support

     neutral       0.61      0.56      0.58       615
    positive       0.76      0.65      0.70       468
    negative       0.83      0.89      0.86      1824

    accuracy                           0.78      2907
   macro avg       0.73      0.70      0.71      2907
weighted avg       0.77      0.78      0.77      2907



## Using SpaCy to Train Model

In [46]:
!pip install spacy
!pip install transformers -U
!python -m spacy download en_core_web_sm

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m43.1 MB/s[0m eta [36m0:00:0

In [47]:
import spacy
import pandas as pd

In [48]:
nlp = spacy.load('en_core_web_sm')

In [49]:
#Splitting the dataset into train and test
train = df.sample(frac = 0.8, random_state = 17)
test = df.drop(train.index)

In [50]:
print(len(train), len(test))


11626 2906


In [51]:
#create tuples which are pairs of text along with sentiments
#create for train and test datasets

#Creating tuples
train['tuples'] = train.apply(lambda row: (row['clean_tweet'],row['airline_sentiment']), axis=1)
train = train['tuples'].tolist()
test['tuples'] = test.apply(lambda row: (row['clean_tweet'],row['airline_sentiment']), axis=1)
test = test['tuples'].tolist()

In [52]:
type(train[0])

tuple

In [None]:
df.airline_sentiment.unique()


In [55]:
#function for converting the train and test dataset into spaCy document

def document(data):

  text = []

  for doc, label in nlp.pipe(data, as_tuples = True):
    if(label=='positive'):
        doc.cats['positive']=1
        doc.cats['negative']=0
        doc.cats['neutral'] =0
    elif(label=='negative'):
        doc.cats['positive']=0
        doc.cats['negative']=1
        doc.cats['neutral'] =0
    else:
        doc.cats['positive']=0
        doc.cats['negative']=0
        doc.cats['neutral'] =1

        text.append(doc)
    return text

In [56]:
# Storing docs in binary format
from spacy.tokens import DocBin

In [57]:
#passing the train dataset into function 'document'
train_docs = document(train[:3])


In [58]:
train_docs[0]


possible book refundable trip willing pay extra would domestic round trip flight

In [59]:
train_docs[0].cats


{'positive': 0, 'negative': 0, 'neutral': 1}

In [60]:
#passing the train dataset into function 'document'
train_docs = document(train)

#Creating binary document using DocBin function in spaCy
doc_bin = DocBin(docs = train_docs)

#Saving the binary document as train.spacy
doc_bin.to_disk("train.spacy")

In [61]:
#passing the test dataset into function 'document'
test_docs = document(test)
doc_bin = DocBin(docs = test_docs)
doc_bin.to_disk("test.spacy")

In [62]:
!pwd

/content


#### Create starter config file for spaCy at https://spacy.io/usage/training#quickstart
#### Update the paths for train.spacy and test.spacy

In [64]:
from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
config = {
   "threshold": 0.5,
   "model": DEFAULT_MULTI_TEXTCAT_MODEL,
}
nlp.add_pipe("textcat_multilabel", config=config)

<spacy.pipeline.textcat_multilabel.MultiLabel_TextCategorizer at 0x7a6df398ea40>

In [67]:
#Restart kernel
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train config.cfg --output airline_tweet_model  --paths.train train.spacy --paths.dev test.spacy


In [None]:
!python -m spacy evaluate airline_tweet_model/model-best/ --output metrics.json ./test.spacy
