<a href="https://colab.research.google.com/github/datasistah/ml_sytem_design_course/blob/main/airline_tweet_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Twitter Sentiment Analysis

**Problem statement:** Airline industry had a very hard time post covid to sustain their business due to a long hault. It is very important for them to make sure they exceed customer expectations. The best way to evaluate performance is customer feedback. You are given a dataset of airline tweets from real customers.

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

You will use the text column and sentiment column to create a classification model that classifies a given tweet into one of the 3 classes - positive, negative, neutral.

**Understanding the Dataset:**

Dataset contains many columns out of which below are most important ones-
1. airline_sentiment - defines the sentiment of the tweet 
2. negative_reason - reason for the negative feedback (if negative)
3. Text - tweet text content
4. tweet_location - location from which tweet was posted

You can use more columns in your model training if you want. 


**Steps to perform**
1. Load dataset - https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
2. Clean, preprocess data and EDA
3. Vectorise columns that contain text 
4. Run Classification model to classify - positive, negative or neutral
5. Evaluate model



### Use SpaCy to Train

In [1]:
#!pip install spacy

In [2]:
#!python -m spacy download en_core_web_sm


In [3]:
#pip install transformers -U 

In [4]:
import spacy

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [5]:
#!pip install tokenizers==0.9.4

In [6]:
#!pip install transformers -U


In [7]:
nlp = spacy.load('en_core_web_sm')

In [8]:
import pandas as pd

In [9]:
file = 'kaggle/Tweets.csv'
df = pd.read_csv(file)
df.head()


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [10]:
print(f"The length of the original dataset is {len(df)}")

The length of the original dataset is 14640


In [11]:
#important columns
imp_cols = ['tweet_id','airline_sentiment','negativereason','text','tweet_location']
df = df[imp_cols]
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,


In [12]:
df = df.drop_duplicates()
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,


In [13]:
print(f"The length after droping duplicates {len(df)}")

The length after droping duplicates 14532


In [14]:
df.airline_sentiment.value_counts()

negative    9118
neutral     3074
positive    2340
Name: airline_sentiment, dtype: int64

#### Data Cleaning

In [15]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def clean_tweet(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove user mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize and remove stop words
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Join the tokens back into a single string
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text



[nltk_data] Downloading package punkt to /Users/teasletx/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
df['clean_tweet'] = df['text'].apply(lambda x: clean_tweet(x))
df.head()

Unnamed: 0,tweet_id,airline_sentiment,negativereason,text,tweet_location,clean_tweet
0,570306133677760513,neutral,,@VirginAmerica What @dhepburn said.,,said
1,570301130888122368,positive,,@VirginAmerica plus you've added commercials t...,,plus youve added commercials experience tacky
2,570301083672813571,neutral,,@VirginAmerica I didn't today... Must mean I n...,Lets Play,didnt today must mean need take another trip
3,570301031407624196,negative,Bad Flight,@VirginAmerica it's really aggressive to blast...,,really aggressive blast obnoxious entertainmen...
4,570300817074462722,negative,Can't Tell,@VirginAmerica and it's a really big bad thing...,,really big bad thing


In [17]:
#Splitting the dataset into train and test
train = df.sample(frac = 0.8, random_state = 17)
test = df.drop(train.index)

In [18]:
print(len(train), len(test))

11626 2906


> Reference: Building a Sentiment Classifier Using spaCy Tranformers
    >> #https://towardsdatascience.com/building-sentiment-classifier-using-spacy-3-0-transformers-c744bfc767b

In [19]:
#create tuples which are pairs of text along with sentiments
#create for train and test datasets

#Creating tuples
train['tuples'] = train.apply(lambda row: (row['clean_tweet'],row['airline_sentiment']), axis=1)
train = train['tuples'].tolist()
test['tuples'] = test.apply(lambda row: (row['clean_tweet'],row['airline_sentiment']), axis=1)
test = test['tuples'].tolist()

In [44]:
type(train[0])

tuple

In [45]:
df.airline_sentiment.unique()

array(['neutral', 'positive', 'negative'], dtype=object)

In [26]:
#function for converting the train and test dataset into spaCy document

def document(data):

  text = []

  for doc, label in nlp.pipe(data, as_tuples = True):
    if(label=='positive'):
        doc.cats['positive']=1
        doc.cats['negative']=0
        doc.cats['neutral'] =0
    elif(label=='negative'):
        doc.cats['positive']=0
        doc.cats['negative']=1
        doc.cats['neutral'] =0
    else:
        doc.cats['positive']=0
        doc.cats['negative']=0
        doc.cats['neutral'] =1

        text.append(doc)

> The final text is the internal spacy representation of the text

>> Convert the input data into the binary objects as the required format of spaCy

In [27]:
# Storing docs in binary format
from spacy.tokens import DocBin

> Sequence length error, must truncate to 512 seq = seq[:512]

In [40]:
def document(data):
    text = []

    for doc, label in nlp.pipe(data, as_tuples=True):
        if label == 'positive':
            doc.cats['positive'] = 1
            doc.cats['negative'] = 0
            doc.cats['neutral'] = 0
        elif label == 'negative':
            doc.cats['positive'] = 0
            doc.cats['negative'] = 1
            doc.cats['neutral'] = 0
        else:
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral'] = 1
        text.append(doc)
    
    return text


In [41]:
#passing the train dataset into function 'document'
train_docs = document(train[:3])

In [48]:
train_docs[0]

possible book refundable trip willing pay extra would domestic round trip flight

In [47]:
train_docs[0].cats

{'positive': 0, 'negative': 0, 'neutral': 1}

In [49]:
#passing the train dataset into function 'document'
train_docs = document(train)

#Creating binary document using DocBin function in spaCy
doc_bin = DocBin(docs = train_docs)

#Saving the binary document as train.spacy
doc_bin.to_disk("train.spacy")


In [50]:
#passing the test dataset into function 'document'
test_docs = document(test)
doc_bin = DocBin(docs = test_docs)
doc_bin.to_disk("test.spacy")

In [57]:
#!pip install spacy

In [59]:
#!python -m spacy download en_core_web_sm

> Create starter config file for spaCy at https://spacy.io/usage/training#quickstart 
>> Update the paths for train.spacy and test.spacy

In [63]:
from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
config = {
   "threshold": 0.5,
   "model": DEFAULT_MULTI_TEXTCAT_MODEL,
}
nlp.add_pipe("textcat_multilabel", config=config)

<spacy.pipeline.textcat_multilabel.MultiLabel_TextCategorizer at 0x7ff09da73860>

In [1]:
#Restart kernel
!python -m spacy init fill-config base_config.cfg config.cfg

  from pandas.core.computation.check import NUMEXPR_INSTALLED
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [3]:
!python -m spacy train config.cfg --output airline_tweet_model  --paths.train train.spacy --paths.dev test.spacy 

  from pandas.core.computation.check import NUMEXPR_INSTALLED
[38;5;2m✔ Created output directory: airline_tweet_model[0m
[38;5;4mℹ Saving to output directory: airline_tweet_model[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-05-11 17:33:01,609] [INFO] Set up nlp object from config
[2023-05-11 17:33:01,616] [INFO] Pipeline: ['textcat']
[2023-05-11 17:33:01,619] [INFO] Created vocabulary
[2023-05-11 17:33:01,619] [INFO] Finished initializing nlp object
[2023-05-11 17:33:04,288] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.22       28.38    0.28
  0     200         33.69       37.68    0.38
  0     400         27.18       48.68    0.49
  0     600         25.99       59.22    0.59
  1     800         23.00       65.15    0.65
  1    1000         19.91    

>> Documentation for SpaCy evaluation https://catherinebreslin.medium.com/text-classification-with-spacy-3-0-d945e2e8fc44

In [4]:
!python -m spacy evaluate airline_tweet_model/model-best/ --output metrics.json ./test.spacy

  from pandas.core.computation.check import NUMEXPR_INSTALLED
[38;5;4mℹ Using CPU[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   72.23 
SPEED               208423

[1m

               P       R       F
positive   76.57   65.90   70.84
negative   82.46   90.69   86.38
neutral    66.14   54.05   59.48

[1m

           ROC AUC
positive      0.92
negative      0.90
neutral       0.86

[38;5;2m✔ Saved results to metrics.json[0m
