# Twitter US Airline Sentiment

[Dataset Link](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

In this notebook we have used the US Airline twitter Sentiment dataset, and have pre-processed this data as required to perform the predictions.

In [1]:
# Importing the required libraries

import pandas as pd

import re

from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

## 1. Reading the Raw data

In [2]:
# Read the CSV file

df = pd.read_csv('./Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## 2. Choose Columns

We choose only the required colums and drop the rest.

In [3]:
df = df[['airline_sentiment', 'airline', 'text']]
df.head()

Unnamed: 0,airline_sentiment,airline,text
0,neutral,Virgin America,@VirginAmerica What @dhepburn said.
1,positive,Virgin America,@VirginAmerica plus you've added commercials t...
2,neutral,Virgin America,@VirginAmerica I didn't today... Must mean I n...
3,negative,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,Virgin America,@VirginAmerica and it's a really big bad thing...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   airline_sentiment  14640 non-null  object
 1   airline            14640 non-null  object
 2   text               14640 non-null  object
dtypes: object(3)
memory usage: 343.2+ KB


## 3. Know the Dataset

In this section, we analyse the columns to know if there are any `NULL` values in these column values.

In [5]:
# Airline Sentiment

df['airline_sentiment'].unique()

array(['neutral', 'positive', 'negative'], dtype=object)

In [6]:
# Airline Name

df['airline'].unique()

array(['Virgin America', 'United', 'Southwest', 'Delta', 'US Airways',
       'American'], dtype=object)

**Dataset Feilds**

* `airline_sentiment` : The sentiment of the tweet, one of `positive`, `negative` or `neutral`.
* `airline` : The name of the airline company.
* `text` : The tweet by the person commenting on the airlines.

## 4. Preprocessing Steps

### 4.1 Text Preprocessing

In this section we process the tweets and convert them into a standard form.

**Reference Links**
* [Regex for Twitter Hashtags](https://stackoverflow.com/questions/8376691/how-to-remove-hashtag-user-link-of-a-tweet-using-regular-expression)
* [Tweet Preprocessing](https://medium.com/analytics-vidhya/pre-processing-tweets-for-sentiment-analysis-a74deda9993e)


In [7]:
# Convert to lower case

df['text'] = df['text'].str.lower()

In [8]:
# Remove links or URL's, as these do not contribute to sentiment

df['text'] = df['text'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
df['text'] = df['text'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

In [9]:
# Remove User names

df['text'] = df['text'].apply(lambda x: ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()))

In [10]:
# Remove punctuations, emojis, numbers, etc

df['text'] = df['text'].apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#]", '', x))

In [11]:
# Tokenize the Tweet and remove the stop words, and lemmatize the remaining words

# Initialize the tweet tokenizer
tknzr = TweetTokenizer()

# Initialize the stop words
stop_words = set(stopwords.words('english'))

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def tokenize_tweet(tweet) :
    # Tokenize the tweet
    tweet = tknzr.tokenize(tweet)
    # Filter out the stop words
    filtered_tweet = [lemmatizer.lemmatize(word) for word in tweet if word not in stop_words ]
    # Return the filtered out list
    return ' '.join(filtered_tweet)

# Apply the function to the tweets
df['text'] = df['text'].apply(tokenize_tweet)

In [12]:
# View the modified dataset

df.head()

Unnamed: 0,airline_sentiment,airline,text
0,neutral,Virgin America,said
1,positive,Virgin America,plus added commercial experience tacky
2,neutral,Virgin America,today must mean need take another trip
3,negative,Virgin America,really aggressive blast obnoxious entertainmen...
4,negative,Virgin America,really big bad thing


### 4.2 Change the Sentiment column to Numerical values

In [13]:
sentiment_map = {
    'negative' : 0,
    'neutral' : 1,
    'positive' : 2
}

df['airline_sentiment'] = df['airline_sentiment'].replace(sentiment_map)

In [14]:
df.head()

Unnamed: 0,airline_sentiment,airline,text
0,1,Virgin America,said
1,2,Virgin America,plus added commercial experience tacky
2,1,Virgin America,today must mean need take another trip
3,0,Virgin America,really aggressive blast obnoxious entertainmen...
4,0,Virgin America,really big bad thing


## 5 Saving the dataset

In [35]:
# This is the information on the modified dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   airline_sentiment  14640 non-null  int64 
 1   airline            14640 non-null  object
 2   text               14640 non-null  object
dtypes: int64(1), object(2)
memory usage: 343.2+ KB


In [36]:
# We save this modified dataset into a CSV file

df.to_csv('airline_tweet_processed.csv', index=False)