# Importing required libraries to load the dataset and for data cleaning

The dataset used for the training of the model is airline-tweets. The dataset contains the tweets written by passengers who took a flight and gave their experience about the airline service. The dataset's target feature is airline_sentiment which is used for rating prediction. Airline_sentiment has three ratings - Negative, Neutral and Positive, which are to be label encoded into numbers being represented as 0,1,2.

In [77]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import pandas as pd
df = pd.read_csv('Tweets.csv')

In [78]:
df

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


# Removing unwanted columns to reduce the size of the dataset

By removing the unwanted columns, it gets easier to work and clean the dataset. Moreover it also helps in increasing the model accuracy. The required columns are- airline_sentiment, airline_sentiment_confidence, text.

In [79]:
df.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [80]:
df = df[['airline_sentiment', 'airline_sentiment_confidence','text']]

In [81]:
df.shape

(14640, 3)

# Selecting the cells which have 'airline_sentiment_confidence' greater then 0.9 so as to increase the accuracy of the model

In [82]:
df[df['airline_sentiment_confidence'] > 0.9].shape

(10458, 3)

In [83]:
df = df[df['airline_sentiment_confidence'] > 0.9]

In [84]:
df

Unnamed: 0,airline_sentiment,airline_sentiment_confidence,text
0,neutral,1.0,@VirginAmerica What @dhepburn said.
3,negative,1.0,@VirginAmerica it's really aggressive to blast...
4,negative,1.0,@VirginAmerica and it's a really big bad thing...
5,negative,1.0,@VirginAmerica seriously would pay $30 a fligh...
9,positive,1.0,"@VirginAmerica it was amazing, and arrived an ..."
...,...,...,...
14631,negative,1.0,@AmericanAir thx for nothing on getting us out...
14633,negative,1.0,@AmericanAir my flight was Cancelled Flightled...
14636,negative,1.0,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,1.0,@AmericanAir Please bring American Airlines to...


In [85]:
list_of_stopwords = stopwords.words('English')

# Writing a for loop to reduce the complexity of the tweets

A for loop is written to eliminate the unwanted words. The loop removes special characters and airline names used in the tweets as they are of no use. Moreover, all the stopwords are also removed to reduce the complexity of the model. After removing the unwanted words, the remaining words are appended into a new column named clean_tweet.

In [86]:
clean_tweets_list = []
airline_names = ['virginamerica', 'jetblue', 'united', 'americanair', 'usairways']
special_characters = ['|', '@', '#', '$', '%', '^', '&', '*', ':']
for i in range(df.shape[0]):
    current_tweet = df['text'].values[i]
    list_of_words = word_tokenize(current_tweet)
    clean_tweet = ''
    for each_word in list_of_words:
        each_word = each_word.lower()
        if not each_word in list_of_stopwords:
            if not each_word in airline_names:
                if not each_word in special_characters:
                    clean_tweet = clean_tweet + '' + each_word
    clean_tweets_list.append(clean_tweet)

In [87]:
df['clean_tweets'] = clean_tweets_list

In [88]:
df.head(60)

Unnamed: 0,airline_sentiment,airline_sentiment_confidence,text,clean_tweets
0,neutral,1.0,@VirginAmerica What @dhepburn said.,dhepburnsaid.
3,negative,1.0,@VirginAmerica it's really aggressive to blast...,'sreallyaggressiveblastobnoxious``entertainmen...
4,negative,1.0,@VirginAmerica and it's a really big bad thing...,'sreallybigbadthing
5,negative,1.0,@VirginAmerica seriously would pay $30 a fligh...,seriouslywouldpay30flightseatsn'tplaying.'srea...
9,positive,1.0,"@VirginAmerica it was amazing, and arrived an ...","amazing,arrivedhourearly.'regood."
11,positive,1.0,@VirginAmerica I &lt;3 pretty graphics. so muc...,lt;3prettygraphics.muchbetterminimaliconography.
12,positive,1.0,@VirginAmerica This is such a great deal! Alre...,greatdeal!alreadythinking2ndtripaustraliaamp;n...
14,positive,1.0,@VirginAmerica Thanks!,thanks!
16,positive,1.0,@VirginAmerica So excited for my first cross c...,excitedfirstcrosscountryflightlaxmco'veheardno...
17,negative,1.0,@VirginAmerica I flew from NYC to SFO last we...,flewnycsfolastweekcouldn'tfullysitseatduetwola...


# Converting the tweets into a sparse matrix so as to train the model

Using sparse matrix to store data that contains a large number of zero-valued elements can both save a significant amount of memory and speed up the processing of that data.

In [89]:
from sklearn.feature_extraction.text import CountVectorizer

In [90]:
cv = CountVectorizer()

In [91]:
cv.fit(df['clean_tweets'])

CountVectorizer()

In [92]:
sparse_matrix = cv.transform(df['clean_tweets']) 

In [93]:
sparse_matrix 

<10458x24821 sparse matrix of type '<class 'numpy.int64'>'
	with 31755 stored elements in Compressed Sparse Row format>

# Importing LabelEncoder to encode the rating

Label encoding the airline_sentiment column by replacing Negative, Neutral and Positive to be numbered as 0,1 and 2 respectively.

In [94]:
from sklearn.preprocessing import LabelEncoder

In [95]:
le = LabelEncoder()

In [96]:
df['airline_sentiment'] = le.fit_transform(df['airline_sentiment'])

In [97]:
df

Unnamed: 0,airline_sentiment,airline_sentiment_confidence,text,clean_tweets
0,1,1.0,@VirginAmerica What @dhepburn said.,dhepburnsaid.
3,0,1.0,@VirginAmerica it's really aggressive to blast...,'sreallyaggressiveblastobnoxious``entertainmen...
4,0,1.0,@VirginAmerica and it's a really big bad thing...,'sreallybigbadthing
5,0,1.0,@VirginAmerica seriously would pay $30 a fligh...,seriouslywouldpay30flightseatsn'tplaying.'srea...
9,2,1.0,"@VirginAmerica it was amazing, and arrived an ...","amazing,arrivedhourearly.'regood."
...,...,...,...,...
14631,0,1.0,@AmericanAir thx for nothing on getting us out...,thxnothinggettinguscountrybackus.brokenplane?c...
14633,0,1.0,@AmericanAir my flight was Cancelled Flightled...,"flightcancelledflightled,leavingtomorrowmornin..."
14636,0,1.0,@AmericanAir leaving over 20 minutes Late Flig...,leaving20minuteslateflight.warningscommunicati...
14637,1,1.0,@AmericanAir Please bring American Airlines to...,pleasebringamericanairlinesblackberry10


# Using train_test_split splitting the dataset into training and testing sets

In [98]:
from sklearn.model_selection import train_test_split

In [99]:
X_train, X_test, Y_train, Y_test = train_test_split(sparse_matrix , df['airline_sentiment'])

In [100]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((7843, 24821), (2615, 24821), (7843,), (2615,))

# Using MultinomialNB to train the model

Naive Bayes predict the tag of a text. They calculate the probability of each tag for a given text and then output the tag with the highest one.It is highly scalable with the number of predictors and data points. It is fast and can be used to make real-time predictions.It doesn't require as much training data

In [101]:
from sklearn.naive_bayes import MultinomialNB

In [102]:
nv = MultinomialNB()

In [103]:
nv.fit(X_train,Y_train)

MultinomialNB()

# Cheching the accuracy of the model

In [104]:
from sklearn.metrics import accuracy_score

In [105]:
Y_pred = nv.predict(X_test)

In [106]:
accuracy_score(Y_test,Y_pred)

0.682982791586998

# Testing the model

In [107]:
user_sentence = 'love it'

In [108]:
user_transformed_input = cv.transform([user_sentence])

In [109]:
nv.predict(user_transformed_input)

array([2])