## Project Twitter US Airline Sentiment

### Background and Context:

Twitter possesses 330 million monthly active users, which allows businesses to reach a broad population and connect with customers without intermediaries. On the other hand, there’s so much information that it’s difficult for brands to quickly detect negative social mentions that could harm their business.

That's why sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.


Listening to how customers feel about the product/service on Twitter allows companies to understand their audience, keep on top of what’s being said about their brand and their competitors, and discover new trends in the industry.

 

### Data Description:

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

 

### Dataset:

The dataset has the following columns:

tweet_id                                                           
airline_sentiment                                               
airline_sentiment_confidence                               
negativereason                                                   
negativereason_confidence                                    
airline                                                                    
airline_sentiment_gold                                              
name     
negativereason_gold 
retweet_count
text
tweet_coord
tweet_created
tweet_location 
user_timezone

### Data Summary
Add your view and opinion along with the problem statement, shape of the data, data description.
1. Import the libraries, load dataset, print the shape of data, data description. (4 Marks)

In [21]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

In [3]:
## Add your view 

# ********************************

In [4]:
data = pd.read_csv("Tweets.csv")

data_copy = data.copy()

In [5]:
data.shape

(14640, 15)

In [6]:
## Insights

In [7]:
data.describe()

Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


In [8]:
## Insights

In [9]:
data.dtypes

tweet_id                          int64
airline_sentiment                object
airline_sentiment_confidence    float64
negativereason                   object
negativereason_confidence       float64
airline                          object
airline_sentiment_gold           object
name                             object
negativereason_gold              object
retweet_count                     int64
text                             object
tweet_coord                      object
tweet_created                    object
tweet_location                   object
user_timezone                    object
dtype: object

In [10]:
## Insights

### EDA
2. Do Exploratory data analysis(EDA) based on the below statement. (9 Marks)
      a. Plot the distribution of all tweets among each airline & plot the distribution of sentiment across all the tweets. 
      b. Plot the distribution of Sentiment of tweets for each airline & plot the distribution of all the negative reasons. 
      c. Plot the word cloud graph of tweets for positive and negative sentiment separately.

### 3. Understand of data columns: (3 Marks)
     a. Drop all other columns except “text” and “airline_sentiment”.
     b. Check the shape of the data.
     c. Print the first 5 rows of data.

In [11]:
data.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [12]:
data = data.drop(['tweet_id',  'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],axis = 1)

In [13]:
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [14]:
data.shape

(14640, 2)

**Insights**

### 4. Text pre-processing: Data preparation. (12 Marks)
NOTE:- Each text pre-processing step should be mentioned in the notebook separately.<br>
     a. Html tag removal.<br>
     b. Tokenization.<br>
     c. Remove the numbers.<br>
     d. Removal of Special Characters and Punctuations.<br>
     e. Removal of stopwords<br>
     f. Conversion to lowercase.<br>
     g. Lemmatize or stemming.<br>
     h. Join the words in the list to convert back to text string in the data frame. (So that each row
          contains the data in text format.)<br>
     i. Print the first 5 rows of data after pre-processing.

In [15]:
## a. Html tag removal.

from bs4 import BeautifulSoup
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text



data['text'] = data['text'].apply(strip_html_tags)

data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [16]:
## c. Remove the numbers.

def remove_digits(text):
    pattern = r'[0-9]'
    text = re.sub(pattern, '', str(text))
    return text

data['text'] = data['text'].apply(remove_digits)

data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [17]:
#remove special characters

def remove_special_characters(text):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', text)
    return text

data['text'] = data['text'].apply(remove_special_characters)

data.head()



Unnamed: 0,airline_sentiment,text
0,neutral,VirginAmerica What dhepburn said
1,positive,VirginAmerica plus youve added commercials to ...
2,neutral,VirginAmerica I didnt today Must mean I need t...
3,negative,VirginAmerica its really aggressive to blast o...
4,negative,VirginAmerica and its a really big bad thing a...


In [18]:
## b. Tokenization.

def Tokenize_text(text):
    tokenizer=ToktokTokenizer()
    tokens=tokenizer.tokenize(text)
    return tokens

data['text'] = data['text'].apply(Tokenize_text)

data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[VirginAmerica, What, dhepburn, said]"
1,positive,"[VirginAmerica, plus, youve, added, commercial..."
2,neutral,"[VirginAmerica, I, didnt, today, Must, mean, I..."
3,negative,"[VirginAmerica, its, really, aggressive, to, b..."
4,negative,"[VirginAmerica, and, its, a, really, big, bad,..."


In [20]:
## b. Tokenization.

def to_lower(text):
    mod_text = []
    for word in text:
        lower_word = word.lower()
        mod_text.append(lower_word)
    return mod_text

data['text'] = data['text'].apply(to_lower)

data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[virginamerica, what, dhepburn, said]"
1,positive,"[virginamerica, plus, youve, added, commercial..."
2,neutral,"[virginamerica, i, didnt, today, must, mean, i..."
3,negative,"[virginamerica, its, really, aggressive, to, b..."
4,negative,"[virginamerica, and, its, a, really, big, bad,..."


In [None]:
# remove stop words


def remove_stop_words(text):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_data=[token for token in text if token not in stopword_list]
    return filtered_data

data['text'] = data['text'].apply(remove_stop_words)

data.head()

In [None]:

nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if != '-PRON-' else word.text for word in text])
    return text

lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")