# Project Description: Twitter US Airline Sentiment, Natural Language Processing



<b>Data Description:</b> 

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").


<b>Dataset:</b> 

The project is from a dataset from Kaggle. Link to the Kaggle project site:https://www.kaggle.com/crowdflower/twitter-airline-sentiment The dataset has to be downloaded from the above Kaggle website.

The dataset has the following columns:
- tweet_id 
- airline_sentiment 
- airline_sentiment_confidence 
- negative reason
- negativereason_confidence 
- airline  
- airline_sentiment_gold  
- name 
- negativereason_gold 
- retweet_count  
- text  
- tweet_coord  
- tweet_created  
- tweet_location 
- user_timezone


<b>Objective:</b> 

To implement the techniques learnt as a part of the course.


<b>Learning Outcomes:</b>
- Basic understanding of text pre-processing.
- What to do after text pre-processing: 
- Bag of wordsoTf-idf
- Build the classification model.
- Evaluate the Model.Steps and tasks:

<b>Tasks:</b>

1. Import the libraries, load dataset, print shape of data, data description (5 Marks)
2. Understand of data-columns (5 Marks)<br>     
        a. Drop all other columns except “text”and “airline_sentiment”.     
        b. Check the shape of data.     
        c. Print first 5 rows of data.
3. Text pre-processing: Data preparation (20 Marks)
    
        a. Html tag removal.    
        b. Tokenization.     
        c. Remove the numbers.     
        d. Removal of Special Characters and Punctuations.     
        e. Conversion to lowercase.     
        f. Lemmatize or stemming.    
        g. Join the words in the list to convert back to text string in the dataframe. (So that each row contains the data in text format.)     
        h. Print first 5 rows of data after pre-processing. 
4. Vectorization (10 Marks)<br>
        a. Use CountVectorizer.<br>     
        b. Use TfidfVectorizer.<br> 
5. Fit and evaluate model using both type of vectorization (6+6 Marks)
6. Summarize your understanding of the application of Various Pre-processing and Vectorization and performance of your modelon this dataset (8 Marks)



### Import the libraries, load dataset, print shape of data, data description 

In [1]:
#import libraries
import numpy as np
import pandas as pd
import re, string, unicodedata
import nltk
import textsearch
import contractions
from bs4 import BeautifulSoup

In [2]:
#import data and display first 5 rows
data = pd.read_csv('Tweets.csv')
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
#display shape of data and description
print('The shape of the data is', data.shape)
display(data.describe())
print(data.info())

The shape of the data is (14640, 15)


Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
tweet_id                        14640 non-null int64
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
negativereason_confidence       10522 non-null float64
airline                         14640 non-null object
airline_sentiment_gold          40 non-null object
name                            14640 non-null object
negativereason_gold             32 non-null object
retweet_count                   14640 non-null int64
text                            14640 non-null object
tweet_coord                     1019 non-null object
tweet_created                   14640 non-null object
tweet_location                  9907 non-null object
user_timezone                   9820 non-null object
dtypes: float64(2), int64(2), object(11)
memory usage: 1.7+ MB
None


### Understand of data-columns (5 Marks)
 a. Drop all other columns except “text”and “airline_sentiment”.     
 b. Check the shape of data.     
 c. Print first 5 rows of data.

In [4]:
#create new dataframe with only text and airline sentiment columns
sentiment = data[['text','airline_sentiment']]
print('The shape is', sentiment.shape)
display(sentiment.head())

The shape is (14640, 2)


Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


### Text pre-processing: Data preparation (20 Marks)

 a. Html tag removal.    
 b. Tokenization.     
 c. Remove the numbers.     
 d. Removal of Special Characters and Punctuations.     
 e. Conversion to lowercase.     
 f. Lemmatize or stemming.    
 g. Join the words in the list to convert back to text string in the dataframe. (So that each row contains the data in text format.)     
 h. Print first 5 rows of data after pre-processing. 


In [5]:
# The next several steps will individually build a function, test on the dataframe to ensure it's working properly and without error
# then combine the functions at the end and save as a new dataframe that's ready for preprocessing


In [6]:
#remove html tags
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

In [7]:
#check to make sure function runs on dataframe
sentiment['text'].apply(strip_html)

0                      @VirginAmerica What @dhepburn said.
1        @VirginAmerica plus you've added commercials t...
2        @VirginAmerica I didn't today... Must mean I n...
3        @VirginAmerica it's really aggressive to blast...
4        @VirginAmerica and it's a really big bad thing...
5        @VirginAmerica seriously would pay $30 a fligh...
6        @VirginAmerica yes, nearly every time I fly VX...
7        @VirginAmerica Really missed a prime opportuni...
8          @virginamerica Well, I didn't…but NOW I DO! :-D
9        @VirginAmerica it was amazing, and arrived an ...
10       @VirginAmerica did you know that suicide is th...
11       @VirginAmerica I <3 pretty graphics. so much b...
12       @VirginAmerica This is such a great deal! Alre...
13       @VirginAmerica @virginmedia I'm flying your #f...
14                                  @VirginAmerica Thanks!
15           @VirginAmerica SFO-PDX schedule is still MIA.
16       @VirginAmerica So excited for my first cross c.

In [8]:
#define function for replacing contractions using the contractions library
def replace_contractions(text):
    return contractions.fix(text)

In [9]:
#check to make sure function runs on dataframe
sentiment['text'].apply(replace_contractions)

0                      @VirginAmerica What @dhepburn said.
1        @VirginAmerica plus you have added commercials...
2        @VirginAmerica I did not today... Must mean I ...
3        @VirginAmerica it is really aggressive to blas...
4        @VirginAmerica and it is a really big bad thin...
5        @VirginAmerica seriously would pay $30 a fligh...
6        @VirginAmerica yes, nearly every time I fly VX...
7        @VirginAmerica Really missed a prime opportuni...
8         @virginamerica Well, I did not…but NOW I DO! :-D
9        @VirginAmerica it was amazing, and arrived an ...
10       @VirginAmerica did you know that suicide is th...
11       @VirginAmerica I &lt;3 pretty graphics. so muc...
12       @VirginAmerica This is such a great deal! Alre...
13       @VirginAmerica @virginmedia I am flying your #...
14                                  @VirginAmerica Thanks!
15           @VirginAmerica SFO-PDX schedule is still MIA.
16       @VirginAmerica So excited for my first cross c.

In [10]:
from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')

#tokenize
def tokenizer(text):
    tkn = word_tokenize(text)
    return tkn

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jharnack\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
#check to make sure function works on dataframe
sentiment['text'].apply(tokenizer)

0           [@, VirginAmerica, What, @, dhepburn, said, .]
1        [@, VirginAmerica, plus, you, 've, added, comm...
2        [@, VirginAmerica, I, did, n't, today, ..., Mu...
3        [@, VirginAmerica, it, 's, really, aggressive,...
4        [@, VirginAmerica, and, it, 's, a, really, big...
5        [@, VirginAmerica, seriously, would, pay, $, 3...
6        [@, VirginAmerica, yes, ,, nearly, every, time...
7        [@, VirginAmerica, Really, missed, a, prime, o...
8        [@, virginamerica, Well, ,, I, didn't…but, NOW...
9        [@, VirginAmerica, it, was, amazing, ,, and, a...
10       [@, VirginAmerica, did, you, know, that, suici...
11       [@, VirginAmerica, I, &, lt, ;, 3, pretty, gra...
12       [@, VirginAmerica, This, is, such, a, great, d...
13       [@, VirginAmerica, @, virginmedia, I, 'm, flyi...
14                           [@, VirginAmerica, Thanks, !]
15       [@, VirginAmerica, SFO-PDX, schedule, is, stil...
16       [@, VirginAmerica, So, excited, for, my, first.

In [12]:
#remove numbers
def num_remover(text):
    

SyntaxError: unexpected EOF while parsing (<ipython-input-12-e6ddfa8784e3>, line 3)