# <div style="text-align: center; background-color:#0084b4; font-family:Impact; color: white; padding: 20px; line-height: 1;border-radius:20px">Twitter Sentiment Analysis 🕊️</div>

<div style="width:100%;text-align:center"> 
<img align=middle src = "https://ichef.bbci.co.uk/news/976/cpsprodpb/13B2F/production/_127678608_gettyimages-1244636244.jpg" width="500px">
</div>

<div style='background-color:#1dcaff;color:white;padding:4px;border-radius:25px;font-family:georgia'>
<h3 style='color:white;font-family:Impact'>&nbsp About Dataset 📑</h3>
</div>

<div style='padding:6px; font-size:16'>
    <p>The dataset consists of message, entity, and sentiment in Twitter. There are three classes in the dataset: positive, negative, and neutral. The messages that are not relevant to the entity should be regarded as neutral.</p>
    <ul>
        <li><mark><b>Tweet ID:</b></mark> ID of Tweet
        </li>
        <li><mark><b>Entity:</b></mark> Entity that Tweet talks about
        </li>
       <li><mark><b>Sentiment:</b></mark> Sentiment of the tweet text regarding the entity
           <br/>&nbsp Positive, Negative, Neutral, Irrelevant
        </li>
        <li><mark><b>Tweet Content:</b></mark> Tweet Text
        </li>
    </ul>
</div>

<div style='background-color:#1dcaff;color:white;padding:4px;border-radius:25px;font-family:georgia'>
<h3 style='color:white;font-family:Impact'>&nbsp Goal of the Project 🙇‍♀️</h3>
</div>

<div style='padding:6px'>
    <p>The goal of the project is to 📊 <mark>explore</mark> data (EDA Analysis), ⚙️ <mark>perform NLP preprocessing</mark>, and 🤖 <mark>perform ML </mark> to judge the sentiment of the message about the entity.
</div>

<div style='background-color:#1dcaff;color:white;padding:4px;border-radius:25px;font-family:georgia'>
    <h3 style='color:white;font-family:Impact'>&nbsp Table of Contents 🧚</h3>
</div>
<ul style='padding:6px'>
    <a href='#1'><b>1. Import Libraries 📚</b><br/></a>
    <a href='#2'><b>2. Exploratory Data Analysis 📊</b></a>
    <ul>
        <a href='#2.1'><b>2.1 Sentiment Analysis</b><br/></a>
            <ul>
                <a href='#2.1.1'>2.1.1 Distribution of Sentiment<br/></a>
                <a href='#2.1.2'>2.1.2 Distribution of Entity<br></a>
                <a href='#2.1.3'>2.1.3 Sentiment Distribution in Top 3 Entities<br></a>
            </ul>
        <a href='#2.2'><b>2.2 Text Analysis with NLP Preprocessing<br/></b></a>
            <ul>
                <a href='#2.2.1'>2.2.1 NLP Preprocessing<br/></a>
                <a href='#2.2.2'>2.2.2 Positive Sentiment Text Distribution<br/></a>
                <a href='#2.2.3'>2.2.3 Negative Sentiment Text Distribution<br></a>
                <a href='#2.2.4'>2.2.4 Neutral Sentiment Text Distribution<br></a>
            </ul>
    </ul>
    <a href='#3'><b>3. ML Pipeline Modelling 🤖</b></a>
</ul>

<a id="1"></a>
# <div style="text-align: center; background-color: #00aced; font-family:Impact; color: white; padding: 14px; line-height: 1;border-radius:20px">1. Import Libraries 📚</div>

In [None]:
## Remove Warnings ## 
import warnings
warnings.filterwarnings("ignore")

## DATA ## 
import numpy as np 
import pandas as pd 
import re

## NLP ##
import nltk 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

## Visualization ## 
from wordcloud import WordCloud
import matplotlib.pyplot as plt  
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

## ML Modelling ## 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import GridSearchCV

In [None]:
col_names = ['ID', 'Entity', 'Sentiment', 'Content']
train_df = pd.read_csv('/kaggle/input/twitter-entity-sentiment-analysis/twitter_training.csv', names=col_names)
test_df = pd.read_csv('/kaggle/input/twitter-entity-sentiment-analysis/twitter_validation.csv', names=col_names)

In [None]:
train_df.head()

In [None]:
train_df.isnull().sum()

Since there are <b>686</b> null values in content (text), I will drop them. 

In [None]:
train_df.dropna(subset=['Content'], inplace=True)

In [None]:
train_df['Sentiment'] = train_df['Sentiment'].replace('Irrelevant', 'Neutral')
test_df['Sentiment'] = test_df['Sentiment'].replace('Irrelevant', 'Neutral')

I will replace <b>irrelevant</b> to <b>neutral</b>.

<a id="2"></a>
# <div style="text-align: center; background-color: #00aced; font-family:Impact; color: white; padding: 14px; line-height: 1;border-radius:20px">2. Exploratory Data Analysis 📊</div>

In this section, I will first <b>analyze sentiment</b> distribution and sentiment distribution by top 3 entity, and then <b>NLP preprocess texts</b>, and lastly <b>visualize text distribution</b> by each sentiment. 

<div id='2.1' style='background-color:#51C7FF;text-align:center;padding:4px;border-radius:25px'>
    <h3 style='color:white;font-family:Impact'>2.1 Sentiment Analysis </h3>
</div>

<div id='2.1.1' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.1.1 Distribution of Sentiment</h4>
</div>

In [None]:
sentiment_counts = train_df['Sentiment'].value_counts().sort_index()

sentiment_labels = ['Negative', 'Neutral', 'Positive']
sentiment_colors = ['red', 'grey', 'green']

fig = go.Figure(data=[go.Pie(labels=sentiment_counts.index, 
                             values=sentiment_counts.values,
                             textinfo='percent+value+label',
                             marker_colors=sentiment_colors,
                             textposition='auto',
                             hole=.3)])

fig.update_layout(
    title_text='Sentiment Distribution',
    template='plotly_white',
    xaxis=dict(
        title='Sources',
    ),
    yaxis=dict(
        title='Number of Posts in Twitter',
    )
)

fig.update_traces(marker_line_color='black', 
                  marker_line_width=1.5, 
                  opacity=0.8)
 
fig.show()

There are <b>41.9%</b> of neutral sentiment texts about entity, <b>30.2%</b> of negative sentiment texts about entity, and <b>27.9%</b> of positive sentiment texts about entity.

<div id='2.1.2' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.1.2 Distribution of Entity</h4>
</div>

In [None]:
top10_entity_counts = train_df['Entity'].value_counts().sort_values(ascending=False)[:10]

fig = px.bar(x=top10_entity_counts.index, 
             y=top10_entity_counts.values,
             color=top10_entity_counts.values,
             text=top10_entity_counts.values,
             color_continuous_scale='Blues')

fig.update_layout(
    title_text='Top 10 Twitter Entity Distribution',
    template='plotly_white',
    xaxis=dict(
        title='Entity',
    ),
    yaxis=dict(
        title='Number of Posts in Twitter',
    )
)

fig.update_traces(marker_line_color='black', 
                  marker_line_width=1.5, 
                  opacity=0.8)
 
fig.show()

There are about <b>same</b> amount of data for each entity. <mark>MaddenNFL, LeagueOfLegends, CallOfDuty</mark> are 3 most distributed entities in the dataset.

<div id='2.1.3' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.1.3 Sentiment Distribution in Top 3 Entities</h4>
</div>

In [None]:
top3_entity_df = train_df['Entity'].value_counts().sort_values(ascending=False)[:3]
top3_entity = top3_entity_df.index.tolist()
sentiment_by_entity = train_df.loc[train_df['Entity'].isin(top3_entity)].groupby('Entity')['Sentiment'].value_counts().sort_index()

sentiment_labels = ['Negative', 'Neutral', 'Positive']
sentiment_colors = ['red', 'grey', 'green']

row_n = 1
col_n = 3

fig = make_subplots(rows=row_n, cols=col_n, 
                    specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=top3_entity)

for i, col in enumerate(top3_entity):
    fig.add_trace(
        go.Pie(labels=sentiment_labels, 
                values=sentiment_by_entity[col].values, 
                textinfo='percent+value+label',
                marker_colors=sentiment_colors,
                textposition='auto',
                name=col),
            row=int(i/col_n)+1, col=int(i%col_n)+1)
    
fig.update_traces(marker_line_color='black', 
                  marker_line_width=1.5, 
                  opacity=0.8)

fig.show()

There are <b>71.3%</b> negative sentiment tweets about MaddenNFL, <b>47.5%</b> neutral sentiment tweets about LeagueOfLegends, <b>44.1%</b> neutral sentiment tweets about CallOfDuty. 

<div id='2.2' style='background-color:#51C7FF;text-align:center;padding:4px;border-radius:25px'>
    <h3 style='color:white;font-family:Impact'>2.2 Text Analysis with NLP Preprocessing </h3>
</div>

<div id='2.2.1' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.2.1 NLP Preprocessing</h4>
</div>

In this section, I will perform <mark>NLP Preprocessing</mark> and <mark>visualize texts</mark> for each sentiment. <hr/>
Preprocessing Functions Explanations:
- <mark><b>get_all_string:</b></mark> this function returns all strings in one sentence given a text series
- <mark><b>get_word:</b></mark> this function returns list of words given a sentence 
- <mark><b>remove_stopword:</b></mark> this function removes stopwords like "the", "is", "and", and etc
- <mark><b>lemmatize_word:</b></mark> this function lemmatizes the word (i.e. "Caring" --> "Care") 
- <mark><b>create_freq_df:</b></mark> this function returns the frequency dataframe given the list of words

In [None]:
def get_all_string(sentences): 
    sentence = ''
    for words in sentences:
        sentence += words
    sentence = re.sub('[^A-Za-z0-9 ]+', '', sentence)
    sentence = re.sub(r'http\S+', '', sentence)
    sentence = sentence.lower()
    return sentence 

def get_word(sentence):
    return nltk.RegexpTokenizer(r'\w+').tokenize(sentence)

def remove_stopword(word_tokens):
    stopword_list = stopwords.words('english')
    filtered_tokens = []
    
    for word in word_tokens:
        if word not in stopword_list: 
            filtered_tokens.append(word) 
    return filtered_tokens 

def lemmatize_words(filtered_tokens):
    lemm = WordNetLemmatizer() 
    cleaned_tokens = [lemm.lemmatize(word) for word in filtered_tokens]
    return cleaned_tokens

In [None]:
def create_freq_df(cleaned_tokens): 
    fdist = nltk.FreqDist(cleaned_tokens)
    freq_df = pd.DataFrame.from_dict(fdist, orient='index')
    freq_df.columns = ['Frequency']
    freq_df.index.name = 'Term'
    freq_df = freq_df.sort_values(by=['Frequency'], ascending=False)
    freq_df = freq_df.reset_index()
    return freq_df

In [None]:
def preprocess(series):
    all_string = get_all_string(series)
    words = get_word(all_string)
    filtered_tokens = remove_stopword(words)
    cleaned_tokens = lemmatize_words(filtered_tokens)
    return cleaned_tokens

In [None]:
def plot_text_distribution(x_df, y_df, color, title, xaxis_text, yaxis_text):
    
    fig = px.bar(x=x_df, 
                y=y_df,
                color=y_df,
                text=y_df,
                color_continuous_scale=color)

    fig.update_layout(
        title_text=title,
        template='plotly_white',
        xaxis=dict(
            title=xaxis_text,
        ),
        yaxis=dict(
            title=yaxis_text,
        )
    )

    fig.update_traces(marker_line_color='black', 
                    marker_line_width=1.5, 
                    opacity=0.8)
    
    fig.show()

In [None]:
def create_wordcloud(freq_df, title, color):
    
    data = freq_df.set_index('Term').to_dict()['Frequency']
    
    plt.figure(figsize = (20,15))
    wc = WordCloud(width=800, 
               height=400, 
               max_words=100,
               colormap= color,
               max_font_size=200,
               min_font_size = 1 ,
               random_state=8888, 
               background_color='white').generate_from_frequencies(data)
    
    plt.imshow(wc, interpolation='bilinear')
    plt.title(title, fontsize=20)
    plt.axis('off')
    plt.show()

<div id='2.2.2' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.2.2 Positive Sentiment Text Distribution</h4>
</div>

In [None]:
positive_words = preprocess(train_df.loc[train_df['Sentiment'] == 'Positive']['Content'])
positive_words_df = create_freq_df(positive_words)
top10_positive_words = positive_words_df[:10]

plot_text_distribution(top10_positive_words['Term'], top10_positive_words['Frequency'],
                  'Greens', 'Top 10 Positive Sentiment Text Distribution', 'Text', 'Number of Texts')
create_wordcloud(positive_words_df, 'Positive Sentiment Text Distribution', 'BuGn')

<div id='2.2.3' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.2.3 Negative Sentiment Text Distribution</h4>
</div>

In [None]:
negative_words = preprocess(train_df.loc[train_df['Sentiment'] == 'Negative']['Content'])
negative_words_df = create_freq_df(negative_words)
top10_negative_words = negative_words_df[:10]

plot_text_distribution(top10_negative_words['Term'], top10_negative_words['Frequency'],
                  'Reds', 'Top 10 Negative Sentiment Text Distribution', 'Text', 'Number of Texts')
create_wordcloud(negative_words_df, 'Negative Sentiment Text Distribution', 'OrRd')

<div id='2.2.4' style='background-color:#9FDFFF;padding:2px;border-radius:25px'>
    <h4 style='font-family:Impact; color:black'>&nbsp 2.2.4 Neutral Sentiment Text Distribution</h4>
</div>

In [None]:
neutral_words = preprocess(train_df.loc[train_df['Sentiment'] == 'Neutral']['Content'])
neutral_words_df = create_freq_df(neutral_words)
top10_neutral_words = neutral_words_df[:10]

plot_text_distribution(top10_neutral_words['Term'], top10_neutral_words['Frequency'],
                  'Greys', 'Top 10 Neutral Sentiment Text Distribution', 'Text', 'Number of Texts')
create_wordcloud(neutral_words_df, 'Neutral Sentiment Text Distribution', 'binary_r')

The interesting thing is that the <b>most frequent</b> word for all sentiments is <mark><b>game</b></mark>.

<a id="3"></a>
# <div style="text-align: center; background-color: #00aced; font-family:Impact; color: white; padding: 14px; line-height: 1;border-radius:20px">3. ML Pipeline Modelling 🤖</div>

In this section, I will <b>build a pipeline</b> to find out <mark>optimized parameters of TfidfVectorizer and logistic regression. </mark>

In [None]:
X_train = train_df['Content']
X_test = test_df['Content']
y_train = train_df['Sentiment']
y_test = test_df['Sentiment']

In [None]:
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english')),
    ('lr_clf', LogisticRegression(solver='liblinear'))
])

params = {'tfidf_vect__ngram_range': [(1,1), (1,2), (1,3)],
          'tfidf_vect__max_df': [0.5, 0.75, 1.0],
          'lr_clf__C': [1, 5, 10]}

grid_cv_pipe = GridSearchCV(pipeline, param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv_pipe.fit(X_train, y_train)
print('Optimized Hyperparameters: ', grid_cv_pipe.best_params_)

pred = grid_cv_pipe.predict(X_test)
print('Optimized Accuracy Score: {0: .3f}'.format(accuracy_score(y_test, pred)))

The optimized hyperparmeters are <b>{'lr_clf__C': 1, 'tfidf_vect__max_df': 0.5, 'tfidf_vect__ngram_range': (1, 3)}</b>, and the optimized accuracy of test dataset is <mark><b>96.8%</b></mark>.