# Twitter US Airline Sentiment Text Generation By Markov Chain

### Here we will generate random and simple sentences based on two criteria:
1. They should be grammatically correct.
2. They should make sense—or at least some sense!

### Data: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

In [2]:
import nltk
import numpy as np
import pandas as pd
import random
import string
import en_core_web_sm
nlp = en_core_web_sm.load()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import gutenberg
import re
import spacy
import warnings
import markovify
from sqlalchemy import create_engine
#from chatterbot import ChatBot
#from chatterbot.trainers import ListTrainer, ChatterBotCorpusTrainer
#from chatterbot.conversation import Statement
warnings.filterwarnings("ignore")
nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
✘ Couldn't link model to 'en'
Creating a symlink in spacy/data failed. Make sure you have the required
permissions and try re-running the command as admin, or use a virtualenv. You
can still import the model as a module and call its load() method, or create the
symlink manually.
C:\Users\User\anaconda3\lib\site-packages\en_core_web_sm -->
C:\Users\User\anaconda3\lib\site-packages\spacy\data\en
⚠ Download successful but linking failed
Creating a shortcut link for 'en' didn't work (maybe you don't have admin
permissions?), but you can still load the model via its full package name: nlp =
spacy.load('en_core_web_sm')
You do not have sufficient privilege to perform this operation.


## Get the data

In [3]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'twitter_sentiment'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df0 = pd.read_sql_query('select * from twitter', con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [87]:
nRow, nCol = df0.shape
print(f'There are {nRow} rows and {nCol} columns')

There are 14640 rows and 16 columns


In [88]:
df0.head(2)

Unnamed: 0,index,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)


In [89]:
# Let's focus on negative and positive sentiments seperately
df1 = df0.sample(4000)
df1_neg = df1[df1['airline_sentiment'] == 'negative']
df1_pos = df1[df1['airline_sentiment'] == 'positive']
#df1_pos = df0.copy()

In [90]:
df1_neg.shape, df1_pos.shape

((2522, 16), (620, 16))

In [91]:
# Utility function for standard text cleaning
def text_cleaner(text):
    text = re.sub(r'--','',text)
    text = re.sub("[\[]*[\]]", "", text)
    text = re.sub("\@", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)", "", text) # remove links
    text = ' '.join(text.split())
    return text

In [104]:
df1_pos['cleaned'] = df1_pos['text'].astype(str).apply(text_cleaner)
df1_neg['cleaned'] = df1_neg['text'].astype(str).apply(text_cleaner)

In [105]:
# Convert the text in column to a body of text
dialogs_neg = df1_neg['cleaned'].tolist()
dialogs_neg_doc = ''.join(dialogs_neg)

# Convert the text in column to a body of text
dialogs_pos = df1_pos['cleaned'].tolist()
dialogs_pos_doc = ''.join(dialogs_pos)

In [106]:
len(dialogs_neg_doc), len(dialogs_pos_doc)

(272869, 52720)

In [107]:
type(dialogs_neg_doc)

str

In [108]:
# Function to remove emojis
def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

In [109]:
# Remove emojis
dialogs_neg_doc = deEmojify(dialogs_neg_doc)
dialogs_pos_doc = deEmojify(dialogs_pos_doc)

In [110]:
# Adust NLP for a large body of text
nlp.max_length = 6000000 # or even higher
dialogs_neg_doc = nlp(dialogs_neg_doc)
dialogs_pos_doc = nlp(dialogs_pos_doc)

## Break the body to sentences

In [118]:
dialogs_neg_sents = [sent.text for sent in dialogs_neg_doc.sents if len(sent.text) > 1]
dialogs_pos_sents = [sent.text for sent in dialogs_pos_doc.sents if len(sent.text) > 1]

### To generate the transition probabilities, we wil use Markovify's `Text()` class. This class has a parameter called `state_size`. This parameter determines how many words the model uses as the current state. For example, if we want to generate the next word by looking at just the previous word, set `state_size=1`. If we want to generate the next word by looking at the previous two words, then set `state_size=2`. The following is set to `state_size=3`.

In [113]:
dialogs_neg_generator = markovify.Text(dialogs_neg_sents, state_size = 3)
dialogs_pos_generator = markovify.Text(dialogs_pos_sents, state_size = 3)

### At this stage, we've trained a Markov chain model from *twitter*. Now, we're all set to generate random sentences from this model:

In [114]:
# Three randomly generated negative sentences
for i in range(3):
    print(dialogs_neg_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_neg_generator.make_short_sentence(100))

None
None
give you $ voucher, like I want to put a coat in there isn't that my choice?
, I'm so sad for the baggage to begin loading.
I think it's safe to say that JetBlue has officially lost a customer.
but i should still be able to complete the trip.


In [115]:
# Three randomly generated positive sentences
for i in range(3):
    print(dialogs_pos_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_pos_generator.make_short_sentence(100))

None
None
None
I knew there was a frequent tweeter discount
Thank you!SouthwestAir thank you for contacting me.
the entire flight crew on flight from Orlando to Indy was AMAZING!


## The sentences do sound good but not very natural and to improve the performance of the model, we can use some syntactic information like part-of-speech tags. The Markovify package also supports this and can work together with spaCy as follows:

In [119]:
class POSifiedText(markovify.Text):
    
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

### Now, train a Markov chain model again. This time, use the `POSifiedText()` class:

In [120]:
dialogs_neg_generator = POSifiedText(dialogs_neg_sents, state_size = 3)
dialogs_pos_generator = POSifiedText(dialogs_pos_sents, state_size = 3)

In [121]:
# Three randomly generated negative sentences
for i in range(3):
    print(dialogs_neg_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_neg_generator.make_short_sentence(100))

AmericanAir has the worst flights and customer service , irresponsible staff and lack of snacks on the plane .
The pilot told us they would release bags as well as other passengers tickets away .
None
But not everyone is on the ground .....
No window at all .
Was supposed to be in a flight ..


In [122]:
# Three randomly generated positive sentences
for i in range(3):
    print(dialogs_pos_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_pos_generator.make_short_sentence(100))

None
You guys are awesome !
None
favoriteairline # luvforSW # southwestAirunited and to add to reasons why I fly with you
Thank you!SouthwestAir thank you for the credit .
SouthwestAir - I just had a great LA flight with Clarence and Frank !


## These generators work very well. Both negative and positve texts show meaninful semantic and syntax.