# Twitter US Airline Sentiment Text Generation By Markov Chain

### Here we will generate random and simple sentences based on two criteria:
1. They should be grammatically correct.
2. They should make sense—or at least some sense!

### Data: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

In [5]:
pip install markovify

Collecting markovify
  Downloading https://files.pythonhosted.org/packages/33/92/4036691c7ea53e545e98e0ffffcef357ca19aa2405df366ae5b8b7da391a/markovify-0.8.3.tar.gz
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/74/65/91eab655041e9e92f948cb7302e54962035762ce7b518272ed9d6b269e93/Unidecode-1.1.2-py2.py3-none-any.whl (239kB)
[K     |████████████████████████████████| 245kB 8.1MB/s 
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.8.3-cp36-none-any.whl size=18415 sha256=6b73710f74e45bb7720303c1b1ed971b88e23a0dd60da291f784b8dd97d592ac
  Stored in directory: /root/.cache/pip/wheels/5e/e5/be/8e61715070048813947af5fb32f47b4cf9dddd37c965800bdb
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.8.3 unidecode-1.1.2


In [6]:
import nltk
import numpy as np
import pandas as pd
import random
import string
import en_core_web_sm
nlp = en_core_web_sm.load()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import gutenberg
import re
import spacy
import warnings
import markovify
from sqlalchemy import create_engine
#from chatterbot import ChatBot
#from chatterbot.trainers import ListTrainer, ChatterBotCorpusTrainer
#from chatterbot.conversation import Statement
warnings.filterwarnings("ignore")
nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


## Get the data

In [7]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'twitter_sentiment'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df0 = pd.read_sql_query('select * from twitter', con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [8]:
nRow, nCol = df0.shape
print(f'There are {nRow} rows and {nCol} columns')

There are 14640 rows and 16 columns


In [9]:
df0.head(2)

Unnamed: 0,index,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)


In [15]:
# Let's focus on negative and positive sentiments seperately
df1 = df0.sample(14000)
df1_neg = df1[df1['airline_sentiment'] == 'negative']
df1_pos = df1[df1['airline_sentiment'] == 'positive']
#df1_pos = df0.copy()

In [16]:
df1_neg.shape, df1_pos.shape

((8772, 16), (2245, 16))

In [17]:
# Utility function for standard text cleaning
def text_cleaner(text):
    text = re.sub(r'--','',text)
    text = re.sub("[\[]*[\]]", "", text)
    text = re.sub("\@", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)", "", text) # remove links
    text = ' '.join(text.split())
    return text

In [18]:
df1_pos['cleaned'] = df1_pos['text'].astype(str).apply(text_cleaner)
df1_neg['cleaned'] = df1_neg['text'].astype(str).apply(text_cleaner)

In [19]:
# Convert the text in column to a body of text
dialogs_neg = df1_neg['cleaned'].tolist()
dialogs_neg_doc = ''.join(dialogs_neg)

# Convert the text in column to a body of text
dialogs_pos = df1_pos['cleaned'].tolist()
dialogs_pos_doc = ''.join(dialogs_pos)

In [20]:
len(dialogs_neg_doc), len(dialogs_pos_doc)

(945709, 180170)

In [21]:
type(dialogs_neg_doc)

str

In [22]:
# Function to remove emojis
def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

In [23]:
# Remove emojis
dialogs_neg_doc = deEmojify(dialogs_neg_doc)
dialogs_pos_doc = deEmojify(dialogs_pos_doc)

In [24]:
# Adust NLP for a large body of text
nlp.max_length = 6000000 # or even higher
dialogs_neg_doc = nlp(dialogs_neg_doc)
dialogs_pos_doc = nlp(dialogs_pos_doc)

## Break the body to sentences

In [33]:
dialogs_neg_sents = [sent.text for sent in dialogs_neg_doc.sents if len(sent.text) > 1]
dialogs_pos_sents = [sent.text for sent in dialogs_pos_doc.sents if len(sent.text) > 1]

### To generate the transition probabilities, we wil use Markovify's `Text()` class. This class has a parameter called `state_size`. This parameter determines how many words the model uses as the current state. For example, if we want to generate the next word by looking at just the previous word, set `state_size=1`. If we want to generate the next word by looking at the previous two words, then set `state_size=2`. The following is set to `state_size=3`.

In [34]:
dialogs_neg_generator = markovify.Text(dialogs_neg_sents, state_size = 3)
dialogs_pos_generator = markovify.Text(dialogs_pos_sents, state_size = 3)

### At this stage, we've trained a Markov chain model from *twitter*. Now, we're all set to generate random sentences from this model:

In [35]:
# Three randomly generated negative sentences
for i in range(3):
    print(dialogs_neg_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_neg_generator.make_short_sentence(100))

how is it that my flight # can arrive early and be delayed due to poor communication, which sounded like it was salvaged from the 80s.
hopefully to the Late Flightr flight from PHL.SouthwestAir
PLEASE HELP!SouthwestAir if this flight is to your team.
you asked me to follow them to try and resolveAmericanAir my flight to BWI to wait for from buffalo?
now been on hold for an hour due to computers being down.
hey I got a bad exchange rate.


In [36]:
# Three randomly generated positive sentences
for i in range(3):
    print(dialogs_pos_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_pos_generator.make_short_sentence(100))

This has to be the best video I have seen in years of flying USAir.
None
I needed to change my flight that was Cancelled Flighted and rescheduled for today.
I would be so awesome to see!JetBlue
AmericanAirUnited is the best first class I have ever flown any other airlines!
SouthwestAir thanks for the show!


## The sentences do sound good but not very natural and to improve the performance of the model, we can use some syntactic information like part-of-speech tags. The Markovify package also supports this and can work together with spaCy as follows:

In [37]:
class POSifiedText(markovify.Text):
    
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

### Now, train a Markov chain model again. This time, use the `POSifiedText()` class:

In [40]:
dialogs_neg_generator = POSifiedText(dialogs_neg_sents, state_size = 3)
dialogs_pos_generator = POSifiedText(dialogs_pos_sents, state_size = 3)

In [41]:
# Three randomly generated negative sentences
for i in range(3):
    print(dialogs_neg_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_neg_generator.make_short_sentence(100))

2nd plane forced to get off the plane - so frustratedunited
can we get a refund?SouthwestAir
What 's your excuse this time for delay from LGA to BOS for any shuttle flight ?
If that 's true I never want to deal with me in person .
I take JetBlue because I 've had with Jet Blue .
WELL't be on this plane at  its  now and still have had to .


In [42]:
# Three randomly generated positive sentences
for i in range(3):
    print(dialogs_pos_generator.make_sentence())

# Three randomly generated sentences of no more than 100 characters
for i in range(3):
    print(dialogs_pos_generator.make_short_sentence(100))

None
None
non stop CMH - OAK has me daydreaming about a trip to Cali & lt;united Hi , flight .
exceptional customer service today !
Thanks so much , that helps a lot .
you should know the crew today on flight # from IND to PHX!!JetBlue


## These generators work very well. Both negative and positve texts show meaninful semantic and syntax.