## Predicting IPO Excess Returns: A Sentiment Analysis Approach

### Sentiment Analysis

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from textblob import TextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
tweet_filepath = r'C:\Users\restr\Documents\Springboard\Capstone project 2\IPO Project\Tweet.csv'
company_tweet_filepath = r'C:\Users\restr\Documents\Springboard\Capstone project 2\IPO Project\Company_Tweet.csv'


df_tweet = pd.read_csv(tweet_filepath)
df_company_tweet = pd.read_csv(company_tweet_filepath)

# Display the first few rows of the dataframes
print(df_tweet.head())
print(df_company_tweet.head())

             tweet_id           writer   post_date  \
0  550441509175443456  VisualStockRSRC  1420070457   
1  550441672312512512      KeralaGuy77  1420070496   
2  550441732014223360      DozenStocks  1420070510   
3  550442977802207232     ShowDreamCar  1420070807   
4  550443807834402816     i_Know_First  1420071005   

                                                body  comment_num  \
0  lx21 made $10,008  on $AAPL -Check it out! htt...            0   
1  Insanity of today weirdo massive selling. $aap...            0   
2  S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...            0   
3  $GM $TSLA: Volkswagen Pushes 2014 Record Recal...            0   
4  Swing Trading: Up To 8.91% Return In 14 Days h...            0   

   retweet_num  like_num  
0            0         1  
1            0         0  
2            0         0  
3            0         1  
4            0         1  
             tweet_id ticker_symbol
0  550803612197457920          AAPL
1  550803610825928706     

In [3]:
# Show the shape of the dataframes
print('df_tweet shape:', df_tweet.shape)
print('df_company_tweet shape:', df_company_tweet.shape)

# Show general information about the dataframes
print(df_tweet.info())
print(df_company_tweet.info())

# Show descriptive statistics of the dataframes
print(df_tweet.describe())
print(df_company_tweet.describe())


df_tweet shape: (3717964, 7)
df_company_tweet shape: (4336445, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3717964 entries, 0 to 3717963
Data columns (total 7 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   tweet_id     int64 
 1   writer       object
 2   post_date    int64 
 3   body         object
 4   comment_num  int64 
 5   retweet_num  int64 
 6   like_num     int64 
dtypes: int64(5), object(2)
memory usage: 198.6+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4336445 entries, 0 to 4336444
Data columns (total 2 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   tweet_id       int64 
 1   ticker_symbol  object
dtypes: int64(1), object(1)
memory usage: 66.2+ MB
None
           tweet_id     post_date   comment_num   retweet_num      like_num
count  3.717964e+06  3.717964e+06  3.717964e+06  3.717964e+06  3.717964e+06
mean   8.797444e+17  1.498582e+09  3.123642e-01  6.214807e-01  2.219982e+00
std    1.924039e+17  4.587266e+07  1

In [4]:
print(df_tweet.isnull().sum())

tweet_id           0
writer         47273
post_date          0
body               0
comment_num        0
retweet_num        0
like_num           0
dtype: int64


In [5]:
print(df_company_tweet.isnull().sum())

tweet_id         0
ticker_symbol    0
dtype: int64


Convert the post_date from UNIX timestamp to standard datetime format and then merge the dataframes besed on 'tweet_id'

In [6]:
df_tweet['post_date'] = pd.to_datetime(df_tweet['post_date'], unit='s')

df = pd.merge(df_tweet, df_company_tweet, how='inner', on='tweet_id')


In [7]:
print(df.head())

             tweet_id           writer           post_date  \
0  550441509175443456  VisualStockRSRC 2015-01-01 00:00:57   
1  550441672312512512      KeralaGuy77 2015-01-01 00:01:36   
2  550441732014223360      DozenStocks 2015-01-01 00:01:50   
3  550442977802207232     ShowDreamCar 2015-01-01 00:06:47   
4  550443807834402816     i_Know_First 2015-01-01 00:10:05   

                                                body  comment_num  \
0  lx21 made $10,008  on $AAPL -Check it out! htt...            0   
1  Insanity of today weirdo massive selling. $aap...            0   
2  S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...            0   
3  $GM $TSLA: Volkswagen Pushes 2014 Record Recal...            0   
4  Swing Trading: Up To 8.91% Return In 14 Days h...            0   

   retweet_num  like_num ticker_symbol  
0            0         1          AAPL  
1            0         0          AAPL  
2            0         0          AMZN  
3            0         1          TSLA  
4      

Filter data from 2018 onwards

In [8]:
df['post_date'] = pd.to_datetime(df['post_date'])

df = df[df['post_date'].dt.year >= 2018]

print(df.head())


                   tweet_id           writer           post_date  \
2516889  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516890  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516891  947618475958325248  ExactOptionPick 2018-01-01 00:00:25   
2516892  947619846124122113          Eric714 2018-01-01 00:05:51   
2516893  947619846124122113          Eric714 2018-01-01 00:05:51   

                                                      body  comment_num  \
2516889  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516890  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516891  Don't miss our next FREE OPTION TRADE.  Sign u...            0   
2516892  How is $AAPL @Apple going to get me to buy a #...            1   
2516893  How is $AAPL @Apple going to get me to buy a #...            1   

         retweet_num  like_num ticker_symbol  
2516889            0         2         GOOGL  
2516890            0         2          AMZN  

## Sentiment Analysis 
To perform our Sentiment Analysis, first we will use TextBlob. It is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, and sentiment analysis.

The sentiment analysis algorithm used by TextBlob is based on a machine learning approach and uses a pre-trained Naive Bayes classifier.

The function will return two columns: 

Polarity: This is a value that ranges from -1 to 1. Values closer to 1 mean that the statement is positive, values closer to -1 mean that the statement is negative, and values around 0 are neutral.

Subjectivity: This is a value that ranges from 0 to 1. Values closer to 1 are more subjective (based on or influenced by personal feelings, tastes, or opinions), and values closer to 0 are more objective (not influenced by personal feelings or opinions in considering and representing facts).

In [9]:
# Function to get the polarity of text
def get_polarity(text):
    return TextBlob(text).sentiment.polarity

df['sentiment'] = df['body'].apply(get_polarity)


print(df.head())

                   tweet_id           writer           post_date  \
2516889  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516890  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516891  947618475958325248  ExactOptionPick 2018-01-01 00:00:25   
2516892  947619846124122113          Eric714 2018-01-01 00:05:51   
2516893  947619846124122113          Eric714 2018-01-01 00:05:51   

                                                      body  comment_num  \
2516889  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516890  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516891  Don't miss our next FREE OPTION TRADE.  Sign u...            0   
2516892  How is $AAPL @Apple going to get me to buy a #...            1   
2516893  How is $AAPL @Apple going to get me to buy a #...            1   

         retweet_num  like_num ticker_symbol  sentiment  
2516889            0         2         GOOGL        1.0  
2516890            0    

In [10]:
# Function to get the subjectivity of text
def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

df['subjectivity'] = df['body'].apply(get_subjectivity)

print(df.head())


                   tweet_id           writer           post_date  \
2516889  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516890  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516891  947618475958325248  ExactOptionPick 2018-01-01 00:00:25   
2516892  947619846124122113          Eric714 2018-01-01 00:05:51   
2516893  947619846124122113          Eric714 2018-01-01 00:05:51   

                                                      body  comment_num  \
2516889  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516890  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516891  Don't miss our next FREE OPTION TRADE.  Sign u...            0   
2516892  How is $AAPL @Apple going to get me to buy a #...            1   
2516893  How is $AAPL @Apple going to get me to buy a #...            1   

         retweet_num  like_num ticker_symbol  sentiment  subjectivity  
2516889            0         2         GOOGL        1.0           1.

## VADER
Now we will use VADER, a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.

In [11]:
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

# Define a function to get the sentiment scores 
def get_sentiment_scores(text):
    return sid.polarity_scores(text)

df['sentiment_scores'] = df['body'].apply(get_sentiment_scores)

print(df.head())

                   tweet_id           writer           post_date  \
2516889  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516890  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516891  947618475958325248  ExactOptionPick 2018-01-01 00:00:25   
2516892  947619846124122113          Eric714 2018-01-01 00:05:51   
2516893  947619846124122113          Eric714 2018-01-01 00:05:51   

                                                      body  comment_num  \
2516889  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516890  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516891  Don't miss our next FREE OPTION TRADE.  Sign u...            0   
2516892  How is $AAPL @Apple going to get me to buy a #...            1   
2516893  How is $AAPL @Apple going to get me to buy a #...            1   

         retweet_num  like_num ticker_symbol  sentiment  subjectivity  \
2516889            0         2         GOOGL        1.0           1

This returns a new column 'sentiment_scores' which is a dictionary that contains four values:

'neg': Negative sentiment in text

'neu': Neutral sentiment in text

'pos': Positive sentiment in text

'compound': Compound (i.e., aggregated) sentiment score, which is a weighted composite score computed by summing the valence scores of each word in the lexicon, adjusted for the impact of intensifiers, negations, and degree adverbs, then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single measure of sentiment.

Now we create separate columns for these scores as follows:

In [12]:
df = pd.concat([df.drop(['sentiment_scores'], axis=1), df['sentiment_scores'].apply(pd.Series)], axis=1)

In [13]:
print(df.head())

                   tweet_id           writer           post_date  \
2516889  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516890  947618374569353216          Auscomp 2018-01-01 00:00:01   
2516891  947618475958325248  ExactOptionPick 2018-01-01 00:00:25   
2516892  947619846124122113          Eric714 2018-01-01 00:05:51   
2516893  947619846124122113          Eric714 2018-01-01 00:05:51   

                                                      body  comment_num  \
2516889  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516890  The 7 Greatest Tech Stocks of All Time monitor...            0   
2516891  Don't miss our next FREE OPTION TRADE.  Sign u...            0   
2516892  How is $AAPL @Apple going to get me to buy a #...            1   
2516893  How is $AAPL @Apple going to get me to buy a #...            1   

         retweet_num  like_num ticker_symbol  sentiment  subjectivity  neg  \
2516889            0         2         GOOGL        1.0       

### Save as CSV file:

In [14]:
df.to_csv('sentiment_analysis.csv')