Dataset Used -> Sentiment Analysis of Financial Tweets

---


Assignment 1 & 2 consists of the following components:
1. Uploading Dataset
2. Pre-processing like stemming and lemmatisation
3. Tokenisation and creating Inverted Index.

## UPLOADING THE DATASET FROM KAGGLE

In [3]:
# upload kaggle API token into the folder which comes up on connecting
! pip install kaggle 
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download vivekrathi055/sentiment-analysis-on-financial-tweets

mkdir: cannot create directory ‘/root/.kaggle’: File exists
sentiment-analysis-on-financial-tweets.zip: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
! unzip sentiment-analysis-on-financial-tweets

Archive:  sentiment-analysis-on-financial-tweets.zip
  inflating: stockerbot-export1.csv  
  inflating: tweet_sentiment.csv     


In [5]:
# Mount the Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/kaggle.json"
! cd /content

In [7]:
import pandas as pd
dataset = pd.read_csv('/content/stockerbot-export1.csv')
dataset.head()

Unnamed: 0,id,text,timestamp,source,symbols,company_names,url,verified
0,1.0197e+18,VIDEO: “I was in my office. I was minding my o...,Wed Jul 18 21:33:26 +0000 2018,GoldmanSachs,GS,The Goldman Sachs,https://twitter.com/i/web/status/1019696670777...,True
1,1.01971e+18,The price of lumber $LB_F is down 22% since hi...,Wed Jul 18 22:22:47 +0000 2018,StockTwits,M,Macy's,https://twitter.com/i/web/status/1019709091038...,True
2,1.01971e+18,Who says the American Dream is dead? https://t...,Wed Jul 18 22:32:01 +0000 2018,TheStreet,AIG,American,https://buff.ly/2L3kmc4,True
3,1.01972e+18,Barry Silbert is extremely optimistic on bitco...,Wed Jul 18 22:52:52 +0000 2018,MarketWatch,BTC,Bitcoin,https://twitter.com/i/web/status/1019716662587...,True
4,1.01972e+18,How satellites avoid attacks and space junk wh...,Wed Jul 18 23:00:01 +0000 2018,Forbes,ORCL,Oracle,http://on.forbes.com/6013DqDDU,True


## PRE PROCESSING

In [10]:
tweets = dataset["text"].tolist() # convert tweets to a list 

In [11]:
import regex as re
no_link = []

def remove_links(tweet):
    """Takes a string and removes web links from it"""
    tweet = re.sub(r'http\S+', '', tweet)   # remove http links
    tweet = re.sub(r'bit.ly/\S+', '', tweet)  # remove bitly links
    tweet = tweet.strip('[link]')   # remove [links]
    tweet = re.sub(r'pic.twitter\S+','', tweet)
    return tweet

for i in tweets:
  no_link.append(remove_links(i))

print(no_link[0])

VIDEO: “I was in my office. I was minding my own business...” –David Solomon tells $GS interns how he learned he wa… 


In [12]:
# CONVERT TO LOWER CASE
import re
low = []

def lower(tweets):
  data = re.sub(' +',' ',tweets)
  lower_text = data.lower()
  return lower_text

for i in no_link:
    low.append(lower(i))

In [13]:
print(low[0])

video: “i was in my office. i was minding my own business...” –david solomon tells $gs interns how he learned he wa… 


In [14]:
# REMOVE PUNCTUATION
def remove_punctuation(words):
  new_words = []
  for word in words:
    new_word = re.sub(r'[^\w\s]', '', (word))
    if new_word != '':
       new_words.append(new_word)
  return new_words

new = remove_punctuation(low)
print(new[0])

video i was in my office i was minding my own business david solomon tells gs interns how he learned he wa 


In [15]:
# REMOVE USERS 
no_user = []

def remove_users(tweet):
    """Takes a string and removes retweet and @user information"""
    tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet)  # remove re-tweet
    tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet)  # remove tweeted at
    return tweet

for i in new:
    no_user.append(remove_users(i))

print(no_user[0])

video i was in my office i was minding my own business david solomon tells gs interns how he learned he wa 


In [16]:
# REMOVE HASHTAGS
no_hash = []

def remove_hashtags(tweet):
    """Takes a string and removes any hash tags"""
    tweet = re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet)  # remove hash tags
    return tweet

for i in no_user:
    no_hash.append(remove_hashtags(i))

print(no_hash[0])

video i was in my office i was minding my own business david solomon tells gs interns how he learned he wa 


In [17]:
# REMOVE SPACES AND DIGITS 
data_final = []

def spaces_digits(tweet):
    tweet = re.sub('\s+', ' ', tweet)  # remove multi spacing
    tweet = re.sub('([0-9]+)', '', tweet)  # remove numbers
    return tweet

for i in no_hash:
  data_final.append(spaces_digits(i))

print(data_final[0])

video i was in my office i was minding my own business david solomon tells gs interns how he learned he wa 


In [18]:
# REMOVE STOP WORDS

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

res = [] # resultant list of cleaned strings

for s in data_final:
  lst_string = [s][0].split()
  
  no_stpwords_string="" #new string created 
  for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i + ' '
  no_stpwords_string = no_stpwords_string[:-1]
  res.append(no_stpwords_string)

print(res[0]) # example cleaned string 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
video office minding business david solomon tells gs interns learned wa


### TOKENISATION

In [19]:
from nltk.tokenize import regexp_tokenize
  
res_fin = [sub.split() for sub in res] #creates a list of lists for all tweets tokenised
print(res_fin[0])

['video', 'office', 'minding', 'business', 'david', 'solomon', 'tells', 'gs', 'interns', 'learned', 'wa']


### STEMMING

In [20]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
stemmed = []
for words in res_fin:
    temp = [porter.stem(i) for i in words]
    stemmed.append(temp)
print(stemmed[0])    

['video', 'offic', 'mind', 'busi', 'david', 'solomon', 'tell', 'gs', 'intern', 'learn', 'wa']


### LEMMATISATION

In [21]:
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [22]:
# import these modules
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()

lemmatised = []
for words in stemmed:
  temp = [lemmatizer.lemmatize(i) for i in words]
  lemmatised.append(temp)
print(lemmatised[0])  

['video', 'offic', 'mind', 'busi', 'david', 'solomon', 'tell', 'g', 'intern', 'learn', 'wa']


In [23]:
lemmatised.append("a")

In [33]:
 # Lemmatised is a list of lists. Split this dataset into parts
import numpy as np
documents = []
print(len(lemmatised))
k = 0
for i in range(60):
    doc=[]
    for j in range(474):
        doc.append(lemmatised[k])
        k+=1
    documents.append(doc)   
print(documents[0])

28440
[['video', 'offic', 'mind', 'busi', 'david', 'solomon', 'tell', 'g', 'intern', 'learn', 'wa'], ['price', 'lumber', 'lb_f', 'sinc', 'hit', 'ytd', 'high', 'maci', 'turnaround', 'still', 'happen'], ['say', 'american', 'dream', 'dead'], ['barri', 'silbert', 'extrem', 'optimist', 'bitcoin', 'predict', 'new', 'crypto', 'entrant', 'go', 'zero'], ['satellit', 'avoid', 'attack', 'space', 'junk', 'circl', 'earth', 'paid', 'oracl'], ['realmoney', 'david', 'butler', 'favorit', 'fang', 'stock', 'isnt', 'realmoneysod', 'alphabet', 'facebook'], ['dont', 'miss', 'convo', 'one', 'favorit', 'thinker', 'samharrisorg'], ['u', 'intellig', 'document', 'nelson', 'mandela', 'made', 'public'], ['senat', 'want', 'emerg', 'alert', 'go', 'netflix', 'spotifi', 'etc', 'grg'], ['hedg', 'fund', 'manag', 'marc', 'larsi', 'say', 'bitcoin', 'k', 'possibl'], ['u', 'propos', 'expedit', 'appeal', 'fight', 'atampt', 'time', 'warner', 'purchas'], ['roger', 'feder', 'uniqlo', 'deal', 'make', 'one', 'athlet', 'earn', 'en

Create a Flat List splitting the dataset into 60 documents to make the posting list less sparse and more understandable.

In [34]:
flat_list =[]
for item in documents:
  flat =[]
  for i in item:
    for j in i:
      flat.append(j)
  flat_list.append(flat)
print(flat_list[0][0])     # to access each term  
# flat list is a list of lists (size 60) with each sublist as one document"""

video


## INVERTED INDEX CREATION

In [26]:
from collections import defaultdict
def create_index (data):
  index = defaultdict(list)
  for i, tokens in enumerate(data):
    for token in tokens:
      index[token].append(i)
  return index

In [27]:
def create_index_new(data):
  return enumerate(data)

In [28]:
index = create_index(flat_list)
# print a few examples
# index[i] represents the posting list for that particular i

# remove duplicate elements 
for i in index:
  index[i] = set(index[i])
  index[i] = list(index[i])

for i in list(index)[0:10]:
  print(index[i])

[0, 1, 2, 3, 4, 5, 7, 9, 10, 11, 12, 14, 17, 18, 22, 24, 25, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
[0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 14, 15, 18, 20, 21, 24, 25, 31, 33, 34, 35, 42, 43, 44, 47, 52, 54, 55, 58]
[0, 33, 34, 3, 36, 37, 58, 42, 44, 48, 21, 56, 57, 26, 59, 30]
[0, 2, 5, 7, 8, 10, 12, 14, 18, 20, 21, 22, 23, 26, 29, 30, 31, 33, 34, 36, 38, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59]
[0, 8, 9, 11, 13, 14, 15, 23, 26, 27, 32, 33, 34, 35, 37, 38, 40, 50, 51, 55, 56, 58, 59]
[0]
[0, 3, 5, 11, 18, 19, 20, 21, 23, 25, 33, 37, 44, 45, 46, 48, 50, 53, 54, 55, 56, 57, 59]
[0, 1, 2, 3, 4, 7, 8, 9, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,