<a href="https://colab.research.google.com/github/danieljai/CIND820-AndyLee/blob/main/AndyLee_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preload setup

Basic setup so results can utilize the full width of the screen.

In [81]:
%config IPCompleter.greedy=True
import pandas as pd
pd.set_option('display.max_colwidth', 200)
pd.options.display.max_colwidth = 500
pd.options.display.max_rows = 100

# Import Dataset

The hydrated file is stored in a Google Drive, and using the follow code will mount Google Drive onto Colaboratory.

In [82]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [83]:
df = pd.read_csv("/content/drive/My Drive/__CIND 820 - Data Analytics Project/3-data/Book1-fastsave.csv")

# Data Cleaning and Manipulation

## Readjust attribute datatype

For `retweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`
- Convert `null` values to 0
- Convert attribute as int64

In [84]:
df.retweet_id = df[df['retweet_id'].notnull()].retweet_id.astype('int64') 

In [85]:
df.retweet_id = df.retweet_id.fillna(0).astype('int64')
df.in_reply_to_status_id = df.in_reply_to_status_id.fillna(0).astype('int64')
df.in_reply_to_user_id = df.in_reply_to_user_id.fillna(0).astype('int64')

## Misc. cleaning up to reduce noise when conducting sentimental analysis
1. remove \n
2. remove URL
3. remove user referrals
4. remove hashtags

In [86]:
df['modified_text'] = df.text.str.replace(r'\n', '')
df['modified_text'] = df.modified_text.str.replace(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '')
df['modified_text'] = df.modified_text.str.replace(r'\B@\w+', '')
df['modified_text'] = df.modified_text.str.replace(r'\B#\w+', '')

## Splitting Dataframes (originals and retweets)




The collection includes both original tweets and retweets. Since retweets mirrors the original tweet by someone else other than the author, we don't need to run sentimental analysis on the retweet as it would have been run on the original tweet, therefore we can split original tweets and retweets into two dataframes to avoid wasting resources.


- Original tweets: `dfOriginals`
- Retweets: `dfRetweets` cons


In [87]:
dfOriginals = df[df.retweet_id == 0]
dfRetweets = df[df.retweet_id != 0]

## Preview data after cleaning and manipulation

Original Tweet dataframe

In [88]:
dfOriginals.dtypes

coordinates                    object
created_at                     object
hashtags                       object
media                          object
urls                           object
favorite_count                  int64
id                              int64
in_reply_to_screen_name        object
in_reply_to_status_id           int64
in_reply_to_user_id             int64
lang                           object
place                          object
possibly_sensitive             object
retweet_count                   int64
retweet_id                      int64
retweet_screen_name            object
source                         object
text                           object
tweet_url                      object
user_created_at                object
user_screen_name               object
user_default_profile_image       bool
user_description               object
user_favourites_count           int64
user_followers_count            int64
user_friends_count              int64
user_listed_

In [89]:
dfOriginals.sample(2)

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,lang,place,possibly_sensitive,retweet_count,retweet_id,retweet_screen_name,source,text,tweet_url,user_created_at,user_screen_name,user_default_profile_image,user_description,user_favourites_count,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name.1,user_statuses_count,user_time_zone,user_urls,user_verified,modified_text
721873,,Wed Apr 01 02:29:25 +0000 2020,,,,0,1245176410148397062,MrsGandhi,1245102287082491904,85657578,en,,,0,0,,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>","@MrsGandhi Arnab: This gatherings are the reason why Corona Virus has spread\n\nPanelist: You mean Victory Gathering of BJP MLAs of M.P\n\nArnab: I'm talking about religious gathering\n\nPanelist: You mean Trichy Temple Gathering , Arattu Festival &amp; Kanika's Holi party Gathering\n\nArnab: Fcuk off",https://twitter.com/PublicVoice18/status/1245176410148397062,Tue May 21 20:12:50 +0000 2019,PublicVoice18,False,Speak up if you are alive,6895,98,217,0,Mumbai,unpredictable,PublicVoice18,9080,,,False,"Arnab: This gatherings are the reason why Corona Virus has spreadPanelist: You mean Victory Gathering of BJP MLAs of M.PArnab: I'm talking about religious gatheringPanelist: You mean Trichy Temple Gathering , Arattu Festival &amp; Kanika's Holi party GatheringArnab: Fcuk off"
783757,,Wed Apr 01 04:42:54 +0000 2020,,,,1,1245210003394433024,outrotanjiro,1245209206480949248,1705801320,en,,,0,0,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@outrotanjiro Unless Ms. Corona says we can’t leave our houses for the rest of the year, and then that year becomes a decade 🤡🗿",https://twitter.com/jinternals/status/1245210003394433024,Wed Aug 01 05:47:54 +0000 2018,jinternals,False,i’m irrelevant go away #txt on 05/19/19 #SuperM on 11/11/19,6533,225,527,4,Namjoons sexy brain,shel +✶𖧵⁷,jinternals,4391,,,False,"Unless Ms. Corona says we can’t leave our houses for the rest of the year, and then that year becomes a decade 🤡🗿"


## Guessing language

Since our sentiment analysis focuses on only English tweets, we will install a `langdetect` library to help filter out tweets that are not English.

In [90]:
#https://pypi.org/project/langdetect/
!pip install langdetect
from langdetect import detect
from langdetect import DetectorFactory
DetectorFactory.seed = 0

# import multiprocessing as mp
# p=mp.Pool(4)




In [91]:
# p.map(detect,dfOriginals.sample(500).text)

In [92]:
#dfOriginals = dfOriginals[~(dfOriginals.modified_text == "")]

In [93]:
# dfOriginals[(dfOriginals.modified_text == "")]

Function to test whether tweet is English with error handling.

In [94]:
#https://stackoverflow.com/questions/60930935/exclude-non-english-rows-in-pandas

def is_en(txt):
    try:
        return detect(txt)=='en'
    except:
        return False

Passing as the `is_en()` as first-class function; returning a boolean value as a attribute `guessed_language`.

In [95]:
dfOriginals['guessed_language'] = dfOriginals.modified_text.apply(is_en)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


My filtering out non-English tweets, we remove noise induced to the sentimental analysis which will affect the sentimental score.

In [96]:
print("There are " + str(len(dfOriginals)) + " tweets in total, and " + str(sum(dfOriginals.guessed_language)) + " detects as English.")
print("Percentage of tweets that are in English: " + str(round((sum(dfOriginals.guessed_language) / len(dfOriginals)) * 100, 4)) + "%")


There are 245389 tweets in total, and 232816 detects as English.
Percentage of tweets that are in English: 94.8763%


# Being Sentiment Analysis

To conduct Seitment Analysis, we begin by importing the NLTK library.

In [97]:
import nltk
from nltk.sentiment.util import *
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import tokenize
nltk.download('punkt')
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [98]:
# for sentence in sentences:
#      print(sentence)
#      ss = sid.polarity_scores(sentence)
#      for k in sorted(ss):
#          print('{0}: {1}, '.format(k, ss[k]), end='')
#      print()

## Applying Sentiment Analysis function

We apply the polarity score function and store results on a new attribute `sentimentscore`.



In [99]:
dfOriginals['sentimentscore'] = dfOriginals.text.apply(sid.polarity_scores)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


A quick sample preview of the text and its sentiment score.

In [100]:
dfOriginals.sample(10)[['modified_text','sentimentscore']]

Unnamed: 0,modified_text,sentimentscore
519343,"America is good at making death, it made Al Qaeda, and then ISIS, and today it makes Corona Virus#COVID2019⁦⁦#usa#USA#trump#coronavirus#CoronaVirusitaly","{'neg': 0.123, 'neu': 0.786, 'pos': 0.091, 'compound': -0.25}"
598182,Corona don humble amHe's a full blacksmith now,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}"
432309,Remember ur recent tweets and speaches. Hinduphobia is ok. Islamicphobia is not ok. U reap what u sow. Remember. Ji haadan Bibi. Corona jihad just bcoz u hate modi,"{'neg': 0.115, 'neu': 0.748, 'pos': 0.137, 'compound': -0.0772}"
68453,"Please still be cautious. As Dr Anand Ranganathan informed during TimeNow discussion on similar topic, Iran too had large malaria incidents and usage of same drug. But is badly effected due to Corona.","{'neg': 0.119, 'neu': 0.82, 'pos': 0.061, 'compound': -0.296}"
738898,All airports and their parking areas can be used as hospitals for corona affected patients.,"{'neg': 0.103, 'neu': 0.897, 'pos': 0.0, 'compound': -0.1531}"
180796,Beyond belief. They are surely on something.,"{'neg': 0.0, 'neu': 0.707, 'pos': 0.293, 'compound': 0.4404}"
719140,I'm young--ish....and would likely survive a bout with Corona. The pain of government tyranny is something you don't get over.,"{'neg': 0.148, 'neu': 0.852, 'pos': 0.0, 'compound': -0.5106}"
771778,Without comments,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}"
142401,actually you are bigger problem than Corona,"{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'compound': -0.4019}"
698398,Should cause alarm across the country that Ds are currently RELEASING prisoners under the guise of “corona-scare” &amp; simultaneously reporting that officers will NOT respond to certain calls!Some things never change,"{'neg': 0.206, 'neu': 0.794, 'pos': 0.0, 'compound': -0.7786}"


The `SentimentIntensityAnalyzer()` returns a dictionary of scores negative, neutral, positive, and compound. Compound is the normalization of negative, neutral, and positive values.

Next, we expand the dictionary into their own respective attributes for easier data manipulation.

In [101]:
dfOriginalSScore = pd.json_normalize(dfOriginals.sentimentscore)
dfOriginalSScore['original_index'] = dfOriginals.index
dfOriginalSScore = dfOriginalSScore.set_index('original_index')
dfOriginals = dfOriginals.merge(dfOriginalSScore, left_index=True, right_index=True)

- need some explanation on the "compound" attribute
- have an answer what the compound score stands for.

In [102]:
dfOriginals.sample(10)[['modified_text','compound']]

Unnamed: 0,modified_text,compound
275082,My Dad was telling me earlier how apparently SL has its first case of corona and everyone’s in uproar 😪,0.0
406241,👎 Mark Rutte 😡,0.0
637038,This one ballad will cure the sick of Corona. Please play it in the hospitals,0.1027
398361,We did it. We stopped corona.Thank you Kira.,0.1531
425696,People in Florida a different breed of stupid,-0.5267
70022,I'm immune then.,0.296
83181,"Choked on my chewing gum in Aldi he which ended in a coughing fit and everyone was looking at me like I had corona, felt like I had killed their family members the way they glared at me 🥴🙂",-0.2732
200305,The end of corona virus will be the day before my birthday.,0.0
290221,Latest Report On !Link 👉👉👉,0.0
509217,She sent a 38 minutes vn😭😭😂😂. That has marked off the end of corona talks on whatsapp. It's in person now😭😭😭,0.0


In [103]:
dfOriginals.head(5).compound

3     0.0129
4     0.3400
6     0.4404
7     0.6369
12    0.8885
Name: compound, dtype: float64

Splitting sentiment score into 5 classes of equal 0.4 parts.
- -1 to -0.6 for extra negative
- -0.6 to -0.2 for slight negative
- 0.2 to 0.2 for neutral
- 0.2 to 0.6 for slight positive
- 0.6 to 1 for extra positive

In [109]:
dfOriginals['sentiment_class'] = pd.cut(dfOriginals['compound'], bins=[-1, -.6, -.2, 0.2, .6, 1], right=True, labels=['x_neg', 's_neg', 'neu', 's_pos','x_pos'])

In [110]:
dfOriginals.sample(10)[['modified_text','compound','sentiment_class']]

Unnamed: 0,modified_text,compound,sentiment_class
458690,"It doesn't matter if the Gov't kept the case under wraps, at the end of the day batho ke bone ba tswang go bapala corona, not the government. They could not have told us bfr they were sure. We had one job!",0.3649,s_pos
722791,Yo it gotta be another way 🤦🏾‍♂️,0.0,neu
608765,Corona produced by China and Russia Spread by IranSimple but bitter plot,-0.5719,s_neg
194569,Just want to know sino kaya yung fake news creators?,-0.4215,s_neg
114186,Malawi Electoral Commission (MEC) said that they will include sensitization messages pertaining to Corona virus in regard for preparations for the freshpresidential polls.,0.0,neu
140216,I wouldn’t mind if Corona Virus yelled APRIL FOOLS tomorrow and fucked right off !!!!,-0.8885,x_neg
650234,"Niggha lit 🔥 with ill flow like Corona, catch your attention like a sneeze",0.296,s_pos
17061,Yeah cause apparently only gas attendants carry the disease Shut the fuck up and stay home if you’re that afraid We don’t like being around you either,0.0516,neu
643459,TKX :)I KNEW THAT AND WAY MORE THEN THAT TYPE CORONA IN THE SEARCH BAR MANY POST WITH THE NAME CORONA@POTUS,0.4588,s_pos
606745,"People saw in the internet that if you gargle with Epson Salt, you won't get Corona virus! What a joke!!!",0.5216,s_pos


In [111]:
dfOriginals.sentiment_class.value_counts()

neu      82447
s_pos    48833
s_neg    47398
x_neg    35876
x_pos    30835
Name: sentiment_class, dtype: int64

In [137]:
dfOriginals[dfOriginals.favorite_count > 1000].sample(2)

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,lang,place,possibly_sensitive,retweet_count,retweet_id,retweet_screen_name,source,text,tweet_url,user_created_at,user_screen_name,user_default_profile_image,user_description,user_favourites_count,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name.1,user_statuses_count,user_time_zone,user_urls,user_verified,modified_text,guessed_language,sentimentscore,neg,neu,pos,compound,sentiment_class
376506,,Tue Mar 31 15:37:25 +0000 2020,Corona,https://twitter.com/StuBishop_LPD/status/1245012330141880320/photo/1,,1513,1245012330141880320,,0,0,en,,False,127,0,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Just a little topic change from all this depressing #Corona stuff. 😂😂 https://t.co/F3UTnq32Lu,https://twitter.com/StuBishop_LPD/status/1245012330141880320,Sun Apr 21 05:45:12 +0000 2019,StuBishop_LPD,False,PoPo🐷 | SWAT | Formerly featured on #LivePD | Cubs | #MAGA2020 🇺🇸| Views and tweets are my own| Acct not affiliated with my Dept,88772,43487,2445,58,"Indianapolis, IN",Stu Bishop,StuBishop_LPD,11140,,,False,Just a little topic change from all this depressing stuff. 😂😂,True,"{'neg': 0.214, 'neu': 0.786, 'pos': 0.0, 'compound': -0.4588}",0.214,0.786,0.0,-0.4588,s_neg
727361,,Wed Apr 01 02:41:44 +0000 2020,,,https://www.indiatoday.in/my-take/video/my-take-coronavirus-must-be-fought-by-everyone-collectively-not-by-dividing-us-on-religious-lines-1661916-2020-03-31,4094,1245179509617217536,,0,0,en,,False,710,0,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",My take: the last thing we need is to make the fight against corona into a Hindu Muslim issue.. https://t.co/2tJipxbXF0,https://twitter.com/sardesairajdeep/status/1245179509617217536,Mon Jul 13 06:14:44 +0000 2009,sardesairajdeep,False,"Citizen first. Only 'ism' is humanism. newsman, tv anchor, author, father, friend. New book: 2019: How Modi Won India. pre order here: https://t.co/IJRYyyNZzx",9170,8998619,582,8693,New Delhi,Rajdeep Sardesai,sardesairajdeep,65626,,http://rajdeepsardesai.net/,True,My take: the last thing we need is to make the fight against corona into a Hindu Muslim issue..,True,"{'neg': 0.126, 'neu': 0.874, 'pos': 0.0, 'compound': -0.3818}",0.126,0.874,0.0,-0.3818,s_neg


Attempts to find a correlation between compound value and retweet count. The result shows there's no correlation.

In [138]:
dfOriginals.compound.corr(dfOriginals.retweet_count)

-0.0032970675393302726

Since compound starts from -1 to 1 perhaps normalizing the compound value will show a correlation. However, results are showing a close 0 correlation; no correlation.

In [191]:
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler 

scalar = MinMaxScaler()

dfOriginals['compound_norm'] = pd.DataFrame(scalar.fit_transform(dfOriginals[['compound']]))

dfOriginals['compound_norm'].corr(dfOriginals['retweet_count'])

-0.0028806337520278997

Sine -1 and 1 are extremes of positive and negative where close to 0 is the neutral zone, maybe I should absolute the compound value. However, results are still showing a close to 0 correlation; no correlation.

In [197]:
dfOriginals['compound_abs'] = abs(dfOriginals.compound)
dfOriginals['compound_abs'].corr(dfOriginals['retweet_count'])

0.004760935628751365

Attempt to find a correlation between a user's follower's count and retweet count. Results show little correlation of 0.09.

In [198]:
dfOriginals.user_followers_count.corr(dfOriginals.retweet_count)

0.09002522953041471

- impact factor of a tweet 