[Part 1.:](#part1.) Contents

[Part 2.:](#part2.) Introduction

____ [Part 2.1.:](#part2.1.) Column Names

____ [Part 2.2.:](#part2.2.) Packages

____ [Part 2.3.:](#part2.3.) Constants

____ [Part 2.4.:](#part2.4.) Read the File

[Part 3.:](#part3.) Action

____ [Step 3.1.:](#step3.1.) Basic Information

________ [Step 3.1.1.:](#step3.1.1.) Investigate the Dataframe

________ [Step 3.1.2.:](#step3.1.2.) Investigate the Sentiments

____________ [Step 3.1.2.1.:](#step3.1.2.1.) Find Duplicates

____________ [Step 3.1.2.2.:](#step3.1.2.2.) Remove Duplicates

____________ [Step 3.1.2.3.:](#step3.1.2.3.) Neutralize Sentiments

________ [Step 3.1.3.:](#step3.1.3.) Clean Out

____ [Step 3.2.:](#step3.2.) Generate Corpus

____ [Side Track:](#sidetrackmemoryusage) Memory Usage

____ [Step 3.3.:](#step3.3.) The Vocabulary

________ [Step 3.3.1.:](#step3.3.1.) Remove Unnecessary Tokens

____________ [Step 3.3.1.1.:](#step3.3.1.1.) Remove Hyperlinks

____________ [Step 3.3.1.2.:](#step3.3.1.2.) Remove @-tags

____________ [Step 3.3.1.2.:](#step3.3.1.3.) Remove #-tags

____________ [Step 3.3.1.3.:](#step3.3.1.4.) Remove punctuation

____________ [Step 3.3.1.4.:](#step3.3.1.5.) Remove "stop" words

________________ [Step 3.3.1.4.1.:](#step3.3.1.5.1.) Articles: "a", "an", "the"

________________ [Step 3.3.1.4.2.:](#step3.3.1.5.2.) Conjunctions: "and", "or", "but"

________________ [Step 3.3.1.4.3.:](#step3.3.1.5.3.) Prepositions: "at", "on", "in", "of", "to", "with", "by"

________________ [Step 3.3.1.4.4.:](#step3.3.1.5.4.) Pronouns: "he", "she", "it", "they", "we", "you"

________________ [Step 3.3.1.4.5.:](#step3.3.1.5.5.) Auxiliary verbs: "am", "is", "are", "was", "were", "be", "been", "have", "has", "had", "do", 
"does", "did"

________________ [Step 3.3.1.4.6.:](#step3.3.1.5.6.) Adverbs of frequency: "always", "usually", "often", "sometimes", "rarely", "never"

________________ [Step 3.3.1.4.7.:](#step3.3.1.5.7.) Interjections: "oh", "ah", "wow", "hmm"

________________ [Step 3.3.1.4.8.:](#step3.3.1.5.8.) Suggestions from other sources

________ [Step 3.3.2.:](#step3.3.2) Gather Unique Tokens

____________ [Step 3.3.2.1.:](#step3.3.2.1.)  Save a list of all the unique tokens from the 'corpus' in a csv-file for ease of use; this one we'll call 'token_list'

____________ [Step 3.3.2.2.:](#step3.3.2.2.)  Find and add the frequency of each token in the 'corpus' to the 'token_list'

________ [Step 3.3.2.3.:](#step3.3.2.3.) Split & Count

____________ [Step 3.3.2.4.:](#step3.3.2.4.) Cleaning with Regard to Sentiments

________________ [Step 3.3.2.4.1.:](#step3.3.2.4.1.) Look at the three categories of sentiments and try to find what differentiates a positive sentiment from a negative or a neutral one, with regard to the words used in their related texts</li>

________________ [Step 3.3.2.4.2.:](#step3.3.2.4.2.) At last, we'll prune the 'text' column to leave only the words that are necessary to keep in each tweet. This is what constitutes the input to our ML system. And we'll save this as another csv-file, together with the corresponding sentiments - the outputs - which we will call the 'column'

____ [Step 3.4.:](#step3.4.) Learn from Google's Search Engine

________ [Step 3.4.1.:](#step3.4.1.)  Word frequency analysis: This algorithm counts the frequency of each word in a text and identifies the most common words.

________ [Step 3.4.2.:](#step3.4.2.)  TF-Idframe: This algorithm identifies the most important words in a text by comparing their frequency in the text to their frequency in a larger corpus of texts.

________ [Step 3.4.3.:](#step3.4.3.)  Latent Dirichlet Allocation (LDA): This algorithm is used for topic modeling, which involves identifying the topics present in a text or set of texts.

________ [Step 3.4.4.:](#step3.4.4.)  Word2Vec: This algorithm creates a vector representation for each word in a text, which can be used for various NLP tasks like semantic similarity and word analogy.

________ [Step 3.4.5.:](#step3.4.5.)  Deep Learning Models: Google also uses deep learning models such as neural networks for various NLP tasks, including language translation, sentiment analysis, and question answering.

[Part 4.:](#part4.)  Side Track - Split the Date

____ [Step 4.1.:](#step4.1.)  Side Track Follow-Up

[Appendix:](#appendix) Test Zone

<style>
  body {
    color: black;
  }
  h1 {
    background-color: transparent;
    color: LightSteelBlue;
  }
  h3 {
    background-color: transparent;
    color: WhiteSmoke;
  }
  b {
    background-color: transparent;
    color: LightSteelBlue;
  }
</style>
<h1><a id='part1.'>Part 1.:</a> Introduction</h1>

<h3><a id='part2.1.'>Part 2.1.:</a> Column Names</h3>
<ul>
  <li><b>ArithmeticErrortarget:</b> the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)</li>
  <li><b>id:</b> The id of the tweet (2087)</li>
  <li><b>date:</b> The date of the tweet (Sat May 16 23.:58.:44 UTC 2009)</li>
  <li><b>flag:</b> The query (lyx). If there is no query, then this value is NO_QUERY.</li>
  <li><b>user:</b> The user that tweeted (robotickilldozr)</li>
  <li><b>text:</b> The text of the tweet (Lyx is cool)</li>
</ul>

<h3><a id='part2.2'>Part 2.2.:</a> Packages</h3>

In [169]:
%reload_ext autoreload
%autoreload 2

import packages as pkg
import functions as func
import constants as const

if __name__ == '__name__':
    print('Starting Spark Session. All data is loaded into memory')

<h3><a id='part2.3.'>Part 2.3.:</a> Constants</h3>

Constant values that are used throughout the project are stored in constants.py. This includes the path to the data, the path to the output, and the names of the columns in the data.

<h3><a id='part2.4.'>Part 2.4.:</a> Read the File</h3>

Read the csv file into a Python dataframe and then write it into a new file

In [170]:
dframe = pkg.pd.read_csv('.databases/tweets.csv', encoding='ISO-8859-1')

<style>
  body {
    color: black;
  }
  h1 {
    background-color: transparent;
    color: LightSteelBlue;
  }
  h3 {
    background-color: transparent;
    color: WhiteSmoke;
  }
  b {
    background-color: transparent;
    color: LightSteelBlue;
  }
</style>
<h1><a id='part3'>Part 3.:</a> Action</h1>

<h3><a id=' Step 3.1'>Step 3.1.:</a> Basic Information</h3>

Get the basic information about the content of the file

In [171]:
display(dframe)
display(dframe.shape)

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best feeling ever
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interviews! â« http://blip.fm/~8bmta
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me for details
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur


(1599999, 6)

In [172]:
dframe.columns = const.columns
print('The first five rows of the database')
display(dframe.head())
print('Information')
display(dframe.info())
print('Description')
dframe.describe()

The first five rows of the database


Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1599999 non-null  int64 
 1   id         1599999 non-null  int64 
 2   date       1599999 non-null  object
 3   flag       1599999 non-null  object
 4   user       1599999 non-null  object
 5   text       1599999 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


None

Description


Unnamed: 0,sentiment,id
count,1599999.0,1599999.0
mean,2.000001,1998818000.0
std,2.000001,193575700.0
min,0.0,1467811000.0
25%,0.0,1956916000.0
50%,4.0,2002102000.0
75%,4.0,2177059000.0
max,4.0,2329206000.0


<h4><a id=' Step 3.1.1'>Step 3.1.1.:</a> Investigate the Dataframe</h4>

Now let us get the number of unique values in each column; we had 

**dframe.unstack().groupby(level=0).nunique()**

as a suggestion but it was over 6 times slower

In [173]:
# we had 
# dframe.unstack().groupby(level=0).nunique()
# as a suggestion but it was over 6 times slower
dframe.apply(pkg.pd.Series.nunique)

sentiment          2
id           1598314
date          774362
flag               1
user          659775
text         1581465
dtype: int64

In [174]:
for _ in dframe.columns:
    display(_)
    display(dframe[_].value_counts())

'sentiment'

sentiment
4    800000
0    799999
Name: count, dtype: int64

'id'

id
2190457769    2
1974742852    2
2062516845    2
1551586713    2
1563681287    2
             ..
2197311343    1
2197311196    1
2197311146    1
2197310899    1
2193602129    1
Name: count, Length: 1598314, dtype: int64

'date'

date
Mon Jun 15 12:53:14 PDT 2009    20
Fri May 29 13:40:04 PDT 2009    17
Mon Jun 15 13:39:50 PDT 2009    17
Fri May 22 05:10:17 PDT 2009    17
Fri Jun 05 11:05:33 PDT 2009    16
                                ..
Sun Jun 07 12:36:09 PDT 2009     1
Sun Jun 07 12:36:07 PDT 2009     1
Sun Jun 07 12:36:04 PDT 2009     1
Sun Jun 07 12:36:03 PDT 2009     1
Tue Jun 16 08:40:50 PDT 2009     1
Name: count, Length: 774362, dtype: int64

'flag'

flag
NO_QUERY    1599999
Name: count, dtype: int64

'user'

user
lost_dog           549
webwoke            345
tweetpet           310
SallytheShizzle    281
VioletsCRUK        279
                  ... 
iheartrobpattz       1
67trinity            1
Sibby                1
mAnyA_15             1
bpbabe               1
Name: count, Length: 659775, dtype: int64

'text'

text
isPlayer Has Died! Sorry                                                                              210
good morning                                                                                          118
headache                                                                                              115
Good morning                                                                                          112
Headache                                                                                              106
                                                                                                     ... 
braces  tell me it will be okay...                                                                      1
is stuck at home without curry                                                                          1
@mrsduryee I've applied to about 70 since I lost my job in March...it certainly FEELS like a lot!       1
The cheese I got @SarawithanR lost its sq


To see the whole width of the table we make a small permanent change

In [175]:
pkg.pd.set_option('display.max_colwidth', None)

Let us see if some of the functions work properly and how we can use them to gain more information about this database. Here we list all the tweets that are generated by the user 'lost_dog'. We choose to only print the texts and nothing else.

In [176]:
print(dframe[dframe['user'].astype('string') == 'lost_dog']['text'])

43934              @NyleW I am lost. Please help me find a good home. 
45573             @SallyD I am lost. Please help me find a good home. 
46918         @zuppaholic I am lost. Please help me find a good home. 
47948         @LOSTPETUSA I am lost. Please help me find a good home. 
50571     @JeanLevertHood I am lost. Please help me find a good home. 
                                      ...                             
792408       @trooppetrie I am lost. Please help me find a good home. 
793313         @Carly_FTS I am lost. Please help me find a good home. 
793609         @inathlone I am lost. Please help me find a good home. 
798607              @Kram I am lost. Please help me find a good home. 
799404         @W_Hancock I am lost. Please help me find a good home. 
Name: text, Length: 549, dtype: object


Find the 50 users that generated the most tweets

In [177]:
grouped_dframe = dframe.groupby(['user']).count()
sorted_dframe = grouped_dframe.sort_values(by=['user'], ascending=False)
largest_dframe = sorted_dframe.nlargest(50, 'flag')
display(largest_dframe)

Unnamed: 0_level_0,sentiment,id,date,flag,text
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
lost_dog,549,549,549,549,549
webwoke,345,345,345,345,345
tweetpet,310,310,310,310,310
SallytheShizzle,281,281,281,281,281
VioletsCRUK,279,279,279,279,279
mcraddictal,276,276,276,276,276
tsarnick,248,248,248,248,248
what_bugs_u,246,246,246,246,246
Karen230683,238,238,238,238,238
DarkPiano,236,236,236,236,236


<style>
  span {
    background-color: transparent;
    color: orange;
  }
</style>
What differs the tweets that share the same ID but are still saved in this dataframe, from the others, is that they have different sentiments; don't ask me why but some seem totally unreasonable. For instance, why should the following tweet, with the ID# 1467863684, be both positive and negative at the same time? Is it because they have mentioned the word "sad"?

<span>Awwh babs... you look so sad underneith that shop entrance of &quot;Yesterday's Musik&quot; O-: I like the look of the new transformer movie</span>

<h4><a id=' Step 3.1.2'>Step 3.1.2.:</a> Investigate the Sentiments</h4>

As we see in [Investigating the Dataframe](#investigatingthedataframe) above, there are tweet IDs that are redundant. As a matter of fact, 1,598,314 out of 1,599,999 tweets have unique IDs. So, let us have a look at it and see what we can find out.

<h5><a id=' Step 3.1.2.1'>Step 3.1.2.1.:</a> Find Duplicates</h5>

Try and find how many tweets have more than one sentiment related to them. For this we count the occurrences of each value in the 'id' column

In [178]:
all_dframe_ids = dframe['id'].value_counts()
display(all_dframe_ids)

id
2190457769    2
1974742852    2
2062516845    2
1551586713    2
1563681287    2
             ..
2197311343    1
2197311196    1
2197311146    1
2197310899    1
2193602129    1
Name: count, Length: 1598314, dtype: int64

We then filter the result to include only values that appear twice or more

In [179]:
duplicated_dframe_ids = all_dframe_ids[all_dframe_ids >= 2]
display(duplicated_dframe_ids)

id
2190457769    2
1974742852    2
2062516845    2
1551586713    2
1563681287    2
             ..
2015412220    2
2006617256    2
1933201064    2
2189722020    2
1982279593    2
Name: count, Length: 1685, dtype: int64

This shows clearly that as we expected 1,599,999 - 1,598,314 = 1,685 tweets have double sentiments. So we print the rows in dframe for which the 'id' column is in repeated_values.index

In [180]:
duplicated_tweet_dframe = dframe[dframe['id'].isin(duplicated_dframe_ids.index)]
display(duplicated_tweet_dframe.sort_values(by='id').head(50))

Unnamed: 0,sentiment,id,date,flag,user,text
212,0,1467863684,Mon Apr 06 22:33:35 PDT 2009,NO_QUERY,DjGundam,Awwh babs... you look so sad underneith that shop entrance of &quot;Yesterday's Musik&quot; O-: I like the look of the new transformer movie
800260,4,1467863684,Mon Apr 06 22:33:35 PDT 2009,NO_QUERY,DjGundam,Awwh babs... you look so sad underneith that shop entrance of &quot;Yesterday's Musik&quot; O-: I like the look of the new transformer movie
274,0,1467880442,Mon Apr 06 22:38:04 PDT 2009,NO_QUERY,iCalvin,"Haven't tweeted nearly all day Posted my website tonight, hopefully that goes well Night time!"
800299,4,1467880442,Mon Apr 06 22:38:04 PDT 2009,NO_QUERY,iCalvin,"Haven't tweeted nearly all day Posted my website tonight, hopefully that goes well Night time!"
988,0,1468053611,Mon Apr 06 23:28:09 PDT 2009,NO_QUERY,mariejamora,@hellobebe I also send some updates in plurk but i upload photos on twitter! you didnt see any of my updates on plurk? Zero?
801279,4,1468053611,Mon Apr 06 23:28:09 PDT 2009,NO_QUERY,mariejamora,@hellobebe I also send some updates in plurk but i upload photos on twitter! you didnt see any of my updates on plurk? Zero?
1176,0,1468100580,Mon Apr 06 23:42:57 PDT 2009,NO_QUERY,cristygarza,good night swetdreamss to everyonee and jared never chat in kyte puff
801572,4,1468100580,Mon Apr 06 23:42:57 PDT 2009,NO_QUERY,cristygarza,good night swetdreamss to everyonee and jared never chat in kyte puff
1253,0,1468115720,Mon Apr 06 23:48:00 PDT 2009,NO_QUERY,WarholGirl,@ientje89 aw i'm fine too thanks! yeah i miss you so much on the MFC but hope we can talk later on today kisses :huglove:
801649,4,1468115720,Mon Apr 06 23:48:00 PDT 2009,NO_QUERY,WarholGirl,@ientje89 aw i'm fine too thanks! yeah i miss you so much on the MFC but hope we can talk later on today kisses :huglove:


Now we check if those rows with duplicate ids have one 0 and one 4 in the sentiment column. We do it by adding the sentiments for each duplicated tweet. If they all have 4 as the result of the aggregation, we can conclude that each duplicate has received one 0 and one 4 as sentiment value. Consequenlty

In [181]:
aggregated_sentiments = duplicated_tweet_dframe.groupby('id').agg({'sentiment': 'sum', 'text': 'first'})
display(aggregated_sentiments)

Unnamed: 0_level_0,sentiment,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467863684,4,Awwh babs... you look so sad underneith that shop entrance of &quot;Yesterday's Musik&quot; O-: I like the look of the new transformer movie
1467880442,4,"Haven't tweeted nearly all day Posted my website tonight, hopefully that goes well Night time!"
1468053611,4,@hellobebe I also send some updates in plurk but i upload photos on twitter! you didnt see any of my updates on plurk? Zero?
1468100580,4,good night swetdreamss to everyonee and jared never chat in kyte puff
1468115720,4,@ientje89 aw i'm fine too thanks! yeah i miss you so much on the MFC but hope we can talk later on today kisses :huglove:
...,...,...
2193278017,4,"oh dear HH is back please twitter do something about her. I'm begging you, please pretty please"
2193403830,4,"english exam went okay revising for french, r.e and geography now, urrff"
2193428118,4,"finally finished typing!!!! Woohoooo , still need to add graphs though"
2193451289,4,"@fanafatin see, @misschimichanga tweet u to join us!! u really cant? so if thurs, when &amp; where?"


Which can be tested by asking if the number of 4s in the 'sentiment' column is the same as the size of the table

In [182]:
aggregated_sentiments['sentiment'].value_counts()

sentiment
4    1685
Name: count, dtype: int64

And this means <h3>YES</h3> We can go on with the next step and try to remove one of the duplicates and change the sentiment of the other one to 2, which is the same as 'natural'

<h5><a id=' Step 3.1.2.2'>Step 3.1.2.2.:</a> Remove Duplicates</h5>

Let us remove those rows in dframe that have the same id as in duplicated_dframe_ids and a sentiment of 0

In [183]:
dframe_wihout_duplicates = dframe[~(dframe['id'].isin(duplicated_dframe_ids.index) & (dframe['sentiment'] == 0))]
display(dframe_wihout_duplicates)

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best feeling ever
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interviews! â« http://blip.fm/~8bmta
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me for details
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur


<h5><a id=' Step 3.1.2.3'>Step 3.1.2.3.:</a> Neutralize the Sentiments</h5>

Set the sentiment of those rows in 'dframe' that are mentioned in the list of duplicated ids 'duplicated_dframe_ids' to 2 to mean that they should be classified - or percepted - as neutral

In [184]:
# change the sentiment of those rows in dframe_wihout_duplicates that are listed in duplicated_dframe_ids to 2
neutralized_dframe = dframe_wihout_duplicates.copy()
neutralized_dframe.loc[neutralized_dframe['id'].isin(duplicated_dframe_ids.index), 'sentiment'] = 2
display(neutralized_dframe)


Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best feeling ever
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interviews! â« http://blip.fm/~8bmta
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me for details
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur


Let us check if the change has worked out

In [185]:
# neutralized_dframe.apply(pkg.pd.Series.nunique)

Yes indeed; we have three categories of sentiment now. And to make sure

In [186]:
# neutralized_dframe['sentiment'].value_counts()

<h4><a id=' Step 3.1.3'>Step 3.1.3.:</a> Clean Out</h4>

We also skip the flag, that has no function in any types of investigations. Perhaps if we would combine it with other databases, we'll have to take it back. But for now, we just drop it.

In [187]:
flagless_dframe = neutralized_dframe.drop(columns=['flag'])

And the result is

In [188]:
flagless_dframe.head(50)

Unnamed: 0,sentiment,id,date,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,joy_wolf,@Kwesidei not the whole crew
5,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,mybirch,Need a hug
6,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,coZZ,"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?"
7,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,2Hood4Hollywood,@Tatiana_K nope they didn't have it
8,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,mimismo,@twittera que me muera ?
9,0,1467812416,Mon Apr 06 22:20:16 PDT 2009,erinx3leannexo,spring break in plain city... it's snowing


We have 1,685 tweets with neutral sentiment. The amount of neutral sentiments is comparably very low, but still better nothing at all.

Let us also drop all unnecessary columns for tweet sentiment classification.

In [189]:
singledout_dframe = flagless_dframe.drop(columns=['id', 'date', 'user'])

And the result is

In [190]:
# display(singledout_dframe)

<style>span{background-color: transparent; color: orange;}</style><span>Thus, 'singledout_dframe' is the table we are going to use for training our ML solution</span>; half-truth, since we will need to prune the column contents quite a lot before we can use it for training.

<h3><a id=' Step 3.2'>Step 3.2.:</a> Generate Corpus</h3>

Here we make one big text out of all tokens used in the 'text' column; call it 'corpus'. For concatenating all the sentences in each row of the table, we use *join* instead of *str.cat*
<ol>
    <li>corpus = singledout_dframe['text'].str.cat(sep=' ')</li>
    <li>corpus = ' '.join(singledout_dframe['text'].astype(str))</li>
</ol>

Join shows to have a shorter Execution Time but a somewhat longer Overhead Time, whereas str.cat has the opposite characteristics. And to borrow from my dear companion, Chat GPT: "In general, the *join* method may use slightly less memory than *str.cat*, especially if you use the *astype(str)* method to convert the values in the DataFrame column to strings before concatenation. This is because the *join* method creates a new string object containing the concatenated strings, whereas *str.cat* creates a new Series object containing the concatenated strings."

Anyhow, we use the *join* method for the concatenation and save the result as a txt-file.

But...

Before we do that, we also generate a table, where we save each tweet on a separate row. We call it Line-By-Line Corpus or 'lbl_corpus' and save it as a txt-file.

In [194]:
# write a code that writes each row of singledout_dframe['text'] on a separate line to a file called lbl_corpus.txt
with open('lbl_corpus.txt', 'w') as f:
    for row in singledout_dframe['text']:
        f.write(row + '\n')

# write a code that writes each row of singledout_dframe['sentiment'] on a separate line to a file called lbl_corpus_sentiment.txt
with open('lbl_corpus_sentiment.txt', 'w') as f:
    for row in singledout_dframe['sentiment']:
        f.write(str(row) + '\n')

# make a corpus of all the tweets in singledout_dframe['text'] and write it to a file called corpus.txt
dframe_corpus = ' '.join(singledout_dframe['text'].astype(str))
print(const.repo_path+'corpus.txt')
with open(const.repo_path+'corpus.txt', 'w') as f:
    f.write(dframe_corpus)

# write the size of the original corpus to a file called corpus_mensura.txt
with open(const.repo_path+str(const.CORPUS_MENSURA), 'w') as f:
    f.write(str(len(dframe_corpus.split())))

./.repository/corpus.txt


We can readily see that a character in this particular system - PowerMac Laptop 1.4 GHz Quad-Core Intel Core i5 with Ventura 13.3.1, Visual Studio Code Version: 1.77.3 (Universal), Jupyter Notebook 6.4.12 - is 1 byte long but with a overhead of 49 bytes

In [108]:
# print(f'The string "T" is {len("T")} characters long')
# print(f'But it occupies a total of {pkg.sys.getsizeof("T")} bytes')

Thus, we expect the whole 'corpus', which is

In [109]:
# print(f'{len(dframe_corpus)} characters long')

to take

In [110]:
# 119980390 + 49

bytes of memory, but obviously it takes

In [111]:
# pkg.sys.getsizeof(dframe_corpus)

bytes. This yields that the overhead for this corpus is

In [112]:
# (119980463-119980390)

bytes long.

<h3><a id='sidetrackmemoryusage'>Side Track:</a> Memory Usage</h3>

To investigate the reason for the overhead and its content we install and use "memory_profiler" which is a module for monitoring memory usage of a python program. We use it to see how much memory is used by each line of code in the .py-copy of this very notebook. The result is depicted in the following graph

In [113]:
# pkg.Image(filename=const.repo_path+'memory_time-space_profile.png')

For a breakdown of the memory usage we utilize the function objgraph.show_backrefs, that recrusively shows all objects that have a reference to the input 'corpus' variable. The result is illustrated in the graph that follows. There, we see a that the str object 'corpus' occpies 119,980,463 bytes of memory, which is the same number we got from the *sys.getsizeof* function. This means that we still have no idea what these extra 73 bytes are and where they come from.

In [114]:
# pkg.objgraph.show_backrefs(dframe_corpus, max_depth=4, too_many=5, filename=const.repo_path+'/corpus_backrefs_4-5.png', extra_info=lambda x: f"{pkg.sys.getsizeof(x)} bytes")
# pkg.Image(filename=const.repo_path+'/corpus_backrefs_4-5.png')

So, back to the main track...

<h3><a id=' Step 3.3'>Step 3.3.:</a> Memory Space Clean Out</h3>

To clean out the memory space, we delete the 'dframe_corpus' variable along with all the other dataframe objects we have created so far. We have already saved it as a txt-file. Also, the singledout_dframe is saved as a csv-file, so it's save to remove. This way, we'll save

In [115]:
import functions as func

keys_to_remove = [key for key, value in locals().items() if not key.startswith('_') and 'dframe' in key]
print(f'The total memory space that can be released is about {(func.total_occupied_space(locals(), keys_to_remove))}')
print(f'The total memory occupied by the varibles in the program {(func.total_occupied_space(locals(), locals().keys()))}')

The total memory space that can be released is about is 2.60 GB
The total memory occupied by the varibles in the program is 2.60 GB


In [116]:
keys_to_remove = list([key for key, value in locals().items() if not key.startswith('_') and 'dframe' in key])
func.cleanup_variable_space(locals(), keys_to_remove)

Following variables are begin deleted:
['dframe', 'grouped_dframe', 'sorted_dframe', 'largest_dframe', 'all_dframe_ids', 'duplicated_dframe_ids', 'duplicated_tweet_dframe', 'dframe_wihout_duplicates', 'neutralized_dframe', 'flagless_dframe', 'singledout_dframe', 'dframe_corpus']


<h3><a id=' Step 3.4'>Step 3.4.:</a> The Vocabulary</h3>

In every database there is a vocabulary, which is a list of all the words that are used in the database. The amount of words may create a problem for the ML algorithm, since it has to learn the weights of all the words in the vocabulary. In our case, the vocabulary is the set of all the words that are used in the 'text' column of the 'singledout_dframe' dataframe. We can easily get it by using the *set* function on the 'corpus' variable. In the following substeps, we will try to reduce the size of the vocabulary by removing all the words that are not relevant for the sentiment classification.

<h5><a id=' Step 3.4.1.'>Step 3.4.1.:</a> Remove Unnecessary Tokens</h5>

By *token* we mean any cohesive series of characters except the *space* character. In our original corpus, we have 21,052,251 tokens.

We are supposed to begin with the removal of hyperlinks, @- and #-tags, other punctuation and stopwords... but... having emoticons in the text enriches its content and makes it more expressive. So, we should protect these tokens from the removal. We will do it in the next step. Let us start with finding out about existence of emoticons in the corpus.

In [117]:
# import functions as func

# global print
# print = pkg.functools.partial(print, flush=True)

# #func.list_affected_tokens('corpus', {':)': 'smiling face', ':(': 'sad face', ':\'(': 'crying face'}, 'mix')
# func.list_affected_tokens('corpus', const.emoticons, 'mix')

On the other hand, experimenting with these emoticons shows that we should be mindful of the precedence of the more complicated ones. For instance, 'http:/' that is part of the well-known hyperlink, should be removed before we count emoticons. After all, we wouldn't want to keep 'http:/' as a skeptical emoticon ":/", would we? So, let us remove all hyperlinks first.

What is important to consider is that for every string we try to remove, there will be manyfold variations. For example, the http may be found as http in all hyperlinks. But, where only the word HTTP is used, it will not be found by our algorithms, unless for every word in the corpus we check for all the possible variations. We also know that the variation of the fonts, being small or capital letters won't affect their sentiment. Thus, we have to make sure the whole text is converted to lower case before we start removing any tokens.

We do it right here, right now.

In [118]:
import string
import shutil
import gc

pkg.shutil.copyfile(const.repo_path+'corpus.txt', const.repo_path+'corpus_original.txt')
all_small = ''
with open(const.repo_path+'corpus.txt', 'r') as f:
    all_small = f.read().lower()
with open(const.repo_path+'corpus.txt', 'w') as g:
    g.write(all_small)
del all_small
gc.collect()

0

<h5><a id=' Step 3.4.1.1.'>Step 3.4.1.1.</a> Remove Hyperlinks</h5>

First, let's see how many URLs - or any substrings as 'http', 'ftp', 'ssh', and so on that may contain or be about hyperlinks - there are in the corpus. We write them all to the file 'urls.txt' and count the number of lines in the file.

In [119]:
func.string_affected_tokens('corpus', 'http', 'signa_continentur_http')
no_http = func.remove_token('corpus', 'http', 'corpus_sine_http')

signum =  http
There are 71,659 tokens in './.repository/signa_continentur_http.txt' containing 'http'
The previous corpus had ('21052251', ',') tokens and the new one has
 20,980,592; this is 71,659 less.
The previous corpus had ('21052251', ',') tokens and the new one has 20,980,592; this is 71,659 less.
Writing the new, reduced corpus to './.repository/corpus_sine_http.txt' ...


Let us also check for other means of file-transport like ftp and ssh.

In [120]:
import functions as func

func.string_affected_tokens('corpus', 'ftp', 'signa_continentur_ftp')
func.string_affected_tokens('corpus', 'ssh', 'signa_continentur_ssh')

signum =  ftp
There are 107 tokens in './.repository/signa_continentur_ftp.txt' containing 'ftp'
signum =  ssh
There are 1,673 tokens in './.repository/signa_continentur_ssh.txt' containing 'ssh'


True

Of course, there were 'ftp's that can be removed but no 'ssh' in the sense we mean, but they were all parts of names. So we leave them as they are and only remove the ftp-related tokens.

In [121]:
no_ftp = func.remove_token(func.strip_file_name(no_http), 'ftp', 'corpus_sine_ftp')

The previous corpus had ('20980592', ',') tokens and the new one has
 20,980,489; this is 103 less.
The previous corpus had ('20980592', ',') tokens and the new one has 20,980,489; this is 103 less.
Writing the new, reduced corpus to './.repository/corpus_sine_ftp.txt' ...


Let us check the @ sign.

In [122]:
func.string_affected_tokens('corpus', '@', 'signa_continentur_@')

signum =  @
There are 797,271 tokens in './.repository/signa_continentur_@.txt' containing '@'


True

It's natural to have so many of this sign in Twitter. We remove them all, since they won't affect the sentiment classification anyhow.

In [123]:
no_at = func.remove_token(func.strip_file_name(no_ftp), '@', 'signa_continentur_@')

The previous corpus had ('20980489', ',') tokens and the new one has
 20,183,282; this is 797,207 less.
The previous corpus had ('20980489', ',') tokens and the new one has 20,183,282; this is 797,207 less.
Writing the new, reduced corpus to './.repository/signa_continentur_@.txt' ...


Now let us look at tokens containing #-tags.

In [124]:
func.string_affected_tokens('corpus', '#', 'signa_continentur_#')

signum =  #
There are 44,986 tokens in './.repository/signa_continentur_#.txt' containing '#'


True

They are also very common in Twitter, and some of them, like #MakesMeSmile can convey a sentiment. So, we leave them as they are. Hopefully coming strategies will take care of them.

So, to cleaning punctuation:

Since the punctuation marks are not part of the vocabulary, we can remove them but they overlap a lot with the emoticons. So, we will remove them after we have dealt with the emoticons. Let us first all emoticons and replace them with their names.

In [125]:
func.key_value_exchange(func.strip_file_name(no_at), const.emoticons, 'clavis_valorem_commutationem')

Writing the new, reduced corpus to './.repository/clavis_valorem_commutationem.txt' ...


'./.repository/clavis_valorem_commutationem.txt'

Now I guess we can remove the punctuation marks, or actually, every character that is not a letter or a space.

In [126]:
func.count_regex(func.strip_file_name(no_at), r'[^a-zA-Z]', 'signa_continentur_non_literas')
# no_non_letters = func.remove_token(func.strip_file_name(no_at), r'[^a-zA-Z]', 'corpus_sine_non_literas')

The corpus contains 9,470,402 characters that match the regex '[^a-zA-Z]'.
This is 8.66% of the whole corpus.


True

<h5><a id=' Step 3.1.1.2.'>Step 3.1.1.2.</a> Remove @tags</h5>

Next, let's see how many @tags there are in the corpus. We write them all to the file 'at_signis.txt' and count the number of lines in the file.

In [127]:
# with open(const.repo_path+'/corpus_sine_delata.txt', 'r') as f:
#     text = f.read()
#     at_signis = pkg.re.findall(r'\S*@\S*', text)
#     print(f'There are {len(at_signis)} @-related tokens in the http free corpus')
#     with open(const.repo_path+'/at_signis.txt', 'w') as f:
#         for ad_signum in at_signis:
#             f.write(ad_signum + '\n')

So yet another 797,227 tokens are identified in the corpus and saved in the file 'ad_signis.txt', making them 3.79% of the tokens in the original corpus.

In [128]:
797227/21052251*100

3.7868967076252322

Removing them from the corpus_sine_delata.txt file, we get the corpus_sine_signo.txt file, which has 20,183,438 tokens.

In [129]:
# corpus_sine_signo = pkg.re.sub(r'\S*@\S*', '', corpus_sine_delata)
# print(f'Original corpus has {len(dframe_corpus.split())} tokens and the new corpus has {len(corpus_sine_signo.split())} tokens')
# with open(const.repo_path+'/corpus_sine_signo.txt', 'w') as f:
#     f.write(corpus_sine_signo)

 <h5><a id=' Step 3.1.1.3.'>Step 3.1.1.3.</a> Remove #-tags</h5>

 We do the same for the hashtags '#'. First we same the those tokens containing the '#' character to the file 'nullam_marcam.txt', and then we remove them from the corpus_sine_signo.txt file, which we save as 'sine_corpore_nullam.txt'. The result is

In [130]:
# with open(const.repo_path+'/corpus_sine_signo.txt', 'r') as f:
#     text = f.read()
#     nullam_marcas = pkg.re.findall(r'\S*#\S*', text)
#     print(f'There are {len(nullam_marcas)} #-related tokens in the hyperlink-and-@-free corpus')
#     with open(const.repo_path+'/nullam_marcas.txt', 'w') as f:
#         for nullam_marcam in nullam_marcas:
#             f.write(nullam_marcam + '\n')

Removing even the hash tags with 44,702 related tokens, we get a corpus which is free from hyperlinks, hashtags, and @signs.

In [131]:
# corpus_liberum_nullam = pkg.re.sub(r'\S*#\S*', '', corpus_sine_signo)
# print(f'Original corpus has {len(dframe_corpus.split())} tokens and the new corpus has {len(corpus_liberum_nullam.split())} tokens')
# with open(const.repo_path+'/corpus_liberum_nullam.txt', 'w') as f:
#     f.write(corpus_liberum_nullam)

The '@' involves only

In [132]:
44702/21052251*100

0.21233833854631506

0.21% of the tokens in the original corpus


2.1. COUNT SMILIES IN THE CORPUS... THEY CAN HELP WITH THE SENTIMENT CLASSIFICATION...

These are the most common smilies used in Twitter:

:) - Smiling face

:-) - Smiling face

;) - Winking face

;-) - Winking face

:( - Frowning face

:-( - Frowning face

:D - Grinning face

:d - Grinning face

:-D - Grinning face

:-d - Grinning face

:P - Sticking out tongue

:p - Sticking out tongue

:-P - Sticking out tongue

:-p - Sticking out tongue

:-o - Shocked face

:-O - Shocked face

:o - Shocked face

:O - Shocked face

:-| - Neutral face

:| - Neutral face

:-* - Kiss

:* - Kiss

:/ - Skeptical face

:-/ - Skeptical face

<3 - Heart

</3 - Broken heart

____________ [Step 3.3.1.1.:](#step3.3.1.1.) Remove URLs

____________ [Step 3.3.1.2.:](#step3.3.1.2.) Remove @-tags

____________ [Step 3.3.1.2.:](#step3.3.1.3.) Remove #-tags

____________ [Step 3.3.1.3.:](#step3.3.1.4.) Remove punctuation

So we need to find if we have these strings in the corpus and get the statistics of their usage.

2.2. THERE MUST BE A LIST OF MOST FREQUENTLY USED SMILIES IN TWTTER. I THINK THERE IS DICTIONARY OF TWITTER SIMILIES SOMEWHERE ON THE INTERNET

2.3. LOOK AT ALL TUPLES OF SPECIAL CHARACTERS THAT ARE OF LENGTH 3

2.4. LOOK AT ALL TUPLES OF SPECIAL CHARACTERS THAT ARE OF LENGTH 2

3. FIND OTHER SPECIAL CHARACTERS THAT MAY HAVE BEEN USED IN THE CORPUS

4. SEE IF YOU CAN FIND ANY ANOMALIES IN THE CORPUS THAT STILL CAN AFFECT THE SENTIMENT CLASSIFICATION

5. FIND ALL NON-LETTER CHARACTERS THAT ARE USED FOR BUILDING TOKENS, WHICH ARE REPEATED MORE THAN ONCE

____________ [Step 3.3.1.4.:](#step3.3.1.4.) Remove "stop" words

________________ [Step 3.3.1.4.1.:](#step3.3.1.4.1.) Articles: "a", "an", "the"

________________ [Step 3.3.1.4.2.:](#step3.3.1.4.2.) Conjunctions: "and", "or", "but"

________________ [Step 3.3.1.4.3.:](#step3.3.1.4.3.) Prepositions: "at", "on", "in","of", "to", "with", "by"

________________ [Step 3.3.1.4.4.:](#step3.3.1.4.4.) Pronouns: "he", "she", "it", "they", "we", "you"

________________ [Step 3.3.1.4.5.:](#step3.3.1.4.5.) Auxiliary verbs: "am", "is", "are", "was", "were", "be", "been", "have", "has", "had", "do", 
"does", "did"

________________ [Step 3.3.1.4.6.:](#step3.3.1.4.6.) Adverbs of frequency: "always", "usually", "often", "sometimes", "rarely", "never"

________________ [Step 3.3.1.4.7.:](#step3.3.1.4.7.) Interjections: "oh", "ah", "wow", "hmm"

________________ [Step 3.3.1.4.8.:](#step3.3.1.4.8.) Suggestions from other sources

In [133]:
#x2 = func.remove_token(func.strip_file_name(x1), 'www', 'corpus_sine_www')

In [134]:
#x3 = func.get_affected_tokens(func.strip_file_name(x2), '@', 'z_test_file')

In [135]:
#STOP

<h4><a id=' Step 3.3.2'>Step 3.3.2.:</a> Gather Unique Tokens</h4>


<h4><a id=' Step 3.3.3'>Step 3.3.3.:</a> Split & Count</h4>



<h1>UNDER CONSTRUCTION</h1>

We use the Python package 'collections' to create a set of tuples with each unique token and its frequency in the 'corpus'. We choose to sort the token in descending order of frequency.

In [136]:
# token_counts = pkg.Counter(dframe_corpus.split())
# print(token_counts.most_common(10))  # print the 20 most common tokens
# # Sort the tokens and their frequencies in descending order
# sorted_counts = sorted(token_counts.items(), key=lambda item: item[1], reverse=True)
# # Print the 20 most common tokens
# print(sorted_counts[:10])

Then we write the result to a CSV file that we call 'token_frequencies.csv'

In [137]:
# # Save the sorted counts to a CSV file
# with open('token_frequencies.csv', 'w', newline='\n') as csvfile:
#     writer = pkg.csv.writer(csvfile)
#     writer.writerow(['Token', 'Count'])  # write the header row
#     for token, count in sorted_counts:
#         writer.writerow([token, count])
        

Now we make a small word cloud of what is left of the tokens in the corpus

In [138]:
# # Load the word frequencies from a CSV file
# word_freqs = {}
# with open('token_frequencies.csv', 'r') as file:
#     reader = pkg.csv.reader(file)
#     new_header = next(reader)  # skip the header row
#     for row in reader:
#         # if the word frequency is greater than 26336, add it to the dictionary
#         if int(row[1]) > 26336.:
#             word_freqs[row[0]] = int(row[1])

# # Create the tags for the word cloud
# tags = pkg.make_tags([(word, 1.) for word in word_freqs.keys()], maxsize=80)

# # Set the tag sizes based on the word frequencies
# for tag in tags:
#     tag['size'] = int(word_freqs[tag['tag']] / max(word_freqs.values()) * 100)

# # Generate the image for the word cloud
# pkg.create_tag_image(tags, 'wordcloud_gt-26336.png', size=(600, 400), fontname='Lobster')

# pkg.Image(filename='./wordcloud_gt-26336.png')

<h3><a id=' Step 3'>Step 3.:</a> Cleaning with Regard to Sentiments</h3>

<h3><a id='step4'>Step 4.:</a> Learn from Google's Search Engine</h3>

1. Remove all unnecessary words from the corpus and find
2. Build dictionary
3. Code the words and phrases
4. Run ML algorithm

<h1><a id='sidetracksplitthedate'>Side Track</a> - Split the Date</h1>

Divide the 'date' into separate columns with the names 'day name', 'year', 'month', 'day', 'hour', 'minute', 'second'**

Let us see how many time zones are mentioned in the table

In [139]:
# flagless_dframe['date'].str.extract(r'([A-Z]{3})', expand=False).value_counts()

This shows that we only have PDT as the time zone. We can thus skip this by removing it from all the strings in the 'date' column

In [140]:
# zoneless_dframe = flagless_dframe.copy()
# zoneless_dframe['date'] = zoneless_dframe['date'].str.replace('PDT', '')
# zoneless_dframe.head()

Now we can divide the date up into 6 different columns, for other investigations than the main subject of this project

In [141]:
# separated_dframe = zoneless_dframe.copy()
# # Convert the 'date' column to a datetime type
# separated_dframe['date'] =pkg.pd.to_datetime(separated_dframe['date'], format='%a %b %d %H:%M:%S %Y')
# # Extract individual date components into separate columns
# separated_dframe['year'] = separated_dframe['date'].dt.year
# separated_dframe['month'] = separated_dframe['date'].dt.month
# separated_dframe['day'] = separated_dframe['date'].dt.day
# separated_dframe['hour'] = separated_dframe['date'].dt.hour
# separated_dframe['minute'] = separated_dframe['date'].dt.minute
# separated_dframe['second'] = separated_dframe['date'].dt.second
# separated_dframe['weekday'] = separated_dframe['date'].dt.weekday
# # Drop the original 'date' column
# dateless_dframe = separated_dframe.drop('date', axis=1)

Now show the neutralized, zoneless dframe with separated date details

In [142]:
# dateless_dframe.sort_values(by='weekday', ascending=True).head(1599999)

Show the result, sorted by 'id' in descending order

In [143]:
# dateless_dframe.sort_values(by='id', ascending=False).head(10)

<h1><a id='sidetrackfollowup'>Side Track</a> Follow-Up:</h1>

**TODO:** Find the ten dates when the most tweets were generated

**TODO: Find out the change in the frequnency of tweets per day**

**TODO : See if there is a correlation between the sentiment of the tweet and the day it was made**

**TODO : See if there is a correlation between the sentiment of the tweet and the time of the day it was made; eg if the tweets tens to be more negative at night compared to days and so on**

<a id='testzone'><h1>TEST ZONE</H1></a>

In [144]:
# # extract the row with the id number 1467811372
# dframe.loc[dframe['id'] == 1467813782]

In [145]:
# # create a sample dataframe
# daf =pkg.pd.DataFrame({'col1': [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]})
# display(daf)
# # count the occurrences of each value in the 'col1' column
# value_counts = daf['col1'].value_counts()
# display(value_counts)
# # filter the result to include only values that appear three times or more
# repeated_values = value_counts[value_counts >= 3]
# display(repeated_values)
# # print the repeated values
# print(repeated_values.index.tolist())

In [146]:
# import pkg_resources
# print(pkg_resources.resource_filename('graphviz', ''))

In [147]:
# import pydot

# def references_graph(obj, max_depth=3, too_many=10):
#     edges = []
#     nodes = set()
#     nodeid = 0

#     def get_node(obj, depth):
#         nonlocal nodeid
#         if depth == 0:
#             return None
#         if id(obj) in nodes:
#             return str(id(obj))
#         if len(nodes) >= too_many:
#             return None

#         node_name = repr(obj)[:30]
#         nodes.add(id(obj))
#         node = pydot.Node(str(nodeid), label=node_name, shape='box')
#         nodeid += 1

#         for ref in gc.get_referents(obj):
#             child_node = get_node(ref, depth - 1.)
#             if child_node is not None:
#                 edges.append((node, child_node))

#         return node

#     get_node(obj, max_depth)
#     graph = pydot.graph_from_edges(edges, directed=True)
#     graph.write_png('corpus_backrefs.png')

In [148]:
# format(9404204284, ',')

In [149]:
# nbr = format(930850284, ',')
# print(f'{format(930850284, ",")}')

In [150]:
# print(locals())
# print([name for name in locals()])
# all_names = [type(name) for name in locals().keys()]
# index = 0
# for name, content in globals().items():
#     print('<---------------------------- ', index,' -------------------------------------->')
#     index += 1
#     print(f'{name} : {content}')
# #    print(f'{name} = {locals()[name]}')
#     if index > 27:
#         break

In [151]:
# import random

# def bubble_sort(arr):
#     n = len(arr)
#     for i in range(n):
#         for j in range(0, n-i-1):
#             if arr[j] > arr[j+1] :
#                 arr[j], arr[j+1] = arr[j+1], arr[j]

# if __name__ == '__main__':
#     arr = [random.randint(1, 100) for _ in range(1000)]
#     bubble_sort(arr)

In [152]:
# def get_var_name(var):
#     """Return the name of a variable as a string."""
#     for name in globals():
#         print(name)
#         if id(globals()[name]) == id(var):
#             return name
#     return None

# # Example usage
# x = 42
# y = 'hello'
# print(get_var_name(x)) # Output: 'x'
# print(get_var_name(y)) # Output: 'y'

In [153]:
# func.my_func()

In [154]:
# from varname.helpers import Wrapper

# foo = Wrapper(dict())

# # foo.name == 'foo'
# # foo.value == {}
# foo.value['bar'] = 2

In [155]:
# from varname import varname
# def function():
#     return varname()

# func = function()  # func == 'func'
# print(func)

In [156]:
# def function():
#     # retrieve the variable name at the 2nd frame from this one
#     return varname(frame=1)

# func = function()  # func == 'func'
# print(varname())

In [157]:
# def wrapped():
#     print(varname(frame=1))
#     return function()

# def function():
#     print(varname(frame=2))
#     # retrieve the variable name at the 2nd frame from this one
#     return function_()

# def function_():
#     print(varname(frame=3))
#     holder = 'string'
#     return holder.varname(frame=4)

# func = wrapped() # func == 'func'
# print(func)

In [158]:
# # since v0.5.4
# def func():
#     return varname(multi_vars=True)

# a = func() # a == ('a',)
# a, b = func() # (a, b) == ('a', 'b')
# [a, b] = func() # (a, b) == ('a', 'b')

# # hierarchy is also possible
# a, (b, c) = func() # (a, b, c) == ('a', 'b', 'c')
# print(a, b, c)

In [159]:
# from varname.helpers import register

# @register
# def function():
#     koja = 'koja'
#     return koja.__varname__

# func = function() # func == 'func'

# print(func)
# # @register(frame=2)
# # def function():
# #     return __varname__

# # def wrapped():
# #     return function()

# # func = wrapped() # func == 'func'

In [160]:
# from varname import varname, nameof

# p = 12
# x = p
# f = nameof('hasanali') # 'varname'
# print(nameof(func))

In [161]:
# f[:-2]

In [162]:
# from varname import varname

# def func():
#   return varname()

# # In external uses
# x = func() # 'x'
# y = func() # 'y'
# print(x, y)
# print(nameof(x))
# print(nameof(y))

In [163]:
# import pandas as pd

# my_dataframe = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': ['a', 'b', 'c']})

# def process_dataframe(dataframe_name):
#     # retrieve the DataFrame object from its name using the globals() function
#     print(type(dataframe_name))
#     df = globals()[dataframe_name]
#     print(type(df))
#     print(df)
#     # do something with the DataFrame object
#     # ...

# process_dataframe('my_dataframe')

In [164]:
# import psutil

# # Get a list of all running processes
# processes = psutil.process_iter()

# # Iterate over the list of processes and find the one you want to kill
# for process in processes:
#     print(process.name())

In [165]:
# import sys
# ___aaa = dframe_corpus
# print(sys.getsizeof(___aaa))
# print(locals().keys())
# if '___aaa' in globals().keys():
#     print('___aaa is')
# if 'dframe_corpus' in globals().keys():
#     print('dframe_corpus is')
# ___ninanoo = ___aaa
# del ___aaa
# del dframe_corpus
# gc.collect()
# if '___aaa' in globals().keys():
#     print('___aaa is')
# if 'dframe_corpus' in globals().keys():
#     print('dframe_corpus is')
# print(locals().keys())

In [166]:
import string

'Http is not writTen liek Zis'.lower()

'http is not written liek zis'