# Data Preprocessed

In this code, we preprocess the twitter data so that we can make a dictionary list and form an id list for each twitter based on this dictionary list.

In [1]:
# import packages
import numpy as np
import pandas as pd
import re

In [2]:
# open and read the data file
f = open("../Dataset/Bdata.txt","r",encoding='utf-8')

result=[]
for line in f.readlines():
            line=line.strip().split("\t")
            result.append(line)

f.close()

result=pd.DataFrame(result).dropna(axis=0)
print(result)

                        0            1         2  \
0      681563394940473347  amy schumer  negative   
1      675847244747177984  amy schumer  negative   
2      672827854279843840  amy schumer  negative   
3      662755012129529858  amy schumer  negative   
4      671502639671042048  amy schumer  negative   
...                   ...          ...       ...   
10547  638032969383309312         zayn  positive   
10548  634711870570500096         zayn  positive   
10549  637134671797690368         zayn  positive   
10550  636413565780557824         zayn  positive   
10551  634633336124776448         zayn  positive   

                                                       3  
0      @MargaretsBelly Amy Schumer is the stereotypic...  
1      @dani_pitter I mean I get the hype around JLaw...  
2      Amy Schumer at the #GQmenoftheyear2015 party i...  
3      Amy Schumer is on Sky Atlantic doing one of th...  
4      "Amy Schumer may have brought us Trainwreck, b...  
...                  

We don't need this numerical column which is a list of download id for each twitter.

In [3]:
df = result.iloc[:,[3,2,1]]
df.columns=['twitter','sent','topic']
print(df)

                                                 twitter      sent  \
0      @MargaretsBelly Amy Schumer is the stereotypic...  negative   
1      @dani_pitter I mean I get the hype around JLaw...  negative   
2      Amy Schumer at the #GQmenoftheyear2015 party i...  negative   
3      Amy Schumer is on Sky Atlantic doing one of th...  negative   
4      "Amy Schumer may have brought us Trainwreck, b...  negative   
...                                                  ...       ...   
10547  tomorrow I've to wake up  early so Zayn's erfo...  positive   
10548  with Zayn gone I can now definitively say that...  positive   
10549  yo don't ever say that! god forbid! may it not...  positive   
10550  "you may call me a bad fan but I sobbed so har...  positive   
10551  "zayn's voice: c'mon guys you can do it, nobod...  positive   

             topic  
0      amy schumer  
1      amy schumer  
2      amy schumer  
3      amy schumer  
4      amy schumer  
...            ...  
10547       

We treat each twitter like a short email, and use a similar way to handle twitter text. And we have to find the topic in the message.

In [4]:
def processTwitter(twitter,topic):
    twitter = twitter.lower()
    strinfo = re.compile(topic)
    twitter = strinfo.sub('thistopc', twitter)
    # put all website addresses into a same entry of dictionary
    strinfo = re.compile('(http|https)://[^\s]*')
    twitter = strinfo.sub('httpaddr', twitter)
    # put all twitter accounts into a same entry of dictionary
    strinfo = re.compile('@[^\s]+')
    twitter = strinfo.sub('twitteraddr', twitter)
    # treat all numbers equally
    strinfo = re.compile('\d+')
    twitter = strinfo.sub('numbr', twitter)
    # treat all twitter topic equally
    strinfo = re.compile('#[^\s]+')
    twitter = strinfo.sub('topc', twitter)
    return(twitter)
#Reference：https://blog.csdn.net/weixin_36815313/article/details/105149312

In [5]:
# start processing
df['Processed'] = ''
for i in range(df.shape[0]):
    df.iloc[i,3] =  processTwitter(df.iloc[i,0],df.iloc[i,2])
print(df)

                                                 twitter      sent  \
0      @MargaretsBelly Amy Schumer is the stereotypic...  negative   
1      @dani_pitter I mean I get the hype around JLaw...  negative   
2      Amy Schumer at the #GQmenoftheyear2015 party i...  negative   
3      Amy Schumer is on Sky Atlantic doing one of th...  negative   
4      "Amy Schumer may have brought us Trainwreck, b...  negative   
...                                                  ...       ...   
10547  tomorrow I've to wake up  early so Zayn's erfo...  positive   
10548  with Zayn gone I can now definitively say that...  positive   
10549  yo don't ever say that! god forbid! may it not...  positive   
10550  "you may call me a bad fan but I sobbed so har...  positive   
10551  "zayn's voice: c'mon guys you can do it, nobod...  positive   

             topic                                          Processed  
0      amy schumer  twitteraddr thistopc is the stereotypical numb...  
1      amy schu

In [6]:
# save the processed data into a new file
df[['Processed','sent','topic']].to_csv('processed_twitter.txt',sep='\t',index=False)

We pause here and generate a dictionary list based on this processed data. Now we have to do the word-id replacement.

In [7]:
# open and read the dictionary list
f2 = open("dictionary.txt","r",encoding='utf-8')

result2 = []
for line in f2.readlines():
    line=line.strip().split("\t")
    result2.append(line)

f2.close()

df2=pd.DataFrame(result2[1:]).dropna(axis=0)
df2.columns = ['word','frequency']
print(df2)

              word frequency
0         thistopc     10878
1              the      8433
2      twitteraddr      3381
3              and      3221
4            numbr      3181
...            ...       ...
13459     misfired         1
13460          ruc         1
13461        newry         1
13462       mortar         1
13463       sobbed         1

[13464 rows x 2 columns]


As you may see, I exclude all the words with no more than 2 letters, they appear too frequently and most of them are emotionless.

The dictionary list is sorted by descending frequency.

In [8]:
# find the matchup in the dictionary list
df['text_code']=""
for i in range(df.shape[0]):
    listOfTokens = re.split(r'\W',df.iloc[i,3])
    codes = np.zeros(len(listOfTokens), dtype=np.int)
    for j in range(len(listOfTokens)):
        token = listOfTokens[j]
        # length 2 tokens are excluded
        if(len(token)<3):
            codes[j] = 0
            continue
        codes[j] = df2[df2['word']==token].index.tolist()[-1]+1
    df.iloc[i,4] = codes

In [9]:
print(df)

                                                 twitter      sent  \
0      @MargaretsBelly Amy Schumer is the stereotypic...  negative   
1      @dani_pitter I mean I get the hype around JLaw...  negative   
2      Amy Schumer at the #GQmenoftheyear2015 party i...  negative   
3      Amy Schumer is on Sky Atlantic doing one of th...  negative   
4      "Amy Schumer may have brought us Trainwreck, b...  negative   
...                                                  ...       ...   
10547  tomorrow I've to wake up  early so Zayn's erfo...  positive   
10548  with Zayn gone I can now definitively say that...  positive   
10549  yo don't ever say that! god forbid! may it not...  positive   
10550  "you may call me a bad fan but I sobbed so har...  positive   
10551  "zayn's voice: c'mon guys you can do it, nobod...  positive   

             topic                                          Processed  \
0      amy schumer  twitteraddr thistopc is the stereotypical numb...   
1      amy sc

In [10]:
# save the id list into a new file
df[['text_code','sent','topic']].to_csv('code.csv',index=False)