# Data Preprocessed

In this code, we preprocess the twitter data so that we can make a dictionary list and form an id list for each twitter based on this dictionary list.

In [2]:
# import packages
import numpy as np
import pandas as pd
import re

In [4]:
# open and read the data file
f = open("../Dataset/Adata.txt","r",encoding='utf-8')

result=[]
for line in f.readlines():
            line=line.strip().split("\t")
            result.append(line)

f.close()

result=pd.DataFrame(result).dropna(axis=0)
print(result)

                       0         1  \
0     628949369883000832  negative   
1     628976607420645377  negative   
2     629023169169518592  negative   
3     629179223232479232  negative   
4     629186282179153920   neutral   
...                  ...       ...   
5995  639855845958885376  positive   
5996  639979760735662080   neutral   
5997  640196838260363269   neutral   
5998  640975710354567168  positive   
5999  641034340068143104   neutral   

                                                      2  
0     dear @Microsoft the newOoffice for Mac is grea...  
1     @Microsoft how about you make a system that do...  
2     I may be ignorant on this issue but... should ...  
3     Thanks to @microsoft, I just may be switching ...  
4     If I make a game as a #windows10 Universal App...  
...                                                 ...  
5995  @Racalto_SK ok good to know. Punting at MetLif...  
5996  everyone who sat around me at metlife was so a...  
5997  what giants or 

We don't need this numerical column which is a list of download id for each twitter.

In [14]:
df = result.iloc[:,[2,1]]
df.columns=['twitter','sent']
print(df)

                                                twitter      sent
0     dear @Microsoft the newOoffice for Mac is grea...  negative
1     @Microsoft how about you make a system that do...  negative
2     I may be ignorant on this issue but... should ...  negative
3     Thanks to @microsoft, I just may be switching ...  negative
4     If I make a game as a #windows10 Universal App...   neutral
...                                                 ...       ...
5995  @Racalto_SK ok good to know. Punting at MetLif...  positive
5996  everyone who sat around me at metlife was so a...   neutral
5997  what giants or niners fans would wanna go to t...   neutral
5998  Anybody want a ticket for tomorrow Colombia vs...  positive
5999  Mendez told me he'd drive me to MetLife on Sun...   neutral

[6000 rows x 2 columns]


We treat each twitter like a short email, and use a similar way to handle twitter text.

In [15]:
def processTwitter(twitter):
    # put all website addresses into a same entry of dictionary
    strinfo = re.compile('(http|https)://[^\s]*')
    twitter = strinfo.sub('httpaddr', twitter)
    # put all twitter accounts into a same entry of dictionary
    strinfo = re.compile('@[^\s]+')
    twitter = strinfo.sub('twitteraddr', twitter)
    # treat all numbers equally
    strinfo = re.compile('\d+')
    twitter = strinfo.sub('numbr', twitter)
    # treat all twitter topic equally
    strinfo = re.compile('#[^\s]+')
    twitter = strinfo.sub('topc', twitter)
    return(twitter.lower())
#Reference：https://blog.csdn.net/weixin_36815313/article/details/105149312

In [16]:
# start processing
df['Processed'] = ''
for i in range(df.shape[0]):
    df.iloc[i,2] =  processTwitter(df.iloc[i,0])
print(df)

                                                twitter      sent  \
0     dear @Microsoft the newOoffice for Mac is grea...  negative   
1     @Microsoft how about you make a system that do...  negative   
2     I may be ignorant on this issue but... should ...  negative   
3     Thanks to @microsoft, I just may be switching ...  negative   
4     If I make a game as a #windows10 Universal App...   neutral   
...                                                 ...       ...   
5995  @Racalto_SK ok good to know. Punting at MetLif...  positive   
5996  everyone who sat around me at metlife was so a...   neutral   
5997  what giants or niners fans would wanna go to t...   neutral   
5998  Anybody want a ticket for tomorrow Colombia vs...  positive   
5999  Mendez told me he'd drive me to MetLife on Sun...   neutral   

                                              Processed  
0     dear twitteraddr the newooffice for mac is gre...  
1     twitteraddr how about you make a system that d...

In [17]:
# save the processed data into a new file
df[['Processed','sent']].to_csv('processed_twitter.txt',sep='\t',index=False)

We pause here and generate a dictionary list based on this processed data. Now we have to do the word-id replacement.

In [18]:
# open and read the dictionary list
f2 = open("dictionary.txt","r",encoding='utf-8')

result2 = []
for line in f2.readlines():
    line=line.strip().split("\t")
    result2.append(line)

f2.close()

df2=pd.DataFrame(result2[1:]).dropna(axis=0)
df2.columns = ['word','frequency']
print(df2)

              word frequency
0              the      4749
1      twitteraddr      2247
2         httpaddr      2205
3            numbr      2137
4              and      1671
...            ...       ...
10503     multiple         1
10504      chicken         1
10505     moonbyul         1
10506        pouya         1
10507         peru         1

[10508 rows x 2 columns]


As you may see, I exclude all the words with no more than 2 letters, they appear too frequently and most of them are emotionless.

The dictionary list is sorted by descending frequency.

In [19]:
# find the matchup in the dictionary list
df['text_code']=""
for i in range(df.shape[0]):
    listOfTokens = re.split(r'\W',df.iloc[i,2])
    codes = np.zeros(len(listOfTokens), dtype=np.int)
    for j in range(len(listOfTokens)):
        token = listOfTokens[j]
        # length 2 tokens are excluded
        if(len(token)<3):
            codes[j] = 0
            continue
        codes[j] = df2[df2['word']==token].index.tolist()[-1]+1
    df.iloc[i,3] = codes

In [21]:
print(df)

                                                twitter      sent  \
0     dear @Microsoft the newOoffice for Mac is grea...  negative   
1     @Microsoft how about you make a system that do...  negative   
2     I may be ignorant on this issue but... should ...  negative   
3     Thanks to @microsoft, I just may be switching ...  negative   
4     If I make a game as a #windows10 Universal App...   neutral   
...                                                 ...       ...   
5995  @Racalto_SK ok good to know. Punting at MetLif...  positive   
5996  everyone who sat around me at metlife was so a...   neutral   
5997  what giants or niners fans would wanna go to t...   neutral   
5998  Anybody want a ticket for tomorrow Colombia vs...  positive   
5999  Mendez told me he'd drive me to MetLife on Sun...   neutral   

                                              Processed  \
0     dear twitteraddr the newooffice for mac is gre...   
1     twitteraddr how about you make a system that d.

In [22]:
# save the id list into a new file
df[['text_code','sent']].to_csv('code.csv',index=False)