# 2. Data handling and visualization

IN GENERAL: INTRODUCE THINGS HERE WHICH WE NEED IN SESSIONS 3-14. THAT MEANS, WHENEVER WE 

### Textbooks & sources

- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://www.pythonlikeyoumeanit.com/index.html

### Notes

- Focus here is on flat data structures (Pandas dataframes) and mathematical data structures (NumPy arrays), add also regex here instead of showing them in the functions of NLP?; hierarchical data structures (JSON and HTML) are covered in session 4.

## 2.1. Essentials

https://www.pythonlikeyoumeanit.com/module_2.html

BOX: OBJECT-ORIENTED PROGRAMING
https://www.pythonlikeyoumeanit.com/module_4.html

## 2.2. Pandas

- https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html

### 2.2.1. TweetsCOV19 dataset

https://data.gesis.org/tweetscov19/

- Tweet Id: Long.
- Username: String. Encrypted for privacy issues.
- Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
- #Followers: Integer.
- #Friends: Integer.
- #Retweets: Integer.
- #Favorites: Integer.
- Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
- Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
- Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
- Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
- URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"


Download the file https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz and store it in the data/tweetscov19/ directory.

In [1]:
import pandas as pd
import numpy as np

In [2]:
tweets = pd.read_csv('data/data', sep='\t', header=None)

In [3]:
tweets.columns = ['tweet_id', 'username', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'entities', 'sentiment', 'mentions', 'hashtags', 'urls']

In [4]:
# Dropping the erroneous rows:


# Wrong hashtags:

tweets.drop(tweets.index[328062], inplace=True)
tweets.drop(tweets.index[605051], inplace=True)
tweets.drop(tweets.index[713562], inplace=True)
tweets.drop(tweets.index[1891877], inplace=True)

# Wrong mentions:

tweets.drop(tweets.index[614876], inplace=True)
tweets.drop(tweets.index[1183492], inplace=True)
tweets.drop(tweets.index[1681823], inplace=True)

In [5]:
# Splitting 'sentiment' into pos and neg and deleting 'sentiment' column:

pos = []
neg = []

for i in tweets['sentiment']:
    pos.append(i.split()[0])
    neg.append(i.split()[1])
    
tweets['sentiment_pos'] = pos
tweets['sentiment_neg'] = neg

del tweets['sentiment']

# tweets

In [6]:
# Putting values of hashtags column into lists:
# Note: 'null;' and NaN values are replaced with ['']

def f1(cell):
    if cell == 'null;' or type(cell) == float:
#     if type(cell) == float:
        cell = ['']
    else:
        cell = cell.split()
    return cell

tweets['hashtags'] = tweets['hashtags'].apply(f1)




# Putting values of mentions column into lists:
# Note: 'null;' and NaN values are replaced with ['']

tweets['mentions'] = tweets['mentions'].apply(f1)




# Putting values of entities column into lists:
# Note: 'null;' values are replaced with ['']

def f2(cell):
    if cell == 'null;':
        cell = ['']
    else:
        splitted = cell.split(';')
        del splitted[-1]
        cell = splitted
        
    return cell

tweets['entities'] = tweets['entities'].apply(f2)




# Putting values of urls column into lists:
# Note: 'null;' and NaN values are replaced with ['']

def f3(cell):
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        splitted = cell.split(':-:')
        del splitted[-1]
        cell = splitted
        
    return cell

tweets['urls'] = tweets['urls'].apply(f3)

In [7]:
# Creating users dataframe:

# Note: This may take around 90 seconds to run

followers_max = tweets.loc[tweets.groupby('username')['followers'].idxmax()]
followers_max = followers_max.reset_index()

friends_max = tweets.loc[tweets.groupby('username')['friends'].idxmax()]
friends_max = friends_max.reset_index()

users = pd.DataFrame()
users['username'] = friends_max['username']
users['followers_max'] = followers_max['followers']
users['friends_max'] = friends_max['friends']
users = users.sort_values('followers_max', ascending = False).reset_index()
del users['index']

users

Unnamed: 0,username,followers_max,friends_max
0,c1d4d177b4028f2b6ea90a3617c32fb6,117926717,606040
1,0b64e075d55e5221457d3e22ba3dcc14,111636059,299530
2,7cd534d396546a50ddd2dea9ee7f9145,108555597,224
3,a075253a703c963c96f819be90e82a67,81495144,123005
4,75224fc65ae453fe9ec3ca855cd8619b,80751709,46
...,...,...,...
1117992,0a3f65725f4c932569df55778c366cd6,0,1
1117993,d609e811b62e6ad00238038db86dcb24,0,42
1117994,317209898026d144b894547ba2c30615,0,1
1117995,a7646f491896749f18813387d078b79c,0,1


In [8]:
# Creating tweets dataframe:

# Note: This may take around 4 minutes to run, the line of code that causes this is specified with a comment below.

tweets_table = tweets.copy()
tweets_table.rename(columns = {'tweet_id':'identifier', 'username':'user_id'}, inplace = True)

temp_users = users.reset_index()
del temp_users['followers_max']
del temp_users['friends_max']

merged = pd.merge(temp_users, tweets_table, left_on='username', right_on='user_id', how='left').drop('user_id', axis=1)
# del temp_users
merged = merged.rename(columns={"index": "user_id"})
del merged['username']

modified_timestamps = pd.to_datetime(merged['timestamp']) # This line takes around 4 minutes to run.

merged['modified_timestamps'] = modified_timestamps
merged = merged.sort_values(by=['modified_timestamps']).reset_index()
del merged['modified_timestamps']
del merged['index']

cols = ['identifier'] + ['user_id'] + [col for col in merged if (col != 'identifier' and col != 'user_id')]
merged = merged[cols]
tweets_sorted = merged
tweets_table = tweets_sorted.copy()

del tweets_table['entities']
del tweets_table['mentions']
del tweets_table['hashtags']
del tweets_table['urls']
del merged

tweets_table

Unnamed: 0,identifier,user_id,timestamp,followers,friends,retweets,favorites,sentiment_pos,sentiment_neg
0,1255980246370676737,9835,Thu Apr 30 22:00:00 +0000 2020,200432,1880,14,50,2,-3
1,1255980246995570692,219274,Thu Apr 30 22:00:00 +0000 2020,1714,112,37,46,2,-1
2,1255980248161714177,11861,Thu Apr 30 22:00:00 +0000 2020,140395,350,2,4,1,-1
3,1255980247683674113,377,Thu Apr 30 22:00:00 +0000 2020,6149624,462,32,104,1,-1
4,1255980248728035329,14943,Thu Apr 30 22:00:00 +0000 2020,120440,69187,78,90,1,-2
...,...,...,...,...,...,...,...,...,...
1912058,1267214225270685696,789616,Sun May 31 21:59:49 +0000 2020,118,419,0,0,1,-4
1912059,1267214225283231744,753634,Sun May 31 21:59:49 +0000 2020,149,341,0,0,2,-3
1912060,1267214229469310978,322417,Sun May 31 21:59:50 +0000 2020,1465,1566,0,0,1,-1
1912061,1267214242052288520,700178,Sun May 31 21:59:53 +0000 2020,205,218,0,0,3,-1


In [9]:
# Creating entities dataframe:

entities_table = pd.DataFrame()

res = pd.DataFrame({'entities': np.concatenate(tweets['entities'].values)})
entities_table['entities'] = res.squeeze().value_counts().index

original = ['']
annotated = ['']
score = ['']

for i in entities_table['entities'][1:]:      
          
    split = i.split(':')
    original.append(split[0])
    annotated.append(split[1])
    score.append(split[2])

entities_table['original'] = original
entities_table['annotated'] = annotated
entities_table['score'] = score

entities_table['selections'] = res.squeeze().value_counts().reindex().to_numpy()

entities_table = entities_table.drop(entities_table.index[0]).reset_index()
del entities_table['index']
    
entities_table

Unnamed: 0,entities,original,annotated,score,selections
0,covid 19:Coronavirus_disease_2019:-1.535776454...,covid 19,Coronavirus_disease_2019,-1.535776454600282,140238
1,quarantine:Quarantine:-2.3096035868012508,quarantine,Quarantine,-2.3096035868012508,70605
2,china:China:-2.113921624336916,china,China,-2.113921624336916,57153
3,social distancing:Social_distancing:-1.4103273...,social distancing,Social_distancing,-1.4103273474020743,38290
4,ppe:Philosophy%2C_politics_and_economics:-2.48...,ppe,Philosophy%2C_politics_and_economics,-2.481280260595,16471
...,...,...,...,...,...
181964,steve biko:Steve_Biko:-1.1217053124820895,steve biko,Steve_Biko,-1.1217053124820895,1
181965,jack nicklaus:Jack_Nicklaus:-0.810259216283029,jack nicklaus,Jack_Nicklaus,-0.810259216283029,1
181966,fine living:Cooking_Channel:-2.3010870202064426,fine living,Cooking_Channel,-2.3010870202064426,1
181967,3des:Triple_DES:-2.5412059066401578,3des,Triple_DES,-2.5412059066401578,1


In [10]:
# Creating mentions dataframe:

mentions_table = pd.DataFrame()

res = pd.DataFrame({'mentions': np.concatenate(tweets['mentions'].values)})
mentions_table['mentions'] = res.squeeze().value_counts().index

mentions_table['selections'] = res.squeeze().value_counts().reindex().to_numpy()

mentions_table = mentions_table.drop(mentions_table.index[0]).reset_index()
del mentions_table['index']

mentions_table

Unnamed: 0,mentions,selections
0,realDonaldTrump,37878
1,PMOIndia,6361
2,narendramodi,6342
3,jaketapper,5892
4,YouTube,5658
...,...,...
678371,ListenShahid1,1
678372,Fingerz00,1
678373,AinsleyFoods,1
678374,ImtiazTyab,1


In [11]:
# Creating hashtags dataframe:

hashtags_table = pd.DataFrame()

res = pd.DataFrame({'hashtags': np.concatenate(tweets['hashtags'].values)})
hashtags_table['hashtags'] = res.squeeze().value_counts().index

hashtags_table['selections'] = res.squeeze().value_counts().reindex().to_numpy()

hashtags_table = hashtags_table.drop(hashtags_table.index[0]).reset_index()
del hashtags_table['index']

hashtags_table

Unnamed: 0,hashtags,selections
0,COVID19,67421
1,coronavirus,30332
2,Covid_19,11032
3,covid19,10648
4,null;,9477
...,...,...
351602,BeckettandQuarantine,1
351603,366DaysOfWords,1
351604,Pioppi,1
351605,PhotographyGame,1


In [12]:
# Creating urls dataframe:

urls_table = pd.DataFrame()

urls1 = pd.DataFrame()
urls1['urls'] = tweets['urls']

urls2 = pd.DataFrame()

for i in range (10):
    
    temp_df = urls1[i*(int(urls1.shape[0]/10)):i*(int(urls1.shape[0]/10))+int(urls1.shape[0]/10)]
    res = pd.DataFrame({'urls': np.concatenate(temp_df['urls'].values)})
    urls2 = pd.concat([urls2, res], ignore_index = True, axis = 0)

urls_table['urls'] = urls2.squeeze().value_counts().index

urls_table['selections'] = urls2.squeeze().value_counts().reindex().to_numpy()

urls_table = urls_table.drop(urls_table.index[0]).reset_index()
del urls_table['index']

urls_table

Unnamed: 0,urls,selections
0,https://www.twittascope.com/?sign=5,549
1,https://api.whatsapp.com/send?phone=9190393567...,368
2,http://rebrand.ly/work-2020,286
3,https://www.twittascope.com/?sign=6,271
4,https://redcross.give.asia/campaign/essentials...,260
...,...,...
420617,https://baltimore.cbslocal.com/2020/05/23/ocea...,1
420618,https://mol.im/a/8350307,1
420619,https://iq.cash,1
420620,https://block.fiverr.com/index.html?url=aHR0cD...,1


In [13]:
# Creating tweets_entities dataframe:

tweets_entities = tweets_sorted['entities'].reset_index()

lens = list(map(len, tweets_entities['entities'].values))
tweets_entities = pd.DataFrame({'tweet_id': np.repeat(tweets_entities['index'], lens), 'entities': np.concatenate(tweets_entities['entities'].values)})

entities_table = entities_table.reset_index()
tweets_entities = tweets_entities[tweets_entities.entities != '']

merged = pd.merge(entities_table, tweets_entities, left_on='entities', right_on='entities', how='right')

merged.rename(columns = {'index':'entity_id'}, inplace = True)

del merged['entities']
del merged['original']
del merged['annotated']
del merged['score']
del merged['selections']

cols = ['tweet_id'] + ['entity_id']
merged = merged[cols]
tweets_entities = merged.copy()

del entities_table['index']

tweets_entities

Unnamed: 0,tweet_id,entity_id
0,0,16611
1,0,6716
2,0,383
3,1,9942
4,2,9554
...,...,...
2757735,1912060,16
2757736,1912061,851
2757737,1912061,1752
2757738,1912061,130908


In [14]:
# Creating tweets_hashtags dataframe:

tweets_hashtags = tweets_sorted['hashtags'].reset_index()

lens = list(map(len, tweets_hashtags['hashtags'].values))
tweets_hashtags = pd.DataFrame({'tweet_id': np.repeat(tweets_hashtags['index'], lens), 'hashtags': np.concatenate(tweets_hashtags['hashtags'].values)})

hashtags_table = hashtags_table.reset_index()
tweets_hashtags = tweets_hashtags[tweets_hashtags.hashtags != '']

merged = pd.merge(hashtags_table, tweets_hashtags, left_on='hashtags', right_on='hashtags', how='right')

merged.rename(columns = {'index':'hashtag_id'}, inplace = True)
del merged['hashtags']
del merged['selections']

cols = ['tweet_id'] + ['hashtag_id']
merged = merged[cols]
tweets_hashtags = merged.copy()

del hashtags_table['index']

tweets_hashtags

Unnamed: 0,tweet_id,hashtag_id
0,0,438
1,1,179308
2,1,8943
3,1,14935
4,1,10668
...,...,...
1551213,1912035,241
1551214,1912035,4688
1551215,1912047,9525
1551216,1912051,674


In [15]:
# Creating tweets_mentions dataframe:

tweets_mentions = tweets_sorted['mentions'].reset_index()

lens = list(map(len, tweets_mentions['mentions'].values))
tweets_mentions = pd.DataFrame({'tweet_id': np.repeat(tweets_mentions['index'], lens), 'mentions': np.concatenate(tweets_mentions['mentions'].values)})

mentions_table = mentions_table.reset_index()
tweets_mentions = tweets_mentions[tweets_mentions.mentions != '']

merged = pd.merge(mentions_table, tweets_mentions, left_on='mentions', right_on='mentions', how='right')

merged.rename(columns = {'index':'mention_id'}, inplace = True)

del merged['mentions']
del merged['selections']

cols = ['tweet_id'] + ['mention_id']
merged = merged[cols]
tweets_mentions = merged.copy()

del mentions_table['index']

tweets_mentions

Unnamed: 0,tweet_id,mention_id
0,3,29152
1,8,108796
2,9,15649
3,13,440995
4,17,435743
...,...,...
2009441,1912060,90696
2009442,1912061,392590
2009443,1912061,392589
2009444,1912061,42533


In [16]:
# Creating tweets_urls dataframe:

tweets_urls = tweets_sorted['urls'].reset_index()



urls1 = tweets_urls.copy()
urls2 = pd.DataFrame()

for i in range (10):   
    temp_df = urls1[i*(int(urls1.shape[0]/10)):i*(int(urls1.shape[0]/10))+int(urls1.shape[0]/10)]   
    lens = list(map(len, temp_df['urls'].values))
    res = pd.DataFrame({'tweet_id': np.repeat(temp_df['index'], lens), 'urls': np.concatenate(temp_df['urls'].values)})    
    urls2 = pd.concat([urls2, res], ignore_index = True, axis = 0)

tweets_urls = urls2

urls_table = urls_table.reset_index()
tweets_urls = tweets_urls[tweets_urls.urls != '']

merged = pd.merge(urls_table, tweets_urls, left_on='urls', right_on='urls', how='right')

merged.rename(columns = {'index':'url_id'}, inplace = True)

del merged['urls']
del merged['selections']

cols = ['tweet_id'] + ['url_id']
merged = merged[cols]
tweets_urls = merged.copy()

del urls_table['index']

tweets_urls

Unnamed: 0,tweet_id,url_id
0,0,64962
1,2,11746
2,3,336798
3,4,261083
4,5,44481
...,...,...
534002,1912002,11972
534003,1912010,390206
534004,1912017,8366
534005,1912028,73475


### 2.2.2. Working with a single dataframe

In [15]:
# read/save
# describe()
# changing index and column names
# grouping
# using and resetting the index
# categorize series: categories and codes
# matrix to edgelist and vice versa
# zip
# columns into dict
# datetime
# ...

### 2.2.3. Working with multiple dataframes

In [24]:
# merge split concat etc
# ...

## 2.3. NumPy

- https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html
- https://www.pythonlikeyoumeanit.com/module_3.html

In [25]:
# read/save
# relationship to pandas
# ...

## 2.4. SciPy

In [1]:
# sparse matrices
# matrix multiplication

## 2.5. Data visualiation with Seaborn & Matplotlib

SUGGESTION: TEACH HOW TO WITH SEABORN, USE MATPLOTLIB WHERE SEABORN DOES NOT OFFER METHODS

- https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html