# 2. Data handling and visualization

https://www.pythonlikeyoumeanit.com/index.html

CHECK IF WE CAN BUILD ON: https://jakevdp.github.io/PythonDataScienceHandbook/

### Notes

- Focus here is on flat data structures (Pandas dataframes) and mathematical data structures (NumPy arrays); hierarchical data structures (JSON and HTML) are covered in session 4.

## 2.1. Essentials

https://www.pythonlikeyoumeanit.com/module_2.html

## 2.2. Pandas

### 2.2.1. TweetsCOV19 dataset

https://data.gesis.org/tweetscov19/

- Tweet Id: Long.
- Username: String. Encrypted for privacy issues.
- Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
- #Followers: Integer.
- #Friends: Integer.
- #Retweets: Integer.
- #Favorites: Integer.
- Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
- Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
- Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
- Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
- URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"


Download the file https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz and store it in the data/tweetscov19/ directory.

In [1]:
import pandas as pd

In [19]:
tweets = pd.read_csv('data/tweetscov19/TweetsCOV19_052020.tsv.gz', sep='\t', header=None)

In [22]:
tweets.columns = ['tweet_id', 'username', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'entities', 'sentiment', 'mentions', 'hashtags', 'urls']

In [23]:
tweets

Unnamed: 0,tweet_id,username,timestamp,followers,friends,retweets,favorites,entities,sentiment,mentions,hashtags,urls
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
...,...,...,...,...,...,...,...,...,...,...,...,...
1912065,1267207472424660992,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,Sun May 31 21:32:59 +0000 2020,15,45,0,0,spotify:Spotify:-0.9407337067771776;wifi:Wi-Fi...,2 -1,null;,null;,null;
1912066,1267207883487354881,0e4323d01d164b9eb6e33f35564c7e25,Sun May 31 21:34:37 +0000 2020,43,931,0,0,china:China:-2.113921624336916;death penalty:C...,1 -2,null;,null;,null;
1912067,1267209309559173122,00fc2c96e4012e27a6eee351723ab461,Sun May 31 21:40:17 +0000 2020,256,451,0,0,null;,2 -1,null;,null;,null;
1912068,1267212987938545667,0f99a3b8b0d490f062215575d074518b,Sun May 31 21:54:54 +0000 2020,1467,1505,0,0,omg:OMG_%28Usher_song%29:-2.580063760606172;,2 -1,lsddrq,null;,null;


TO IMPLEMENT:
- SPLIT 'sentiment' INTO 'sentiment_pos' AND 'sentiment_neg' AND DELETE THE ORIGINAL SERIES
- CREATE AN 'entities' DF WITH FOUR COLUMNS ['tweet_id', 'original', 'annotated', 'score']
- CREATE A 'mentions' DF WITH ['tweet_id', 'mentionee'] COLUMNS
- CREATE A 'hashtags' DF WITH ['tweet_id', 'hashtag'] COLUMNS
- CREATE A 'urls' DF WITH ['tweet_id', 'url'] COLUMNS

### 2.2.2. Working with a single dataframe

In [15]:
# read/save
# describe()
# changing index and column names
# grouping
# using and resetting the index
# categorize series: categories and codes
# matrix to edgelist and vice versa
# zip
# columns into dict
# datetime
# ...

### 2.2.3. Working with multiple dataframes

In [24]:
# merge split concat etc
# ...

## 2.3. NumPy

https://www.pythonlikeyoumeanit.com/module_3.html

In [25]:
# read/save
# relationship to pandas
# ...

## 2.4. SciPy

In [None]:
# sparse matrices
# matrix multiplication

## 2.5. Matplotlib

In [None]:
# examples for all kinds of plots: scatter, bar, pie, etc; manipulate drawing options; export to file

## 2.6. Seaborn

In [None]:
# adddress differences to 