# Preprocessing

#### Main Goal: To downsize the input files, by casting data types or getting rid of data that are completely useless (for analysis)
Preprocessed data will be stored under the `temp` directory as recommended by the given README.<br>
Even null values are kept. Only columns that are totally useless will be thrown away.

<a id="navigation"></a>

## Navigation
* #### Comment Data
    * [Pickle the Commment Data (for speed)](#pickle-comment)
    * [Take a deeper look into fasttext (language detection model)](#fasttext)
    * [Preprocessing Comment Data](#preprocessing-comment-data)
    * [Dealing with Contractions](#dealing-with-contractions)
    * [Add spacyTextblob Sentiment Polarity and Subjectivity](#add-spacytextblob)
* #### Post Data
    * [preprocessing post data](#post-data)

#### Suggested workflow from given README
* Use relative paths in the code
* Read data from _input_
* Export generated tables or figures to _output_
* Read/Write other temporary files from _temp_

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
%matplotlib inline

In [2]:
import os
import re
import spacy
from spacy import displacy
from spacytextblob.spacytextblob import SpacyTextBlob
import pickle
os.chdir('/home3/r09725056/Desktop/analysis-ChingYaoL')
print(os.getcwd())
# Use relative paths in the code

/home3/r09725056/Desktop/analysis-ChingYaoL


In [3]:
for file in os.listdir():
    print(file, end=' ')

input code output temp .git README.md README_about_USFB_Data.pdf 

In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # 讓 DataFrame Output可以重複疊起來
pd.set_option('display.max_columns', None) # show all columns

In [5]:
from datetime import datetime
from pandas_profiling import ProfileReport

In [6]:
import string
import fasttext
import contractions # resolving contractions and slangs, e.g. "yall're happy now" --> "you all are happy now"
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

In [7]:
import nltk
# nltk.download('stopwords');

In [8]:
import demoji
# demoji.download_codes()

* [PyPI for fasttext](https://pypi.org/project/fasttext/)
* [medium for fasttext](https://medium.com/@c.chaitanya/language-identification-in-python-using-fasttext-60359dc30ed0)
* [GitHub for contractions](https://github.com/kootenpv/contractions)

![tree.png](tree.png)

## 2016 US Presidential Election: Nov 8, 2016 (Tue)

#### **Page**
* page_info
* politician_info
* page_politician_info 

#### **Post**
* top_1000_pages
* politician_pages

#### **Reaction**
**1. Reactions on 1000-page**<br>
A. Every-20-minutes (2016-09-29 ~ 2016-11-21)<br>
* LIKE, LOVE, HAHA, WOW, SAD, ANGRY, THANKFUL

B. LIKE by US political users (2015-01-01 ~ 2016-11-30)
* LIKE_on_1000_page

**2. Reactions on politicians**<br>
* LIKE_on_politicians

#### **Comment**




# Comment
[[Back to Navigation]](#navigation)

### Techniques Adopted for Cleaning Comment Texts

#### Minimum Preprocessing
1. Drop completely duplicate rows 
2. Add language labels (using Language Detection Model from fasttext)
 * the reason I use fasttext: [Benchmarking Language Detection for NLP](https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c)
 * [PyPI for fasttext](https://pypi.org/project/fasttext/)
 * [medium on fasttext](https://medium.com/@c.chaitanya/language-identification-in-python-using-fasttext-60359dc30ed0)
 * Meanwhile, Summarize the number of English, non-English, and empty comments, and store the summary in a global counter, `en_counter_comment`
3. Filter out non-English or empty comments; then, drop the language column

#### Further Preprocessing
4. Convert all words to lowercase
5. Expand contractions and slangs (eg. yall're cool -> you all are cool): [GitHub for contractions](https://github.com/kootenpv/contractions)
6. Remove html `<br>` tags, punctuations, links, newlines, tabs, and shrink consecutive spaces to one
7. Remove emojis
8. Remove stopwords (using nltk)
9. Lemmatize texts (using spacy)
10. Pickle (Serialize) the result

<a id="pickle-comment"></a>
##### Since it always takes a while to read the data, I decided to pickle the comment data first for easier and quicker access  in the future.

In [None]:
def pickle_comments(remove_stop=True, lemmatize=True) -> None:
    "Simple Preprocessing: Delete duplicate rows. Then, pickle the comment data."
    # Set source and destination path
    src_path = r'input/comment/2015-01-01-to-2016-11-30'
    dest_path = r'temp/comment/original'
    # Loop for each file in src_path
    for i, f_name in enumerate(sorted(os.listdir(src_path))):
        # Set source and destination files
        src_file = os.path.join(src_path, f_name)
        dest_file = os.path.join(dest_path, f'comment_{i}.pkl')
        # Read in Data
        usecols = ['comment_message', 'post_id', 'comment_created_time']
        temp = pd.read_csv(src_file, parse_dates=['comment_created_time'], usecols=usecols)
        # Simple Preprocessing
        print(f'{f_name}: {temp.duplicated().sum()} completely duplicate rows')
        temp.drop_duplicates(inplace=True, ignore_index=True)
        # Pickling
        temp.to_pickle(dest_file)

In [None]:
if __name__ == '__main__':
    pickle_comments()

<a id="fasttext"></a>

**Now, before actually preprocessing the data, I would like to <font color="blue">take a deeper look into fasttext.</font><br>**
**This helps me to gain some insight into the fasttext model itself, and figure out some ways to <font color="blue">"save" some data that shouldn't be filtered out.</font><br>**
**A total of 48128 data was saved in the end.**

**For your convenience, you may [skip to the section: Preprocessing Comment Data](#preprocessing-comment-data)**, since the the following section is a bit long and may potentially break the logic flow of your reading. What it basically does is simply trying to further optimize the preprocessing process (by getting more English comments). You may skip it without worrying not being able to understand the rest of the notebook. Thanks!

## Take a deeper look into Fasttext
[[Back to Navigation]](#navigation)

1. Why are there null values (in the prediction)?
2. Lowercasing some data for improving precision
3. Check if Punctuation removal may improve precision
4. Check if Punctuation removal for only  uppercase data may improve precision

In [44]:
# Read in the commment data
comment = pd.read_pickle(r'temp/comment/original/comment_0.pkl')
comment['language'] = comment['comment_message'].apply(predict_language)

#### 1. Why are there null values?

In [45]:
comment[comment['language'] != 'en']

Unnamed: 0,comment_message,post_id,comment_created_time,language
11,,205344452828349_1214133445282773,2015-03-27 02:48:43+00:00,
15,Linette Aleman,165583971161_10153322962686162,2015-04-19 23:38:13+00:00,es
17,Ahora van a estar todo el mes hablando de el c...,174725429795_10154116006284796,2016-01-09 04:19:47+00:00,es
42,bravo,125342830842328_918716924838244,2015-05-11 19:19:14+00:00,it
46,MAKE HILLARY AN INMATE AGAIN,39442131319_10154774303631320,2016-10-28 19:50:04+00:00,ja
...,...,...,...,...
2862948,Good r u.,169204449790211_916776711699644,2015-04-25 23:24:08+00:00,de
2862976,,127559550648374_917129971691324,2015-10-24 03:59:37+00:00,
2862986,"Estas q,opinan aquí\nQ,nathalia va hacer coron...",259955926518_10153512283491519,2016-03-16 03:50:38+00:00,es
2863006,Martin Baliola,7331091005_10154284725951006,2016-08-25 13:31:37+00:00,id


In [46]:
comment.isna().sum()

comment_message         57872
post_id                     0
comment_created_time        0
language                57872
dtype: int64

In [47]:
# If language is null, then comment_message is null
comment.loc[comment['language'].isna(), ['comment_message', 'language']].isna().sum()

comment_message    57872
language           57872
dtype: int64

In [48]:
# If comment_message is null, then language is null
comment.loc[comment['comment_message'].isna(), ['comment_message', 'language']].isna().sum()

comment_message    57872
language           57872
dtype: int64

##### All null values/labels result from null comments. Thus, we can safely drop them.

In [49]:
# We can safely drop all
comment.dropna(subset=['language'], inplace=True)

In [50]:
comment.shape

(2805137, 4)

#### 2. Lowercasing for improving precision

In [51]:
# comment['language'].unique()
# comment.loc[comment['language'] == 'es', 'comment_message'].head(50)

In [52]:
comment['comment_message_lower'] = comment['comment_message'].str.lower()
comment['lower_language'] = comment['comment_message_lower'].apply(predict_language)

In [54]:
# Define filters
lang_diff = comment['language'] != comment['lower_language']
cmt_en = comment['language'] == 'en'
cmt_lower_en = comment['lower_language'] == 'en'
# Define columns of interest
cmt_lang_cols = ['comment_message', 'language', 'comment_message_lower', 'lower_language']

In [55]:
# In the cases of lang_diff...
# Most of those predicted English are, in fact, really English!
# But were predicted otherwise after being transformed to lowercase. (Red Arrow)
comment.loc[lang_diff & cmt_en, cmt_lang_cols].sample(50, random_state=21)

Unnamed: 0,comment_message,language,comment_message_lower,lower_language
54218,Sore loser.,en,sore loser.,it
658600,Islam Tyler Bradley Albertas,en,islam tyler bradley albertas,es
2477153,Patty NG,en,patty ng,tl
1891299,Kyle Skrivanek,en,kyle skrivanek,lt
2079815,Alberto Alvarez,en,alberto alvarez,it
1463743,Kristi Hartman,en,kristi hartman,fi
1485387,Tarra Cheshire ��,en,tarra cheshire ��,pt
2642490,Oh Jr.,en,oh jr.,nl
504496,Donovan Casares ��,en,donovan casares ��,es
2296785,Amen,en,amen,ca


In [56]:
# In the cases of lang_diff...
# Most of those predicted non-English are in reality English!
# And were accurately predicted as English after being transformed to lowercase. (Green Arrow)
comment.loc[lang_diff & cmt_lower_en, cmt_lang_cols].sample(50, random_state=42)

Unnamed: 0,comment_message,language,comment_message_lower,lower_language
764553,Yanitza Monk Presella Tomsjansen ooo hell to t...,nl,yanitza monk presella tomsjansen ooo hell to t...,en
970903,Well......bye,pt,well......bye,en
223100,C,it,c,en
2862500,"Deadspin: ""Remember Leney Kukua? Guys? Anyone!?""",de,"deadspin: ""remember leney kukua? guys? anyone!?""",en
1811534,TRUMP/PENCE����,ru,trump/pence����,en
1992414,Thats cute ^°^ ����,ru,thats cute ^°^ ����,en
1594121,Happy Birthday Pres.Bush,es,happy birthday pres.bush,en
440355,POOR ICE ICE BABY!!,ja,poor ice ice baby!!,en
2274308,Scumbag!,de,scumbag!,en
521840,Aww,jbo,aww,en


**Data saved!**

In [57]:
# Green Arrow: convert those predicted non-English above into English
indexer_to_en = comment.loc[lang_diff & cmt_lower_en, cmt_lang_cols].index
comment.loc[indexer_to_en, 'language'] = 'en'

In [75]:
len(indexer_to_en)

48128

In [68]:
# Check out the results
# comment.loc[lang_diff & cmt_lower_en, cmt_lang_cols].head(40)

In [72]:
comment = comment.loc[comment['language'] == 'en', ['comment_message', 'post_id', 'comment_created_time']]

In [74]:
comment.to_pickle(r'temp/comment/english_only/comment_0.pkl')

In [59]:
# Beware that the filters lang_diff, cmt_en, and cmt_lower_en are not updated yet! (boolean series)
# comment.loc[lang_diff & cmt_lower_en, cmt_lang_cols]

In [60]:
# Update the filters
lang_diff = comment['language'] != comment['lower_language']
cmt_en = comment['language'] == 'en'
cmt_lower_en = comment['lower_language'] == 'en'

#### 3. Punctuation Removal (failed attempt)
* remove_punc_symbol does not remove emojis
* It seems that the effectiveness of fasttext is heavily affected by th presence of punctuations. (At least this is seen when a comment is in full uppercase. We need to check for the general case, which includes both cases of uppercase and lowercase)
* NOT favorable

In [61]:
# Create new columns
comment['comment_message_nopunc'] = comment['comment_message'].apply(remove_punc_symbol)
comment['nopunc_language'] = comment['comment_message_nopunc'].apply(predict_language)
# Define new filters
nopunc_lang_diff = comment['language'] != comment['nopunc_language']
cmt_nopunc_en = comment['nopunc_language'] == 'en'
# Define columns of interest
cmt_lang_cols_nopunc = ['comment_message', 'language', 'comment_message_nopunc', 'nopunc_language']

In [62]:
# FIRST, we need to check if removing puctuations removes emojis...
# Create regex finding emojis
emoji_regex = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"  # includes question mark symbol
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
# Create filters about emoji existence
cmt_contains_emoji = comment['comment_message'].apply(lambda x: True if emoji_regex.search(x) else False)
cmt_contains_emoji_besides_question_mark = comment['comment_message'].str.replace('�', '').apply(lambda x: True if emoji_regex.search(x) else False)

In [63]:
# ANSWER: remove_punc_symbol does not removes emojis
cmt_contains_emoji.sum()
comment['comment_message_nopunc'].apply(lambda x: True if emoji_regex.search(x) else False).sum()

140495

140495

In [64]:
# Back to the topic: Punctuation Removal (without removing emojis)
# In the cases of nopunc_lang_diff...
# Remove punctuations and check out the overall results
comment.loc[nopunc_lang_diff, cmt_lang_cols_nopunc].sample(15)

Unnamed: 0,comment_message,language,comment_message_nopunc,nopunc_language
868232,yawn...,vi,yawn,en
830365,Luv my .9mm ....,en,Luv my 9mm,zh
237350,HEE HEE. HEE,es,HEE HEE HEE,de
630209,Lol!,tr,Lol,en
821,Lolol,en,Lolol,sv
1217855,Nah.,id,Nah,en
924454,Niklas Hengesbach GTI #pressure,nl,Niklas Hengesbach GTI pressure,en
2690377,Haters,en,Haters,de
2793986,Denise FernRos Rosalm hoppas du gjorde det samma!,da,Denise FernRos Rosalm hoppas du gjorde det samma,sv
791560,Tiffany Luk,en,Tiffany Luk,id


In [65]:
# In the cases of nopunc_lang_diff...
# The specific results for those originally predicted English
# punctuations removal ISN'T favorable
comment.loc[nopunc_lang_diff & cmt_en, cmt_lang_cols_nopunc].sample(10)

Unnamed: 0,comment_message,language,comment_message_nopunc,nopunc_language
645139,No!!!,en,No,es
2642386,Very intelligent man !!!!,en,Very intelligent man,de
211820,Lmao Uzair Piotr Bee,en,Lmao Uzair Piotr Bee,pl
611778,https://www.youtube.com/watch?v=xI3sA004xlQ,en,v xI3sA004xlQ,ru
331739,Ashley Larson,en,Ashley Larson,sv
1829581,NO MOTHER FUCKER NEVER DID FOIIOW YOU NOT...,en,NO MOTHER FUCKER NEVER DID FOIIOW YOU NOT NEVE...,ja
1676961,Fuck no.,en,Fuck no,it
1312799,"""Bro""",en,Bro,fr
1528138,Aww.,en,Aww,jbo
1272934,ISLAMIC TERRORIST IN THE HOUSE !,en,ISLAMIC TERRORIST IN THE HOUSE,ja


In [69]:
# In the cases of nopunc_lang_diff...
# The specific results for those originally predicted non-English
comment.loc[nopunc_lang_diff & cmt_nopunc_en, cmt_lang_cols_nopunc].sample(15)

Unnamed: 0,comment_message,language,comment_message_nopunc,nopunc_language
2842033,R.I.P.,es,R I P,en
2548359,Amen.....,pt,Amen,en
1719750,;),el,,en
2542721,Nit wit.,la,Nit wit,en
660641,Amen!,eo,Amen,en
1379953,Fuckin Obamuslima!,pt,Fuckin Obamuslima,en
410166,R.I.P.,es,R I P,en
720964,Amen!!,sv,Amen,en
1126069,Happy Birthday Condoleezza!,it,Happy Birthday Condoleezza,en
1791100,Blah.,id,Blah,en


#### 4. Punctuation Removal only for Uppercase comments (failed attempt)
* Doesn't seem favorable either

In [31]:
# In the cases of nopunc_lang_diff...
# The "Union" case: Remove punctuations only for uppercase comments

In [42]:
comment.loc[comment['comment_message'].str.isupper() & nopunc_lang_diff, 'nopunc_language'].apply(lambda x: x == 'en').sum()

1529

In [43]:
# In the cases of nopunc_lang_diff...
# The "Union" case: Remove punctuations only for Uppercase Comments
# Punctuations are actually helpful in language prediction! esp. for uppercase comments!
comment.loc[comment['comment_message'].str.isupper() & nopunc_lang_diff, cmt_lang_cols_nopunc].sample(50, random_state=42)

Unnamed: 0,comment_message,language,comment_message_nopunc,nopunc_language
1453961,TRUMP WON OHIO!!!!,en,TRUMP WON OHIO,ja
344938,YES !,ta,YES,en
2752525,TRUE!,ar,TRUE,de
500940,GOOD COUPLE,en,GOOD COUPLE,ja
521793,JOIN THE N.R.A.,ru,JOIN THE N R A,ja
1424381,#BIAS,eu,BIAS,fr
450994,TREASON !!!!!!!!!!!!!!!!!!!!!!!,en,TREASON,ja
2229053,THE POWERS OF CHRIST COMPELS YOU,en,THE POWERS OF CHRIST COMPELS YOU,ja
429128,YES!!!! KILL KILL KILL!!!!,en,YES KILL KILL KILL,zh
683288,BARACK OBAMA IS THE GREATEST THREAT AMERICA HA...,en,BARACK OBAMA IS THE GREATEST THREAT AMERICA HA...,ja


In [34]:
# In the cases of lang_diff...
# Remove punctuations for non-uppercase comments and check out the results
comment.loc[~comment['comment_message'].str.isupper() & lang_diff, 'language'].apply(lambda x: x == 'en').sum()
comment.loc[~comment['comment_message'].str.isupper() & nopunc_lang_diff, 'nopunc_language'].apply(lambda x: x == 'en').sum()

105568

110932

In [35]:
comment.loc[~comment['comment_message'].str.isupper() & lang_diff, cmt_lang_cols].sample(50, random_state=42)

Unnamed: 0,comment_message,language,comment_message_lower,lower2language
146740,Hahahahaha!,it,hahahahaha!,de
2506981,Douglas Brito,en,douglas brito,de
19606,Sad RIP,ja,sad rip,it
366247,Omg,sv,omg,nl
1520728,Dios!,la,dios!,el
630217,No,es,no,pt
1816613,Aww costarricense <3,es,aww costarricense <3,pt
895627,Nina Sobi,en,nina sobi,es
1562136,Race traitor white cunt.,la,race traitor white cunt.,si
1090302,HAAaaa!!,nl,haaaaa!!,ru


<a id="preprocessing-comment-data"></a>
## Preprocessing Comment Data

[[Back to Navigation]](#navigation)

#### Minimum Preprocessing
1. Drop completely duplicate rows 
2. Add language labels (using Language Detection Model from fasttext)
 * the reason I use fasttext: [Benchmarking Language Detection for NLP](https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c)
 * [PyPI for fasttext](https://pypi.org/project/fasttext/)
 * [medium on fasttext](https://medium.com/@c.chaitanya/language-identification-in-python-using-fasttext-60359dc30ed0)
 * Meanwhile, Summarize the number of English, non-English, and empty comments, and store the summary in a global counter, `en_counter_comment`
3. Filter out non-English or empty comments; then, drop the language column

#### Further Preprocessing
4. Convert all words to lowercase
5. Expand contractions and slangs (eg. yall're cool -> you all are cool): [GitHub for contractions](https://github.com/kootenpv/contractions)
6. Remove html `<br>` tags, punctuations, links, newlines, tabs, and shrink consecutive spaces to one
7. Remove emojis
8. Remove stopwords (using nltk)
9. Lemmatize texts (using spacy)
10. Pickle (Serialize) the result

In [9]:
# Create a decorator to decorate text-transformation functions
def func_to_string(func):
    """
    A decorator that allows a function to bypass nonstring arguments. That is, func applies only to string columns.
    """
    def wrapper_func(x):
        try:
            return func(x)
        except:
            return x
    return wrapper_func

In [10]:
# Functions to apply to textual column
@func_to_string
def contract_message(x):
    return contractions.fix(x)

pretrained_model = "/home3/r09725056/.conda/envs/usfb/lib/python3.7/site-packages/fasttext/lid.176.bin"
model = fasttext.load_model(pretrained_model)    
@func_to_string
def predict_language(sent):
    sent = sent.replace('\n', ' ')
    pred = model.predict(sent) # model.predict() returns a tuple like this: (('__label__en',), array([0.95346403]))
    return pred[0][0].split('_')[-1]

@func_to_string
def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
print(nlp.pipe_names)
@func_to_string
def lemmatize(text):
    return " ".join(token.lemma_ for token in nlp(text))

@func_to_string
def remove_punc_symbol(text):
    "Remove html <br> tags, links, newlines, tabs, punctuations, and shrink spaces"
    # Remove html <br> tags, links, newlines, tabs
    pattern = re.compile(r'(<br>|https?://[a-zA-Z0-9-./|\s]+)', re.IGNORECASE)
    text = re.sub(pattern, ' ', text)
    # Remove puntuations
    mapping_table = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    text = text.translate(mapping_table)
    # Shrink spaces
    text = re.sub(r'[ ]+', ' ', text) # replace multiple spaces with one space
    return text

stop = set(stopwords.words('english')) # Get stopwords from nltk
@func_to_string
def remove_stopwords(text):
    x = text
    return " ".join(x for x in x.split() if x not in stop) # remove stopwords



['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']


In [11]:
# Self-defined functions
def check_lang_code(code):
    with open('temp/temp_variables/lang_code.pkl', 'rb') as f:
        lang_code = pickle.load(f)
    return lang_code.loc[lang_code['639-1'] == code, 'ISO language name']

In [None]:
# A global counter for language
en_counter_comment = {'en': 0, 'non_en': 0, 'null': 0}

def preprocess_pickle_comments(remove_stop=True, lemmatize=True) -> None:
    "Minimum & Further Preprocessing"
    # Set source and destination path
    src_path = r'input/comment/2015-01-01-to-2016-11-30'
    dest_path = r'temp/comment/preprocessed'
    # Loop for each file in src_path
    for i, f_name in enumerate(sorted(os.listdir(src_path))):
        # Set source and destination files
        src_file = os.path.join(src_path, f_name)
        dest_file = dest_file = os.path.join(dest_path, f'comment_{i}.pkl')
        # Read in Data
        usecols = ['comment_message', 'post_id', 'comment_created_time']
        temp = pd.read_csv(src_file, parse_dates=['comment_created_time'], usecols=usecols)
        # Preprocessing
        temp.drop_duplicates(inplace=True, ignore_index=True)
        temp['language'] = temp['comment_message'].apply(predict_language) # Add language labels
        # Update language counts
        total = len(temp)
        n_null = temp['language'].value_counts()[1:].sum()
        n_en = temp['language'].value_counts()['en']
        n_non_en = total - n_null - n_en
        en_counter_comment['en'] += n_en # update english
        en_counter_comment['non_en'] += n_non_en # update non-english
        en_counter_comment['null'] += n_null # update null values
        # Preprocessing (part 2)
        temp = temp[temp['language'] == 'en'] # filter out non-English and NaN
        temp.drop(columns='language', inplace=True) # drop the language column
        temp['comment_message'] = temp['comment_message'].str.lower() # set all words to lowercase
        temp['comment_message'] = temp['comment_message'].apply(contract_message) # expand contractions
        temp['comment_message'] = temp['comment_message'].apply(remove_punc_symbol) # remove punctuations and special characters
        temp['comment_message'] = temp['comment_message'].apply(lambda x: remove_emoji(x)) # remove emojis
        if remove_stop:
            stop = set(stopwords.words('english')) # Get stopwords from nltk
            temp['comment_message'] = temp['comment_message'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) # remove stopwords
        if lemmatize:
            try: # lemmatize words (in a faster way)
                temp['comment_message'] = [' '.join(token.lemma_ for token in doc) for doc in nlp.pipe(temp['comment_message'])]
            except TypeError: # lemmatize in a slower way (which accepts null values)
                temp['comment_message'] = temp['comment_message'].apply(lemmatize)
        # Pickling
        temp.to_pickle(dest_file)
        

In [27]:
if __name__ == '__main__':
    preprocess_pickle_comments() # DONE

In [None]:
# Save the result of the global counter
with open('temp/temp_variables/en_counter_comment.pkl') as f:
    pickle.dump(en_counter_comment, f)

In [31]:
en_counter_comment

{'en': 418313529, 'non_en': 9717226, 'null': 50211963}

<a id="dealing-with-contractions"></a>

## Dealing with Contractions...
[[Back to Navigation]](#navigation)

Some related sources about contractions:
* [Common Informal Contractions in English You Need to Know](https://www.fluentland.com/common-informal-contractions/)<br>
* [50個超常用的簡訊縮寫](https://tw.blog.voicetube.com/archives/41825/9-posctn-ttyl-k-%E5%88%B0%E5%BA%95%E6%98%AF%E5%95%A5%E6%84%8F%E6%80%9D%EF%BC%9F50%E5%80%8B%E8%B6%85%E5%B8%B8%E7%94%A8%E7%9A%84%E7%B0%A1%E8%A8%8A%E7%B8%AE%E5%AF%AB/)
* [internetslang.com](https://www.internetslang.com/list.asp?i=all) ([scraped - my code](#scrape))

I have made full use of the above resources to expand my contraction dictionary.

In [None]:
def update_my_contractions(mapping_dict):
    with open('temp/temp_variables/my_contractions.pkl', 'rb') as read_f:
        my_contractions = pickle.load(read_f)
    for slang, meaning in mapping_dict.items():
        if slang not in my_contractions:
            slang = slang.lower()
            meaning = meaning.lower()
            my_contractions[slang] = meaning
            print(f"({slang}: {meaning}) is successfully added.")
        else:
            print(f"{slang.lower()} is included already...")
    with open('temp/temp_variables/my_contractions.pkl', 'wb') as write_f:
        pickle.dump(my_contractions, write_f)
# Set (slang: meaning) dictionary
mapping_dict = {
    "whatcha": "what have you" # I have manually updated contractions whenever I see new ones...
}
# Update!!!
update_my_contractions(mapping_dict) # Save result on disk
with open('temp/temp_variables/my_contractions.pkl', 'rb') as read_f:
    my_contractions = pickle.load(read_f)
for slang, meaning in my_contractions.items():
    contractions.add(slang, meaning)

<a id="scrape"></a>
#### This is the piece of code with which I wrote to scrape slangs and their meanings from internetslang.com

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
from pprint import pprint
from functools import reduce
import pandas as pd
import requests
import re
import os
import pickle


def get_page(path, verbose=False):
    # Get the webpage
    try:
        html = requests.get(path)
    except HTTPError as e:
        # couldn't find the webpage on the server
        print(e)
    except URLError as e:
        # More severe problem: cannot find the server
        print('The server could not be found!')
    except Exception as e:
        # Other unknown problems
        print('Sorry, something went wrong...')
        print(e)
    else:
        # Test if no exception are thrown
        if verbose:
            print('page successfully retrieved!')
        
    return html

def strip_replace(meaning_list):
    return [m.strip().replace('&amp;', '&') for m in meaning_list]

def create_slang_df(full_slang_list_untackled):
    df = pd.DataFrame(full_slang_list_untackled, columns=['slang', 'meaning', 'untackled'])
    df.drop_duplicates(subset='slang', inplace=True, ignore_index=False)
    # Add important slangs that're not scraped
    my_slang = [('WTF', ['What the Fuck?'], False), \
                ("YW", ["You're Welcome"], False), \
                ('SMH', ['Shaking my head. Used to express shock or disappointment.'], False)
               ]
    my_slang = pd.DataFrame(my_slang, columns=df.columns)
    df = pd.concat([df, my_slang], ignore_index=True)
    df.loc[df.untackled, 'new_meaning'] = df.loc[df.untackled, 'meaning'].apply(lambda x: get_meaning(x[0]))
    df.loc[~df.untackled, 'new_meaning'] = df.loc[~df.untackled, 'meaning']
    return df

def scrape_slangs(path, verbose=False):
    """
    Returns a list of tuples in the form of (slang, meaning, untackled)
    
    Return:
    ----------------------------------------------
    slang: str
        The word to look up
    meaning: list
        a list of possible meanings (str)
    untackled: bool
        equals True if `meaning` still contains slangs/contractions. 
        This is caused by the website adding link(s) in the meaning section instead of its meaning(s) directly.
    """
    # Get the webpage
    html = get_page(path)
    # Create a BeautifulSoup Object from the webpage
    soup = BeautifulSoup(html.text, 'html.parser')
    # Get all slangs & meanings, and store them in rows
    pattern = re.compile(r'nodepath:/html/body/table/tr\[1\]/td/table\[3\]/tr\[\d+\];display:table-row;')
    rows = soup.find_all('tr', style=pattern)
    
    if path.endswith('=1'): # The first page
        rows = rows[4:] # Get rid of the first four rows

    # Store all slangs and their meanings in slang_list
    slang_list = []
    for row in rows:
        # Get the slang
        slang = row.find('a').text
        if slang == 'All': # Not slang (but page link instead)
            break
        # Parse out the meaning & Add a label (UNTACKLED)
        meaning_section = row.find_all('td')[-1]
        untackled = False
        tags_contained = meaning_section.find_all(True)
        if tags_contained: # meaning_section contains at least a tag
            tag_names = [tag.name for tag in tags_contained]
            if 'a' in tag_names: # contains <a> tag
                meaning_list = [meaning_section.find('a').decode_contents()]
                # Mark the meaning_list as "UNTACKLED"; will tackle these later by look them up
                untackled = True
            elif 'br' in tag_names: # contains <br> tag
                meaning_list = meaning_section.decode_contents().split('<br/>')
                meaning_list = [m for m in meaning_list]
        else: # meaning_section doesn't contain any tag
            meaning_list = [meaning_section.decode_contents()] # with length of one
        # Preprocess each meaning in the meaning_list
        meaning_list = strip_replace(meaning_list)
        # Store the results
        vocab = (slang, meaning_list, untackled)
        slang_list.append(vocab)
        # Print for testing
        if verbose:
            print("{}: {}, UNTACKLED={}".format(*vocab))
    # End of For Loop
    return slang_list

def first_scrape():
    "Scrape all slangs for the first time"
    full_slang_list = []
    for i in range(1, 26):
        path = f'https://www.internetslang.com/list.asp?i=all&ezpage={i}'
        slang_list = scrape_slangs(path, verbose=False) # Change to True to see the results in real time
        full_slang_list.extend(slang_list)
    return full_slang_list

def main() -> None:
    os.chdir('/home3/r09725056/Desktop/analysis-ChingYaoL')
    full_slang_list_untackled = first_scrape()
    with open(r'temp/temp_variables/full_slang_list_untackled.pkl', 'wb') as f:
        pickle.dump(full_slang_list_untackled, f)
        
    slang_df = create_slang_df(full_slang_list_untackled)
    with open(r'temp/temp_variables/slang_df.pkl', 'wb') as f:
        pickle.dump(slang_df, f)
        
if __name__ == '__main__':
    main()

<a id="add-spacytextblob"></a>

## Add spacyTextblob Sentiment Polarity and Subjectivity
[[Back to Navigation]](#navigation)

In [39]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [40]:
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("spacytextblob")
print(nlp.pipe_names)

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f3a481f94d0>

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'spacytextblob']


### Quick test
I discovered that `nlp.pipe` runs faster on pickled data than on the original data.

In [47]:
i = 0
cmt = pd.read_pickle(f'temp/comment/preprocessed/comment_{i}.pkl')

In [48]:
%timeit [(doc._.polarity, doc._.subjectivity, doc._.assessments) for doc in nlp.pipe(cmt['comment_message'].iloc[:10000])]
# 15.5 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

15.5 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [49]:
test = pd.read_pickle(f'temp/comment/original/comment_{i}.pkl')

In [50]:
%timeit [(doc._.polarity, doc._.subjectivity, doc._.assessments) for doc in nlp.pipe(test['comment_message'].iloc[:10000])]
# 28.2 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

28.2 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Thus, in the function below, I would use `nlp.pipe` in combination with pickled data

In [None]:
def add_sentiment_to_preprocessed_comment() -> None:
    "Calculate sentiment for each comment (preprocessed), and dump to new directory"
    for i in range(500):
        cmt = pd.read_pickle(f'temp/comment/preprocessed/comment_{i}.pkl') # Use pickled data
        # Create new columns
        cmt[['polarity', 'subjectivity', 'assessments']] = [(doc._.polarity, doc._.subjectivity, doc._.assessments) for doc in nlp.pipe(cmt['comment_message'])]
        cmt.to_pickle(f'temp/comment/preprocessed_with_sentiment/comment_{i}.pkl')

In [None]:
if __name__ == '__main__':
    add_sentiment_to_preprocessed_comment()

<a id="post-data"></a>

# Post
Data stored in the variable: top_1000_pages

[[Back to Navigation]](#navigation)

### Techniques Adopted for Cleaning Post Texts

##### Basic Preprocessing
0. Drop unwanted columns ('post_created_date_CT', 'post_updated_date_CT', 'post_created_time', 'post_updated_time')
1. Drop completely duplicate rows
2. Expand contractions and slangs (eg. yall're cool -> you all are cool): [GitHub for contractions](https://github.com/kootenpv/contractions)
3. Add language labels (using Language Detection Model from fasttext)
 * the reason I use fasttext: [Benchmarking Language Detection for NLP](https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c)
 * [PyPI for fasttext](https://pypi.org/project/fasttext/)
 * [medium on fasttext](https://medium.com/@c.chaitanya/language-identification-in-python-using-fasttext-60359dc30ed0)
4. Summarize the number of English, non-English, and empty comments, and store the summary in a global counter, `en_counter_post`
5. Filter out non-English or empty comments; then, drop the language column
6. Convert all words to lowercase
7. Remove html `<br>` tags, punctuations, links, newlines, tabs, and shrink consecutive spaces to one
8. Remove emojis
9. Remove stopwords (using nltk)
10. Lemmatize texts (using spacy)
11. Pickle (Serialize) the result

In [None]:
# Pickle the original top_1000_pages data
top_1000_pages = pd.read_csv(r'input/post/1000-page/2015-01-01-to-2017-04-08.csv', parse_dates=['post_created_time_CT', 'post_updated_time_CT', 'post_created_date_CT', 'post_updated_date_CT', 'post_created_time', 'post_updated_time'])
top_1000_pages.to_pickle(r'temp/post/original/top_1000_pages.pkl') # DONE

In [None]:
# A global counter for language
en_counter_post = {'en': 0, 'non_en': 0, 'null': 0}

def preprocess_pickle_post(remove_stop=True, lemmatize=True) -> None:
    # Set source and destination path
    src_path = r'temp/post/original'
    dest_path = r'temp/post/preprocessed'
    # Set source and destination files
    src_file = os.path.join(src_path, 'top_1000_pages.pkl')
    dest_file = dest_file = os.path.join(dest_path, 'top_1000_page_preprocessed.pkl')
    # Read in Data
    temp = pd.read_pickle(src_file)
    # Preprocessing
    temp.drop(columns=['post_created_date_CT', 'post_updated_date_CT', 'post_created_time', 'post_updated_time'], inplace=True)
    temp.drop_duplicates(inplace=True, ignore_index=True)
    temp['post_message_converted'] = temp['post_message'].apply(contract_message) # expand contractions
    temp['language'] = temp['post_message'].apply(predict_language) # Add language labels
    # Update language counts
    total = len(temp)
    n_null = temp['language'].value_counts()[1:].sum()
    n_en = temp['language'].value_counts()['en']
    n_non_en = total - n_null - n_en
    en_counter_post['en'] += n_en # update english
    en_counter_post['non_en'] += n_non_en # update non-english
    en_counter_post['null'] += n_null # update null values
    # Preprocessing (part 2)
    temp = temp[temp['language'] == 'en'] # filter out non-English and NaN
    temp.drop(columns='language', inplace=True) # drop the language column
    temp['post_message_converted'] = temp['post_message_converted'].str.lower() # set all words to lowercase
    temp['post_message_converted'] = temp['post_message_converted'].apply(remove_punc_symbol) # remove punctuations and special characters
    temp['post_message_converted'] = temp['post_message_converted'].apply(lambda x: remove_emoji(x)) # remove emojis
    if remove_stop:
        stop = set(stopwords.words('english')) # Get stopwords from nltk
        temp['post_message_converted'] = temp['post_message_converted'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) # remove stopwords
    if lemmatize:
        try: # lemmatize words (in a faster way)
            temp['post_message_converted'] = [' '.join(token.lemma_ for token in doc) for doc in nlp.pipe(temp['post_message_converted'])]
        except TypeError: # lemmatize in a slower way (which accepts null values)
            temp['post_message_converted'] = temp['post_message_converted'].apply(lemmatize)
    # Pickling
    temp.to_pickle(dest_file)

In [None]:
if __name__ == '__main__':
    preprocess_pickle_post() # DONE

In [None]:
with open('temp/temp_variables/en_counter_post.pkl', 'wb') as f:
    pickle.dump(en_counter_post, f) # DONE

### Graph: Number of Samples for Top Languages

In [None]:
# Get rid of unwanted columns
cols_to_drop = ['post_picture', 'post_link', 'post_created_time', 'post_updated_time']
top_1000_pages.drop(columns=cols_to_drop, inplace=True)

In [None]:
top_1000_pages['post_message'] = top_1000_pages['post_message'].apply(contract_message)
top_1000_pages['language'] = top_1000_pages['post_message'].apply(predict_language)

In [None]:
# Check the results
# print(f"{top_1000_pages.post_name.isna().mean():.6f}")
# print(f"{top_1000_pages.post_message.isna().mean():.6f}")
# print(f"{top_1000_pages.post_caption.isna().mean():.6f}")
# print(f"{top_1000_pages.post_description.isna().mean():.6f}")
# print(f"{top_1000_pages.post_reactions.isna().mean():.6f}")
# print(f"{top_1000_pages.post_likes.isna().mean():.6f}")
# print(f"{top_1000_pages.post_comments.isna().mean():.6f}")
# print(f"{top_1000_pages.post_shares.isna().mean():.6f}")

In [None]:
def filter_top_unique_elements(ser, thres=0.001, ndecimal=6, show_null_rate=True):
    """
    Return top unique elements which take up no less than 0.001 (or thres) in the entire pd.Series
    This is done through ser.value_counts(normalize=True, dropna=False)
    """
    filt = (ser.value_counts(dropna=False, normalize=True) >= thres)
    if show_null_rate:
        print(f"null rate = {ser.isnull().mean() * 100:.6f} %")
    return ser.value_counts(dropna=False).loc[filt].round(ndecimal)

# filter for languages that's at least 0.1%, then assign to top_languages
lang_thres = 0.001
top_languages = filter_top_unique_elements(top_1000_pages['language'], lang_thres)

In [None]:
# A simpler way to format yticks
# fig, ax = plt.subplots(figsize=(12, 8))
# f = mticker.ScalarFormatter(useOffset=False, useMathText=True)
# g = lambda x, pos : "${}$".format(f._formatSciNotation('%1.10e' % x))
# plt.gca().yaxis.set_major_formatter(mticker.FuncFormatter(g))
# top_languages.plot.bar()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
# Customize Yticks
class MathTextSciFormatter(mticker.Formatter):
    def __init__(self, fmt="%1.2e"):
        self.fmt = fmt
    def __call__(self, x, pos=None):
        s = self.fmt % x
        decimal_point = '.'
        positive_sign = '+'
        tup = s.split('e')
        significand = tup[0].rstrip(decimal_point)
        sign = tup[1][0].replace(positive_sign, '')
        exponent = tup[1][1:].lstrip('0')
        if exponent:
            exponent = '10^{%s%s}' % (sign, exponent)
        if significand and exponent:
            s =  r'%s{\times}%s' % (significand, exponent)
        else:
            s =  r'%s%s' % (significand, exponent)
        return "${}$".format(s)

# Format with 2 decimal places
plt.gca().yaxis.set_major_formatter(MathTextSciFormatter("%1.2e"))
# Plot Top Languages
plt.title("Number of Samples for Top Languages", size=20)
plt.xticks(size=20)
plt.yticks(size=15)
top_languages.plot.bar();

In [None]:
top_1000_pages_cleaned = pd.read_pickle(r'temp/post/preprocessed/top_1000_page_preprocessed.plk')
# Get the post_id whose post contains emojis
filt = top_1000_pages_cleaned['post_message'].apply(lambda x: len(demoji.findall(x)) != 0)
emoji_post_ids = top_1000_pages_cleaned.loc[filt, 'post_id']
with open(r'temp/temp_variables/emoji_post_ids.pkl', 'wb') as f:
    pickle.dump(emoji_post_ids, f) # DONE

# Appendix
* Reaction
    * Reaction on 1000-page
    * Reactions on Politicians
* Side Note

# Reaction
The reaction data wasn't used in further analysis. Thus, its preprocessing is just for reference as follows.

### 1. Reaction on 1000-page: 
#### A. Every-20-minutes (2016-09-29 ~ 2016-11-21)
1. Sum the number of each REACTIONs on Pages for each post in each 20min, and Dump to pickle files (split by chunks)
 * chunksize = 10000000 rows in the original data (before aggregation)
2. Summarize the number of each reactions, and save the summary in a dictionary, `num_reactions`

In [None]:
# Reaction (LIKE, LOVE, HAHA, WOW, SAD, ANGRY, THANKFUL)
## 1. Reactions on 1000-page
#### A. Every-20-minutes (2016-09-29 ~ 2016-11-21)
def reaction_pickler(reaction_type, chunksize=10000000):
    "Read in reaction data by chunks, and process them to return desired output."
    # Create an iterator (chunk_reader) to read files by chunks
    path = r'input/reaction/1000-page/20-min/by-reaction-type/{}.csv'.format(reaction_type)
    date_parser = lambda unixTime: pd.to_datetime(unixTime, unit='s')
    chunk_reader = pd.read_csv(path, usecols=['post_id', 'reaction_time'], parse_dates=['reaction_time'], date_parser=date_parser, chunksize=chunksize)

    # Iteratively write "processed" chunks to pickle files
    for i, chunk in enumerate(chunk_reader):
        aggregated_chunk = chunk.groupby(['post_id', 'reaction_time'])['reaction_time'].count()
        aggregated_chunk.to_pickle(r'temp/reaction/1000-page/20-min/{}_{}.pkl'.format(reaction_type, i))

# Actually pickle the files
reaction_types = ['LIKE', 'LOVE', 'HAHA', 'WOW', 'SAD', 'ANGRY', 'THANKFUL']
for rt in reaction_types:
    reaction_pickler(rt) # DONE

In [None]:
reaction_types = ['LIKE', 'LOVE', 'HAHA', 'WOW', 'SAD', 'ANGRY', 'THANKFUL']
num_reactions = dict(zip(reaction_types, [0]*7))
# Store the count for each reactions
for f_name in os.listdir(r'temp/reaction/1000-page/20-min'):
    reaction = f_name.split('_')[0]
    num_reaction = pd.read_pickle(r'temp/reaction/1000-page/20-min/{}'.format(f_name)).sum() # Sum of the Series
    try:
        num_reactions[reaction] += num_reaction
    except KeyError:
        print(f"Reaction {reaction} doesn't exist.")
# Save the variable: num_reactions
with open('temp/temp_variables/num_reactions.pkl', 'wb') as f:
    pickle.dump(num_reactions, f) # DONE

In [11]:
with open('temp/temp_variables/num_reactions.pkl', 'rb') as f:
    num_reactions = pickle.load(f)

#### B. LIKE by US political users (2015-01-01 ~ 2016-11-29)
Sum the number of each REACTIONs on Pages for each post, and Dump to pickle files (split by dates)

In [None]:
## 1. Reactions on 1000-page
#### B. LIKE by US political users (2015-01-01 ~ 2016-11-30)
like_on_page_post_dates = []
src_path = r'input/reaction/1000-page/2015-01-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date'
dest_path = r'temp/reaction/1000-page/LIKE'
for fileName in sorted(os.listdir(src_path)):
    if not fileName.startswith('.'): # Not a hidden file
        like_on_page_post_dates.append(fileName.split('.')[0])

for post_date in like_on_page_post_dates:
    # Set source and destination files
    src_file = os.path.join(src_path, f'{post_date}.csv')
    dest_file = os.path.join(dest_path, f'page_likes_{post_date}.pkl')
    # Read in a file
    temp_df = pd.read_csv(src_file)
    # Compute num of likes for each post
    aggregated_temp = temp_df.groupby('post_id').count()
    # Save as pickle (under the temp directory)
    aggregated_temp.to_pickle(dest_file) # DONE

### 2. Reactions on politicians
#### LIKE by US political users (2015-05-01 ~ 2016-11-29)
Sum the number of LIKEs on Politicians for each post, and Dump to pickle files (split by dates)

In [None]:
## 2. Reactions on Politician 
#### A. LIKE by US political users (2015-05-01 ~ 2016-11-29) # NOT 11-30!
like_on_politician_post_dates = []
src_path = r'input/reaction/politician/2015-05-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date'
dest_path = r'temp/reaction/politician/LIKE'
for fileName in sorted(os.listdir(src_path)):
    if not fileName.startswith('.'): # Not a hidden file
        like_on_politician_post_dates.append(fileName.split('.')[0])

for post_date in like_on_politician_post_dates:
    # Set source and destination files
    src_file = os.path.join(src_path, f'{post_date}.csv')
    dest_file = os.path.join(dest_path, f'politician_likes_{post_date}.pkl')
    # Read in a file
    temp_df = pd.read_csv(src_file)
    # Compute num of likes for each post
    aggregated_temp = temp_df.groupby('post_id').count()
    # Save as pickle (under the temp directory)
    aggregated_temp.to_pickle(dest_file) # DONE

### Side Notes:
Below is the original intended way to read the files.<br>
Unfortunately, the files were too large, and I have to shrink their sizes before actually reading to memory...<br>
Thus I need this notebook "Preprocessing.ipynb" to pickle everything.
```python
# Reaction (LIKE, LOVE, HAHA, WOW, SAD, ANGRY, THANKFUL)
## 1. Reactions on 1000-page
#### A. Every-20-minutes (2016-09-29 ~ 2016-11-21)
date_parser = lambda unixTime: pd.to_datetime(unixTime, unit='s')
every20min_like = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/LIKE.csv', parse_dates=['reaction_time'], date_parser=date_parser)
every20min_love = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/LOVE.csv', parse_dates=['reaction_time'], date_parser=date_parser)
every20min_haha = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/HAHA.csv', parse_dates=['reaction_time'], date_parser=date_parser)
every20min_wow = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/WOW.csv', parse_dates=['reaction_time'], date_parser=date_parser)
every20min_sad = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/SAD.csv', parse_dates=['reaction_time'], date_parser=date_parser)
every20min_angry = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/ANGRY.csv', parse_dates=['reaction_time'], date_parser=date_parser)
every20min_thankful = pd.read_csv(r'input/reaction/1000-page/20-min/by-reaction-type/THANKFUL.csv', parse_dates=['reaction_time'], date_parser=date_parser)

#### B. LIKE by US political users (2015-01-01 ~ 2016-11-30)
# pd.read_csv(r'input/reaction/1000-page/2015-01-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date/')
like_by_post_dates = []
for fileName in sorted(os.listdir(r'input/reaction/1000-page/2015-01-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date')):
    if not fileName.startswith('.'): # Not a hidden file
        like_by_post_dates.append(fileName.split('.')[0])

LIKE_dfs = []
for post_date in like_by_post_dates:
    path = r'input/reaction/1000-page/2015-01-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date/{}.csv'.format(post_date)
    temp_df = pd.read_csv(path)
    temp_df['post_date'] = datetime.strptime(post_date, "%Y-%m-%d") # Add a column specifying the date, which is (part of) the fileName
    LIKE_dfs.append(temp_df)
    
LIKE_on_1000_page = pd.concat(LIKE_dfs)

## 2. Reactions on Politician 
#### A. LIKE by US political users (2015-05-01 ~ 2016-11-30)
like_by_post_dates = []
for fileName in sorted(os.listdir(r'input/reaction/politician/2015-05-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date')):
    if not fileName.startswith('.'): # Not a hidden file
        like_by_post_dates.append(fileName.split('.')[0])

LIKE_dfs = []
for post_date in like_by_post_dates:
    path = r'input/reaction/politician/2015-05-01-to-2016-11-30/us-political-user/by-reaction-type/LIKE/by-post-date/{}.csv'.format(post_date)
    temp_df = pd.read_csv(path)
    temp_df['post_date'] = datetime.strptime(post_date, "%Y-%m-%d") # Add a column specifying the date, which is (part of) the fileName
    LIKE_dfs.append(temp_df)
    
LIKE_on_politician = pd.concat(LIKE_dfs)

# Comment
## a total of 500 csv files
## NOTE: The files are LARGE! eg. 000000000000.csv, as DataFrame, has a shape of (2863013, 4) and memory usage of 87.4+ MB
tables = []
for fileName in os.listdir(r'input/comment/2015-01-01-to-2016-11-30'):
    tables.append(pd.read_csv(r'input/comment/2015-01-01-to-2016-11-30/{}'.format(fileName), parse_dates=["comment_created_time"]))
comments = pd.concat(tables)
```