# Transformation of the UCI "Microblog PCU" dataset in CSV format

The main objective is to transform this UCI dataset into a CSV file with the following format <Id, Date/Order, Attribute1, Attribute2,...>.

This MicroblogPCU dataset is used to generate a new dataset that follows the posts of each user at different times. Each tuple corresponds to a post with the following informations < Id user, Date, Gender, Content, Keywords, Topic, Number of retweets, Number of comments >.


- Find the file encoding:

In [1]:
import chardet
import pandas as pd

# look at the first ten thousand bytes to guess the character encoding
with open("..\\Data\\Raw\\microblogPCU\\user_post.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

pd.show_versions() # verifie la version de pandas (doit être >1.3 pour générer les erreurs d'encodage)


{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

INSTALLED VERSIONS
------------------
commit           : c7f7443c1bad8262358114d5e88cd9c8a308e8aa
python           : 3.9.0.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Rwanda.utf8

pandas           : 1.3.1
numpy            : 1.21.1
pytz             : 2021.1
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.2.0
Cython           : None
pytest           : 6.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.4
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.3
IPython          : 7.29.0
pandas_datareader: 

- Load the CSV files with users (weibo_user.csv) and posts (user_post.csv).

In [2]:
import pandas as pd

users = pd.read_csv("..\\Data\\Raw\\microblogPCU\\weibo_user.csv", header=0, usecols=['user_id','gender'], encoding_errors='replace')
posts = pd.read_csv("..\\Data\\Raw\\microblogPCU\\user_post.csv", header=0, usecols=['post_time','content','poster_id','repost_num','comment_num'], parse_dates=['post_time'], encoding='GB2312', encoding_errors='replace')
posts


Unnamed: 0,post_time,content,poster_id,repost_num,comment_num
0,2014-08-17 21:00:00,置顶#代理须知#拒绝泛滥，我不会每天一直更新图片的，图片有专门的地方，为了不影响其他顾客正常...,2.364770e+09,0.0,0.0
1,2014-11-16 16:45:00,爆美来袭独家实拍质量看的见招微信代理一件代发专卖国际品牌名鞋品质赢得市场诚信赚得人心V信：a...,2.364770e+09,0.0,0.0
2,2014-11-16 16:29:00,爆美来袭独家实拍质量看的见招微信代理一件代发专卖国际品牌名鞋品质赢得市场诚信赚得人心V信：a...,2.364770e+09,0.0,0.0
3,2014-11-16 16:19:00,爆美来袭独家实拍招微信代理一件代发专卖国际品牌名鞋品质赢得市场诚信赚得人心V信：asicsh...,2.364770e+09,0.0,0.0
4,2014-11-16 16:11:00,独家实拍招微信代理一件代发专卖国际品牌名鞋品质赢得市场诚信赚得人心V信：asicshoesQ...,2.364770e+09,0.0,0.0
...,...,...,...,...,...
48808,2014-07-10 21:21:00,等你们被黑你们就关注了,1.933466e+09,0.0,2.0
48809,2014-07-10 21:21:00,支付宝黑钱了都没人关注么？,1.933466e+09,0.0,1.0
48810,2014-07-10 21:16:00,@陈巨飞@蔡炎�|@褚乔@杜子建各位仁兄帮忙转发，不用这办法真的没法诉权，让支付宝黑了5000元,1.933466e+09,1.0,1.0
48811,2014-07-10 21:12:00,#张耀扬吸毒被查#,1.933466e+09,0.0,1.0


- Join data between users and posts files:

In [3]:
microblog = users.set_index(['user_id']).join( posts.set_index(['poster_id']))
microblog


Unnamed: 0,gender,post_time,content,repost_num,comment_num
1041514813,male,2014-11-21 06:57:00,【中国最危险的村庄之郭亮村全方位解析】(分享自@今日头条)|中国最危险的村庄之郭亮村全方位解析,0.0,0.0
1041514813,male,2014-11-21 06:52:00,【康巴传奇之康北旅游】(分享自@今日头条)|康巴传奇之康北旅游,0.0,0.0
1041514813,male,2014-11-20 06:59:00,【“青龙、白虎、朱雀、玄武”四圣兽的来龙去脉】(分享自@今日头条)|“青龙、白虎、朱雀、玄武...,0.0,1.0
1041514813,male,2014-11-20 06:57:00,【19张惊艳四座的暗色调照片！】(分享自@今日头条)|19张惊艳四座的暗色调照片！,0.0,0.0
1041514813,male,2014-11-20 06:52:00,【顾里20句话道破人生真相！话虽毒却句句戳心！】(分享自@今日头条)|顾里20句话道破人生真...,0.0,0.0
...,...,...,...,...,...
5365515981,female,2014-11-14 02:54:00,【你适合跟谁在一起】据说经过这个测试的情侣不会分手；你和她将是最完美的酸碱中和，永远在一起。...,0.0,0.0
5365515981,female,2014-11-13 18:32:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0
5365515981,female,2014-11-13 11:19:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0
5365519449,female,NaT,,,


- Delete tuples with an empty post (Nan value):

In [4]:
microblog.dropna(subset=['content'], inplace=True)
microblog.head(22)

Unnamed: 0,gender,post_time,content,repost_num,comment_num
1041514813,male,2014-11-21 06:57:00,【中国最危险的村庄之郭亮村全方位解析】(分享自@今日头条)|中国最危险的村庄之郭亮村全方位解析,0.0,0.0
1041514813,male,2014-11-21 06:52:00,【康巴传奇之康北旅游】(分享自@今日头条)|康巴传奇之康北旅游,0.0,0.0
1041514813,male,2014-11-20 06:59:00,【“青龙、白虎、朱雀、玄武”四圣兽的来龙去脉】(分享自@今日头条)|“青龙、白虎、朱雀、玄武...,0.0,1.0
1041514813,male,2014-11-20 06:57:00,【19张惊艳四座的暗色调照片！】(分享自@今日头条)|19张惊艳四座的暗色调照片！,0.0,0.0
1041514813,male,2014-11-20 06:52:00,【顾里20句话道破人生真相！话虽毒却句句戳心！】(分享自@今日头条)|顾里20句话道破人生真...,0.0,0.0
1041514813,male,2014-11-20 06:51:00,【做人最难得的十个字】(分享自@今日头条)http://t.cn/RzZDmJO,0.0,1.0
1041514813,male,2014-11-20 06:51:00,【雾霾不怕不怕啦，盘点中国最适合洗肺的7个城市】(分享自@今日头条)|雾霾不怕不怕啦，盘点中...,0.0,0.0
1041514813,male,2014-11-19 21:47:00,【经典电脑桌面原来都有实景地哪天按图索骥走一遍】(分享自@今日头条)|经典电脑桌面原来都有实...,0.0,0.0
1041514813,male,2014-11-19 21:46:00,【这些话，只说给女人听，请静静地看完它】(分享自@今日头条)|这些话，只说给女人听，请静静地看完它,0.0,0.0
1041514813,male,2014-11-19 21:46:00,【摄影技术没有进步的5个常见原因】(分享自@今日头条)|摄影技术没有进步的5个常见原因,0.0,0.0


- Translate in english the posts (in chinese in the original dataset) using googletrans :

The number of posts being very important, it is necessary to send the translations in buckets of 400 tuples, to store them in temporary files, then to regroup these files.

In [5]:
# from os import environ

# from google.cloud import translate_v2 as translate

# # project_id = 'translate-chinese-dataset'
# # parent = f"projects/{project_id}"
# # print(project_id)

# client = translate.Client()

# # Translate posts and save data into separate files
# i = 29600
# chunk_size = 400
# while(i < len(microblog.index) ):
#     current_chunk = microblog.iloc[i:i+chunk_size].copy()
#     current_chunk['content'] = current_chunk['content'].apply(lambda x: client.translate(x, target_language='en')['translatedText'])
#     current_chunk.to_csv('../CSV/tmp/microblog_tmp_'+str(i)+'.csv', index=False)
#     i+=chunk_size


Another solution to translate a large number of posts is to run batch translations on the google cloud service.

In [6]:
# from google.cloud import translate

# def batch_translate_text(
#     input_uri="gs://microblog/chinese_posts.txt",
#     output_uri="gs://microblog/results/",
#     project_id="translate-chinese-dataset",
#     timeout=1800,
# ):
#     """Translates a batch of texts on GCS and stores the result in a GCS location."""

#     client = translate.TranslationServiceClient()

#     location = "us-central1"
#     # Supported file types: https://cloud.google.com/translate/docs/supported-formats
#     gcs_source = {"input_uri": input_uri}

#     input_configs_element = {
#         "gcs_source": gcs_source,
#         "mime_type": "text/plain",  # Can be "text/plain" or "text/html".
#     }
#     gcs_destination = {"output_uri_prefix": output_uri}
#     output_config = {"gcs_destination": gcs_destination}
#     parent = f"projects/{project_id}/locations/{location}"

#     # Supported language codes: https://cloud.google.com/translate/docs/languages
#     operation = client.batch_translate_text(
#         request={
#             "parent": parent,
#             "source_language_code": "zh-CN",
#             "target_language_codes": ["en"],  # Up to 10 language codes here.
#             "input_configs": [input_configs_element],
#             "output_config": output_config,
#         }
#     )

#     print("Waiting for operation to complete...")
#     response = operation.result(timeout)

#     print("Total Characters: {}".format(response.total_characters))
#     print("Translated Characters: {}".format(response.translated_characters))

# # save the "content" data of the dataframe in a csv file for batch processing in google cloud
# # microblog['content'].to_csv('../CSV/tmp/chinese_posts.txt', header=False, index=False)

# batch_translate_text()

- Read the file with the posts translated into english and store data in a DataFrame:

In [7]:
translated_file = open("..\\Data\\CSV\\tmp\\results_microblog_chinese_posts_en_translations.txt", "r")
microblog['english content'] = translated_file.readlines()
microblog

Unnamed: 0,gender,post_time,content,repost_num,comment_num,english content
1041514813,male,2014-11-21 06:57:00,【中国最危险的村庄之郭亮村全方位解析】(分享自@今日头条)|中国最危险的村庄之郭亮村全方位解析,0.0,0.0,"[All-round analysis of Guo Liangcun, the most ..."
1041514813,male,2014-11-21 06:52:00,【康巴传奇之康北旅游】(分享自@今日头条)|康巴传奇之康北旅游,0.0,0.0,[Kamba Legend of Kangbei Travel] (Shared from ...
1041514813,male,2014-11-20 06:59:00,【“青龙、白虎、朱雀、玄武”四圣兽的来龙去脉】(分享自@今日头条)|“青龙、白虎、朱雀、玄武...,0.0,1.0,[The Ins and Outs of the Four Sacred Beasts of...
1041514813,male,2014-11-20 06:57:00,【19张惊艳四座的暗色调照片！】(分享自@今日头条)|19张惊艳四座的暗色调照片！,0.0,0.0,[19 stunning dark-toned photos of the four sea...
1041514813,male,2014-11-20 06:52:00,【顾里20句话道破人生真相！话虽毒却句句戳心！】(分享自@今日头条)|顾里20句话道破人生真...,0.0,0.0,[Gu Li reveals the truth of life in 20 words! ...
...,...,...,...,...,...,...
5365515981,female,2014-11-14 14:21:00,全国神经病等级测试题！还敢说自己不是“深井冰”？,0.0,0.0,National Neuropathy Level Test Questions! Dare...
5365515981,female,2014-11-14 02:54:00,【你适合跟谁在一起】据说经过这个测试的情侣不会分手；你和她将是最完美的酸碱中和，永远在一起。...,0.0,0.0,[Who do you suit to be with] It is said that t...
5365515981,female,2014-11-13 18:32:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0,"In this age of passion and lace, can you still..."
5365515981,female,2014-11-13 11:19:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0,"In this age of passion and lace, can you still..."


- Delete remaining chinese characters in posts:

In [8]:
import re

microblog['english content'] = microblog['english content'].apply(lambda x: re.sub(r'^-|#+|[^A-Za-z0-9 -.,;]+', "", x))
microblog['english content'].head(3386)

microblog.loc[1705586121]

Unnamed: 0,gender,post_time,content,repost_num,comment_num,english content
1705586121,female,2014-11-11 16:54:00,置顶G.E.M.X.X.X.Live世界巡��演唱��，最新公�蚜讼愀壅�(4��)、�R�...,3029.0,6755.0,"Top G.E.M.X.X.X.Live world tour concerts, the ..."
1705586121,female,2014-11-17 22:36:00,杜莎夫人��像�^，你��以後��多了一��很�鄢缘淖】袜福∈澄����幔抗�哈哈哈！,4302.0,7763.0,"Madame Tussauds is like , you will have a lot ..."
1705586121,female,2014-11-16 10:32:00,一天之��在於晨。大家早上在�致铮课铱��旱��g海，一大早就�_始���R！http://...,6144.0,17994.0,"A day is in the morning. In the morning, every..."
1705586121,female,2014-11-15 21:13:00,�o�r�o刻都���浜米约骸Ｒ��槲杼��凫段���每��人，我��只要勇敢站上去！,6715.0,12042.0,"Everyone, I just have to stand up bravely!"
1705586121,female,2014-11-14 23:50:00,�@�Y拜一直����煮�|西，就是�榱私裢怼０����Y婚24�L年了，�@是我�樗���...,8766.0,16442.0,"Y has been cooking West, that is, 0. Y has bee..."
...,...,...,...,...,...,...
1705586121,female,2014-09-08 21:48:00,中秋晚上的活�邮牵�一家人打桌球：）http://t.cn/Rht7LRr,5173.0,10996.0,"The Mid-Autumn Festival night live, mail, and ..."
1705586121,female,2014-09-08 20:40:00,中秋��能回家吃��真的太幸福了！一家�F�A，感恩(拍�@��照片，爸爸在旁配音：�W友�...,6096.0,13049.0,Its so happy to be able to go home to eat duri...
1705586121,female,2014-09-08 18:09:00,各位中秋��快�愤@�N浪漫�A�M的一天，我自己要先跟大家�f一句：喜�g你！！！然後咧，...,6617.0,10483.0,"For the Mid-Autumn Festival, you are very angr..."
1705586121,female,2014-09-08 00:07:00,中秋��前夕，在家�e弄甜�c�o家人朋友！心中充�M激�樱�弄完後好有成就感，嘻嘻,12375.0,22401.0,"On the eve of the Mid-Autumn Festival, sweeten..."


- Delete posts with invalid or empty translations:

In [9]:
import numpy as np

# delete whitespaces ant the start and end
microblog['english content'] = microblog['english content'].apply(lambda x: x.strip())

# replace by NaN strings which are too small or with special caracters or without letters or without whitespace 
microblog['english content'] = microblog['english content'].apply(lambda x: x if len(x)>5 and re.match('^[a-zA-Z]* [a-zA-Z]*', x) and not re.match(r'^[_\W]+$', x) and not re.match('^httpt.*',x) else np.nan )

# replace html codes by strings 
microblog['english content'].replace( {'&amp;': ' and '}, regex=True, inplace=True)
microblog['english content'].replace( {'&gt;': ' > '}, regex=True, inplace=True)

# delete rows with NaN
microblog.dropna(subset=['english content'], inplace=True)

microblog['english content']

1041514813    Kamba Legend of Kangbei Travel (Shared from )K...
1041514813    The Ins and Outs of the Four Sacred Beasts of ...
1041514813    Gu Li reveals the truth of life in 20 words! A...
1041514813    Ten Rare Words to Be a Man (Shared from ) http...
1041514813    Dont be afraid of smog, take inventory of 7 ci...
                                    ...                        
5365515981    National Neuropathy Level Test Questions! Dare...
5365515981    Who do you suit to be with It is said that the...
5365515981    In this age of passion and lace, can you still...
5365515981    In this age of passion and lace, can you still...
5372497984    I found that the  used by UC currently is supe...
Name: english content, Length: 31907, dtype: object

- Extract 5 keywords for each post:

In [10]:
import spacy
import spacy_ke  # https://github.com/talmago/spacy_ke

# load spacy model
nlp = spacy.load("en_core_web_sm")

# spacy v3.0.x factory.
# if you're using spacy v2.x.x swich to `nlp.add_pipe(spacy_ke.Yake(nlp))`
nlp.add_pipe("yake")

# extract 5 keywords from the english content and add it in a dedicated column
microblog['keywords'] = microblog['english content'].apply(lambda x: '", "'.join([keyword[0].text for keyword in nlp(x)._.extract_keywords(n=5)]))

# add [] at the begin and end of the keyword list such that it can be processed as python list afterwards
microblog['keywords'] = '["'+microblog['keywords']+'"]'
microblog['keywords'] = microblog['keywords'].apply(eval)

microblog

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


Unnamed: 0,gender,post_time,content,repost_num,comment_num,english content,keywords
1041514813,male,2014-11-21 06:52:00,【康巴传奇之康北旅游】(分享自@今日头条)|康巴传奇之康北旅游,0.0,0.0,Kamba Legend of Kangbei Travel (Shared from )K...,"[Kangbei Travel, Kamba Legend, Travel, Shared,..."
1041514813,male,2014-11-20 06:59:00,【“青龙、白虎、朱雀、玄武”四圣兽的来龙去脉】(分享自@今日头条)|“青龙、白虎、朱雀、玄武...,0.0,1.0,The Ins and Outs of the Four Sacred Beasts of ...,"[Blue Dragon, White Tiger, Sacred Beasts, Suza..."
1041514813,male,2014-11-20 06:52:00,【顾里20句话道破人生真相！话虽毒却句句戳心！】(分享自@今日头条)|顾里20句话道破人生真...,0.0,0.0,Gu Li reveals the truth of life in 20 words! A...,"[Gu Li, Gu Li reveals, words, poisonous, poign..."
1041514813,male,2014-11-20 06:51:00,【做人最难得的十个字】(分享自@今日头条)http://t.cn/RzZDmJO,0.0,1.0,Ten Rare Words to Be a Man (Shared from ) http...,"[Rare Words, Shared, Man]"
1041514813,male,2014-11-20 06:51:00,【雾霾不怕不怕啦，盘点中国最适合洗肺的7个城市】(分享自@今日头条)|雾霾不怕不怕啦，盘点中...,0.0,0.0,"Dont be afraid of smog, take inventory of 7 ci...","[Shared, lung washing, smog, China, inventory]"
...,...,...,...,...,...,...,...
5365515981,female,2014-11-14 14:21:00,全国神经病等级测试题！还敢说自己不是“深井冰”？,0.0,0.0,National Neuropathy Level Test Questions! Dare...,"[Level Test Questions, Neuropathy Level Test, ..."
5365515981,female,2014-11-14 02:54:00,【你适合跟谁在一起】据说经过这个测试的情侣不会分手；你和她将是最完美的酸碱中和，永远在一起。...,0.0,0.0,Who do you suit to be with It is said that the...,"[base neutralization, perfect acid, base, acid..."
5365515981,female,2014-11-13 18:32:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0,"In this age of passion and lace, can you still...","[dear, lace, confirm, passion and lace, confir..."
5365515981,female,2014-11-13 11:19:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0,"In this age of passion and lace, can you still...","[dear, lace, confirm, passion and lace, confir..."


- Rename columns and change thier order:

In [11]:
# reseting the index to intergrate it as a column
microblog.reset_index(inplace=True)

# renaming columns
microblog = microblog.rename(columns={'index':'Id', 'gender':'Gender', 'post_time':'Date', 'repost_num':'Nb reposts', 'comment_num':'Nb comments', 'keywords':'Keywords'})

# delete rows with no keywords
microblog['Keywords'].replace('', np.nan, inplace=True)
microblog.dropna(subset=['Keywords'], inplace=True)

microblog

Unnamed: 0,Id,Gender,Date,content,Nb reposts,Nb comments,english content,Keywords
0,1041514813,male,2014-11-21 06:52:00,【康巴传奇之康北旅游】(分享自@今日头条)|康巴传奇之康北旅游,0.0,0.0,Kamba Legend of Kangbei Travel (Shared from )K...,"[Kangbei Travel, Kamba Legend, Travel, Shared,..."
1,1041514813,male,2014-11-20 06:59:00,【“青龙、白虎、朱雀、玄武”四圣兽的来龙去脉】(分享自@今日头条)|“青龙、白虎、朱雀、玄武...,0.0,1.0,The Ins and Outs of the Four Sacred Beasts of ...,"[Blue Dragon, White Tiger, Sacred Beasts, Suza..."
2,1041514813,male,2014-11-20 06:52:00,【顾里20句话道破人生真相！话虽毒却句句戳心！】(分享自@今日头条)|顾里20句话道破人生真...,0.0,0.0,Gu Li reveals the truth of life in 20 words! A...,"[Gu Li, Gu Li reveals, words, poisonous, poign..."
3,1041514813,male,2014-11-20 06:51:00,【做人最难得的十个字】(分享自@今日头条)http://t.cn/RzZDmJO,0.0,1.0,Ten Rare Words to Be a Man (Shared from ) http...,"[Rare Words, Shared, Man]"
4,1041514813,male,2014-11-20 06:51:00,【雾霾不怕不怕啦，盘点中国最适合洗肺的7个城市】(分享自@今日头条)|雾霾不怕不怕啦，盘点中...,0.0,0.0,"Dont be afraid of smog, take inventory of 7 ci...","[Shared, lung washing, smog, China, inventory]"
...,...,...,...,...,...,...,...,...
31902,5365515981,female,2014-11-14 14:21:00,全国神经病等级测试题！还敢说自己不是“深井冰”？,0.0,0.0,National Neuropathy Level Test Questions! Dare...,"[Level Test Questions, Neuropathy Level Test, ..."
31903,5365515981,female,2014-11-14 02:54:00,【你适合跟谁在一起】据说经过这个测试的情侣不会分手；你和她将是最完美的酸碱中和，永远在一起。...,0.0,0.0,Who do you suit to be with It is said that the...,"[base neutralization, perfect acid, base, acid..."
31904,5365515981,female,2014-11-13 18:32:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0,"In this age of passion and lace, can you still...","[dear, lace, confirm, passion and lace, confir..."
31905,5365515981,female,2014-11-13 11:19:00,在这个基情四射，蕾丝满天的时代里，你还能100%确认你的性向么？做个测试看看再拍拖吧，以免害...,0.0,0.0,"In this age of passion and lace, can you still...","[dear, lace, confirm, passion and lace, confir..."


- Save the dataset in a CSV file:

In [12]:
microblog.to_csv('../Data/CSV/microblogPCU.csv', columns=['Id', 'Date', 'Gender', 'Keywords', 'Nb comments', 'Nb reposts'], index=False, sep=';')