# Data Cleaning
In this notebook, I lay out the steps I took to properly read the full Leiden Weibo Corpus CSV file as a pandas dataframe and clean aspects of the data which include: selecting certain columns, removing extraneous characters, removing rows that were incorrectly parsed by `pandas.read_csv`, changing data types of the corpus, and writing subsets of the data to various CSV files for future analysis.

In [2]:
# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set(style="darkgrid")

In [5]:
# assign columns
col_names = ["col" + str(i) for i in range(1,10)]
# load full data set and convert csv \N to NaN in pandas dataframe
AllData = pd.read_csv('lwc_data\parsed_messages.csv', quotechar='\"', names=col_names, na_values='\\N')

In [6]:
AllData.head()

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8,col9
0,"""3399658666084059""","""44""","""1""","""m""","""soulleoo""","""一分耕耘，一分收获！//@soulleoo 的小小乐园的最具吸引力乐园为5031，超过@败...","""25""","""一 分 耕耘 ， 一 分 收获 ！ 的 小小 乐园 的 最 具 吸引力 乐园 为 5031...","""一/CD 分/M 耕耘/VV ，/PU 一/CD 分/M 收获/NN ！/PU 的/DEG..."
1,"""3399658666083922""","""35""","""1""","""m""","""Tony_Heqi""","""太给力了，是不是？//@Tony_Heqi 在达人麻将 获得了成就Good Job!，获取...","""36""","""太 给力 了 ， 是 不 是 ？ 在 达人 麻将 获得 了 成就 Good Job ， 获...","""太/AD 给力/VV 了/AS ，/PU 是/VC 不/AD 是/VC ？/PU 在/P ..."
2,"""3399658665799759""","""44""","""8""","""f""","""謝秋如""","""3399658665799719""",,,
3,"""3399658665799554""","""44""","""3""","""m""","""yinjiawei""","""3399658665799504""",,,
4,"""3399658665575454""","""31""","""1000""","""m""","""Wings_of_Winds""","""为什么每次不在中国都会在他们家？@田鸡W @-RJ-爱喝乙醇的甲醇""","""12""","""为什么 每 次 不 在 中国 都 会 在 他们 家 ？""","""为什么/AD 每/DT 次/M 不/AD 在/P 中国/NR 都/AD 会/VV 在/P ..."


These are the columns of the full data set as noted on the Leiden Weibo Corpus [website](http://lwc.daanvanesch.nl/openaccess.php).

In [8]:
# rename columns
AllData.columns = ['MessageID', 'Province', 'City', 'Gender', 'Username', 'Message', 'NumWords', 'WithWordBounds', 'POStags']

In [9]:
AllData.head()

Unnamed: 0,MessageID,Province,City,Gender,Username,Message,NumWords,WithWordBounds,POStags
0,"""3399658666084059""","""44""","""1""","""m""","""soulleoo""","""一分耕耘，一分收获！//@soulleoo 的小小乐园的最具吸引力乐园为5031，超过@败...","""25""","""一 分 耕耘 ， 一 分 收获 ！ 的 小小 乐园 的 最 具 吸引力 乐园 为 5031...","""一/CD 分/M 耕耘/VV ，/PU 一/CD 分/M 收获/NN ！/PU 的/DEG..."
1,"""3399658666083922""","""35""","""1""","""m""","""Tony_Heqi""","""太给力了，是不是？//@Tony_Heqi 在达人麻将 获得了成就Good Job!，获取...","""36""","""太 给力 了 ， 是 不 是 ？ 在 达人 麻将 获得 了 成就 Good Job ， 获...","""太/AD 给力/VV 了/AS ，/PU 是/VC 不/AD 是/VC ？/PU 在/P ..."
2,"""3399658665799759""","""44""","""8""","""f""","""謝秋如""","""3399658665799719""",,,
3,"""3399658665799554""","""44""","""3""","""m""","""yinjiawei""","""3399658665799504""",,,
4,"""3399658665575454""","""31""","""1000""","""m""","""Wings_of_Winds""","""为什么每次不在中国都会在他们家？@田鸡W @-RJ-爱喝乙醇的甲醇""","""12""","""为什么 每 次 不 在 中国 都 会 在 他们 家 ？""","""为什么/AD 每/DT 次/M 不/AD 在/P 中国/NR 都/AD 会/VV 在/P ..."


In [11]:
AllData.tail()

Unnamed: 0,MessageID,Province,City,Gender,Username,Message,NumWords,WithWordBounds,POStags
5103591,"""3407331444661006""","""32""","""1""","""f""","""damondede""","""还是烦，我在烦恼什么却不清楚""",,,
5103592,"""3407331444404506""","""44""","""13""","""m""","""陈乐明_希望如此""","""呜呜！手手！掉扣了！！[泪][泪][衰] 今天吃一个苹果 手就掉扣了！一天一苹果医生接...",,,
5103593,"""3407331444369205""","""11""","""1000""","""m""","""黄硕harry""","""-_","发现出处了，上微博的人不会不知道这个表情的....哈哈哈哈[喝多了]""",,
5103594,"""3407331444369201""","""44""","""7""","""m""","""Long_love""","""分享图片[黑线]""",,,
5103595,"""3407331443121006""","""31""","""1000""","""f""","""极品ET""","""有种说法，25岁是个界限，这以后别轻易换恋人，因为性格已经定型，很难再与他人磨合。在看戏这...",,,


## Various Cleaning Endeavors
Some entries in the original CSV are strings with commas in them. This in turn appears to manifest as an error in how pandas reads the CSV file as a dataframe. This should be taken into account with all future assumptions of the data. 

### Subsetting the Data
Going forward only the message and demographic data are important so I select the following columns.

In [14]:
# subset message and demographic data
AllMessages = AllData[['Province', 'City', 'Gender', 'Message']]

In [15]:
AllMessages.head()

Unnamed: 0,Province,City,Gender,Message
0,"""44""","""1""","""m""","""一分耕耘，一分收获！//@soulleoo 的小小乐园的最具吸引力乐园为5031，超过@败..."
1,"""35""","""1""","""m""","""太给力了，是不是？//@Tony_Heqi 在达人麻将 获得了成就Good Job!，获取..."
2,"""44""","""8""","""f""","""3399658665799719"""
3,"""44""","""3""","""m""","""3399658665799504"""
4,"""31""","""1000""","""m""","""为什么每次不在中国都会在他们家？@田鸡W @-RJ-爱喝乙醇的甲醇"""


### Removing Extraneous Characters
In the original CSV file, each entry is encapsulate by double quotes (i.e. "content") despite being initialized as strings by default, thus, I strip these characters initially.

In [22]:
# remove extraneous double quotes 
AllMessages['Province'] = AllMessages['Province'].str.replace("\"", "")
AllMessages['City'] = AllMessages['City'].str.replace("\"", "")
AllMessages['Gender'] = AllMessages['Gender'].str.replace("\"", "")
AllMessages['Message'] = AllMessages['Message'].str.replace("\"", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [29]:
AllMessages.head()

Unnamed: 0,Province,City,Gender,Message
0,44,1,m,一分耕耘，一分收获！//@soulleoo 的小小乐园的最具吸引力乐园为5031，超过@败家...
1,35,1,m,太给力了，是不是？//@Tony_Heqi 在达人麻将 获得了成就Good Job!，获取速...
2,44,8,f,3399658665799719
3,44,3,m,3399658665799504
4,31,1000,m,为什么每次不在中国都会在他们家？@田鸡W @-RJ-爱喝乙醇的甲醇


### Removing Misparsed Data
Here I find that 128 rows were incorrectly read in pandas's read_csv function such that `Gender` is incorrectly coded.

In [43]:
rows_todelete = []

for i in range(len(genders)):
    if genders[i] != 'm' and genders[i] != 'f':
        rows_todelete += [i]
        
len(rows_todelete)

128

The 128 incorrectly coded rows are insignificant compared to the at large 5103596 rows of data read from the CSV file, thus, I remove these rows for convenience in future analysis.

In [56]:
AllMessages['Gender']

0          m
1          m
2          f
3          m
4          m
          ..
5103591    f
5103592    m
5103593    m
5103594    m
5103595    f
Name: Gender, Length: 5103596, dtype: object

In [59]:
AllMessages = AllMessages[(AllMessages.Gender == 'm') | (AllMessages.Gender == 'f')]

In [60]:
AllMessages

Unnamed: 0,Province,City,Gender,Message
0,44,1,m,一分耕耘，一分收获！//@soulleoo 的小小乐园的最具吸引力乐园为5031，超过@败家...
1,35,1,m,太给力了，是不是？//@Tony_Heqi 在达人麻将 获得了成就Good Job!，获取速...
2,44,8,f,3399658665799719
3,44,3,m,3399658665799504
4,31,1000,m,为什么每次不在中国都会在他们家？@田鸡W @-RJ-爱喝乙醇的甲醇
...,...,...,...,...
5103591,32,1,f,还是烦，我在烦恼什么却不清楚
5103592,44,13,m,呜呜！手手！掉扣了！！[泪][泪][衰] 今天吃一个苹果 手就掉扣了！一天一苹果医生接近...
5103593,11,1000,m,-_
5103594,44,7,m,分享图片[黑线]


### Changing Data Types
The data types of each column of the data set need to be as following so that SnowNLP can be properly used to analyze the data.
* `Gender`: binary (Male --> 1; Female --> 0)
* `Province`: integer
* `City`: integer
* `Message`: unicode string

In [61]:
# recode Gender as binary where male --> 1 and female --> 0
gender = {'m': 1,'f': 0}

AllMessages.Gender = [gender[item] for item in AllMessages.Gender]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [68]:
# change data type of Province and City from string to integer
AllMessages['Province'] = pd.to_numeric(AllMessages['Province'], errors='ignore')
AllMessages['City'] = pd.to_numeric(AllMessages['City'], errors='ignore')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [70]:
# change data type of Message from str to unicode string to preserve special characters (Chinese characters)
AllMessages['Message'] = AllMessages['Message'].astype('unicode')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Subsetting Data by Topic
Now I have selected, cleaned for mis-parsed, and applied the correct data types to the full Leiden Weibo Corpus data set. Next I will individually write subsets of the full data set to CSV files which will represent Weibo Corpus pertaining to various topics.

In [80]:
AllMessages.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5103468 entries, 0 to 5103595
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   Province  int64 
 1   City      int64 
 2   Gender    int64 
 3   Message   object
dtypes: int64(3), object(1)
memory usage: 194.7+ MB
