# Data Cleaning and Preparation

In this part, we will be doing data cleaning to prepare our data for sentiment analysis.

Import essential libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
cmts_voo = pd.read_csv('../datasets/reddit_comment_voo.csv')

In [3]:
display(cmts_voo)

Unnamed: 0,author,id,created_utc,permalink,body,score,subreddit
0,coinslinger88,kxwqx9i,2024-04-04 04:28:47,/r/ETFs/comments/1buybes/sso_vs_spyvoo/kxwqx9i/,$VOO is for people who hate money,1,ETFs
1,coinslinger88,kxwr28w,2024-04-04 04:29:32,/r/ETFs/comments/1buvulh/just_starting_brokera...,$VOO is for homeless people,0,ETFs
2,Key-Mark4536,kxsmrdk,2024-04-03 10:17:45,/r/ETFs/comments/1buf5zo/voo/kxsmrdk/,"It's not a bad idea, but it wouldn’t add anyth...",6,ETFs
3,thefreewheeler,kxsjef1,2024-04-03 09:55:36,/r/ETFs/comments/1buf5zo/voo/kxsjef1/,No. VOO and VTI are nearly identical. Either a...,3,ETFs
4,Fun_Grapefruit_3416,kxwl2i1,2024-04-04 03:57:51,/r/ETFs/comments/1buf5zo/voo/kxwl2i1/,No. You already have VOO within VTI and the ta...,3,ETFs
...,...,...,...,...,...,...,...
221,Speedbot_3000,kwwcf3u,2024-03-28 10:22:57,/r/ETFs/comments/1bo2cac/voo_or_qqq/kwwcf3u/,I too have that as well except for IBIT and I'...,2,ETFs
222,iIiiiiIlIillliIilliI,kwmcam7,2024-03-26 18:03:15,/r/ETFs/comments/1bo2cac/voo_or_qqq/kwmcam7/,If I am 34? (What age is the cutoff where VOO ...,2,ETFs
223,noctilucus,kws1br2,2024-03-27 18:35:10,/r/ETFs/comments/1bo2cac/voo_or_qqq/kws1br2/,"I'd also go for a mix of both, it's not like O...",1,ETFs
224,Decent-Bed9289,kwwq0dh,2024-03-28 12:05:51,/r/ETFs/comments/1bo2cac/voo_or_qqq/kwwq0dh/,"Yeah there’s a bit of overlap, but some overla...",1,ETFs


The below is the definition of the attributes from ```PRAW 7.7.1``` documentation.

| Attribute        | Description                                                                                    |
| ---------------- |------------------------------------------------------------------------------------------------|
| ```author```     | Provides an instance of Redditor.                                                              |
| ```id```         | The ID of the comment.                                                                         | 
| ```created_utc```| Time the comment was created, represented in Unix Time.                                        |
| ```permalink```  | A permalink for the comment. Comment objects from the inbox have a context attribute instead.  |
| ```body```       | The body of the comment, as Markdown.                                                          |
| ```score```      | The number of upvotes for the comment.                                                         |
| ```subreddit```  | Provides an instance of Subreddit. The subreddit that the comment belongs to.                  |

In [4]:
print("Data type: ", type(cmts_voo))
print("Dims: ", cmts_voo.shape)

Data type:  <class 'pandas.core.frame.DataFrame'>
Dims:  (226, 7)


Here, we check for any null values.

In [5]:
cmts_voo.isnull().sum()

author         0
id             0
created_utc    0
permalink      0
body           0
score          0
subreddit      0
dtype: int64

We can see that there is 14 null values in the ```author``` column. We will now handle these null values by filling them with 'Unknown'.

In [6]:
cmts_voo.fillna({'AUTHOR': "Unknown"}, inplace=True)

In [7]:
cmts_voo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       226 non-null    object
 1   id           226 non-null    object
 2   created_utc  226 non-null    object
 3   permalink    226 non-null    object
 4   body         226 non-null    object
 5   score        226 non-null    int64 
 6   subreddit    226 non-null    object
dtypes: int64(1), object(6)
memory usage: 12.5+ KB


### Save the Cleaned DataFrame as a CSV File for sentiment analysis.

In [8]:
import os

folder_path = '../datasets'

file_path = os.path.join(folder_path, 'cleaned_cmts_voo.csv')

cmts_voo.to_csv(file_path, index=False)