# Data Cleaning and Preparation

In this part, we will be doing data cleaning to prepare our data for sentiment analysis.

Import essential libraries.

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [46]:
cmts_voo = pd.read_csv('../datasets/reddit_comment_voo.csv')

In [47]:
display(cmts_voo)

Unnamed: 0,author,id,created_utc,permalink,body,score,subreddit
0,lotterytix,kwh3sji,2024-03-25 20:10:23,/r/ETFs/comments/1bmqbxg/new_to_investing_is_v...,Maybe consider VOO and a mid/small cap value f...,1,ETFs
1,AlgoTradingQuant,kwczgum,2024-03-25 00:51:21,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,I’m retired and hold a 100% equities portfolio...,8,ETFs
2,foldinthechhese,kwdbk25,2024-03-25 02:02:08,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,The more experienced investors recommend a ble...,5,ETFs
3,SirChetManly,kwd6nto,2024-03-25 01:33:43,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,It isn't *risky* by any stretch. You're exclud...,2,ETFs
4,ZAROV8862,kwei3zo,2024-03-25 06:17:54,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,Enough said :)),2,ETFs
...,...,...,...,...,...,...,...
927,Financial_Pickle_987,kvqcowi,2024-03-20 21:48:06,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,"Lots of downs, lots of ups, but average is aro...",2,ETFs
928,platskol,kvaljn3,2024-03-17 23:49:51,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,That is a Reddit thing. As soon as people say ...,8,ETFs
929,phillip_jay,kv9pe1j,2024-03-17 20:01:30,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,Did you read it?,4,ETFs
930,Rand-Seagull96734,kvhyeh5,2024-03-19 06:54:48,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,"Let's say you decided to invest some ""play mon...",1,ETFs


The below is the definition of the attributes from ```PRAW 7.7.1``` documentation.

| Attribute        | Description                                                                                    |
| ---------------- |------------------------------------------------------------------------------------------------|
| ```author```     | Provides an instance of Redditor.                                                              |
| ```id```         | The ID of the comment.                                                                         | 
| ```created_utc```| Time the comment was created, represented in Unix Time.                                        |
| ```permalink```  | A permalink for the comment. Comment objects from the inbox have a context attribute instead.  |
| ```body```       | The body of the comment, as Markdown.                                                          |
| ```score```      | The number of upvotes for the comment.                                                         |
| ```subreddit```  | Provides an instance of Subreddit. The subreddit that the comment belongs to.                  |

In [48]:
print("Data type: ", type(cmts_voo))
print("Dims: ", cmts_voo.shape)

Data type:  <class 'pandas.core.frame.DataFrame'>
Dims:  (932, 7)


### Cleaning the Headers

In [49]:
cmts_voo.columns = cmts_voo.columns.str.upper()
cmts_voo.head()

Unnamed: 0,AUTHOR,ID,CREATED_UTC,PERMALINK,BODY,SCORE,SUBREDDIT
0,lotterytix,kwh3sji,2024-03-25 20:10:23,/r/ETFs/comments/1bmqbxg/new_to_investing_is_v...,Maybe consider VOO and a mid/small cap value f...,1,ETFs
1,AlgoTradingQuant,kwczgum,2024-03-25 00:51:21,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,I’m retired and hold a 100% equities portfolio...,8,ETFs
2,foldinthechhese,kwdbk25,2024-03-25 02:02:08,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,The more experienced investors recommend a ble...,5,ETFs
3,SirChetManly,kwd6nto,2024-03-25 01:33:43,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,It isn't *risky* by any stretch. You're exclud...,2,ETFs
4,ZAROV8862,kwei3zo,2024-03-25 06:17:54,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,Enough said :)),2,ETFs


Here, we check for any null values.

In [50]:
cmts_voo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 932 entries, 0 to 931
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   AUTHOR       918 non-null    object
 1   ID           932 non-null    object
 2   CREATED_UTC  932 non-null    object
 3   PERMALINK    932 non-null    object
 4   BODY         932 non-null    object
 5   SCORE        932 non-null    int64 
 6   SUBREDDIT    932 non-null    object
dtypes: int64(1), object(6)
memory usage: 51.1+ KB


We can see that there is 14 null values in the ```author``` column. We will now handle these null values by filling them with 'Unknown'.

In [51]:
cmts_voo.fillna({'AUTHOR': "Unknown"}, inplace=True)

In [52]:
cmts_voo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 932 entries, 0 to 931
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   AUTHOR       932 non-null    object
 1   ID           932 non-null    object
 2   CREATED_UTC  932 non-null    object
 3   PERMALINK    932 non-null    object
 4   BODY         932 non-null    object
 5   SCORE        932 non-null    int64 
 6   SUBREDDIT    932 non-null    object
dtypes: int64(1), object(6)
memory usage: 51.1+ KB


### Save the Cleaned DataFrame as a CSV File for sentiment analysis.

In [53]:
import os

folder_path = 'datasets'

file_path = os.path.join(folder_path, 'cleaned_cmts_voo.csv')

cmts_voo.to_csv(file_path, index=False)