# Overview

We'll add more politics content to the dataset:

- [Kaggle: Politics on Reddit](https://www.kaggle.com/datasets/gpreda/politics-on-reddit)


# Instantiate required Python components.

Our project will use TensorFlow for developing our model.  We'll also need several other Python libraries to work with our CSV.

In [1]:
import re
import pandas as pd
import csv
import numpy as np
# import tensorflow as tf
# from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))

# Used for Troubleshooting
from IPython.display import display

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Set Hyperparameters

This handy section will control all the important parameters for our model.

In [2]:
# The file that contains the data.
FILE_MESSAGES = "./data/sources/politics-reddit.csv"

# Read the CSV data

Read the CSV contents and keep only specific fields.

In [3]:
# Open file and save to dataframe.
df = pd.read_csv(FILE_MESSAGES)

#print(df.columns)
display(df)

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,A Right Wing Group in Texas Is Making up Fake ...,166,ov1ll3,https://www.vice.com/en/article/wx5bg5/blm-whi...,34,1.627710e+09,,2021-07-31 08:35:47
1,DOJ sues Texas over Gov. Abbott’s order for la...,85,ouwc9i,https://www.kxan.com/news/texas-politics/doj-s...,17,1.627688e+09,,2021-07-31 02:26:12
2,"From white evangelicals to QAnon believers, wh...",57,ouqkxi,https://www.modbee.com/news/coronavirus/articl...,27,1.627671e+09,,2021-07-30 21:45:09
3,DeSantis says he’ll sign order allowing parent...,269,oun2lc,https://www.orlandosentinel.com/politics/os-ne...,138,1.627660e+09,,2021-07-30 18:43:05
4,"Show on the road: In Utah, Florida Gov. Ron De...",31,ouipnz,https://www.tallahassee.com/story/news/politic...,28,1.627644e+09,,2021-07-30 14:21:54
...,...,...,...,...,...,...,...,...
28058,Comment,1,hociwir,,0,1.639375e+09,lil'wayne got a pardon and not them ah ah,2021-12-13 07:48:46
28059,Comment,1,hociv7d,,0,1.639375e+09,So you think it will be called unconstitutiona...,2021-12-13 07:48:25
28060,Comment,1,hociupn,,0,1.639374e+09,The left of America has out numbered the right...,2021-12-13 07:48:16
28061,Comment,1,hociuet,,0,1.639374e+09,Everyone spread the word…I just set fire on water,2021-12-13 07:48:10


# Preprocess Data

As part of the Machine Learning process, we will remove fields not required, fix missing values, remove noisy data, and any additional steps to prepare for the ML training process.

## Remove Empty Messages

In [4]:
# Remove rows that have empty data (missing values) in the specified column
df = df[df["body"].notna()]

## Keep Labels and Messages

We will keep only specific columns that is important to the model.

In [5]:
# Keep specific columns.
df = df[["body"]]

print(df.columns)
print(f'Total number of rows: {len(df)}')

Index(['body'], dtype='object')
Total number of rows: 18068


## Label Data with Bad Words as Inappropriate

## Add Reason Column

Our two features we want to keep are:

- reason
- singleMessage

We need to aim to format our datashape to that.

In [6]:
# Add a new column called 'reason' with a single value for all rows
df["reason"] = "Politics not allowed outside of references to the market."

## Rename Column Names

In [7]:
df = df.rename(columns={"body": "singleMessage"})

## Change Order of Columns

In [8]:
df = df[["reason", "singleMessage"]]

# Review Data Results

In [9]:
display(df)

Unnamed: 0,reason,singleMessage
34,Politics not allowed outside of references to ...,I had the same reasoning when I watch fox news...
35,Politics not allowed outside of references to ...,Unethical fucks will always find a loophole.
36,Politics not allowed outside of references to ...,Failed actual coup.
37,Politics not allowed outside of references to ...,Why is trump even in the news anymore?
38,Politics not allowed outside of references to ...,And it could be my head in a basket...
...,...,...
28058,Politics not allowed outside of references to ...,lil'wayne got a pardon and not them ah ah
28059,Politics not allowed outside of references to ...,So you think it will be called unconstitutiona...
28060,Politics not allowed outside of references to ...,The left of America has out numbered the right...
28061,Politics not allowed outside of references to ...,Everyone spread the word…I just set fire on water


# 🚧 Save Data to Disk

Let's save all our hard work formatting the dataframe to a CSV for future reference.

- [Pandas DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [10]:
df.to_csv('data/output/2-dataset-politics.csv', index=False)