# Overview

We'll be shaping the clean data.

# Instantiate required Python components.

Our project will use TensorFlow for developing our model.  We'll also need several other Python libraries to work with our CSV.

In [1]:
import re
import pandas as pd
import csv
import numpy as np

# Used for Troubleshooting
from IPython.display import display

# Set Hyperparameters

Merge multiple files into one CSV.  Ensure that the columns are in the following order: reason, singleMessage

In [2]:
# The file that contains the data.
FILE_MESSAGES = "../artifacts/data/sources/20230328-clean-data.csv"

# Read the CSV Data

Read the CSV contents and keep only specific fields.

In [3]:
# Open file and save to dataframe.
df = pd.read_csv(FILE_MESSAGES)

#print(df.columns)
display(df)

Unnamed: 0,id,message
0,7537646,gm
1,7537647,gm
2,7537648,hi
3,7537649,GM
4,7537650,gm
...,...,...
110717,7648753,it has not. i made a nice little chunk on [$P...
110718,7648754,@Kyle. Agreed. Algo scalping is not easy. If I...
110719,7648755,I have not seen it yet. I will look for it. ...
110720,7648756,"@Tanner. Agreed. It's dead, lol"


# Preprocess Data

As part of the Machine Learning process, we will remove fields not required, fix missing values, remove noisy data, and any additional steps to prepare for the ML training process.

## Remove Empty Messages

In [4]:
# Remove rows that have empty data (missing values) in the specified column
df = df[df["message"].notna()]

## Keep Labels and Messages

We will keep only specific columns that is important to the model.

In [5]:
# Keep specific columns.
df = df[["message"]]

print(df.columns)
print(f'Total number of rows: {len(df)}')

Index(['message'], dtype='object')
Total number of rows: 110721


## Add Reason Column

Our two features we want to keep are:

- reason
- singleMessage

We need to aim to format our datashape to that.

In [6]:
# Add a new column called 'reason' with a single value for all rows
df["reason"] = "Clean"

## Rename Column Names

In [7]:
df = df.rename(columns={"message": "singleMessage"})

## Change Order of Columns

In [8]:
df = df[["reason", "singleMessage"]]

# Review Data Results

In [9]:
display(df)

Unnamed: 0,reason,singleMessage
0,Clean,gm
1,Clean,gm
2,Clean,hi
3,Clean,GM
4,Clean,gm
...,...,...
110717,Clean,it has not. i made a nice little chunk on [$P...
110718,Clean,@Kyle. Agreed. Algo scalping is not easy. If I...
110719,Clean,I have not seen it yet. I will look for it. ...
110720,Clean,"@Tanner. Agreed. It's dead, lol"


# 🚧 Save Data to Disk

Let's save all our hard work formatting the dataframe to a CSV for future reference.

- [Pandas DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [10]:
df.to_csv('data/output/1-preprocess-clean.csv', index=False)