<h1>Data Cleaning</h1>

<p>This is the second step of the NLP pipeline. For details on the first step (data collection), look under "scrapers/" to find the script that scrapes Reuters for company news data based on the stock ticker.</p>

<h2>Scraped Data</h2>

<p>The data that's been scraped comprises</p>
<ul>
    <li>Dates</li>
    <li>Headlines</li>
    <li>First sentences of articles</li>
</ul>

<p>To start off, let's take a look at the raw dataset.</p>

In [1]:
import pandas as pd

In [2]:
# Note that although I'm using AAPL here, the Reuters scraper I wrote has been abstracted to scrape data for any ticker
original_data = pd.read_csv("../dataset/AAPL_news.csv")

In [3]:
original_data

Unnamed: 0,Date,Headline,Sentence
0,"JULY 23, 2020",U.S. Congressional hearing to question tech gi...,A U.S. congressional hearing scheduled for nex...
1,"JULY 23, 2020",UPDATE 1-U.S. Congressional hearing to questio...,A U.S. congressional hearing scheduled for nex...
2,"JULY 23, 2020",California appeals court rejects Apple TV cons...,A California appeals court has declined a bid ...
3,"JULY 23, 2020",Apple faces deceptive trade practices probe by...,Multiple U.S. states are investigating Apple I...
4,"JULY 23, 2020",Apple faces consumer protection probe by multi...,Multiple U.S. states are investigating Apple I...
...,...,...,...
678,"APRIL 15, 2019",Huawei says not discussed 5G chipsets with App...,China's Huawei Technologies said on Tuesday it...
679,"APRIL 15, 2019","Apple, allies seek billions in U.S. trial test...",Apple Inc and its allies on Monday will kick o...
680,"APRIL 12, 2019",Apple hit with East Texas patent case over iPh...,A New York-based patent monetization company o...
681,"APRIL 12, 2019",Chinese group to get control of Japan Display ...,A Chinese-Taiwanese group will take control of...


<h2>Null values</h2>

<p>A common issue in the date cleaning process is dealing with null values. Let's see how our original dataset fares in this aspect.</p>

In [4]:
original_data.isnull().any()

Date        False
Headline    False
Sentence    False
dtype: bool

<p>As we can see, there aren't any null values in the original dataset which is an indicator that our scraper performed relatively well. But then again, null values are a common issue only when dealing with numeric data, so soon enough, we will have to apply text pre-processing techniques to this dataset.</p>

<h2>Date Format</h2>

<p>Let's convert the "Date" column into a pandas Timestamp (pandas equivalent of the datetime object) and take a look at the results.</p>

In [5]:
df = original_data.copy() # making a copy of the original DataFrame to work with
df["Date"] = pd.to_datetime(df["Date"])

df.head()

Unnamed: 0,Date,Headline,Sentence
0,2020-07-23,U.S. Congressional hearing to question tech gi...,A U.S. congressional hearing scheduled for nex...
1,2020-07-23,UPDATE 1-U.S. Congressional hearing to questio...,A U.S. congressional hearing scheduled for nex...
2,2020-07-23,California appeals court rejects Apple TV cons...,A California appeals court has declined a bid ...
3,2020-07-23,Apple faces deceptive trade practices probe by...,Multiple U.S. states are investigating Apple I...
4,2020-07-23,Apple faces consumer protection probe by multi...,Multiple U.S. states are investigating Apple I...


<p>That's pretty much everything we have to do to the "Date" column as far as cleaning it up is concerned.</p>

<h2>Common Functions to Clean Text Data</h2>

<p>There are several ways to clean/process text data before feeding it into a model. The exact functions required varies across different use cases. Here are some of the procecssing steps we need to employ for this project</p>

<ul>
    <li>Converting all text to lowercase</li>
    <li>Getting rid of special characters and punctuation (', ,, :, %, # etc.)</li>
    <li>Getting rid of stop words (words that add little to no meaning to the sentence on their own - a, an, the etc.)</li>
    <li>Getting rid of text between angular brackets</li>
    <li>Replacing occurrences of "U.S" with united states</li>
    <li>Replacing occurrences of "EU" with europe</li>
    <li>Tokenization - Splitting the text into its constituent words</li>
    <li>Lemmatization - Dealing with word inflections (rise, risen etc.)</li>
    <li>And potentially more</li>
</ul>