<h1>Data Cleaning</h1>

<p>This is the second step of the NLP pipeline. For details on the first step (data collection), look under the "scrapers" folder to find the script that scrapes Reuters for company news data based on the stock ticker.</p>

<h2>Scraped Data</h2>

<p>The data that's been scraped comprises</p>
<ul>
    <li>Dates</li>
    <li>Headlines</li>
    <li>First sentences of articles</li>
</ul>

<p>To start off, let's take a look at the raw dataset.</p>

In [1]:
import pandas as pd

In [2]:
# Note that although I'm using AAPL here, the Reuters scraper I wrote has been abstracted to scrape data for any ticker
original_data = pd.read_csv("../dataset/AAPL_news.csv")

In [3]:
original_data

Unnamed: 0,Date,Headline,Sentence
0,"JULY 23, 2020",U.S. Congressional hearing to question tech gi...,A U.S. congressional hearing scheduled for nex...
1,"JULY 23, 2020",UPDATE 1-U.S. Congressional hearing to questio...,A U.S. congressional hearing scheduled for nex...
2,"JULY 23, 2020",California appeals court rejects Apple TV cons...,A California appeals court has declined a bid ...
3,"JULY 23, 2020",Apple faces deceptive trade practices probe by...,Multiple U.S. states are investigating Apple I...
4,"JULY 23, 2020",Apple faces consumer protection probe by multi...,Multiple U.S. states are investigating Apple I...
...,...,...,...
678,"APRIL 15, 2019",Huawei says not discussed 5G chipsets with App...,China's Huawei Technologies said on Tuesday it...
679,"APRIL 15, 2019","Apple, allies seek billions in U.S. trial test...",Apple Inc and its allies on Monday will kick o...
680,"APRIL 12, 2019",Apple hit with East Texas patent case over iPh...,A New York-based patent monetization company o...
681,"APRIL 12, 2019",Chinese group to get control of Japan Display ...,A Chinese-Taiwanese group will take control of...


<h2>Null values</h2>

<p>A common issue in the date cleaning process is dealing with null values. Let's see how our original dataset fares in this aspect.</p>

In [4]:
original_data.isnull().any()

Date        False
Headline    False
Sentence    False
dtype: bool

<p>As we can see, there aren't any null values in the original dataset which is an indicator that our scraper performed relatively well. But then again, null values are a common issue only when dealing with numeric data, so soon enough, we will have to apply text pre-processing techniques to this dataset.</p>

<h2>Date Format</h2>

<p>Let's convert the "Date" column into a pandas Timestamp (pandas equivalent of the datetime object) and take a look at the results.</p>

In [5]:
df = original_data.copy() # making a copy of the original DataFrame to work with
df["Date"] = pd.to_datetime(df["Date"])

df.head()

Unnamed: 0,Date,Headline,Sentence
0,2020-07-23,U.S. Congressional hearing to question tech gi...,A U.S. congressional hearing scheduled for nex...
1,2020-07-23,UPDATE 1-U.S. Congressional hearing to questio...,A U.S. congressional hearing scheduled for nex...
2,2020-07-23,California appeals court rejects Apple TV cons...,A California appeals court has declined a bid ...
3,2020-07-23,Apple faces deceptive trade practices probe by...,Multiple U.S. states are investigating Apple I...
4,2020-07-23,Apple faces consumer protection probe by multi...,Multiple U.S. states are investigating Apple I...


<p>That's pretty much everything we have to do to the "Date" column as far as cleaning it up is concerned.</p>

<h2>Common Functions to Clean Text Data</h2>

<p>There are several ways to clean/process text data before feeding it into a model. The exact functions required varies across different use cases. Here are some of the procecssing steps we need to employ for this project</p>

<li>Converting to lowercase</li>

In [6]:
def to_lower(text):
    return text.lower()

<p>For the next set of cleanup functions, we are going to have to make a few imports</p>

In [7]:
import re
import string

<li>Getting rid of punctuation</li>

In [8]:
def remove_punctuation(text):
    return re.sub("[%s]" % re.escape(string.punctuation), "", text)

<li>Replacing special characters (â€™ to represent ', random strings etc.)</li>

In [9]:
def replace_special_chars(text):
    text = text.replace("â€™", "\'")
    text = text.replace("Ã¤ÃŸ", "")
    text = text.replace("Ã¼", "")
    
    return text

<li>Replacing country names (abbreviations)</li>

In [10]:
def replace_country_names(text):
    text = text.replace("U.S", "United States")
    text = text.replace("EU", "Europe")
    text = text.replace("UK", "United Kingdom")
    
    return text

<li>Removing numbers from words (034220.KS, COVID-19 to COVID etc.)</li>

In [11]:
def remove_numbers(text):
    return re.sub("[\d-]", "", text)

<p>This is a good start. As we progress, if we notice unexpected results, we can come back and process the text in a couple more ways by implementing </p>
    <ul>
        <li>Lemmatization - Dealing with word inflections (ex. "take" is the same as "taken" or "took")</li>
        <li>Creating n-grams - Considering n words at a time (ex. a bi-gram would be "United States" as opposed to considering "united" and "states" separately)</li>
        <li>Parts of speech tagging - Recognizing nouns, pronouns (ex. "Canada", "he", "it" etc.)</li>
        <li>Removing words containing numbers - Not done for now because "COVID-19" can be impactful on the result of the analysis</li>
        <li>Removal of meaningless text</li>
    </ul>

<p>For now, let's clean up the text in our dataset using the utility functions defined above.</p>

In [12]:
# Applies all the cleanup functions to a piece of text
def clean_up_text(text):
    text = replace_country_names(text)
    text = replace_special_chars(text)
    text = remove_numbers(text)
    text = remove_punctuation(text)
    text = to_lower(text)
    
    return text


clean = lambda text: clean_up_text(text)

In [13]:
# Creating a DataFrame with clean text
headline_clean = pd.DataFrame(df["Headline"].apply(clean))
sentence_clean = pd.DataFrame(df["Sentence"].apply(clean))
df_clean = pd.concat([headline_clean, sentence_clean], axis=1)
df_clean = pd.concat([df["Date"], df_clean], axis=1)

df_clean

Unnamed: 0,Date,Headline,Sentence
0,2020-07-23,united states congressional hearing to questio...,a united states congressional hearing schedule...
1,2020-07-23,update united states congressional hearing to ...,a united states congressional hearing schedule...
2,2020-07-23,california appeals court rejects apple tv cons...,a california appeals court has declined a bid ...
3,2020-07-23,apple faces deceptive trade practices probe by...,multiple united states states are investigatin...
4,2020-07-23,apple faces consumer protection probe by multi...,multiple united states states are investigatin...
...,...,...,...
678,2019-04-15,huawei says not discussed g chipsets with appl...,chinas huawei technologies said on tuesday it ...
679,2019-04-15,apple allies seek billions in united states tr...,apple inc and its allies on monday will kick o...
680,2019-04-12,apple hit with east texas patent case over iph...,a new yorkbased patent monetization company on...
681,2019-04-12,chinese group to get control of japan display ...,a chinesetaiwanese group will take control of ...
