<h1>Exploratory Data Analysis (EDA)</h1>

<p>This is the next component of the NLP pipeline (follows data cleaning). EDA is carried out to verify that the data we have makes sense. If it doesn't, then we'll have to go back to our data cleaning notebook and implement a few more techniques to transform the data into a more usable format.</p>

In [None]:
import pandas as pd
import seaborn as sns
from textblob import TextBlob

# Read in corpus and DTMs
headlines = pd.read_pickle("./pickles/headline_dtm.pkl")
text = pd.read_pickle("./pickles/text_dtm.pkl")
df = pd.read_pickle("./pickles/corpus.pkl")

# Transposing the datasets for easier comprehension
headlines = headlines.transpose()
text = text.transpose()

<p>EDA is done by visualizing aspects of the cleaned dataset.</p>

<p><i>(For each visualization that's presented, there will be two variants - headlines and display text.)</i></p>

<li>Comparing sentiment polarity using TextBlob</li>

In [None]:
# Headline sentiment
df["Headline Polarity"] = df["Headline"].map(lambda text: TextBlob(text).sentiment.polarity)

sns.distplot(df["Headline Polarity"])

In [None]:
# Display text sentiment (headline + first sentence)
df["Text Polarity"] = df["Text"].map(lambda text: TextBlob(text).sentiment.polarity)

sns.distplot(df["Text Polarity"])

<p>This tells us that most of the news has been pretty neutral (positive) in sentiment. Let's compare this to the stock price inlection to verify this.</p>

<li>Stock price inflection</li>

In [None]:
sns.distplot(df["Inflection"], color="r")

<p>So it's true that there are more upward inflections in the price.</p>

<p>Let's take a look at the length of the text.</p>

<li>News length</li>

In [None]:
# Headline length
df["Headline Length"] = df["Headline"].map(lambda text: len(text))

sns.distplot(df["Headline Length"],
             kde_kws={"color": "b", "lw": 2, "label": "KDE"},
             hist_kws={"histtype": "step", "linewidth": 3, "alpha": 1, "color": "g"})

In [None]:
# Display text length (headline + first sentence)
df["Text Length"] = df["Text"].map(lambda text: len(text))

sns.distplot(df["Text Length"],
             kde_kws={"color": "r", "lw": 2, "label": "KDE"},
             hist_kws={"histtype": "step", "linewidth": 3, "alpha": 1, "color": "k"})

<p>The mode headline length is around 70 characters while the mode display text length is around 280 characters. Let's see if these numbers mean anything by charting correlation plots (with Inflection).</p>

<li>Correlation between news length and price inflection</li>

In [None]:
# Relationship between headline length and price inflection
sns.pairplot(df, vars=["Headline Length", "Inflection"], kind="reg")

In [None]:
# Relationship between display text length and price inflection
sns.pairplot(df, vars=["Text Length", "Inflection"], kind="reg")