# Raw Data EDA
* This note book is created to perform EDA on raw scrapped data. 
* We are conducting this EDA to figure out, 
     * If there is any missing/corrupt data. 
     * If there are any interesting patterns in data. 

## Table Of Contents


## Installing Required Libraries

In [1]:
! pip install pandas
! pip install numpy
! pip install plotly
! pip install nbformat
! pip install ipykernel
! pip install matplotlip
! pip install wordcloud

/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[

## Import Libraries

In [2]:
import json

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

## Read Data Set
* We have scrapped data from `foxnews.com` using `scrapping.ipynb` notebook. 
* The scrapped data is written to `fox_data.csv`.

In [3]:
## Importing the data
data = pd.read_csv("./data/fox_data.csv")

## Explore Data

In [4]:
## lets see the data
data.head()

Unnamed: 0,imageUrl,title,description,url,publicationDate,lastPublishedDate,category,isBreaking,isLive,duration,authors,text
0,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
1,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...
2,https://a57.foxnews.com/static.foxnews.com/fox...,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,https://www.foxnews.com/politics/nyc-naked-cow...,2022-11-02 21:58:25-04:00,2022-11-02T21:58:25-04:00,"{'name': 'New York City', 'url': '/category/us...",False,False,,[{'name': 'Adam Sabes'}],The famous Naked Cowboy in New York City's Tim...
3,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
4,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...


In [5]:
## Lets look at the shape of the data
data.shape

(39787, 12)

In [6]:
## Lets look at the data types of the columns
# read character column

pd.DataFrame(data.dtypes, columns=['DataType'])

Unnamed: 0,DataType
imageUrl,object
title,object
description,object
url,object
publicationDate,object
lastPublishedDate,object
category,object
isBreaking,bool
isLive,bool
duration,float64


##### Notes
* For our usecase of `topic modeling` and `sentiment analysis` I think we only need `url`, `text`, `authors` and `title`. We can create a new minified data set with just these columns for more efficiency and write to different files. 
* `authors` column is an array of `json` with `name` property. We should change it to just the name for simplicity. 
* Some of the columns like `isBreaking`, `isLive` and `duration` might not be applicable here since this is text content. 
* We'll also need to convert timestamps to datetime column for EDA. 
* We'll still do a EDA on all the columns to see if there are any interesting insights. 
* We'll also change the column names to follow naming conventions and best practises. 

### Rename Columns

In [7]:
data.rename(columns={
     'Unnamed: 0': 'id',
     'imageUrl': 'image_url',
     'publicationDate': 'publication_date',
     'lastPublishedDate': 'last_published_date',
     'isBreaking': 'is_breaking',
     'isLive': 'is_live',     
}, inplace=True)

data.head()

Unnamed: 0,image_url,title,description,url,publication_date,last_published_date,category,is_breaking,is_live,duration,authors,text
0,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
1,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...
2,https://a57.foxnews.com/static.foxnews.com/fox...,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,https://www.foxnews.com/politics/nyc-naked-cow...,2022-11-02 21:58:25-04:00,2022-11-02T21:58:25-04:00,"{'name': 'New York City', 'url': '/category/us...",False,False,,[{'name': 'Adam Sabes'}],The famous Naked Cowboy in New York City's Tim...
3,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
4,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...


### Convert to `datetime`

In [31]:
## Changing column types for `publication_date` and `last_published_date`

data["last_published_date"] = pd.to_datetime(data["last_published_date"])
data["published_day"] = data["last_published_date"].dt.day
data["published_month"] = data["last_published_date"].dt.month

In [9]:
## lets check the data types again
pd.DataFrame(data.dtypes, columns=['DataType'])

Unnamed: 0,DataType
image_url,object
title,object
description,object
url,object
publication_date,object
last_published_date,"datetime64[ns, pytz.FixedOffset(-240)]"
category,object
is_breaking,bool
is_live,bool
duration,float64


### Check for Missing Data

In [10]:
## lets check if there are any null values in the data
data.isnull().mean()

image_url              0.000000
title                  0.000000
description            0.000503
url                    0.000000
publication_date       0.000000
last_published_date    0.000000
category               0.000000
is_breaking            0.000000
is_live                0.000000
duration               1.000000
authors                0.000000
text                   0.000754
published_day          0.000000
published_month        0.000000
dtype: float64

##### Notes
* So the duration colunmn has `100%` data missing which makes sense cause this is text content, we can safely drop this data. 
* Both `description` and `text` has less than `1%` missing data.  We don't care too much about `description` since we are not going to use it for our models, but for missing `text` we have 2 options, 
     * Drop the missing data rows, since its less than `1%` we won't be loosing too much data. 
     * Try and scrape the missing data again. 
* I am going to start with `Option 2` and try and scrape the missing data. If that doesn't work we can drop the rows. 
* For description we can just add placeholder "Not Available" string. 

In [11]:
## using fillna to fill the null values in description. 
data['description'].fillna('Not Available', inplace=True)

In [12]:
## lets try and check the URL for missing data
data[data["text"].isnull()]["url"].unique()

array(['https://www.foxnews.com/politics/cartoons-slideshow',
       'https://www.foxnews.com/politics/photos-president-trump-family-gather-funeral-ivana-trump',
       'https://www.foxnews.com/politics/supreme-court-overturns-roe-v-wade-photos-protesters-crowds-outside'],
      dtype=object)

##### Notes
* So I checked the URLs manually and it looks like these are just image slide shows and no content.  So it should be ok to drop the rows. 

### Dropping Missing Data


In [13]:
## lets check the dropping of the null values - ignoring inplace for now
data.dropna(subset=["text"]).isnull().mean()

## lets check the shape of the data
data.dropna(subset=["text"]).shape

(39757, 14)

##### Notes
* So we'll loose ~30 rows, which makes me think why are there 30 rows with 3 URLs.  

In [14]:
## drop rows in place
data.dropna(subset=["text"], inplace=True)
data.shape

(39757, 14)

### Checking for duplicates
* So earlier we saw that there were 30 rows with same URL. That shouldn't be the case, the assumption is that each row is a unique news article. Lets confirm that. 

In [15]:
## lets check for duplicate urls
data["url"].unique().shape

(3972,)

##### Notes
* We only have `3972` rows! Which means are data is 10 times less than expected. 
* Lets investigate some more


In [16]:
data["url"].unique()

array(['https://www.foxnews.com/politics/hassan-bolduc-trade-fire-final-showdown-after-gop-nominee-comes-under-attack-arriving-debate',
       'https://www.foxnews.com/politics/biden-speech',
       'https://www.foxnews.com/politics/nyc-naked-cowboy-makes-endorsement-while-performing-times-square-restore-law-order',
       ...,
       'https://www.foxnews.com/politics/florida-desantis-campaign-warns-donations-pac-dont-benefit-reelection',
       'https://www.foxnews.com/politics/west-point-cadets-taught-critical-race-theory-addressing-whiteness-docs-show',
       'https://www.foxnews.com/politics/cia-officer-turned-democratic-congresswoman-spanberger-emphasizes-proud-american'],
      dtype=object)

In [17]:
len(data.query("url=='https://www.foxnews.com/politics/nyc-naked-cowboy-makes-endorsement-while-performing-times-square-restore-law-order'"))

9

##### Notes
* So looks like we have a lot of duplicates.  Most probably due to our web scraper. 
* Lets try to remove duplicates and see how it affects our dataset size. 

### Removing Duplicates

In [18]:
## lets check dataset size after removing for duplicate urls
data.drop_duplicates(subset=["url"]).shape


(3972, 14)

##### Notes
* So after removing duplicates we'll still have `3972` records, which is a lot less that I wanted but might be enough for topic labelling. We can start with these records and once have the steps and process in place can try and scrape more data. 

In [19]:
## dropping duplicates in place
data.drop_duplicates(subset=["url"], inplace=True)
data.shape

(3972, 14)

### Removing unused columns
* So columns like `category`, `is_live`, `is_breaking` and `duration` are not useful to us because, 
     * `category`: Is just `json` object with `relative url` and one word `category` for the news article. 
     * `is_live` is boolean for whether the news is live or not, not applicable to text news content, all values are false. 
     * `is_breaking` is boolean for whether the news is breaking news or not, not applicable for our use case and all values are false. 
     * `duration` is video duration, again not applicable to our use case. 
* Also I think `image_url` and `publication_date` columns are also not useful. `image_url` is just link to image used in article and `publication_date` is `last_published_date` in local timezone. 
* Lets drop these columns to further reduce the dataset size. 


In [20]:
data.drop(["is_live", "is_breaking", "duration", "category", "image_url", "publication_date"], axis=1, inplace=True)

In [21]:
data.shape

(3972, 8)

### Feature Engineering Authors
* Right now `authors` column is a `json` array author name, we can extract the author name from the array and save it as string

In [22]:
data.head()["authors"]

0     [{'name': 'Paul Steinhauser'}]
1           [{'name': 'Haris Alic'}]
2           [{'name': 'Adam Sabes'}]
9        [{'name': 'Bradford Betz'}]
22       [{'name': 'Bradford Betz'}]
Name: authors, dtype: object

##### Notes
* So the assumption here is that each article has just one author, lets first confirm that by adding new column `num_authors` 

In [23]:
## using ast to convert string to list
import ast
## helper function to map number of authords
def find_num_author(authorArrayString):
   authorArray = ast.literal_eval(authorArrayString)
   # authorsString = authorArray[0]
   # authors = json.loads(authorsString)
   return len(authorArray)  
   
## lets apply the function to the data
data["num_authors"] = data["authors"].apply(find_num_author)

## lets check if num_authors is is more than 1
data[data["num_authors"] > 1].shape

(498, 9)

In [24]:
## lets check num_authors is is more than 1
data["num_authors"].value_counts()

1    3373
2     434
0     101
3      53
4      10
5       1
Name: num_authors, dtype: int64

In [25]:
## just to confirm lets also check the values there. 
data[data["num_authors"] > 1]["authors"]

75       [{'name': 'Adam Shaw'}, {'name': 'Bill Melugin'}]
91       [{'name': 'Aubrie Spady'}, {'name': 'Sophia Sl...
107      [{'name': 'Kyle Morris'}, {'name': 'Sophia Sla...
181      [{'name': 'Paul Steinhauser'}, {'name': 'Brook...
291      [{'name': 'Timothy Nerozzi'}, {'name': 'Courtn...
                               ...                        
39476    [{'name': 'Brandon Gillespie'}, {'name': 'Thom...
39638    [{'name': 'Lisa Bennatan'}, {'name': 'Megan My...
39699    [{'name': 'Lisa Bennatan'}, {'name': 'Jon Raas...
39726    [{'name': 'Chad Pergram'}, {'name': 'Paul Best'}]
39786    [{'name': 'Paul Steinhauser'}, {'name': 'Andre...
Name: authors, Length: 498, dtype: object

##### Notes
* So we have ~500 records with more than 1 author, our assumption was clearly wrong. 
* There are ~100 records with no author, we'll just add unknown there. 
* In order to feature engineer this we can creat a CSV string value of authors, and each author name will be combined using an underscore. 
* That was searching and analyzing would be easier. 

In [26]:
## helper function to convert author object to string
def find_author(authorArrayString):
   authorArray = ast.literal_eval(authorArrayString)
   if(len(authorArray) == 0):
      return "Unknown"
   authors = ["_".join(author["name"].split(" ")) for author in authorArray]
   return ",".join(authors)
   

## lets apply the function to the data
data["author"] = data["authors"].apply(find_author)

### Feature Engineering Word Count
* Even though it might not be 100% accurate lets create a feature with word count for EDA


In [27]:
## helper funtion to create word count from text column
def word_count(text):
   return len(text.split(" "))

data["word_count"] = data["text"].apply(word_count)
data["word_count"].describe()

count    3972.000000
mean      610.804884
std       340.785100
min        31.000000
25%       399.000000
50%       534.000000
75%       734.000000
max      9672.000000
Name: word_count, dtype: float64

##### Notes
* Interesting there is an outlier article with `9672` words! Would be interesting to check it out. 
* Also the article with just 31 words, we'll need to check them out in EDA

### Feature Engineering Line Count

* Again even though it might not be 100% accurate lets create a feature with line count for EDA


In [28]:
## helper function to create line count from text column
def create_line_count(text):
   return len(text.split("."))

data["line_count"] = data["text"].apply(create_line_count)
data["line_count"].describe()

count    3972.000000
mean       31.652064
std        19.232276
min         3.000000
25%        21.000000
50%        28.000000
75%        37.000000
max       647.000000
Name: line_count, dtype: float64

#### Notes
* As expected after looking at the number of words, the min and max articles seem to be outlier with `3` and `647` lines respectively. 
* We'll need to investigate that. 

In [29]:
data.isnull().mean()

title                  0.0
description            0.0
url                    0.0
last_published_date    0.0
authors                0.0
text                   0.0
published_day          0.0
published_month        0.0
num_authors            0.0
author                 0.0
word_count             0.0
line_count             0.0
dtype: float64

## Exploratory Data Analysis
* Lets try and do some EDA to get some stats on following
     * Average number of articles per day/week/month
     * Average number of articles per author
     * Average number of words an author writes
     * Average number of articles an author writes per day/week/month. 
     * Analyze the outliers


In [32]:
data.head()

Unnamed: 0,title,description,url,last_published_date,authors,text,published_day,published_month,num_authors,author,word_count,line_count
0,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....,2,11,1,Paul_Steinhauser,1271,62
1,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 22:15:46-04:00,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...,2,11,1,Haris_Alic,478,22
2,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,https://www.foxnews.com/politics/nyc-naked-cow...,2022-11-02 21:58:25-04:00,[{'name': 'Adam Sabes'}],The famous Naked Cowboy in New York City's Tim...,2,11,1,Adam_Sabes,205,18
9,Wisconsin courts shoot down liberal groups' at...,A Wisconsin appeals court and a circuit judge ...,https://www.foxnews.com/politics/wisconsin-cou...,2022-11-02 21:44:40-04:00,[{'name': 'Bradford Betz'}],Liberal groups in Wisconsin seeking to change ...,2,11,1,Bradford_Betz,381,20
22,Texas gubernatorial candidate Beto O'Rourke jo...,Texas gubernatorial nominee Beto O’Rourke is t...,https://www.foxnews.com/politics/texas-guberna...,2022-11-02 20:38:30-04:00,[{'name': 'Bradford Betz'}],Texas gubernatorial nominee Beto O’Rourke is a...,2,11,1,Bradford_Betz,267,15
