# Raw Data EDA
* This note book is created to perform EDA on raw scrapped data. 
* We are conducting this EDA to figure out, 
     * If there is any missing/corrupt data. 
     * If there are any interesting patterns in data. 

## Table Of Contents


* [Installing Required Libraries](#installation)
* [Importing Libraries](#importing)
* [Reading Data Set](#reading)
* [Exploring Data](#exploring)
     * [Renaming Columns](#renaming)
     * [Datetime Conversion](#datetime_conversion)
     * [Checking Missing Data](#missing_data)
     * [Dropping Missing Data](#dropping_missing_data)
     * [Checking Duplicates](#checking_duplicates)
     * [Removing Duplicates](#removing_duplicates)
     * [Removing Unused Columns](#removing_unused_columns)
     * [Feature Engineering Authors](#feature_engineering_authors)
     * [Feature Engineering Word Count](#feature_engineering_wordcounts)
     * [Feature Engineering Line Count](#feature_engineering_linecounts)
* [Exploratory Data Analysis](#eda)
     * [Number of Daily Articles](#num_daily_articles)
     * [Average Daily Articles Per Month](#avg_daily_articles)
     * [Total Monthly Articles](#total_monthly_articles)
     * [Number Of Articles Per Author](#num_articles_per_author)
     * [Number of Daily Articles Per Author](#num_daily_articles_per_author)
     * [Analyzing The Outliers](#outlier_analysis)
* [Writing to file](#writing)

## Installing Required Libraries <a class="anchor" id="installation"></a>

In [166]:
! pip install pandas
! pip install numpy
! pip install plotly
! pip install nbformat
! pip install ipykernel
! pip install matplotlip
! pip install wordcloud

/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[

## Importing Libraries <a class="anchor" id="importing"></a>

In [167]:
import json

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

## Reading Data Set <a class="anchor" id="reading"></a>
* We have scrapped data from `foxnews.com` using `scrapping.ipynb` notebook. 
* The scrapped data is written to `fox_data.csv`.

In [168]:
## Importing the data
data = pd.read_csv("./data/fox_data.csv")

## Exploring Data <a class="anchor" id="exploring"></a>

In [169]:
## lets see the data
data.head()

Unnamed: 0,imageUrl,title,description,url,publicationDate,lastPublishedDate,category,isBreaking,isLive,duration,authors,text
0,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
1,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...
2,https://a57.foxnews.com/static.foxnews.com/fox...,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,https://www.foxnews.com/politics/nyc-naked-cow...,2022-11-02 21:58:25-04:00,2022-11-02T21:58:25-04:00,"{'name': 'New York City', 'url': '/category/us...",False,False,,[{'name': 'Adam Sabes'}],The famous Naked Cowboy in New York City's Tim...
3,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
4,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...


In [170]:
## Lets look at the shape of the data
data.shape

(39787, 12)

In [171]:
## Lets look at the data types of the columns
# read character column

pd.DataFrame(data.dtypes, columns=['DataType'])

Unnamed: 0,DataType
imageUrl,object
title,object
description,object
url,object
publicationDate,object
lastPublishedDate,object
category,object
isBreaking,bool
isLive,bool
duration,float64


##### Notes
* For our usecase of `topic modeling` and `sentiment analysis` I think we only need `url`, `text`, `authors` and `title`. We can create a new minified data set with just these columns for more efficiency and write to different files. 
* `authors` column is an array of `json` with `name` property. We should change it to just the name for simplicity. 
* Some of the columns like `isBreaking`, `isLive` and `duration` might not be applicable here since this is text content. 
* We'll also need to convert timestamps to datetime column for EDA. 
* We'll still do a EDA on all the columns to see if there are any interesting insights. 
* We'll also change the column names to follow naming conventions and best practises. 

### Renaming Columns <a class="anchor" id="renaming"></a>

In [172]:
data.rename(columns={
     'Unnamed: 0': 'id',
     'imageUrl': 'image_url',
     'publicationDate': 'publication_date',
     'lastPublishedDate': 'last_published_date',
     'isBreaking': 'is_breaking',
     'isLive': 'is_live',     
}, inplace=True)

data.head()

Unnamed: 0,image_url,title,description,url,publication_date,last_published_date,category,is_breaking,is_live,duration,authors,text
0,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
1,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...
2,https://a57.foxnews.com/static.foxnews.com/fox...,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,https://www.foxnews.com/politics/nyc-naked-cow...,2022-11-02 21:58:25-04:00,2022-11-02T21:58:25-04:00,"{'name': 'New York City', 'url': '/category/us...",False,False,,[{'name': 'Adam Sabes'}],The famous Naked Cowboy in New York City's Tim...
3,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....
4,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...


### Datetime Conversion <a class="anchor" id="datetime_conversion"></a>

In [173]:
## Changing column types for `publication_date` and `last_published_date`

data["last_published_date"] = pd.to_datetime(data["last_published_date"])
data["published_day"] = data["last_published_date"].dt.day
data["published_month"] = data["last_published_date"].dt.month

In [174]:
## lets check the data types again
pd.DataFrame(data.dtypes, columns=['DataType'])

Unnamed: 0,DataType
image_url,object
title,object
description,object
url,object
publication_date,object
last_published_date,"datetime64[ns, pytz.FixedOffset(-240)]"
category,object
is_breaking,bool
is_live,bool
duration,float64


### Checking Missing Data <a class="anchor" id="missing_data"></a>

In [175]:
## lets check if there are any null values in the data
data.isnull().mean()

image_url              0.000000
title                  0.000000
description            0.000503
url                    0.000000
publication_date       0.000000
last_published_date    0.000000
category               0.000000
is_breaking            0.000000
is_live                0.000000
duration               1.000000
authors                0.000000
text                   0.000754
published_day          0.000000
published_month        0.000000
dtype: float64

##### Notes
* So the duration colunmn has `100%` data missing which makes sense cause this is text content, we can safely drop this data. 
* Both `description` and `text` has less than `1%` missing data.  We don't care too much about `description` since we are not going to use it for our models, but for missing `text` we have 2 options, 
     * Drop the missing data rows, since its less than `1%` we won't be loosing too much data. 
     * Try and scrape the missing data again. 
* I am going to start with `Option 2` and try and scrape the missing data. If that doesn't work we can drop the rows. 
* For description we can just add placeholder "Not Available" string. 

In [176]:
## using fillna to fill the null values in description. 
data['description'].fillna('Not Available', inplace=True)

In [177]:
## lets try and check the URL for missing data
data[data["text"].isnull()]["url"].unique()

array(['https://www.foxnews.com/politics/cartoons-slideshow',
       'https://www.foxnews.com/politics/photos-president-trump-family-gather-funeral-ivana-trump',
       'https://www.foxnews.com/politics/supreme-court-overturns-roe-v-wade-photos-protesters-crowds-outside'],
      dtype=object)

##### Notes
* So I checked the URLs manually and it looks like these are just image slide shows and no content.  So it should be ok to drop the rows. 

### Dropping Missing Data <a class="anchor" id="dropping_missing_data"></a>

In [178]:
## lets check the dropping of the null values - ignoring inplace for now
data.dropna(subset=["text"]).isnull().mean()

## lets check the shape of the data
data.dropna(subset=["text"]).shape

(39757, 14)

##### Notes
* So we'll loose ~30 rows, which makes me think why are there 30 rows with 3 URLs.  

In [179]:
## drop rows in place
data.dropna(subset=["text"], inplace=True)
data.shape

(39757, 14)

### Checking Duplicates <a class="anchor" id="checking_duplicates"></a>
* So earlier we saw that there were 30 rows with same URL. That shouldn't be the case, the assumption is that each row is a unique news article. Lets confirm that. 

In [180]:
## lets check for duplicate urls
data["url"].unique().shape

(3972,)

##### Notes
* We only have `3972` rows! Which means are data is 10 times less than expected. 
* Lets investigate some more


In [181]:
data["url"].unique()

array(['https://www.foxnews.com/politics/hassan-bolduc-trade-fire-final-showdown-after-gop-nominee-comes-under-attack-arriving-debate',
       'https://www.foxnews.com/politics/biden-speech',
       'https://www.foxnews.com/politics/nyc-naked-cowboy-makes-endorsement-while-performing-times-square-restore-law-order',
       ...,
       'https://www.foxnews.com/politics/florida-desantis-campaign-warns-donations-pac-dont-benefit-reelection',
       'https://www.foxnews.com/politics/west-point-cadets-taught-critical-race-theory-addressing-whiteness-docs-show',
       'https://www.foxnews.com/politics/cia-officer-turned-democratic-congresswoman-spanberger-emphasizes-proud-american'],
      dtype=object)

In [182]:
len(data.query("url=='https://www.foxnews.com/politics/nyc-naked-cowboy-makes-endorsement-while-performing-times-square-restore-law-order'"))

9

##### Notes
* So looks like we have a lot of duplicates.  Most probably due to our web scraper. 
* Lets try to remove duplicates and see how it affects our dataset size. 

### Removing Duplicates <a class="anchor" id="removing_duplicates"></a>

In [183]:
## lets check dataset size after removing for duplicate urls
data.drop_duplicates(subset=["url"]).shape


(3972, 14)

##### Notes
* So after removing duplicates we'll still have `3972` records, which is a lot less that I wanted but might be enough for topic labelling. We can start with these records and once have the steps and process in place can try and scrape more data. 

In [184]:
## dropping duplicates in place
data.drop_duplicates(subset=["url"], inplace=True)
data.shape

(3972, 14)

### Removing Unused Columns <a class="anchor" id="removing_unused_columns"></a>
* So columns like `category`, `is_live`, `is_breaking` and `duration` are not useful to us because, 
     * `category`: Is just `json` object with `relative url` and one word `category` for the news article. 
     * `is_live` is boolean for whether the news is live or not, not applicable to text news content, all values are false. 
     * `is_breaking` is boolean for whether the news is breaking news or not, not applicable for our use case and all values are false. 
     * `duration` is video duration, again not applicable to our use case. 
* Also I think `image_url` and `publication_date` columns are also not useful. `image_url` is just link to image used in article and `publication_date` is `last_published_date` in local timezone. 
* Lets drop these columns to further reduce the dataset size. 


In [185]:
data.drop(["is_live", "is_breaking", "duration", "category", "image_url", "publication_date"], axis=1, inplace=True)

In [186]:
data.shape

(3972, 8)

### Feature Engineering Authors <a class="anchor" id="feature_engineering_authors"></a>
* Right now `authors` column is a `json` array author name, we can extract the author name from the array and save it as string

In [187]:
data.head()["authors"]

0     [{'name': 'Paul Steinhauser'}]
1           [{'name': 'Haris Alic'}]
2           [{'name': 'Adam Sabes'}]
9        [{'name': 'Bradford Betz'}]
22       [{'name': 'Bradford Betz'}]
Name: authors, dtype: object

##### Notes
* So the assumption here is that each article has just one author, lets first confirm that by adding new column `num_authors` 

In [188]:
## using ast to convert string to list
import ast
## helper function to map number of authords
def find_num_author(authorArrayString):
   authorArray = ast.literal_eval(authorArrayString)
   # authorsString = authorArray[0]
   # authors = json.loads(authorsString)
   return len(authorArray)  
   
## lets apply the function to the data
data["num_authors"] = data["authors"].apply(find_num_author)

## lets check if num_authors is is more than 1
data[data["num_authors"] > 1].shape

(498, 9)

In [189]:
## lets check num_authors is is more than 1
data["num_authors"].value_counts()

1    3373
2     434
0     101
3      53
4      10
5       1
Name: num_authors, dtype: int64

In [190]:
## just to confirm lets also check the values there. 
data[data["num_authors"] > 1]["authors"]

75       [{'name': 'Adam Shaw'}, {'name': 'Bill Melugin'}]
91       [{'name': 'Aubrie Spady'}, {'name': 'Sophia Sl...
107      [{'name': 'Kyle Morris'}, {'name': 'Sophia Sla...
181      [{'name': 'Paul Steinhauser'}, {'name': 'Brook...
291      [{'name': 'Timothy Nerozzi'}, {'name': 'Courtn...
                               ...                        
39476    [{'name': 'Brandon Gillespie'}, {'name': 'Thom...
39638    [{'name': 'Lisa Bennatan'}, {'name': 'Megan My...
39699    [{'name': 'Lisa Bennatan'}, {'name': 'Jon Raas...
39726    [{'name': 'Chad Pergram'}, {'name': 'Paul Best'}]
39786    [{'name': 'Paul Steinhauser'}, {'name': 'Andre...
Name: authors, Length: 498, dtype: object

##### Notes
* So we have ~500 records with more than 1 author, our assumption was clearly wrong. 
* There are ~100 records with no author, we'll just add unknown there. 
* In order to feature engineer this we can creat a CSV string value of authors, and each author name will be combined using an underscore. 
* That was searching and analyzing would be easier. 

In [191]:
## helper function to convert author object to string
def find_author(authorArrayString):
   authorArray = ast.literal_eval(authorArrayString)
   if(len(authorArray) == 0):
      return "Unknown"
   authors = ["_".join(author["name"].split(" ")) for author in authorArray]
   return ",".join(authors)
   

## lets apply the function to the data
data["author"] = data["authors"].apply(find_author)

### Feature Engineering Word Count <a class="anchor" id="feature_engineering_wordcounts"></a>
* Even though it might not be 100% accurate lets create a feature with word count for EDA


In [192]:
## helper funtion to create word count from text column
def word_count(text):
   return len(text.split(" "))

data["word_count"] = data["text"].apply(word_count)
data["word_count"].describe()

count    3972.000000
mean      610.804884
std       340.785100
min        31.000000
25%       399.000000
50%       534.000000
75%       734.000000
max      9672.000000
Name: word_count, dtype: float64

##### Notes
* Interesting there is an outlier article with `9672` words! Would be interesting to check it out. 
* Also the article with just 31 words, we'll need to check them out in EDA

### Feature Engineering Line Count <a class="anchor" id="feature_engineering_linecounts"></a>

* Again even though it might not be 100% accurate lets create a feature with line count for EDA


In [193]:
## helper function to create line count from text column
def create_line_count(text):
   return len(text.split("."))

data["line_count"] = data["text"].apply(create_line_count)
data["line_count"].describe()

count    3972.000000
mean       31.652064
std        19.232276
min         3.000000
25%        21.000000
50%        28.000000
75%        37.000000
max       647.000000
Name: line_count, dtype: float64

#### Notes
* As expected after looking at the number of words, the min and max articles seem to be outlier with `3` and `647` lines respectively. 
* We'll need to investigate that. 

## Exploratory Data Analysis <a class="anchor" id="eda"></a>
* Lets try and do some EDA to get some stats on following
     * Average number of articles per day/week/month
     * Total number of articles per month. 
     * Average number of articles per author
     * Average number of words an author writes
     * Average number of articles an author writes per day/week/month. 
     * Analyze the outliers


### Number of Daily Articles <a class="anchor" id="num_daily_articles"></a>

In [194]:
## lets group by published_day and published_month
daily_numbers = pd.DataFrame({'count' : data.groupby([ "published_month", "published_day"])["text"].count()}).reset_index()

## plotting line chart for daily numbers
for mon in [6,7,8,9,10,11]:
   fig = px.line(daily_numbers.query("published_month == @mon"), x="published_day", y="count", title=f"Daily Numbers for {mon}")
   fig.show()

### Average Daily Articles Every Month <a class="anchor" id="avg_daily_articles"></a>

In [195]:
## calculating average number of articles per day. 
average_monthly_articles = pd.DataFrame({"average": daily_numbers.groupby("published_month")["count"].mean()}).reset_index()
fig = px.bar(average_monthly_articles, x='published_month', y='average', title='Average number of articles published per month');
fig.show()

##### Notes
* There is a spike in number of articles published in `November` this could be because of `US Mid Term` elections that took place on `11/08` OR we just have data of few days from `November` which could be skewing the average lets check that. 

In [196]:
## checking the latest date in the data
data["last_published_date"].max()

Timestamp('2022-11-02 22:47:00-0400', tz='pytz.FixedOffset(-240)')

In [197]:
## checking the earliest date in the data
data["last_published_date"].min()

Timestamp('2022-06-20 13:00:12-0400', tz='pytz.FixedOffset(-240)')

##### Notes
* So as expected we only have data from 2 days of `November`.

### Total Monthly Articles <a class="anchor" id="total_monthly_articles"></a>

In [198]:
## calculating average number of articles per day. 
average_monthly_articles = pd.DataFrame({"total": daily_numbers.groupby("published_month")["count"].sum()}).reset_index()
fig = px.bar(average_monthly_articles, x='published_month', y='total', title='Total number of articles published per month');
fig.show()

##### Notes 
* This difference in data makes sense since we only have `2` days of data from `November` 

### Number of Articles Per Author <a class="anchor" id="num_articles_per_author"></a>

##### Notes
* To get the number of articles per author, we'll first need to create a unique list of authors. 

In [199]:
## creating a unique list of authors
unique_author_list = []
for author in list(data["author"]):
     author_lst = author.split(",")
     for auth in author_lst:
          if auth not in unique_author_list:
               unique_author_list.append(auth)
               
               
## printing the length of the unique author list               
len(unique_author_list)

149

In [200]:
## lets create a dataframe with author, number of articles, word count and line count
author_attributes = []
for author in unique_author_list:
     author_attributes.append({
          "author": author,
          "num_of_articles": data[data["author"].str.contains(author)]["author"].count(),
          "word_count": data[data["author"].str.contains(author)]["word_count"].sum(),
          "line_count": data[data["author"].str.contains(author)]["line_count"].sum()
     })

author_attributes

[{'author': 'Paul_Steinhauser',
  'num_of_articles': 245,
  'word_count': 198212,
  'line_count': 9276},
 {'author': 'Haris_Alic',
  'num_of_articles': 163,
  'word_count': 97488,
  'line_count': 5408},
 {'author': 'Adam_Sabes',
  'num_of_articles': 115,
  'word_count': 50526,
  'line_count': 2825},
 {'author': 'Bradford_Betz',
  'num_of_articles': 116,
  'word_count': 51746,
  'line_count': 2762},
 {'author': 'Brooke_Singman',
  'num_of_articles': 205,
  'word_count': 164635,
  'line_count': 7811},
 {'author': 'Louis_Casiano',
  'num_of_articles': 35,
  'word_count': 13720,
  'line_count': 755},
 {'author': 'Adam_Shaw',
  'num_of_articles': 210,
  'word_count': 151282,
  'line_count': 7620},
 {'author': 'Bill_Melugin',
  'num_of_articles': 26,
  'word_count': 15243,
  'line_count': 825},
 {'author': 'Aubrie_Spady',
  'num_of_articles': 171,
  'word_count': 91764,
  'line_count': 5136},
 {'author': 'Sophia_Slacik',
  'num_of_articles': 18,
  'word_count': 10843,
  'line_count': 677},
 

In [201]:
## Converting the list to dataframe
author_attributes_df = pd.DataFrame(author_attributes)
author_attributes_df["author"] = author_attributes_df["author"].str.replace("_", " ")
author_attributes_df["avg_word_count"] = author_attributes_df["word_count"] / author_attributes_df["num_of_articles"]
author_attributes_df["avg_line_count"] = author_attributes_df["line_count"] / author_attributes_df["num_of_articles"]
author_attributes_df

Unnamed: 0,author,num_of_articles,word_count,line_count,avg_word_count,avg_line_count
0,Paul Steinhauser,245,198212,9276,809.028571,37.861224
1,Haris Alic,163,97488,5408,598.085890,33.177914
2,Adam Sabes,115,50526,2825,439.356522,24.565217
3,Bradford Betz,116,51746,2762,446.086207,23.810345
4,Brooke Singman,205,164635,7811,803.097561,38.102439
...,...,...,...,...,...,...
144,Nick Kalman,1,428,25,428.000000,25.000000
145,Breck Dumas,1,236,15,236.000000,15.000000
146,Sam Dorman,1,1081,63,1081.000000,63.000000
147,Mitch Shin,1,1388,62,1388.000000,62.000000


In [202]:
author_attributes_df.describe()


Unnamed: 0,num_of_articles,word_count,line_count,avg_word_count,avg_line_count
count,149.0,149.0,149.0,149.0,149.0
mean,30.510067,18831.100671,976.187919,663.226064,35.813674
std,54.101708,35629.364111,1789.452651,285.794882,16.564549
min,1.0,208.0,13.0,203.0,11.5
25%,1.0,1040.0,58.0,496.7625,24.75
50%,6.0,3883.0,188.0,598.08589,31.5
75%,27.0,15749.0,806.0,781.142857,40.0
max,245.0,198212.0,9276.0,1900.0,112.0


##### Notes
* Max number of articles are `245` in last 6 months, written by `Paul Steinhauser`

In [203]:
## lets look at top 10 authors by number of articles
author_attributes_df.sort_values("num_of_articles", ascending=False).head(10)

Unnamed: 0,author,num_of_articles,word_count,line_count,avg_word_count,avg_line_count
0,Paul Steinhauser,245,198212,9276,809.028571,37.861224
15,Ronn Blitzer,212,102770,5350,484.764151,25.235849
6,Adam Shaw,210,151282,7620,720.390476,36.285714
4,Brooke Singman,205,164635,7811,803.097561,38.102439
11,Kyle Morris,197,155708,7173,790.395939,36.411168
41,Houston Keene,180,121940,5903,677.444444,32.794444
18,Timothy Nerozzi,175,67389,3775,385.08,21.571429
25,Tyler Olson,172,111648,6535,649.116279,37.994186
8,Aubrie Spady,171,91764,5136,536.631579,30.035088
32,Jessica Chasmar,169,117943,5731,697.887574,33.911243


##### Notes
* This information doesn't give any insights right now, but after topic labelling and sentiment analysis, analyzing stats of authors would be interesting. 

### Number of Daily Articles per Author <a class="anchor" id="num_daily_articles_per_author"></a>


In [204]:
## lets create a function that returns daily groupby object for author
def get_daily_numbers(author):
   return pd.DataFrame({'count' : data[data["author"].str.contains(author)].groupby([ "published_month", "published_day"])["text"].count()}).reset_index()


## lets create a function that returns monthly groupby object for author
def get_monthly_numbers(author):
     return pd.DataFrame({'count' : data[data["author"].str.contains(author)].groupby([ "published_month"])["text"].count()}).reset_index()


In [205]:
## lets creat barchart for top 10 authors by number of articles
def create_bar_chart(author):
     monthly_numbers = get_monthly_numbers(author)
     fig = px.bar(monthly_numbers, x='published_month', y='count', title=f"Monthly Numbers for {' '.join(author.split('_'))}");
     fig.show()
     
for author in list(author_attributes_df.sort_values("num_of_articles", ascending=False).head(10)["author"]):
     create_bar_chart("_".join(author.split(" ")))

##### Notes
* We can reuse these funtions in our application for more custom input based analytics and dashboards. 

### Analyzing The Outliers <a class="anchor" id="outlier_analysis"></a>


##### Notes
* So we found few outliers with max and min number of lines. Although these shouldn't affect `topic modeling` and `sentiment analysis` would be interesting to check them out. 

In [206]:
## lets start by sorting the dataframe by number of words
data.loc[:, ["url","word_count"]].sort_values(by="word_count", ascending=False).head(10)

Unnamed: 0,url,word_count
5751,https://www.foxnews.com/politics/fox-news-sund...,9672
30949,https://www.foxnews.com/politics/supreme-court...,3790
25781,https://www.foxnews.com/politics/trump-targete...,3326
29070,https://www.foxnews.com/politics/democrats-res...,3002
25655,https://www.foxnews.com/politics/hunter-biden-...,2374
986,https://www.foxnews.com/politics/fox-news-powe...,2311
14345,https://www.foxnews.com/politics/fox-news-powe...,2246
16503,https://www.foxnews.com/politics/republicans-s...,2224
2921,https://www.foxnews.com/politics/hunter-biden-...,2212
16723,https://www.foxnews.com/politics/2022-primary-...,2176


##### Notes
* So the record with Maximum words is transcript of `Fox Sunday Show`, not sure how it ended up in news article, but I think it should be ok. 
* Lets also look at articles with minimum words. 

In [207]:
data.loc[:, ["url","word_count"]].sort_values(by="word_count", ascending=True).head(10)

Unnamed: 0,url,word_count
22660,https://www.foxnews.com/politics/oklahoma-2022...,31
22664,https://www.foxnews.com/politics/new-york-2022...,31
22656,https://www.foxnews.com/politics/florida-2022-...,32
38545,https://www.foxnews.com/politics/supreme-court...,43
16671,https://www.foxnews.com/politics/new-hampshire...,45
16662,https://www.foxnews.com/politics/rhode-island-...,45
20082,https://www.foxnews.com/politics/doj-filing-tr...,51
21181,https://www.foxnews.com/politics/trump-raid-se...,53
18659,https://www.foxnews.com/politics/massachusetts...,57
25175,https://www.foxnews.com/politics/marjorie-tayl...,66


##### Notes
* So I did visual inspection of some of the random articles, and it seems these are just state wise midterm election results. Seems auto generated so just has few lines of text and graphs. 
* It should be ok to ignore these, but will keep it for now. 

## Writing Data <a class="anchor" id="writing"></a>

##### Notes
* So I've decided to write this duplicate free, feature engineered data to separate file so that we can easily use this for `topic labeling` without need of duplicate effort. 

In [208]:
## keeping this line commented so that it does not run again and accidently overwrite the file with corrupt data. 
# data.to_csv("./data/fox_data_clean.csv", index=False)