# Data Extraction and Cleaning
<br>
Date: 01/22/2021

## About this Notebook
This notebook is to clean the data for the use by the model. <br><br>

### Data
https://www.kaggle.com/snapcrack/all-the-news

## Adminstrative Activity

### Import Packages

In [19]:
import os, json, sys

import pandas as pd
import numpy as np

from itertools import chain

from time import time #duration

#NTLK
from nltk.corpus import stopwords  # stopwords
from nltk.stem.porter import PorterStemmer #Stemming
from nltk.stem import WordNetLemmatizer # Lemmatization
import re, string #Text cleaning

#Text cleaning
from text_cleaner import text_cleaner

#HTML Display
from html_functions import ez_display as d

In [3]:
d("<b>Current Python Version Used:</b> Python " +  sys.version.split('(')[0].strip())

In [9]:
raw_data = "data\RAW"

## Pulling Data

In [11]:
csv = []
for file in os.listdir(raw_data):
    data = pd.read_csv(os.path.join(raw_data,file))
    csv.append(data)

In [30]:
df = pd.concat(csv).reset_index()
df.drop(df.columns[0:2],axis=1,inplace=True)

In [31]:
d('<b>Dataframe Shape:</b> '+str(df.shape))

In [32]:
df.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


## Cleaning Text

#### Cleaning
- __Stopwords:__  Dropping of common terms.
- __Lemming:__ Removes inflectional endings only and to return the base or dictionary form of a word.

In [18]:
STOPWORDS = stopwords.words('english')
STOPWORDS = [word.translate(str.maketrans('','',string.punctuation)) for word in STOPWORDS] # 
LEMMING =  WordNetLemmatizer()

### Defining Function to Clean
- URLs, Emails, and duplicate spaces within the comments bring no additional value to the analysis.
- Special characters, punctuation,and numbers that are within the comments bring no additional value to the analysis.
- Non-ascii characters can cause problems in the analysis

In [35]:
%%time
df['simple_clean'] = text_cleaner(df['content'])

Wall time: 58.2 s


In [40]:
%%time
df['stopwords_clean'] =  text_cleaner(df['simple_clean'],
                          SIMPLE = False,
                          STOPWORDS = STOPWORDS)

Wall time: 2min 6s


In [41]:
%%time
df['lemming_clean'] =  text_cleaner(df['stopwords_clean'],
                          SIMPLE = False,
                          LEMMING = LEMMING)

Wall time: 2min 33s


## Saving DataFrames
To use feather, please be sure to pyarrow installed "pip install pyarrow"

In [42]:
fn = "articles.feather"
os.mkdir('data\cleaned')
df.to_feather(path=os.path.join("data\cleaned",fn))

In [47]:
df['lemming_clean'][0]

'washington congressional republican new fear come health care lawsuit obama administration might win incoming trump administration could choose longer defend executive branch suit challenge administration authority spend billion dollar health insurance subsidy american handing house republican big victory issue sudden loss disputed subsidy could conceivably cause health care program implode leaving million people without access health insurance republican prepared replacement could lead chaos insurance market spur political backlash republican gain full control government stave outcome republican could find awkward position appropriating huge sum temporarily prop obama health care law angering conservative voter demanding end law year another twist donald j trump administration worried preserving executive branch prerogative could choose fight republican ally house central question dispute eager avoid ugly political pileup republican capitol hill trump transition team gaming handle la