# Data Extraction and Cleaning
<br>
Date: 01/22/2021

## About this Notebook
This notebook is to clean the data for the use by the model. <br><br>

### Data
https://www.kaggle.com/snapcrack/all-the-news

## Adminstrative Activity

### Import Packages

In [2]:
import os, json, sys

import pandas as pd
import numpy as np

from time import time #duration

#NTLK
from nltk.corpus import stopwords  # stopwords
from nltk.stem import WordNetLemmatizer # Lemmatization
import re, string #Text cleaning

#Custom Code
from bin.text_cleaner import text_cleaner
from bin.html_functions import ez_display as d

In [3]:
d("<b>Current Python Version Used:</b> Python " +  sys.version.split('(')[0].strip())

In [4]:
data_folder = "data"
raw_data_folder = os.path.join(data_folder,'RAW')
cleaned_data_folder = os.path.join(data_folder,'cleaned')
cleaned_data_filename = "articles.feather"
cleaned_data_filepath = os.path.join(cleaned_data_folder,cleaned_data_filename)
article_filenames = ['articles1.csv', 'articles2.csv', 'articles3.csv']

## Creating Directories if not available

In [5]:
if os.path.isdir(raw_data_folder) == False:
    if os.path.isdir(data_folder) == False:
        os.mkdir(data_folder)
    os.mkdir(raw_data_folder)
if os.path.isdir(cleaned_data_folder) == False:
    os.mkdir(cleaned_data_folder)

## Download dataset

In [6]:
check = all(file in os.listdir(raw_data_folder) for file in article_filenames)
if check == False:
    url = "https://www.kaggle.com/snapcrack/all-the-news/download"
    d(f'1. Download data from {url}')
    d(f'2. Place here: {raw_data_folder}')

## Pulling Data

In [7]:
csv = []
for file in os.listdir(raw_data_folder):
    data = pd.read_csv(os.path.join(raw_data_folder,file))
    csv.append(data)

In [8]:
df = pd.concat(csv).reset_index()
df.drop(df.columns[0:2],axis=1,inplace=True)

In [9]:
d('<b>Dataframe Shape:</b> '+str(df.shape))

In [10]:
df.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


## Cleaning Text

#### Cleaning
- __Stopwords:__  Dropping of common terms.
- __Lemming:__ Removes inflectional endings only and to return the base or dictionary form of a word.

In [11]:
LEMMING =  WordNetLemmatizer()

### Defining Function to Clean
- URLs, Emails, and duplicate spaces within the comments bring no additional value to the analysis.
- Special characters, punctuation,and numbers that are within the comments bring no additional value to the analysis.
- Non-ascii characters can cause problems in the analysis

In [12]:
%%time
df['simple_clean'] = text_cleaner(df['content'],STOPWORDS=False,YEARS=False,MIN_CHAR_LENGTH=False)

Wall time: 54.4 s


In [13]:
%%time
df['stopwords_clean'] =  text_cleaner(df['simple_clean'],
                          SIMPLE = False,
                          YEARS = True,
                          MIN_CHAR_LENGTH = 3)

Wall time: 11min 31s


In [14]:
%%time
df['lemming_clean'] =  text_cleaner(df['stopwords_clean'],
                          SIMPLE = False,
                          YEARS = False,
                          MIN_CHAR_LENGTH = False,
                          LEMMING = LEMMING)

Wall time: 9min 1s


## Saving DataFrames
To use feather, please be sure to pyarrow installed "pip install pyarrow"

In [15]:
df.to_feather(path=cleaned_data_filepath)