<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 2 Assignment 1*

Analyze a corpus of text using text visualization of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Stemming
- Lemmatization
- Custom stopword removal
- Using frequency based stopwrod removal

You are free to use any dataset you are interested in. Kaggle is a great place to start. Feel free to sample the data if the dataset is too large to hanlde in memory. 

In [1]:
# Start here 
"""
Import Statements
"""

# Base
from collections import Counter
import re

import pandas as pd

# Plotting
import squarify
import matplotlib.pyplot as plt
import seaborn as sns

# NLP Libraries
import spacy
from spacy.tokenizer import Tokenizer
from nltk.stem import PorterStemmer

nlp = spacy.load("en_core_web_sm")

In [4]:
import os

def gather_data(filefolder):
    
    data = []
    
    files = os.listdir(filefolder)
    
    
    for article in files:
        
        path = os.path.join(filefolder, article)
        
        if path[-3:] == 'txt':
            with open(path, 'rb') as f:
                      data.append(f.read())
                      
    return data

In [5]:
data = gather_data('./data/reuters21578')

In [6]:
data[0][:50]

b'adb-africa\nadb-asia\naibd\naid\nanrpc\nasean\natpc\nbis\n'

In [13]:
# Load in each data file (zfill pads out integers with leading zeros)
text_data = []
for index in range(22):
    filename = './data/reuters21578/reut2-{0}.sgm'.format(str(index).zfill(3))
    with open(filename, 'r', encoding = 'utf-8', errors = 'ignore') as infile:
        text_data.append(infile.read())
# Print first 300 characters of first file
print(text_data[0][:300])

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXC


In [15]:
# Separate each text file into articles
import bs4
from bs4 import BeautifulSoup

articles = []
for textfile in text_data:
    # Parse text as html using beautiful soup
    parsed_text = BeautifulSoup(textfile, 'html.parser')
    # Extract article between <BODY> and </BODY> and convert to standard text. Add to list of articles
    articles += [article.get_text() for article in parsed_text.find_all('body')]
# print the first article as an example
print(articles[0])

Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export 

In [17]:
import string

# Convert each article to all lower case
articles = [article.lower() for article in articles]
# Strip all punctuation from each article
# This uses str.translate to map all punctuation to the empty string
table = str.maketrans('', '', string.punctuation)
articles = [article.translate(table) for article in articles]
# Convert all numbers in the article to the word 'num' using regular expressions
articles = [re.sub(r'\d+', 'num', article) for article in articles]
# Print the first article as a running example
print(articles[0])

showers continued throughout the week in
the bahia cocoa zone alleviating the drought since early
january and improving prospects for the coming temporao
although normal humidity levels have not been restored
comissaria smith said in its weekly review
    the dry period means the temporao will be late this year
    arrivals for the week ended february num were num bags
of num kilos making a cumulative total for the season of num
mln against num at the same stage last year again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures
    comissaria smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end with total bahia crop estimates
around num mln bags and sales standing at almost num mln there
are a few hundred thousand bags still in the hands of farmers
middlemen exporters and processors
    there are doubts as to how much of this cocoa would be fit
for export as shippers are

In [19]:
import nltk

# Create stopwords list, convert to a set for speed
stopwords = set(nltk.corpus.stopwords.words('english') + ['reuter', '\x03'])
articles = [[word for word in article.split() if word not in stopwords] for article in articles]

In [20]:
stemmer = nltk.stem.PorterStemmer()
articles = [" ".join([stemmer.stem(word) for word in article]) for article in articles]
# print the first article as a running example
print(articles[0])

shower continu throughout week bahia cocoa zone allevi drought sinc earli januari improv prospect come temporao although normal humid level restor comissaria smith said weekli review dri period mean temporao late year arriv week end februari num num bag num kilo make cumul total season num mln num stage last year seem cocoa deliv earlier consign includ arriv figur comissaria smith said still doubt much old crop cocoa still avail harvest practic come end total bahia crop estim around num mln bag sale stand almost num mln hundr thousand bag still hand farmer middlemen export processor doubt much cocoa would fit export shipper experienc dificulti obtain bahia superior certif view lower qualiti recent week farmer sold good part cocoa held consign comissaria smith said spot bean price rose num num cruzado per arroba num kilo bean shipper reluct offer nearbi shipment limit sale book march shipment num num dlr per tonn port name new crop sale also light open port junejuli go num num dlr num n

## Stretch Goals

* Write a web scraper that can scrape "Data Scientist" job listings from indeed.com.
* Look ahead to some of the topics from later this week:
 - Part of Speech Tagging
 - Named Entity Recognition
 - Document Classification
* Try a different visualization techniques
* Automate the process of retriving job listings. ;)