# Importing Data for Data Science 1

## Accessing Data From Text Files

Accessing data from a text file is straightforward. 

In [None]:
# Load a simple text file into a string variable
f = open("data/test_text.txt", "r")
# Separates text by new line character \n. Reads until EOF. Equivalently you can use list(f)
lines = f.readlines()
# Remember to free up the resources used by the file when you are finished with it!
f.close()
# Iterate through each line in the file and print it out
for line in lines:
    print("****", line)

In [None]:
for line in lines:
    print("****")
    words = line.split(" ")
    for word in words:
        print(word)

We can even load a text file across the Internet by using **requests.get** from the **requests** package instead of simply **open**. We use [the Guttenberg Press](http://www.gutenberg.org) in this example.

In [None]:
# Import the requests package
import requests

# Define a URL to Alice in Wonderland on the Guttenberg Press (www.gutenberg.org)
url='http://www.gutenberg.org/cache/epub/11/pg11.txt'

# Read the text from the URL. This variable is just a long string. 
text_page = requests.get(url).text
# Print the first 1000 characters of the book
print(text_page[:1000])

We can even connect to a HTML file, but this starts to get really hard.

In [None]:
# Connect to a URL and extract the HTML text
# url = "http://www.independent.ie/sport/soccer/international-soccer/neil-taylor-facing-longer-ban-for-seamus-coleman-horror-tackle-as-fifa-step-in-35578919.html"
url = "https://www.rte.ie/news/2019/0410/1041742-eu_brexit_summit/"
text = requests.get(url).text
print(text[:1000])

## Parsing HTML Files

Accessing data from web pages is straightforward. The tricky bit is extracting the useful information from the webpage. We can use the **BeautifulSoup4** (http://www.crummy.com/software/BeautifulSoup) packages to make this easier.

In [None]:
# Import the BeautifulSoup package
from bs4 import BeautifulSoup 

# Read the HTML file
# url = "http://www.independent.ie/sport/soccer/international-soccer/neil-taylor-facing-longer-ban-for-seamus-coleman-horror-tackle-as-fifa-step-in-35578919.html"
url = "https://www.rte.ie/news/2019/0410/1041742-eu_brexit_summit/"
html = requests.get(url).text

# Create a beautiful soup object from the text file so that we get at the article text
article_soup = BeautifulSoup(html, "lxml")
# print(article_soup)
# Extract the actual article text  - this relies on the fact that I know what the HTML looks like, not completely robust!
article = article_soup.find('article')
# Gives us text including tags. 
headline = article.find('h1')
# This is out of date now
# article_content = article.find_all('div', class_='ctx_content')
article_content = article.find_all('section', class_="medium-10 medium-offset-1 columns article-body")

# Raw data
# print(article_content)

# Start the article text by adding the headline (this will get what's in between the tags)
article_text = headline.get_text()

# Construct the article by adding together the paragraph pieces
for tag in article_content:
    article_text += tag.get_text()
    
# Print the article content
print(article_text)

A wordcloud is a fun way to visualise text

In [None]:
%matplotlib inline  
import matplotlib 
import matplotlib.pyplot as plt

# Import package for drawing word clouds this needs to be installed from binstar using ./conda install -c https://conda.binstar.org/derickl wordcloud
from wordcloud import WordCloud, STOPWORDS 

# Create a word cloud
# Because everyone loves comic sans right? - This font path will probably only work on Mac. 
wordcloud = WordCloud(font_path='/Library/Fonts/Comic Sans MS.ttf',
                     stopwords=STOPWORDS,
                      background_color='white',
                      width=2400,
                      height=2400
                     ).generate(article_text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

## Accessing RSS Feeds 

One way to access lots of news articles is to use an RSS (Really Simple Syndication) feed. We can access RSS feeds easily in Python using the **feedparser** package.

In [None]:
# For reading RSS feeds - imstnall using ./conda install feedparser
import feedparser 

# Read from the Irish Times RSS feed
RSS_url = "https://www.irishtimes.com/cmlink/news-1.1319192"
it_feed = feedparser.parse(RSS_url)
print("Number of entries:", len(it_feed.entries))

# Iterate through the entries from the feed and print the title of each article and the URL for the articl
for article_entry in it_feed.entries:
    article_title = article_entry['title']
    article_url = article_entry['links'][0]['href']
    print(article_title)
    print(article_url)

## Accessing Data From Twitter

Twitter is obviously a fun service to get text from. We can use the **Tweepy** package to access the Twitter API. Before using Tweepy you must have Twitter **OAuth credentials** available from https://apps.twitter.com/. Create a new applciation (using your own Twitter credentials) and the generate access tokens.

In [None]:
# Import tweepy 
import tweepy

# OAuth access details for getting at the Twitter API - having these in my code is pretty insecure!!
consumer_key = "ENTER YOUR KEY HERE"
consumer_secret = "ENTER YOUR SECRET HERE"
access_token = "ENTER YOUR TOKEN HERE"
access_token_secret = "ENTER YOUR TOKEN HERE" 

# Connect to the Twitter API using authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Access the tweets appearing in my timeline
coord_list = list()
public_tweets = api.home_timeline(count=25)
for tweet in public_tweets:
    print("@" + tweet.author.screen_name, "|", tweet.author.name)
    print(tweet.text)
    print() 
    

In [None]:
#Search for recent tweets containing a specific keyword
results = api.search(q="Dublin", count=10)
for tweet in results:
    print("@" + tweet.author.screen_name, "|", tweet.author.name)
    print(tweet.text)
    print() 

## An Introduction To NLTK

The **Natural Language Toolkit**, **NLTK** (http://www.nltk.org/), is a well written, widely used, and well respected toolkit for performing natural language processing in Python. It offers a wide range of useful functionality and data structres that make text natural language processing, and so text analytics, much easier. Features included in the NLTK include corpus management, document classification, colocation discovery, part of speech tagging, parsing, and chunking. The best reference for the NLTK is the **NLTK Book**, Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper, which is freely avialble online at http://www.nltk.org/book/ or for sale at http://www.amazon.co.uk/Natural-Language-Processing-Python-Steven/dp/0596516495. Many of the examples in this tutorial are taken from this book.

### Import Packages

Import a set of packages that we will use in order to perform text analysis. These are very commonly used Python packages.

In [None]:
import nltk # The best known Python natural language processing toolkit
from nltk import FreqDist # Explicitlty import the FreqDist function from NLTK
import numpy # Package for scientific computing
import matplotlib # Python plotting library
import matplotlib.pyplot as plt # Easy syntax access to pyplot
import re # functions fior dealing with regular expressions
from wordcloud import WordCloud, STOPWORDS # package for drawing word clouds this needs to be installed from binstar using ./conda install -c https://conda.binstar.org/derickl wordcloud
from urllib.request import urlopen # for accessing URLs
from bs4 import BeautifulSoup # For parsing HTML documents

# Tells iPython notebook to draw graphic sinline in the webpage
%matplotlib inline 

In [None]:
# Uncomment this in order to launch the NLTK downloader to access corpora, packages etc
# nltk.download()

### Load An NLTK Built-In Corpus

In [None]:
nltk.corpus.gutenberg.fileids()

### Find Text Within a Corpus

A concordance returns a set of sentences that include a search term or terms. We first create an NLTK text object that we can manipulate.

In [None]:
hamlet = nltk.Text(nltk.corpus.gutenberg.words('shakespeare-hamlet.txt'))
print(hamlet)


Using the NLTK **concordance** function we can generate a set of setences containing a chose word.

In [None]:
hamlet.concordance("ophelia")

We can generate a text object containing all of the texts in our corpus and generate a concordance from this.

In [None]:
allText = nltk.Text(nltk.corpus.gutenberg.words())
# "A concordance view shows us every occurrence of a given word, together with some context."
# https://www.nltk.org/book/ch01.html
allText.concordance('sandwich')

**EXERCISE:** Generate a concordance of the occurences of the word *whale* in *Moby Dick* and the word *computer* in the overall gutenburg corpus. 

In [None]:
moby_dick = nltk.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
moby_dick .concordance('whale')

In [None]:
allText.concordance('computer')

A **dispersion plot** is a fun data visualsiation supported by NLTK that shows us where in a text words appear. It is generated using the **dispersion_plot** function.

In [None]:
hamlet.dispersion_plot(["Hamlet", "Horatio", "Ophelia", "Fortinbras", "Yorick", "death", "skull", "dagger"])

A more powerful way that we can find text within a corpus is to use **regular expressions**. Regular expressions are a powerful way to define textual patterns that allow us find interesting things within a document. First just find all words ending in "ings".

In [None]:
list(set([w for w in hamlet if re.search('ings$', w)]))

This is some nice Python code to iterate through all the words in our hamlet list, and to add those that match our regular expression to a new list. Python is great for this type of stuff!

In the next example we consider a Hamlet-based crossword puzzle in which we we need to find a word that matches this pattern: \_ \_ m \_ \_ t 

In [None]:
 list(set([w for w in hamlet if re.search('^..m..t$', w)]))

There is tonnes that you can do with regular expressions - find dates, find phone numbers, find matches for types of words, find pattrerns across multiple words .... The basic operators for definining regular expressions are as follows.

Operator | Behavior
----------|------------
. |	Wildcard, matches any character
^abc |	Matches some pattern abc at the start of a string
abc\$ |	Matches some pattern abc at the end of a string
[abc] |	Matches one of a set of characters
[A-Z0-9] |	Matches one of a range of characters
ed $\mid$ ing $\mid$ s |	Matches one of the specified strings (disjunction)
* |	Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
+ |	One or more of previous item, e.g. a+, [a-z]+
? |	Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n} |	Exactly n repeats where n is a non-negative integer
{n,} |	At least n repeats
{,n} |	No more than n repeats
{m,n} |	At least m and no more than n repeats
a(b $\mid$ c)+ |	Parentheses that indicate the scope of the operators

**EXERCISE:** Load the corpus of American presedential inaugural addresses, nltk.corpus.inaugural, and find all mentions of *America*, *freedom*, and *war*.

In [None]:
inaugural_add = nltk.Text(nltk.corpus.inaugural.words())
inaugural_add.dispersion_plot(["America","freedom","war"])

### Counting Vocabulary

Counting vocabulary is a really important thing to do in text, and we can do it easily in Python with NLTK. First, let's get the number of words in Hamlet.

In [None]:
len(hamlet)

Let's extract the number of unique words - converting from a Python **list** to a Python **set** does this!

In [None]:
len(set(hamlet))

**Lexical diversity** is a technique used to measure how complicated a text is. It is just the ratio of unique words to total words. Higher values indicate more complicated texts.

In [None]:
# Set to get unique values which we then count
len(set(hamlet))/len(hamlet)

**EXERCISE:** Calculate the lexical diversity of *Moby Dick* and *Alice In Wonderland*.

In [None]:
alice = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
len(set(alice))/len(alice)

In [None]:
len(set(moby_dick))/len(moby_dick)

We can easily define a lexical diversity function

In [None]:
# Define function to calcualte lexical diversity
def lexical_diversity(text):
    return len(set(text)) / len(text)

# Use the newly defined function to claculate the lexical diversity of hamlet
lexical_diversity(hamlet)

Counting the **most frequently occuring words** in a text is one of the most common ways to analyse the meaning of a text. The NLTK makes this job very easy for us by allowing us to quickly generate a **frequency distribution** using the **FreqDist** function.

In [None]:
%matplotlib inline
# convert all words to lower case
hamlet = [w.lower() for w in hamlet]

# Remove all punctuation from word lists - note the use of regular expressions!
# If the output of the regular expression that is looking for the pattern is None,
# add to the list. 
hamlet = [w for w in hamlet if not (re.match(r'^\W+$', w) != None)]

# Remove all stop words from word lists
hamlet = [w for w in hamlet if not w in nltk.corpus.stopwords.words('english')]

#print hamlet

# Print revised lexical diversity
#print '{}{}'.format('Lexical diversity: ', lexical_diversity(hamlet))

# Generate the frquency distribution for hamlet
hamlet_freq_dist = nltk.FreqDist(hamlet)
print(hamlet_freq_dist)


In [None]:
# Print the number of occurences of the word hamlet
print(hamlet_freq_dist['hamlet'])
print('{}{}'.format('Frequency Hamlet: ', hamlet_freq_dist['hamlet']))

# Print the top X words
numWords = 0
wordLimit = 20
for w in hamlet_freq_dist.keys():
    print('{}{}{}'.format(w, ': ', hamlet_freq_dist[w]))
    numWords = numWords + 1
    if numWords > wordLimit:
        break

# Plot a nice graph of word frequencies
hamlet_freq_dist.plot(wordLimit)


We can also plot a word cloud from this frequency distribution (watch out for the path to the font used).

In [None]:
wordcloud = WordCloud(font_path='/Library/Fonts/Comic Sans MS.ttf',
                     stopwords=STOPWORDS,
                      background_color='black',
                      width=1800,
                      height=1400
                     ).generate(str(hamlet))


plt.imshow(wordcloud)
plt.axis('off')
#matplotlib.savefig('./my_twitter_wordcloud_1.png', dpi=300)
plt.show()

Nice webpage explaining how to make word clouds from your Twitter feed: http://spartanideas.msu.edu/2014/11/28/turn-your-twitter-timeline-into-a-word-cloud-using-python/#A.-Downloading-Your-Twitter-Timeline-Tweets