# Getting the most out of what we've learned

So, now you know Python and NLTK! The main things we still have to do are:

1. Address some specific questions
2. Manage resources and results
3. Brainstorm some other uses for NLTK
4. Integrate IPython into your existing workflow
5. Have an open discussion about what we've done
6. Summarise and say goodbye!

This lesson is pretty light on content and structure. Please do jump in at any point, and tell us about your research, and whether or not what you've learned here will be of much use.

Or, ask us if Python can do a certain thing. Maybe we have some tips!

In [14]:
from __future__ import print_function, division
import nltk
import os
from urllib.request import urlopen
% matplotlib inline

### Using Beautiful Soup to read text from the web
Of course, a lot of the text you're going to want to work with won't be in handy text files already. That's where a Python library called Beautiful Soup comes in.

In [15]:
!pip install beautifulsoup4

[33mYou are using pip version 7.0.3, however version 8.0.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [16]:
from bs4 import BeautifulSoup

In [17]:
import urllib
from urllib.request import urlopen

In [18]:
url = "http://en.wikipedia.org/wiki/Smog"

In [19]:
raw = urlopen(url).read()
print(type(raw))
print(raw[100:200])

<class 'bytes'>
b'e>Smog - Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = docum'


Beautiful Soup breaks the single long string into its constituent parts, creating an object 'Beautiful Soup'

In [20]:
soup = BeautifulSoup(raw, 'html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


Find all the paragraphs, and put them into a list

In [21]:
texts = []
for para in soup.find_all('p'):
    text = para.text
    texts.append(text)
print(texts[:10])

['Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmanteau of the words smoke and fog to refer to smoky fog.[1] The word was then intended to refer to what was sometimes known as green soup fog, a familiar and serious problem in Mexico from the 19th century to the mid 20th century. This kind of visible air pollution is composed of nitrogen oxides, sulfur oxides, ozone, smoke or particulates among others (less visible pollutants include carbon monoxide, CFCs and radioactive sources). Man-made smog is derived from coal emissions, vehicular emissions, industrial emissions, forest and agricultural fires and photochemical reactions of these emissions.', 'Modern smog, as found for example in Los Angeles, is a type of air pollution derived from vehicular emission from internal combustion engines and industrial fumes that react in the atmosphere with sunlight to form secondary pollutants that also combine with the primary emissions to form photoche

In [22]:
import re
regex = re.compile('\[[0-9]*\]')
joined_texts = '\n'.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
print(type(joined_texts))
print(joined_texts[:100])

<class 'str'>
Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmante


In order to work on the text, the first step is to tokenise it into words.

In [23]:
import nltk
wordlist = nltk.word_tokenize(joined_texts)
wordlist[:8]

['Smog', 'is', 'a', 'type', 'of', 'air', 'pollutant', '.']

For some other types of analysis, we'll need to create an NLTK text object

In [24]:
good_text = nltk.Text(wordlist)
good_text.concordance('smog')

Displaying 25 of 39 matches:
                                     Smog is a type of air pollutant . The wor
                                     smog '' was coined in the early 20th cent
and radioactive sources ) . Man-made smog is derived from coal emissions , veh
eactions of these emissions . Modern smog , as found for example in Los Angele
mary emissions to form photochemical smog . In certain other cities , such as 
rtain other cities , such as Delhi , smog severity is often aggravated by stub
fe or death . Coinage of the term `` smog '' is generally attributed to Dr. He
 clouds of smoke that contributes to smog . Air pollution from this source has
 , as witnessed by the 2013 autumnal smog in Harbin , China , which closed roa
 major ingredient in the creation of smog in some large cities . The major cul
 ozone , and particles that comprise smog . Photochemical smog is the chemical
s that comprise smog . Photochemical smog is the chemical reaction of sunlight
active and oxidizing . 

And once we've done all that work creating clean text, it's a good idea to save it for later.

In [25]:
%cd
! mkdir smog
%cd smog

/Users/dansandiford
mkdir: smog: File exists
/Users/dansandiford/smog


In [30]:
NLTK_file = open("NLTK-Smog.txt", "w", encoding='UTF-8')
NLTK_file.write(str(wordlist))
NLTK_file.close()

In [31]:
text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

In [32]:
joined_texts[2450:2470]

's type is still a pr'

In [33]:
#joined_texts[2450:2470]
text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

### Challenge!
* Find a webpage of interest to your studies and use Beautiful Soup to extract the text
* Tokenise the text
* Find the most common words in your text (Extension: remove the stop words)
* Find trigrams in your text 
* Save your text to a text file

### PDF


In [34]:
import os

In [35]:
if not os.path.exists('1984.pdf'):
    !wget "http://www.planetebook.com/ebooks/1984.pdf"

In [36]:
!pip install pypdf2

[33mYou are using pip version 7.0.3, however version 8.0.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [37]:
from PyPDF2 import PdfFileWriter, PdfFileReader

In [38]:
pdf = PdfFileReader(open('1984.pdf', "rb"))
book_text = ''
#for page in range(len(pdf.pages)):
for page in range(10):
    temppage = pdf.getPage(page)
    book_text += temppage.extractText()
    