# Getting the most out of what we've learned

So, now you know Python and NLTK! The main things we still have to do are:

1. Address some specific questions
2. Manage resources and results
3. Brainstorm some other uses for NLTK
4. Integrate IPython into your existing workflow
5. Have an open discussion about what we've done
6. Summarise and say goodbye!

This lesson is pretty light on content and structure. Please do jump in at any point, and tell us about your research, and whether or not what you've learned here will be of much use.

Or, ask us if Python can do a certain thing. Maybe we have some tips!

In [None]:
from __future__ import print_function, division
import nltk
import os
from urllib.request import urlopen
% matplotlib inline

### Using Beautiful Soup to read text from the web
Of course, a lot of the text you're going to want to work with won't be in handy text files already. That's where a Python library called Beautiful Soup comes in.

In [None]:
!pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup

In [None]:
import urllib
from urllib.request import urlopen

In [None]:
url = "http://en.wikipedia.org/wiki/Smog"

In [None]:
raw = urlopen(url).read()
print(type(raw))
print(raw[100:200])

Beautiful Soup breaks the single long string into its constituent parts, creating an object 'Beautiful Soup'

In [None]:
soup = BeautifulSoup(raw, 'html.parser')
print(type(soup))

Find all the paragraphs, and put them into a list

In [None]:
texts = []
for para in soup.find_all('p'):
    text = para.text
    texts.append(text)
print(texts[:10])

In [None]:
import re
regex = re.compile('\[[0-9]*\]')
joined_texts = '\n'.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
print(type(joined_texts))
print(joined_texts[:100])

In order to work on the text, the first step is to tokenise it into words.

In [None]:
import nltk
wordlist = nltk.word_tokenize(joined_texts)
wordlist[:8]

For some other types of analysis, we'll need to create an NLTK text object

In [None]:
good_text = nltk.Text(wordlist)
good_text.concordance('smog')

And once we've done all that work creating clean text, it's a good idea to save it for later.

In [None]:
%cd
! mkdir smog
%cd smog

In [None]:
NLTK_file = open("NLTK-Smog.txt", "w", encoding='UTF-8')
NLTK_file.write(str(wordlist))
NLTK_file.close()

In [None]:
text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

In [None]:
joined_texts[2450:2470]

In [None]:
#joined_texts[2450:2470]
text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

### Challenge!
* Find a webpage of interest to your studies and use Beautiful Soup to extract the text
* Tokenise the text
* Find the most common words in your text (Extension: remove the stop words)
* Find trigrams in your text 
* Save your text to a text file

### PDF


In [None]:
import os

In [None]:
if not os.path.exists('1984.pdf'):
    !wget "http://www.planetebook.com/ebooks/1984.pdf"

In [None]:
!pip install pypdf2

In [None]:
from PyPDF2 import PdfFileWriter, PdfFileReader

In [None]:
pdf = PdfFileReader(open('1984.pdf', "rb"))
book_text = ''
#for page in range(len(pdf.pages)):
for page in range(10):
    temppage = pdf.getPage(page)
    book_text += temppage.extractText()
    