# Getting Started with Python & NLTK

We'll use python through out this course, the purpose for this notebook is to provide you some introductions to the softwares we are going to use as well as help set up the environment in your laptop.

The environment will be set up in the following procedures:
1. Install anaconda
2. Create a virtual environment with dependency packages
3. Install Jupyter Notebook
4. Run a demo in Jupyter Notebook

---

## 1. Python with Anaconda

### What & Why Anaconda

Anaconda is "the world’s most popular and trusted data science ecosystem". You'll find some details about it with the link I provide [here](https://www.continuum.io/what-is-anaconda). But for me, the take-away is:
1. It helps you set up a environment for python (otherwise sometimes it is a littble bit painful if you're using windows)
2. The virtual environment feature gives you the flexibility to switch between python versions of packages versions. You might encouter a place that you have to use python 2.7 for one project and python 3.5 for another, or version 1.0 (I made up this number) of NLTK for one project yet 1.5 for another project. Avoding installing them globally allows you to set up python environments for each project and not conficts with each other
3. Unlike matlab, python is an open-source community. This results in an abundant and fast-developing third-party packages, which also means a lot of time you have to install those packages by yourself. Using conda can help you easily maintain those packages

### Install Anaconda

1. __Go to [download page of Anaconda](https://www.continuum.io/downloads)__
    - Download the Python 3.6 version
    - It will be a graphical installer for both windows and mac, so you shall be fine for this step. But just in case, here are the instructions for [windows](https://docs.continuum.io/anaconda/install/windows) and [mac](https://docs.continuum.io/anaconda/install/mac-os#macos-graphical-install)
    - There is only a command line installer in linux platfor, just follow [this](https://docs.continuum.io/anaconda/install/linux), also shouldn't be too hard
2. __Check your installation__
    - For windows users, please use anaconda prompt(which has already been installed), for mac and linux users, just use terminal
    - Enter 'conda --version', you'll see something like 'conda 4.4.0'
3. __Why Python 3?__
    Python 3 is the latest version and lots of new advanced machine learning & deep learning tools are built upon Python 3. It is more friendly to scientific computing. But still, when conda, you can easily switch between Python 2 and Python 3
4. __If you already installed one__
    Check [this doc](https://docs.continuum.io/anaconda/install/update-version). In short, go to terminal and do:
    ```
    conda update conda
    conda update anaconda
    ```

### Setup Virtual Environment

Now we're going to setup the virtual environment using conda. Normally you can do it by using instructions in this [cheat sheet](https://conda.io/docs/_downloads/conda-cheatsheet.pdf). For your convience, I wrote a yaml file and you can just use it to create the environment. All you need to do is the following steps:

1. In terminal (anaconda prompt for windows users), cd to the git repo you just cloned (e.g. path\to\ece590-08-2017\)
2. Enter command
    ```
    conda env create
    ```
    This will automatically load the dependencies I specified in environment.yaml and install them for you
3. After installation is complete, for windows users, do:
    ```
    activate nlp-env
    ```
    and for mac/linux users, do:
    ```
    source activate nlp-env
    ```
    You should see the new line now start with '(nlp-env)'
4. The last but also very important step, you need to install nb_conda to bind kernel of your virtual environment to jupyter notebook, the detailed explanations can be found in [this SO thread](https://stackoverflow.com/questions/37433363/link-conda-environment-with-jupyter-notebook). Just enter this command:
    ```
    conda install nb_conda
    ```
5. If step 4 does not work, try follow these steps:
    1. deactivate nlp-env
    2. run ```
    conda install nb_conda
    ```
    3. activate nlp-env
    4. run ```
    python -m ipykernel install --user --name nlp-env
    ```

---

# 2. Brief Intro of Jupyter Notebook

Jupyter Notebook is a tool I highly recommend. It is designed for data scientists to create and share codes. And now it goes far beyond that. Because it supports Markdown and LaTex, you can do a lot of things with it. Personally, I now use Jupyter Notebook for coding, light-duty debug and even taking note. To see what Jupyter Notebook is capable of, check this [cool list](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)

To start Jupyter Notebook, just type
```
Jupyter Notebook
```
in your termial (with nlp-env activated)

Now you shall see a prompt webpage. That's the interface of Jupyter Notebook.  
You've also probably already noticed that this page itself is actually a .ipynb file. So open the 'install_instruction.ipynb' file in the folder. Before we move to the next step, make sure the top right of the page (right side of the toolbar) says 'conda env: nlp-env'. Which means you're using the nlp-env with all the dependency libraries we just installed.

If it says something different, like 'Python 3'. Just click 'Kernel' at toolbar, then 'Change Kernel' and select the right one.

---

# 3. Run the First NLTK Demo

Ok, now we come to the last step. In this step, we'll start using Jupyter Notebook and try to run a simple NLTK demo

Before we move on, let me first introduce NLTK to you. Natural Language Toolkit is a popular (probabaly the most popular) NLP toolkit in python. It can provides you tools for downloading and managing useful corpora as well as processing text like tokenization, tagging and stemming.

More detailed documentation can be found in their [website](http://www.nltk.org/)

Now let's try a 'hello world' example using NLTK (I modified from [this site](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)). Just use 'shift+enter' at each cell to excecute them

In [1]:
import nltk
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Bohao\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\Bohao\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gazetteers.zip.
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\Bohao\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\Bohao\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gutenberg.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\Bohao\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     C:\Users\Bohao\AppData\Roaming\nl

True

Like I mentioned above, NLTK helps you download and mange corpora easily. 
```
nltk.download('popular')
```
just downloads most popular corpora for your (all the corpora takes more than 1GB space). Some of them are necessary for us to processing text. You can find useful info in this [SO thread](https://stackoverflow.com/questions/22211525/how-do-i-download-nltk-data)

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

txt = "Hi, my name is Bohao. Welcome to ECE590.08: natural language processing class. Don't be shy to ask any questions."

print(sent_tokenize(txt))
print(word_tokenize(txt))

['Hi, my name is Bohao.', 'Welcome to ECE590.08: natural language processing class.', "Don't be shy to ask any questions."]
['Hi', ',', 'my', 'name', 'is', 'Bohao', '.', 'Welcome', 'to', 'ECE590.08', ':', 'natural', 'language', 'processing', 'class', '.', 'Do', "n't", 'be', 'shy', 'to', 'ask', 'any', 'questions', '.']


You'll see in the first line of output, input text has been divided into three sentences. NLTK use a pretty robust method that it is not simply splitting by '.': ECE590.08 is still in the same sentence.

Also, the second line devide 'don't' into two parts, indicates this is a negation.

You'll know how to do these 'tricks' after taking this course : )

---

# 4. Some Useful Resources

1. Search engine: whenever you don't know anything, like not sure about an algorhithm, don't know what to do with an error message. Just search it. You'll find __you are not the only one__. Lot's of people might already asked same question like you in StackOverflow/Quora/... You can see I cited lots of SO links in this jupyter notebook, and each one represents a search query I've made when making this tutorial
2. Some open classes online:
    - [Standford CS224n](http://web.stanford.edu/class/cs224n/): a very famous course, little bit deep learning oriented
    - [Cousera NLP Course](https://www.coursera.org/learn/natural-language-processing): very good one, but no upcoming session available
    - Also, check this [Quora page](https://www.quora.com/What-is-the-best-natural-language-processing-MOOC)
3. Some materials:
    - [The Hitchhiker’s Guide to Python!](The Hitchhiker’s Guide to Python!): this is very detailed, you'll probably find everything you need for python
    - [Duke STA663](http://people.duke.edu/~ccc14/sta-663-2017/index.html): this is the class notes we had when taking STA663(statistical computing) last semester, the first few chapters are actually excellent introduction for you to get familiar with python and jupyter notebook and other necessary tools for data analysis