### J-Term 2017, Harvard GSD :
### Introduction to Data Science for Building Simulation
***
Instructor: Jung Min Han, elliehan07@gmail.com <br>
Teaching Assistant: NJ Namju Lee, nj.namju@gmail.com <br>
Date/Time: Jan 9-12/ 1:00 - 3:00 p.m. <br>
Location: 20 Sumner/Room 1-D<br>
***

### PART A - 1

# Anaconda Crash Course: Using Anaconda (“Conda”) to Supplement Python

***

Python is a manifold scripting language for data management and analysis, however managing a Python project environment can be nuanced and tricky. Anaconda is a platform built to complement Python by providing or producing customizable and easily accessible environments in which you can run Python scripts with diverse libraries. 

For references, the Anaconda homepage is found at the following address.

https://www.continuum.io/why-anaconda

The following tutorial runs through the installation of Anaconda, and then introduces you to the concepts behind Anaconda that make it a nice and useful Python development environment.

***

### Install Anaconda (aka Conda)

The Anaconda homepage contains the materials that you need to install Anaconda on your machine. You will primarily be using Anaconda through the command line, so you will have to get comfortable working on the command line. 

## 1. Check Anaconda Version and Install

The first step is to open Terminal and check to see if you have Anaconda installed. If not, we will install it. To check the version, follow the following commands.

#### i. Open Terminal
#### ii. Check Version

The syntax to access Anaconda on the command line is simply ‘conda’. To check the version you have installed, use the following:

```sh
	conda info
```

If you have it installed, you will see platform information, version details, and environment paths after you hit enter, if not, the terminal will not recognize the command.


#### iii. Install Anaconda

To install ‘Conda’, navigate to the Anaconda downloads page at:

[Anaconda Homepage and Downloads](https://www.continuum.io/downloads)

Here, pick your system (Mac/Windows) and the Python version. In our case, we are going to pick Mac and select version 2.7. Use the graphical installer, it will provide us a wizard that will step us through the installation process.  Download the installer, double click the package file and follow the instructions.

If you run into problems installing, or you get an error that states that Anaconda cannot be install in the default location, visit this page for short instructions on how to troubleshoot installation.

[Anaconda Installation Docs](http://docs.continuum.io/anaconda/install#anaconda-install)

Anaconda is contained in one directory, so if you ever need to uninstall Anaconda, use Terminal to remove the entire Anaconda directory using **rm -rf ~/anaconda**.

We used Python 2.7 and not Python 3. The main reason for this is that Linux distributions and Mac still use Python 2.7 as default, and because the Python ecosystem has amassed a significant amount of quality software over the years in which some of it does not yet work on 3. Python 3, however, is designed to be more robust and fixes a lot of bugs in Python 2, so in the future, expect to see a continued migration to Python 3. For now, Python 2 will work just fine, and since we are using Anaconda, if we want to set up a Python 3 instance at some point, it will be easy to do!

## 2. Confirm the installation worked properly

Once we are finished with the installation, check to make sure it installed correctly by performing a version check.

```sh
	conda info
```

If you see a 3.XX version number popup, and with platform and environment information, the installation worked. Now we can begin working with Conda.

***

## 3. The Anaconda 30-minute Test Drive

Now let’s familiarize with what exactly Anaconda allows us to do. On a basic level, Anaconda is a Python distribution that adds many features and streamlines work with the language. It does this by creating specific environments on your machine in which you can specify the packages that are installed and used, and easily lets you toggle between environments. With in the individual environments, you can perform analysis, run scripts, and develop code.

Environments are the bread and butter of Anaconda, because not all Python scripts you run will use the same packages, so you can customize exactly what you need, and create a sandbox that lets you try new things. Your environments will save the packages you have installed, allowing you to easily load an environment and run your scripts.

The Anaconda team has put together a very nice Test Drive that is designed to take about a half hour that will introduce you to concepts around Anaconda, including setting up an environment, toggling between environments, managing the Python version you are using, managing the Python packages you are using in your environments, and finally, removing or uninstall packages and environments if you no longer need them.

Follow the Test Drive at the following link:

[Anaconda 30-minute Test Drive](http://conda.pydata.org/docs/test-drive.html)

Working with Anaconda can make working with Python a much more pleasant experience. For additional resources, including cheatsheets and useful links, see the following materials.

***

## 4. Fire up Python - Web Scraping using Beautiful Soup

Let's scrape some data. For the last part of this exercise, we are going to get into a bit of Python. Our goal will be to scrape some data a simple website to create a dataset of the best restaurants in Boston. Using a Python library called Beautiful Soup. To do so, we will create a Anaconda environment in which we can install modules and packages that we need to run the scraper (Beautiful Soup and Requests) To start, lets create an environment, like we just learned about in the Anaconda Test Drive.

The website we are going to scrape is here.

[Rat Incidents in Greater Boston](http://duspviz.mit.edu/_assets/data/sample.html)

Let's get started!


#### i. Create Virtual Environment in Conda

The first step is to create our virtual environment. In this environment, we can set up the packages and program versions we need to optimally run our script. Create an environment called **souperscraper** using the following syntax in terminal. We are going to install the Beautiful Soup program when we create the environment.

```sh
	conda create --name souperscraper beautifulsoup4
```

Anaconda will ask install a handful of new packages that Python will work with in this environment, including Beautiful Soup version 4. When prompted, hit **y** to continue the installation. You should see that we are using Python 2.7, Beautiful Soup 4.4.1, and a handful of others.

After creating the environment, launch the environment by typing the following. This will launch a virtual environment and allow us to access and run Python using the version and all packages we have installed into it.

```sh
	source activate souperscraper
```

All work we do now will be in our **souperscraper** environment.

#### ii. Add Packages to Environment

The next step is to a package we can use to access websites in our script. We have already installed beautiful soup, so the next package we want to install is called **requests**. Requests is a package that lets us make HTTP calls to websites from right inside our script. To install requests use **one** of the following syntaxes.

Using [Easy Install](http://peak.telecommunity.com/DevCenter/EasyInstall) (a Python Packager, comes with Python/Conda)

```sh
	easy_install requests
```

or [PIP](https://pip.pypa.io/en/stable/) (another Python Module Packager, perhaps more common, comes with Python/Conda)

```sh
	pip install requests
```

Both of these methods will install requests. [Requests](http://docs.python-requests.org/en/latest/) is a library that makes it easy to access code from pages across the web, and view it in various forms. For data scraping, it is immensely useful.

The workflow is the following: we will call a website using 'requests' and then parse it into components using 'beautifulsoup4'. Also, at this time, set your working directory in terminal to where you want files to save and update. To see your current working directory, type in **pwd** in terminal (this stands for Print Working Directory). To change, use **cd** followed by the path.

#### iii. Simple Scraping Script using Beautiful Soup and Requests

A great example to get into the usefulness of Python, especially for Big Data, is to try to get a bit of data. Open up your terminal to start working with Beautiful Soup on the command line. Start by firing up Python by entering the following.

```sh
	python
```

Python will launch in the terminal. Next, lets import modules. **import requests** imports the requests module, and **import bs4** imports the Beautiful Soup library.

In [2]:
import requests
import bs4

#### Testing out Requests

Requests will allow us to load a webpage into python so that we can parse it and manipulate it. Test this by running the following. I am using a really simple page from the Beautiful Soup documentation to explain what is happening here.

In [3]:
response = requests.get('http://duspviz.mit.edu/_assets/data/sample.html')
print response.text

<html><head><title>Where are the rats?</title></head>
<body>
<p class="title"><b>Rat Incidents in Greater Boston</b></p>

<p class="story">The following is rodent incident data for 
<a href="http://example.com/boston" class="link" id="link1">Boston</a>,
<a href="http://example.com/brookline" class="link" id="link2">Brookline</a>,
<a href="http://example.com/cambridge" class="link" id="link2">Cambridge</a>, and
<a href="http://example.com/somerville" class="link" id="link3">Somerville</a>;
and it only available here.</p>

<table>
	<thead>
		<tr>
			<th>City</th>
			<th># of rats</th>
		</tr>
	</thead>
	<tbody>
		<tr>
			<td class="city">Cambridge</td>
			<td class="number">400</td>
		</tr>
		<tr>
			<td class="city">Boston</td>
			<td class="number">900</td>
		</tr>
		<tr>
			<td class="city">Somerville</td>
			<td class="number">300</td>
		</tr>
		<tr>
			<td class="city">Brookline</td>
			<td class="number">600</td>
		</tr>
	</tbody>
</table>

</body>



This allowed us to access all of the content from the source code of the webpage with Python, which we can now parse and extract data. Pretty cool!

#### Testing out Beautiful Soup

Our next big step is to test out Beautiful Soup. Let's talk about what this is...

### What is Beautiful Soup?

Beautiful Soup is a Python library for parsing data out of HTML and XML files (aka webpages). It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library. Once we grab elements, Python makes it makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.

The sample webpage we are using contains data on 'rodent incidents' in the greater Boston area. Let's use this file to explore the tree, and extract some data.

#### iv. Make the Soup

First, we have to turn the website code into a Python object. We have already imported the Beautiful Soup library, so we can start calling some of the methods in the libary. Replace **print response.text** with the following. This turns the text into an Python object named **soup**.

An important note: You need to specify the specific parser that Beautiful Soup uses to parse your text. This is done in the second argument of the BeautifulSoup function. The default is the built in Python parser, which we can call using **html.parser**

You an also use **lxml** or **html5lib**. This is nicely described in the [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser). For our purposes, using the default is fine.

Using the Beautiful Soup **prettify()** function, we can print the page to see the code printed in a readable and legible manner.

In [4]:
soup = bs4.BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

<html>
 <head>
  <title>
   Where are the rats?
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    Rat Incidents in Greater Boston
   </b>
  </p>
  <p class="story">
   The following is rodent incident data for
   <a class="link" href="http://example.com/boston" id="link1">
    Boston
   </a>
   ,
   <a class="link" href="http://example.com/brookline" id="link2">
    Brookline
   </a>
   ,
   <a class="link" href="http://example.com/cambridge" id="link2">
    Cambridge
   </a>
   , and
   <a class="link" href="http://example.com/somerville" id="link3">
    Somerville
   </a>
   ;
and it only available here.
  </p>
  <table>
   <thead>
    <tr>
     <th>
      City
     </th>
     <th>
      # of rats
     </th>
    </tr>
   </thead>
   <tbody>
    <tr>
     <td class="city">
      Cambridge
     </td>
     <td class="number">
      400
     </td>
    </tr>
    <tr>
     <td class="city">
      Boston
     </td>
     <td class="number">
      900
     </td>
    </tr>
    <tr>
  

At any point, if you need a reference, visit the [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the official descriptions of functions. Prettify is a handy one to see our document in a clean fashion.

#### Navigating the Data Structure

With our data from the webpage nicely laid out, Beautiful Soup allows us to now navigate the data structure. We called our Beautiful Soup object **soup**, so we can run the Beautiful Soup functions on this object. Let's explore some ways to do this, try entering some of the following into your terminal.

In [5]:
# Access the title element
soup.title

<title>Where are the rats?</title>

In [6]:
# Access the content of the title element
soup.title.string

u'Where are the rats?'

In [7]:
# Access data in the first 'p' tag
soup.p

<p class="title"><b>Rat Incidents in Greater Boston</b></p>

In [8]:
# Access data in the first 'a' tag
soup.a

<a class="link" href="http://example.com/boston" id="link1">Boston</a>

In [9]:
# Retrieve all links in the document (note it returns an array)
soup.find_all('a')

[<a class="link" href="http://example.com/boston" id="link1">Boston</a>,
 <a class="link" href="http://example.com/brookline" id="link2">Brookline</a>,
 <a class="link" href="http://example.com/cambridge" id="link2">Cambridge</a>,
 <a class="link" href="http://example.com/somerville" id="link3">Somerville</a>]

In [10]:
# Retrieve elements by class equal to link using the attributes argument
soup.findAll(attrs={'class' : 'link'})

[<a class="link" href="http://example.com/boston" id="link1">Boston</a>,
 <a class="link" href="http://example.com/brookline" id="link2">Brookline</a>,
 <a class="link" href="http://example.com/cambridge" id="link2">Cambridge</a>,
 <a class="link" href="http://example.com/somerville" id="link3">Somerville</a>]

In [11]:
# Retrieve a specific link by ID
soup.find(id="link3")

<a class="link" href="http://example.com/somerville" id="link3">Somerville</a>

In [12]:
# Access Data in the table (note it returns an array)
soup.find_all('td')

[<td class="city">Cambridge</td>,
 <td class="number">400</td>,
 <td class="city">Boston</td>,
 <td class="number">900</td>,
 <td class="city">Somerville</td>,
 <td class="number">300</td>,
 <td class="city">Brookline</td>,
 <td class="number">600</td>]

#### Working with Arrays

The easiest way to access elements and then either write them to file or manipulate them is to save them as objects themselves. Note that our data is organzed into cities and numbers. Let's save these to arrays, which are the easiest way to work with the data.

The following gives us an array, we can work with the elements.

In [13]:
data = soup.findAll(attrs={'class':'city'})
print data[0].string
print data[1].string
print data[2].string
print data[3].string

Cambridge
Boston
Somerville
Brookline


You never want to repeat code like this, so turn this into a loop:

In [14]:
data = soup.findAll(attrs={'class':'city'})
for i in data:
    print i.string

Cambridge
Boston
Somerville
Brookline


This array only gives us cities though, let's get all of the data elements that have either class **city** or class **number**.

In [15]:
data = soup.findAll(attrs={'class':['city','number']})
print data

[<td class="city">Cambridge</td>, <td class="number">400</td>, <td class="city">Boston</td>, <td class="number">900</td>, <td class="city">Somerville</td>, <td class="number">300</td>, <td class="city">Brookline</td>, <td class="number">600</td>]


We have all of our data that was nested in these tags saved to a Python array. Access the elements of the array by using data[x], where x is location in the array. In Python, arrays start at 0, so place 1 in a Python array is actually called by using a 0, and place 8 would be called by a 7.

In [16]:
print data[0]
print data[1]

<td class="city">Cambridge</td>
<td class="number">400</td>


Right now, we get the whole element with those commands. To get just the content, use the following.

In [17]:
print data[0].string
print data[1].string

Cambridge
400


#### Write Data to a File using a Simple Loop

Python makes opening a file and writing to it very easy. Let's take this simple dataset and write it to a file that saves in our current working directory. An important note, whatever the working directory is when you start Python will be the root for where your files are read from and written to.

Python also has nice iteration features that allow us to iterate through arrays, lists, and other files. In this following example, manually create a comma-separated document with our data using file writing operations and a while loop.

In pseudo-code:

1. Open up a file to write in and append data.

2. Set up parameters for the while loop

3. Write headers

4. Run while loop that will write elements of the array to file

5. When complete, close the file

Once done, open the file on your machine and see your data. Enter the following code, note what each line is doing.

In [19]:
f = open('rat_data.txt','a') # open new file

p = 0 # initial place in array
l = len(data)-1 # length of array minus one

f.write("City, Number\n") #write headers

while p < l: # while place is less than length
    f.write(data[p].string + ", ") # write city and add comma
    p = p + 1 # increment
    f.write(data[p].string + "\n") # write number and line break
    p = p + 1 # increment

f.close() # close file

Open **rat_data.txt** on your machine. It's a CSV with the data from the page!

---

```sh
City, Number
Cambridge, 400
Boston, 900
Somerville, 300
Brookline, 600
```

---

To see a completed script to extract this data from the rat data website, view **rat_data.txt** in the materials.

Further reading on File Reading and Writing and Iteration in Python can be found at the following links.

[Python File Reading and Writing Documentation](https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files)

[Tutorials Point Python While Loop Statements](http://www.tutorialspoint.com/python/python_while_loop.htm)

We created this CSV manually to illustrate some basic Python. Python has modules and features that support CSV as well that are very useful and handy. These are best if you are reading in a CSV, allowing you to access the elements of the CSV. You can read more about the built in CSV module [here](https://docs.python.org/2/library/csv.html).

You've got your feet wet, over the next weeks, there will be much more to come on Python and data scraping!

***

## Additional Reading and Resources

#### Conda Command Line Cheatsheet -
http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf

#### Mac Command Line Cheatsheet –
https://github.com/0nn0/terminal-mac-cheatsheet/wiki/Terminal-Cheatsheet-for-Mac-(-basics-)

#### Python Documentation -
https://docs.python.org/2/library/index.html

#### Beautiful Soup Tutorials -
http://www.crummy.com/software/BeautifulSoup/bs4/doc/