# Importing data into pandas

There are tons of ways you can get data into a pandas dataframe. Here are a few of the more common ones.

First, let's import pandas `as` pd.

In [None]:
import pandas as pd

### From a CSV file

If your data file is delimited with something other than a comma, you'll need to specify that in the `sep` argument. For example: `pd.read_csv('../data/my-delimited-file.txt', sep='|')`

In [None]:
csv_df = pd.read_csv('../data/mlb.csv')

In [None]:
csv_df.head()

### From a CSV file on the Internet

Just pass in the URL. This example uses the official results of the fall 2016 election in Nebraska.

In [None]:
csvi_df = pd.read_csv('http://electionresults.sos.ne.gov/resultsCSV.aspx?text=All')

In [None]:
csvi_df.head()

### From an Excel file

First, you'll need to install the `xlrd` module. (If you're using pipenv, the command would be: `pipenv install xlrd`)

You might want to specify the `sheet_name` to select a worksheet -- the default is "the first one."

In [None]:
xl_df = pd.read_excel('../data/homicides2014.xlsx', sheet_name='Murders')

In [None]:
xl_df.head()

### From a Python data collection

Maybe the work you're doig in pandas happens downstream of some other Python processing, so the data exists as, say, a list of dictionaries. You can turn this (and other Python data collections, like a list of lists) into a pandas dataframe, too.

In [None]:
test_data = [
    {'name': 'Cody Winchester', 'job': 'Training director', 'location': 'Colorado Springs, CO'},
    {'name': 'Guy Fieri', 'job': 'Gourmand', 'location': 'Flavortown'},
    {'name': 'Sarah Huckabee Sanders', 'job': 'Spokeswoman', 'location': 'Washington, D.C.'}
]

py_df = pd.DataFrame(test_data)

In [None]:
py_df.head()

### From an HTML table

This one requires you to install and specify the HTML parsing engine of your choice -- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [lxml](http://lxml.de/). The default is `lxml`. 

Pulling data from an HTML table can be hit and miss, depending on how hairy the underlying HTML is, but it's good to know that it's an option.

In this example, we've installed `BeautifulSoup` (alias `bs4`) with pipenv and we're going to import [a table of media witnesses](https://www.tdcj.state.tx.us/death_row/dr_media_witness_list.html) to Texas death row executions.

We're going to pass four things to [the pandas `read_html()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html):
1. A string containing the URL we want to scrape
2. The `flavor` of parser that we'd like to use to process the HTML (`bs4`)
3. The HTML attributes of the table we're targeting (in this case, the table has a `class` called `tdcj_table`)
4. The index number of the list, in the list of lists that gets returned in a dataframe, that is the `header`? (Usually it's 0 -- the first one)

Reading through the documentation for this method, we also notice that this method returns a *list* of matching tables as datafranes, so we need to grab the *first* item in this list of tables returned. Our arguments were specific enough that there's only one item in the returned list, though, so we can just grab the first item with `[0]`.

👉For more details on selecting items from Python lists, see the "collections of things" section in [this notebook](../appendix/Python%20data%20types%20and%20basic%20syntax.ipynb#Collections-of-things:-Lists-and-dictionaries).

In [None]:
html_df = pd.read_html('https://www.tdcj.state.tx.us/death_row/dr_media_witness_list.html',
                       flavor='bs4',
                       attrs={'class': 'tdcj_table'},
                       header=0)[0]

In [None]:
html_df.head()

### From JSON

JSON stands for **J**ava**S**cript **O**bject **N**otation. It's a common data interchange format on the web.

Pandas can slurp in data from a local `.json` file, or from a URL -- say, a JSON API with data on dogs and cats registered in the Sunshine Coast Region of Australia.

In [None]:
json_df = pd.read_json('https://data.sunshinecoast.qld.gov.au/resource/44qj-t4fr.json')

In [None]:
json_df.head()