# Importing data into pandas

There are tons of ways you can get data into a pandas dataframe. Here are a couple of the more common ones.

In [2]:
import pandas as pd

### From a CSV file

If your data file is delimited with something other than a comma, you'll need to specify that in the `sep` argument. For example: `pd.read_csv('../data/my-delimited-file.txt', sep='|')`

In [2]:
csv_df = pd.read_csv('../data/mlb.csv')

In [3]:
csv_df.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


### From a CSV file on the Internet

Just pass in the URL. This example uses the official results of the fall 2016 election in Nebraska.

In [10]:
csvi_df = pd.read_csv('http://electionresults.sos.ne.gov/resultsCSV.aspx?text=All')

In [11]:
csvi_df.head()

Unnamed: 0,RaceID,RaceName,PartyCode,AreaType,AreaNum,OfficeSeqNo,BallotOrder,CandidateID,CandidateName,CurrentDateTime,VoteFor,CandidateVotes,CandidatePercentage,PrecinctsReporting,PartialPrecinctsReporting
8457,For President and Vice President of the United...,LIB,SW,,91,,1424,Gary Johnson and Bill Weld,3/30/2018 4:42:28 PM,1,14075,0.050716,0/0,0/0,
8457,For President and Vice President of the United...,DEM,SW,,91,,1421,Hillary Clinton and Tim Kaine,3/30/2018 4:42:28 PM,1,100465,0.362003,0/0,0/0,
8457,For President and Vice President of the United...,REP,SW,,91,,1420,Donald J. Trump and Michael R. Pence,3/30/2018 4:42:28 PM,1,159609,0.575116,0/0,0/0,
8457,For President and Vice President of the United...,PET,SW,,91,,1646,Jill Stein and Ajamu Baraka,3/30/2018 4:42:28 PM,1,3376,0.012165,0/0,0/0,
8457,For President and Vice President of the United...,PET,SW,,92,,1646,Jill Stein and Ajamu Baraka,3/30/2018 4:42:28 PM,1,3345,0.011777,0/0,0/0,


### From an Excel file

First, you'll need to install the `xlrd` module. (If you're using pipenv, the command would be: `pipenv install xlrd`)

You might want to specify the `sheet_name` to select a worksheet -- the default is "the first one."

In [3]:
xl_df = pd.read_excel('../data/homicides2014.xlsx', sheet_name='Murders')

In [4]:
xl_df.head()

Unnamed: 0,City,State,Population,Murders
0,New York City,NY,8473938,333
1,Los Angeles,CA,3906772,260
2,Chicago,IL,2724121,415
3,Houston,TX,2219933,242
4,Philadelphia,PA,1559062,248


### From a Python data collection

Maybe the work you want to do in pandas exists downstream of some other Python processing you're doing, and the data is, say, in a list of dictionaries. You can turn this into a pandas dataframe, too.

In [8]:
test_data = [
    {'name': 'Cody Winchester', 'job': 'Training director', 'location': 'Colorado Springs, CO'},
    {'name': 'Guy Fieri', 'job': 'Gourmand', 'location': 'Flavortown'},
    {'name': 'Sarah Huckabee Sanders', 'job': 'Spokeswoman', 'location': 'Washington, D.C.'}
]

py_df = pd.DataFrame(test_data)

In [9]:
py_df.head()

Unnamed: 0,job,location,name
0,Training director,"Colorado Springs, CO",Cody Winchester
1,Gourmand,Flavortown,Guy Fieri
2,Spokeswoman,"Washington, D.C.",Sarah Huckabee Sanders


### From an HTML table

This one requires you to install and specify the HTML parsing engine of your choice -- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [lxml](http://lxml.de/). The default is `lxml`. 

The results can be kind of spotty, depending on how hairy the underlying HTML is, but it's good to know that it's an option.

In this example, we've installed `BeautifulSoup` (alias `bs4`) and we're going to import [a table of media witnesses](https://www.tdcj.state.tx.us/death_row/dr_media_witness_list.html) to Texas death row executions.

We're going to pass four things to [the `read_html()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html):
1. The URL we want to scrape as a string
2. The `flavor` of parser we're using (`bs4`)
3. The HTML attributes of the table we're targeting (in this case, the table has a `class` called `tdcj_table`
4. Which list, in the list of lists that gets returned in a dataframe, is the `header`? (Usually it's 0 -- the first one)

Reading through the documentation for this method, we also notice that this method returns a *list* of matching tables as datafranes. So we need to grab the *first* item in this list of tables returned -- our arguments were specific enough that there's only one item in the returned list, though, so we can just grab the first item with `[0]`.

In [13]:
html_df = pd.read_html('https://www.tdcj.state.tx.us/death_row/dr_media_witness_list.html',
                       flavor='bs4',
                       attrs={'class': 'tdcj_table'},
                       header=0)[0]

In [14]:
html_df.head()

Unnamed: 0,Execution,Link,Last Name,First Name,TDCJ Number,Date,Media Witness List
0,549,Offender Information,"Rodriguez, III",Rosendo,999534,3/27/2018,"Michael Graczyk, Associated Press; Cody Stark,..."
1,548,Offender Information,Battagllia,John,999412,2/1/2018,"Michael Graczyk, Associated Press; Cody Stark,..."
2,547,Offender Information,Rayford,William,999371,01/30/2018,"Michael Graczyk, Associated Press; Cody Stark,..."
3,546,Offender Information,Shore,Anthony,999488,01/18/2018,"Michael Graczyk, Associated Press; Cody Stark,..."
4,545,Offender Information,Cardenas,Ruben,999275,11/8/2017,"Michael Graczyk, Associated Press; Cody Stark,..."
