# Data loading and storage
Pandas provides functions that can read tabular data as a DataFrame object. The options for these functions fall into a few categories:
* Indexing: one can treat one or more columns or get column names from the file or user
* Type Inference: includes the user-defined value conversions and custom lists 
* Datetime parsing: includes combining capability
* Iterating: support for iterating over chunks of very large files
* Unclean data issues: skippig rows or a footer, comments or other minor things

## Type Inference
Type inference can handle the types of the data automatically. 

In [1]:
!head train.csv

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S


In [2]:
import pandas as pd
df = pd.read_csv('train.csv')

In [3]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Or one can use __read_table__ and specify the delimiter.

In [4]:
pd.read_table('train.csv', sep = ',').dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
pd.read_csv('train.csv', index_col = 'PassengerId').head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
pd.read_csv('train.csv', skiprows = [1, 4, 5]).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
2,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
3,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
4,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


## Reading Text files in pieces
When processing a very large files, one can read in a small piece of a file or iterate through smaller chunks of the file. 

In [7]:
pd.read_csv('train.csv', nrows = 8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


To iterate over the parts of the file, __TextParser__ object returned by __read_csv__ can allow you to iterate over the parts of file according to the __chunksize__.

In [8]:
chunker = pd.read_csv('train.csv', chunksize=100)

for piece in chunker:
    print(piece['Sex'].value_counts())

male      61
female    39
Name: Sex, dtype: int64
male      68
female    32
Name: Sex, dtype: int64
male      64
female    36
Name: Sex, dtype: int64
male      54
female    46
Name: Sex, dtype: int64
male      68
female    32
Name: Sex, dtype: int64
male      62
female    38
Name: Sex, dtype: int64
male      72
female    28
Name: Sex, dtype: int64
male      68
female    32
Name: Sex, dtype: int64
male      60
female    31
Name: Sex, dtype: int64


Write the data into a text file, __Series__ and __DataFrame__ have the methods [__to_csv__](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) that can write the data out to a comma-separated file. 

## Web Scraping
lxml has consistently strong performance in parsing very large files. Let's parse the stream with lxml:

In [9]:
from lxml.html import parse
from urllib.request  import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/quote/AAPL/options?ltr=1'))

doc = parsed.getroot()

Now you can extract all HTML tags of a particular type via this object.

In [10]:
links = doc.findall('.//a')
links[15:20]

[<Element a at 0x7fceace4dc78>,
 <Element a at 0x7fceace4dcc8>,
 <Element a at 0x7fceace4dd18>,
 <Element a at 0x7fceace4dd68>,
 <Element a at 0x7fceace4ddb8>]

These objects represent HTML elements, to get URL and link text one will use __get__ method and __text_content__ method.

In [11]:
links[20].get('href')

'/topic/finalround'

In [12]:
links[20].text_content()

'The Final Round'

In [13]:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]

['http://finance.yahoo.com/broker-comparison?bypass=true',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_MAIL_ML&locale=en_US&id=SLN2310&actp=productlink',
 'http://help.yahoo.com/l/us/yahoo/finance/',
 'https://yahoo.uservoice.com/forums/382977',
 'http://info.yahoo.com/privacy/us/yahoo/',
 'http://info.yahoo.com/relevantads/',
 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
 'http://twitter.com/YahooFinance',
 'http://facebook.com/yahoofinance',
 'http://yahoofinance.tumblr.com']

To find a table in the document, similarly one can search name __table__.

In [14]:
tables = doc.findall('.//table')

## Interacting with HTML and Web APIs
Since many websites have public APIs providing data feeds via JSON or other format, one can utilize those APIs to fetch data.

In [17]:
import requests
url = 'https://twitter.com/search?q=python%20pandas&src=typd&lang=en'
resp = requests.get(url)

In [18]:
resp

<Response [200]>

In [25]:
resp.text.count('href')

334

## Interacting with Databases
Pandas provides some functions to work with SQL.

In [27]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER);
"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()

In [29]:
data = [('Atlanta', 'Georgia', 1.25, 6),
       ('Tallahassee', 'Florida', 2.6, 3),
       ('Sacramento', 'California', 1.7, 5)]
smst = "INSERT INTO test VALUES(?,?,?,?)"

con.executemany(smst, data)
con.commit()

In [30]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()

In [31]:
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [32]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [35]:
pd.DataFrame(rows)

Unnamed: 0,0,1,2,3
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


There is a simple way to similify the process obove.

In [37]:
pd.read_sql('select * from test', con)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
