(c) 2019 Copyright Ashish Lal All rights reserved

In [1]:
import pandas as pd
import numpy as np

# Data Loading, Storage and File Formats

Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs

## Reading and Writing Data in Text Format

![Fig.1](imgs/pandas_2_001.png)

![Fig 2](imgs/pandas_2_002.png)

we will be using read_csv most often

In [2]:
!cat '../datasets/ex1.csv'

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


since this file is comma delimited we will use read_csv

In [3]:
df = pd.read_csv('../datasets/ex1.csv')

In [4]:
df.head()

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [5]:
df.shape

(3, 5)

A file will not always have a header. 

In [8]:
df2 = pd.read_csv('../datasets/ex2.csv', header=None)

In [9]:
df2

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [12]:
pd.read_csv('../datasets/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [14]:
pd.read_csv('../datasets/ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [15]:
df2 = pd.read_csv('../datasets/ex2.csv', header=None)

In [16]:
df2

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [18]:
df2.set_index(4)

Unnamed: 0_level_0,0,1,2,3
4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [21]:
df2 = df2.set_index(4)

In [23]:
df2

Unnamed: 0_level_0,0,1,2,3
4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [24]:
df2.loc['hello']

0    1
1    2
2    3
3    4
Name: hello, dtype: int64

In [29]:
df2.iloc[2]

0     9
1    10
2    11
3    12
Name: foo, dtype: int64

In [32]:
pd.read_csv('../datasets/csv_mindex.csv')

Unnamed: 0,key1,key2,value1,value2
0,one,a,1,2
1,one,b,3,4
2,one,c,5,6
3,one,d,7,8
4,two,a,9,10
5,two,b,11,12
6,two,c,13,14


In [30]:
parsed = pd.read_csv('../datasets/csv_mindex.csv',index_col=['key1', 'key2'])

In [31]:
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14


In [34]:
parsed.loc['one']

Unnamed: 0_level_0,value1,value2
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1,2
b,3,4
c,5,6
d,7,8


In [35]:
parsed.loc['one', 'b']

value1    3
value2    4
Name: (one, b), dtype: int64

In [38]:
pd.read_csv('../datasets/ex4.csv')

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,# hey!
a,b,c,d,message
# just wanted to make things more difficult for you,,,,
# who reads CSV files with computers,anyway?,,,
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [40]:
pd.read_csv('../datasets/ex4.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### Reading Text files in pieces

In [42]:
pd.read_csv('../datasets/ex6.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


In [43]:
pd.read_csv('../datasets/ex6.csv', chunksize=1000)

<pandas.io.parsers.TextFileReader at 0x7ff8cfcad7f0>

The TextFileReader object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column

In [52]:
chunker = pd.read_csv('../datasets/ex6.csv', chunksize=1000)

In [47]:
for piece in chunker:
    print(piece)

          one       two     three      four key
0    0.467976 -0.038649 -0.295344 -1.824726   L
1   -0.358893  1.404453  0.704965 -0.200638   B
2   -0.501840  0.659254 -0.421691 -0.057688   G
3    0.204886  1.074134  1.388361 -0.982404   R
4    0.354628 -0.133116  0.283763 -0.837063   Q
5    1.817480  0.742273  0.419395 -2.251035   Q
6   -0.776764  0.935518 -0.332872 -1.875641   U
7   -0.913135  1.530624 -0.572657  0.477252   K
8    0.358480 -0.497572 -0.367016  0.507702   S
9   -1.740877 -1.160417 -1.637830  2.172201   G
10   0.240564 -0.328249  1.252155  1.072796   8
11   0.764018  1.165476 -0.639544  1.495258   R
12   0.571035 -0.310537  0.582437 -0.298765   1
13   2.317658  0.430710 -1.334216  0.199679   P
14   1.547771 -1.119753 -2.277634  0.329586   J
15  -1.310608  0.401719 -1.000987  1.156708   E
16  -0.088496  0.634712  0.153324  0.415335   B
17  -0.018663 -0.247487 -1.446522  0.750938   A
18  -0.070127 -1.579097  0.120892  0.671432   F
19  -0.194678 -0.492039  2.359605  0.319

In [54]:
chunker = pd.read_csv('../datasets/ex6.csv', chunksize=1000)
for piece in chunker:
    a = piece['key'].nunique()
    print(a)

36
36
36
36
36
36
36
36
36
36


In [55]:
chunker = pd.read_csv('../datasets/ex6.csv', chunksize=1000)
for piece in chunker:
    a = piece['key'].value_counts()
    print(a)

S    48
O    44
F    40
H    39
Q    39
J    39
R    38
G    38
I    37
X    37
V    35
U    33
D    32
E    32
K    31
W    31
L    31
A    30
M    29
Y    28
C    27
T    27
N    27
Z    26
B    25
P    25
7    17
4    17
6    17
3    15
1    13
8    13
9    11
2    11
0     9
5     9
Name: key, dtype: int64
O    48
L    44
X    40
I    39
R    38
F    37
Q    35
D    34
K    33
V    33
E    32
T    31
J    31
H    31
A    31
Z    30
N    30
U    30
M    28
S    27
Y    27
P    26
G    26
W    25
B    24
C    23
5    20
8    19
0    19
9    19
4    18
3    16
1    16
6    14
2    14
7    12
Name: key, dtype: int64
A    40
O    40
E    39
X    39
M    38
H    37
T    36
G    36
K    34
L    34
U    33
F    32
B    32
N    31
V    31
P    30
S    30
J    29
W    29
Z    29
R    28
D    28
Q    28
C    27
Y    27
I    25
6    20
0    19
9    19
7    18
3    15
2    14
5    14
8    14
4    14
1    11
Name: key, dtype: int64
X    43
J    41
D    38
Q    38
V    38
C    37
E    37
N    36


fill_value : None or float value, default None (NaN)

Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing

In [60]:
chunker = pd.read_csv('../datasets/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
print(tot)

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
V    328.0
I    327.0
U    326.0
P    324.0
D    320.0
A    320.0
R    318.0
Y    314.0
G    308.0
S    308.0
N    306.0
W    305.0
T    304.0
B    302.0
Z    288.0
C    286.0
4    171.0
6    166.0
7    164.0
8    162.0
3    162.0
5    157.0
2    152.0
0    151.0
9    150.0
1    146.0
dtype: float64


### Writing Data To Text Format

In [61]:
data = pd.read_csv('../datasets/ex6.csv')

In [63]:
data.shape

(10000, 5)

In [64]:
data.head()

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


In [65]:
data.to_csv('../cache/cached_ex6.csv')

It’s possible to load most forms of tabular data from disk using functions like pandas.read_csv. In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_csv

In [66]:
data = pd.read_csv('../datasets/ex7.csv')

In [67]:
data

Unnamed: 0,a,b,c
0,1,2,3
1,1,2,3


ex7.csv has the following data -

"a","b","c"

"1","2","3"

"1","2","3"

In [68]:
data = pd.read_csv('../datasets/ex7.csv', header=None)

In [69]:
data

Unnamed: 0,0,1,2
0,a,b,c
1,1,2,3
2,1,2,3


In [70]:
import csv

In [74]:
f = open('../datasets/ex7.csv')

reader = csv.reader(f)

Iterating through the reader like a file yields tuples of values with any quote characters removed:

In [75]:
for line in reader:
        print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


In [76]:
with open('../datasets/ex7.csv') as f:
    lines = list(csv.reader(f))

Then, we split the lines into the header line and the data lines:

In [77]:
header, values = lines[0], lines[1:]

Then we can create a dictionary of data columns using a dictionary comprehension and the expression zip(*values), which transposes rows to columns:

In [79]:
data_dict = {h: v for h, v in zip(header, zip(*values))}

data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV files come in many different flavors. To define a new format with a different delimiter, string quoting convention, or line terminator, we define a
simple subclass of csv.Dialect:

class my_dialect(csv.Dialect):<br>
    &emsp;lineterminator = '\n'<br>
    &emsp;delimiter = ';'<br>
    &emsp;quotechar = '"'<br>
    &emsp;quoting = csv.QUOTE_MINIMAL<br>
<br>
reader = csv.reader(f, dialect=my_dialect)<br>
We can also give individual CSV dialect parameters as keywords to csv.reader without having to define a subclass:
reader = csv.reader(f, delimiter='|')

### JSON Data

JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV.

In [98]:
obj = """ 
{
    "name": "Wes",
    "places_lived": ["Mumbai", "Delhi","Bangalore"],
    "pet": null,
    "siblings": [{"name": "Scott", "age": 30, 
                  "pets": ["Zeus", "Zuko"]},
                 {"name": "Katie", "age": 38,"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""


In [99]:
import json

In [100]:
result = json.loads(obj)

In [101]:
result

{'name': 'Wes',
 'places_lived': ['Mumbai', 'Delhi', 'Bangalore'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

result is a python dictionary. Below is how you convert json to a python dataframe

In [102]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])

In [103]:
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


### XML and HTML: Web scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples include lxml, Beautiful Soup, and html5lib. While lxml is comparatively much faster in general, the other libraries can better handle malformed HTML or XML files.
pandas has a built-in function, read_html, which uses libraries like lxml and Beautiful Soup to automatically parse tables out of HTML files as DataFrame objects.

!pip install lxml<br>
!pip install beautifulsoup4 html5lib

In [105]:
tables = pd.read_html('../datasets/fdic_failed_bank_list.html')

In [108]:
len(tables)

1

In [109]:
failures = tables[0]

In [110]:
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


In [111]:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

In [116]:
close_timestamps.dt.month.head()

0    9
1    8
2    5
3    4
4    3
Name: Closing Date, dtype: int64

In [117]:
type(close_timestamps.dt.month)

pandas.core.series.Series

In [122]:
!cat '../datasets/mta_perf/Performance_MNR.xml'

<?xml  version="1.0" encoding="ISO-8859-1"?>
<PERFORMANCE>
<INDICATOR>
  <INDICATOR_SEQ>28445</INDICATOR_SEQ>
  <PARENT_SEQ></PARENT_SEQ>
  <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
  <INDICATOR_NAME>On-Time Performance (West of Hudson)</INDICATOR_NAME>
  <DESCRIPTION>Percent of commuter trains that arrive at their destinations within 5 minutes and 59 seconds of the scheduled time. West of Hudson services include the Pascack Valley and Port Jervis lines. Metro-North Railroad contracts with New Jersey Transit to operate service on these lines.
</DESCRIPTION>
  <PERIOD_YEAR>2008</PERIOD_YEAR>
  <PERIOD_MONTH>1</PERIOD_MONTH>
  <CATEGORY>Service Indicators</CATEGORY>
  <FREQUENCY>M</FREQUENCY>
  <DESIRED_CHANGE>U</DESIRED_CHANGE>
  <INDICATOR_UNIT>%</INDICATOR_UNIT>
  <DECIMAL_PLACES>1</DECIMAL_PLACES>
  <YTD_TARGET>95.00</YTD_TARGET>
  <YTD_ACTUAL>96.90</YTD_ACTUAL>
  <MONTHLY_TARGET>95.00</MONTHLY_TARGET>
  <MONTHLY_ACTUAL>96.90</MONTHLY_ACTUAL>
</INDICATOR>
<

  <YTD_ACTUAL>97.90</YTD_ACTUAL>
  <MONTHLY_TARGET>97.60</MONTHLY_TARGET>
  <MONTHLY_ACTUAL>98.10</MONTHLY_ACTUAL>
</INDICATOR>
<INDICATOR>
  <INDICATOR_SEQ>55526</INDICATOR_SEQ>
  <PARENT_SEQ></PARENT_SEQ>
  <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
  <INDICATOR_NAME>On-Time Performance (East of Hudson)</INDICATOR_NAME>
  <DESCRIPTION>Percent of commuter trains that arrive at their destinations within 5 minutes and 59 seconds of the scheduled time. East of Hudson service includes the Harlem, Hudson and New Haven lines.</DESCRIPTION>
  <PERIOD_YEAR>2009</PERIOD_YEAR>
  <PERIOD_MONTH>11</PERIOD_MONTH>
  <CATEGORY>Service Indicators</CATEGORY>
  <FREQUENCY>M</FREQUENCY>
  <DESIRED_CHANGE>U</DESIRED_CHANGE>
  <INDICATOR_UNIT>%</INDICATOR_UNIT>
  <DECIMAL_PLACES>1</DECIMAL_PLACES>
  <YTD_TARGET>97.60</YTD_TARGET>
  <YTD_ACTUAL>97.90</YTD_ACTUAL>
  <MONTHLY_TARGET>97.60</MONTHLY_TARGET>
  <MONTHLY_ACTUAL>98.20</MONTHLY_ACTUAL>
</INDICATOR>
<INDICATOR>
  <INDICAT

In [119]:
from lxml import objectify

In [120]:
path = '../datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

In [123]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
    el_data[child.tag] = child.pyval
    data.append(el_data)


Lastly, convert this list of dicts into a DataFrame:

In [124]:
perf = pd.DataFrame(data)
perf.head()

Unnamed: 0,MONTHLY_ACTUAL
0,96.9
1,95.0
2,96.9
3,98.3
4,95.8


## Binary Data Formats

In [126]:
df = pd.read_csv('../datasets/ex1.csv')

In [127]:
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [128]:
df.to_pickle('../cache/ex1.pkl')

In [129]:
df1 = pd.read_pickle('../cache/ex1.pkl')

In [130]:
df1

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [131]:
df == df1

Unnamed: 0,a,b,c,d,message
0,True,True,True,True,True
1,True,True,True,True,True
2,True,True,True,True,True


pickle is only recommended as a short-term storage format. 
The problem is that it is hard to guarantee that the format will be stable over time; 
an object pickled today may not unpickle with a later version of a library. 
The creators of pickle have tried to maintain backward compatibility when possible, 
but at some point in the future it may be necessary to “break” the pickle format.

### Using HDF5 format

HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. It is available as a C library, and it has interfaces available in many other languages, including Java, Julia, MATLAB, and Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file can store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compression modes, enabling data with repeated patterns to be stored more efficiently. HDF5 can be a good choice for working with very large datasets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays.<br>
While it’s possible to directly access HDF5 files using either the PyTables or h5py libraries, pandas provides a high-level interface that simplifies storing Series and DataFrame object.

In [135]:
# !pip install tables

In [136]:
df = pd.DataFrame({'a': np.random.randn(100)})

In [138]:
store = pd.HDFStore('../cache/mydata.h5')
store['obj1'] = df

store['obj1_col'] = df['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: ../cache/mydata.h5

Objects contained in the HDF5 file can then be retrieved with the same dict-like API:

In [140]:
store['obj1'].head()

Unnamed: 0,a
0,1.286701
1,-0.717639
2,-0.667486
3,0.779048
4,-0.008632


HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations using a special syntax:

In [142]:
store.put('obj2', df, format='table')

store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,0.882094
11,-0.3673
12,0.506153
13,-0.779353
14,-1.42994
15,0.185417


In [143]:
store.close()

If you are processing data that is stored on remote servers, like Amazon S3 or HDFS, using a different binary format designed for distributed storage like Apache Parquet may be more suitable. Python for Parquet and other such storage formats is still developing.<br>

If you work with large quantities of data locally, you are encouraged to explore PyTables and h5py to see how they can suit your needs. Since many data analysis problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can massively accelerate your applications.

### Reading MS EXCEL files

In [146]:
# !pip install xlrd

In [147]:
xlsx = pd.ExcelFile('../datasets/ex1.xlsx')

#Data stored in a sheet can then be read into DataFrame with parse:
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can also simply pass the filename to pandas.read_excel:

In [149]:
frame = pd.read_excel('../datasets/ex1.xlsx', 'Sheet1')

In [150]:
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ to_excel method:

In [152]:
# !pip install openpyxl

In [156]:
writer = pd.ExcelWriter('../cache/ex2.xlsx')

In [157]:
frame.to_excel(writer, 'Sheet1')
writer.save()

## Interactiion with web APIs

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package

In [159]:
import requests
# url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
url = 'https://github.com/ashishlal/itv-ml'
resp = requests.get(url)
resp

<Response [200]>

The Response object’s json method will return a dictionary containing JSON parsed into native Python objects. This does not work well at times. 

## Interacting with databases

In a business setting, most data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.
Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process.

In [161]:
import sqlite3

In [164]:
query = """CREATE TABLE test(a VARCHAR(20), b VARCHAR(20),c REAL, d INTEGER );"""
con = sqlite3.connect('mydata.sqlite')


In [165]:
con.execute(query)

<sqlite3.Cursor at 0x7ff87edc96c0>

In [166]:
con.commit()

Then, insert a few rows of data:

In [169]:
data = [('Mumbai', 'Mah', 1.25, 6),('Bengaluru', 'Karnataka', 2.6, 3),('Varanasi', 'UP', 1.7, 5)]

In [170]:
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [172]:
con.executemany(stmt,data)

<sqlite3.Cursor at 0x7ff87e62d570>

In [173]:
con.commit()

In [174]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

[('Mumbai', 'Mah', 1.25, 6),
 ('Bengaluru', 'Karnataka', 2.6, 3),
 ('Varanasi', 'UP', 1.7, 5)]

You can pass the list of tuples to the DataFrame constructor, but you also need the column names, contained in the cursor’s description attribute:


In [175]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [176]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Mumbai,Mah,1.25,6
1,Bengaluru,Karnataka,2.6,3
2,Varanasi,UP,1.7,5


This is quite a bit of argument among data scientists that you’d rather not repeat each time you query the database. The SQLAlchemy project is a popular Python SQL toolkit that abstracts away many of the common differences between SQL databases. pandas has a read_sql function that enables you to read data easily from a general SQLAlchemy connection. Here, we’ll connect to the same SQLite database with SQLAlchemy and read data from the table created before:

In [177]:
import sqlalchemy as sqla

db = sqla.create_engine('sqlite:///mydata.sqlite')

pd.read_sql('select * from test', db)

Unnamed: 0,a,b,c,d
0,Mumbai,Mah,1.25,6
1,Bengaluru,Karnataka,2.6,3
2,Varanasi,UP,1.7,5
