**Reading and Writing Data**

In [162]:
!cat ex1.csv
# use !type ex1.csv for windows OS

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [163]:
import pandas as pd
pd.read_csv('ex1.csv') 

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [164]:
pd.read_table('ex1.csv',sep=',')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [165]:
import pandas as pd
file_path='/home/ankit/Desktop/ml/InterviewPreparation/AI-ML-DS/1.Python/'
print(pd.read_csv(file_path+'ex2.csv',names=['a','b','c','d','message']))

   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo


Suppose you wanted the message column to be the index of the returned DataFrame. 
You can either indicate you want the column at index 4 or named 'message' using the index_col argument

In [166]:
print(pd.read_csv(file_path+'ex2.csv',names=['a','b','c','d','message'],index_col='message'))

         a   b   c   d
message               
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12


In [167]:
print(pd.read_csv(file_path+'/ex2.csv',names=['a','b','c','d','message'],index_col=[4,1]))

            a   c   d
message b            
hello   2   1   3   4
world   6   5   7   8
foo     10  9  11  12


While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. 
In these cases, you can pass a regular expression as a delimiter for read_table. This can be expressed by the regular expression \\s+

In [168]:
result = pd.read_table(file_path+'ex3.txt', sep='\\s+')
print(result)

            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491


In [169]:
print(pd.read_csv(file_path+'ex2.csv',names=['a','b','c','d','message'],skiprows=[0,1]))

   a   b   c   d message
0  9  10  11  12     foo


Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA and NULL

In [170]:
result = pd.read_table(file_path+'ex5.csv',sep=',')
print(result)
print(pd.isnull(result))


  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo
   something      a      b      c      d  message
0      False  False  False  False  False     True
1      False  False  False   True  False    False
2      False  False  False  False  False    False


**Make a value in df to null**

In [171]:
result = pd.read_table(file_path+'ex5.csv',sep=',',na_values={'message': ['foo','world'],\
                                                              'something':['two','three']})
print(result)

  something  a   b     c   d  message
0       one  1   2   3.0   4      NaN
1       NaN  5   6   NaN   8      NaN
2       NaN  9  10  11.0  12      NaN


In [172]:
# change the display setting to show only max 10 rows for a dataframe

# import pandas as pd
# pd.options.display.max_rows = 10

In [173]:
# Data can also be exported to a delimited format
result.to_csv(file_path+'out.csv')

In [174]:
# Other delimiters can be used, of course (writing to sys.stdout so it prints the text result to the console)
# Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value

import sys
result.to_csv(sys.stdout,sep='|',index=False,header=False,na_rep='NULL')

one|1|2|3.0|4|NULL
NULL|5|6|NULL|8|NULL
NULL|9|10|11.0|12|NULL


**JSON Data**

JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV.

JSON object to python object using loads method and python object to json objects using dumps method.

In [175]:
json_obj="""
{"name":[{"myself":["ankit"],"mylife":"kiio"},{"age":[21,22],"state":["UP","Punjab"]}],
 "Institute":["ISI","IIIT-D"]
}
"""

In [176]:
json_obj

'\n{"name":[{"myself":["ankit"],"mylife":"kiio"},{"age":[21,22],"state":["UP","Punjab"]}],\n "Institute":["ISI","IIIT-D"]\n}\n'

In [177]:
import json
python_obj=json.loads(json_obj)
print(python_obj)

{'name': [{'myself': ['ankit'], 'mylife': 'kiio'}, {'age': [21, 22], 'state': ['UP', 'Punjab']}], 'Institute': ['ISI', 'IIIT-D']}


In [178]:
json_obj=json.dumps(python_obj)
print(json_obj)

{"name": [{"myself": ["ankit"], "mylife": "kiio"}, {"age": [21, 22], "state": ["UP", "Punjab"]}], "Institute": ["ISI", "IIIT-D"]}


The pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or DataFrame.

In [179]:
data = pd.read_json(file_path+'example.json')
data

Unnamed: 0,id,name,age,city,salary,department
0,1,John Doe,28,New York,75000,Engineering
1,2,Jane Smith,32,Los Angeles,85000,Marketing
2,3,Bob Johnson,45,Chicago,95000,Sales
3,4,Alice Brown,29,San Francisco,80000,Engineering
4,5,Charlie Wilson,35,Boston,70000,HR


In [180]:
data = pd.read_json(file_path+'example1.json')
data

Unnamed: 0,id,name,age,city,salary,department
0,1,John Doe,28,New York,75000,Engineering
1,2,Jane Smith,32,Los Angeles,85000,Marketing
2,3,Bob Johnson,45,Chicago,95000,Sales
3,4,Alice Brown,29,San Francisco,80000,Engineering
4,5,Charlie Wilson,35,Boston,70000,HR


If you need to export data from pandas to JSON, one way is to use the to_json methods on Series and DataFrame

In [181]:
data.to_json('sample.json')
pd.read_json('sample.json')

Unnamed: 0,id,name,age,city,salary,department
0,1,John Doe,28,New York,75000,Engineering
1,2,Jane Smith,32,Los Angeles,85000,Marketing
2,3,Bob Johnson,45,Chicago,95000,Sales
3,4,Alice Brown,29,San Francisco,80000,Engineering
4,5,Charlie Wilson,35,Boston,70000,HR


**XML and HTML: Web Scraping**

Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples include lxml, Beautiful Soup, and html5lib. 
While lxml is comparatively much faster in general, the other libraries can better handle malformed HTML or XML files.

pandas has a built-in function, read_html, which uses libraries like lxml and Beautiful Soup to automatically parse tables out of HTML files as DataFrame objects.

The pandas.read_html function has a number of options, but by default it searches for and attempts to parse all tabular data contained within tags. The result is a list of DataFrame objects.

In [182]:
tables=pd.read_html('table.html')
print(tables[0])

  Employee ID            Name   Department   Salary  Performance Score
0      E10234      John Smith    Marketing  $72,500                 88
1      E10235        Jane Doe  Engineering  $95,000                 92
2      E10236  Robert Johnson        Sales  $65,000                 79
3      E10237  Emily Williams  Engineering  $98,000                 95
4      E10238   Michael Brown           HR  $68,000                 85


In [183]:
tables

[  Employee ID            Name   Department   Salary  Performance Score
 0      E10234      John Smith    Marketing  $72,500                 88
 1      E10235        Jane Doe  Engineering  $95,000                 92
 2      E10236  Robert Johnson        Sales  $65,000                 79
 3      E10237  Emily Williams  Engineering  $98,000                 95
 4      E10238   Michael Brown           HR  $68,000                 85]

XML (eXtensible Markup Language) is another common structured data format supporting hierarchical, nested data with metadata.

In [184]:
from lxml import objectify
parsed=objectify.parse(open('Performance_MNR.xml'))
root=parsed.getroot()
xml_data=[]
skip_fields=['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']
for ele in root.INDICATOR:
    data={}
    for child in ele.getchildren():
        if child.tag in skip_fields:
            continue
        else:
            data[child.tag]=child.pyval
    xml_data.append(data)
df=pd.DataFrame(xml_data)
df

Unnamed: 0,INDICATOR_NAME,VALUE,UNIT,PERIOD
0,Revenue Growth,15.5,Percentage,Q1 2024
1,Customer Satisfaction,92.3,Score,Q1 2024
2,Production Cost,250000.0,USD,Q1 2024
3,Employee Turnover,8.2,Percentage,Q1 2024


XML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag, which is also valid XML.

In [185]:
from io  import StringIO
root=objectify.parse(StringIO('<a href="http://www.google.com">Google</a>')).getroot()
print(root.get("href"))
print(root.text)

http://www.google.com
Google


One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. 
pandas objects all have a to_pickle method that writes the data to disk in pickle format.

In [186]:
df.to_pickle('serialized_dataframe')
pd.read_pickle('serialized_dataframe')

Unnamed: 0,INDICATOR_NAME,VALUE,UNIT,PERIOD
0,Revenue Growth,15.5,Percentage,Q1 2024
1,Customer Satisfaction,92.3,Score,Q1 2024
2,Production Cost,250000.0,USD,Q1 2024
3,Employee Turnover,8.2,Percentage,Q1 2024


HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. It is available as a C library, and it has interfaces 
available in many other languages, including Java, Julia, MATLAB, and Python. The “HDF” in HDF5 stands for hierarchical data format.

Each HDF5 file can store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compression modes, enabling data with repeated patterns to be stored more efficiently. HDF5 can be a good choice for working with very large datasets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays.

While it’s possible to directly access HDF5 files using either the PyTables or h5py libraries, pandas provides a high-level interface that simplifies storing Series and DataFrame object. The HDFStore class works like a dict and handles the low-level details

In [187]:
# !conda install pytables -y

In [188]:
# !pip install --user tables

In [189]:
import numpy as np
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
print(store['obj1'])

           a
0   0.194782
1   1.052083
2  -0.494156
3  -0.504498
4   0.453644
..       ...
95  0.773727
96  0.541345
97 -1.075445
98 -1.608589
99 -1.087545

[100 rows x 1 columns]


In [190]:
store['obj1_col']

0     0.194782
1     1.052083
2    -0.494156
3    -0.504498
4     0.453644
        ...   
95    0.773727
96    0.541345
97   -1.075445
98   -1.608589
99   -1.087545
Name: a, Length: 100, dtype: float64

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations using a special syntax

In [191]:
store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,0.983675
11,-1.627824
12,1.65577
13,0.495177
14,1.111095
15,0.222738


In [192]:
list(store.items())

[('/obj1',
  /obj1 (Group) ''
    children := ['axis0' (Array), 'axis1' (Array), 'block0_values' (Array), 'block0_items' (Array)]),
 ('/obj1_col',
  /obj1_col (Group) ''
    children := ['index' (Array), 'values' (Array)]),
 ('/obj2',
  /obj2 (Group) ''
    children := ['table' (Table)]),
 ('/obj3',
  /obj3 (Group) ''
    children := ['table' (Table)])]

In [193]:
store.close()

The put is an explicit version of the `store['obj2'] = frame` method but allows us to set other options like the storage format.        

The pandas.read_hdf function gives you a shortcut to these tools

In [194]:
frame.to_hdf('mydata.h5', 'obj3', format='table')
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

Unnamed: 0,a
0,0.194782
1,1.052083
2,-0.494156
3,-0.504498
4,0.453644


If you are processing data that is stored on remote servers, like Amazon S3 or HDFS, using a different binary format designed for distributed storage like Apache Parquet may be more suitable. Python for Parquet and other such storage formats is still developing.

If you work with large quantities of data locally, I would encourage you to explore PyTables and h5py to see how they can suit your needs. Since many data analysis problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can massively accelerate your applications.

HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or pandas.read_excel function.        
Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively.


In [195]:
# !pip install xlrd==1.1.0
# !pip install openpyxl

In [196]:
# !pip uninstall xlrd -y
# !pip install xlrd

In [197]:
# !pip uninstall xlrd -y

In [198]:
# !pip install pandas>=1.3.0 openpyxl>=3.0.0

In [199]:
xlsx=pd.ExcelFile(file_path+'My_Excel.xlsx', engine='openpyxl')
pd.read_excel(xlsx,sheet_name='Sheet1')

Unnamed: 0,Name,College
0,Ankit,ISI
1,Kiioo,IIIT-D
2,Soumi,IISc
3,Summi,---


To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ to_excel method

In [201]:
# Use mode='a' to append to existing file instead of overwriting
writer = pd.ExcelWriter(file_path + 'My_Excel.xlsx', mode='a', engine='openpyxl')

frame = pd.DataFrame(np.random.randn(5), columns=['Random Numbers'])
frame.to_excel(writer, sheet_name='Sheet4')

# Use close() instead of save() for better practice
writer.close()

**Interacting with Web APIs**

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package.

we can make a GET HTTP request using the add-on requests library

In [202]:
import requests
response=requests.get('https://www.weatherapi.com/')
print(response)

<Response [200]>


The Response object’s json method will return a dictionary containing JSON parsed into native Python objects

In [203]:
import requests
data = requests.get('https://api.openf1.org/v1/drivers').json()
print(data[:3])  # Show first 3 drivers

[{'meeting_key': 1140, 'session_key': 7763, 'driver_number': 1, 'broadcast_name': 'M VERSTAPPEN', 'full_name': 'Max VERSTAPPEN', 'name_acronym': 'VER', 'team_name': 'Red Bull Racing', 'team_colour': '3671C6', 'first_name': 'Max', 'last_name': 'Verstappen', 'headshot_url': 'https://www.formula1.com/content/dam/fom-website/drivers/M/MAXVER01_Max_Verstappen/maxver01.png.transform/1col/image.png', 'country_code': 'NED'}, {'meeting_key': 1140, 'session_key': 7763, 'driver_number': 2, 'broadcast_name': 'L SARGEANT', 'full_name': 'Logan SARGEANT', 'name_acronym': 'SAR', 'team_name': 'Williams', 'team_colour': '37BEDD', 'first_name': 'Logan', 'last_name': 'Sargeant', 'headshot_url': 'https://www.formula1.com/content/dam/fom-website/drivers/L/LOGSAR01_Logan_Sargeant/logsar01.png.transform/1col/image.png', 'country_code': 'USA'}, {'meeting_key': 1140, 'session_key': 7763, 'driver_number': 4, 'broadcast_name': 'L NORRIS', 'full_name': 'Lando NORRIS', 'name_acronym': 'NOR', 'team_name': 'McLaren',

**Databases**

In [204]:
import sqlite3
query = """CREATE TABLE test_2 (a VARCHAR(20), b VARCHAR(20), c REAL, d INTEGER );"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3),('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test_2 VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from test_2')
rows = cursor.fetchall()
rows


[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [205]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [206]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


The SQLAlchemy project is a popular Python SQL toolkit that abstracts away many of the common differences between SQL databases.      
    
pandas has a read_sql function that enables you to read data easily from a general SQLAlchemy connection. 
Here, we’ll connect to the same SQLite database with SQLAlchemy and read data from the table created before

In [207]:
import sqlalchemy as sqla
import pandas as pd

db = sqla.create_engine('sqlite:///mydata.sqlite')
# Use SQLAlchemy's execute method directly
with db.connect() as conn:
    result = conn.execute(sqla.text('select * from test'))
    df = pd.DataFrame(result.fetchall(), columns=result.keys())

In [208]:
df

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


In [209]:
import urllib.request
import json 

# Bitcoin Genesis Block Transactions
your_url = 'https://blockchain.info/rawaddr/12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX'

with urllib.request.urlopen(your_url) as url:
    data = json.loads(url.read().decode())
    print(data)

{'hash160': '119b098e2e980a229e139a9ed01a469e518e6f26', 'address': '12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX', 'n_tx': 220, 'n_unredeemed': 220, 'total_received': 5135253007, 'total_sent': 0, 'final_balance': 5135253007, 'txs': [{'hash': '308bc3b8c3987ce86f2af1c32a6e8727f18cb151604b26ab0ce6b21ae5b500b1', 'ver': 1, 'vin_sz': 1, 'vout_sz': 2, 'size': 278, 'weight': 785, 'fee': 5000, 'relayed_by': '0.0.0.0', 'lock_time': 0, 'tx_index': 6227731513824860, 'double_spend': False, 'time': 1754254633, 'block_index': 908451, 'block_height': 908451, 'inputs': [{'sequence': 4294967294, 'witness': '02473044022001e1661aba38c580640f9a5c6f2b99a60c39098f6d8a436ac80a1bf3ee7daf0d02207c4d2b534ba5c006555bfd6f61ad5f406baddee0dc172c8032171680951d2f9a01210235ad7f08fcc677848136db845a948988c671b2ce8757451203fac7f59680fa6e', 'script': '160014458a593eeb60320ededa34d98bb52cb4a8aaf3f9', 'index': 0, 'prev_out': {'type': 0, 'spent': True, 'value': 10000, 'spending_outpoints': [{'tx_index': 6227731513824860, 'n': 0}], 'n': 