# IMPORTING DATA IN PYTHON
<hr>
Importing data in Python is a fundamental step in any data-related task, be it for analysis, processing, or machine learning. It involves bringing external data into your Python environment so that you can work with it using Python's extensive set of data manipulation, analysis, and visualization tools. Here are the primary methods and techniques used for importing data in Python:

In [158]:
# Importing the pandas library
import pandas as pd

# Importing the NumPy library 
import numpy as np

## 1. Flatfiles/textfiles
### 1.1. Using open
Opening a file named `tweets3.txt` for reading (`r` mode)
- The `with` statement opens the file in context and ensures that the file is properly closed after reading
- The `.read()` method reads the entire contents of the file and stores it in the variable `tweets`

In [194]:
with open('datasets/importing-data/tweets3.txt', 'r') as con:
    tweets = con.read()

In [195]:
# Extract the first 1000 characters from the 'tweets' variable
tweets[:1000]

'{"in_reply_to_user_id": null, "created_at": "Tue Mar 29 23:40:17 +0000 2016", "filter_level": "low", "truncated": false, "possibly_sensitive": false, "timestamp_ms": "1459294817758", "user": {"profile_banner_url": "https://pbs.twimg.com/profile_banners/2290155049/1456586630", "created_at": "Mon Jan 13 20:05:32 +0000 2014", "utc_offset": 3600, "geo_enabled": false, "notifications": null, "lang": "en", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_image_url_https": "https://pbs.twimg.com/profile_images/700888033558720512/KexOIMN4_normal.jpg", "time_zone": "London", "listed_count": 71, "screen_name": "greyman25", "url": "http://www.get-saved-today.webnode.com", "profile_background_tile": false, "followers_count": 578, "profile_link_color": "1B95E0", "default_profile": false, "name": "Born in Britain", "follow_request_sent": null, "following": null, "profile_use_background_image": false, "profile_background_color": "000000", "id": 2290

### 1.2. Using loadtxt
Loading data from a CSV file named `titanic_sub.csv`
- `dtype=str` specifies that the data should be read as strings
- `delimiter=`, specifies that the comma `,` is used as the delimiter between values
- `skiprows=1` skips the first row (usually the header) when reading the data
- `usecols=[2, 3]` specifies that only the 3rd and 4th columns (0-based index) should be read

In [161]:
titanic = np.loadtxt('datasets/importing-data/titanic_sub.csv', dtype=str, delimiter=',', skiprows=1, usecols=[2, 3, 4])
titanic

array([['3', 'male', '22.0'],
       ['1', 'female', '38.0'],
       ['3', 'female', '26.0'],
       ...,
       ['3', 'female', ''],
       ['1', 'male', '26.0'],
       ['3', 'male', '32.0']], dtype='<U6')

### 1.3. Using genfromtxt
Loading data from a CSV file named `titanic_sub.csv` using `genfromtxt`
- `delimiter= ','` specifies that the comma `,` is used as the delimiter between values
- `dtype = None` lets NumPy infer the data types of the columns automatically
- `skip_header = 1` skips the first row (header) when reading the data
- `encoding = utf-8` specifies the character encoding of the file
- `max_rows = 10` limits the number of rows to be read

In [162]:
titanic2 = np.genfromtxt('datasets/importing-data/titanic_sub.csv', delimiter=',', dtype=None, skip_header=1, encoding='utf-8', max_rows=10)
titanic2

array([( 1, 0, 3, 'male', 22., 1, 0, 'A/5 21171',  7.25  , '', 'S'),
       ( 2, 1, 1, 'female', 38., 1, 0, 'PC 17599', 71.2833, 'C85', 'C'),
       ( 3, 1, 3, 'female', 26., 0, 0, 'STON/O2. 3101282',  7.925 , '', 'S'),
       ( 4, 1, 1, 'female', 35., 1, 0, '113803', 53.1   , 'C123', 'S'),
       ( 5, 0, 3, 'male', 35., 0, 0, '373450',  8.05  , '', 'S'),
       ( 6, 0, 3, 'male', nan, 0, 0, '330877',  8.4583, '', 'Q'),
       ( 7, 0, 1, 'male', 54., 0, 0, '17463', 51.8625, 'E46', 'S'),
       ( 8, 0, 3, 'male',  2., 3, 1, '349909', 21.075 , '', 'S'),
       ( 9, 1, 3, 'female', 27., 0, 2, '347742', 11.1333, '', 'S'),
       (10, 1, 2, 'female', 14., 1, 0, '237736', 30.0708, '', 'C')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U6'), ('f4', '<f8'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<U16'), ('f8', '<f8'), ('f9', '<U4'), ('f10', '<U1')])

## 2. JSON files
### 2.1. Using open
Opening a JSON file named `batman.json` for reading

In [163]:
import json

In [164]:
with open('datasets/importing-data/batman.json') as con:
     # Loading the JSON data from the file and storing it in the variable 'batman'
    batman = json.load(con)
batman

{'Title': 'Batman',
 'Year': '1989',
 'Rated': 'PG-13',
 'Released': '23 Jun 1989',
 'Runtime': '126 min',
 'Genre': 'Action, Adventure',
 'Director': 'Tim Burton',
 'Writer': 'Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)',
 'Actors': 'Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl',
 'Plot': 'The Dark Knight of Gotham City begins his war on crime with his first major enemy being Jack Napier, a criminal who becomes the clownishly homicidal Joker.',
 'Language': 'English, French, Spanish',
 'Country': 'USA, UK',
 'Awards': 'Won 1 Oscar. Another 8 wins & 26 nominations.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.5/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '71%'},
  {'Source': 'Metacritic', 'Value': '69/100'}],
 'Metascore': '69',
 'imdbRating': '7.5',
 'imdbVotes': '329,552',
 'imdbID': 'tt

<hr>
The read file can be converted to pandas dataframe for easy visualization

In [165]:
with open('datasets/importing-data/movies.json') as con:
    movies = json.load(con)
movies = pd.DataFrame(movies)
movies.head()

Unnamed: 0,id,cover_url,description,rating,title
0,4fede17c312f912796000034,,,6.3,"L'affaire Gordji, histoire d'une cohabitation"
1,4fede17f312f912796000035,,Documentary telling the true story of the sink...,6.8,Le naufrage du Laconia - partie 1
2,4fede181312f912796000036,,Documentary telling the true story of the sink...,6.8,Le naufrage du Laconia - partie 2
3,4fede184312f912796000037,http://ia.media-imdb.com/images/M/MV5BMjAyMTg0...,The extraordinary story of three Rwandan kids ...,6.2,Africa United
4,4fede186312f912796000038,http://ia.media-imdb.com/images/M/MV5BMjAyNDcx...,A young man is rocked by two announcements fro...,7.2,Beginners


### 2.2. Reading JSON files directly to pandas
Reading JSON files directly into a Pandas DataFrame is a common and convenient way to import data. It is done using the pandas `.read_json()` method.

In [166]:
movies = pd.read_json('datasets/importing-data/movies.json')
movies.head()

Unnamed: 0,id,cover_url,description,rating,title
0,4fede17c312f912796000034,,,6.3,"L'affaire Gordji, histoire d'une cohabitation"
1,4fede17f312f912796000035,,Documentary telling the true story of the sink...,6.8,Le naufrage du Laconia - partie 1
2,4fede181312f912796000036,,Documentary telling the true story of the sink...,6.8,Le naufrage du Laconia - partie 2
3,4fede184312f912796000037,http://ia.media-imdb.com/images/M/MV5BMjAyMTg0...,The extraordinary story of three Rwandan kids ...,6.2,Africa United
4,4fede186312f912796000038,http://ia.media-imdb.com/images/M/MV5BMjAyNDcx...,A young man is rocked by two announcements fro...,7.2,Beginners


## 3. MATLAB
MATLAB files are data files associated with the MATLAB software, a high-level programming language and interactive environment for numerical computing and data analysis. MATLAB files can contain various types of data, including variables, functions, scripts, and more. Here, we'll look at `.mat` files.

We load the MATLAB file `ja_data2.mat` using the loadmat function from the SciPy library and acess the variable `CYratioCyt`.

In [167]:
from scipy.io import loadmat

In [168]:
# Reading the file
data = loadmat('datasets/importing-data/ja_data2.mat')

# Getting the keys (variable names) present in the loaded MATLAB file
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'rfpCyt', 'rfpNuc', 'cfpNuc', 'cfpCyt', 'yfpNuc', 'yfpCyt', 'CYratioCyt'])

In [169]:
# Check the shape of CYratioCyt
data['CYratioCyt'].shape

(200, 137)

In [170]:
# Display CYratioCyt data
data['CYratioCyt']

array([[0.        , 1.53071547, 1.54297013, ..., 1.34990123, 1.35329984,
        1.34922173],
       [0.        , 1.28605578, 1.29385656, ..., 1.31307311, 1.30039694,
        1.30563938],
       [0.        , 1.32731222, 1.32884617, ..., 1.24887565, 1.24506205,
        1.25825831],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.44552606, 1.42862357, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.45794466, 0.        , ..., 1.1229479 , 1.12224652,
        1.1486481 ]])

## 4. Stata
Files with the extension `.dta`  are the primary data files used by Stata. They can contain datasets, variables, labels, and other metadata.

They can be read using the pandas `.read_stata()` method. 

In [171]:
auto = pd.read_stata('datasets/importing-data/auto.dta')
auto.head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic


## 5. Excel
### 5.1. Using pandas read_excel
Excel files can be read directly by the pandas method `.read_excel()`. Installation of additional packages may be required.

In [172]:
# Using Pandas to read an Excel file named 'latitude.xls'
latitude = pd.read_excel('datasets/importing-data/latitude.xls')
latitude.head()

Unnamed: 0,country,1700
0,Afghanistan,34.565
1,Akrotiri and Dhekelia,34.616667
2,Albania,41.312
3,Algeria,36.72
4,American Samoa,-14.307


#### 5.1.1. Multiple sheets
If the file has multiple sheets, specify the sheet name.

In [173]:
# Using Pandas to read an Excel file named 'CO2-Dataset.xlsx'
co2 = pd.read_excel('datasets/importing-data/CO2-Dataset.xlsx',  sheet_name = 'CO2 (kt) for Split')
co2.sample(7)

Unnamed: 0,Country Code,Country Name,Region,Year,CO2 (kt)
5764,LCA,St. Lucia,Latin America & Caribbean,2004,355.699
1798,CHI,Channel Islands,Europe & Central Asia,1990,
5235,KHM,Cambodia,East Asia & Pacific,1995,1551.141
1574,BWA,Botswana,Sub-Saharan Africa,1974,88.008
10004,TTO,Trinidad and Tobago,Latin America & Caribbean,1981,17260.569
11097,ZWE,Zimbabwe,Sub-Saharan Africa,1982,8811.801
8219,PRY,Paraguay,Latin America & Caribbean,1963,410.704


### 5.2. Using pandas .ExcelFile
This method can be used to display all sheet names in the workbook.

In [174]:
# Read the file
co2_wide = pd.ExcelFile('datasets/importing-data/CO2-Dataset.xlsx')

# Display the sheets in the workbook
co2_wide.sheet_names

['About',
 'CO2 (kt) Pivoted',
 'CO2 (kt) RAW DATA',
 'CO2 Data Cleaned',
 'CO2 (kt) for Split',
 'CO2 for World to Union',
 'CO2 Per Capita RAW DATA',
 'CO2 Per Capita (Pivoted)',
 'Metadata - Countries']

To load a particular sheet as a dataframe, apply the `.parse()` method to the object data with a single argument, which is either the name as a string or the index.

In [175]:
# Parse using the string name of the sheet
co2_clean = co2_wide.parse('CO2 Data Cleaned')
co2_clean.head()

Unnamed: 0,Country Code,Country Name,Region,Year,CO2 (kt),CO2 Per Capita (metric tons)
0,ABW,Aruba,Latin America & Caribbean,1960,,
1,ABW,Aruba,Latin America & Caribbean,1961,,
2,ABW,Aruba,Latin America & Caribbean,1962,,
3,ABW,Aruba,Latin America & Caribbean,1963,,
4,ABW,Aruba,Latin America & Caribbean,1964,,


In [176]:
# Parse using the index location of the sheet.
co2_clean_index = co2_wide.parse(3)
co2_clean_index.head()

Unnamed: 0,Country Code,Country Name,Region,Year,CO2 (kt),CO2 Per Capita (metric tons)
0,ABW,Aruba,Latin America & Caribbean,1960,,
1,ABW,Aruba,Latin America & Caribbean,1961,,
2,ABW,Aruba,Latin America & Caribbean,1962,,
3,ABW,Aruba,Latin America & Caribbean,1963,,
4,ABW,Aruba,Latin America & Caribbean,1964,,


## 6. Comma-separated files
CSV stands for Comma-Separated Values. It is a widely used file format for storing tabular data in plain text. In a CSV file, each line represents a row of data, and within each line, individual values are separated by commas (,). CSV files typically have a `.csv` file extension.

In [177]:
movies = pd.read_csv('datasets/importing-data/Movie-Data.csv', index_col = 'Movie Title')
movies.head()

Unnamed: 0_level_0,Release Date,Wikipedia URL,Genre,Director (1),Director (2),Cast (1),Cast (2),Cast (3),Cast (4),Cast (5),Budget,Revenue
Movie Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10 Cloverfield Lane,2016-03-08,https://en.wikipedia.org/wiki/10_Cloverfield_Lane,Thriller,Dan Trachtenberg,,Mary Elizabeth Winstead,John Goodman,John Gallagher,,,"$15,000,000.00","$108,300,000.00"
13 Hours: The Secret Soldiers of Benghazi,2016-01-15,https://en.wikipedia.org/wiki/13_Hours:_The_Se...,Action,Michael Bay,,James Badge Dale,John Krasinski,Toby Stephens,Pablo Schreiber,Max Martini,"$45,000,000.00","$69,400,000.00"
2 Guns,2013-08-02,https://en.wikipedia.org/wiki/2_Guns,Action,Baltasar Kormákur,,Mark Wahlberg,Denzel Washington,Paula Patton,Bill Paxton,Edward James Olmos,"$61,000,000.00","$131,900,000.00"
21 Jump Street,2012-03-16,https://en.wikipedia.org/wiki/21_Jump_Street_(...,Comedy,Phil Lord,Chris Miller,Jonah Hill,Channing Tatum,Ice Cube,Brie Larson,Rob Riggle,"$55,000,000.00","$201,500,000.00"
22 Jump Street,2014-06-04,https://en.wikipedia.org/wiki/22_Jump_Street,Action,Phil Lord,Chris Miller,Channing Tatum,Jonah Hill,Ice Cube,,,"$84,500,000.00","$331,300,000.00"


## 7. Python files
You can read and display the contents of a python file using the open function.

In [178]:
with open('datasets/importing-data/tweet_listener.py', 'r') as con:
    tweet = con.read()

print(tweet)

class MyStreamListener(tweepy.StreamListener):
    def __init__(self, api=None):
        super(MyStreamListener, self).__init__()
        self.num_tweets = 0
        self.file = open("tweets.txt", "w")

    def on_status(self, status):
        tweet = status._json
        self.file.write( json.dumps(tweet) + '\n' )
        self.num_tweets += 1
        if self.num_tweets < 100:
            return True
        else:
            return False
        self.file.close()

    def on_error(self, status):
        print(status)


In [179]:
with open('datasets/importing-data/test_arithmetic.py', 'r') as con:
    doc = con.read()
    
print(doc[:300])

import operator

import numpy as np
import pytest

import pandas as pd
import pandas._testing as tm
from pandas.arrays import FloatingArray


@pytest.fixture
def data():
    return pd.array(
        [True, False] * 4 + [np.nan] + [True, False] * 44 + [np.nan] + [True, False],
        dtype="boolean"


## 8. SAS files

In [180]:
# Import sas7bdat
from sas7bdat import SAS7BDAT

In [181]:
# Opening the SAS file 'consumption.sas7bdat'
consumption = SAS7BDAT('datasets/importing-data/consumption.sas7bdat')

# Converting the SAS data to a DataFrame
consumption_df = consumption.to_data_frame()

consumption_df.head(7)

Unnamed: 0,INC,CONS,DUR
0,8369.0,7537.0,428.0
1,8436.0,7651.0,434.0
2,8567.0,7655.0,404.0
3,8692.0,7885.0,475.0
4,8775.0,7947.0,491.0
5,8865.0,7967.0,486.0
6,8794.0,7917.0,482.0


## 9. SQL
Python allows for the importation and querying of SQLite files.

In [182]:
# Importing necessary modules from SQLAlchemy
from sqlalchemy import create_engine, inspect, text

# Creating an engine to connect to the SQLite database 'Chinook.sqlite'
engine = create_engine('sqlite:///datasets/importing-data/Chinook.sqlite')

# Inspecting the engine to get the table names in the database
table_names = inspect(engine).get_table_names()
table_names

['Album',
 'Artist',
 'Customer',
 'Employee',
 'Genre',
 'Invoice',
 'InvoiceLine',
 'MediaType',
 'Playlist',
 'PlaylistTrack',
 'Track']

The syntax below doesn't run in the new SQLAlchemy Version: 2.0.20. 

The syntax above is modified to run in new SQLAlchemy Version: 2.0.20. 

In [183]:
# Executing an SQL query to select all columns from the 'album' table
album = pd.read_sql_query(text("SELECT * FROM Album"), engine.connect())

# Displaying the first few rows of the 'album' DataFrame
album.head(7)

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3
5,6,Jagged Little Pill,4
6,7,Facelift,5


Retrieve all columns from the `artist` table in the connected database and load them into a Pandas DataFrame

In [184]:
artist = pd.read_sql_query(text('SELECT * FROM artist'), engine.connect())
artist.head()

Unnamed: 0,ArtistId,Name
0,1,AC/DC
1,2,Accept
2,3,Aerosmith
3,4,Alanis Morissette
4,5,Alice In Chains


Retrieve all columns from the `track` table.

In [185]:
tracks = pd.read_sql_query(text('SELECT * FROM Track'), engine.connect())
tracks.head()

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
1,2,Balls to the Wall,2,2,1,,342562,5510424,0.99
2,3,Fast As a Shark,3,2,1,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho...",230619,3990994,0.99
3,4,Restless and Wild,3,2,1,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...",252051,4331779,0.99
4,5,Princess of the Dawn,3,2,1,Deaffy & R.A. Smith-Diesel,375418,6290521,0.99


In [186]:
genre = pd.read_sql_query(text('SELECT * FROM Genre'), engine.connect())
genre.head()

Unnamed: 0,GenreId,Name
0,1,Rock
1,2,Jazz
2,3,Metal
3,4,Alternative & Punk
4,5,Rock And Roll


Retrieve information from the `track` table and use unions to join tables and retrieve more data.

In [187]:
union = pd.read_sql_query(text(
    """
    SELECT Track.NAME as track, artist.Name as artist, Genre.Name as genre, album.title as album FROM Track 
    INNER JOIN Album on track.AlbumId = Album.AlbumId 
    INNER JOIN Artist on Album.ArtistId = Artist.ArtistId 
    INNER JOIN Genre on Track.GenreId = Genre.GenreID  
    ORDER BY artist 
    LIMIT 30"""), engine.connect())
union

Unnamed: 0,track,artist,genre,album
0,For Those About To Rock (We Salute You),AC/DC,Rock,For Those About To Rock We Salute You
1,Put The Finger On You,AC/DC,Rock,For Those About To Rock We Salute You
2,Let's Get It Up,AC/DC,Rock,For Those About To Rock We Salute You
3,Inject The Venom,AC/DC,Rock,For Those About To Rock We Salute You
4,Snowballed,AC/DC,Rock,For Those About To Rock We Salute You
5,Evil Walks,AC/DC,Rock,For Those About To Rock We Salute You
6,C.O.D.,AC/DC,Rock,For Those About To Rock We Salute You
7,Breaking The Rules,AC/DC,Rock,For Those About To Rock We Salute You
8,Night Of The Long Knives,AC/DC,Rock,For Those About To Rock We Salute You
9,Spellbound,AC/DC,Rock,For Those About To Rock We Salute You


## 10. Pickle
### 10.1. Using pandas

In [188]:
avocado = pd.read_pickle('datasets/importing-data/avoplotto.pkl')
avocado.head()

Unnamed: 0,date,type,year,avg_price,size,nb_sold
0,2015-12-27,conventional,2015,0.95,small,9626901.09
1,2015-12-20,conventional,2015,0.98,small,8710021.76
2,2015-12-13,conventional,2015,0.93,small,9855053.66
3,2015-12-06,conventional,2015,0.89,small,9405464.36
4,2015-11-29,conventional,2015,0.99,small,8094803.56


In [189]:
wards = pd.read_pickle('datasets/joining-data/chicago_ward.p')
wards.head()

Unnamed: 0,ward,alderman,address,zip
0,1,"Proco ""Joe"" Moreno",2058 NORTH WESTERN AVENUE,60647
1,2,Brian Hopkins,1400 NORTH ASHLAND AVENUE,60622
2,3,Pat Dowell,5046 SOUTH STATE STREET,60609
3,4,William D. Burns,"435 EAST 35TH STREET, 1ST FLOOR",60616
4,5,Leslie A. Hairston,2325 EAST 71ST STREET,60649


### 10.2. Open function

In [190]:
import pickle

with open('datasets/importing-data/avoplotto.pkl', 'rb') as con:
    avocado2 = pickle.load(con)

avocado2.head()

Unnamed: 0,date,type,year,avg_price,size,nb_sold
0,2015-12-27,conventional,2015,0.95,small,9626901.09
1,2015-12-20,conventional,2015,0.98,small,8710021.76
2,2015-12-13,conventional,2015,0.93,small,9855053.66
3,2015-12-06,conventional,2015,0.89,small,9405464.36
4,2015-11-29,conventional,2015,0.99,small,8094803.56


In [191]:
with open('datasets/joining-data/chicago_ward.p', 'rb') as con:
    ward = pickle.load(con)
ward.head()

Unnamed: 0,ward,alderman,address,zip
0,1,"Proco ""Joe"" Moreno",2058 NORTH WESTERN AVENUE,60647
1,2,Brian Hopkins,1400 NORTH ASHLAND AVENUE,60622
2,3,Pat Dowell,5046 SOUTH STATE STREET,60609
3,4,William D. Burns,"435 EAST 35TH STREET, 1ST FLOOR",60616
4,5,Leslie A. Hairston,2325 EAST 71ST STREET,60649


### 10.3. HDF5
HDF5 is a general purpose library and file format for storing scientific data. HDF5 can store two primary types of objects: datasets and groups. 

In [192]:
# Importing the necessary module
import h5py

# Opening an HDF5 file in read mode
file = h5py.File('datasets/importing-data/L-L1_LOSC_4_V1-1126259446-32.hdf5', 'r')

# Getting the keys (group names) in the HDF5 file
file.keys()

<KeysViewHDF5 ['meta', 'quality', 'strain']>

To access a specific dataset named `Strain` within the group `strain` in the HDF5 file and print the contents:

In [193]:
# Accessing the dataset 'Strain' within the group 'strain' in the HDF5 file
strain_data = file['strain']['Strain']

# Converting the dataset 'Strain' to a NumPy array
numpy_strain_data = np.array(file['strain']['Strain'])
numpy_strain_data

array([-1.77955839e-18, -1.76552067e-18, -1.71049117e-18, ...,
       -1.76375155e-18, -1.72364846e-18, -1.71969299e-18])