<a href="https://colab.research.google.com/github/cweiqiang/wq.github.io/blob/main/Cheatsheet_Importing_Data_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Importing Data in Python

Most of the time, you’ll use either NumPy or pandas to import your data:

In [None]:
import numpy as np
import pandas as pd

# 2. Help



In [None]:
import numpy as np
np.info(np.ndarray.dtype)
help(pd.read_csv)

# Text Files



## Plain Text Files

In [None]:
filename = 'huck_finn.txt'
file = open(filename, mode='r') # open the file for reading
text = file.read() #read a file's content
print(file.closed) # check whether another file is closed
file.close() # close file
print(text)

 - Using the context manager with

In [None]:
with open('huck_finn.txt', 'r') as file:
  print(file.readline()) #Read a single line
  print(file.readline())
  print(file.readline())

## Table Data: Flat files



### Importing Flat Files with NumPy

In [None]:
filename = 'huck_finn.txt'
file = open(filename, mode='r') # open the file for reading
text = file.read() #read a file's content
print(file.closed) # check whether another file is closed
file.close() # close file
print(text)

### Files with one data type

In [None]:
filename = 'mnist.txt'
data = np.loadtxt(filename,
delimiter= ',', #string used to separate values
skiprows=2, #skip the first 2 lines
usecols=[0,2], #read the 1st and 3rd column
dtype=str) # the type of the resulting string

### Files with mixed data type

In [None]:
filename = 'titanic.csv'

data = np.genfromtxt(filename,
delimiter=',',
names=True, #look for column header
dtype=None)

data_array = np.recfromcsv(filename)
# the default dtype of the np.recfromcsv() function is None

## Importing Flat Files with Pandas

In [None]:
filename = 'winequality-red.csv'

data = pd.read_csv(filename,
nrows=5,
header=None,
sep='\t',
comment= '#',
na_values=[""])

# Exploring Your Data



## NumPy Arrays

In [None]:
data_array.dtype # Data type of array elements
data_array.shape # Array dimentions
len(data_array) #Length of array

## Pandas Dataframes

In [None]:
df=[]

In [None]:
df.head()
df.tail()
df.index
df.columns
df.info()
data_array = data.values #Convert a DataFrame to an a NumPy array

## SAS File

In [None]:
from sas7bdat import SAS7BDAT
with SAS7BDAT( ) as file:
df_sas = file.to_data_frame()

## Stata File

In [None]:
data = pd.read_stata('urbanpop.dta')

## Excel Spreadsheets

In [None]:
file = 'urbanpop.xlsx'
data = pd.ExcelFile(file)

df_sheet2 = data.parse('1960-1966',
skiprows=[0],
names=['Country',
'AAM: War(2002)'
])

df_sheet1 = data.parse(0,
parse_cols=[0],
skiprows=[0],
names=['Country'])

To access the sheet names, use the `sheet_names` attribute:

In [None]:
data.sheet_names

# Relational Databases



In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite://Northwind.sqlite')

Use the `table_names()` method to fetch a list of table names:

In [None]:
table_names = engine.table_names()

## Querying Relational Databases

In [None]:
con = engine.connect()
rs = con.execute("SELECT * FROM Orders")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
con.close()

Using the context manager with

In [None]:
with engine.connect() as con:
  rs = con.execute( )
  df = pd.DataFrame(rs.fetchmany(size=5))
  df.columns = rs.keys()

## Querying relational databases with pandas

In [None]:
df = pd.read_sql_query("SELECT * FROM Orders", engine)

# Pickled Files


In [None]:
import pickle
with open('pickled_fruit.pkl' 'rb') as file:
  pickled_data = pickle.load(file)

# MATLAB files


In [None]:
import scipy.io
filename = 'workspace.mat'
mat = scipy.io.loadmat(filename)

# HDF5 Files


In [None]:
import h5py
filename ='H-H1_LOSC_4_v1-815411200-4096.hdf5'
data = h5py.File(filename, 'r')

# Exploring Dictionaries

## Querying relational databases with pandas

In [None]:
print(mat.keys()) #Print dictionary keys
for key in data.keys(): #Print dictionary keys
  print(key)

meta

quality

strain

In [None]:
pickled_data.values() #Return dictionary values
print(mat.items()) #Returns items in list format of (key, value) tuple pairs

## Accessing Data Items with Keys

In [None]:
for key in data ['meta'].keys(): 
    print(key) #Explore the HDF5 structure

Description

DescriptionURL

Detector

Duration

GPSstart

Observatory

Type

UTCstart

In [None]:
#Retrieve the value for a key
print(data['meta']['Description'].value)

# Navigating Your FileSystem



## Magic Commands

In [None]:
!ls #List directory contents of files and directories
%cd .. #Change current working directory
%pwd #Return the current working directory path

## OS Library

In [None]:
import os
path = "/usr/tmp"
wd = os.getcwd() #Store the name of current directory in a string
os.listdir(wd) #Output contents of the directory in a list
os.chdir(path) #Change current working directory
os.rename("test1.txt", #Rename a file
"test2.txt")
os.remove("test1.txt") #Delete an existing file
os.mkdir("newdir") #Create a new directory