# Importing Data in Python (Part 1)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from jupyterthemes import jtplot

jtplot.style()
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format='retina'

## Chapter 1. Introduction and flat files

## 1. Reading a text file

* `r`	открытие на чтение (является значением по умолчанию)  
* `w`	открытие на запись, содержимое файла удаляется, если файла не существует, создается новый  
* `x`	открытие на запись, если файла не существует, иначе исключение  
* `a`	открытие на дозапись, информация добавляется в конец файла  
* `b`	открытие в двоичном режиме  
* `t`	открытие в текстовом режиме (является значением по умолчанию)  
* `+`	открытие на чтение и запись  

```python
filename = 'huck_finn.txt'
file = open(filename, 'r')
text = file.read()

```

```python
with open(filename, 'r') as file:
    print(file.read())
    
```

In [2]:
with open('Importing_Data_in_Python_Part1/moby_dick.txt', 'w') as file:
    file.write('CHAPTER 1. Loomings.\nCall me Ishmael. Some years ago--never mind how long \
               precisely--having\n little or no money in my purse, and nothing particular to \
               interest me on\n shore, I thought I would sail about a little and see the \
               watery part of\n the world. It is a way I have of driving off the spleen and \
               regulating\n the circulation. Whenever I find myself growing grim about the \
               mouth;\n whenever it is a damp, drizzly November in my soul; whenever I \
               find\n myself involuntarily pausing before coffin warehouses, and bringing \
               up\n the rear of every funeral I meet; and especially whenever my hypos get\n \
               such an upper hand of me, that it requires a strong moral principle to\n \
               prevent me from deliberately stepping into the street, and methodically\n \
               knocking people\'s hats off--then, I account it high time to get to sea\n as \
               soon as I can. This is my substitute for pistol and ball. With a\n \
               philosophical flourish Cato throws himself upon his sword; I quietly\n take \
               to the ship. There is nothing surprising in this. If they but knew\n it, \
               almost all men in their degree, some time or other, cherish very\n nearly the \
               same feelings towards the ocean with me.')
    file.close()
    

In [3]:
with open('Importing_Data_in_Python_Part1/moby_dick.txt', 'r') as file:
    print(file.read())
    file.close

CHAPTER 1. Loomings.
Call me Ishmael. Some years ago--never mind how long                precisely--having
 little or no money in my purse, and nothing particular to                interest me on
 shore, I thought I would sail about a little and see the                watery part of
 the world. It is a way I have of driving off the spleen and                regulating
 the circulation. Whenever I find myself growing grim about the                mouth;
 whenever it is a damp, drizzly November in my soul; whenever I                find
 myself involuntarily pausing before coffin warehouses, and bringing                up
 the rear of every funeral I meet; and especially whenever my hypos get
                such an upper hand of me, that it requires a strong moral principle to
                prevent me from deliberately stepping into the street, and methodically
                knocking people's hats off--then, I account it high time to get to sea
 as                soon as I can. This

In [4]:
file.closed

True

## 2. Importing flat files using NumPy

```python
file = 'digits_header.txt'

# Skip the 1st row and choose 1st and 3rd columns
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

```

* Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. 

```python
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
```

```python
np.recfromcsv('filename')
```

## 3. Importing flat files using Pandas¶

```python
import pandas as pd

data_frame = pd.read_csv('filename.csv', nrows=10, header=None)

data_array = data_frame.values

```

* Clean data from comments occurring in flat files, empty lines and missing values

```python
data = pd.read_csv(file, sep='\t', comment='#',
                   na_values=['NA', 'NaN', 'Nothing'])
```

## Chapter 2. Importing data from other file types

## 4. Introduction to other file types

### Other filetypes:

* Excel spreadsheets
* MATLAB files
* SAS files
* STATA files
* HDF5 files

### Pickled files

* File type native to Python
* Motivation: many datatypes for which it isn't obvious how to store them
* Pickled files are serialized
* Seriolize = convert object to byte stream

```python
import pickle

with open('pickled_fruit.pkl', 'rb') as file:
    data = pickle.load(file)
    
print(data)

```

```
{'apples': 14, 'bananas': 43, 'oranges': 6}
```

### Importing Excel spreadsheets

```python
file = 'urbanpop.xlsx'

data = pd.ExcelFile(file)

print(data.sheet_names)
```
```
['1960-1969', '1970-1979', '1980-1989']
```

```
df1 = data.parse('1960-1969') # import '1960-1969' sheet as df1 variable
```

### Bible os

### Loading a pickled file

```python
import pickle

with open('data.pkl', 'rb') as file:
    d = pickle.load(file)

print(type(d))

```
`<class 'dict'>`

### Customizing your spreadsheet import

```python
df2 = xl.parse(xl.sheet_names[1], parse_cols=[0],
               skiprows=1, names=['Country'])
```

* The additional arguments skiprows, names and `parse_cols`. These skip rows, name the columns and designate which columns to parse, respectively. All these arguments can be assigned to lists containing the specific row numbers, strings and column numbers, as appropriate.

## 5. Importing SAS/Stata files using pandas

### SAS and Stata files

* __SAS__ - Statistical Analysis System
* __Stata__ - 'Statictics' + 'data'


* __SAS__: business analytics and biostatistics
* __Stata__: academic social sciences research

### SAS files

* Used for:
  * Advanced analytics
  * Multivariate analysis
  * Business intelligence
  * Data management
  * Predictive analytics
* Standard for computational analysis
* `.sas7bdat` or `.sas7bcat`

### Importing SAS files

* __SAS__ - Statistical Analysis System
* __Stata__ - 'Statictics' + 'data'


* __SAS__: business analytics and biostatistics
* __Stata__: academic social sciences research

```python
import pandas as pd
from sas7bdat import SAS7BDAT

with SAS7BDAT('urbanpop.sas7bdat') as file:
    df_sas = file.to_data_frame()
    
```

### Importing Stata files

```python
data = pd.read_stata('file_name.dta')

```

## 6. Importing HDF5 files

### HDF5 files

* Hierarchical Data Format version 5
* Standard for storing large quantities of numerical data
* Datasets can be hundreds of gigabytes or terabytes
* HDF5 can scale to exabytes

### Importing HDF5 files

```python
In [1]: import h5py
    
In [2]: filename = 'H-H1_LOSC_4_V1-815411200-4096.hdf5'
    
In [3]: data = h5py.File(filename, 'r') # 'r' is to read
    
In [4]: print(type(data))
        <class 'h5py._hl.files.File'>
        
In [5]: for key in data.keys():
            print(key)
Out[5]: meta
        quality
        strain
        
In [6]: print(type(data['meta']))
        <class 'h5py._hl.group.Group'>

```

### The structure of HDF5 files

```python
In [7]: for key in data['meta'].keys():
            print(key)
        
Out[7]: Description
        DescriptionURL
        Detector
        Duration
        GPSstart
        Observatory
        Type
        UTCstart
        
In [8]: print(data['meta']['Description'].value,
              data['meta']['Detector'].value)
    
        b'Strain data time series from LIGO' b'H1'
```


### The HDF Project

* Actively maintained by the HDF Group
* Based in Champaign, Illinois

## 7. Importing MATLAB files

### MATLAB

* 'Matrix Laboratory'
* Industry standard in engineering and science
* Data saved as .mat files

### SciPy to the rescue!

* `scipy.io.loadmat()` - read .mat files
* `scipy.io.savemat()` - write .mat files

### Importing a .mat file

```python
In [1]: import scipy.io
    
In [2]: filename = 'workspace.mat'
    
In [3]: mat = scipy.io.loadmat(filename)
    
In [4]: print(type(mat))
    
        <class 'dict'>
    
In [5]: print(type(mat['x']))
    
        <class 'numpy.ndarray'>
```

* keys = MATLAB variable names
* values = objects assigned to variables

## Chapter 3. Working with relational databases in Python

## 8. Introduction to relational databases

### What is a relational database?

* Based on relational model of data
* First described by Edgar 'Ted' Codd

### Relational model

* Widely adopted
* Todd’s 12 Rules/Commandments
* Consists of 13 rules (zero-indexed!)
* Describes what a Relational Database Management System should adhere to to be considered relational

### Relational Database Management Systems

* PostgreSQL
* MySQL
* SQLite
* SQL = Structured Query Language

### Relational Database

* Each row or record in a table represents an instance of an entity type
* Each column in a table represents an attribute or feature of an instance
* Every table contains a primary key column, which has a unique entry for each row
* There are relations between tables


## 9. Creating a database engine in Python

### Creating a database engine

* SQLite database
  * Fast and simple
* SQLAlchemy
  * Works with many Relational Database Management Systems


In [5]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Importing_Data_in_Python_Part1/Northwind.sqlite')

### Getting table names

In [6]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Importing_Data_in_Python_Part1/Northwind.sqlite')

table_names = engine.table_names()
print(table_names)

['Category', 'Customer', 'CustomerCustomerDemo', 'CustomerDemographic', 'Employee', 'EmployeeTerritory', 'Order', 'OrderDetail', 'Product', 'Region', 'Shipper', 'Supplier', 'Territory']


## 10. Querying relational databases in Python

### Basic SQL query

`SELECT * FROM Table_Name`

* Returns all columns of all rows of the table
* Example:

`SELECT * FROM Orders`

* We’ll use SQLAlchemy and pandas

### Workflow of SQL querying

* Import packages and functions
* Create the database engine
* Connect to the engine
* Query the database
* Save query results to a DataFrame
* Close the connection

### Your first SQL query

In [7]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Importing_Data_in_Python_Part1/Northwind.sqlite')

con = engine.connect()
rs = con.execute('SELECT * FROM Employee')

df = pd.DataFrame(rs.fetchall())
con.close()

### Printing your query results

In [8]:
print(df.head())

   0          1         2                      3     4           5   \
0   1    Davolio     Nancy   Sales Representative   Ms.  1980-12-08   
1   2     Fuller    Andrew  Vice President, Sales   Dr.  1984-02-19   
2   3  Leverling     Janet   Sales Representative   Ms.  1995-08-30   
3   4    Peacock  Margaret   Sales Representative  Mrs.  1969-09-19   
4   5   Buchanan    Steven          Sales Manager   Mr.  1987-03-04   

           6                           7         8              9        10  \
0  2024-05-01  507 - 20th Ave. E. Apt. 2A   Seattle  North America    98122   
1  2024-08-14          908 W. Capital Way    Tacoma  North America    98401   
2  2024-04-01          722 Moss Bay Blvd.  Kirkland  North America    98033   
3  2025-05-03        4110 Old Redmond Rd.   Redmond  North America    98052   
4  2025-10-17             14 Garrett Hill    London  British Isles  SW1 8JR   

    11              12    13    14  \
0  USA  (206) 555-9857  5467  None   
1  USA  (206) 555-9482

### Set the DataFrame column names

In [9]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Importing_Data_in_Python_Part1/Northwind.sqlite')

con = engine.connect()
rs = con.execute('SELECT * FROM Employee')

df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
con.close() 

print(df.head())

   Id   LastName FirstName                  Title TitleOfCourtesy   BirthDate  \
0   1    Davolio     Nancy   Sales Representative             Ms.  1980-12-08   
1   2     Fuller    Andrew  Vice President, Sales             Dr.  1984-02-19   
2   3  Leverling     Janet   Sales Representative             Ms.  1995-08-30   
3   4    Peacock  Margaret   Sales Representative            Mrs.  1969-09-19   
4   5   Buchanan    Steven          Sales Manager             Mr.  1987-03-04   

     HireDate                     Address      City         Region PostalCode  \
0  2024-05-01  507 - 20th Ave. E. Apt. 2A   Seattle  North America      98122   
1  2024-08-14          908 W. Capital Way    Tacoma  North America      98401   
2  2024-04-01          722 Moss Bay Blvd.  Kirkland  North America      98033   
3  2025-05-03        4110 Old Redmond Rd.   Redmond  North America      98052   
4  2025-10-17             14 Garrett Hill    London  British Isles    SW1 8JR   

  Country       HomePhone 

### Using the context manager

* Method `.fetchall()` imports all rows
* Mothod `.fetchmany(size=n)` imports `n` rows

In [10]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Importing_Data_in_Python_Part1/Northwind.sqlite')
with engine.connect() as con:
    rs = con.execute('SELECT Id, FirstName, LastName FROM Employee')
    df = pd.DataFrame(rs.fetchmany(size=5))
    df.columns = rs.keys()

### Filtering quirying

```SQL
SELECT * FROM Customer WHERE Country = 'Canada'
```

### Ordering your SQL records 

```SQL
SELECT * FROM Customer ORDER BY SupportRepId
```

### Short cut

In [11]:
df = pd.read_sql_query('SELECT * FROM Employee', engine)
df = pd.read_sql_query('SELECT * FROM Employee WHERE Id >= 6 ORDER BY BirthDate', engine)

## 11. Advanced querying: exploiting table relationships

### INNER JOIN in Python (pandas)

In [12]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Importing_Data_in_Python_Part1/Northwind.sqlite')

df = pd.read_sql_query('SELECT FirstName, LastName, TerritoryId FROM Employee \
                        INNER JOIN EmployeeTerritory \
                        on Employee.Id = EmployeeTerritory.EmployeeID', engine)
print(df.head())

  FirstName LastName TerritoryId
0     Nancy  Davolio       06897
1     Nancy  Davolio       19713
2    Andrew   Fuller       01581
3    Andrew   Fuller       01730
4    Andrew   Fuller       01833


### Example of joining columns 'FirstName' and 'SecondName' of tables 'Employee' and 'EmployeeTerritory' by the column 'TerritoryId'

In [13]:
with engine.connect() as con:
    rs = con.execute('SELECT FirstName, LastName, TerritoryId FROM Employee \
                      INNER JOIN EmployeeTerritory \
                      on Employee.Id = EmployeeTerritory.EmployeeID')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

print(df.head())

  FirstName LastName TerritoryId
0     Nancy  Davolio       06897
1     Nancy  Davolio       19713
2    Andrew   Fuller       01581
3    Andrew   Fuller       01730
4    Andrew   Fuller       01833
