## 21.1 Extract Transform Load (ETL)

* **Extract**: opening a file and reading its contents.
* **Transform**: 
* **Load**:

## 21.2 Reading text files

### 21.2.1 Text encoding: ASCII, Unicode

In [2]:
open('test.txt', 'wb').write(bytes([65, 66, 67, 255, 192, 193])) # writes and returns # character written

6

In [3]:
x = open('test.txt').read() # by default errors='strict'

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

In [4]:
open('test.txt', errors='ignore').read()

'ABC'

In [8]:
open('test.txt', errors='replace').read() # replaced by marker '?' character

'ABC���'

In [6]:
open('test.txt', errors='surrogateescape').read()

'ABC\udcff\udcc0\udcc1'

In [7]:
open('test.txt', errors='backslashreplace').read()

'ABC\\xff\\xc0\\xc1'

### 21.2.2 Unstructured text

In [13]:
moby_text = open('moby_01.txt').read()
moby_paragraphs = moby_text.split('\n\n')
print(moby_paragraphs[0])

Para1 word2, end.


In [14]:
len(moby_paragraphs)

4

In [15]:
moby_paragraphs[3]

''

In [16]:
moby_paragraphs[0].lower()

'para1 word2, end.'

In [18]:
moby = moby_paragraphs[0].lower()

In [19]:
moby = moby.replace(".", "")
moby = moby.replace(",", "")
moby_words = moby.split(" ")
print(moby_words)

['para1', 'word2', 'end']


### 21.2.3 Delimited flat files

In [20]:
line = "Illinois|1979/01/01|17.48|994"
print(line.split("|"))

['Illinois', '1979/01/01', '17.48', '994']


In [24]:
results = []
for line in open('temp_data_pipes_00a.txt'):
    fields = line.strip().split("|")
    results.append(fields)
results

[['State',
  'Month Day, Year Code',
  'Avg Daily Max Air Temperature (F)',
  'Record Count for Daily Max Air Temp (F)'],
 ['Illinois', '1979/01/01', '17.48', '994'],
 ['Illinois', '1979/01/02', '4.64', '994'],
 ['Illinois', '1979/01/03', '11.05', '994'],
 ['Illinois', '1979/01/04', '9.51', '994'],
 ['Illinois', '1979/05/15', '68.42', '994'],
 ['Illinois', '1979/05/16', '70.29', '994'],
 ['Illinois', '1979/05/17', '75.34', '994'],
 ['Illinois', '1979/05/18', '79.13', '994'],
 ['Illinois', '1979/05/19', '74.94', '994']]

### 21.2.4 The csv module

In [26]:
import csv

results = [fields for fields in csv.reader(open('temp_data_pipes_00a.txt'), delimiter="|")]
results

[['State',
  'Month Day, Year Code',
  'Avg Daily Max Air Temperature (F)',
  'Record Count for Daily Max Air Temp (F)'],
 ['Illinois', '1979/01/01', '17.48', '994'],
 ['Illinois', '1979/01/02', '4.64', '994'],
 ['Illinois', '1979/01/03', '11.05', '994'],
 ['Illinois', '1979/01/04', '9.51', '994'],
 ['Illinois', '1979/05/15', '68.42', '994'],
 ['Illinois', '1979/05/16', '70.29', '994'],
 ['Illinois', '1979/05/17', '75.34', '994'],
 ['Illinois', '1979/05/18', '79.13', '994'],
 ['Illinois', '1979/05/19', '74.94', '994']]

#### complex data
here, the fields are doublt quoted (some are not). first field is empty. ',' inside the field

In [34]:
print(open('temp_data_01.csv').read())

"Notes","State","State Code","Month Day, Year","Month Day, Year Code",Avg Daily Max Air Temperature (F),Record Count for Daily Max Air Temp (F),Min Temp for Daily Max Air Temp (F),Max Temp for Daily Max Air Temp (F),Avg Daily Min Air Temperature (F),Record Count for Daily Min Air Temp (F),Min Temp for Daily Min Air Temp (F),Max Temp for Daily Min Air Temp (F),Avg Daily Max Heat Index (F),Record Count for Daily Max Heat Index (F),Min for Daily Max Heat Index (F),Max for Daily Max Heat Index (F),Daily Max Heat Index (F) % Coverage
,"Illinois","17","Jan 01, 1979","1979/01/01",17.48,994,6.00,30.50,2.89,994,-13.60,15.80,Missing,0,Missing,Missing,0.00%
,"Illinois","17","Jan 02, 1979","1979/01/02",4.64,994,-6.40,15.80,-9.03,994,-23.60,6.60,Missing,0,Missing,Missing,0.00%
,"Illinois","17","Jan 03, 1979","1979/01/03",11.05,994,-0.70,24.70,-2.17,994,-18.30,12.90,Missing,0,Missing,Missing,0.00%
,"Illinois","17","Jan 04, 1979","1979/01/04",9.51,994,0.20,27.60,-0.43,994,-16.30,16.30,Missing,0,Missi

In [37]:
results2 = [fields for fields in csv.reader(open('temp_data_01.csv'))]
results2 # commas inside double quotes are intact

[['Notes',
  'State',
  'State Code',
  'Month Day, Year',
  'Month Day, Year Code',
  'Avg Daily Max Air Temperature (F)',
  'Record Count for Daily Max Air Temp (F)',
  'Min Temp for Daily Max Air Temp (F)',
  'Max Temp for Daily Max Air Temp (F)',
  'Avg Daily Min Air Temperature (F)',
  'Record Count for Daily Min Air Temp (F)',
  'Min Temp for Daily Min Air Temp (F)',
  'Max Temp for Daily Min Air Temp (F)',
  'Avg Daily Max Heat Index (F)',
  'Record Count for Daily Max Heat Index (F)',
  'Min for Daily Max Heat Index (F)',
  'Max for Daily Max Heat Index (F)',
  'Daily Max Heat Index (F) % Coverage'],
 ['',
  'Illinois',
  '17',
  'Jan 01, 1979',
  '1979/01/01',
  '17.48',
  '994',
  '6.00',
  '30.50',
  '2.89',
  '994',
  '-13.60',
  '15.80',
  'Missing',
  '0',
  'Missing',
  'Missing',
  '0.00%'],
 ['',
  'Illinois',
  '17',
  'Jan 02, 1979',
  '1979/01/02',
  '4.64',
  '994',
  '-6.40',
  '15.80',
  '-9.03',
  '994',
  '-23.60',
  '6.60',
  'Missing',
  '0',
  'Missing',
  '

#### without csv module

In [51]:
for line in open('temp_data_01.csv'):
    line = line.strip()
    fields = line.split('",')
    fields = fields[:4] + fields[4].split(',')
    fields = [field.replace('"', "") for field in fields]
    print(fields)

['Notes', 'State', 'State Code', 'Month Day, Year', 'Month Day', ' Year Code']
[',Illinois', '17', 'Jan 01, 1979', '1979/01/01', '17.48', '994', '6.00', '30.50', '2.89', '994', '-13.60', '15.80', 'Missing', '0', 'Missing', 'Missing', '0.00%']
[',Illinois', '17', 'Jan 02, 1979', '1979/01/02', '4.64', '994', '-6.40', '15.80', '-9.03', '994', '-23.60', '6.60', 'Missing', '0', 'Missing', 'Missing', '0.00%']
[',Illinois', '17', 'Jan 03, 1979', '1979/01/03', '11.05', '994', '-0.70', '24.70', '-2.17', '994', '-18.30', '12.90', 'Missing', '0', 'Missing', 'Missing', '0.00%']
[',Illinois', '17', 'Jan 04, 1979', '1979/01/04', '9.51', '994', '0.20', '27.60', '-0.43', '994', '-16.30', '16.30', 'Missing', '0', 'Missing', 'Missing', '0.00%']
[',Illinois', '17', 'May 15, 1979', '1979/05/15', '68.42', '994', '61.00', '75.10', '51.30', '994', '43.30', '57.00', 'Missing', '0', 'Missing', 'Missing', '0.00%']
[',Illinois', '17', 'May 16, 1979', '1979/05/16', '70.29', '994', '63.40', '73.50', '48.09', '994'

 ### 21.2.5 Reading a csv file as a list of dictionaries
* Result is list of rows where each row is mapped to a dictionary with key as the column name.
* `csv.DictReader` returns `OrderedDict`. so fields stay in their original order.
* If data is quite large, `DictReader` takes the order of twice as long to read same amount of data.

In [55]:
results = [fields for fields in csv.DictReader(open('temp_data_01.csv'))]
results[0]

{'Notes': '',
 'State': 'Illinois',
 'State Code': '17',
 'Month Day, Year': 'Jan 01, 1979',
 'Month Day, Year Code': '1979/01/01',
 'Avg Daily Max Air Temperature (F)': '17.48',
 'Record Count for Daily Max Air Temp (F)': '994',
 'Min Temp for Daily Max Air Temp (F)': '6.00',
 'Max Temp for Daily Max Air Temp (F)': '30.50',
 'Avg Daily Min Air Temperature (F)': '2.89',
 'Record Count for Daily Min Air Temp (F)': '994',
 'Min Temp for Daily Min Air Temp (F)': '-13.60',
 'Max Temp for Daily Min Air Temp (F)': '15.80',
 'Avg Daily Max Heat Index (F)': 'Missing',
 'Record Count for Daily Max Heat Index (F)': '0',
 'Min for Daily Max Heat Index (F)': 'Missing',
 'Max for Daily Max Heat Index (F)': 'Missing',
 'Daily Max Heat Index (F) % Coverage': '0.00%'}

In [59]:
results[0]['State']

'Illinois'

## 21.3 Excel Files

In [62]:
from openpyxl import load_workbook

wb = load_workbook('temp_data_01.xlsx')
ws = wb.worksheets[0]

results = []
for row in ws.iter_rows():
    results.append([cell.value for cell in row])
print(results)    

[['Notes', 'State', 'State Code', 'Month Day, Year', 'Month Day, Year Code', 'Avg Daily Max Air Temperature (F)', 'Record Count for Daily Max Air Temp (F)', 'Min Temp for Daily Max Air Temp (F)', 'Max Temp for Daily Max Air Temp (F)', 'Avg Daily Max Heat Index (F)', 'Record Count for Daily Max Heat Index (F)', 'Min for Daily Max Heat Index (F)', 'Max for Daily Max Heat Index (F)', 'Daily Max Heat Index (F) % Coverage'], [None, 'Illinois', 17, 'Jan 01, 1979', '1979/01/01', 17.48, 994, 6, 30.5, 'Missing', 0, 'Missing', 'Missing', '0.00%'], [None, 'Illinois', 17, 'Jan 02, 1979', '1979/01/02', 4.64, 994, -6.4, 15.8, 'Missing', 0, 'Missing', 'Missing', '0.00%'], [None, 'Illinois', 17, 'Jan 03, 1979', '1979/01/03', 11.05, 994, -0.7, 24.7, 'Missing', 0, 'Missing', 'Missing', '0.00%'], [None, 'Illinois', 17, 'Jan 04, 1979', '1979/01/04', 9.51, 994, 0.2, 27.6, 'Missing', 0, 'Missing', 'Missing', '0.00%'], [None, 'Illinois', 17, 'May 15, 1979', '1979/05/15', 68.42, 994, 61, 75.1, 'Missing', 0, '

#### Challenges

* Most spreadsheets automatically interpret 1E20 as 1.00E+20 while ignoring 1E20 as a string.
* Formatting of spreadsheets, if macros are used, they are significant and hard to process. Much care to be taken.

It's always good to use CSV when at all possible. We can save the spreadsheet as CSV whenever possible.

## 21.4 Data cleaning

The process of dealing with situations like null values, illegal values, extra whitespaces etc., is called *data cleaning*

### 21.4.1 Cleaning

### 21.4.2 Sorting

### 21.4.3 Data cleaning issues and pitfalls

## 21.5 Writing data files

### 21.5.1 CSV and other delimited files

### 21.5.2 Writing Excel files

### 21.5.3 Packaging data files

## Lab 21: Weather Observations