# Week 9
# Data Loading and Storage

Accessing data is a necessary first step for most data science projects. From this chapter we will learn:
- Reading and writing data in text format (.txt, .csv, .json)
- Reading data from webpages (web scrapping)
- Reading and writing data in binary format (.pickle, .feather, .h5)
- Interacting with databases

Reading:
- Textbook, Chapter 6

## I. Reading and Writing Data in Text Format

### 1. csv file

In [None]:
# Let's create a data frame first
import numpy as np
import pandas as pd

values = np.array([
    [100, 80, 95, 'A'],
    [55, 60, 45, 'F'],
    [70, 75, 90, 'A'],
    [75, 70, 60, 'D'],
    [60, 73, 75, 'C'],
    [72, 63, -1, 'NA']
])
df = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final', 'LetterGrade'],
                   index=['Alex', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
df

In [None]:
# Write to a csv file using .to_csv()
import os
print('Does path "Data/temp/" exist?', os.path.exists("Data/temp/"))

if not os.path.exists("Data/temp"):
    os.mkdir("Data/temp")
    print('File path "Data/temp" created.')

df.to_csv("Data/temp/grades.csv")

In [None]:
# Load the csv file
df2 = pd.read_csv("Data/temp/grades.csv", sep=",")
df2

In [None]:
# Load only the first 3 rows
df3 = pd.read_csv("Data/temp/grades.csv", nrows=3)
df3

In [None]:
# Load the file, skipping row 2 and 4
df4 = pd.read_csv("Data/temp/grades.csv", skiprows=[2, 4])
df4

In [None]:
# Remove column headers from the csv file, then load it
# df5 = pd.read_csv("Data/temp/grades.csv", header=None, names=['Name', 'Midterm', 'Project', 'Final', 'LetterGrade'])
df5 = pd.read_csv("Data/temp/grades.csv", names=[1, 2, 3, 4, 5], skiprows=[0])
df5

In [None]:
# Set first column as index
df6 = pd.read_csv("Data/temp/grades.csv", index_col=0)
df6

In [None]:
# Identify -1 as NaN
df7 = pd.read_csv("Data/temp/grades.csv", na_values=[-1, 63])
df7

### 2. Load txt file with values separated by spaces

In [None]:
with open("Data/temp/values.txt", 'w') as file:
    file.write("Index Category     Value\n")
    file.write("1            A      2.92\n")
    file.write("2            B     12.14\n")
    file.write("3            C    123.56\n")

In [None]:
# Although read_csv() is still applicable, setting delimiter to a single space will create errors
df = pd.read_csv("Data/temp/values.txt", sep=" ")
df

In [None]:
df = pd.read_csv("Data/temp/values.txt", sep="\s+")
df

### 3. Load JSON files

**JavaScript Object Notation (JSON)** is a popular file format to storing unstructured data because it is easy for both human and computer to understand.
- Its structure is very similar to Python dictionary
- Load a json file with json.loads()
- Writes to a json file with json.dump()

In [None]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [None]:
import json
result = json.loads(obj)
result

In [None]:
# A JSON object is represented as a python dictionary
?result

In [None]:
asjson = json.dumps(result) # Convert back to string

In [None]:
asjson

In [None]:
# Use json.dump(object, file) to write the content to file
with open("Data/temp/People.json", 'w') as file:
    json.dump(result, file)
    
# The with statement is equivalent to the following:
# file = open("Data/temp/People.json", 'w')
# json.dump(result, file)
# file.close()

In [None]:
# Load from People.json
with open("Data/temp/People.json", "r") as file:
    people = json.load(file)
people

In [None]:
# Load the content as a data frame
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age', 'pets'])
siblings

## II. Web Scrapping
When performing data science tasks, it's common to want to use data found on the internet. You'll usually be able to access the data in csv format, or via an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you'll want to use a technique called **web scraping** to get the data from the web page into a format you can work with in your analysis.

In [None]:
# Download a webpage
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page #2** status code usually means successful download

In [None]:
# Show what is downloaded
print(page.content)

We will use **beautifulsoup** library to extract useful information from the html script.

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

In [None]:
# using the children attribute to select all the top-level tags
list(soup.children)

In [None]:
# type of each children
print([type(item) for item in list(soup.children)])

In [None]:
# select the html tag and its children by taking the third item in the list:
html = list(soup.children)[2]
print(html)

In [None]:
print('\n'.join([str(idx) + ':\n' + str(item) \
                 for idx, item in enumerate(list(html.children))]))

In [None]:
len(list(html.children))

In [None]:
print([type(item) for item in list(html.children)])

In [None]:
body = list(html.children)[3]
print(body)

In [None]:
print(list(body.children))

In [None]:
p = list(body.children)[1]
print(p)

In [None]:
p.get_text()

In [None]:
# Exercise: find the name "Brian J. Murphy" for Dr. Murphy's website.
page = requests.get("http://comet.lehman.cuny.edu/bmurphy/")
soup2 = BeautifulSoup(page.content, 'html.parser')
print(soup2.prettify())

In [None]:
level1_children = list(soup2.children)
print(len(level1_children))

In [None]:
level2_children = list(level1_children[0])
print(len(level2_children))
print(level2_children[3])

In [None]:
level3_children = list(level2_children[3])
print(len(level3_children))
print(level3_children[5])

In [None]:
level4_children = list(level3_children[13])
print(len(level4_children))
print(level4_children[1])

In [None]:
name = level4_children[1]
print(name.get_text())

In [None]:
# Ex: Extract all the button labels

index_list = [3, 5, 7, 9, 11]
for i in index_list:
    button = list(list(list(soup2.children)[0].children)[3])[i]
    print(button['value'])

#### FInding all instances of a tag at once

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('input')

In [None]:
all_buttons = soup.find_all('input')
for button in all_buttons:
    print(button['value'])

In [None]:
# Find the first instance of a tag
soup.find('input')

#### Searching for tags by class and id

In [None]:
# Let's look at another webpage with classes and id's
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

In [None]:
# Find all tags of a class
soup.find_all(class_="first-item")

In [None]:
soup.find_all(id="first")

In [None]:
soup.find_all('p')

#### Downloading the weather data
1. Open the [weather forecast page](https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE)
2. Display the source code (On Chrome use "Developer Tools")
3. Identify the item containing data (On Chrome right click the values and select "Inspect")

In [None]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
# print(len(seven_day))
# print(seven_day)
forecast_items = seven_day.find_all(class_="tombstone-container")
# print(len(forecast_items))
# print(forecast_items)
tonight = forecast_items[0]
print(tonight.prettify())

In [None]:
# Find today's weather

items = soup.find_all(class_="myforecast-current-lrg")
items[0].get_text()

In [None]:
names = soup.find_all(class_="period-name") 
# This statement creates a list of temperature labels

In [None]:
# Convert the list of p objects to a list of strings
days = []
for obj in names:
#     print(obj.get_text())
    days.append(obj.get_text())
print(days)

days = [obj.get_text() for obj in names]
print(days)

In [None]:
temps = soup.find_all(class_="temp")
# Retrieve all the temperature data

In [None]:
# Extract the text from each temperature object
temperatures = [obj.get_text() for obj in temps]
print(temperatures)

In [None]:
import numpy as np
data = np.array([days, temperatures]).T
# transpose the array so that each list becomes a column

In [None]:
import pandas as pd
# Create a data frame with the days and the temperatures
df = pd.DataFrame(data, columns=["Day", "Temperature"])
df

In [None]:
# Find weather forecast for the week
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

In [None]:
# Find short descriptions and long descriptions for the week
short_desc_tags = seven_day.select(".tombstone-container .short-desc")
short_descs = [pt.get_text() for obj in short_desc_tags]
print(short_descs)

In [None]:
long_desc_tags = seven_day.select(".tombstone-container .forecast-icon")
descs = [obj['title'] for obj in long_desc_tags]
print(long_descs)

In [None]:
temp_tags = seven_day.select(".tombstone-container .temp")
temps = [obj.get_text() for obj in temp_tags]

In [None]:
# Load the weather data as a data frame
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

In [None]:
# extract numeric temperature
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
weather["temp_num"] = temp_nums.astype('int')
weather

In [None]:
# Identify day temperature from night temperature
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

In [None]:
weather[is_night]

In [None]:
# Get new headlines from New York Times?
# Get current stock prices?
# Monitor alarms?;
# Download files?

# III. Binary File Formats

## 1. pickle
The `pickle` module implements binary protocols for serializing and de-serializing a Python object structure. Only Python can properly read and write pickle files

In [None]:
# Let's create a data frame first
import numpy as np
import pandas as pd

values = np.array([
    [100, 80, 95, 'A'],
    [55, 60, 45, 'F'],
    [70, 75, 90, 'A'],
    [75, 70, 60, 'D'],
    [60, 73, 75, 'C'],
    [72, 63, -1, 'NA']
])
df = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final', 'LetterGrade'],
                   index=['Alex', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
df

In [None]:
# Save as a .pickle file
df.to_pickle("data.pickle")

In [None]:
# Load the pickle file
df_pickle = pd.read_pickle("data.pickle")
df_pickle

In [None]:
# A pickle file can contain multiple objects.
import pickle
a = 5
b = ['a', 'b', 'c']
with open('temp.pickle', 'wb') as file:
    pickle.dump(a, file)
    pickle.dump(b, file)
    pickle.dump(df_pickle, file)

In [None]:
with open('temp.pickle', 'rb') as file:
    a = pickle.load(file)
    b = pickle.load(file)
    df_pickle = pickle.load(file)
    
print(a)
print(b)
df_pickle.head()

## 2. HDF5
The "HDF" stands for "hierarchical data format". HDF5 can be a good choice for working with very large datasets that don't fit into memory, as you can efficiently read and write small sections of large arrays.

In [None]:
df = pd.DataFrame({
    'Col1': np.random.randn(100),
    'Col2': np.random.randn(100)
})
df.head(5)

In [None]:
# The PyTable package may require update
!pip3 install --upgrade tables

In [None]:
df.to_hdf('data.h5', 'obj1', format='table')

In [None]:
df_hdf5 = pd.read_hdf('data.h5', 'obj1', where=['index < 3'])
df_hdf5

# III. Interacting with Databases
In a business setting, most data may not be stored in text or binary files. SQL-based relational databases (such as mySQL) are in wide use.

Python has sqlite3 package to interact with databases, and Pandas has some functions to simplify the process.

In [None]:
# Create a SQLite database
import sqlite3
query = """
CREATE TABLE tb
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('data.sqlite')
con.execute(query)
con.commit()

In [None]:
# query = """
# DROP TABLE test
# """
# con.execute(query)
# con.commit()

In [None]:
# Insert a few rows of data
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

In [None]:
# Select data
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

In [None]:
# Retrieve columns names
cursor.description

In [None]:
# Create a pandas data frame
columns = [x[0] for x in cursor.description]
df = pd.DataFrame(rows, columns=columns)
df