# Week 9
# Data Loading and Storage

Accessing data is a necessary first step for most data science projects. From this chapter we will learn:
- Reading and writing data in text format (.txt, .csv, .json)
- Reading data from webpages (web scrapping)
- Reading and writing data in binary format (.pickle, .feather, .h5)
- Interacting with databases

Reading:
- Textbook, Chapter 6

## I. Reading and Writing Data in Text Format

### 1. csv file

In [1]:
# Let's create a data frame first
import numpy as np
import pandas as pd

values = np.array([
    [100, 80, 95, 'A'],
    [55, 60, 45, 'F'],
    [70, 75, 90, 'A'],
    [75, 70, 60, 'D'],
    [60, 73, 75, 'C'],
    [72, 63, -1, 'NA']
])
df = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final', 'LetterGrade'],
                   index=['Alex', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
df

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


In [2]:
# This statement removes a folder completely.
# import shutil
# shutil.rmtree("Data/temp/")

In [3]:
# Write to a csv file using .to_csv()
import os
print('Does path "Data/temp/" exist?', os.path.exists("Data/temp/"))

if not os.path.exists("Data/temp"):
    os.mkdir("Data/temp")
    print('File path "Data/temp" created.')

df.to_csv("Data/temp/grades.csv")

Does path "Data/temp/" exist? True


In [4]:
# Load the csv file
df2 = pd.read_csv("Data/temp/grades.csv", index_col=0)
df2

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


In [5]:
# Load only the first 3 rows
df3 = pd.read_csv("Data/temp/grades.csv", nrows=3, index_col=0)
df3

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A


In [6]:
# Load the file, skipping the first two rows
df4 = pd.read_csv("Data/temp/grades.csv", skiprows=[0, 1])
df4

Unnamed: 0,Bob,55,60,45,F
0,Chris,70,75,90,A
1,Doug,75,70,60,D
2,Eva,60,73,75,C
3,Frank,72,63,-1,


In [7]:
# Remove column headers from the csv file, then load it
# df5 = pd.read_csv("Data/temp/grades.csv", header=None, names=['Name', 'Midterm', 'Project', 'Final', 'LetterGrade'])
df5 = pd.read_csv("Data/temp/grades.csv", 
                  names=['Column1', 'Column 2', 'Column 3',
                         'Column4', 'Column 5'],
                  skiprows=[0])
df5

Unnamed: 0,Column1,Column 2,Column 3,Column4,Column 5
0,Alex,100,80,95,A
1,Bob,55,60,45,F
2,Chris,70,75,90,A
3,Doug,75,70,60,D
4,Eva,60,73,75,C
5,Frank,72,63,-1,


In [8]:
# Set first column as index
df6 = pd.read_csv("Data/temp/grades.csv", index_col=0)
df6

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


In [9]:
# With -1 in the Final column, the average final exam score will 
# be incorrect.
df6['Final'].mean()

60.666666666666664

In [10]:
# Identify -1 as NaN (Not a number)
df7 = pd.read_csv("Data/temp/grades.csv", na_values=[-1])
df7

Unnamed: 0.1,Unnamed: 0,Midterm,Project,Final,LetterGrade
0,Alex,100,80,95.0,A
1,Bob,55,60,45.0,F
2,Chris,70,75,90.0,A
3,Doug,75,70,60.0,D
4,Eva,60,73,75.0,C
5,Frank,72,63,,


In [11]:
df7['Final'].mean()

73.0

### 2. Load txt file with values separated by spaces

In [12]:
with open("Data/temp/values.txt", 'w') as file:
    file.write("Index Category     Value\n")
    file.write("1            A      2.92\n")
    file.write("2            B     12.14\n")
    file.write("3            C    123.56\n")

In [13]:
# Although read_csv() is still applicable, setting delimiter to a single space will create errors
df = pd.read_csv("Data/temp/values.txt", sep=' ')
df

Unnamed: 0,Unnamed: 1,Unnamed: 2.1,Unnamed: 3.1,Unnamed: 4.1,Unnamed: 5.1,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Index,Category,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Value
1,,,,,,,,,,,,A,,,,,,2.92
2,,,,,,,,,,,,B,,,,,12.14,
3,,,,,,,,,,,,C,,,,123.56,,


In [14]:
df = pd.read_csv("Data/temp/values.txt", sep="\s+")
df

Unnamed: 0,Index,Category,Value
0,1,A,2.92
1,2,B,12.14
2,3,C,123.56


### 3. Load JSON files

**JavaScript Object Notation (JSON)** is a popular file format to storing unstructured data because it is easy for both human and computer to understand.
- Its structure is very similar to Python dictionary
- Load a json file with json.loads()
- Writes to a json file with json.dump()

In [15]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [16]:
import json
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [17]:
# A JSON object is represented as a python dictionary
?result

In [18]:
asjson = json.dumps(result) # Convert back to string

In [19]:
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

In [20]:
# Use json.dump(object, file) to write the content to file
with open("Data/temp/People.json", 'w') as file:
    json.dump(result, file)
    
# The with clause is equivalent to the following: 
# file = open("Data/temp/People.json", 'w')
# json.dump(result, file)
# file.close()

In [21]:
# Load from People.json
with open("Data/temp/People.json", "r") as file:
    people = json.load(file)
people

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [22]:
# Load the content as a data frame
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age', 'pets'])
siblings

Unnamed: 0,name,age,pets
0,Scott,30,"[Zeus, Zuko]"
1,Katie,38,"[Sixes, Stache, Cisco]"


## II. Web Scrapping
When performing data science tasks, it's common to want to use data found on the internet. You'll usually be able to access the data in csv format, or via an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you'll want to use a technique called **web scraping** to get the data from the web page into a format you can work with in your analysis.

In [23]:
# Download a webpage
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page #2** status code usually means successful download

<Response [200]>

In [24]:
# Show what is downloaded
print(page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


We will use **beautifulsoup** library to extract useful information from the html script.

In [25]:
from bs4 import BeautifulSoup

In [26]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [27]:
# using the children attribute to select all the top-level tags
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [28]:
# type of each children
print([type(item) for item in list(soup.children)])

[<class 'bs4.element.Doctype'>, <class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>]


In [29]:
# select the html tag and its children by taking the third item in the list:
html = list(soup.children)[2]
print(html)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [30]:
print('\n'.join([str(idx) + ':\n' + str(item) \
                 for idx, item in enumerate(list(html.children))]))

0:


1:
<head>
<title>A simple example page</title>
</head>
2:


3:
<body>
<p>Here is some simple content for this page.</p>
</body>
4:




In [31]:
len(list(html.children))

5

In [32]:
print([type(item) for item in list(html.children)])

[<class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>, <class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>, <class 'bs4.element.NavigableString'>]


In [33]:
body = list(html.children)[3]
print(body)

<body>
<p>Here is some simple content for this page.</p>
</body>


In [34]:
print(list(body.children))

['\n', <p>Here is some simple content for this page.</p>, '\n']


In [35]:
p = list(body.children)[1]
print(p)

<p>Here is some simple content for this page.</p>


In [36]:
p.get_text()

'Here is some simple content for this page.'

In [37]:
# Exercise: find the name "Brian J. Murphy" for Dr. Murphy's website.
page = requests.get("http://comet.lehman.cuny.edu/bmurphy/")
soup2 = BeautifulSoup(page.content, 'html.parser')
print(soup2.prettify())

<html>
 <head>
  <script>
   var csDept = "http://www.lehman.edu/computer-science/index.php";
var lehman = "http://www.lehman.edu";

var email = "Brian."+"Murphy"+"@"+"lehman."+"cuny."+"edu";


function popup(show, hide)
{
   hide.style.position = "absolute";

   show.style.position = "static";

   return show;
}


function goto(url)
{
   window.location.href = url;
}
  </script>
  <style>
   html 
{
   zoom  : 100%;
}

.boxed
{
   position		: relative;
   top			: 0;
   left			: 0;
   width		: 82%;
   text-align		: center;
}

.gray
{
   color		: #CCCCCC;
   text-shadow		: 4px 3px #000000;
   font-family		: Arial;
   font-size		: 42;
}
 
.black
{
   color		: black;
   text-shadow		: 3px 2px #666666;
   font-family		: Arial;
   font-size		: 30;
}

input
{
   position		: absolute;
   font-family		: Arial;
   font-size		: 28;
   width		: 17%;
   opacity		: .7;
   left			: 82%;
   box-shadow		: 4px 4px #444444
}

img
{
   positio

In [38]:
level1_children = list(soup2.children)
print(len(level1_children))

1


In [39]:
level2_children = list(level1_children[0])
print(len(level2_children))
print(level2_children[3])

5
<body>
<img src=".\images\Gillet Side.jpg"/>
<input onclick="current = popup(courses, current)" style="top:25%" type="BUTTON" value="Courses"/>
<input onclick="current = popup(office,  current)" style="top:35%" type="BUTTON" value="Office Hours"/>
<input onclick="current = popup(contact, current)" style="top:45%" type="BUTTON" value="Contact"/>
<input onclick="goto(csDept)" style="top:55%" type="BUTTON" value="CS Dept."/>
<input onclick="goto(lehman)" style="top:65%" type="BUTTON" value="Lehman"/>
<div class="boxed">
<div class="gray">
BRIAN J. MURPHY
</div>
<div class="black">
Department of Computer Science<br/>
Lehman College<br/>
The City University of New York<br/>
</div>
</div>
<table border="0" height="50%" width="83%">
<tr>
<td>
<div class="popup" id="courses" style="width: 420"><!--510 width:740)-->
<u>Courses - Fall 2023</u>:
<br/><br/>
<a href=".\CMP338\">CMP 338</a> - T/H 6:00pm-7:40pm
<br/><br/>
<a href=".\CMP428\">CMP 428</a> - T/H 7:50pm-9:30pm
<br/><br/>
<a hre

In [40]:
level3_children = list(level2_children[3])
print(len(level3_children))
print(level3_children[3])

19
<input onclick="current = popup(courses, current)" style="top:25%" type="BUTTON" value="Courses"/>


In [41]:
button1 = level3_children[3]
print(button1['value'])

Courses


In [42]:
# Ex: Extract all the button labels

index_list = [3, 5, 7, 9, 11]
for i in index_list:
    button = list(list(list(soup2.children)[0].children)[3])[i]
    print(button['value'])

Courses
Office Hours
Contact
CS Dept.
Lehman


#### FInding all instances of a tag at once

In [43]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('input')

[<input onclick="current = popup(courses, current)" style="top:25%" type="BUTTON" value="Courses"/>,
 <input onclick="current = popup(office,  current)" style="top:35%" type="BUTTON" value="Office Hours"/>,
 <input onclick="current = popup(contact, current)" style="top:45%" type="BUTTON" value="Contact"/>,
 <input onclick="goto(csDept)" style="top:55%" type="BUTTON" value="CS Dept."/>,
 <input onclick="goto(lehman)" style="top:65%" type="BUTTON" value="Lehman"/>]

In [44]:
all_buttons = soup.find_all('input')
for button in all_buttons:
    print(button['value'])

Courses
Office Hours
Contact
CS Dept.
Lehman


In [45]:
# Find the first instance of a tag
soup.find('input')

<input onclick="current = popup(courses, current)" style="top:25%" type="BUTTON" value="Courses"/>

#### Searching for tags by class and id

In [46]:
# Let's look at another webpage with classes and id's
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [47]:
# Find all tags of a class
soup.find_all(class_="first-item")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [48]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [49]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### Downloading the weather data
1. Open the [weather forecast page](https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE)
2. Display the source code (On Chrome use "Developer Tools")
3. Identify the item containing data (On Chrome right click the values and select "Inspect")

In [50]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="//www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>
   <

In [51]:
# Find today's weather

items = soup.find_all(class_="myforecast-current-lrg")
items[0].get_text()

'44°F'

In [62]:
names = soup.find_all(class_="period-name") 
# This statement creates a list of temperature labels
for name in names:
    print(name.get_text())

Today
Tonight
Thursday
ThursdayNight
Friday
FridayNight
Saturday
SaturdayNight
Sunday


In [63]:
temperatures = soup.find_all(class_="temp")
for obj in temperatures:
    print(obj.get_text())

High: 49 °F
Low: 38 °F
High: 49 °F
Low: 41 °F
High: 56 °F
Low: 48 °F
High: 59 °F
Low: 49 °F
High: 61 °F


In [67]:
# Create a data frame with these data
name_list = []
for name in names:
    name_list.append(name.get_text())
print(name_list)

temperature_list = []
for temperature in temperatures:
    temperature_list.append(temperature.get_text())
print(temperature_list)

data = [name_list, temperature_list]
data = np.array(data).T

df = pd.DataFrame(data, columns=["Period", "Temperature"])

df

['Today', 'Tonight', 'Thursday', 'ThursdayNight', 'Friday', 'FridayNight', 'Saturday', 'SaturdayNight', 'Sunday']
['High: 49 °F', 'Low: 38 °F', 'High: 49 °F', 'Low: 41 °F', 'High: 56 °F', 'Low: 48 °F', 'High: 59 °F', 'Low: 49 °F', 'High: 61 °F']


Unnamed: 0,Period,Temperature
0,Today,High: 49 °F
1,Tonight,Low: 38 °F
2,Thursday,High: 49 °F
3,ThursdayNight,Low: 41 °F
4,Friday,High: 56 °F
5,FridayNight,Low: 48 °F
6,Saturday,High: 59 °F
7,SaturdayNight,Low: 49 °F
8,Sunday,High: 61 °F


In [68]:
# Exercise: Extract the temperature values
def extract_temp(string):
    
    index1 = string.index(":")
    index2 = string.index("°")
    value = string[(index1+1):index2]
    value = int(value)
    return value

In [70]:
test = "High: 56 °F"
type(extract_temp(test))

int

In [71]:
df['Value'] = df['Temperature'].apply(extract_temp)
df

Unnamed: 0,Period,Temperature,Value
0,Today,High: 49 °F,49
1,Tonight,Low: 38 °F,38
2,Thursday,High: 49 °F,49
3,ThursdayNight,Low: 41 °F,41
4,Friday,High: 56 °F,56
5,FridayNight,Low: 48 °F,48
6,Saturday,High: 59 °F,59
7,SaturdayNight,Low: 49 °F,49
8,Sunday,High: 61 °F,61


In [74]:
# Approach 2: Use the space to isolate the value
def extract_temp2(string):
#     strings = string.split(' ')
#     value = strings[1]
#     value = int(value)
#     return value

    return int(string.split(' ')[1])

# test = "High: 56 °F"
# type(extract_temp2(test))
# extract_temp2(test)

df['Value'] = df['Temperature'].apply(extract_temp2)
df

Unnamed: 0,Period,Temperature,Value
0,Today,High: 49 °F,49
1,Tonight,Low: 38 °F,38
2,Thursday,High: 49 °F,49
3,ThursdayNight,Low: 41 °F,41
4,Friday,High: 56 °F,56
5,FridayNight,Low: 48 °F,48
6,Saturday,High: 59 °F,59
7,SaturdayNight,Low: 49 °F,49
8,Sunday,High: 61 °F,61


In [75]:
# Approach 3: Use lambda expression to define the function
df['Value'] = df['Temperature'].apply(lambda string: int(string.split(' ')[1]))
df

Unnamed: 0,Period,Temperature,Value
0,Today,High: 49 °F,49
1,Tonight,Low: 38 °F,38
2,Thursday,High: 49 °F,49
3,ThursdayNight,Low: 41 °F,41
4,Friday,High: 56 °F,56
5,FridayNight,Low: 48 °F,48
6,Saturday,High: 59 °F,59
7,SaturdayNight,Low: 49 °F,49
8,Sunday,High: 61 °F,61


**Example 2:** Create a more comprehensive weather forcast

In [78]:
# Find weather forecast for the week
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE")
seven_day = BeautifulSoup(page.content, 'html.parser')
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday']

In [80]:
# Find short descriptions and long descriptions for the week
short_desc_tags = seven_day.select(".tombstone-container .short-desc")
short_descs = [obj.get_text() for obj in short_desc_tags]
print(short_descs)

['DecreasingClouds', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny', 'Partly Cloudy', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny']


In [82]:
long_desc_tags = seven_day.select(".tombstone-container .forecast-icon")
descs = [obj['title'] for obj in long_desc_tags]
print(descs)

['Today: Cloudy, then gradually becoming mostly sunny, with a high near 49. North wind 11 to 13 mph. ', 'Tonight: Mostly clear, with a low around 38. Northwest wind 9 to 13 mph. ', 'Thursday: Sunny, with a high near 49. West wind 5 to 7 mph. ', 'Thursday Night: Mostly clear, with a low around 41. Southwest wind around 7 mph. ', 'Friday: Sunny, with a high near 56. Southwest wind 6 to 10 mph. ', 'Friday Night: Partly cloudy, with a low around 48.', 'Saturday: Mostly sunny, with a high near 59.', 'Saturday Night: Partly cloudy, with a low around 49.', 'Sunday: Mostly sunny, with a high near 61.']


In [83]:
temp_tags = seven_day.select(".tombstone-container .temp")
temps = [obj.get_text() for obj in temp_tags]

In [84]:
# Load the weather data as a data frame
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,DecreasingClouds,High: 49 °F,"Today: Cloudy, then gradually becoming mostly ..."
1,Tonight,Mostly Clear,Low: 38 °F,"Tonight: Mostly clear, with a low around 38. N..."
2,Thursday,Sunny,High: 49 °F,"Thursday: Sunny, with a high near 49. West win..."
3,ThursdayNight,Mostly Clear,Low: 41 °F,"Thursday Night: Mostly clear, with a low aroun..."
4,Friday,Sunny,High: 56 °F,"Friday: Sunny, with a high near 56. Southwest ..."
5,FridayNight,Partly Cloudy,Low: 48 °F,"Friday Night: Partly cloudy, with a low around..."
6,Saturday,Mostly Sunny,High: 59 °F,"Saturday: Mostly sunny, with a high near 59."
7,SaturdayNight,Partly Cloudy,Low: 49 °F,"Saturday Night: Partly cloudy, with a low arou..."
8,Sunday,Mostly Sunny,High: 61 °F,"Sunday: Mostly sunny, with a high near 61."


In [86]:
# extract numeric temperature
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
weather["temp_num"] = temp_nums.astype('int')
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num
0,Today,DecreasingClouds,High: 49 °F,"Today: Cloudy, then gradually becoming mostly ...",49
1,Tonight,Mostly Clear,Low: 38 °F,"Tonight: Mostly clear, with a low around 38. N...",38
2,Thursday,Sunny,High: 49 °F,"Thursday: Sunny, with a high near 49. West win...",49
3,ThursdayNight,Mostly Clear,Low: 41 °F,"Thursday Night: Mostly clear, with a low aroun...",41
4,Friday,Sunny,High: 56 °F,"Friday: Sunny, with a high near 56. Southwest ...",56
5,FridayNight,Partly Cloudy,Low: 48 °F,"Friday Night: Partly cloudy, with a low around...",48
6,Saturday,Mostly Sunny,High: 59 °F,"Saturday: Mostly sunny, with a high near 59.",59
7,SaturdayNight,Partly Cloudy,Low: 49 °F,"Saturday Night: Partly cloudy, with a low arou...",49
8,Sunday,Mostly Sunny,High: 61 °F,"Sunday: Mostly sunny, with a high near 61.",61


In [87]:
# Identify day temperature from night temperature
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [88]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,Mostly Clear,Low: 38 °F,"Tonight: Mostly clear, with a low around 38. N...",38,True
3,ThursdayNight,Mostly Clear,Low: 41 °F,"Thursday Night: Mostly clear, with a low aroun...",41,True
5,FridayNight,Partly Cloudy,Low: 48 °F,"Friday Night: Partly cloudy, with a low around...",48,True
7,SaturdayNight,Partly Cloudy,Low: 49 °F,"Saturday Night: Partly cloudy, with a low arou...",49,True


In [None]:
# Get new headlines from New York Times?
# Get current stock prices?
# Monitor alarms?;
# Download files?

# III. Binary File Formats

## 1. pickle
The `pickle` module implements binary protocols for serializing and de-serializing a Python object structure. Only Python can properly read and write pickle files

In [89]:
# Let's create a data frame first
import numpy as np
import pandas as pd

values = np.array([
    [100, 80, 95, 'A'],
    [55, 60, 45, 'F'],
    [70, 75, 90, 'A'],
    [75, 70, 60, 'D'],
    [60, 73, 75, 'C'],
    [72, 63, -1, 'NA']
])
df = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final', 'LetterGrade'],
                   index=['Alex', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
df

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


In [90]:
# Save as a .pickle file
df.to_pickle("data.pickle")

In [91]:
# Load the pickle file
df_pickle = pd.read_pickle("data.pickle")
df_pickle

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


In [92]:
# A pickle file can contain multiple objects.
import pickle
a = 5
b = ['a', 'b', 'c']
with open('temp.pickle', 'wb') as file:
    pickle.dump(a, file)
    pickle.dump(b, file)
    pickle.dump(df_pickle, file)

In [93]:
with open('temp.pickle', 'rb') as file:
    a = pickle.load(file)
    b = pickle.load(file)
    df_pickle = pickle.load(file)
    
print(a)
print(b)
df_pickle.head()

5
['a', 'b', 'c']


Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C


## 2. HDF5
The "HDF" stands for "hierarchical data format". HDF5 can be a good choice for working with very large datasets that don't fit into memory, as you can efficiently read and write small sections of large arrays.

In [94]:
df = pd.DataFrame({
    'Col1': np.random.randn(100),
    'Col2': np.random.randn(100)
})
df.head(5)

Unnamed: 0,Col1,Col2
0,0.093113,-0.910027
1,-0.379575,-0.122672
2,-0.000714,0.219606
3,0.605131,0.252806
4,-0.672866,0.910832


In [95]:
# The PyTable package may require update
!pip3 install --upgrade tables

Collecting tables
  Downloading tables-3.9.1-cp39-cp39-win_amd64.whl (4.4 MB)
     ---------------------------------------- 4.4/4.4 MB 18.6 MB/s eta 0:00:00
Collecting blosc2>=2.2.8
  Downloading blosc2-2.2.9-cp39-cp39-win_amd64.whl (2.3 MB)
     ---------------------------------------- 2.3/2.3 MB 21.1 MB/s eta 0:00:00
Collecting ndindex>=1.4
  Downloading ndindex-1.7-py3-none-any.whl (85 kB)
     ---------------------------------------- 85.7/85.7 kB ? eta 0:00:00
Installing collected packages: ndindex, blosc2, tables
  Attempting uninstall: blosc2
    Found existing installation: blosc2 2.0.0
    Uninstalling blosc2-2.0.0:
      Successfully uninstalled blosc2-2.0.0
  Attempting uninstall: tables
    Found existing installation: tables 3.8.0
    Uninstalling tables-3.8.0:
      Successfully uninstalled tables-3.8.0
Successfully installed blosc2-2.2.9 ndindex-1.7 tables-3.9.1


In [96]:
df.to_hdf('data.h5', 'obj1', format='table')

In [97]:
df_hdf5 = pd.read_hdf('data.h5', 'obj1', where=['index < 3'])
df_hdf5

Unnamed: 0,Col1,Col2
0,0.093113,-0.910027
1,-0.379575,-0.122672
2,-0.000714,0.219606


# IV. Interacting with Databases
In a business setting, most data may not be stored in text or binary files. SQL-based relational databases (such as mySQL) are in wide use.

Python has sqlite3 package to interact with databases, and Pandas has some functions to simplify the process.

In [98]:
# Create a SQLite database
import sqlite3
query = """
CREATE TABLE tb
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('data.sqlite')
con.execute(query)
con.commit()

In [None]:
# query = """
# DROP TABLE test
# """
# con.execute(query)
# con.commit()

In [100]:
# Insert a few rows of data
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO tb VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

In [101]:
# Select data
cursor = con.execute('select * from tb')
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [102]:
# Retrieve columns names
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [103]:
# Create a pandas data frame
columns = [x[0] for x in cursor.description]
df = pd.DataFrame(rows, columns=columns)
df

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
