## I. Load Data using Web API

Many websites (Twitter, Facebook, Kaggle, Reddit, ...) offer Application Programming Interfaces (APIs) which provide access to data on their web server. Today we will use Twitter API to download and analyze some tweets.

**Get started with Twitter API:**

1. Sign up on Twitter.
2. Apply for [develop access](https://developer.twitter.com/en/apps).
3. Create a Twitter app.

In [None]:
# Install tweepy package for Python
!pip install tweepy

In [None]:
import tweepy

In [None]:
# Copy and paste tokens from "Keys and Access Tokens" tab


In [None]:
# consumer_key = "(put your token here)"
# consumer_secret = "(put your token here)"
# access_token = "(put your token here)"
# access_token_secret = "(put your token here)"

In [None]:
# User authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [None]:
# Create API object to access twitter data
api = tweepy.API(auth)

## Task 1: Retrieve tweets from timeline

In [None]:
# My timeline
public_tweets = api.home_timeline(tweet_mode = "extended")

In [None]:
# Look into one tweet data
tweet = public_tweets[1]
tweet

In [None]:
# the _json attribute contains info of the tweet
tweet._json

In [None]:
# Find specific info
print(tweet.full_text)
print(tweet.author.name)
print(tweet.created_at)
print(tweet.author.location)

## Task 2: Retrieve Tweets from Another User

In [None]:
name = "nytimes"
tweetCount = 20
results = api.user_timeline(id=name, count=tweetCount, tweet_mode = "extended")

In [None]:
for tweet in results:
    print('-' * 80)
    print(tweet.full_text)

## Task 3: Search for Tweets

In [None]:
search_words = "wildfires -filter:retweets"
date_since = "2019-10-01"
cursor = tweepy.Cursor(api.search,
                       q=search_words,
                       lang="en",
                       since=date_since,
                       tweet_mode = "extended")

In [None]:
tweets = cursor.items(10)
for tweet in tweets:
    print('-' * 80)
    print(tweet.full_text)
    print(tweet.author.name)
    print(tweet.author.location)

In [None]:
# Create a Pandas DataFrame to store tweets, authors, and locations
import pandas as pd

# Create an empty data frame
tweets_df = pd.DataFrame(columns=['Name', 'Location', 'Text'])
tweets_df

In [None]:
# Append tweets data to the data frame
tweets = cursor.items(10)
count = 0
for tweet in tweets:
    tweets_df.loc[count, :] = [tweet.author.name,
                               tweet.author.location,
                               tweet.full_text]
    count += 1
                               
tweets_df

# II. Binary File Formats

## 1. pickle
The `pickle` module implements binary protocols for serializing and de-serializing a Python object structure. Only Python can properly read and write pickle files

In [1]:
# Let's create a data frame first
import numpy as np
import pandas as pd

values = np.array([
    [100, 80, 95, 'A'],
    [55, 60, 45, 'F'],
    [70, 75, 90, 'A'],
    [75, 70, 60, 'D'],
    [60, 73, 75, 'C'],
    [72, 63, -1, 'NA']
])
df = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final', 'LetterGrade'],
                   index=['Alex', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
df

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


In [2]:
# Save as a .pickle file
df.to_pickle("Data/temp/data.pickle")

In [3]:
# Load the pickle file
df_pickle = pd.read_pickle("Data/temp/data.pickle")
df_pickle

Unnamed: 0,Midterm,Project,Final,LetterGrade
Alex,100,80,95,A
Bob,55,60,45,F
Chris,70,75,90,A
Doug,75,70,60,D
Eva,60,73,75,C
Frank,72,63,-1,


## 2. HDF5
The "HDF" stands for "hierarchical data format". HDF5 can be a good choice for working with very large datasets that don't fit into memory, as you can efficiently read and write small sections of large arrays.

In [7]:
df = pd.DataFrame({
    'Col1': np.random.randn(100),
    'Col2': np.random.randn(100)
})
df.head(5)

Unnamed: 0,Col1,Col2
0,-0.20768,-0.137898
1,-0.267059,-0.477401
2,0.923666,1.516953
3,-0.042093,-1.018903
4,-2.822723,2.128009


In [9]:
df.to_hdf('Data/temp/data.h5', 'obj1', format='table')

In [10]:
df_hdf5 = pd.read_hdf('Data/temp/data.h5', 'obj1', where=['index < 3'])
df_hdf5

Unnamed: 0,Col1,Col2
0,-0.20768,-0.137898
1,-0.267059,-0.477401
2,0.923666,1.516953


## 3. feather
The feather format is adapted from the R statistical language. It has extremely high read and write performance.

In [14]:
import time
start = time.time()
df.to_feather('Data/temp/data.feather')
end = time.time()
print("Time cost:", (end - start))

Time cost: 0.004143953323364258


In [18]:
import time
start = time.time()
df_feather = pd.read_feather('Data/temp/data.feather')
end = time.time()
print("Time cost:", (end - start))
df_feather

Time cost: 0.0027763843536376953


Unnamed: 0,Col1,Col2
0,-0.207680,-0.137898
1,-0.267059,-0.477401
2,0.923666,1.516953
3,-0.042093,-1.018903
4,-2.822723,2.128009
...,...,...
95,-0.371647,-1.801467
96,-1.158852,-0.749160
97,1.319792,-0.716860
98,-0.666237,0.211813


# III. Interacting with Databases
In a business setting, most data may not be stored in text or binary files. SQL-based relational databases (such as mySQL) are in wide use.

Python has sqlite3 package to interact with databases, and Pandas has some functions to simplify the process.

In [51]:
# Create a SQLite database
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('Data/temp/data.sqlite')
con.execute(query)
con.commit()

In [48]:
# query = """
# DROP TABLE test
# """
# con.execute(query)
# con.commit()

In [52]:
# Insert a few rows of data
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

In [53]:
# Select data
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [54]:
# Retrieve columns names
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [55]:
# Create a pandas data frame
columns = [x[0] for x in cursor.description]
df = pd.DataFrame(rows, columns=columns)
df

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


Use package `sqlalchemy` to create a data frame directly from a database.

In [56]:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///Data/temp/data.sqlite')
df = pd.read_sql('select * from test', db)
df

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
