# Data Ingestion

## Topics
* Common Import Issues (Infering Types, Infering Missing Values, Records with Errors)
* Types of Ingestion (CSV, Excel, APIs/JSONs, Relational Database)

## Common Import Issues
* Data types: 
    * Pandas infer types, but it might infer incorrectly: read_csv(dtype = {"col":type})
    * Pandas infer missing values, but it might infer incorrectly: read_csv(na_values={"col" : 0}), where 0 should be interpreted as missing value.
    * Lines with errors: a record could have more values than columns, so this will cause a parsing error: read_csv(error_bad_lines = False, warn_bad_lines = True), which will show messages when records are skipped.


## Types of Ingestion

### Ingesting from a local CSV
* Flat Files
* Source: https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi

In [32]:
import pandas as pd

# Specify the path to your CSV file
file_path = "/workspace/sources/datacamp/general_datasets/us_tax_data_2016.csv"

# Create list of columns to use
cols = ["ZIPCODE", "AGI_STUB", "MARS1", "MARS2", "NUMDEP"]

# Create dict specifying data types for agi_stub and zipcode
data_types = {'AGI_STUB':'category',
			  'ZIPCODE':str}

# Create dict specifying that 0s in zipcode are NA values
null_values = {'ZIPCODE':0}

try:
  # Set warn_bad_lines to issue warnings about bad records as well as other parameters
  data = pd.read_csv(file_path, 
                     nrows=1000,
                     skiprows=0,
                     usecols=cols,
                     dtype = data_types,
                     na_values=null_values,
                     on_bad_lines = 'warn')
  
  # View first 5 records
  print(data.head())
  
except pd.errors.ParserError:
    print("Your data contained rows that could not be parsed.")

# Print data types of resulting frame
print(data.dtypes.head())
print(data.shape)

# View rows with NA ZIP codes
print(data[data["ZIPCODE"].isna()])

  ZIPCODE AGI_STUB   MARS1   MARS2   NUMDEP
0     NaN        0  825680  748830  1417040
1   35004        0    2150    2140     3430
2   35005        0    1340     890     2170
3   35006        0     430     600      820
4   35007        0    4770    5140     8840
ZIPCODE       object
AGI_STUB    category
MARS1          int64
MARS2          int64
NUMDEP         int64
dtype: object
(1000, 5)
    ZIPCODE AGI_STUB    MARS1    MARS2   NUMDEP
0       NaN        0   825680   748830  1417040
594     NaN        0   173420   125440   204720
750     NaN        0  1315560  1068920  2026700


### Ingesting from a local Excel
* Unlike flat files, Spreadsheets can have formatting and formulas and/or multiple spreadsheets can coexist in a workbook.
    * Single file
    * Multiple files
* Source: https://github.com/freeCodeCamp/2021-new-coder-survey

In [None]:
import pandas as pd

# ONE SPREADSHEET WITHIN EXCEL

# Specify the path to your Excel file
file_path = "fcc_survey"

# Create string of lettered columns to load
col_string = "AD, AW:BA"

# Try reading the Excel file with the 'openpyxl' engine
try:
    survey_responses = pd.read_excel(file_path, 
                       engine='openpyxl',
                       skiprows = 1, 
                       usecols = col_string)
    print("File read successfully using openpyxl engine.")
except Exception as e:
    print("Error:", e)
    # If 'openpyxl' fails, try with 'xlrd'
    try:
        survey_responses = pd.read_excel(file_path, 
                           engine='xlrd',
                           skiprows = 1, 
                           usecols = col_string)
        print("File read successfully using xlrd engine.")
    except Exception as e:
        print("Error:", e)
        print("Unable to read the Excel file.")

# View the names of the columns selected
print(survey_responses.columns)

In [None]:
import pandas as pd

# MULTIPLE SPREADSHEETS WITHIN EXCEL

# Specify the path to your Excel file
file_path = "fcc_survey.xlsx"

# Examples to load sheets
    # 1) Load ALL sheets in the Excel file
    all_survey_data = pd.read_excel("file_path.xlsx",
                                    sheet_name = None)

    # 2) Load all sheets in the Excel file with index and name
    all_survey_data = pd.read_excel("fcc_survey.xlsx",
                                    sheet_name = [0, '2017'])

    # 3) Load both the 2016 and 2017 sheets by name
    all_survey_data = pd.read_excel("fcc_survey.xlsx",
                                    sheet_name = ['2016', '2017'])
    # View the data type of all_survey_data
    print(type(all_survey_data)) # type will be a dictionary where keys are the sheetnames and values are the data
    print(all_survey_data.keys())

    # 4) Create an empty dataframe to hold all loaded sheets and concatenate
    all_responses = pd.DataFrame()

    # Set up for loop to iterate through values in responses
    for df in responses.values():
    # Print the number of rows being added
    print("Adding {} rows".format(df.shape[0]))
    # Concatenate all_responses and df, assign result
    all_responses = pd.concat([all_responses, df])

### Ingesting from a JSON or an API
* JSONs are:
    * Non-tabular (unstrcutured) data: it stores information in dictionaries.
    * Schema-less: there are no predefined rules or constraints imposed on the structure or content of the JSON data. In other words, there is no formal schema definition that specifies what keys are allowed, what data types are expected, or what the structure of the JSON object should be.
    * Because of these properties, a JSON might not be "dataframe-ready", as the dictionary values might be nested objects or have varying "schemas".
        * Variying schemas, means: first element is a dictionary with 2 columns names, second element is a dict with 3 columns names, etc.
* JSON: can be record or column oriented, so specifing orientation is important.
    * read_json()
        * orient='columns' (tell JSON how the dictionary values are stored, in this case, column-wise, i.e, keys are column names and values are column values)
        * easier to use when there's some structure in the dictionary, like all elements follow the same pattern (same number of columns or same types of nested objects)
    * import json -> open() as file -> json.load()
        * better to use when the JSON has varying schema
    * Single JSON
    * Nested JSONs
        * A json contain objects with attribute-value pairs, like a dictionary. A nested json is when a value is itself an object. The idea is to flatten the nested jsons. For that, we use: 
        * pandas.io.json submodule to read/write jsons.
            * It's json_normalize() function takes a dictionary and returns a flattened dataframe
        * Naming meta columns can get tedious for datasets with many attributes, and code is susceptible to breaking if column names or nesting levels change. In such cases, you may have to write a custom function and employ techniques like recursion to handle the data.

* API: APIs are the most common source of JSON data. 
    * APIs limit the amount of data you can get in a given timeframe, but you can customize it. (in Yelp, is the offset())
    * It provides an endpoint to send requests to. 
    * Its documentation describes what a request should look like (such as parameters).
    * Requests library it's an option. It allows users to send/receive data from an URL
        * requests.get(url_string, params = , headers = , ...)
        * return.json(): returns just the data in a dict type
            * We cannot use read_json, because it expects strings and not dictionaries
            * We need to use pd.DataFrame()

In [None]:
# Reading a JSON
# Applying Logging with the try-except block (check the data pipelines jupyter notebook)  
try:
    # Load the JSON without keyword arguments
    df = pd.read_json("dhs_report_reformatted.json")
    
    # Plot total population in shelters over time
    df["date_of_census"] = pd.to_datetime(df["date_of_census"])
    df.plot(x="date_of_census", 
            y="total_individuals_in_shelter")
    plt.show()
    
except ValueError:
    print("pandas could not parse the JSON.")

In [2]:
# API about bookstores in San Francisco, California
import requests
import pandas as pd
import os

# API authenticaion: https://www.yelp.com/developers/v3/manage_app 
# API endpoint: https://api.yelp.com/v3/businesses/search
    # parameters we want: term, location
    # Dictionary with authentication info

api_url = "https://api.yelp.com/v3/businesses/search"

# Get the Yelp API key from the .env file
api_key = os.environ.get("YELP_API_KEY")

# Create dictionary with authentication info
headers = {
    "Authorization": "Bearer {}".format(api_key)
}

# Create dictionary to query API for cafes in NYC
parameters = {"term": "cafe",
          	  "location": "NYC"}

# Query the Yelp API with headers and params set
response = requests.get(api_url,
                headers=headers,
                params=parameters)

# Check if the request was successful
if response.status_code == 200:
    # Extract JSON data from response
    data = response.json()

    # Load "businesses" values to a dataframe and print head
    cafes = pd.DataFrame(data["businesses"])

    # View the data's dtypes
    print(cafes.dtypes)
    print(cafes.head())
    print("This is the shape of the dataframe returned by Yelp, where only 20 records at a time are retrived: ", cafes.shape)
else:
    print("Error:", response.status_code)

id                object
alias             object
name              object
image_url         object
is_closed           bool
url               object
review_count       int64
categories        object
rating           float64
coordinates       object
transactions      object
price             object
location          object
phone             object
display_phone     object
distance         float64
attributes        object
dtype: object
                       id               alias       name  \
0  ED7A7vDdg8yLNKJTSVHHmg    arabica-brooklyn  % Arabica   
1  -2UtjTxrt1Xzd-HPsLJ7mA   butler-brooklyn-2     Butler   
2  d2y35lqplnZvK0cbMWz7xQ   kijitora-brooklyn   Kijitora   
3  bJDU8KNLQMrZG0Ngs4AY0w  le-phin-new-york-2    Le Phin   
4  DE0ROwygh-86i4s-WLp8wQ   maman-new-york-22      maman   

                                           image_url  is_closed  \
0  https://s3-media2.fl.yelpcdn.com/bphoto/RZ7MHl...      False   
1  https://s3-media3.fl.yelpcdn.com/bphoto/bdMNkv...      False   

In [160]:
# Now, suppose we have already loaded 50 records by having set the "limit": 50 in a previous time.
# Concatenating more datasets, because Yelp only returns 20 at a time. This returns 51:100.
parameters2 = {"term": "cafe", 
          "location": "NYC",
          "sort_by": "rating", 
          "limit": 50,
          "offset": 50}

# Query the Yelp API with headers and params set
response = requests.get(api_url,
                headers=headers,
                params=parameters2)

# Check if the request was successful
if response.status_code == 200:
    # Extract JSON data from response
    data = response.json()

    # Load "businesses" values to a dataframe and print head
    cafes = pd.DataFrame(data["businesses"])

    print("This is the new shape of the dataframe returned by Yelp: ", cafes.shape)
else:
    print("Error:", response.status_code)

This is the new shape of the dataframe returned by Yelp:  (50, 17)


In [1]:
# Nested JSONs

# Example: The Yelp API response data is nested. 
# The idea is to look at bookstores in San Francisco.
# Your job is to flatten out the next level of data in the coordinates and location columns.

import pandas as pd
import requests
import os

# Set up headers, parameters, and API endpoint
api_url = "https://api.yelp.com/v3/businesses/search"

# Get the Yelp API key from the .env file
api_key = os.environ.get("YELP_API_KEY")

if api_key is None:
    print("YELP_API_KEY environment variable is not set!")
    # You can handle this case however you want, such as exiting the program
else:
    headers = {"Authorization": "Bearer {}".format(api_key)}
    params = {"term": "bookstore", 
              "location": "San Francisco"}

# Make the API call and extract the JSON data
response = requests.get(api_url, headers=headers, params=params)
bookstore_data_dict = response.json()

print("This is the Json's type from the API request:", type(bookstore_data_dict))
print("How does the json looks like: ")
for key in bookstore_data_dict.keys():
    print(key, ":", bookstore_data_dict[key])
print("Json keys has 3 keys:: ", bookstore_data_dict.keys())
print("Note that the 'businesses' value is of a list type: ", type(bookstore_data_dict["businesses"]))
print("However, it is a dictionary inside a list: ")
biz_bookstore_data_list = bookstore_data_dict["businesses"]
print(type(biz_bookstore_data_list))
print("There are, ", len(biz_bookstore_data_list), "elements in the 'businesses' key.")
print("Accessing the first element, which is of type: ", type(biz_bookstore_data_list[0]))
print(biz_bookstore_data_list[0])
print("Accessing the 'categories' key of the first item in the 'businesses' key: ", biz_bookstore_data_list[0]['categories'])

This is the Json's type from the API request: <class 'dict'>
How does the json looks like: 
businesses : [{'id': 'Uu2yJ_LoL1nTcr3Vf2Oz6g', 'alias': 'borderlands-books-san-francisco', 'name': 'Borderlands Books', 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/rxMcttdR9XapCMW-56shkg/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/borderlands-books-san-francisco?adjust_creative=RraRAWZZ1IgIuxDvaATISQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=RraRAWZZ1IgIuxDvaATISQ', 'review_count': 215, 'categories': [{'alias': 'bookstores', 'title': 'Bookstores'}], 'rating': 4.8, 'coordinates': {'latitude': 37.7696454422024, 'longitude': -122.45099043497255}, 'transactions': [], 'price': '$$', 'location': {'address1': '1740 Haight St', 'address2': '', 'address3': '', 'city': 'San Francisco', 'zip_code': '94117', 'country': 'US', 'state': 'CA', 'display_address': ['1740 Haight St', 'San Francisco, CA 94117']}, 'phone': '+14158248203', 'display_phone': '(415)

In [151]:
# So, there are many jsons nested inside jsons. We need to flatten them. 
import numpy as np

# Flatten business data into a dataframe, replace separator
bookstore_data_df = pd.json_normalize(bookstore_data_dict["businesses"], sep = '_')

# View the data
print("Dataframe columns: \n", list(bookstore_data_df.columns))
print("The coordinates_latitude column: \n", bookstore_data_df["coordinates_latitude"].head(2))
print("However, note that 'categories' is still nested: \n", bookstore_data_df['categories'].head(2))
print("This is the alias from the 'categories' value within the 'businesses' value: \n", bookstore_data_df['categories'][0][0]["alias"])
print("This is the alias from the 'businesses' value: \n", bookstore_data_df['alias'][0])
print("Now, printing alias (businesses key), alias (categories key) and coordinates_latitude (already flattened):" )
print("alias (businesses key): ", bookstore_data_df['alias'][0])
print("alias (categories key): ", bookstore_data_df['categories'][0][0]["alias"])
print("coordinates_latitude (already flattened)", bookstore_data_df['coordinates_latitude'][0])
print("-------")

# Load other business attributes and set meta prefix
bookstore_data_flat_df = pd.json_normalize(data["businesses"],
                            sep="_",
                    		record_path="categories", # this sets the 
                    		meta=['name', 
                                  'alias',  
                                  'rating',
                          		  ['coordinates', 'latitude'],   #this will flatten coordinates
                          		  ['coordinates', 'longitude']], #this will flatten coordinates
                    		meta_prefix="biz_")

# View the data
print("Flattened dataframe columns (based on categories): \n", list(bookstore_data_flat_df.columns))
print(bookstore_data_flat_df[["biz_alias", "alias", "biz_coordinates_latitude"]].head(1))
print("Notice that both the first json ( using json_normalize() ) and the flattened json based on categories are the same thing. However, the flattened is more denormalized (more duplicated in some columns)")


Dataframe columns: 
 ['id', 'alias', 'name', 'image_url', 'is_closed', 'url', 'review_count', 'categories', 'rating', 'transactions', 'price', 'phone', 'display_phone', 'distance', 'coordinates_latitude', 'coordinates_longitude', 'location_address1', 'location_address2', 'location_address3', 'location_city', 'location_zip_code', 'location_country', 'location_state', 'location_display_address', 'attributes_business_temp_closed', 'attributes_open24_hours', 'attributes_waitlist_reservation']
The coordinates_latitude column: 
 0    37.769645
1    37.769310
Name: coordinates_latitude, dtype: float64
However, note that 'categories' is still nested: 
 0     [{'alias': 'bookstores', 'title': 'Bookstores'}]
1    [{'alias': 'bookstores', 'title': 'Bookstores'...
Name: categories, dtype: object
This is the alias from the 'categories' value within the 'businesses' value: 
 bookstores
This is the alias from the 'businesses' value: 
 borderlands-books-san-francisco
Now, printing alias (businesses ke

### Ingesting from a (Relational) Database
* Step 1: Connect to a database (SQLAlchemy)
    * Create a Database Engine to handle database connections: create_engine ()
* Step 2: Query the database (SQL or Pandas)
    * pd.read_sql(sql_query, engine): to load in data from a database

* Databases examples:
    * SQLite
    * Postgres

### Method 0: SQLAlchemy & Pandas
* Getting the full data
* Getting partial data with SQL refinements (example: SELECT DISTINCT, WHERE, etc)

In [None]:
# Import sqlalchemy's create_engine() function
from sqlalchemy import create_engine

# Create the database engine
engine = create_engine("sqlite:///data.db")

# View the tables in the database
print(engine.table_names())
    # ['boro_census', 'hpd311calls', 'weather']
# Load hpd311calls without any SQL
hpd_calls = pd.read_sql('hpd311calls', engine) #load the hpd311calls table by name (without any SQL) into a pandas dataframe

# View the first few rows of data
print(hpd_calls.head())

# Create a SQL query to load the entire weather table
query = """
SELECT * 
  FROM weather;
"""

# Load weather with the SQL query
weather = pd.read_sql(query, engine) #load the weather table by a SQL query

### Method 1: Here we use Pandas & SQLAlchemy & Faker to ingest fake data into the Postgres database.

In [1]:
# We will use the SQLAlchemy package to access an postgres database

# We start by importing the create_engine function.
    # This engine fires up a SQL engine that will communicates out SQL queries to the database 
import pandas as pd
from sqlalchemy import create_engine, text, inspect
from faker import Faker


# Create the engine
engine = create_engine('postgresql://myuser:mypassword@postgres/mydatabase')

# Checking the table names within the database
insp = inspect(engine)
print(insp.get_table_names(schema="schema_test")) # recall that postgres prefer lower case for names 

# Connecting to the engine and executing a SELECT query
with engine.connect() as conn:

    faker = Faker('en_US')

    # Insert fake data
    for i in range(10):
        test_id = faker.random_int(min=1, max=200)
        amount = faker.random_int(min=100, max=10000)
        #created_at: recall that the created_at is defined in the init.sql
        #insert_query = text(f"INSERT INTO SCHEMA_TEST.TABLE_TEST (test_id, amount) VALUES ({test_id}, {amount})")
        insert_query = text("INSERT INTO SCHEMA_TEST.TABLE_TEST (test_id, amount) VALUES (:test_id, :amount)")
        conn.execute(insert_query, {"test_id": test_id, "amount": amount})

    # Commit the transaction
    conn.commit() # committing refers to finalizing and applying the changes made within a transaction to the database.

    # Fetch and print the table after inserting the data
    select_query = text("SELECT * FROM SCHEMA_TEST.TABLE_TEST")
    result = conn.execute(select_query) # Created a SQLAlchemy object that is assigned to the result variable
    df = pd.DataFrame(result.fetchall()) # Fetches all rows
    df.columns = result.keys() # set the dataframe column names
    # Print the table after inserting the data
df.head()


['table_test']


Unnamed: 0,test_id,amount,created_at
0,192,192,2024-04-29 09:54:23.833128
1,138,2992,2024-04-29 09:54:23.833128
2,120,880,2024-04-29 09:54:23.833128
3,31,5607,2024-04-29 09:54:23.833128
4,98,5471,2024-04-29 09:54:23.833128


### Method 2: Here we use Pandas & SQLAlchemy & Faker to ingest fake data into the Postgres database, but quicker at the end.

In [7]:
# We will use the SQLAlchemy package to access an postgres database, but with pandas at the end to query it

# We start by importing the create_engine function.
    # This engine fires up a SQL engine that will communicates out SQL queries to the database 
from sqlalchemy import create_engine, text, inspect
from faker import Faker
import pandas as pd

# Create the engine
engine = create_engine('postgresql://myuser:mypassword@postgres/mydatabase')

# Checking the table names within the database
insp = inspect(engine)
print(insp.get_table_names(schema="schema_test")) # recall that postgres prefer lower case for names 

# Connecting to the engine and executing a SELECT query
with engine.connect() as conn:

    faker = Faker('en_US')

    # Insert fake data
    for i in range(10):
        test_id = faker.random_int(min=1, max=200)
        amount = faker.random_int(min=100, max=10000)
        #created_at: recall that the created_at is defined in the init.sql
        #insert_query = text(f"INSERT INTO SCHEMA_TEST.TABLE_TEST (test_id, amount) VALUES ({test_id}, {amount})")
        insert_query = text("INSERT INTO SCHEMA_TEST.TABLE_TEST (test_id, amount) VALUES (:test_id, :amount)")
        conn.execute(insert_query, {"test_id": test_id, "amount": amount})

    # Commit the transaction
    conn.commit() # committing refers to finalizing and applying the changes made within a transaction to the database.

df = pd.read_sql_query("SELECT * FROM SCHEMA_TEST.TABLE_TEST", engine)
df.head()

['table_test']


Unnamed: 0,test_id,amount,created_at
0,200,1909,2024-04-15 14:50:40.478758
1,198,4832,2024-04-15 14:50:40.478758
2,173,1485,2024-04-15 14:50:40.478758
3,174,929,2024-04-15 14:50:40.478758
4,91,1693,2024-04-15 14:50:40.478758


### Method 3: Here we use Pandas & urllib to ingest CSV data from an URL

In [8]:
# Import package
from urllib.request import urlretrieve

# Import pandas
import pandas as pd

# Assign url of file
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

### Method 4: Here we ingest data from an URL with HTTP requests

In [1]:
# Import package
import requests

# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response r
r = requests.get(url)

# Extract the response
text = r.text

# Print the html
print(text)

<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131}button,html{font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}@media (prefers-color-scheme:dark){body{background-color:#222;color:#d9d9d9}body a{color:#fff}body a:hover{color:#ee730a;text-decoration:underline}body .lds-ring div{border-color:#999 transparent transparent}body .font-red{color:#b20f03}body .big-button,body .pow-button{background-color:#4693ff;color:#1d1d1d}body #challenge-success-text{background-image:url(data:image/svg+xml;base64,PH

### Method 5: Here we Scrape the web using BeautifulSoup and HTTP requests

In [12]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response r
r = requests.get(url)

# Extracts the response as html
html_doc = r.text

# create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks)
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

<title>Guido's Personal Home Page</title>
pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
images/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
Resume.html
https://docs.python.org
https://github.com/python/cpython/issues
https://discuss.python.org
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


### Method 6: Here we Ingest data from APIs and JSONs, anonymously (without an account)

In [1]:
# Import package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network'

# Package the request, send the request and catch the response r
r = requests.get(url)

# Decode the JSON data into a dictionary
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

# or this: Print each key-value pair in json_data
# for k, v in json_data.items():
#     print(k + ': ', v)

Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin, Ben Mezrich
Actors:  Jesse Eisenberg, Andrew Garfield, Justin Timberlake
Plot:  As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  United States
Awards:  Won 3 Oscars. 173 wins & 187 nominations total
Poster:  https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.8/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.8
imdbVotes:  754,796
imdbID:  tt1285016
Type:  movie
DVD:  05 Jun 2012
BoxOffice:  $9

### Method 7: Here we Ingest data from APIs and nested JSONs, anonymously (without an account)

In [2]:
# Import package
import requests

# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'

# Package the request, send the request and catch the response r
r = requests.get(url)

# Decode the JSON data into a dictionary
json_data = r.json()

# Print the Wikipedia page extract (nested jsons)
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)


<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1033289096">
<p class="mw-empty-elt">

</p>
<p><b>Pizza</b> (<span></span> <i title="English pronunciation respelling"><span>PEET</span>-sə</i>, <span>Italian:</span> <span lang="it-Latn-fonipa">[ˈpittsa]</span>; <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1177148991"><span>Neapolitan:</span> <span lang="nap-Latn-fonipa">[ˈpittsə]</span>) is a dish of Italian origin consisting of a flat base of leavened wheat-based dough topped with tomato, cheese, and other ingredients, baked at a high temperature, traditionally in a wood-fired oven.</p><p>The term <i>pizza</i> was first recorded in the year 997 AD, in a Latin manuscript from the southern Italian town of Gaeta, in Lazio, on the border with Campania. Raffaele Esposito is often credited for creating modern pizza in Naples. In 2009, Neapolitan pizza was registered with the European Union as a traditional speciality guaranteed dish. In 2017, 

### Method 8: Here we Ingest data from Twitter APIs and nested JSONs, with an account (with authentication credentials). We use Tweepy and we filter tweets for specific tags

In [12]:
# Import packages
import json
import pandas as pd
import tweepy #uncomment the tweepy installation in requirements.txt

# Store credentials in relevant variables
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"

#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):

    def on_status(self, status):
        print(status.text)

myStreamListener = MyStreamListener()

# Create your Stream object with credentials
stream = tweepy.Stream(consumer_key, consumer_secret, access_token, access_token_secret)

# Filter your Stream variable
stream.filter(["clinton", "trump", "sanders", "cruz"])

# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets (this will be a list of dictionaries)
tweets_data = []

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns = ['text', 'lang'])

# Print head of DataFrame
print(df.head())


# Count how many tweets contain the words 'clinton', 'trump', 'sanders' and 'cruz'
# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])



AttributeError: module 'tweepy' has no attribute 'Stream'

### Method 9: Here we Ingest & Stream data from APIs, with an account (with authentication credentials)

In [10]:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
d
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4
