## Learning Objectives

- How we can dump large dataset on S3 and read it by using Boto

- Learn this by uploading churn dataset on S3, train a Keras DL model by `Churn_Modelling.csv`

### Why use Cloud to Store Data?

- changes to the dataset are more consistently shared around the company
- and it's more effiecient resource allocation, data stored in one place rather than everyone has their own copy on their machine



S3 + Boto:
- pip install awscli (!pip install awscli on Google Colab)
- $ aws configure (!aws configure on Google Colab)
- AWS Access Key ID [None]: ...
- AWS Secret Access Key [None]: ...
- Default region name [None]: ...
- Default output format [None]: ... 

In [8]:
# from this blog: https://dluo.me/s3databoto3

# use this when installing packages like Boto3: --use-feature=2020-resolver

# using the AWS API
import boto3

client = boto3.client('s3') #low-level functional API

BUCKET_NAME = ''  # this is SENSITIVE! 

resource = boto3.resource('s3') #high-level object-oriented API
my_bucket = resource.Bucket(BUCKET_NAME) #subsitute this for your s3 bucket name.

# making a Pandas DF

import pandas as pd
# Bucket refer to the name of the bucket
# key refers to the file path, once in the bucket
obj = client.get_object(Bucket=BUCKET_NAME, Key='')
df = pd.read_csv(obj['Body'])

# use environment variables, even if the repo is PRIVATE!

In [9]:
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [None]:
import pandas as pd
import boto3

# bucket = "makeschooldata"
# file_name = "data/Churn_Modelling.csv"

s3 = boto3.client('s3')
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket=bucket, Key=file_name)
# get object and file (key) from bucket

df = pd.read_csv(obj['Body']) # 'Body' is a key word
print(df.head())

# Churn Prediction

- Lets first read: https://medium.com/@pushkarmandot/build-your-first-deep-learning-neural-network-model-using-keras-in-python-a90b5864116d

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix


print(df.head())

# get features and output of dataset
X = df.iloc[:, 3:13].values
y = df.iloc[:, 13].values

print(X)
print(X.shape)
print(y)

label_encoder_X_1 = LabelEncoder()
X[:, 1] = label_encoder_X_1.fit_transform(X[:, 1])
label_encoder_X_2 = LabelEncoder()
X[:, 2] = label_encoder_X_2.fit_transform(X[:, 2])
print(X)
print(X.shape)

one_hot_encoder = OneHotEncoder(categorical_features=[1])
X = one_hot_encoder.fit_transform(X).toarray()
X = X[:, 1:]
# print('M:')
# print(X[:, :10])
# print(X[:, 10])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train.shape)

# MLP network
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim=6, init='uniform', activation='relu', input_dim=11))
# Adding the second hidden layer
classifier.add(Dense(output_dim=6, init='uniform', activation='relu'))
# Adding the output layer
classifier.add(Dense(output_dim=1, init='uniform', activation='sigmoid'))
# Compiling Neural Network
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Fitting our model
classifier.fit(X_train, y_train, batch_size=10, nb_epoch=50, verbose=1)
# Predicting the Test set results
y_predict = classifier.predict(X_test)
print(y_predict)
y_predict = (y_predict > 0.5)
cm = confusion_matrix(y_test, y_predict)
print(cm)

## SQL

Review the basics of SQL vs NonSQL (BEW 1.1):

SQL (less priority)

- not a db
- Structured Query Language
- lets you write db queries in a structured way
- lots of keywords 
- tables = samples of one Resource
- usually used with relational db 
    - lots of tables for the resources
    - assume that they are related to each other in some way
        - three ways in way this exisits is in BEW 1.1!
        - kinds of relationships are 1:1, 1:+1, +1:1, or +1:+1
        - Many to Many - it's a table in between two tables, which both have a one-to-many relationship to each other!
- Schema = fields of a resource, for one table in the db
    - fields = columns
    - rows = records, or samples
    - all records in the table must have a value for all the fields
        - "adhering to the schema"
        

NonSQL:

- 

In [2]:
import sqlite3 as lite  # sqlite3 lets you run SQL in your Python code, doesn't need to be installed with pip


con = lite.connect('population.db')  # population db is referred to using P, and table is with lower p

with con:
    cur = con.cursor()
    # set the schema
    """cur.execute("CREATE TABLE Population(id INTEGER PRIMARY KEY, country TEXT, population INT)")
    # adding records
    cur.execute("INSERT INTO Population VALUES(NULL,'Germany',81197537)")
    cur.execute("INSERT INTO Population VALUES(NULL,'France', 66415161)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Italy', 60795612)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")
    """
    

In [4]:
import pandas as pd 
# connect to db
con = lite.connect('population.db')
# write query
search = "SELECT country FROM Population where population > 50000000;"
# get the records 
print(pd.read_sql_query(search, con))

   country
0  Germany
1   France
2    Italy


In [16]:
import pandas as pd
import sqlite3

# listing the countries in the db, using SQL queries
conn = sqlite3.connect('population.db')
# countries must have population above a certain population size
query = "SELECT country FROM Population WHERE population > 50000000;"
# if you did SELECT * - that means all FIELDS of the records that are queried in the db


'''From here, it's all up to your Pandas skills''' 
# making a df from the query, (1 column)
df = pd.read_sql_query(query, conn)
for country in df['country']:
    print(country)

Germany
France
Italy


In [11]:
# connect to the db
con = lite.connect('population.db')
# make the query
query = "SELECT country FROM Population WHERE country LIKE 'S%'"  # % is like a regex? 
# get the records
countries_start_with_s = pd.read_sql_query(query, con)
# output the result
for country in countries_start_with_s['country']:
    print(country)
    

# osther SQL exercises:
# 1. how to get duplicate values, or count them?

# Cool thing? A lot of Pandas commands completely condense SQL query equivalents!

Spain
Spain


## On NonSQL or "MongoDB world"

NonSQL - what is it?

MongoDB - most popular nonsql db

NonSQL don't have tables, but collections

Collections are made of documents

- like rows in SQL - except they don't need to have all the same schema!
- much more flexible!

- no relations in NonSQL though, much less relied upon
- instead, all info put in one place 
- more popular at ealrier stage in the website for a business
- therefore, queries are less used


SQL vs. NonSQL - which is better? 
Neither! Both have their own strengths and weaknesses, are good for certain use cases

Scalability:

1. Horizontal Scaling - NonSQL is easier to do this, because no relationships, no hindrances

- we add more power, by adding more servers
- have to distribute db against the servers
- often NOT supported in SQL

2. Vertical Scaling

- adding more power to a single server


SQL
- multiple read/write operations can be problematic, if you're doing very complicated queries


NonSQL
- if data related, nonSQl is redundant
- data is typically merged and nested in a few collections

- data is structured like JSON
- table in SQL = collection in NoSQL
- record in SQL = document in NoSQL
- field in SQL = kinda like key value pair of a doc in NoSQL

The hard truth
- you can pretty much build any application you want, with either kind of db
- SQL vs NonSQL really only presents problems at SCALE

## Setup the MongoDB and insert and have query in Python

Read: https://marcobonzanini.com/2015/09/07/getting-started-with-mongodb-and-python/

In [None]:
from pymongo import MongoClient
from datetime import datetime

# set up connection
client = MongoClient()
# get db, and the collection
db = client['tutorial']
coll = db['articles']
# add a new doc to the db
doc = {
    "title": "An article about MongoDB and Python",
    "author": "Marco",
    "publication_date": datetime.utcnow(),
    # more fields
}

doc_id = coll.insert_one(doc).inserted_id

In [None]:
from pymongo import MongoClient

# reading from the db
client = MongoClient()

db = client['tutorial']
coll = db['articles']

for doc in coll.find():
    print(doc)

### Syntaxes:

sudo mkdir -p /data/db

whoami

sudo chown miladtoutounchian /data/db

./bin/mongod


## Download MongoDB Compass

# GitHub RESTful API example

In [28]:
import requests  # package for calling public APIs, along with urllib.requests

url = 'https://api.github.com/search/repositories?q=tensorflow&type=python'

tf_repos = requests.get(url)

repos = tf_repos.json()

# print the top 10
for item in repos['items'][:10]:
    print(item['full_name'])
    

tensorflow/tensorflow
romeokienzler/TensorFlow
aymericdamien/TensorFlow-Examples
czy36mengfei/tensorflow2_tutorials_chinese
jikexueyuanwiki/tensorflow-zh
jtoy/awesome-tensorflow
tensorflow/models
yao62995/tensorflow
lyhue1991/eat_tensorflow2_in_30_days
tensorflow/docs


## Review

1. Why We Use Cloud Storage
2. Using SQL and NoSQL
3. Using Public APIs using Requests package