## Learning Objectives

- How we can dump large dataset on S3 and read it by using Boto

- Learn this by uploading churn dataset on S3, train a Keras DL model by `Churn_Modelling.csv`

S3 + Boto:
- pip install awscli (!pip install awscli on Google Colab)
- $ aws configure (!aws configure on Google Colab)
- AWS Access Key ID [None]: ...
- AWS Secret Access Key [None]: ...
- Default region name [None]: ...
- Default output format [None]: ... 

In [2]:
import pandas as pd
import boto3

bucket = "makeschoolds"
file_name = "data/Churn_Modelling.csv"

s3 = boto3.client('s3')
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket=bucket, Key=file_name)
# get object and file (key) from bucket

df = pd.read_csv(obj['Body']) # 'Body' is a key word
print(df.head())

   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0       2       0.00              1          1               1   
1       1   83807.86              1          0               1   
2       8  159660.80              3          1               0   
3       1       0.00              2          0               0   
4       2  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       0  
4         790

# Churn Prediction

- Lets first read: https://medium.com/@pushkarmandot/build-your-first-deep-learning-neural-network-model-using-keras-in-python-a90b5864116d

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix


print(df.head())

X = df.iloc[:, 3:13].values
y = df.iloc[:, 13].values

print(X)
print(X.shape)
print(y)

label_encoder_X_1 = LabelEncoder()
X[:, 1] = label_encoder_X_1.fit_transform(X[:, 1])
label_encoder_X_2 = LabelEncoder()
X[:, 2] = label_encoder_X_2.fit_transform(X[:, 2])
print(X)
print(X.shape)

one_hot_encoder = OneHotEncoder(categorical_features=[1])
X = one_hot_encoder.fit_transform(X).toarray()
X = X[:, 1:]
# print('M:')
# print(X[:, :10])
# print(X[:, 10])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train.shape)

classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim=6, init='uniform', activation='relu', input_dim=11))
# Adding the second hidden layer
classifier.add(Dense(output_dim=6, init='uniform', activation='relu'))
# Adding the output layer
classifier.add(Dense(output_dim=1, init='uniform', activation='sigmoid'))
# Compiling Neural Network
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Fitting our model
classifier.fit(X_train, y_train, batch_size=10, nb_epoch=50, verbose=1)
# Predicting the Test set results
y_predict = classifier.predict(X_test)
print(y_predict)
y_predict = (y_predict > 0.5)
cm = confusion_matrix(y_test, y_predict)
print(cm)

## SQL

In [3]:
import sqlite3 as lite

con = lite.connect('population.db')

with con:
    cur = con.cursor()
    cur.execute("CREATE TABLE Population(id INTEGER PRIMARY KEY, country TEXT, population INT)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Germany',81197537)")
    cur.execute("INSERT INTO Population VALUES(NULL,'France', 66415161)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Italy', 60795612)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")

OperationalError: table Population already exists

### Write a SQL syntax in Python that return all records where population field is greater or equal than 50M from Population table in population database

In [4]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('population.db')
query = "SELECT country FROM Population WHERE population > 50000000;"

df = pd.read_sql_query(query, conn)

for country in df['country']:
    print(country)

Germany
France
Italy


In [6]:
query_1 = "SELECT country FROM Population WHERE country LIKE 'S%'"     # query countrys that begin with S

df = pd.read_sql_query(query_1, conn)

for country in df['country']:
    print(country)

Spain
Spain


## Setup the MongoDB and insert and have query in Python

Read: https://marcobonzanini.com/2015/09/07/getting-started-wimport pandas as pd
import sqlite3

conn = sqlite3.connect('population.db')
query = "SELECT country FROM Population WHERE population > 50000000;"

df = pd.read_sql_query(query, conn)

for country in df['country']:
    print(country)ith-mongodb-and-python/

In [None]:
from pymongo import MongoClient
from datetime import datetime

client = MongoClient()

db = client['tutorial']
coll = db['articles']

doc = {
    "title": "An article about MongoDB and Python",
    "author": "Marco",
    "publication_date": datetime.utcnow(),
    # more fields
}

doc_id = coll.insert_one(doc).inserted_id

In [None]:
from pymongo import MongoClient


client = MongoClient()

db = client['tutorial']
coll = db['articles']

for doc in coll.find():
    print(doc)

### Syntaxes:

sudo mkdir -p /data/db

whoami

sudo chown miladtoutounchian /data/db

./bin/mongod


### We want to call GitHub API in Python. 

To do this use the following hints:

1- import requests

2- url = 'https://api.github.com/search/repositories?q=tensorflow'

Lets print the top 10 repositories for Tensorflow from GitHub API 

In [7]:
import requests

In [9]:
url = 'https://api.github.com/search/repositories?q=tensorflow'
print(requests.get(url).json()['total_count'])

91022
