# Data Management: sql and nosql, mongodb

### First we need to have an idea that there are different types of data: 
- structured
- unstructured
- semi-structured

#### unstructured data
From https://en.wikipedia.org/wiki/Unstructured_data:

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. 

- This means that unstructured data is hard to manage and maintain

#### structured data
So oppositely, structured data is information that has a pre-defined data model or organized in a pre-defined manner.

#### semi-structured data
From https://en.wikipedia.org/wiki/Semi-structured_data:

Semi-structured data[1] is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

- XML
- JSON

## Why we need databases? 

Usually we have four operations on our data: CRUD.
- CREATE
- READ
- UPDATE
- DELETE

If we use text files like csv, it is very time consuming, for example checking duplicity when creating a new item.

Database provides more powerful tool, like index which can speeding up our search time.

Database is more managable. If you have two database, and you want to merge the data, it is very easy. But if you want to merge two text files, that is very hard.

## SQL/Relational Database

- Mysql
- sqlite
- Postgresql
- MariaDB
- Microsoft SQL Server
- Oracle

### Database Normalization

- reduce data redundancy
- improve data integrity

<img src='http://www.ecommerce-digest.com/images/normalization.jpg'>

### A better example:

| Customer | Address   | Order   | Product No.   |
|------|------|------|------|
|   Rick Evans  | A| 111 | 1 |
|   Rick Evans  | A| 111 | 2 |
|   Rick Evans  | A| 111 | 3 |
|   Michael Li  | B| 112 | 1 |
|   Michael Li  | B| 112 | 2 |


| Customer | Address   | Order   | 
|------|------|------|
|   Rick Evans  | A| 111 | 
|   Michael Li  | B| 112 | 

| Order | Product No.   |     
|------|------|
|   111  | 1 | 
|   111  | 2|  
|   111  | 3|
|112|1|
|112|2|

The split of table reduces the data redundancy.
We need 20 table data slots to hold the data before normalization.<br/>
After normalization, we only need 16 slots.

### Tables
SQL databases are usually made of multiple tables.<br/>
Each table can represent an entity, like person, dog, and has its attributes.<br/>
The tables can also hold constraints, like primary key, unique columns.




| Customer | Address   | Order   | Product No.   |
|------|------|------|------|
|   Rick Evans  | A| 111 | 1 |
|   Rick Evans  | A| 111 | 2 |
|   Rick Evans  | A| 111 | 3 |
|   Michael Li  | B| 112 | 1 |
|   Michael Li  | B| 112 | 2 |

### Primary Key
A very important concept. A primary key is like your ID number. Every one has unique primary key. Other people identify you by your primary key and know who you are. <br/>
We can only add constraints like unique. You can think of it like your email. Each email account name needs to be unique, but such things do not need to be primary key.

### basic SQL commands
Let's look at SQL syntax first. The basic SQL commands help us do the CRUD.

- SELECT

    <pre>
    SELECT * FROM trips;
    SELECT id, fare_amount FROM trips;</pre>

- WHERE

    <pre>
    SELECT * 
    FROM trips 
    WHERE vendor_id = 1 AND fare_amount > 10.0;</pre>

- LIMIT

    <pre>
    SELECT * 
    FROM trips 
    WHERE vendor_id = 1 AND fare_amount > 10.0
    LIMIT 10;</pre>

- DISTINCT

    <pre>
    SELECT DISTINCT vendor_id FROM trips</pre>

- COUNT

    <pre>
    SELECT COUNT(*) FROM trips</pre>

- ORDER BY

    <pre>
    SELECT fare_amount, passenger_count 
    FROM trips 
    ORDER BY fare_amount DESC;</pre>

- JOIN

    <pre>
    SELECT t.fare_amount, tzp.borough 
    FROM trips AS t
    LEFT JOIN taxi_zones AS tz ON tzp.locationid = t.pickup_location_id
    LIMIT 10</pre>

- Putting it all together

```
    SELECT 
        t.fare_amount
        ,t.trip_distance
        ,tzp.borough as pickup_borough
        ,tzd.borough as dropoff_borough
    FROM
        trips AS t
        LEFT JOIN taxi_zones AS tzp ON tzp.locationid = t.pickup_location_id
        LEFT JOIN taxi_zones AS tzd ON tzd.locationid = t.dropoff_location_id
    WHERE 
        t.fair_amount > 0 
        AND tzp.borough = 'Manhattan'
    ORDER BY
        t.trip_distance DESC
    LIMIT 10
```

## NOSQL
"Not Only SQL"

- key-value (Redis, Berkeley DB)
- document store (MongoDB, DocumentDB)
- wide column (Cassandra, HBase, DynamoDB)
- graph (Neo4j, Giraph)

key-value store<br/>
Usually used as cache store. Store things the way python dictionary does.

wide column store
<img src='https://studio3t.com/wp-content/uploads/2017/12/cassandra-column-family.png'>


<img src='https://database.guide/wp-content/uploads/2016/06/wide_column_store_database_example_column_family-1.png'>


## Document store
Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization. Although the difference is often moot due to tools in the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques.

Document databases contrast strongly with the traditional relational database (RDB). Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This eliminates the need for object-relational mapping while loading data into the database.

### MongoDB <br/>The most popular document database

Installation:<br/>
- download mongodb from https://www.mongodb.com/download-center/community
- download NoSQLBooster from https://nosqlbooster.com/downloads
        

After the installation of NoSQLBooster
- Start it.
- Click 'Create'
- Click 'Test Connection'
- Click 'Save and Connect'

### Basic concepts at mongodb

- db

- collection <br/>
The same idea as the sql table

- find(filter,projection)

In [31]:
import pymongo

In [37]:
# start up our client, defaults to the local machine, on the default port 27017
#equals to 
#client = MongoClient('localhost', 27017)
client = pymongo.MongoClient()


In [38]:
# get a connection to a database
db = client.mydb

In [39]:
coll = db.mytable

In [40]:
# get one record
coll.find_one()

{'_id': ObjectId('5cb4a42472673b1a4812daf2'),
 'name': 'Ford',
 'models': ['Fiesta', 'Focus', 'Mustang']}

### Database/Collection CREATE


In [41]:
newdb = client['new_db']

In [42]:
newtb = newdb["new_table"]

In [43]:
newtb.insert_one({"firstName":"Jack","lastName":"Who"})

<pymongo.results.InsertOneResult at 0x1f2d9014e08>

### CREATE/FIND

In [11]:
a = {"name":"michael"}
post_id = coll.insert_one(a).inserted_id
post_id

ObjectId('5cb4999d62e26918b8a730c5')

In [12]:
#we do not need to conform to the table constraint
a = {"hello":"world"}
post_id = coll.insert_one(a).inserted_id
post_id

ObjectId('5cb4999e62e26918b8a730c6')

### What is ObjectID?

It is the 'primary key' in MongoDB. Each document/record has a distinct ObjectID so we know exactly which record we are deadling with.

In [13]:
mylist = [
  { "name": "Amy", "hobby": "baseball"},
  { "name": "Amy", "hobby": "baseball"},
  { "name": "Amy", "hobby": "baseball"},

]
x = coll.insert_many(mylist)
print(x.inserted_ids)


[ObjectId('5cb499a062e26918b8a730c7'), ObjectId('5cb499a062e26918b8a730c8'), ObjectId('5cb499a062e26918b8a730c9')]


This means the database is deadling with the uniqueness for us. We just need to focus on our work!<br/> In SQL, if we insert two same rows with same primary key, we will get error and be forced to change it.

### READ/FIND

In [14]:
#find one document for us
res=coll.find_one({"name": "Amy"})
res

{'_id': ObjectId('5cb499a062e26918b8a730c7'),
 'name': 'Amy',
 'hobby': 'baseball'}

In [15]:
#find by ObjectID
coll.find_one({"_id": res['_id']})

{'_id': ObjectId('5cb499a062e26918b8a730c7'),
 'name': 'Amy',
 'hobby': 'baseball'}

In [16]:
#find many, it will give us a cursor
res=coll.find({"name": "Amy"})
res

<pymongo.cursor.Cursor at 0x1f2d8fb6208>

In [17]:
for record in res:
    print(record)

{'_id': ObjectId('5cb499a062e26918b8a730c7'), 'name': 'Amy', 'hobby': 'baseball'}
{'_id': ObjectId('5cb499a062e26918b8a730c8'), 'name': 'Amy', 'hobby': 'baseball'}
{'_id': ObjectId('5cb499a062e26918b8a730c9'), 'name': 'Amy', 'hobby': 'baseball'}


In [18]:
#read all documents 
for document in coll.find():
    print (document)

{'_id': ObjectId('5cb4959e6e19912ad0394328'), 'name': 'Ford', 'models': ['Fiesta', 'Focus', 'Mustang']}
{'_id': ObjectId('5cb4959e6e19912ad0394329'), 'name': 'BMW', 'models': ['320', 'X3', 'X5']}
{'_id': ObjectId('5cb4959e6e19912ad039432a'), 'nickname': 'Fiat', 'models': ['500', 'Panda'], 'hobby': 'basketball'}
{'_id': ObjectId('5cb4999d62e26918b8a730c5'), 'name': 'michael'}
{'_id': ObjectId('5cb4999e62e26918b8a730c6'), 'hello': 'world'}
{'_id': ObjectId('5cb499a062e26918b8a730c7'), 'name': 'Amy', 'hobby': 'baseball'}
{'_id': ObjectId('5cb499a062e26918b8a730c8'), 'name': 'Amy', 'hobby': 'baseball'}
{'_id': ObjectId('5cb499a062e26918b8a730c9'), 'name': 'Amy', 'hobby': 'baseball'}


### UPDATE

In [19]:
#Upsert parameter will insert instead of updating if the post is not found in the database.
result = coll.update_one({'name':"Amy"}, {"$set": {"hobby":"Basketball"}}, upsert=False)
result.matched_count

1

In [20]:
#Upsert parameter will insert instead of updating if the post is not found in the database.
result = coll.update_many({'name':"Amy"}, {"$set": {"hobby":"Basketball"}}, upsert=False)
result.matched_count

3

$set is just a syntax you need to use during update

### delete

In [21]:
x = coll.delete_one({"name":"Amy"})
x.deleted_count

1

In [22]:
x = coll.delete_many({"name":"Amy"})
x.deleted_count

2

In [23]:
x = coll.delete_many()
x.deleted_count

TypeError: delete_many() missing 1 required positional argument: 'filter'

### Count

In [24]:
coll.count_documents({})

5

In [25]:
coll.count_documents({"hello": "world"})

1

### range query

In [26]:
list = [
    {"name":"Jason","age":20},
    {"name":"Jane", "age":21},
    {"hobby":"Swimming", "age":25},
    {"name":"Jack","age":30}
]

In [27]:
x=coll.insert_many(list)
x.inserted_ids

[ObjectId('5cb499ad62e26918b8a730ca'),
 ObjectId('5cb499ad62e26918b8a730cb'),
 ObjectId('5cb499ad62e26918b8a730cc'),
 ObjectId('5cb499ad62e26918b8a730cd')]

In [28]:
for post in coll.find({"age": {"$lt": 26}}).sort("age"):
    print(post)

{'_id': ObjectId('5cb499ad62e26918b8a730ca'), 'name': 'Jason', 'age': 20}
{'_id': ObjectId('5cb499ad62e26918b8a730cb'), 'name': 'Jane', 'age': 21}
{'_id': ObjectId('5cb499ad62e26918b8a730cc'), 'hobby': 'Swimming', 'age': 25}


In [29]:
for post in coll.find({"age": {"$gt": 26}}).sort("age"):
    print(post)

{'_id': ObjectId('5cb499ad62e26918b8a730cd'), 'name': 'Jack', 'age': 30}


In [30]:
for post in coll.find({"age": {"$eq": 25}}).sort("age"):
    print(post)

{'_id': ObjectId('5cb499ad62e26918b8a730cc'), 'hobby': 'Swimming', 'age': 25}


See the full list of operators:
    https://docs.mongodb.com/manual/reference/operator/query/

Reference:<br/>
    Bryan Gibson, COMS W4995 007 2018 3, Elements for Data Science, slide of Week 11 :  Recommendation Engines (cont.), Data Management, Webscraping and Review
    <br/>
    https://en.wikipedia.org/wiki/Document-oriented_database
    <br/>
    https://www.w3schools.com/python/python_mongodb_insert.asp
    <br/>
    http://api.mongodb.com/python/current/tutorial.html