# Assignment 2 helper 

As part of assignment 2, you will be going through a guided process of creating a workflow by answering some of the questions with our tweets data. 

This assignment is like a mini version of the workflow you will produce as part of your final project. In your final project assignment, you need to include some other useful concepts like indexing and warehousing solutions wherever appropriate. You can also check out in lecture 6 how warehousing solutions are applied to the current problem. We will be seeing how indexing and warehousing solutions apply to our assignment dataset in lecture 6.

Basic knowledge in SQL and python is necessary for going through these assignments. This document is written to help some students who need additional support with the SQL needed for this assignment. You will be using mainly SQL commands that you are familiar with, but one reason some students get overwhelmed by seeing this assignment is probably this could be the first time you use SQL for real in a workflow. That is why the questions are not straightforward like in a regular assignment. 

Colby will give another session on basic python needed during his office hours.

The idea of dealing with things in the cloud is that you don't want to worry about any of the installation processes in your laptop and can spin a database and bring it down whenever you want. Also, you all got 100 credits, so make sure you use it, so it's okay to spin a bigger instance if you want to get a faster experience. However, remember that bigger instances cost more, so better shut it down when you don't use it.

It's your choice to spin a different database instance for this assignment or use the previous instance. However, I strongly suggest you use a separate instance for your project.

The process of setting it up will only take a minute from your end, and from the AWS end, it might take around 10 minutes for it to be available. [I am putting a small video on doing it.](https://ubc.zoom.us/rec/play/a6qhchUCkFCSs-7aT1zeMDWG0Gsu2RK1Gj66NZpNhXvWPqxQBtQ353cCdeP9kMy9CQXFvlaOTxGYTjfv.t7YkUbcpCSYe1MOL?startTime=1642465726000)

First, we want to load the data into the database. So why are we loading dump again? So last time, we put the raw data into our database. After that, I did a lot of data cleaning to normalize it for our analysis. [You can check this notebook](https://canvas.ubc.ca/files/18519982/download?download_frd=1)  to see how I did the process of cleaning. But you don't want to run this notebook for doing your assignment, it's just put here for your reference, but you can use it for some help when dealing with group assignments.

The cleaned data is loaded to the tweets schema, and I took a dump of the tweets schema. Next, we want to load this data into our database. 

Hopefully, you all are comfortable uploading dumps from our previous communications. However, [I created this video showing how to upload a dump](https://ubc.zoom.us/rec/share/MzaxQiS4FmDqw2JUipi6PB6y3llcy5Y7-wAtTqfFIZhzdJfQ54V9qK4fv18nH8wD.Q9iSG0syxrBgTuuF?startTime=1642468195000
), it will only take ~3 minutes from your end.

Later you can check your schema and tables in your pgadmin. [Check out this small video.](https://ubc.zoom.us/rec/play/qDHX9a2cmv1Sc0NTrCiA4YauF2NXS_E162eTqom12yJQbEdPvzq2D4_6RYI8d7IcBuTiECaUrDRUSTA4.xk4_nplwVjqguXI4?startTime=1642468736000) 

Passcode for above videos check [in slack](https://bait-580.slack.com/archives/C02SU3M63S5/p1642528551003000).

If you want any help with the .env file creation, you [can check out the first few minutes of this office hour](https://ubc.zoom.us/rec/share/DovyeJvun1E3Sb06bXAuQS_GZbwlTbJktsUcqZxHVs4ekFuCV8cfEzpm4O86l38.D3b10SWO9QllKx_y), and we are setting it up both in windows and mac.

Check out the course channel for access codes for the above videos.

```
# Put your code here.  More than likely you can copy/paste right in.
# Make sure you have your .env file in the same folder as this project file.

import os
import psycopg2
from dotenv import load_dotenv
import numpy
import sparklines
import pandas as pd


conString = {'host':os.environ.get('DB_HOST'),
             'dbname':os.environ.get('DB_NAME'),
             'user':os.environ.get('DB_USER'),
             'password':os.environ.get('DB_PASS'),
             'port':os.environ.get('DB_PORT')}
print(conString)

conn = psycopg2.connect(**conString)
cur = conn.cursor()
```

SQL commands you need to go through this assignment include some fundamental ones.

- SELECT
- LIMIT
- DISTINCT
- COUNT
- INNER JOIN
- GROUP BY
- ORDER BY
- WHERE

Some commands or inbuilt functions that are not too common include
- string functions- LTRIM and regex_matches
- TO_DATE
- WITH IN combination Queries
- date_trunc

To some of these not-so-common commands, I put instructions and references in the assignment to look it up and use it. However, here I will go through all of these not-so-common commands in explaining why we use them generally.

```
## creating a dummy table 
query = """DROP TABLE IF EXISTS import.example;
CREATE TABLE import.example (id text, date text, sentence text);
INSERT INTO import.example(id, date,sentence) VALUES('111','Thu May 18 22:00:00 +0000 2017','This is a test sentence $NEW $OLD $small');
INSERT INTO import.example(id, date,sentence) VALUES('222','Fri Jun 20 09:00:00 +0000 2017','Testing sentence $MBAN');
INSERT INTO import.example(id, date,sentence) VALUES('333','Mon Jul 18 21:00:00 +0000 2017','Sentence $XXX');
INSERT INTO import.example(id, date,sentence) VALUES('444','Mon Jul 18 14:00:00 +0000 2017','Sentence different type $XXX');
INSERT INTO import.example(id, date,sentence) VALUES('555','Thu May 18 14:00:00 +0000 2017','I am tweeting again $OLD');
INSERT INTO import.example(id, date,sentence) VALUES('666','Fri Jun 20 16:00:00 +0000 2017','Sentence different type $MBAN');
"""
cur.execute(query)
conn.commit()
```

## regexp_matches

regexp_matches are helpful when we want to generate some valuable text from a text column. In this dummy example, it's whatever that follows $. Of course, you can check the official docs linked-to assignment, but you don't need all that you can probably check out [this](https://www.postgresqltutorial.com/postgresql-regexp_matches/), as this explains just what we want. So please don't worry too much about regex.

Basic syntax is 
REGEXP_MATCHES(source_string, pattern [, flags])

- source_string: here will be the text column value
- pattern: The regex pattern
- flags: it's optional, but we will be giving flag 'g' that search globally for each occurrence.

The regex pattern is a different study area and can be very useful in many situations. But you can use this [regex generator](https://regex-generator.olafneumann.org/?sampleText=2020-03-12T13%3A34%3A56.123Z%20INFO%20%20%5Borg.example.Class%5D%3A%20This%20is%20a%20%23simple%20%23logline%20containing%20a%20%27value%27.&flags=i&onlyPatterns=false&matchWholeLine=false&selection=) for coming up with your pattern.

Here is how it behaves in our dummy example:

```
query="""SELECT eg.id, 
  regexp_matches(eg.sentence, '\$[A-Z]+', 'g')
FROM import.example AS eg;"""
cur.execute(query)
cur.fetchall()
```

# why we use unnest ? 
unnest just opens up the array

```
cur = conn.cursor()
query="""SELECT eg.id, 
  unnest(regexp_matches(eg.sentence, '\$[A-Z]+', 'g')) AS substring
FROM import.example AS eg;"""
cur.execute(query)
cur.fetchall()
```

# LTRIM
Syntax:
    LTRIM(string,trimming_text)
It takes out the leading character from the string. There is also RTRIM and BTRIM.

```
conn.rollback()
cur = conn.cursor()
query="""SELECT eg.id, 
  LTRIM(unnest(regexp_matches(eg.sentence, '\$[A-Z]+', 'g')),'$') AS substring
FROM import.example AS eg LIMIT 10;"""
cur.execute(query)
cur.fetchall()
```

## Constructing a dataframe from query returned.
As we mentioned in our 3rd lecture, psycopg2 returns the data as a list of tuples. The best thing you can do to work with these tuples is to convert them into pandas dataframe. After that, you are in the python world to deal with any further transformation before getting to the visualization. You might need to do this for Question 5a.

```
# rebuilt to a dataframe 
cur.execute(query)
stocktweets = cur.fetchall()
```

```
##as you see list of tuples 
print(stocktweets)
```

```
# rebuilt to a dataframe 
stockdf = pd.DataFrame(stocktweets, columns=['id', 'substring'])
# Now limit it to three stock tweets:
```

```
stockdf
```

Let's move on to do some data transformations. As mentioned in lecture4, we need to be very careful when dealing with dates as there are a wide variety of dates formats out there, and we need to make sure that it is appropriately aligned with all the datasets. I have already converted the Twitter table to the clean tweets table, and you will be using that from question 3b onwards, but for getting a feel of the conversion, we will use 3a to convert a column to date type.

TO_DATE - You can use this to convert a text to a date format. Here you specify the format your date is in and convert it to the date format. WHY convert to date format? Because if we convert to a date format, we can apply a variety of [date time operations.](https://www.postgresql.org/docs/9.1/functions-datetime.html).

```
query = """SELECT TO_DATE(date, 'Dy Mon DD HH24:MI:SS +0000 YYYY') AS date
           FROM import.example AS tw"""
cur.execute(query)
cur.fetchall()
```

Below cell is not required for your assignment, but I am creating a column named datacol with timestamp type,so that I can demonstrate how we can apply `date_trunc`

```
query = """DROP TABLE import.exampleclean;
CREATE TABLE import.exampleclean AS 
SELECT id,LTRIM(unnest(regexp_matches(eg.sentence, '\$[A-Z]+', 'g')),'$') AS instahash,TO_TIMESTAMP(eg.date, 'Dy Mon DD HH24:MI:SS +0000 YYYY') as datecol
FROM import.example as eg;"""
cur.execute(query)
conn.commit()
```

About date_trunc, I came across an amazing article [here](https://mode.com/blog/date-trunc-sql-timestamp-function-count-on/), and hence not writing about it here.

```
query = """SELECT instahash,date_trunc('day',datecol),COUNT(*) AS total
FROM import.exampleclean
GROUP BY instahash,date_trunc('day',datecol)"""
cur.execute(query)
cur.fetchall()
```

### Using WITH AS and IN
This combination is beneficial in many places. You can read about it [here](https://www.postgresql.org/docs/9.1/queries-with.html). You can think of it as creating temporary tables by querying another table. For example, here we are using the following query ...

```sql
SELECT instahash,date_trunc('day',datecol),COUNT(*) AS total
FROM import.exampleclean
GROUP BY instahash,date_trunc('day',datecol) LIMIT 1
```

...and capturing the result to a kind of temporary table called `random`. This `random` is used to query in another SQL query. Probably you learned about it, but check [this out](https://www.techonthenet.com/postgresql/in.php) to know more.

```sql
select * from import.exampleclean
where instahash IN (select instahash from random)
```

```
## https://www.postgresql.org/docs/9.1/queries-with.html
query="""WITH random as (SELECT * FROM import.exampleclean LIMIT 1)
select * from import.exampleclean
where instahash IN (select instahash from random)"""
cur.execute(query)
cur.fetchall()
```