In the previous file we have learned some basics of Postgres by creating a table for storing user accounts from a CSV file. However, the table that we created used unoptimized datatypes for the type of data we were loading into it.

In this file, we are going to learn how to use the proper datatypes when loading our dataset. Knowing datatypes will save space on the database server which in turn provides faster reads and writes. Furthermore, having proper datatypes will ensure that any errors in our data will be caught at insertion and our data can be queried the way we expect.

Now, we will discuss the datatypes that Postgres provides and some of their implementation details. We will continue the discussion by creating a table that will fit a dataset with proper types. Finally, we will load the data in and ensure that our data model was appropriate.

The data set that we will be using is a set of game reviews from a site called [IGN](https://www.ign.com/). This set includes multiple datatypes that will introduce us to types other than `text` and `integer` which we used previously. Here's a snippet of the dataset we will use which is found in the `ign.csv` tab of the terminal.

![image.png](attachment:image.png)

Throughout this file, we are going to assume that an empty table has been created to accommodate the data contained in `ign.csv`. However, the data types chosen for this table are not appropriate for the data that we want to store. We will learn how to identify the data types used within the table and how to find and change them into more appropriate ones.

**Task**

We have created a query that inserts the first game from the `ign.csv` file into the database. Our task is to execute it by first connecting to the database

![image.png](attachment:image.png)

We will get an error message. Try to understand what it means and why it happened.

**Answer**

`query_string = """
    INSERT INTO ign_reviews VALUES(
        5249979066121302000, 
        'Amazing', 
        'LittleBigPlanet PS Vita', 
        '/games/littlebigplanet-vita/vita-98907', 
        'PlayStation Vita', 
        9.0,
        'Platformer', 
        'Y', 
        2012, 
        9, 
        12
    );
"""`

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute(query_string)`

Above we tried to insert the first row from `ign.csv` into the `ign_reviews` table but got an error message that looks like this:

![image.png](attachment:image.png)

The last line is the important one here: `DataError: integer out of range`. This indicates that some field containing an integer value has a value that is too large. Let's look at the data contained in the first row:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

`SELECT * FROM table_name LIMIT 0;`

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute('SELECT * FROM ign_reviews LIMIT 0;')
print(cur.description)`

Above we inspected the description of the `ign_reviews` table by printing the contents of the `cursor.description` attribute. The result was a tuple of `Column` object. Here is the first element of that list:

`Column(name='id', type_code=23, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None)`

As we probably noticed, printing the description resulted in something that was hard to read. In order to better read the information that we obtained in the above, we show the columns information organized in a table:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We are going to see how we can get the name of a data type given its type code.

From the [documentation of the `pg_catalog.pg`_type table](https://www.postgresql.org/docs/12/catalog-pg-type.html), we see that the first two columns are `oid` and `typname` as show in the following extract of the table:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

In this exercise we are going to find out the names of the data types for the codes `25` and `700`.
![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()`

`cur.execute("SELECT typname FROM pg_catalog.pg_type WHERE oid = 25;")
type_name_25 = cur.fetchone()[0]`

`cur.execute("SELECT typname FROM pg_catalog.pg_type WHERE oid = 700;")
type_name_700 = cur.fetchone()[0]`

Above we saw that type code 25 corresponds to the `text` datatype and that 700 corresponds to the `float4` datatype.

The following table summarizes the types of the columns in the `ign_reviews` table:

![image.png](attachment:image.png)

We saw that when we tried to add a game with identifier `52499790661213` we got an error saying that some value was out of range and conjectured that the problem was value for the `id`. Now we know that the data type for the `id` column is `int4` which stands for 4 byte integers. But what are the values that `int4` is able to represent?

Here's a list of the numeric types that Postgres supports found in the [Postgres numeric datatypes documentation](https://www.postgresql.org/docs/12/datatype-numeric.html):

![image.png](attachment:image.png)

The first column in the above table represents the names that we get inside the `pg_catalog.pg_type` Postgres internal table. These names are not identical to the ones used in the documentation as they represent only the kind of data and the memory consumption. In the rest of this file we will be using both the Postgres internal name and the name interchangeably.

![image.png](attachment:image.png)

We see that we need 46 bits to represent this id which is larger than the 32 bits that the `integer` type allows for. On the other hand, the `bigint` data type can represent values with up to 8 bytes, that is, 8×8=64 bits so it is large enough. In reality we would need to check the id or every row in our data set to make sure that they all fit into the `bigint data`. Fortunately, we already checked that so we can focus on learning how to change the data type of the `id` column from `integer` to `bigint`.

To alter the data type of a column in SQL, we can issue the following query:

`ALTER TABLE table_name 
ALTER COLUMN column_name TYPE new_type;`

The placeholder `new_type` above can be replaced by both the Postgres internal name or the regular name.

**Task**

In this exercise we are going to alter the type of the `id` column to `bigint`

![image.png](attachment:image.png)

Commit changes and close the connection.

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()`

`cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN id TYPE bigint;
""")
conn.commit()
conn.close()`

Since we are already handling numeric datatypes, let's continue with the decimal values found in the `score` column. In this column, we see entries like `8.0`, `9.0`, `6.5`, etc.

We have the following datatypes for representing decimal numbers:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

Change the datatype of column `score` to `DECIMAL(3, 1)`.

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()`

`cur.execute("""
    ALTER TABLE ign_reviews 
    ALTER COLUMN score TYPE DECIMAL(3, 1);
""")
conn.commit()
conn.close()`

So far, we have taken a look at the numeric types in Postgres and evaluated the different sizes each type can accommodate. Now, we will focus on the string-like types that Postgres provides.

Above, we used the `text` types for our string-like columns. The `text` type allows us to enter any length of string-like datatypes in our table which gives us ultimate flexibility with our string-like data. This flexibility is not always a good thing. The following example illustrates why it is not always a good thing.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

Let's compute the maximum size of any phrase for representing the score.

**Answer**

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

In this exercise we are going to alter the type of the `score_phrase` column to `varchar(11)`. The value 11 here comes from above where we saw that this was the maximum length of any score phrase.

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("""
    ALTER TABLE ign_reviews 
    ALTER COLUMN score_phrase TYPE varchar(11);
""")
conn.commit()
conn.close()`

Above we reduced the size required to store the database on disk by changing the datatype of the `score_phrase` column from `text` to `varchar(11)` since we found out that no value in this column required more than eleven characters.

![image.png](attachment:image.png)

Another string like `"University"` would also be accepted if we use the type `varchar(11)` since it contains only 10 characters. However this does not represent a valid value for the `score_phrase`.

In these cases, an even better solution is to create an [enumerated datatype](https://www.postgresql.org/docs/12/datatype-enum.html) containing these values. The way enumerated datatypes work internally is that each possible value is assigned to a 4 byte index and stored in a separate table called [pg_enum](https://www.postgresql.org/docs/12/datatype-enum.html). Then, instead of storing the value explicitly in the `ign_reviews` table, Postgres will store these indexes. This fixes the size required to represent any of these values to 4 bytes, much lower than the 11 bytes on above solution!

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

Our task is to code that creates the `evaluation_enum` datatype to be used for the `score_phrase` column so that the type of the `score_phrase` column becomes the `evaluation_enum` datatype.

**Answer**


`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("""
    CREATE TYPE evaluation_enum AS ENUM (
    'Great',       'Mediocre', 'Bad', 
    'Good',        'Awful',    'Okay', 
    'Masterpiece', 'Amazing',  'Unbearable', 
    'Disaster',    'Painful');
""")`

`cur.execute("""
    ALTER TABLE ign_reviews 
    ALTER COLUMN score_phrase TYPE evaluation_enum 
    USING score_phrase::evaluation_enum;
""")
conn.commit()
conn.close()`

Above we introduced enumerated datatypes. To recap, enumerated datatypes are useful in situations where a given column can only contain a **predefined** set of values. In our IGN reviews example, we had the `score_phrase` column which can only contain one out of eleven different values.

![image.png](attachment:image.png)

The value in the first column, `16847`, represents the `evaluation_enum` datatype code. This is set when the datatype is created and might be have a different value when we execute it depending on the current state of our database.

Then, using the `pg_enum` table, Postgres will automatically convert these indexes back to the original values when queries are performed.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

* Code creating two enumerated datatypes named `platform_enum` and `genre_enum` containing the possible values for the `platform` and `genre` columns.

![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()`

`cur.execute("""
    CREATE TYPE genre_enum AS ENUM (
    'Adventure', 'Strategy', 'Shooter', 'genre', 'Virtual Pet', 'Hardware', 'Adult', 'Baseball', 
    'Sports', 'Flight', 'Unknown', 'Racing', 'Battle', 'Fighting', 'Simulation', 'Party', 'Card', 
    'Productivity', 'Puzzle', 'Educational', 'Casino', 'RPG', 'Board', 'Other', 'Pinball', 'Platformer', 
    'Hunting', 'Action', 'Music', 'Compilation', 'Wrestling', 'Trivia');
""")`

`cur.execute("""
    CREATE TYPE platform_enum AS ENUM (
    'PC', 'Game Boy', 'Sega CD', 'Saturn', 'DVD / HD Video Game', 'Nintendo DSi', 
    'Arcade', 'Wii U', 'Lynx', 'Super NES', 'WonderSwan Color', 'TurboGrafx-CD', 
    'Windows Phone', 'TurboGrafx-16', 'N-Gage', 'Xbox One', 'Atari 2600', 
    'Pocket PC', 'Vectrex', 'Nintendo DS', 'Wireless', 'Ouya', 'Nintendo 64DD', 
    'Atari 5200', 'PlayStation 4', 'GameCube', 'Android', 'Wii', 'Game Boy Color', 
    'PlayStation 2', 'New Nintendo 3DS', 'Linux', 'Dreamcast VMU', 'Game Boy Advance', 
    'Windows Surface', 'Genesis', 'Xbox 360', 'Macintosh', 'Web Games', 'Nintendo 3DS', 'iPhone', 
    'SteamOS', 'Commodore 64/128', 'Dreamcast', 'PlayStation 3', 'NES', 'NeoGeo Pocket Color', 
    'Game.Com', 'PlayStation Portable', 'Master System', 'Sega 32X', 'NeoGeo', 'WonderSwan', 'iPad', 
    'Nintendo 64', 'PlayStation Vita', 'Xbox', 'iPod', 'PlayStation');
""")`

`cur.execute("""
    ALTER TABLE ign_reviews 
    ALTER COLUMN platform TYPE platform_enum 
    USING platform::platform_enum;
""")`

`cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN genre TYPE genre_enum
    USING genre::genre_enum;
""")`

`cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN title TYPE varchar(200);
""")`

`cur.execute("""
    ALTER TABLE ign_reviews 
    ALTER COLUMN url TYPE varchar(200);
""")`

`conn.commit()
conn.close()`

The next field we will look at is the `editors_choice` field. For this field, we see either a `y` or `n` character for each entry with no blank values. Our first instinct might be to use either the `varchar(1)` or an enumerated datatype with values `y` and `n`. We wouldn't be wrong for suggesting that. However, there is another type that is much more useful and adequate for values that look a lot like `True` and `False`.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We can see from the above table that the `y` and `n` values are already in line with the values accepted to represent boolean data. If we remember from our type inspection, the datatype of this column is currently `text` which is a big waste of space as we only need one byte to represent this data. Let's fix that!

In the same way it happened when we converted `text` datatypes to enumerated datypes, to covert a `text` datatype to a `boolean` datatype we need to cast the column datatype to `boolean`.

**Task**

Change the datatype of column `editors_choice` to `boolean`

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("""
    ALTER TABLE ign_reviews 
    ALTER COLUMN editors_choice TYPE boolean
    USING editors_choice::boolean;
""")
conn.commit()
conn.close()`

The final type we will be discussing are the `datetime` types. In Postgres, there are a quite a few datetime types that we have available ranging from timestamps which aim at represent an exact moment (accurate to the microsecond) to the date datatype which is used to represent a specific day of the year. We can find the full details on the [documentation of the Date datatype](https://www.postgresql.org/docs/current/datatype-datetime.html). Here we will focus on the date type.

![image.png](attachment:image.png)

However, this comes at a cost of having to clean and process the dataset. We will leave this for a later. For now, let's focus on creating the `release_data` column with the appropriate datatype and removing the redundant columns mentioned above.

To add a column to a database table in SQL we can use the `ALTER TABLE` command together with the `ADD COLUMN` as shown in the following example.

`ALTER TABLE table_name
ADD COLUMN new_column_name data_type;`

As with other table altering commands, we will need to commit our changes for them to be applied definitively. Removing a column is done in a very similar way:

`ALTER TABLE table_name
DROP COLUMN column_name;`

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("ALTER TABLE ign_reviews ADD COLUMN release_date date;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_year;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_month;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_day;")
conn.commit()
conn.close()`

Now we have optimized the datatypes of a Postgres database table that had been created to represent the IGN reviews data contained in `ign.csv`. All that remains to do is to load the data from the CSV file into our Postgres table.

We have learned in the previous file how to use the [csv module](https://docs.python.org/3/library/csv.html) to read a CSV file. We will do the same here. The only difference is that we do not load the rows as exactly as such since we removed the three columns containing the date data.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Now we have all the necessary ingredients to be able to load the data into the `ign_reviews` table.

**Task**

load the data contained in the file `ign.csv` into the `ign_reviews` table by doing the following for each row:

* Process that date.
* Insert the processed row into the `ign_reviews` table.

**Answer**

`import datetime
import psycopg2
import csv`

`conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()`

`with open('ign.csv', 'r') as file:
    next(file) # skip csv header (first row with column titles)
    reader = csv.reader(file)
    for row in reader:
        year = int(row[8]) # the elements in row are strings so we need to convert to int
        month = int(row[9])
        day = int(row[10])
        date = datetime.date(year, month, day)
        row = row[:-3]
        row.append(date)
        cur.execute("INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);", row)`

`conn.commit()
conn.close()`

In this file, we learned about:

![image.png](attachment:image.png)

In the next file we are going learn about prepared statements: a SQL mechanism that can be used to safely set up parametrized queries that get input from the user.