# Technical Report

## Overview
This notebook walks through the creation of our ETL process for the video games sales database. For specific instructions on how to set up the database and perform the extract, transform, load, visit [README](https://github.com/asundquistdavis/Project-2/blob/main/README.md). We obtained data sets from [kaggle](https://www.kaggle.com/) as csv files. After that we were ready to start ETL.

The first decision we made was to use a PostgreSQL database. Relational databases are typically easier to query. They can be harder to set up and limit what type of data can be added to the database. Since the data sets we are working with already have some relational structure, using SQL as opposed to MongoDB or another non-relational database is a natural choice. We created our database's ERD (shown below) using [quickDBD](https://www.quickdatabasediagrams.com/). The database is in 1st normal form - every element is atomic and cannot be broken down further. This facilitates easier data querying and manipulation. The database is not in 2nd normal form because the "sales" table lacks a single column primary key. Since this would not drastically impact the functionality, we decided against it to keep the model more simple. 

![erd](project-2_database_erd.png)

## Extract
The extraction step is straightforward; both data sets are downloaded as csv files. First we load them as dataframes using pandas csv reader and then check their contents for non-null data and make sure they make sense. 

In [None]:
# import pandas
import pandas as pd

# define the paths to the csv files that contain the raw data
world_pop_path = "Resources/world_population.csv"
vg_sales_path = "Resources/vgsales.csv"

# read the population data as a dataframe  and preview it
world_pop_df = pd.read_csv(world_pop_path)
world_pop_df.head()

In [None]:
# view data types and amount of non-null data per column
world_pop_df.info()

The world population data contains 17 columns and all of them appear to contain 234 non-null values. In the transform step we will drop quite a bit of this data and clean it up so we are left with the data that is most useful for the project.

In [None]:
# read the video games data as a dataframe and preview it
vg_sales_df = pd.read_csv(vg_sales_path)
vg_sales_df.head()

In [None]:
# view data types and amount of non-null data per column
vg_sales_df.info()

The video games sales data contains 11 columns with 16,598 non-null values in all the columns except publisher and release year. We could drop rows with incomplete data. However, for this project we are only concerned about sales data. So we keep all data as is and will drop unused columns in the transform process. 

## Transform
First we work on creating the world population table. The table needs to store the population of each region - Japan, North America, Europe and the world. For this project, we only need population data for 2020. This is because our sales data are cumulative up until 2020.

Japan's population is in the dataframe already so it can be loaded as is. To get the populations for each continent, we take the data and aggregate-sum it by continent. Finally, to get the world population, we sum over all the continents.

In [None]:
# group and sum all data by cntinent
world_pop_by_continents = world_pop_df.groupby(world_pop_df['Continent']).sum()
world_pop_by_continents

In [None]:
# extract values for population table

# na and eu populations are taken from the group by dataframe
na_pop = world_pop_by_continents.loc['North America', '2020 Population']
eu_pop = world_pop_by_continents.loc['Europe', '2020 Population']

# wo population is calculated by summing over all continents
wo_pop = world_pop_by_continents['2020 Population'].sum()

# jp is seelected from the original df
jp_pop = world_pop_df.loc[world_pop_df['Country']== 'Japan', '2020 Population'].values[0]

#save region populations into table
population = pd.DataFrame({"region":["na", "eu", "jp", "wo"], "population":[na_pop, eu_pop, jp_pop, wo_pop]})
population

This is the finished table, ready to be loaded into the database. Its primary key is region and the populations for each region are stored as whole numbers. 

Next we work on the video games table. Each row of the dataframe already represents a unique video game and platform pair. So we just need to assign a primary key to each row of the original dataframe and retain the video game and platform data. The column 'rank' from the original data frame indicates the game/platform pair's rank in total sales. While we are not interested in this number it can be used as the primary key as it is unique to each row of the table. 

In [None]:
# rename the rank cloumn as 'Game_Id'
vg_sales_df= vg_sales_df.rename(columns={"Rank": "game_id", "Name": "name", "Platform": "platfrom"})

# define the video games df 
video_games = vg_sales_df[["game_id", "name", "platform"]]
video_games.head()

This is the video_games table ready to load into the database. 

Lastly we create the sales table. The original dataframe has sales by region in separate columns. We are interested in having unique elements for each game_Id **and** Region in the sales table. To obtain this we use Pandas to manipulate the dataframe. 

In [None]:
# The melt function takes data from multpile columns and stores it as seperate elements
sales=vg_sales_df.melt(value_vars=["NA_Sales", "EU_Sales", "JP_Sales", "Global_Sales"],id_vars=["Game_Id"], 
                       var_name="region", value_name="sales")

# rename the regions to match the region names
sales["Region"].replace({"NA_Sales": "na", "EU_Sales": "eu", "JP_Sales": "jp", "Global_Sales": "wo"}, inplace=True)

sales["sales"]= sales["sales"]*1000000

sales.head()

The resulting datafarame contains a column for Game_Id, Region and Sales and is ready to load into the database.

## Load
The last step is to load the data into our PostgreSQL database. We use pgAdmin to create and run the table [schema](schema.sql) for our database name "project-2_db". SQLalchemy creates the engine to interact with the PostgreSQL database and Pandas loads the dataframes as tables.

In [None]:
# import methods sqlalchemy
from sqlalchemy import create_engine, inspect

# configure PG admin and set up connection string
protocol = 'postgresql'
username = 'postgres'
password = 'bootcamp'
host = 'localhost'
port = 5432
database_name = 'project-2_db'
rds_connection_string = f'{protocol}://{username}:{password}@{host}:{port}/{database_name}'

# create Postgres engine
engine = create_engine(rds_connection_string)

# check to see if SQLalchemy found tables from the database
inspector = inspect(engine)
inspector.get_table_names()

All three tables are found in PostgreSQL

In [None]:
# load population table into database
population.to_sql(name='population', con=engine, if_exists='append', index=False)

# load video_games table into database
video_games.to_sql(name='video_games', con=engine, if_exists='append', index=False)

# load sales table into database
sales.to_sql(name='sales', con=engine, if_exists='append', index=False)

In [None]:
# query the video game table as a check that db is loaded!
pd.read_sql_query('select * from video_games', con=engine).head()

And now we have the data successfully loaded as a database ready to be analyzed! 