# GPU Visualization Data Preparation

This is the process that we went through to prepare our data for the GPU visualization (Worldwide Reddit Post Activity).

## Import Libraries

We are importing the necessary libraries to execute our code.

In [11]:
import os
import pandas as pd
import numpy as np
import datashader as ds
import warnings

## Read in Dataset

We are reading in our test synthetic dataset as a `.parquet` file that we previously did manipulations on.

The `os.getcwd()` method gets the current working directory that you are in, which should be inside the `data_wrangling` folder. However, to access the data file, we need to replace the current working directory with the directory that leads to the file. Once that has been done, we can go head with reading in the data and performing the necessary data manipulations.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('data_wrangling', 'synthetic_data')

In [3]:
new_comb = pd.read_parquet(DATA_DIR + '/final_combinedsubclus.parquet') 

## Convert Hour Variable to Int

Because we had extracted out the hour of day from a string, we needed to convert the hour of day to an int so that our multiselect tool for the GPU visualization would work.

In [4]:
new_comb['hour'] = new_comb['hour'].astype(int) 

## Mapping Day of the Week to Numbers

In order for our day of the week dropdown to work, we needed to assign each day of the week to a number. Each day of the week was assigned to be a condition with a corresponding value, and the select method from the numpy library will match each day of the week to the corresponding number. This will set up the label map that will be used in the dropdown menu for the visualization.

In [5]:
conditions = [(new_comb['day_of_week'] == 'Sunday'),
             (new_comb['day_of_week'] == 'Monday'),
             (new_comb['day_of_week'] == 'Tuesday'),
             (new_comb['day_of_week'] == 'Wednesday'),
             (new_comb['day_of_week'] == 'Thursday'),
             (new_comb['day_of_week'] == 'Friday'),
             (new_comb['day_of_week'] == 'Saturday')]

values = [0, 1, 2, 3, 4, 5, 6]
new_comb['day_as_num'] = np.select(conditions, values)

## Extract Certain Columns of Data

To make a GPU visualization, we need our dataframe to be a `cudf` dataframe. In order to do this, we need to do a conversion. To make the conversion process more efficient, we've pulled only the columns that we want to use in the visualization. 

In [6]:
temp_file = new_comb[["long", "lat", "hour", "day_as_num"]] 

## Convert Longitude/Latitude Points to Mercator Points

Using the longitude/latitude point format that our data originally has, we were able to plot all of the points but the map would not properly display. So, we found a way to convert the points into Mercator format, and doing this conversion allowed for the map to load behind the data points.

In [12]:
warnings.filterwarnings("ignore")

In [13]:
temp_file.loc[:, 'long'], temp_file.loc[:, 'lat'] = ds.utils.lnglat_to_meters(temp_file.long,temp_file.lat) 

## Save to Parquet File

Now that all of the data has been prepared for the GPU visualization, we can save this dataframe as a parquet file that we can just read in once when we run the code in our notebook for the visualization.

In [15]:
temp_file.to_parquet(DATA_DIR + '/gpufile_combinedsubclus.parquet') 