<a id="top"></a>
# Preprocessing Data
Glenn Abastillas | February 26, 2020

This notebook contains functions and logic to preprocess raw data before upload to the domain.

Contents
  1. [Load Data](#load_data)
  1. [View Data](#view_data)
  1. [Normalize Values](#normalize_values)
  1. [Augment Data](#augment_data)
  1. [Subset Data](#subset_data)
  

### Load Data <a id="load_data"></a>

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

PATH_TO_DATA = Path("../resources/data")
THESIS = PATH_TO_DATA / "thesis_data.csv"
LETTERS = PATH_TO_DATA / "letters.csv"

thesis, letters = pd.read_csv(THESIS), pd.read_csv(LETTERS)

[to top](#top)

### View Data <a id="view_data"></a>

Visual inspection of the data before processing it.

In [2]:
thesis.sample(5)

Unnamed: 0,region,cs,tweet,lat,lon,language
208,d,False,@polkamoca Are we even still friends? Hahahs,7.076248,125.617328,0
4444,c,False,Som girls are like this. :( http://t.co/lTMe7...,10.243556,123.812597,0
6994,m,False,@awkwardposts: How to get over a breakup http...,124.651338,8.476431,0
7352,n,False,Kagwapo ani niya \ud83d \ude0d \ud83d \ude0d ...,9.514999,123.159509,0
4770,c,False,Well uhm for... mondays to sundays lovers may...,10.243859,123.832365,0


[to top](#top)

### Normalize Values <a id="normalize_values"></a>

Make sure columns are the data types of the values they hold.

In [3]:
thesis.dtypes

region       object
cs           object
tweet        object
lat         float64
lon         float64
language     object
dtype: object

Define intended datatypes for each column.

In [4]:
columns = thesis.columns
datatypes = ['object', 'bool', 'object', 'float64', 'float64', 'object']

Assign columns correct datatypes.

In [5]:
for column, datatype in zip(columns, datatypes):
    thesis[column] = thesis[column].astype(datatype)

Inspect data to ensure integrity.

In [6]:
thesis.head()

Unnamed: 0,region,cs,tweet,lat,lon,language
0,d,True,@gaaaabrielle_x watch ka? (:,6.936058,125.471194,Cebuano
1,d,True,\ud83d \ude4a \ud83d \ude0d \ud83d \udc95 \ud...,7.039869,125.504724,0
2,d,True,@guibz11 @marcantonyaco tanggala na hacker to oh,7.077326,125.615586,Cebuano
3,d,True,Im already addicted to Angel Eyes' OSTs \ud83...,7.051989,125.561496,0
4,d,True,impromtu meetup with hs friends (c) ipay http...,7.112516,125.618947,Cebuano


[to top](#top)

### Augment Data <a id="augment_data"></a>

Add new columns representing new transformations to augment current dataset.

#### Bucket Coordinate Data (Function Definition) <a id="augment_data_function_bucket"></a>

In [7]:
def bucket(data, coord='lat', step=1):
    ''' Return data set with new bucketed lon or lat columns'''
    
    desc = data.describe()[coord]
    
    start = np.floor(desc['min']) - 1
    stop = np.ceil(desc['max']) + 1
    
    buckets = np.arange(start, stop, step)
    labels = buckets[:-1]
    
    bucketed_data = pd.cut(data[coord], buckets, labels=labels)
    
    return data.assign(**{f"bucketed_{coord}" : bucketed_data})

Test out new data.

In [8]:
bucket(thesis).sample(3)

Unnamed: 0,region,cs,tweet,lat,lon,language,bucketed_lat
755,d,True,Next meeting na lage daw. Tagal makaintindi \...,7.071775,125.60363,Cebuano,7.0
5479,c,True,@zeerrific wala. Y u so oa,10.334674,123.900262,Cebuano,10.0
7032,m,True,@BrentRivera \u2764 \u2764 \u2764,124.660672,8.474473,0,124.0


Bucket data and reassign to variable.

In [9]:
thesis = bucket(thesis, 'lat')
thesis = bucket(thesis, 'lon')

In [10]:
thesis.sample(3)

Unnamed: 0,region,cs,tweet,lat,lon,language,bucketed_lat,bucketed_lon
5515,c,True,I've always loved rain! Brought me back to my...,10.336549,123.907715,0,10.0,123.0
4553,c,True,Thinking bout downloading snapchat again \ud8...,10.264805,123.824452,0,10.0,123.0
7538,n,True,Follow @AwesomeTrixia01 \ud83d \ude0a,9.313211,123.275857,0,9.0,123.0


[to top](#top)

### Subset Data <a id="subset_data"></a>

Split data by a particular variable.

In [11]:
region_d = thesis[thesis.region == 'd']
region_c = thesis[thesis.region == 'c']
region_m = thesis[thesis.region == 'm']
region_n = thesis[thesis.region == 'n']

### Save Data and Subsets <a id="save_data"></a>

Indicate the file type to save the data as. Default is `csv`.

In [14]:
filetype = 'json'

In [16]:
all_data = [thesis, region_d, region_c, region_m, region_n]
filenames = ['thesis_data_', 'thesis_data_region_d', 'thesis_data_region_c', 'thesis_data_region_m', 'thesis_data_region_n']

for datum, filename in zip(all_data, filenames):
    if filetype == 'csv':
        save = datum.to_csv
        index = {"index" : False}
    elif filetype == 'json':
        save = datum.to_json
        index = {}
    
    save(PATH_TO_DATA / f"{filename}.{filetype}", **index)

# END