<a id="top"></a>
# Preprocessing Data
Glenn Abastillas | February 26, 2020

This notebook contains functions and logic to preprocess raw data before upload to the domain.

Contents
  1. [Load Data](#load_data)
  1. [View Data](#view_data)
  1. [Normalize Values](#normalize_values)
  1. [Augment Data](#augment_data)
  1. [Subset Data](#subset_data)
  

### Load Data <a id="load_data"></a>

In [20]:
import pandas as pd
import numpy as np
from pathlib import Path

PATH_TO_DATA = Path("../resources/data")
THESIS = PATH_TO_DATA / "thesis_data.csv"
LETTERS = PATH_TO_DATA / "letters.csv"

thesis, letters = pd.read_csv(THESIS), pd.read_csv(LETTERS)

[to top](#top)

### View Data <a id="view_data"></a>

Visual inspection of the data before processing it.

In [21]:
thesis.sample(5)

Unnamed: 0,region,cs,tweet,lat,lon,language
7335,m,False,My view right now.. Thank You Lord for this w...,124.6153,8.423269,0
1430,d,False,@carlsulla @alanithegreat scare tolerance...,7.07317,125.583464,0
3640,c,False,@AngeloOuanos thank you \ud83d \ude02 \ud83d ...,10.318129,123.906451,0
3081,c,True,wtf nakatog ko nya wa pakoy tuon,10.31458,123.885872,Cebuano
5705,m,False,Anew,124.617858,8.459436,0


[to top](#top)

### Normalize Values <a id="normalize_values"></a>

Make sure columns are the data types of the values they hold.

In [22]:
thesis.dtypes

region       object
cs           object
tweet        object
lat         float64
lon         float64
language     object
dtype: object

Define intended datatypes for each column.

In [23]:
columns = thesis.columns
datatypes = ['object', 'bool', 'object', 'float64', 'float64', 'object']

Assign columns correct datatypes.

In [24]:
for column, datatype in zip(columns, datatypes):
    thesis[column] = thesis[column].astype(datatype)

Inspect data to ensure integrity.

In [25]:
thesis.head()

Unnamed: 0,region,cs,tweet,lat,lon,language
0,d,True,@gaaaabrielle_x watch ka? (:,6.936058,125.471194,Cebuano
1,d,True,\ud83d \ude4a \ud83d \ude0d \ud83d \udc95 \ud...,7.039869,125.504724,0
2,d,True,@guibz11 @marcantonyaco tanggala na hacker to oh,7.077326,125.615586,Cebuano
3,d,True,Im already addicted to Angel Eyes' OSTs \ud83...,7.051989,125.561496,0
4,d,True,impromtu meetup with hs friends (c) ipay http...,7.112516,125.618947,Cebuano


[to top](#top)

### Augment Data <a id="augment_data"></a>

Add new columns representing new transformations to augment current dataset.

#### Bucket Coordinate Data (Function Definition) <a id="augment_data_function_bucket"></a>

In [45]:
def bucket(data, coord='lat', step=1):
    ''' Return data set with new bucketed lon or lat columns'''
    
    desc = data.describe()[coord]
    
    start = np.floor(desc['min']) - 1
    stop = np.ceil(desc['max']) + 1
    
    buckets = np.arange(start, stop, step)
    labels = buckets[:-1]
    
    bucketed_data = pd.cut(data[coord], buckets, labels=labels)
    
    return data.assign(**{f"bucketed_{coord}" : bucketed_data})

Test out new data.

In [47]:
bucket(thesis).sample(3)

Unnamed: 0,region,cs,tweet,lat,lon,language,bucketed_lat
4185,c,True,@taralalalalala #Inspired,10.272719,123.848912,0,10.0
7203,m,True,Lami pa kaayo matulog! :( pero daghan kaayo k...,124.776796,8.510268,0,124.0
439,d,True,'Your account balance is not sufficient' \ud8...,7.062825,125.596613,0,7.0


Bucket data and reassign to variable.

In [49]:
thesis = bucket(thesis, 'lat')
thesis = bucket(thesis, 'lon')

In [50]:
thesis.sample(3)

Unnamed: 0,region,cs,tweet,lat,lon,language,bucketed_lat,bucketed_lon
16,d,True,basically the team was betrayed by one of the...,7.107943,125.623106,0,7.0,125.0
252,d,True,Hi les. Pansinin mo nanan ako oh? \ud83d \udc...,7.101586,125.511443,Cebuano,7.0,125.0
5790,m,True,mkaligo nga muna,124.589496,8.469174,0,124.0,8.0


[to top](#top)

### Subset Data <a id="subset_data"></a>

Split data by a particular variable.

In [51]:
region_d = thesis[thesis.region == 'd']
region_c = thesis[thesis.region == 'c']
region_m = thesis[thesis.region == 'm']
region_n = thesis[thesis.region == 'n']

### Save Data and Subsets <a id="save_data"></a>

In [53]:
all_data = [thesis, region_d, region_c, region_m, region_n]
filenames = ['thesis_data_', 'thesis_data_region_d', 'thesis_data_region_c', 'thesis_data_region_m', 'thesis_data_region_n']

for datum, filename in zip(all_data, filenames):
    datum.to_csv(PATH_TO_DATA / f"{filename}.csv", index=False)

# END