# 2019: Week 11
April 24, 2019

This week is all about stocks but you have Ian Baldwin to thank for this challenge. He posed us the challenge of taking a JSON output from a shares website and turning it in to a file for use within Tableau.

Tableau Prep does not have a connector to allow us to download the data from the site (yet??), or parse JSON (yet??), but we can take a very raw file and manipulate the data file to build out a table that we would commonly use in Tableau Desktop.

# Requirements

<img src="https://2.bp.blogspot.com/-Xfgu2dHIS7w/XMAKxms_HCI/AAAAAAAAANY/HuIlrr_e-VIkBrqHb0uaisxRn5UogxrJwCLcBGAs/s320/11%2Binput.JPG" width="500" height="300"/>

Input data from the .csv
+ Break up the JSON_Name field
+ Exclude 'meta' and '' records in the same column to just leave 'indicators' and 'timestamp'
+ For the column containing our metrics, if this is blank, take the value from the 'indicators' / 'timestamp' column. Rename this field as 'Data Type'
+ There is a column that will contain just numbers (up to 502). If this column is blank then take the value from the other column that contains similar values up to 502. Rename this field to 'Row'
+ Rename 'JSON_ValueString' to 'Value'
+ Only leave fields in your data set that have been renamed as per the instruction above.
+ Pivot fields to form final table structure
+ Turn Unix Epoch time in to a real date

# Output

<img src="https://2.bp.blogspot.com/-wXrrYgStbuc/XMAKxgJDNbI/AAAAAAAAANk/bUSwdu-mZPQyqBoLnUY7NffAVqrpi64pgCEwYBhgL/s320/11%2BOutput.JPG" width="600" height="300" />

+ 8 columns
+ 503 rows of data (504 including headers)
+ No null cells


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("input.csv")
print(df.head(20))
print("====================Data types====================")
print(df.dtypes)
print("==================================================")

                                            JSON_Name  JSON_ValueString
0                        chart.result.0.meta.currency               USD
1                          chart.result.0.meta.symbol              DATA
2                    chart.result.0.meta.exchangeName               NYQ
3                  chart.result.0.meta.instrumentType            EQUITY
4                  chart.result.0.meta.firstTradeDate        1368777600
5                       chart.result.0.meta.gmtoffset            -14400
6                        chart.result.0.meta.timezone               EDT
7            chart.result.0.meta.exchangeTimezoneName  America/New_York
8              chart.result.0.meta.chartPreviousClose             54.09
9                       chart.result.0.meta.priceHint                 2
10  chart.result.0.meta.currentTradingPeriod.pre.t...               EDT
11  chart.result.0.meta.currentTradingPeriod.pre.s...        1556006400
12   chart.result.0.meta.currentTradingPeriod.pre.end        155

In [3]:
# Tìm max "." xuất hiện trong cột JSON_Name để xác định sẽ split bao nhiêu cột
max_dot_counts = df['JSON_Name'].apply(lambda x: x.count('.')).max()

# Đặt tên các cột sẽ split ra
split_col_names = [f'JSON_Name - split {i+1}' for i in range(max_dot_counts+1)]

# Split cột JSON_Name và bỏ cột JSON_Name
df[split_col_names] = df['JSON_Name'].str.split('.', expand=True)
df.drop(columns='JSON_Name', inplace=True)
print(df.head(5))

  JSON_ValueString JSON_Name - split 1 JSON_Name - split 2  \
0              USD               chart              result   
1             DATA               chart              result   
2              NYQ               chart              result   
3           EQUITY               chart              result   
4       1368777600               chart              result   

  JSON_Name - split 3 JSON_Name - split 4 JSON_Name - split 5  \
0                   0                meta            currency   
1                   0                meta              symbol   
2                   0                meta        exchangeName   
3                   0                meta      instrumentType   
4                   0                meta      firstTradeDate   

  JSON_Name - split 6 JSON_Name - split 7 JSON_Name - split 8  
0                None                None                None  
1                None                None                None  
2                None                None   

In [4]:
# Thay value NaN thành '' để có thể filter bằng contains
df['JSON_Name - split 4'] = df['JSON_Name - split 4'].fillna('')

# filter cột split 4
df = df[df['JSON_Name - split 4'].str.contains('timestamp|indicators')]
# print(sub_df.head(5))
# print("====================Data types====================")
print(df.info())
print("==================================================")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3521 entries, 34 to 3554
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   JSON_ValueString     3521 non-null   object
 1   JSON_Name - split 1  3521 non-null   object
 2   JSON_Name - split 2  3521 non-null   object
 3   JSON_Name - split 3  3521 non-null   object
 4   JSON_Name - split 4  3521 non-null   object
 5   JSON_Name - split 5  3521 non-null   object
 6   JSON_Name - split 6  3018 non-null   object
 7   JSON_Name - split 7  3018 non-null   object
 8   JSON_Name - split 8  3018 non-null   object
dtypes: object(9)
memory usage: 275.1+ KB
None


In [5]:
# Nếu dòng nào null giá trị split 7 thì lấy giá trị split 4 thế vào
df['JSON_Name - split 7'] = df.apply(lambda row: row['JSON_Name - split 4'] if pd.isna(row['JSON_Name - split 7']) else row['JSON_Name - split 7'], axis=1)

# Đổi tên cột split1 7
df.rename(columns={'JSON_Name - split 7': 'Data Type'}, inplace=True)
print(df.info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3521 entries, 34 to 3554
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   JSON_ValueString     3521 non-null   object
 1   JSON_Name - split 1  3521 non-null   object
 2   JSON_Name - split 2  3521 non-null   object
 3   JSON_Name - split 3  3521 non-null   object
 4   JSON_Name - split 4  3521 non-null   object
 5   JSON_Name - split 5  3521 non-null   object
 6   JSON_Name - split 6  3018 non-null   object
 7   Data Type            3521 non-null   object
 8   JSON_Name - split 8  3018 non-null   object
dtypes: object(9)
memory usage: 275.1+ KB
None


In [6]:
# Trên cột split 8, dòng nào null thì lấy giá trị cột split 5 thế vào
df['JSON_Name - split 8'] = df.apply(lambda row: row['JSON_Name - split 5'] if pd.isna(row['JSON_Name - split 8']) else row['JSON_Name - split 8'], axis=1)

# Đổi tên cột
df.rename(columns={'JSON_Name - split 8': 'Row', 'JSON_ValueString': 'Value'}, inplace=True)

# Chỉ giữ lại các cột quan trọng
df.drop(columns=list(map(lambda i: f'JSON_Name - split {i+1}' ,range(6))), inplace=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3521 entries, 34 to 3554
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Value      3521 non-null   object
 1   Data Type  3521 non-null   object
 2   Row        3521 non-null   object
dtypes: object(3)
memory usage: 110.0+ KB
None


In [7]:
# Pivot table row to column
df = df.pivot(index='Row', columns='Data Type', values='Value')

# Chỉnh lại multi index
df.reset_index(inplace=True)

# Chỉnh lại datatype
df['adjclose'] = df['adjclose'].astype(float)
df['close'] = df['close'].astype(float)
df['high'] = df['high'].astype(float)
df['low'] = df['low'].astype(float)
df['open'] = df['open'].astype(float)
df['volume'] = df['volume'].astype(int)
df['Row'] = df['Row'].astype(int)
print(df.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Row        503 non-null    int32  
 1   adjclose   503 non-null    float64
 2   close      503 non-null    float64
 3   high       503 non-null    float64
 4   low        503 non-null    float64
 5   open       503 non-null    float64
 6   timestamp  503 non-null    object 
 7   volume     503 non-null    int32  
dtypes: float64(5), int32(2), object(1)
memory usage: 27.6+ KB
None


In [8]:
# Làm tròn số liệu còn 2 chữ số thập phân
output = df.round(2)
print(output)

Data Type  Row  adjclose  close   high    low   open   timestamp   volume
0            0     53.98  53.98  54.57  53.00  54.36  1493040600  1349500
1            1     53.63  53.63  54.42  53.62  54.37  1493127000   777400
2           10     60.05  60.05  60.55  59.44  59.53  1494250200  2401000
3          100     75.23  75.23  75.89  74.73  75.28  1505395800   601900
4          101     75.02  75.02  75.07  74.27  74.85  1505482200   708000
..         ...       ...    ...    ...    ...    ...         ...      ...
498         95     72.24  72.24  72.98  72.13  72.35  1504791000  1138700
499         96     72.85  72.85  72.98  71.80  72.39  1504877400   656400
500         97     74.80  74.80  75.41  73.42  73.42  1505136600  1303900
501         98     75.63  75.63  75.75  74.24  74.94  1505223000   641800
502         99     75.41  75.41  76.27  75.33  75.63  1505309400   881600

[503 rows x 8 columns]
