# Spark Profiling

In this notebook we aim to create a general spark script to perform initial table profiling. the tasks involve:

- ### Add the data as RDD
    - Can multiple formats: csv, delta, text, etc.
- ### Overal dataset information
    - Number of Variables
    - Number of records
    - total missing percentage
    - Total size in memory
    - Average record size in memory
    - Recomended partition size during spark analysis
- ### Variable analysis:
    - Number of categorical
    - Number of Numeric
    - Number of Date
    - Number of Text(unique)
    - Number of rejected
- ### some overal info about the dataset like:
    - GeoLocation has 7315 / 19.0% missing values Missing
    - GeoLocation has a high cardinality: 17100 distinct values Warning
    - mass_g is highly skewed (γ1 = 76.916)
    - recclass has a high cardinality: 466 distinct values Warning
    - reclat has 7315 / 19.0% missing values Missing
    - reclat has 6438 / 14.1% zeros
    - reclat_city is highly correlated with reclat (ρ = 0.99423) Rejected
    - reclong has 7315 / 19.0% missing values Missing
    - reclong has 6214 / 13.6% zeros
    - source has constant value NASA Rejected
- ### Create a table with the column names and Analysis of each column
    - Devide columns into numeric vs none-nummeric
    - some of the features for nun-numeric:
        - Record count
        - Unique Values
        - Empty Strings
        - Null Values
        - Percent Fill
        - Percent Numeric
        - Max Length
        - if float,int 
            - max_value
            - min_value
            - mean
            - std
        - if string
            - shortest value 
            - longest value
            - average length
        - if date time
            - min_date
            - max_date
    - include some graphs for each (seperate numeric vs categorical)
- ### Reproduction tab:
    - date and time of profiling start time
    - date and time of profiling end time
    - name of the database/table
    - version of the data profiler

In later notebooks we will discover profiling visualization using Jinja2 and rendering front-end using HTML templates.

- The codebase has refferences to other repositories like: [this](https://github.com/gandalf1819/NYCOpenData-Profiling-Analysis/blob/master/Task-1-Generic-profiling.py) and [this](https://github.com/pandas-profiling/pandas-profiling)

# Project Progress

### TODO improvements:

- create a front-end or graphs
    - create a html page
    - https://github.com/pallets/jinja/blob/main/examples/basic/test.py
- find composite keys from a small subset of the table
- find the correlation amoung the columns
- give warning for columns that have more than certain amount warnings
- save report as csv or pdf
- perform relational analysis: correl , cordinality, skewness, etc.


#### Import Libraries

In [15]:
import os
import sys
import json
from datetime import datetime

In [16]:
# Pyspark sessions
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as D

#### Create spark session

In [17]:
spark = SparkSession.builder.appName('SparkProfiling').getOrCreate()
sc = spark.sparkContext                # if you need the sparkConext
spark

### Helper Functions

In [18]:
# Create directory
def create_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)

# List the directory
def get_data_input_list(path):
    return os.listdir(path)

def count_not_null(c, nan_as_null=False):
    pred = F.col(c).isNotNull() & (~isnan(c) if nan_as_null else F.lit(True))
    return F.sum(pred.cast('integer')).alias(c)

def get_top_five_frequent_record(DF, col):
    frequency_dataframe=DF.groupBy(col).count().sort(F.desc('count'))
    frequency_dataframe=frequency_dataframe.where(F.col(col).isNotNull())
    top_frequency_five=[]
    if frequency_dataframe.count()<5:
        top_frequency_five=[row[0] for row in frequency_dataframe.collect()]
    else:
        top_frequency_five=[row[0] for row in frequency_dataframe.take(5)]
    return top_frequency_five

# This method, randomly samples 10% of the data and cache the samples to find the size. 
# This method is not acurate and can be costly.
def estimate_rdd_memory_size_mb(df):
    estimated_df = df.sample(fraction = 0.1)
    estimated_df.cache().foreach(lambda x: x)
    catalyst_plan = estimated_df._jdf.queryExecution().logical()
    test_kb = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()
    return test_kb * 10 / (1024 * 1024)

# type cast validation functions
def validate_string_to_integer(d):
    if type(d)==str:
        try:
            z=int(d)
            return z
        except:
            return None
    else:
        return None
    
def validate_string_to_float(d):
    if type(d)==str:
        try:
            z=float(d)
            return z
        except:
            return None
    else:
        return None

def validate_date(d):
    try:
        z=parse(d)
        return str(z)
    except:
        return None

# Saves the Json file
def save_json(json_file,path):
    with open(path, 'w') as outfile:
        json.dump(json_file, outfile)

In [19]:
def profile_table(rdd_df):
    completed_profile = {}
    columns_names = rdd_df.columns
    total_rows = rdd_df.count()

    # Initial column related analysis
    compute_unique_values = rdd_df.agg(*[F.countDistinct(F.col(c)) for c in rdd_df.columns]).collect()[0]
    compute_not_null_columns = rdd_df.agg(*[count_not_null(c) for c in rdd_df.columns]).collect()[0]
    compute_null_columns=[(total_rows-count_notNull) for count_notNull in compute_not_null_columns]
    compute_null_proportion=[round((count_Nulls / total_rows)*100, 3) for count_Nulls in compute_null_columns]

    # general database analysis
    general_df_info = {}
    general_df_info["total_records"] = total_rows
    general_df_info["total_variables"] = len(columns_names)
    general_df_info["total_missing_percentage"] = sum(compute_null_columns) / (general_df_info["total_records"] * general_df_info["total_variables"])
    general_df_info["estimate_size_in_memory_Mb"] = estimate_rdd_memory_size_mb(rdd_df)
    general_df_info["average_record_size_in_memory_Mb"] = general_df_info["estimate_size_in_memory_Mb"] / total_rows
    completed_profile["general_db_info"] = general_df_info

    attribute_analysis = {}
    attribute_analysis['Numeric'] = len([item[0] for item in rdd_df.dtypes if (item[1].startswith('int') or item[1].startswith('float'))])
    attribute_analysis['Categorical'] = len([item[0] for item in rdd_df.dtypes if item[1].startswith('string')])
    attribute_analysis['Date'] = len([item[0] for item in rdd_df.dtypes if item[1].startswith('date')])
    completed_profile["attribute_analysis"] = attribute_analysis

    # UDF function for type casting
    get_int=F.udf(lambda x: x if type(x)==int else None, D.IntegerType())
    get_str=F.udf(lambda x: x if type(x)==str else None, D.StringType())
    get_flt=F.udf(lambda x: x if type(x)==float else None, D.FloatType())
    get_dt=F.udf(lambda x: validate_date(x), D.StringType())
    get_string_int=F.udf(lambda x: validate_string_to_integer(x), D.IntegerType())
    get_string_flt=F.udf(lambda x: validate_string_to_float(x), D.FloatType())

    # Column wide Analysis
    column_analysis = {}
    column_analysis['cols_data'] = []
    # cols_data=[]

    for i, cols in enumerate(rdd_df.columns):
        # Base case
        if total_rows==0:
            continue
        
        columns_data={}
        columns_data['column_name']=cols
        columns_data['dtype']= DF.dtypes[i][1]
        columns_data['record_count']=total_rows
        columns_data['unique_values']=compute_unique_values[i]
        columns_data['number_non_empty_cells']=compute_not_null_columns[i]
        columns_data['number_empty_cells']=compute_null_columns[i]
        columns_data['null_proportion']=compute_null_proportion[i]
        columns_data['top_five_value'] = get_top_five_frequent_record(rdd_df, cols)
        
        # Data type specific analysis (check for possible type casting)
        int_col=cols+' '+'int_type'
        str_col=cols+' '+'str_type'
        float_col=cols+ ' '+ 'float_type'
        date_col=cols+' '+'date_type'
        str_int_col=cols + ' '+'str_int'
        str_float_col=cols +' '+'str_float'
        
        df=rdd_df.select([get_int(cols).alias(int_col), 
                    get_str(cols).alias(str_col), 
                    get_flt(cols).alias(float_col), 
                    get_dt(cols).alias(date_col),
                    get_string_int(cols).alias(str_int_col),
                    get_string_flt(cols).alias(str_float_col)
                    ])
        
        int_df = df.select(int_col).where(F.col(int_col).isNotNull())
        str_df = df.select(str_col).where(F.col(str_col).isNotNull())
        float_df = df.select(float_col).where(F.col(float_col).isNotNull())
        date_df = df.select(date_col).where(F.col(date_col).isNotNull())
        str_int_df = df.select(str_int_col).where(F.col(str_int_col).isNotNull())
        str_float_df = df.select(str_float_col).where(F.col(str_float_col).isNotNull())
        
        columns_data['data_types']=[]
        
        if float_df.count()>1:
            type_data={}
            type_data['type']='REAL'
            type_data['count']=float_df.count()
            type_data['max_value']=float_df.agg({float_col: "max"}).collect()[0][0]
            type_data['min_value']=float_df.agg({float_col: "min"}).collect()[0][0]
            type_data['mean']=float_df.agg({float_col: "avg"}).collect()[0][0]
            type_data['stddev']=float_df.agg({float_col: 'stddev'}).collect()[0][0]
            columns_data['data_types'].append(type_data)

        if int_df.count()>1:
            type_data={}
            type_data['type']='INTEGER (LONG)'
            type_data['count']=int_df.count()
            type_data['max_value']=int_df.agg({int_col: 'max'}).collect()[0][0]
            type_data['min_value']=int_df.agg({int_col: 'min'}).collect()[0][0]
            type_data['mean']=int_df.agg({int_col: 'avg'}).collect()[0][0]
            type_data['stddev']=int_df.agg({int_col: 'stddev'}).collect()[0][0]
            columns_data['data_types'].append(type_data)

        if str_df.count()>1:
            type_data={'type':'TEXT', 'count': str_df.count()}
            str_rows=str_df.distinct().collect()
            str_arr=[row[0] for row in str_rows]
            if len(str_arr)<=5:
                type_data['shortest_values']=str_arr
                type_data['longest_values']=str_arr
            else:
                str_arr.sort(key=len, reverse=True)
                type_data['shortest_values']=str_arr[:-6:-1]
                type_data['longest_values']=str_arr[:5]

            type_data['average_length']=sum(map(len, str_arr))/len(str_arr) #this needs work since it is getting average length of the ditinct values not all values
            # also average length sometimes prints giberish ex"  {'name': 'Ronald Bruce Sith'"
            columns_data['data_types'].append(type_data)

        if date_df.count()>1:
            type_data={"type":"DATE/TIME", "count":date_df.count()}
            min_date, max_date = date_df.select(F.min(date_col), F.max(date_col)).first()
            type_data['max_value']=max_date
            type_data['min_value']=min_date
            columns_data['data_types'].append(type_data)

        if str_float_df.count()>1:
            type_data={}
            type_data['type']='REAL'
            type_data['count']=str_float_df.count()
            type_data['max_value']=str_float_df.agg({str_float_col: "max"}).collect()[0][0]
            type_data['min_value']=str_float_df.agg({str_float_col: "min"}).collect()[0][0]
            type_data['mean']=str_float_df.agg({str_float_col: "avg"}).collect()[0][0]
            type_data['stddev']=str_float_df.agg({str_float_col: 'stddev'}).collect()[0][0]
            columns_data['data_types'].append(type_data)

        if str_int_df.count()>1:
            type_data={}
            type_data['type']='INTEGER (LONG)'
            type_data['count']=str_int_df.count()
            type_data['max_value']=str_int_df.agg({str_int_col: 'max'}).collect()[0][0]
            type_data['min_value']=str_int_df.agg({str_int_col: 'min'}).collect()[0][0]
            type_data['mean']=str_int_df.agg({str_int_col: 'avg'}).collect()[0][0]
            type_data['stddev']=str_int_df.agg({str_int_col: 'stddev'}).collect()[0][0]
            columns_data['data_types'].append(type_data)
        column_analysis['cols_data'].append(columns_data)
    
    completed_profile["column_analysis"] = column_analysis

    return(completed_profile)

In [20]:
# If folder 'Result' does not exist, create one
create_dir('../Result') 

In [23]:
# List data inputs ready to use
files = get_data_input_list('../Data')
print("the files available for profiling in Data directory:")
print(files)

# debug: get user input
print(10*'_')
for idx, file in enumerate(files):
    print(idx, ':', file)
input_id = int(input('please choose the pofiling file_index:'))

# Choose the dataset file (in case csv)
profiling_input = files[input_id]             # choose the input file you want
print("The current profiling table:", profiling_input)

the files available for profiling in Data directory:
['.DS_Store', 'NASDAQ_100_Data_From_2010.tsv', 'COVID_19_vs_Vaccine_in_Turkey.csv', 'data_paper_sample_10k.csv', 'NASDAQ_100_Data_From_2010.csv']
__________
0 : .DS_Store
1 : NASDAQ_100_Data_From_2010.tsv
2 : COVID_19_vs_Vaccine_in_Turkey.csv
3 : data_paper_sample_10k.csv
4 : NASDAQ_100_Data_From_2010.csv
please choose the pofiling file_index:2
The current profiling table: COVID_19_vs_Vaccine_in_Turkey.csv


### start Profiling

In [24]:
# read the dataframe
filepath='../Data/'+profiling_input
DF = spark.read.format('csv').options(header='true',inferschema='true').load(filepath)
DF.show(5)

+----------+-----------+-----------+----------------+----------------+------------+------------+-----------+-----------------------------+------------------------------+------------------------------+--------------+------------------+------------+------------------------------------------------------------+-----------------------------------------------------------+
|dd_mm_yyyy|daily_cases|total_cases|daily_recoveries|total_recoveries|daily_deaths|total_deaths|daily_tests|daily_first_dose_vaccinations|daily_second_dose_vaccinations|total_second_dose_vaccinations|total_boosters|total_vaccinations|vaccine_type|daily_deaths_over_total_second_dose_vaccinations_per_million|daily_cases_over_total_second_dose_vaccinations_per_million|
+----------+-----------+-----------+----------------+----------------+------------+------------+-----------+-----------------------------+------------------------------+------------------------------+--------------+------------------+------------+---------------

In [25]:
# show column names
columns_names = DF.columns
columns_names

['dd_mm_yyyy',
 'daily_cases',
 'total_cases',
 'daily_recoveries',
 'total_recoveries',
 'daily_deaths',
 'total_deaths',
 'daily_tests',
 'daily_first_dose_vaccinations',
 'daily_second_dose_vaccinations',
 'total_second_dose_vaccinations',
 'total_boosters',
 'total_vaccinations',
 'vaccine_type',
 'daily_deaths_over_total_second_dose_vaccinations_per_million',
 'daily_cases_over_total_second_dose_vaccinations_per_million']

In [26]:
profiling_result = profile_table(DF)
profiling_result

{'general_db_info': {'total_records': 244,
  'total_variables': 16,
  'total_missing_percentage': 0.015368852459016393,
  'estimate_size_in_memory_Mb': 0.020046234130859375,
  'average_record_size_in_memory_Mb': 8.215669725762039e-05},
 'attribute_analysis': {'Numeric': 11, 'Categorical': 2, 'Date': 0},
 'column_analysis': {'cols_data': [{'column_name': 'dd_mm_yyyy',
    'dtype': 'string',
    'record_count': 244,
    'unique_values': 244,
    'number_non_empty_cells': 244,
    'number_empty_cells': 0,
    'null_proportion': 0.0,
    'top_five_value': ['3.02.2021',
     '15.04.2021',
     '29.08.2021',
     '6.08.2021',
     '27.01.2021'],
    'data_types': [{'type': 'TEXT',
      'count': 244,
      'shortest_values': ['7.06.2021',
       '6.03.2021',
       '7.03.2021',
       '4.04.2021',
       '9.07.2021'],
      'longest_values': ['27.01.2021',
       '10.06.2021',
       '23.01.2021',
       '20.02.2021',
       '14.07.2021'],
      'average_length': 9.704918032786885}]},
   {'c

### Store Json

In [27]:
# take the file name
profiling_input_name = str(profiling_input).split('.')[0]

# Save the Json file
date = datetime.now().strftime("%Y_%m_%d-%I_%M_%p")
file_name = f"../Result/{profiling_input_name}_profiled_table_{date}"
save_json(profiling_result,f"{file_name}.json")

### Create CSV format from the JSON 

In [None]:
# make CSV file
spark_df = spark.read.json(sc.parallelize(profiling_result["column_analysis"]["cols_data"]))
data_exploded = spark_df.select('column_name', 
                                'dtype', 
                                'record_count',
                                'number_non_empty_cells',
                                'number_empty_cells',
                                'null_proportion',
                                'unique_values',
                                'top_five_value',
                                F.explode('data_types').alias('data_types')
                               ) 

data_exploded = data_exploded.select('column_name', 
                                     'dtype',
                                     'record_count',
                                     'number_non_empty_cells',
                                     'number_empty_cells',
                                     'null_proportion',
                                     'unique_values',
                                     'top_five_value', 
                                     'data_types.*'
                                    )          
# Write the CSV file
data_exploded.withColumn("top_five_value", F.col("top_five_value").cast("string"))   \
             .withColumn("longest_values", F.col("longest_values").cast("string"))   \
             .withColumn("shortest_values", F.col("shortest_values").cast("string")) \
             .coalesce(1)                                                            \
             .write.option("header",True).csv(file_name)

# For Debug Purpose

### Column wide analysis

In [104]:
cols_data=[]

for i, cols in enumerate(DF.columns):
    if total_rows==0:
        continue
    columns_data={}
    columns_data['column_name']=cols
    columns_data['dtype']= DF.dtypes[i][1]
    columns_data['record_count']=total_rows
    columns_data['unique_values']=compute_unique_values[i]
    columns_data['number_non_empty_cells']=compute_not_null_columns[i]
    columns_data['number_empty_cells']=compute_null_columns[i]
    columns_data['null_proportion']=compute_null_proportion[i]
    columns_data['top_five_value'] = get_top_five_frequent_record(DF, cols)
    
    
    # Data type specific analysis
    int_col=cols+' '+'int_type'
    str_col=cols+' '+'str_type'
    float_col=cols+ ' '+ 'float_type'
    date_col=cols+' '+'date_type'
    str_int_col=cols + ' '+'str_int'
    str_float_col=cols +' '+'str_float'
    
    df=DF.select([get_int(cols).alias(int_col), 
                  get_str(cols).alias(str_col), 
                  get_flt(cols).alias(float_col), 
                  get_dt(cols).alias(date_col),
                  get_string_int(cols).alias(str_int_col),
                  get_string_flt(cols).alias(str_float_col)
                 ])
    
    int_df = df.select(int_col).where(F.col(int_col).isNotNull())
    str_df = df.select(str_col).where(F.col(str_col).isNotNull())
    float_df = df.select(float_col).where(F.col(float_col).isNotNull())
    date_df = df.select(date_col).where(F.col(date_col).isNotNull())
    str_int_df = df.select(str_int_col).where(F.col(str_int_col).isNotNull())
    str_float_df = df.select(str_float_col).where(F.col(str_float_col).isNotNull())
    
    columns_data['data_types']=[]
    
    if float_df.count()>1:
        type_data={}
        type_data['type']='REAL'
        type_data['count']=float_df.count()
        type_data['max_value']=float_df.agg({float_col: "max"}).collect()[0][0]
        type_data['min_value']=float_df.agg({float_col: "min"}).collect()[0][0]
        type_data['mean']=float_df.agg({float_col: "avg"}).collect()[0][0]
        type_data['stddev']=float_df.agg({float_col: 'stddev'}).collect()[0][0]
        columns_data['data_types'].append(type_data)

    if int_df.count()>1:
        type_data={}
        type_data['type']='INTEGER (LONG)'
        type_data['count']=int_df.count()
        type_data['max_value']=int_df.agg({int_col: 'max'}).collect()[0][0]
        type_data['min_value']=int_df.agg({int_col: 'min'}).collect()[0][0]
        type_data['mean']=int_df.agg({int_col: 'avg'}).collect()[0][0]
        type_data['stddev']=int_df.agg({int_col: 'stddev'}).collect()[0][0]
        columns_data['data_types'].append(type_data)

    if str_df.count()>1:
        type_data={'type':'TEXT', 'count': str_df.count()}
        str_rows=str_df.distinct().collect()
        str_arr=[row[0] for row in str_rows]
        if len(str_arr)<=5:
            type_data['shortest_values']=str_arr
            type_data['longest_values']=str_arr

        else:
            str_arr.sort(key=len, reverse=True)
            type_data['shortest_values']=str_arr[-5:]
            type_data['longest_values']=str_arr[:5]

        type_data['average_length']=sum(map(len, str_arr))/len(str_arr)
        columns_data['data_types'].append(type_data)

    if date_df.count()>1:
        type_data={"type":"DATE/TIME", "count":date_df.count()}
        min_date, max_date = date_df.select(F.min(date_col), F.max(date_col)).first()
        type_data['max_value']=max_date
        type_data['min_value']=min_date
        columns_data['data_types'].append(type_data)

    if str_float_df.count()>1:
        type_data={}
        type_data['type']='REAL'
        type_data['count']=str_float_df.count()
        type_data['max_value']=str_float_df.agg({str_float_col: "max"}).collect()[0][0]
        type_data['min_value']=str_float_df.agg({str_float_col: "min"}).collect()[0][0]
        type_data['mean']=str_float_df.agg({str_float_col: "avg"}).collect()[0][0]
        type_data['stddev']=str_float_df.agg({str_float_col: 'stddev'}).collect()[0][0]
        columns_data['data_types'].append(type_data)

    if str_int_df.count()>1:
        type_data={}
        type_data['type']='INTEGER (LONG)'
        type_data['count']=str_int_df.count()
        type_data['max_value']=str_int_df.agg({str_int_col: 'max'}).collect()[0][0]
        type_data['min_value']=str_int_df.agg({str_int_col: 'min'}).collect()[0][0]
        type_data['mean']=str_int_df.agg({str_int_col: 'avg'}).collect()[0][0]
        type_data['stddev']=str_int_df.agg({str_int_col: 'stddev'}).collect()[0][0]
        columns_data['data_types'].append(type_data)
    cols_data.append(columns_data)


In [214]:
pandas_df = data_exploded.toPandas()
pandas_df = pandas_df.set_index("column_name")
# pandas_df = pandas_df[['dtype','record_count','number_non_empty_cells','number_empty_cells','null_proportion','unique_values','top_five_value','data_types']]
pandas_df

Unnamed: 0_level_0,dtype,record_count,number_non_empty_cells,number_empty_cells,null_proportion,unique_values,top_five_value,max_value,min_value,mean,stddev,count,longest_values,shortest_values,average_length,type
column_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
_c0,int,291,291,0,0.0,291,"[44, 159, 192, 271, 31]",290.0,0.0,145.0,84.14868,291,,,,INTEGER (LONG)
id,int,291,291,0,0.0,291,"[73220, 136660, 2067561, 5140417, 8232766]",8870360.0,40449.0,4705530.0,2536564.0,291,,,,INTEGER (LONG)
authors,string,291,291,0,0.0,291,"[[{'name': 'Thierry Despeyroux', 'id': 2721087...",,,,,291,"[[{'name': 'Érika Cota', 'org': 'PPGC---Inst. ...","[[{'name': 'Li Gong', 'id': 2479801864}], [{'n...",145.329897,TEXT
title,string,291,291,0,0.0,291,[ESL/EFL Websites: What Should the Teachers an...,,,,,291,[A Study of the Experimental Validation of Fau...,"[ 'id': 2155134101}]"", 'id': 98411351}]"", 'i...",61.487973,TEXT
year,string,291,291,0,0.0,12,"[2001, Irvine', Nanjing, {'name': 'Fanny Wa...",,,,,291,[Evaluation of an Automatically Obtained Shape...,"[ 'id': 2661349709}, Villetaneuse, Irvine', ...",30.166667,TEXT
year,string,291,291,0,0.0,12,"[2001, Irvine', Nanjing, {'name': 'Fanny Wa...",2001.0,2001.0,2001.0,0.0,280,,,,REAL
year,string,291,291,0,0.0,12,"[2001, Irvine', Nanjing, {'name': 'Fanny Wa...",2001.0,2001.0,2001.0,0.0,280,,,,INTEGER (LONG)
n_citation,string,291,291,0,0.0,46,"[0, 1, 2, 4, 3]",,,,,291,"[ People's Republic of China#TAB#"""", Dual Pert...","[6, 9, 1, 4, 2]",5.26087,TEXT
n_citation,string,291,291,0,0.0,46,"[0, 1, 2, 4, 3]",2001.0,0.0,38.10915,242.4903,284,,,,REAL
n_citation,string,291,291,0,0.0,46,"[0, 1, 2, 4, 3]",2001.0,0.0,38.10915,242.4903,284,,,,INTEGER (LONG)


### Fix TSV to CSV format tables

In [6]:
import pandas as pd
test_df = pd.read_csv("../Data/COVID_19_vs_Vaccine_in_Turkey.csv",sep=',')
test_df

Unnamed: 0,dd_mm_yyyy,daily_cases,total_cases,daily_recoveries,total_recoveries,daily_deaths,total_deaths,daily_tests,daily_first_dose_vaccinations,daily_second_dose_vaccinations,total_second_dose_vaccinations,total_boosters,total_vaccinations,vaccine_type,daily_deaths_over_total_second_dose_vaccinations_per_million,daily_cases_over_total_second_dose_vaccinations_per_million
0,13.01.2021,9554,2355839,9463,2227927,173,23325,173603,0,0.0,0,0,0,Sinovac,,
1,14.01.2021,8962,2364801,9011,2236938,170,23495,169847,279452,0.0,0,0,279452,Sinovac,,
2,15.01.2021,8314,2373115,9109,2246047,169,23664,167211,337200,0.0,0,0,616652,Sinovac,,
3,16.01.2021,7550,2380665,8005,2254052,168,23832,156792,60251,0.0,0,0,676903,Sinovac,,
4,17.01.2021,6436,2387101,8812,2262864,165,23997,148636,29548,0.0,0,0,706451,Sinovac,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,9.09.2021,23846,6590414,31322,6055819,257,59170,314793,50930700,301452.0,39599848,9737377,100267925,"BioNTech, Sinovac",6.0,602.0
240,10.09.2021,23562,6613976,35083,6090902,214,59384,318835,51234095,363388.0,39963236,9851336,101048667,"BioNTech, Sinovac",5.0,590.0
241,11.09.2021,22923,6636899,30144,6121046,259,59643,314046,51411717,195137.0,40158373,9884664,101454754,"BioNTech, Sinovac",6.0,571.0
242,12.09.2021,21352,6658251,25616,6146662,243,59886,310546,51536013,147339.0,40305712,9906933,101748658,"BioNTech, Sinovac",6.0,530.0


In [2]:
import pandas as pd 
tsv_file='../Data/NASDAQ_100_Data_From_2010.tsv'
csv_table=pd.read_table(tsv_file,sep='\t')
csv_table.to_csv('../Data/NASDAQ_100_Data_From_2010.csv',index=False)
#NASDAQ_100_Data_From_2010.csv

### Front end Practices

In [10]:
from jinja2 import Template

name = input("Enter your name: ")

tm = Template("Hello {{ name }}")
msg = tm.render(name=name)

print(msg)

Enter your name: Erfan
Hello Erfan


In [13]:
from jinja2 import Environment, FileSystemLoader

persons = [
    {'name': 'Andrej', 'age': 34}, 
    {'name': 'Mark', 'age': 17}, 
    {'name': 'Thomas', 'age': 44}, 
    {'name': 'Lucy', 'age': 14}, 
    {'name': 'Robert', 'age': 23}, 
    {'name': 'Dragomir', 'age': 54}
]

file_loader = FileSystemLoader('templates')
env = Environment(loader=file_loader)

template = env.get_template('showpersons.txt')

output = template.render(persons=persons)
print(output)

Andrej 34
Mark 17
Thomas 44
Lucy 14
Robert 23
Dragomir 54



In [21]:
from jinja2 import Environment, FileSystemLoader

persons = [
    {'name': 'Andrej', 'age': 34}, 
    {'name': 'Mark', 'age': 17}, 
    {'name': 'Thomas', 'age': 44}, 
    {'name': 'Lucy', 'age': 14}, 
    {'name': 'Robert', 'age': 23}, 
    {'name': 'Dragomir', 'age': 54}, 
]

file_loader = FileSystemLoader('templates')
env = Environment(loader=file_loader)
env.trim_blocks = True
env.lstrip_blocks = True
env.rstrip_blocks = True

template = env.get_template('showminors.txt')

output = template.render(persons=persons)
print(output)

Mark
Lucy



In [29]:
from jinja2 import Environment, FileSystemLoader

file_loader = FileSystemLoader('templates')
env = Environment(loader=file_loader)
env.trim_blocks = True
env.lstrip_blocks = True
env.rstrip_blocks = True

template = env.get_template('index.html')

output = template.render(segment='')
print(output)

<!DOCTYPE html>
<html lang="en">

<head>
    <title>
        Jinja Datta Able -  Dashboard  | AppSeed
    </title>
    <!-- HTML5 Shim and Respond.js IE10 support of HTML5 elements and media queries -->
    <!--[if lt IE 10]>
		<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
		<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
		<![endif]-->
    <!-- Meta -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=0, minimal-ui">
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <link rel="canonical" href="https://appseed.us/admin-dashboards/flask-dashboard-dattaable">
    
    <meta name="description" content="Datta Able Bootstrap admin template made using Bootstrap 4 and it has huge amount of ready made feature, UI components, pages which completely fulfills any dashboard needs." />
    <meta name="keywords" content="admin templates, bootstrap adm

In [30]:
# Display the HTML on IPython notebook

from IPython.core.display import HTML
HTML(output)

0,1,2,3
,Isabella Christensen  Lorem Ipsum is simply…,11 MAY 12:56,RejectApprove
,Mathilde Andersen  Lorem Ipsum is simply text of…,11 MAY 10:35,RejectApprove
,Karla Sorensen  Lorem Ipsum is simply…,9 MAY 17:38,RejectApprove
,Ida Jorgensen  Lorem Ipsum is simply text of…,19 MAY 12:56,RejectApprove
,Albert Andersen  Lorem Ipsum is simply dummy…,21 July 12:56,RejectApprove

User,Activity,Time,Status,Unnamed: 4
Ida Jorgensen,The quick brown fox,3:28 PM,Done,
Albert Andersen,Jumps over the lazy,2:37 PM,Missed,
Silje Larsen,Dog the quick brown,10:23 AM,Delayed,
Ida Jorgensen,The quick brown fox,4:28 PM,Done,

User,Activity,Time,Status,Unnamed: 4
Albert Andersen,Jumps over the lazy,2:37 PM,Missed,
Ida Jorgensen,The quick brown fox,3:28 PM,Done,
Ida Jorgensen,The quick brown fox,4:28 PM,Done,
Silje Larsen,Dog the quick brown,10:23 AM,Delayed,

User,Activity,Time,Status,Unnamed: 4
Silje Larsen,Dog the quick brown,10:23 AM,Delayed,
Ida Jorgensen,The quick brown fox,3:28 PM,Done,
Albert Andersen,Jumps over the lazy,2:37 PM,Missed,
Ida Jorgensen,The quick brown fox,4:28 PM,Done,


### Spin up a flask backend to host the html page

In [25]:
from flask import Flask, render_template
app = Flask(__name__, template_folder="templates")

@app.route('/')
def home():
   return render_template('index.html',segment='')

app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/fonts/fontawesome/css/fontawesome-all.min.css HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/plugins/animation/css/animate.min.css HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/css/style.css HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/js/vendor-all.min.js HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/plugins/bootstrap/js/bootstrap.min.js HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/js/pcoded.min.js HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/images/user/avatar-1.jpg HTTP/1.1[0m" 200 -
127.0.0.1 - - [17/Sep/2021 21:41:27] "[37mGET /static/assets/images/user/avatar-3.jpg HTTP/1.1[0

In [31]:
spark.stop()