#!/usr/bin/env python
#coding: utf-8

### How to generate python file from ipynb notebook
When we are converting py file from ipynb note book:
1. ALWAYS modify the ipynb file using Jupyter Notebook and make it error free.
2. Convert the ipynb notebook into a python script (output.py) using the following  command at windows command prompt
    * ipynb-py-convert chinmay_tools.ipynb output.py
3. Never modify the .py file directly, but generate it using tool "ipynb-py-convert"
4. Compare the output python file against the pre-existing python file to find the difference in python code.

### How to generate html or pdf from the ipynb file for documentation
* jupyter nbconvert --help
    * This prints the help on the command
* jupyter nbconvert --to=html test.ipynb
    * This creates a html file from the ipynb file, which can be later printed as pdf
    * The command is equivalent to "jupyter nbconvert test.ipynb" because html is th edefault output format.
    * For pdf we can print the html or can directly produce the pdf using --to=pdf, but direct pdf generation is problematic as it needs many installations etc.

##### Merging twoshells (SHIFT+M) and Splitting a shell (SHIFT+CTRL+MINUS)
* To break a cell into two press "SHIFT+CTRL+MINUS" with cursor at the split point during editing the cell
* To merget wo cells press "SHIFT+M" while the cell is in view mode (not editing)

##### How to find the time taken by the comamnds in ajupyter notebook cell
* Put "%%time" as the first command in the cell to get the time taken by the cell
* Put "%%timeit" to get the average time taken by the commands in a cell found after running multiple loops

#### How to get Help on anything in Jupyter Notebook
* Go to documentation site such as sparkapache.org and search there under API section, if you donot know the package details etc
* If you know the package and the package is already imported into notebook, then you can 
    * put a question mark after the module or property name and press SHIFT+ENTER to get detailed help
    * You can also enter SHIFT+TAB after the module or inside the brackets after the module to get inline help

### How to add our own Python Library in another python script
###### To add this library (current file) to python path so that our other modules will be able to access the modules form this file, follwo the below steps:
* Write / generate your library, the python file.
* Put in path to the python file in the "PYTHONPATH" environment variable (i.e. in sys.path property)
* In fact the sys.path is the list of paths or directories where python modules are found by the system.
* We can add location of our python scripts using .append(my_lib_path) to append at the end of the pythonpath list or we can use .insert(npos, my_lib_path) to insert at a specific position in the path.
    * import sys
    * sys.path.insert(1, my_py_location)  ## This adds "my_py_location" before the second element, if we use 0, it is placed befoer the first element
* Now import your module from your library
    * SYNTAX: <i>from my_python_script import my_module</i>
    >* my_python_scipt is the pytjon file name without ".py" extension
    >* my_module is a method inside the above python file
* Example:
    * <i>import sys</i>
    * <i>sys.path.append('C:/Users/chinuser/my_pyclass_folder')</i>
    * <i>from tools.chinmay_tools import *</i>
    * This above code snippet assumes that we have a "tools" folder under "C:/Users/chinuser/my_pyclass_folder" which contains all our custom library python files such as "chinmay_tools.py" and we want to import all functions (or modules) from this file.
    
#### If you want to import functions from an pre-existing notebook (.ipynb) directly
* follow the link https://stackoverflow.com/questions/20186344/ipynb-import-another-ipynb-file
* ALTERNATELY we can convert the notebook file (.ipynb) into a python file using the tool "ipynb-py-convert" and use the python file as a library as mentioned earlier.
    * Syntax:  "ipynb-py-convert in.ipynb out.py" [use "ipynb-py-convert.exe --help" to get help]

# SPARK DOCUMENTATION

#### pyspark API Documentation:
* http://spark.apache.org/docs/latest/
* http://spark.apache.org/docs/latest/ml-guide.html
* https://spark.apache.org/docs/latest/api/python/

## [Introduction to Statistical Learning](<https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf>)

#### Importing Jupyter Notebooks as Modules
https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Importing%20Notebooks.html

###### Using the functions from one .ipynb notenook in another .ipynb file
* https://stackoverflow.com/questions/44116194/import-a-function-from-another-ipynb-file
* https://github.com/ipython/ipynb

###### Jupyter Notebook Shortcuts
* https://www.dataquest.io/blog/advanced-jupyter-notebooks-tutorial/
* https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

# MY LIBRARY

##### Enable the shell to print multiple results (instead of only the last result)

In [None]:
## Enable the shell to print multiple results (instead of only the last result)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

###### Changing text format using python print statement
* https://kite.com/python/answers/how-to-print-in-bold-in-python
* ASCII TABLE: http://ascii-table.com/ansi-escape-sequences.php
* https://en.wikipedia.org/wiki/ANSI_escape_code

###### Printing formatted text using python print() function
* Escape character:  \x1B (hexa) 0r \033 (octal)

In [None]:
def printHelpOnFormattedText():
    """
    This method explains the starting and ending codes for printing formatted text in python.
    The print syntax is: print('\x1B[Pm\"  + string_to_be_formated + "\x1B[Qm" ')
    where P is the starting attribute indicator and Q is the ending attribute indicator.
    P--> Starting indicator--> 7m:color_inversion, 1m:bold, 4m:underline, 31m-37m:font_colors, 40m-47m:font_backgnd, 92m-97m:font_colors
    Q--> Ending indicator--> 0m --> This resets all the attributes set till now. 
    """
    print("\nFORMAT: print(\"\\x1B[Sm\"  + string_to_be_formated + \"\\x1B[Em\" " )

    print("Where in Starting format (\"\\x1B[Sm\") \n\tS is starting attribute indicator\n\t 7m:color_inversion, 1m:bold, 4m:underline, 31m-37m:font_colors, 40m-47m:font_backgnd, 92m-97m:font_colors")
    print("and in Ending format (\"\\x1B[Em\") \n\tE is ending attribute indicator to turn attributes off, and it is '0' i.e. [0m")
    print("\nEXAMPLE: print(\"\\x1B[4m\  \\x1B[7m\"  + string_to_be_formated + \"\\x1B[0m\" " )


    print("\x1B[43m \x1B[4m  \x1B[7m"  + "string_to_be_formated" + "\x1B[0m" + " produces bold text")
    print("\x1B[4m"  + "string_to_be_formated" + "\x1B[0m" + " produces bold text")


def getFormatted(string1, formatCode):
    """
    Returns the passed-in string formatted with the starting attribute indicator passed in.
    
    Parameters
    ----------
    formatCode: is the starting attribure indicator for the desired format. E.g: "4m" to underline the text passed as first parameter.
    Starting indicator--> 7m:color_inversion, 1m:bold, 4m:underline, 31m-37m:font_colors, 40m-47m:font_backgnd, 92m-97m:font_colors
    """
    return("\x1B["+ formatCode + string1 + "\x1B[0m")

def printFormatted(string1, formatCode):
    """
    Prints the passed-in string formatted with the starting attribute indicator passed in.
    
    Parameters
    ----------
    formatCode: is the starting attribure indicator for the desired format. E.g: "4m" to underline the text passed as first parameter.
    Starting indicator--> 7m:color_inversion, 1m:bold, 4m:underline, 31m-37m:font_colors, 40m-47m:font_backgnd, 92m-97m:font_colors
    """
    print(getFormatted(string1, formatCode))

def getBold(string1):
    """
    Returns the passed-in string in bold formatted
    """
    return getFormatted(string1, "1m")

def getUnderlined(string1):
    """
    Returns the passed-in string as underlined
    """
    return getFormatted(string1, "4m")

def getColorInverted(string1):
    """
    Returns the passed-in string with forground and background color swapped.
    """
    return getFormatted(string1, "7m")

def printHighlighted(string1):
    """
    Prints the passed-in string as bold with forground and background color swapped.
    """
    print(getColorInverted(getBold(string1)))

def printUnderlined(string1):
    """
    Prints the passed-in string as underlined
    """
    print(getUnderlined(getBold(string1)))

###### Print contents of a text file to the console

In [None]:
# from codecs import open
# Print contents of a file

def printTextFile(file_name):
    """
    Opens a text file (txt, csv, xml etc..) in 'utf-8' encoding format and prints its contents.
    params
    It can accept a relative path or a full path to a file in the same file system as this utility.
    
    Parameters:
    ----------
    file_name: The xml file which is the call log generated from andorid SuperBackup application.
    """
    f = open(file_name, 'r', encoding='utf-8')
    file_contents = f.read()
    print (file_contents)
    f.close()

###### Tools using Spark DataFrames
* Convert Pandas DataFrame to Spark DataFrame and vice versa
* Get Json from the Spark DataFrame
* Mask specified columns in a Spark DataFrame

In [None]:
from pyspark.sql import SparkSession

def getSparkDFfromPandasDF(pandasDF):
    """
    Takes a Pandas DataFrame and converts it into a Spark DataFrame
    """
    tempSparkSession = SparkSession.builder.appName("chin_conv").getOrCreate()

    # Enable Arrow-based columnar data transfers
    tempSparkSession.conf.set("spark.sql.execution.arrow.enabled", "true")
    # on some machines it may give a warning which may be ignored

    # Generate a random pandas DataFrame
    # temp_pandasDF = pd.DataFrame(np.random.rand(100, 3))

    # Create a Spark DataFrame from a pandas DataFrame using Arrow
    sparkDF = tempSparkSession.createDataFrame(pandasDF)
    return sparkDF

def getPandasDFfromSparkDF(sparkDF):
    """
    Takes a Spark DataFrame and converts it into a Pandas DataFrame
    """
    # Convert the Spark DataFrame back to a pandas DataFrame using Arrow
    result_pandasDF = sparkDF.select("*").toPandas()
    return result_pandasDF

import json
import pyspark

"""
Here I am importing full pyspark for readability
in the method getJsonFromSparkDF(), I am validating whether the parameter is a spark DataFrame
I could hev comapred with DataFrame after doing a selective import (from pyspark.sql.dataframe import DataFrame)
But the users may get confuse it with pandas DataFrame if they accidentally miss to notice the import statement.

Now I am using fully qualitied modules for comparison, thus improving readability

DATA ISSUE:
In the xml result from SuperBackUp application the data is generated with html encoded form
e.g. if a name is "Rama's friend" then it is generated as "Rama&apos;s friend". 
and during python processing with dat frames it gets replaced as "Rama\'s friend"
And json validators like firefox fail at that escape sequence.
* So once the json is generated remove teh back slash character i.e replace "Rama\'s friend' with "Rama's friend"
ELSE
* Remove single or double quotes from your hone contact names so that this problem does not come at all during SuperBAckU application backup output.
"""

def getJsonFromSparkDF(sparkDF):
    """
    Takes a Spark DataFrame and processes its elements as a list of Spark Rows (which is returned from a sparkDF.collect()), i.e. type of each list member is pyspark.sql.types.Row
    Converts the input into a Json and returns it.
    If input is either invalid or it is not a list of spark rows, then it returns an empty json string
    """
    if not (isinstance(sparkDF, pyspark.sql.dataframe.DataFrame)):
        return '{}'
    sparkRowList = sparkDF.collect()               # The collect() method returns the sparkDF rows as a list of spark rows
    if not (isinstance(sparkRowList, list)):
        return '{}'
    if not (isinstance(sparkRowList[0], pyspark.sql.types.Row)):
        return '{}'

    resultJson = " \"Row 0\" : " +json.dumps(sparkRowList[0].asDict())

    # For second element onwards append the json for individual elements to the result with a comma separator
    for x in range(len(sparkRowList)-1):
        resultJson = resultJson + ", \"Row "+ str(x+1) + "\" :" + json.dumps(sparkRowList[x+1].asDict())  # This starts from second row onwards as first is already added to output
    return "{" + resultJson + "}"


def getMaskedSpartDF(sparkDF, cols_to_mask):
    """
    Masks the specified columns in the passed spark dataframe.
    Returns the input dataframe itself in case of any input error.
    
    Parameters:
    ----------
    sparkDF: The input spark data frame
    cols_to_mask: List of column names which need to be masked in the passed in dataframe.
    """
    if not (isinstance(sparkDF, pyspark.sql.dataframe.DataFrame)):
        return spartDF

    # cols_to_mask = ['Close', 'High', 'Volume']  # Received from parameter
    cols_from_df = sparkDF.columns
    cols_to_consider = []   # final column list to mask
    for s in cols_to_mask:
        if (s in cols_from_df):
            cols_to_consider.append(s)
    sdf_masked = sparkDF
    if len(cols_to_consider)==0:
        return sparkDF
    
    from pyspark.sql.functions import lit   # To write string litersls
    
    MASK_PATTERN = '*** masked ***'
    for cn in cols_to_mask:
        sdf_masked =  sdf_masked.withColumn(cn, lit(MASK_PATTERN))

    return sdf_masked

###### Get the Spark DataFrame version of the call logs generated by the SuperBackUp android application

In [None]:
def getCallLogXmlFromSuperbackup(call_log_xml_file="test_data/calllogs_20200512130135.xml"):
    """
    Parses the exported call logs (xml) from the SuperBackUp android application. The xml lfile structure is "alllogs^log"
    The each of the log record has attributes ["number", "time", "date", "type", "name", "duration"].
    To read the xml we use the module "xml.etree.ElementTree"

    Parameters
    ----------
    call_log_xml_file : A string representing the xml file name (may be a full path) - This is the log file produced from the SuperBackUp android application.

    """
    import pandas as pd 
    import xml.etree.ElementTree as etree

    tree = etree.parse(call_log_xml_file)
    root = tree.getroot()
    columns = ["number", "time", "date", "type", "name", "dur"] #The column list is closely tied to the call log xml
    df_Calllogs = pd.DataFrame(columns = columns)

    for node in root: 
        number = node.attrib.get("number")
        time = node.attrib.get("time") # if node is not None else None
        date = node.attrib.get("date")
        type = node.attrib.get("type")
        name = node.attrib.get("name")
        # name = node.find("name")
        dur = node.attrib.get("dur")
        df_Calllogs = df_Calllogs.append(pd.Series([number, time, date, type, name, dur], index = columns), ignore_index = True)
    
    return df_Calllogs

#### TEST CODE
* Finds the subset of a list that exists in another list

In [None]:
def checkFilteredList() :
    printHighlighted("Testing for logic inside 'getMaskedSpartDF()'" )
    input = ['a5', 'a1', 'a2','a3']
    cols = ['a1', 'a4', 'a2', 'a4', 'a5', 'b1']
    output = []
    for s in input:
        if (s in cols):
            output.append(s)
    print (output)
    if len(output)==0:
        print ('Nothing to process')

### LATEX EXAMPLES

#### Some Text Formatting
Examples of fonts in markup.md 

*italic*(with SINGLE STAR), **bold** (with DOUBLE STARS), _italic_ (with UNDERSCORE),  ~~scratched~~ (with TILDE)

#### Writing scientific equations with Latex
* Refer: https://www.tutorialspoint.com/tex_commands/hash.htm
* Mathematical Notations: https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html

###### Some important LATEX equations:
<details><summary><b>Expand to view more examples..</b>$$\sqrt{\sum_{i=1}^n i = \frac{n(n+1)}{2}}$$</summary>

%%latex is optional
<table border=1  width="80%">
    <tr>
        <td>$\sum_{i=1}^n i = \frac{n(n+1)}{2}$ Used single '\$' for inline expression</td>
        <td>$$\sum_{i=1}^n i = \frac{n(n+1)}{2}$$ Used double '\$\$' for equation form on a different line</td>
        <td>$\sum\limits_{i=1}^n i = \frac{n(n+1)}{2}$ '\limits with \sum' gives same result as '\$\$' but puts as inline text </td>
    </tr>
    <tr>
        <td>$E = mc^2$</td>
        <td>$e^{i pi} = -1$</td>
        <td>FILL IT</td>
    </tr>
    <tr>
        <td>$\sum\limits_{i=1}^n i = \frac{n(n+1)}{2}$</td>
        <td>$\sum\limits_{i=1}^n i^2 = \frac{n(n+1)(2n+1)}{6}$</td>
        <td>FILL IT</td>
    </tr>
    <tr>
        <td>$B^C = A  \impliedby log_B A = C$</td>
        <td>$ln A = log_e A$</td>
        <td>$log_B A = \frac{log A}{log B} = \frac {log_{10} A}{log_{10} B}$</td>
    </tr>
    <tr>
        <td>SAMPLE: $d^2 = (a+b)^2) + C^2*(T^2+2) $</td>
        <td>$\implies C = \sqrt[2]{\frac{d^2 - (a+b)^2)}{T^2+2}}$ :SAMPLE</td>
        <td>SAMPLE</td>
    </tr>
</table>

# $\psi(x)\phi(x) \Pi \pi \sum \Sigma \propto \alpha \beta \Gamma \gamma \Delta \Lambda \lambda \circlearrowleft \circlearrowright
\frac{log_ab^c}{log_bd}\sqrt[3]{\frac{d^2 - (a+b)^2)}{T^2+2} * \int_v\frac{dy}{dx} * \iint_w\frac{d^2y}{dx^2} }$

###### Failed attempt to rotate text using LaTeX
$\textit{usepackage} {graphicx}
\rotatebox[origin=c]{90}{\pi \sum\frac{log_ab^c}{log_bd}}$

$\displaystyle \frac{a}{b}$
</details>

###### How to print multiple header types on a single line.
* Default behavior is that each html header is displayed in a separate line.
* Use ' style="display: inline"' CSS attribute to the header tags to display them in a single line
* <h1 style="display: inline">H1</h1><h2 style="display: inline">H2</h2>
<h3 style="display: inline">H3</h3><h4 style="display: inline">H4</h4>
<h5 style="display: inline">H5</h5><h6 style="display: inline">H6</h6>

#### DISPLAYING IMAGES FROM LOCAL DRIVE or a Remote Server
* For showing an image from remote server use "url" parameter

###### Embedding images in a Markdown document 
<details><summary>show images</summary>
    
* ![](./logistic_regression_steps.png)
* <table><tr>
    <td><img src="logistic_regression_steps.png" width="500" /></td>
    <td><img src="confusiton_matrix_metrics_ratios.png" width="400" /></td>
  </tr></table>
</details>

###### Displaying images iusing a python command

In [6]:
def displayMyImages():
    from IPython.display import SVG, Image
    Image(filename="images/linkedin_profile_9mar2020.jpg")
    Image(filename="images/itbhu_cse_group_before1994_hardy_12jan2010..JPG")
    Image(filename="logistic_regression_steps.png")
    Image(filename="confusiton_matrix_metrics_ratios.jpg")    