<div class="alert alert-block alert-info">
Notebook Author:<br>Felix Gonzalez, P.E. <br> Adjunct Instructor, <br> Division of Professional Studies <br> Computer Science and Electrical Engineering <br> University of Maryland Baltimore County <br> fgonzale@umbc.edu
</div>

<div class="alert alert-block alert-info">
Acknowledgements:<br>
This dataset was generated from The Movie Database API (https://www.kaggle.com/datasets/tmdb/themoviedb.org). This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here: https://www.themoviedb.org/documentation/api.
</div>

# Table of Contents

- [Notebook Goal](#Notebook-Goal)<br>
- [Source Data](#Source-Data)<br>
- [Library Loading](#Library-Loading)<br>
- [Default Jupyter Notebook Settings](#Default-Jupyter-Notebook-Settings)<br>
- [Progress Status Function](#Progress-Status-Function)<br>
- [Data Loading](#Data-Loading)<br>
- [Categorical Columns Menu Option Lists for Filtering](#Categorical-Columns-Menu-Option-Lists-for-Filtering)<br>
- [Text Normalization](#Text-Normalization)<br>
- [Number Formatting Functions](#Number-Formatting-Functions)<br>
- [Dashboard Widgets and Functions](#Dashboard-Widgets-and-Functions)<br>
    - [Data Filtering Widget](#Data-Filtering-Widget)<br>
    - [Filtered Dataframe: Unique Values Dictionary](#Filtered-Dataframe:-Unique-Values-Dictionary)<br>
    - [Statistics Widget Tab](#Statistics-Widget-Tab)<br>
    - [Similarity Search Functions](#Similarity-Search-Functions)<br>
    - [Data Transformation Functions](#Data-Transformation-Functions)<br>
    - [Data Plotting Functions](#Data-Plotting-Functions)<br>
    - [Text Analysis and Text Modeling Widget](#Text-Analysis-and-Text-Modeling-Widget)<br>
        - [Text Analysis and Text Modeling Widget Functions](#Text-Analysis-and-Text-Modeling-Widget-Functions)<br>
        - [Bag of Words Text Model Function](#Bag-of-Words-Text-Model-Function)<br>
        - [Dimensionality Reduction Functions](#Dimensionality-Reduction-Functions)<br>
        - [PCA Dimensionality Reduction Functions](#PCA-Dimensionality-Reduction-Functions)<br>
        - [TSNE Dimensionality Reduction Functions](#TSNE-Dimensionality-Reduction-Functions)<br>
        - [Cluster Plots and Plot Projection Function](#Cluster-Plots-and-Plot-Projection-Function)<br>
        - [DBSCAN Clustering Functions](#DBSCAN-Clustering-Functions)<br>
        - [DBSCAN Clustering (% Records to Cluster) Functions](#DBSCAN-Clustering-(%-Records-to-Cluster)-Functions)<br>
        - [DBSCAN Clustering (Max Clusters w/Optimal EPS) Functions](#DBSCAN-Clustering-(Max-Clusters-w/Optimal-EPS)-Functions)<br>
        - [DBSCAN Clustering (Most Dense Cluster w/Optimal EPS) Widget and Functions](#DBSCAN-Clustering-(Most-Dense-Cluster-w/Optimal-EPS)-Widget-and-Functions)<br>
        - [DBSCAN Clustering (Custom EPS and Min Samples in a Cluster) Functions](#DBSCAN-Clustering-(Custom-EPS-and-Min-Samples-in-a-Cluster)-Functions)<br>
        - [KMeans Clustering Functions](#KMeans-Clustering-Functions)<br>
    - [Top Terms WordCloud Widget and Clustering Plots Functions](#Top-Terms-WordCloud-Widget-and-Clustering-Plots-Functions)<br>
        - [Top Terms Wordcloud Widget](#Top-Terms-Wordcloud-Widget)<br>
    - [Predictive Modeling Widget](#Predictive-Modeling-Widget)<br>
    - [Reports Widget Functions](#Reports-Widget-Functions)<br>
    - [Export Filtered Dataframe Function](#Export-Filtered-Dataframe-Function)<br>
    - [Reclustering: Filters the Selected Cluster](#Reclustering:-Filters-the-Selected-Cluster)<br>
    - [Main Data Filtering Widget](#Main-Data-Filtering-Widget)<br>

# Data Analysis Dashboard Template

Python has various libraries that allow for data dashboards. In this notebook, a dashboard is developed using [Jupyter Notebook Widgets](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html). Python has various other libraries used for developing dashboards. These include Anaconda Panel, Plotly Dash, Streamlit, Voila among others. Amazon Web Services, MS Azure, Google Cloud Platform, Microstrategy, Palantir Foundry, Tableau are other commercially available platforms that have their own platforms, frameworks and libraries that also allow for developing data analysis dashboard like applications.   

Jupyter Notebook Widgets have one advantage in that they can be quickly developed in a Jupyter Notebook and allows a data scientist to develop a concept and explore potential capabilities of a dashboard while also allowing to identify early limitations of the data and the tools without spending the resources in developing a webbased dashboard. However, the main disadvantage of Jupyter Notebook Widgets is that they cannot be deployed in a Website framework or an application easily and will need to be moved to another platform if the stakeholder decides to proceed. The other libraries above all have different advanatages and limitations and would need to be evaluated separately depending on the needs of the stakeholders and the organization.

References: <br>
- https://medium.com/spatial-data-science/the-best-tools-for-dashboarding-in-python-b22975cb4b83

# Notebook Goal
This notebook leverages the output of the 1_Data_Cleaning_Template.ipynb Notebook and continues into the next step to further analyze the data to extract insights. This includes but is not limited to provide visualizations, plots, charts, statistics, correlations, relationship between features, patterns, trends, anomalies, test hypothesis for feature selection, identify potential features to use in clustering or classification and other AI/ML algorithms, etc.

The main goal is to show what the EDA and visualizations would show and look for. There are hundreds of visualization types. What will drive which plots you create will be what are you trying to achieve. What story are you trying to make? What is the decision that you want to make? How can you convince your stakeholders? The data will also limit on what you will be able to do and show. The purpose of this notebook is to give you some ideas on plots that we can make and what general functions we can use. There may be other more advanced functions that can be used as well and it is important to review sample plots from visualization libraries like Matplotlib, SNS, Plotly, JSD3 and others. Their example gallery will potentially give you more ideas on things that you can do.

This notebook can be used as a template for quickly developing data analysis dashboards of any data as long as it has been cleaned using the 1_Data_Cleaning_Template.ipynb. Once the data is loaded as a dataframe (DF), most functions can be used as is with some modifications. Modifications include updating the options of the filtering (in section named "Categorical Columns Menu Option Lists for Filtering" and "DF_DATA" dictionary) and the filtering options in the "Main Data Filtering Widget" as well as the df_data_filtering and all variables as they relate to the filtering columns names and unique values in the "Main Data Filtering Widget" and the "Data Filtering Widget". Identifying the other functions that need to be modified is easier just running the widgets and fixing where the errors are at.

# Source Data
Detailed description of the data can be found at the 1_Data_Cleaning_Template.ipynb Notebook. This Jupyter Notebook uses the outputs from the 1_Data_Cleaning_Template.ipynb Notebook.

# Library Loading
[Return to Table of Contents](#Table-of-Contents)<br>

In [1]:
# Python Libraries
import pandas as pd
import numpy as np
from numpy import unique, where
from collections import Counter
from statistics import mean # Mean function from statistics module
import re

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import MaxNLocator

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  # For bag of words.
from sklearn.cluster import KMeans, DBSCAN # K-Means and DBSCAN
from sklearn.metrics import silhouette_score, pairwise_distances, silhouette_samples, mean_squared_error, r2_score
from sklearn.decomposition import PCA # used for PCA Dimensionality Reduction
from sklearn.manifold import TSNE # Used for TSNE Dimensionality Reduction.
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LinearRegression # Linear Regression Model.
from sklearn.feature_selection import f_regression # Metrics. F-Regression (p-value and f-statistic)


import seaborn as sns
import scipy as sp
from scipy import stats
from scipy.stats import poisson # Poisson Distribution
from scipy.spatial.distance import cdist

import nltk # Natural Langage Toolkit
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer # For lemmitization and Stemming
from nltk import pos_tag # For parts of speech
from nltk import word_tokenize # To create tokens
from nltk.corpus import stopwords, wordnet # Stopwords and POS tags
#nltk.download #(One time to download 'stopwords')
#nltk.download # (One time to download 'punkt')
#nltk.download #(One time to download 'averaged_perceptron_tagger')

from wordcloud import WordCloud

import ipywidgets as widgets
from IPython.display import Image, display, HTML, clear_output
from ipywidgets import interact, interact_manual

# Default Jupyter Notebook Settings
[Return to Table of Contents](#Table-of-Contents)

References: <br>
- Color cycling (https://matplotlib.org/stable/gallery/color/named_colors.html)

In [2]:
pd.set_option('display.max_colwidth', None) # PD has a limit of 50 characters.  This takes out the limit and uses the full text.
pd.options.display.float_format = "{:.4f}".format # Pandas displays float numbers as 4 decimal places.

In [3]:
# ipyWidget Vertical Scroll Threshold

In [4]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

<IPython.core.display.Javascript object>

In [5]:
# Set the default color cycle for MatPlotLib Plots
#plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['b', 'r', 'g']) # Specifies specific color cycling (https://matplotlib.org/stable/gallery/color/named_colors.html)
#plt.style.available # Available Color Styles
plt.style.use('tableau-colorblind10') # Defining a specific color style to use.  Tableu-Colorblind
#plt.style.use('seaborn-colorblind') # Defining a specific color style to use. Seaborn-colorblind

#colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] # Extract Colors being defined in the plt.style.use                       
#print('\n'.join(color for color in colors))

# Progress Status Function
[Return to Table of Contents](#Table-of-Contents)

In [6]:
# Progress Bar Function. Used in loops.
def progress_status(step, total_steps):
    #Progress Status
    clear_output(wait=True)
    print(f"Currently processing step: {step} of {total_steps}.")

# Data Loading
[Return to Table of Contents](#Table-of-Contents)<br>

Refrences: <br>
- Encoding https://stackoverflow.com/questions/57061645/why-is-%C3%82-printed-in-front-of-%C2%B1-when-code-is-run

In [7]:
# LOADING CSV FILE
# Na_values may need to be reviewed as some datasets may include an accronym.
# For example, 'NA' may be an abbreviation for 'North America'.
df_data = pd.read_csv('./output_data/df_data_clean.csv', 
                      encoding = "utf-8-sig",
                      parse_dates=['release_date', 'release_cy_quarter', 'release_cy_month'],
                      keep_default_na=False,
                      na_values=['', '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A','N/A', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan']) 

# IF LOADING EXCEL FILE: use pd.read_excel.
#df_data = pd.read_excel('.\input_data\FILE_NAME.xlsx', parse_dates=['Date', 'Final Date'])

# Encoding "cp1252" or "utf-8-sig" used so that Excel does not create special characters. Standard Python is utf-8.
# See reference for explanation https://stackoverflow.com/questions/57061645/why-is-%C3%82-printed-in-front-of-%C2%B1-when-code-is-run

In [8]:
#df_data.head(3)

In [9]:
#df_data.columns

In [10]:
# Column selection for selecting columsn in loops used in the data cleaning, visualization and model functions below.
dfcolumns = list(df_data.columns.values)
dfcolumns_index = pd.DataFrame(dfcolumns, columns=['column'])
pd.set_option('display.max_rows', None)
#dfcolumns_index

# Categorical Columns Menu Option Lists for Filtering
[Return to Table of Contents](#Table-of-Contents)<br>

This sections creates a list of unique categorical values in each feature or column in order to be able to later as filtering options of dropdown menus. 

In [11]:
# Changes the type of the data so that the NAN are considered a string.
# In cases you may need to changes the NAN to 0 to fix issue with the Stacking. 
#df_data['feature_name'] = df_data['feature_name'].fillna(0)

In [12]:
# Converts the "genres" column to a string.
df_data['genres'] = df_data['genres'].astype(str)

In [13]:
# Record Date Calendar Years
release_cy_list = (df_data['release_cy'].unique()).tolist()
release_cy_list.sort()
#release_cy_list = ['NaT']+release_cy_list # Includes 'NaT'. Only needed if data has undefined year.
release_cy_list = list(dict.fromkeys([element for element in release_cy_list])) # Removes and Duplicates
#release_cy_list

In [14]:
# Develops a unique list of genres to be able to use it in the menus.

# First and Last index of the Movie Genres. 
# Two methods: First using the columns of the main dataframe, the other of the dfcolumns_index dataframe.
first_genre_index = df_data.columns.to_list().index("Action")
#first_genre_index = dfcolumns_index.index[dfcolumns_index['column'] == str('Action')][0] 

last_genre_index = df_data.columns.to_list().index("Western")
#last_genre_index = dfcolumns_index.index[dfcolumns_index['column'] == str('Western')][0] 

# List of Unique Genres.
genres_list = sorted(dfcolumns_index['column'][first_genre_index:last_genre_index+1].to_list())
#genres_list

In [15]:
# TREND OPTIONS LISTS
trends = [3, 5, 10] # Note the start_year needs the trend to be adjusted to -1.

# TEXT MODELING AND CLUTERING OPTIONS LIST
TEXT_MODEL_list = ['None', 'PLACEHOLDER: Topic Model: LDA', 'Clustering: Kmeans (user defined clusters)', 
                   'Clustering: Kmeans w/Optimal K',
                   'Clustering: DBSCAN (Max Clusters w/Optimal EPS)', 
                   'Clustering: DBSCAN (Densest Clusters w/Optimal EPS)', 
                   'Clustering: DBSCAN (% Records to Cluster)',
                   'Clustering: DBSCAN (Custom EPS and Min Samples)']

# Text Normalization
[Return to Table of Contents](#Table-of-Contents)<br>

Recall that the text was normalized in the data cleaning notebook which created a norm_text_lemmatized and norm_text_stemmed columns. In order to successfully use text clustering, similarity searching and other NLP tasks that may use the text normalization you will need to ensure that both text normalization functions are the same.

References: <br>
- Lemmatization: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizer

In [16]:
# Data Specific Stopwords. This list is added in post processing as part of input in the data dashboards.
# This allows to explore real-time the effect of adding stop words.
stopwords_to_add_L2 = ['']

In [17]:
# Stopwords to add need to be developed by a SME familiar with the corpus.
# These stopwords_to_add are added in pre-processing as part of the text normalization function.
stopwords_to_add = [''] # If you wanted to add a corpus related stopword list add them here.

# NLTK library stopwords
stopwords_custom = stopwords.words('english') + [x.lower() for x in stopwords_to_add]

# In some cases you want to consider 2-grams especially with the word 'no', 'not', 'nor'.
# For example 'no fire'.  Removing the word 'no' from the stopwords list allows this to occur.
remove_as_stopword = ['no', 'not', 'nor']
stopwords_custom = list(filter(lambda w: w not in remove_as_stopword, stopwords_custom))

In [18]:
def text_normalization(text, word_reduction_method):
    text = str(text) # Convert narrative to string.
    df = pd.DataFrame({'': [text]}) # Converts narrative to a dataframe format use replace functions.
    df[''] = df[''].str.lower() # Covert narrative to lower case.
    df[''] = df[''].str.replace("\d+", " ", regex = True) # Remove numbers
    df[''] = df[''].str.replace("[^\w\s]", " ", regex = True) # Remove special characters
    df[''] = df[''].str.replace("_", " ", regex = True) # Remove underscores characters
    df[''] = df[''].str.replace('\s+', ' ', regex = True) # Replace multiple spaces with single
    text = str(df[0:1]) # Extracts narrative from dataframe.
    tokenizer = RegexpTokenizer(r'\w+') # Tokenizer.
    tokens = tokenizer.tokenize(text) # Tokenize words.
    filtered_words = [w for w in tokens if len(w) > 1 if not w in stopwords_custom] # Note remove words of 1 letter only. Can increase to higher value as needed.
    if word_reduction_method == 'Lemmatization':
        lemmatizer = WordNetLemmatizer()
        reduced_words=[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in filtered_words] # Lemmatization.  The second argument is the POS tag.
    if word_reduction_method == 'Stemming':
        stemmer = PorterStemmer() # Stemming also could make the word unreadable but is faster than lemmatization.
        reduced_words=[stemmer.stem(w) for w in filtered_words]
    return " ".join(reduced_words) # Join words with space.

def get_wordnet_pos(word): # Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizer
    #"""Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Number Formatting Functions
[Return to Table of Contents](#Table-of-Contents)<br>

The functions in this section allow dynamic changes of the format of numbers (e.g., currency, integer and float numbers). For example, it automatically chagnes the values from 1,000,000 to 1M given the parameters.

In [19]:
# Function to define the axis label with Dollar Sign and e^x. Args are the value and tick position
def currency_number_format(x, pos): 
    if x >= 1e6:
        s = '${:,.2f}M'.format(x*1e-6) # E.g., 1.00M.  If numbers are too big might have to create another level
    elif (1e6 > x) & (x >=1e3):
        s = '${:,.1f}K'.format(x*1e-3) # E.g., $999K - $1K
    elif (1e3 > x) & (x >= 1e2):
        s = '${:,.1f}'.format(x) # E.g., $999.99 - $100
    elif (1e2 > x) & (x >= 0.01):
        s = '${:,.2f}'.format(x) # E.g., $99.99 - $0.01
    elif 0.01 > x: # E.g., $0
        s = '${:,.0f}'.format(x)
    return s

In [20]:
# Function to define the axis label and e^x. Args are the value and tick position
def number_format(x, pos): 
    if x >= 1e6:
        s = '{:,.2f}M'.format(x*1e-6) # E.g., 1.00M.  If numbers are too big might have to create another level
    elif (1e6 > x) & (x >=1e3):
        s = '{:,.1f}K'.format(x*1e-3) # E.g., $999K - $1K
    elif (1e3 > x) & (x >= 1e2):
        s = '{:,.1f}'.format(x) # E.g., $999.99 - $100
    elif (1e2 > x) & (x >= 0.01):
        s = '{:,.2f}'.format(x) # E.g., $99.99 - $0.01
    elif 0.01 > x: # E.g., $0
        s = '{:,.0f}'.format(x)
    return s

In [21]:
# Function to define the axis label with and e^x. Args are the value and tick position
def int_number_format(x, pos): 
    if x >= 1e6:
        s = '{:,.3f}M'.format(x*1e-6) # E.g., 1.00M.  If numbers are too big might have to create another level
    elif (1e6 > x) & (x >=1e4):
        s = '{:,.1f}K'.format(x*1e-3) # E.g., 100K - 9,999
    elif (1e4 > x) & (x >=1e3):
        s = '{:,.0f}'.format(x) # E.g., 9,999 - 1000
    elif (1e3 > x) & (x >= 0):
        s = '{:,.0f}'.format(x) # E.g., 999 - 100
    return s

# Dashboard Widgets and Functions
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- List of widgets:https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html

## Data Filtering Widget
[Return to Table of Contents](#Table-of-Contents)<br>

The Data Filtering Widget takes as input all the filters of the Main Data Filtering Widget and then calls the objects and sub-widgets in each of the tabs.

In [22]:
# PRIMARY WIDGET DATA FUNCTION
def filter_data(release_cy_filter, mult_genres_filter, genres_filter, 
                SIM_SEARCH_filter, SIM_SEARCH_THRESHOLD_filter, SIM_SEARCH_WORD_PROCESS_filter, 
                SIM_SEARCH_VECT_filter, STOPWORDS_filter, PRINT_filter):
    
    global df_data_filtered, print_winput, reclustering_counter, release_cy_winput, genres_winput, STOPWORDS_winput
    
    pd.set_option('display.max_columns', None)

    # Redefining the input filter variables as a new name to pass it to other functions thru a global variable. 
    release_cy_winput = release_cy_filter
    genres_winput = genres_filter
    mult_genres_filter = mult_genres_filter
    STOPWORDS_winput = STOPWORDS_filter.split(',') # Note that stopwords added here are later passed thru the text normalization.
    print_winput = PRINT_filter
    
    # Filtering Data
    if mult_genres_filter == 'AND':
        df_data_filtered = df_data.loc[(df_data['release_cy'] >= min(release_cy_winput)) &
                                       (df_data['release_cy'] <= max(release_cy_winput)) &
                                       (df_data['genres'].str.contains(r'^(?=.*\b' + r'\b)(?=.*\b'.join(genres_winput) 
                                                                       + r'\b).*$', flags=re.IGNORECASE, regex=True))
                                      ].copy().reset_index(drop = True)
        
    if mult_genres_filter == 'OR':
        df_data_filtered = df_data.loc[(df_data['release_cy'] >= min(release_cy_winput)) &
                                       (df_data['release_cy'] <= max(release_cy_winput)) &
                                       (df_data['genres'].str.contains('|'.join(genres_winput), 
                                                                       flags=re.IGNORECASE, regex=True))
                                      ].copy().reset_index(drop = True)
    
    if len(SIM_SEARCH_filter) > 0: # If there is text in the Input Box runs this function to further filter by the Similarity Search parameters
        bow_vectorizer_data(input_data = df_data_filtered,
                            vect = SIM_SEARCH_VECT_filter,
                            word_preprocessing = SIM_SEARCH_WORD_PROCESS_filter, 
                            max_df = 1.0, min_df = 0.0001)
        input_bow_rank_df(text = SIM_SEARCH_filter, 
                          word_reduction_method = SIM_SEARCH_WORD_PROCESS_filter, 
                          threshold = SIM_SEARCH_THRESHOLD_filter)
        
    # Dynamic Sub-Widget Dictionary of text description, feature and unique cateogories list.
    df_unique_values_dict()
    
    # TABS
    statistics_tab = widgets.Output()
    predictive_tab = widgets.Output()
    correlation_tab = widgets.Output()
    auto_trend_tab = widgets.Output()
    feature_plots_tab = widgets.Output()
    top_terms_tab = widgets.Output()
    text_modelling_tab = widgets.Output()
    text_plot_tab = widgets.Output()
    similarity_stats_tab = widgets.Output()
    networkx_tab = widgets.Output()
    summarization_tab = widgets.Output()
    filtered_reports_tab = widgets.Output()
    recluster_tab = widgets.Output() 
    
    # TABS PROPERTIES
    tab = widgets.Tab(children = [statistics_tab, predictive_tab, correlation_tab, auto_trend_tab, feature_plots_tab, 
                                  top_terms_tab, text_modelling_tab, text_plot_tab, similarity_stats_tab, networkx_tab,  
                                  summarization_tab, filtered_reports_tab, recluster_tab])
    tab.set_title(0, 'Stats')
    tab.set_title(1, 'Predictive')
    tab.set_title(2, 'Correlation')
    tab.set_title(3, 'AutoTrends')
    tab.set_title(4, 'Ftr. Plots')
    tab.set_title(5, 'Top Terms')
    tab.set_title(6, 'Txt Model')
    tab.set_title(7, 'Txt Plot')
    tab.set_title(8, 'Sim. Stats')
    tab.set_title(9, 'Network')
    tab.set_title(10, 'Summary')
    tab.set_title(11, 'Reports')
    tab.set_title(12, 'Recluster')

    display(tab)
    
    # Resets reclustering counter if the Main widget runs.
    reclustering_counter = 0
    
    # TABS INFORMATION AND WIDGET AND FUNCTION CALLING
    with statistics_tab:
        display(HTML(f'<h3>Data Statistics and Plots<h3>'))
        stats_title_and_records_widget()

    with predictive_tab:
        print('The predictive analysis aims to leverage predictive and statistical models such as Regressions (e.g., Linear Regression, Multiple Linear Regression), Poisson Distribution, Monte Carlo Simulations, and other models to provide insights on future direction of records.')
        predictive_widget()
        
    with correlation_tab:
        print('The Correlation Tab can allow the exploration of the correlation betweem the features in the data using a correlation heatmap and different correlation coeficients (e.g., Pearson, Spearman, Chi-Square, Cramers V).')
        print('https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745')
        print('https://www.kaggle.com/chrisbss1/cramer-s-v-correlation-matrix')
        print('https://link.medium.com/Y7ZKRp1LEnb')
        print('https://datagy.io/python-correlation-matrix/')
        
    with auto_trend_tab:
        print('The Automatic Trending aims to leverage automating the calculation of the slope of each of the features (e.g., categories, etc.) and provide results of the highest increasing trends.')
        
    with feature_plots_tab:
        print('Feature Plots tab will allow selecting and plotting any two features to evaluate relationship.')
    
    with top_terms_tab:        
        display(HTML(f'<h3>Top Terms within Corpus<h3>'))
        print('The Top Terms provide a list of the top terms given the parameters used.')
        top_terms_widget()
    
    with text_modelling_tab:
        display(HTML(f'<h3>Text Analysis and Modeling<h3>'))
        print('The text analysis and modeling algorithms perform several types of analyses. This include Latent Derechlit Allocation (LDA) which is a type of topic model as well as text clustering using Kmeans and DBSCAN algorithms. These algorithms provide insights by clustering reports into similar groups of reports and givin statistics on the top terms. The non-zero cluster labels in this dashboard are sorted by the mean of the top 10 terms scores. For example, Cluster #1 will have a higher average score than Cluster #2 and so on.')
        text_modelling_widget()

    with text_plot_tab:
        print('Teh Text Plot tab will use Scatter Text Library to visualize text and provide insights.')
    
    with similarity_stats_tab:
        print('The Similarity Statistics will allow insights such as the number of records with high similarity, number of records with low similarity, histogram of records based on Similarity Scores as well as other insights.')
        print('This could be used when performing a search get a similarity score per year (or Time such as Quarter). The document can be plotted using the categories. Where each is a potential precursor to the previous. For the similarity could let the user set a threshold (e.g., 0.1) to count the data. The threshold will filter by report score and then can use it to plot counts per year and for each category. Could plot the cummulative similarity across instead of counts. To validate could use the description of an accident and try to find similar records. Should see a large 1.0 in the accident and if there are many records related should also see the spike.')
    with networkx_tab:
        print('The Network Analysis will provide insights of how the records are connected and allow to filter based on a threshold.')

    with summarization_tab:
        print('The Summarization uses Convolutional Neural Network, DistilBART and Sentence Transformer libraries to develop AI based summaries of all the selected records, rows and/or reports.')
        print('Reference: https://huggingface.co/philschmid/distilbart-cnn-12-6-samsum')
        print('Example summarizationa at: movie_plots_summarization.ipynb')
        print('Could be used to input a long movie plot description and make it shorter.')
        
    with filtered_reports_tab:        
        display(HTML(f'<h4>Show Filtered Reports<h4>'))
        print('Please select a data features to show and start and end rows to show records. \nNotes:')
        print('(1) Re-run the reports module to make sure all features appear as options. Some modules create features and columns (e.g., "cluster_label").')
        print('(2) May be slow if a lot of records are selected.')
        print('(3) Easier to read if less than 5 columns are selected.')
              
        filtered_reports_widget()
        export_df_widget()
        
    with recluster_tab:       
        display(HTML(f'<h3>Recluster Module<h3>'))
        print('This widget allow to filter a specific cluster number and and use that filtered dataset with all the other analysis tools including clustering. This allows to perform infinite levels of clustering until running out of records.')
        display(HTML(f'<h3>Recluster Process Steps<h3>'))
        print('1. Run the "Text Analysis and Modeling" module with either Kmeans or DBSCAN and make note of which cluster you would like to "recluster".')
        print('2. Select "Run" below.')
        print('3. Code will ask which Cluster to select and select "Run" again.')
        print('4. The data will be replaced with the cluster. To check go to reports and all the reporst should have the selected cluster number under the "cluster_label" feature (i.e., the last column).')
        print('5. To reset run the main filter widget.')
        recluster_widget()
        

## Filtered Dataframe: Unique Values Dictionary
[Return to Table of Contents](#Table-of-Contents)<br>

The DF Unique Values Dictornary is used for stacking plots. After filtering the main dataframe, the data dictionary recalculates unique values and then uses those for the stacking. Stacking can be used with columns and features that have small number of unique values (e.g., categorical, one-hot-encoding, etc.). For continuous values there would be too many unique values and may need to convert to a bins in order to be able to use it in stacking plots.

In [23]:
def df_unique_values_dict():
    global DATA_dict
    
    # Note could do a dictionary of 'feature' and 'list' by looping thru the data features. 
    # Could also find a mapping of description to features and map it to my data file features.
    DATA_dict = {'description': ['None',
                                 #'Genres',
                                 'Original Language',
                                 'Action',
                                 'Drama',
                                 'Release Calendar Year',],
                'feature': ['None',
                            #'genres',
                            'original_language',
                            'Action',
                            'Drama',
                            'release_cy',],
                'list': [[], 
                         #sorted([x for x in df_data_filtered['genres'].unique()]), # Genres can't be used for 
                         sorted([x for x in df_data_filtered['original_language'].unique()]),
                         sorted([x for x in df_data_filtered['Action'].unique()]),
                         sorted([x for x in df_data_filtered['Drama'].unique()]),
                         sorted([x for x in df_data_filtered['release_cy'].unique()]),
                        ]
               }

## Statistics Widget Tab
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- Addition/Count of Non-Zeros: https://stackoverflow.com/questions/26053849/counting-non-zero-values-in-each-column-of-a-dataframe-in-python

In [24]:
def stats_title_and_records_widget():
    interact_manual(stats_title_and_records)

In [25]:
def stats_title_and_records():
    # TITLE of tab and record numbers when NO reclustering applied. 
    if reclustering_counter == 0:
        if (len(df_data) == len(df_data_filtered)):
            display(HTML(f'<h2>Data Statistics and Plots for ALL Records<h2>'))
            print(f'The source data has {len(df_data)} records and ALL {len(df_data_filtered)} records are selected.')
        if (len(df_data) > len(df_data_filtered)):
            display(HTML(f'<h2>Data Statistics and Plots for PARTIAL records (see selection filters)<h2>'))
            print(f'The source data has {len(df_data)} records and ONLY {len(df_data_filtered)} records are selected.')
            
    # TITLES or tab and number of records when reclustering is applied
    if reclustering_counter != 0:
        display(HTML(f'<h2>Data Statistics for Recluster #{reclustering_counter}<h2>'))
        print(f'Total number of records in the source data is {len(df_data)}.')
        print(f'Total Number of records in the recluster data is {len(df_data_filtered)}.')
        
    # PLOTS
    display(HTML(f'<h3>Records per Fiscal Year Start/End and Length<h3>'))
    print(f'Note 1: Total number of selected records is {df_data_filtered.shape[0]}.\n')
    print(f'Note 2: When selecting a trendline the dashboard remove records that have unassigned time periods (e.g., CY, FY, etc.) if any.')
    records_per_time_widget()

    display(HTML(f'<h3>Record Statisticts Vertical Bar Plots<h3>'))
    display(HTML(f'<h3>NOTE: THIS WOULD BE A GREAT EXAMPLE TO NORMALIZE.<h3>'))
    print('Data for feature counts is sorted by descending order.')
    feature_counts_widget()
        
    display(HTML(f'<h3>Record Statistics Horizontal Stacked Visualization<h3>'))
    display(HTML(f'<h3>NOTE: NEED TO ALSO DO RELATIVE TO 100%.<h3>'))
    barh_stats_widget()

## Similarity Search Functions
[Return to Table of Contents](#Table-of-Contents)<br>

Input Data Filtering and Ranking

In [26]:
def input_bow_rank_df(text, word_reduction_method, threshold): # Function only returns the top value.
    global cosine_value, inputtext, df_bow, df_data_filtered # Making global value so that it can be called outside of the function.
   
    inputtext = str(text)
    inputtext_normalized = text_normalization(inputtext, word_reduction_method) # Executing function to perform text normalization
    bow = vectorizer.transform([inputtext_normalized]).toarray() # applying bow
    # Calculating the cosine_value of the input_text against every Target Text Row (i.e., Question) in the datafram.
    cosine_value = 1- pairwise_distances(df_bow, bow, metric = 'cosine')

    # Defines and creates table for ranks
    df_cosine = np.round(cosine_value, 4)
    # Converting array to a pandas dataframe (table with index).
    df_cosine_table = pd.DataFrame({'cosine_value': df_cosine[:, 0]})
    # Concatenating the dataframes to show the results for Cosine Value (given inputtext), Original Data, norm Text and BoW
    df_data_filtered = pd.concat([df_cosine_table, df_data_filtered], axis=1)
    # Sorting values with highest Cosine Value on the Top and removing records below threshold
    df_data_filtered = df_data_filtered.sort_values(by=['cosine_value'], ascending=False)
    df_data_filtered = df_data_filtered[df_data_filtered['cosine_value'] > threshold].reset_index(drop = True)
    df_data_filtered.insert(0, "Input_Text", inputtext, True)
    
    return (df_data_filtered)

## Data Transformation Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In [27]:
def records_per_time_widget():
    interact(records_per_time_data,
             feature_to_plot = widgets.Dropdown(options = ['release_cy'],
                                                 value = 'release_cy',   
                                                 description = 'Records per',
                                                 style={'description_width': 'initial'},
                                                 disabled = False),
             add_trend = widgets.Dropdown(options = ['None', 'Linear'],
                                          value = 'None',   
                                          description = 'Trendline (for unstacked plots)',
                                          style={'description_width': 'initial'},
                                          disabled = False),
             add_equation = widgets.Checkbox(value = False,
                                             description = 'Show Trendline Equation',
                                             disabled = False,
                                             Indent = False),
             stack_by_desc = widgets.Dropdown(options = DATA_dict['description'],
                                              value = 'None',   
                                              description = 'Stack By',
                                              style={'description_width': 'initial'},
                                              disabled = False)
            )

In [28]:
# Function creates stacked data plots.
def records_per_time_data(feature_to_plot, add_trend, add_equation, stack_by_desc):
    global df_time_record_counts
    
    stack_by_list = DATA_dict['list'][DATA_dict['description'].index(stack_by_desc)] 
    # Accessing the list from dictionary given the feature
    stack_by = DATA_dict['feature'][DATA_dict['description'].index(stack_by_desc)]
    
    if (stack_by == 'None'):
        df_time_record_counts = df_data_filtered[feature_to_plot].value_counts().sort_index(ascending = True,
                                                                                           ).rename_axis(feature_to_plot,
                                                                                                        ).reset_index(name='counts')
        if (add_trend == 'Linear') & ((feature_to_plot == 'release_cy')): 
            # For adding a trendline need to remove the null values and change the data type to float.
            df_time_record_counts = df_time_record_counts.loc[(df_time_record_counts[feature_to_plot] != 'NaT') &
                                                              (df_time_record_counts[feature_to_plot] != np.nan)]
            df_time_record_counts[feature_to_plot] = df_time_record_counts[feature_to_plot].astype(float)
        
            bar_chart_wtrend(input_data = df_time_record_counts, x_feature = feature_to_plot, y_feature = 'counts', 
                             add_trend = add_trend, add_equation = add_equation)
        
        else: #if (add_trend == 'Linear') & ((feature_to_plot != 'release_cy') | (feature_to_plot != 'DATE_FY')):
            print(f'NOTE: Linear Trend is not setup to calculate trendline for the Quarters or Months yet.')
            bar_chart_wtrend(input_data = df_time_record_counts, x_feature = feature_to_plot, y_feature = 'counts', 
                             add_trend = 'None', add_equation = add_equation)
                                                        
    else:
        stacked_bar_plot_data(input_data = df_data_filtered, feature_to_plot = feature_to_plot, 
                              stack_by = stack_by, stack_by_list = stack_by_list)
        
    # NOTE THAT THE EXCHANGE DASHBOARDS HAVE OTHER FEATURE PLOTS HERE SUCH AS ACTIVE PROJECTS AND PROJECT LENGHTS IF NEEDED.      

## Data Plotting Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In [29]:
def stacked_bar_plot_data(input_data, feature_to_plot, stack_by, stack_by_list):
    df_data_stacked = input_data[[feature_to_plot, 
                                  stack_by]].pivot_table(index=[feature_to_plot], 
                                                         columns=[stack_by], 
                                                         aggfunc=len,
                                                         dropna=False,
                                                         fill_value=0)
    df_data_stacked.reset_index(level=0, inplace=True)
    df_data_stacked["Row_Total"] = df_data_stacked[list(stack_by_list)].sum(axis=1)
    stacked_bar_plot(input_data = df_data_stacked, feature_to_plot = feature_to_plot, 
                     stack_by = stack_by, stack_by_list = stack_by_list)

In [30]:
def stacked_bar_plot(input_data, feature_to_plot, stack_by, stack_by_list):
    fig_width = max(14, (int(len(input_data[feature_to_plot].unique())/4))) 
    # Adjust the width of the figure dynamically depending on number of unique values in the X axis.
    ax = input_data.plot.bar(x = feature_to_plot, 
                        y = stack_by_list, 
                        stacked=True, 
                        figsize=(fig_width, 4))
    plt.legend(reversed(plt.legend().legendHandles), reversed(stack_by_list),
               loc="upper left", bbox_to_anchor=(1, 1), ncol=1)
    plt.xticks(rotation = 45)
    plt.ylabel('No. of Records', size=16)
    plt.grid(axis = 'y')
    
    # Write values inside stacked bars
    for rect in ax.patches: # .patches is everything inside of the chart
        # Find where everything is located
        height = int(rect.get_height())
        width = rect.get_width()
        x = rect.get_x()
        y = rect.get_y()
        # The height of the bar is the data value and can be used as the label
        label_text = f'{height}'  # f'{height:.2f}' to format decimal values
        # ax.text(x, y, text)
        label_x = x + width / 2
        label_y = y + height / 2
        if height > 15: # Write value on plot only when height is greater than specified value
            ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=8)
    
    plt.show();

In [31]:
def bar_chart_wtrend(input_data, x_feature, y_feature, add_trend, add_equation):
    fig_width = max(14, (int(len(input_data[x_feature].unique())/4))) 
    # Adjust the width of the figure dynamically depending on number of unique values in the X axis.
    plt.figure(figsize=(fig_width, 4))
    plt.bar(x = input_data[x_feature], height = input_data[y_feature],) # Bars variable to access bar attributes
    ax = plt.gca()    
    plt.xlabel(x_feature, fontsize=12)
    if len(input_data) > 10:
        plt.xticks(ticks = input_data[x_feature].unique(), rotation = 90, size = 14)
    else:
        plt.xticks(ticks = input_data[x_feature].unique(), rotation = 0, size = 14)  
    plt.ylabel("No. of Records", fontsize=16)
    if (add_trend != 'None'):
        ax.ticklabel_format(useOffset=False) 
        # Fixes an issue with the x-axis showing as integers and +202X in the clustering trends. 
        # Using the if allows this function to be used with the 'NaT' 
    plt.yticks(size = 14)
    plt.ylim(bottom = 0)
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    plt.grid(axis = 'y')   
    
    for i in range(len(input_data)):
        # Prints values on plot
        if len(input_data) > 10:
            plt.text(input_data[x_feature][i], 
                     input_data[y_feature][i]+input_data[y_feature].max()*.01, 
                     input_data[y_feature][i], 
                     size = 14,
                     ha='center',
                     rotation = 90)
        else:
            plt.text(input_data[x_feature][i], 
                     input_data[y_feature][i]+input_data[y_feature].max()*.01, 
                     input_data[y_feature][i], 
                     size = 14,
                     ha='center',
                     rotation = 0)
            
    # Linear Trendline
    if (add_trend == 'Linear'): 
        add_LR_trend(input_data, x_feature, y_feature, add_trend, add_equation)

    plt.show();

In [32]:
def add_LR_trend(input_data, x_feature, y_feature, add_trend, add_equation):
    # Linear Trendline equation
    df_freq_fit = np.polynomial.polynomial.polyfit(input_data[x_feature], input_data[y_feature], 1)
    y_intercept_min_year = df_freq_fit[1]*input_data[x_feature].min()+df_freq_fit[0]
    
    # Drawing the linear trendline
    plt.plot(input_data[x_feature], df_freq_fit[1] * input_data[x_feature] + df_freq_fit[0], color='red', linewidth=2)
    if add_equation == True:
        plt.text((input_data[x_feature].max()+1.5), 
                 (input_data[y_feature].min()+input_data[y_feature].max()*0.1), 
                 'Trendline Equation: y={:.2f}*x+{:.2f}'.format(df_freq_fit[1], y_intercept_min_year), 
                 color='darkblue', 
                 size=16)

In [33]:
def feature_counts_widget():
    interact(feature_counts_data,
             x_feature = widgets.Dropdown(options = DATA_dict['feature'][1:],
                                          value = DATA_dict['feature'][1],
                                          description = 'Feature to Plot',
                                          style={'description_width': 'initial'},
                                          disabled = False),
             Pareto_chkbox = widgets.Checkbox(value = False,
                                              description = 'Pareto Chart',
                                              disabled = False,
                                              Indent = False),
             Pareto_Axs_chkbox = widgets.Checkbox(value = True,
                                                  description = 'Pareto Chart Y-axis',
                                                  disabled = False,
                                                  Indent = False),
             Sort_by_feature_name_chkbox = widgets.Checkbox(value = False,
                                                     description = 'Sort by Feature Name',
                                                     disabled = False,
                                                     Indent = False)
            )

In [34]:
def feature_counts_data(x_feature, Pareto_chkbox, Pareto_Axs_chkbox, Sort_by_feature_name_chkbox):
    global df_feature_counts
    
    if Sort_by_feature_name_chkbox == False:
        df_feature_counts = df_data_filtered[x_feature].value_counts().rename_axis(x_feature).reset_index(name='counts')
    if Sort_by_feature_name_chkbox == True:
        df_feature_counts = df_data_filtered[x_feature].value_counts().rename_axis(x_feature).reset_index(name='counts').sort_values(x_feature)

    bar_chart_wpareto(input_data = df_feature_counts, x_feature = x_feature, 
                      Pareto_chkbox = Pareto_chkbox, Pareto_Axs_chkbox = Pareto_Axs_chkbox)
    
    #horizontal_bar_chart(input_data = df_location_counts.sort_values(by = 'counts'), y_feature = 'counts', x_feature = x_feature)

In [35]:
# Location based count bar plot with Pareto
def bar_chart_wpareto(input_data, x_feature, Pareto_chkbox, Pareto_Axs_chkbox):
    global ax1    
    fig_width = max(14, (int(len(input_data[x_feature].unique())/3)))
    fig, ax1 = plt.subplots(figsize=(fig_width, 4))
    plt.bar(x = input_data[x_feature], height = input_data['counts'])
    plt.xlabel(f'{x_feature}', fontsize=12)
    plt.xticks(ticks = input_data[x_feature].unique(), rotation = 90, size = 14)
    plt.ylabel("No. of Records", fontsize=16)
    plt.yticks(size = 14)
    plt.ylim(bottom = 0)
    #plt.gca().yaxis.set_major_formatter(int_number_format) # Calls the function and formats the y-axis.
    ax1 = plt.gca()
    ax1.yaxis.set_major_locator(MaxNLocator(integer=True))
    ax1.grid(axis = 'y')
    
    # Show the values of each bar
    #if len(input_data[x_feature].unique()) <= 15:
    for i in range(len(input_data)):
        # Prints values on plot
        plt.text(input_data[x_feature][i], 
                 input_data['counts'][i]+input_data['counts'].max()*0.01, 
                 int_number_format(input_data['counts'][i], 0), 
                 size = 14)
        
    # PARETO CHART
    if Pareto_chkbox == True:
        # PARETO CHART Data for Plot 
        x = input_data[x_feature].values
        y = input_data['counts'].values
        pareto_chart(x, y, ax1, Pareto_Axs_chkbox)
    
    plt.show();

In [36]:
def pareto_chart(x, y, ax, Pareto_Axs_chkbox):
    global ax2
    weights = y / y.sum()
    cumsum = weights.cumsum()

    # Drwaing Pareto Chart on Secondary Y-Axis
    ax2 = ax1.twinx()
    ax2.plot(x, cumsum, color = 'green', marker = 'o', linestyle = 'dashed')
    ax2.set_ylabel('', color='g')
    ax2.tick_params('y', colors='g')
    ax2.yaxis.set_ticks(np.arange(0, 1.1, 0.20))
    ax2.set_yticklabels(['{:,.0%}'.format(x) for x in ax2.get_yticks().tolist()])
    ax2.axhline(0.80, color="orange", linestyle="dashed") # Horizontal Line at 80%.
        
    # Y-labels on Secondary Y-Axis (i.e., right side)        
    if Pareto_Axs_chkbox == False:
        ax2.set_yticks([])
        formatted_weights = ['{0:.0%}'.format(x) for x in cumsum]
        for i, txt in enumerate(formatted_weights):
            ax2.annotate(txt, (x[i], cumsum[i]), color = 'black', fontweight='heavy') 

In [37]:
def horizontal_bar_chart(input_data, y_feature, x_feature):
    fig_height = max(10, (int(len(input_data[x_feature].unique())/4)))
    plt.figure(figsize=(4, fig_height))
    plt.barh(y = input_data[x_feature], width = input_data[y_feature])
    for index, value in enumerate(input_data[y_feature]):
        plt.text(value, index, str(value))

In [38]:
def barh_stats_widget():
    interact_manual(barh_stats_stack_widget,
                    x_feature_W = widgets.Dropdown(options = DATA_dict['feature'][1:],
                                          value = DATA_dict['feature'][1],   
                                          description = 'Feature to Plot',
                                          style={'description_width': 'initial'},
                                          disabled = False)
            )

In [39]:
def barh_stats_stack_widget(x_feature_W):
    global x_feature
    x_feature = x_feature_W
    
    # Formula removes the x_feature from list dynamically.
    stack_by_list = DATA_dict['feature'][1:][:DATA_dict['feature'][1:].index(x_feature)] + DATA_dict['feature'][1:][DATA_dict['feature'][1:].index(x_feature)+1:]
    
    interact(barh_stats_sort_widget,
             stack_by_W = widgets.Dropdown(options = stack_by_list,
                                           value = stack_by_list[0],   
                                           description = 'Stack By',
                                           style={'description_width': 'initial'},
                                           disabled = False)
            )

In [40]:
def barh_stats_sort_widget(stack_by_W):
    global stack_by
    stack_by = stack_by_W
    
    interact(barh_stats_data,           
             sort_first = widgets.Dropdown(options = ['None']+DATA_dict['list'][DATA_dict['feature'].index(stack_by)],
                                           value = 'None',  
                                           description = 'First Level Sort',
                                           style={'description_width': 'initial'},
                                           disabled = False),
             sort_second = widgets.Dropdown(options = ['None']+DATA_dict['list'][DATA_dict['feature'].index(stack_by)],
                                            value = 'None',   
                                            description = 'Second Level Sort',
                                            style={'description_width': 'initial'},
                                            disabled = False)
            )

In [41]:
def barh_stats_data(sort_first, sort_second):
    global labels_list, df
    # DATA TRANSFORMATION FOR BAR PLOT WITH PRINTED VALUES
    df_pivot = df_data_filtered[[x_feature]+[stack_by]].pivot_table(index=x_feature,
                                                                    columns = stack_by, 
                                                                    aggfunc=len,
                                                                    dropna=False,
                                                                    fill_value=0,
                                                                    margins=True).reset_index()
    # SORTS THE FEATURES AND COLUMNS TO BE USED IN THE PLOTS 
    sort_list = ['None']+DATA_dict['list'][DATA_dict['feature'].index(stack_by)]
    sorted_columns_list = [sort_first, sort_second]+sort_list+['All']
    sorted_columns_list = list(dict.fromkeys([element for element in sorted_columns_list if element != 'None'])) # Removes 'None' and Duplicates
    sorted_columns_list.insert(0, x_feature)

    df_pivot = df_pivot[sorted_columns_list] # Sorts all dataframe features in the right order.
    df_pivot = df_pivot.drop([len(df_pivot)-1]) # Drops last "All" row

    # SORTING THE DATA FOR THE PLOTS
    if (sort_first == 'None') & (sort_second == 'None'): # Sorts in this order by Totals if none order specified.
        df_pivot = df_pivot.sort_values(['All']) 
    if (sort_first != 'None') & (sort_second == 'None'): # Sorts by specified one level
        df_pivot = df_pivot.sort_values([sort_first])   
    if (sort_first != 'None') & (sort_second != 'None'): # Sorts by the specieifed two levels
        df_pivot = df_pivot.sort_values([sort_first]+[sort_second]) 
  
    df = df_pivot.drop(columns = 'All', axis = 0) # Drops last "All" column
    df_total = df_pivot['All']
    df_percent = df[df.columns[1:]].div(df_total, 0)*100
    
    # DEVELOPING THE LABELS LIST FROM THE SORTED COLUMNS AND REMOVING THE X_FEATURE AND ALL COLUMNS  
    labels_list = sorted_columns_list
    labels_list.remove(x_feature)
    labels_list.remove('All')

    barh_stats_chart(input_data = df, input_data_pivot = df_pivot, input_data_total = df_total,
                     input_data_percent = df_percent)

In [42]:
def barh_stats_chart(input_data, input_data_pivot, input_data_total, input_data_percent):
    # HORIZONTAL BAR PLOT WITH LABELS
    print('Note 1: When stacking, percentage is shown for stack bars with 10 or more records and less than 100%.')
    print('Note 2: Plot with undefined sorting level is performed on the total number of records.')

    # figure and axis
    fig_height = max(2, min(30, (int(len(input_data_pivot)*0.8))))
    fig, ax = plt.subplots(1, figsize=(12, fig_height))
    # plot bars
    left = len(input_data_pivot) * [0]
    for idx, name in enumerate(labels_list):
        plt.barh(y = input_data_pivot[x_feature], width = input_data_pivot[name], height = 0.9, left = left)
        left = left + input_data_pivot[name]
    # title, legend, labels
    plt.title('', loc='left')
    plt.legend(labels_list, bbox_to_anchor=([1, 1]), frameon=False)
    plt.xlabel('Number of Records')
    # remove spines
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    # adjust limits and draw grid lines
    plt.ylim(-0.5, ax.get_yticks()[-1] + 0.5)
    ax.set_axisbelow(True)
    ax.xaxis.grid(color='gray', linestyle='dashed')

    va = ['top', 'bottom']
    va_idx = 0
    
    for n in input_data_percent:
        va_idx = 1 - va_idx
        # cs = cumulative sum, ab = absolute number of records, pc = percent of total records, tot = total.
        for i, (cs, ab, pc, tot) in enumerate(zip(input_data.iloc[:, 1:].cumsum(1)[n], 
                                                  input_data[n], 
                                                  input_data_percent[n], 
                                                  input_data_total)):
            # TOTAL FOR EACH BAR.
            plt.text(tot+1, i, str(tot), va='center') # Shows the Total of the Bar.
            
            # PERCENTAGE OF STACKED BARS GIVEN THE CONDITIONS.
            if (ab >= 30) & (pc < 100): # Shows value horizontally if the absolute value is >=30 or percentage is < 100.
                 plt.text(cs - ab/2, i, str(int(pc)) + '%', va='center', ha='center')
            if (30 > ab) & (ab >= 10) & (pc < 100): # Shows value rotated if the absolute value is <30 and >10 and percentage is <100.
                plt.text(cs - ab/2, i, str(int(pc)) + '%', va='center', ha='center', rotation=90)
            # If total is 100% it does not show the value.
    plt.show();

## Text Analysis and Text Modeling Widget
[Return to Table of Contents](#Table-of-Contents)<br>

The text analysis and modeling algorithms perform several types of analyses. This include Latent Derechlit Allocation (LDA) which is a type of topic model as well as text clustering using Kmeans and DBSCAN algorithms. These algorithms provide insights by clustering reports into similar groups of reports and givin statistics on the top terms.

LDA: 

KMEANS: In the Kmeans algorithm the user specifies number of clusters. With this parameter the algorithm separates the records in the specified number of clusters. The number of clusters can also include an optimal number of clusters where the distance between two datapoints is optimal for the number of clusters.

DBSCAN: The DBSCAN algorithm uses to main parameters EPS and minimum samples in a cluster. The EPS value is the required distance between two datapoints to be considered a cluster. The minimum samples in a cluster is the required number of datapoints to be considered a cluster. Note that low minimum sample in clusters (e.g., 2) can result in a large number of clusters.

Dimensionality Reduction: Clustering algorithms (non-LDA algorithms) can breakdown with datasets with high dimensionality and may not provide meaningful clusters. Dimensionality reduction techniques (e.g., PCA and TSNE) are used in combination with the clustering algorithms (i.e., Kmeans and DBSCAN) to project high dimensions to lower dimensions (e.g., 2D, 3D, etc.) to identify clusters. The scatter plots here are a 2D projections based on TSNE and PCA mathematical techniques. In the case of TSNE, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes given the Perplexity parameter.

Recommendation: Based on experience, to identify meaningful cluster and high relationship to the terms it is better to use term score. For example, TFIDF scores above 0.10 are considered to have good relationship to the cluster. Lower values can also represent good relationship as long as several terms are related. To explore terms and topics it is recommended to run Kmeans with no Dimensionality Reduction and DBSCAN with both PCA and TSNE.

References: <br>
- Random States: https://towardsdatascience.com/manipulating-machine-learning-results-with-random-state-2a6f49b31081
- Random States: https://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution

In [43]:
def text_modelling_widget():
    interact_manual(text_modelling, 
                    text_model_W = widgets.Dropdown(options=TEXT_MODEL_list,
                                                            value = 'None',
                                                            description = 'Text Analysis Method', 
                                                            disabled=False, style={'description_width': 'initial'},
                                                            layout=widgets.Layout(width='40%')),
                    vect_W = widgets.Dropdown(options= ['TFIDF', 'Count'], 
                                              description = 'Word Vectorizer',
                                              value = 'TFIDF',
                                              disabled=False, style={'description_width': 'initial'},
                                              layout=widgets.Layout(width='40%')),
                    word_preprocessing_W = widgets.Dropdown(options= ['Lemmatization', 'Stemming'], 
                                                                       description = 'Word Preprocessing Method',
                                                                       value = 'Lemmatization',
                                                                       disabled=False, style={'description_width': 'initial'},
                                                                       layout=widgets.Layout(width='40%')),
                    dimensionality_reduction_W = widgets.Dropdown(options= ['None', 'PCA', 'TSNE'], 
                                                                       description = 'Dimensionality Reduction Method',
                                                                       value = 'PCA',
                                                                       disabled=False, style={'description_width': 'initial'},
                                                                       layout=widgets.Layout(width='40%')),
                    max_df_W = widgets.FloatSlider(min=0.05, 
                                                   max=1.0, 
                                                   value=0.95, 
                                                   step=0.05,
                                                   description="Term Maximum Document Frequency",
                                                   disables=False, style={'description_width': 'initial'},
                                                   layout=widgets.Layout(width='50%'),
                                                   readout_format='.2f',),
                    min_df_W = widgets.FloatSlider(min=0.0001, 
                                                   max=0.1, 
                                                   value=0.01, 
                                                   step=0.0005,
                                                   description="Term Minimum Document Frequency",
                                                   disables=False, style={'description_width': 'initial'},
                                                   layout=widgets.Layout(width='50%'),
                                                   readout_format='.4f',),
                    cluster_two_D_plot_W = widgets.Dropdown(options= ['None', 'PCA', 'TSNE'], 
                                                            description = 'Comparison 2D Plot',
                                                            value = 'None',
                                                            disabled=False, style={'description_width': 'initial'},
                                                            layout=widgets.Layout(width='40%')),
                    trend_timeperiod_W = widgets.Dropdown(options= ['None', 'release_cy'],
                                                                  description = 'Cluster Trend by',
                                                                  value = 'release_cy',
                                                                  disabled=False, style={'description_width': 'initial'},
                                                                  layout=widgets.Layout(width='40%')),
                   )

## Text Analysis and Text Modeling Widget Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In [44]:
def text_modelling(text_model_W, vect_W, word_preprocessing_W, dimensionality_reduction_W, max_df_W, min_df_W, 
                   cluster_two_D_plot_W, trend_timeperiod_W):
    global text_model, word_preprocessing, dimensionality_reduction, max_df, min_df, cluster_two_D_plot, bow_calc_array_scaled, recommended_min_samples_inacluster, trend_timeperiod
    text_model = text_model_W
    vect = vect_W
    word_preprocessing = word_preprocessing_W
    dimensionality_reduction = dimensionality_reduction_W
    max_df = max_df_W
    min_df = min_df_W
    cluster_two_D_plot = cluster_two_D_plot_W
    trend_timeperiod = trend_timeperiod_W
    

    bow_vectorizer_data(input_data = df_data_filtered, vect = vect_W, word_preprocessing = word_preprocessing,
                        max_df = max_df, min_df = min_df)
    
    if text_model_W == 'None':
        display(HTML(f'<h2>Please Select a Text Analysis Method to obtain results.<h2>'))
    
    # Dimensionality Reduction Functions 
    # Needs to be called before the clustering functions as these functions create the bow_calc_array_scaled
    if dimensionality_reduction_W == 'None':
        bow_calc_array_scaled = bow_array
        
    if dimensionality_reduction_W == 'PCA':
        pca_variance_plot()
        PCA_func()
        
    if dimensionality_reduction_W == 'TSNE':
        TSNE_func()
    
    # Kmeans Clustering
    if text_model_W == 'Clustering: Kmeans (user defined clusters)':
        kmeans_k_clusters_widget()
        
    if text_model_W == 'Clustering: Kmeans w/Optimal K':
        kmeans_optimal_k_widget()
        kmeans_k_clusters_widget()
    
    # DBSCAN Clustering
    if text_model_W == 'Clustering: DBSCAN (Max Clusters w/Optimal EPS)':
        distances_func()
        dbscan_max_clusters_opt_eps_widget()
    
    if text_model_W == 'Clustering: DBSCAN (Densest Clusters w/Optimal EPS)':
        distances_func()
        dbscan_dense_clusters_opt_eps_widget()
    
    if text_model_W == 'Clustering: DBSCAN (% Records to Cluster)':
        recommended_min_samples_inacluster = min(max(np.ceil(df_data_filtered.shape[0]*0.005), 2), 500) 
        # Used as Widget Input for the recommended minimum number of reports in cluster 
        # (i.e., the minimum of either 500 or the largest of 0.5% of total rows and 2).
        distances_func()
        dbscan_cluster_by_percent_widget()
        
    if text_model_W == 'Clustering: DBSCAN (Custom EPS and Min Samples)':
        recommended_min_samples_inacluster = min(max(np.ceil(df_data_filtered.shape[0]*0.005), 2), 500) 
        # Used as Widget Input for the recommended minimum number of reports in cluster 
        # (i.e., the minimum of either 500 or the largest of 0.5% of total rows and 2).
        distances_func()
        dbscan_cluster_custom_widget()

## Bag of Words Text Model Function 
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- Explanation of min_df and max_df in scikit vectorizers: https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer

In [45]:
def bow_vectorizer_data(input_data, vect, word_preprocessing, max_df, min_df):
    global bow_array, features, df_bow, vectorizer, stopwords_DATA
    # DATA Specific Stopwords
    stopwords_DATA = text_normalization(text = stopwords_to_add_L2+STOPWORDS_winput, 
                                        word_reduction_method = word_preprocessing).split()

    # Creating BOW for the Target DataFrame with selected Vectorizer TFIDF
    if vect == 'TFIDF':
        vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words=stopwords_DATA, 
                                     ngram_range=(1, 2), max_df=max_df, min_df=min_df) 
    if vect == 'Count':
        vectorizer = CountVectorizer(lowercase=True, analyzer='word', stop_words=stopwords_DATA, 
                                     ngram_range=(1, 2), max_df=max_df, min_df=min_df) 
    # Text Preprocessing
    if word_preprocessing == 'Lemmatization':
        bow_array = vectorizer.fit_transform(input_data['norm_text_lemma']).toarray()
    if word_preprocessing == 'Stemming':
        bow_array = vectorizer.fit_transform(input_data['norm_text_stem']).toarray()

    # Returns word vectors.
    features = vectorizer.get_feature_names_out()
    df_bow = pd.DataFrame(bow_array, columns = features)
        
    input_data_stats_desc()

In [46]:
def input_data_stats_desc():
    print(f'The input data has {df_data_filtered.shape[0]} records.')
    print(f'The {df_bow.shape[0]} records have {df_bow.shape[1]} words in the Bag of Words model.') 
    print(f'Data Specific and filter added stopwords: {stopwords_DATA}.')
    print('NLTK stopwords are also used during cleaning.')

## Dimensionality Reduction Functions
[Return to Table of Contents](#Table-of-Contents)<br>

There are two main methods for reducing the dimensionality of high dimensional data. These two are PCA and TSNE. When a dataset is high number of dimension and clustering will be performed it is typically recommended to use some form of dimensionality reduction. These methods find a mathematical representatation of multi-dimensionsional data (i.e., >3D) in three or two dimensions. In this case everyword within each record (e.g., movie title and overview) is a dimension.

For both methods (i.e., PCA and TSNE) an array of the reduced dimensions for performing the clustering calculation is created (bow_calc_array). This reduces the dimensionality to 2 dimensions and is a projection from the full bag of words array (bow_array). The bow_calc_array is then scaled as bow_calc_array_scaled and used in the clustering algorithm. The bow_array include all terms and words (i.e., features). In this functions, the bow_calc_array is always used for the calculation of the clusters while the bow_array is always used for the determination of top words within a cluster. 

When no Dimensionality Reduction is used bow_calc_array = bow_array and allows to see the assignment of clusters within the PCA or TSNE projection. This allows for verification of correct behavior of the algorith. 

## PCA Dimensionality Reduction Functions
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- https://github.com/DhruvilKarani/PCA-blog-notebook/blob/master/PCA.ipynb
- https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
- https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-7-unsupervised-learning-pca-and-clustering-db7879568417
- https://towardsdatascience.com/explaining-k-means-clustering-5298dc47bad6
- https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-kmeans-clustering-pca-b7ba6fafa74


In [47]:
def pca_variance_plot():
    NUM_COMPONENTS = min(df_bow.shape[0], bow_array.shape[1]) # Number of components. Cannot be larger than the number of n_samples or features. df_bow.shape[0] is the number of samples (e.g., records) in the sample. The bow_array.shape[1] is the number of features (terms or words) in the dataframe. 
    pca = PCA(n_components = NUM_COMPONENTS) # Defining the PCA model variable
    bow_calc_array = pca.fit_transform(bow_array) # The bow_calc_array is the reduced bow array to be used for the calculation of clusters.
    variance_explained = np.cumsum(pca.explained_variance_)
    
    print('The y-axis is the fraction of cummulative explained variance given the number of components in the x-axis. Ideally we would select the number of components such that we can explain a high number of the variability. For example, if we wanted to explain 80% of the variability we would need number of components that match to 0.8. However, if the result is high it would probably not be feasible. For simplicity the PCA Dimensionality Reduction here reduces the dimensions to 2 components and can be visualized in a 2D plot.')
    fig, ax = plt.subplots(figsize=(8, 4))
    plt.plot(range(NUM_COMPONENTS),variance_explained, color='r')
    ax.grid(True)
    plt.title(label = f"Cummulative Explained Variance vs. {NUM_COMPONENTS} Components")
    plt.xlabel("Number of components")
    plt.ylabel("Cumulative explained variance")
    plt.show();

In [48]:
def PCA_func():
    global bow_calc_array, bow_calc_array_scaled
    pca = PCA(n_components = 2) # PCA Model with 2 Components
    bow_calc_array = pca.fit_transform(bow_array) # The bow_calc_array is the reduced bow_array to be used for the calculation of clusters.

    # Projection Plot (scaled) of records in data.
    scaler = MinMaxScaler()
    scaler.fit(bow_calc_array)
    bow_calc_array_scaled = scaler.transform(bow_calc_array)

## TSNE Dimensionality Reduction Functions
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a
- https://distill.pub/2016/misread-tsne/
- https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1
- https://towardsdatascience.com/high-dimension-clustering-w-t-sne-dbscan-dcec77e6a39b
- https://link.medium.com/zNc9QSQj0cb
- https://github.com/dougfoo/machineLearning/blob/master/covid/COVID-global-clustering.ipynb

In [49]:
def TSNE_func():
    global perplexity, bow_calc_array, bow_calc_array_scaled
    # TSNE with defined perplexity.  The perplexity defines how the TSNE Plot looks and its two main dimensions.  Ideally the user wants separable clusters and would iterate thru different perplexities and select the one with the most defined clusters.  In case of high variable data (e.g., different filters that could be applied), reduce dimensionality and users being SME's in other areas, it would be tough to evaluate and select a perplexity for each unique case.
    perplexity = 30
    bow_calc_array = TSNE(n_components=2, perplexity = perplexity, random_state=42, n_iter=5000).fit_transform(bow_array) # TSNE Model with 2 Componens
    
    # Projection Plot (scaled) of records in data.
    scaler = MinMaxScaler()
    scaler.fit(bow_calc_array)
    bow_calc_array_scaled = scaler.transform(bow_calc_array)
    
    print(f'PLEASE WAIT: Running TSNE function is slow.  TSNE Perplexity parameter is set at {perplexity}.')

## Cluster Plots and Plot Projection Function
[Return to Table of Contents](#Table-of-Contents)<br>

These functions are used to create the plots in the cluster including the projections of the data in 2-Dimensions. The projections can be used to verify that the clustering algorithm is behaving correctly. For example, when Dimensionality Reduction is used the projection plot should have each data point near them with the same color of cluster unless it is in the edge of the cluster. When no dimensionality reduction is used the expected behavior is that the data points for each cluster be overimposed within one another. That is because the clustering algorithm did not used the reduced dimensionality data but rather all the dimensions and is an expected behavior.

This function and its corresponding PCA_func and TSNE_func are called at the end of the clustering algorithm in order to avoid affecting the bow_calc_array_scaled. In this case teh bow_calc_array_scaled is used for the projection but does not affect the calculation of clusters.

References: <br>
- Euclidean Distance: https://www.kite.com/python/answers/how-to-find-euclidean-distance-in-python

In [50]:
def cluster_plots():
    global cluster_pred
    
    cluster_pred = df_data_filtered['cluster_label_by_function'].to_numpy() # Converting cluster_label assigments to an array.
    df_best_features = get_top_features_cluster(bow_array = bow_array, prediction = cluster_pred, n_feats = 20) # Gets top n_feats (e.g., 20) from the BOW_array.
    
    # Develops the Plot for records per Cluster
    cluster_count_figure = sns.countplot(x='cluster_label_rev', data = df_data_filtered)
    cluster_count_figure.set(ylim=(0, max(df_data_filtered['cluster_label_rev'].value_counts())*1.1)) # Increases the y-axis limit by 10% to ensure the annotation is inside of the border.
    plt.xlabel("Cluster Label", fontsize=14)
    plt.ylabel("No. of Records", fontsize=14)
        
    for p in cluster_count_figure.patches:
        cluster_count_figure.annotate(format(p.get_height()), 
                                      (p.get_x()+p.get_width()/2, p.get_height()), 
                                      ha = 'center', 
                                      va = 'center', 
                                      xytext = (0, 10),
                                      textcoords = 'offset points')
 
    if df_data_filtered['cluster_label_rev'].min() == 0:
        plotWords(df_best_features = df_best_features, top_n_terms = 10, elements_pie_chart = True, adjust_label = False)
    if df_data_filtered['cluster_label_rev'].min() == 1:
        plotWords(df_best_features = df_best_features, top_n_terms = 10, elements_pie_chart = True, adjust_label = True)
    if (df_data_filtered['cluster_label_rev'].min() != 0) & (df_data_filtered['cluster_label_rev'].min() != 1):
        print('Error with the Cluster Numbers. Please Check Clustering.')
    cluster_plot_projection()

In [51]:
def cluster_plot_projection():
    # Cluster 2-DImension Plot Visualization. Can be used to verify clustering algorithm behavior.
    if (text_model != 'None') & (text_model != 'Topic Model: LDA') & (dimensionality_reduction == 'PCA'):
        cluster_plot_projection_w_name(plot_name = dimensionality_reduction)
    if (text_model != 'None') & (text_model != 'Topic Model: LDA') & (dimensionality_reduction == 'TSNE'):
        cluster_plot_projection_w_name(plot_name = dimensionality_reduction)
    
    if (text_model != 'None') & (text_model != 'Topic Model: LDA') & (cluster_two_D_plot != 'None'):
        if (cluster_two_D_plot == 'PCA') & (dimensionality_reduction != 'PCA'):
            PCA_func()
            cluster_plot_projection_w_name(plot_name = cluster_two_D_plot)
        if (cluster_two_D_plot == 'TSNE') & (dimensionality_reduction != 'TSNE'):
            TSNE_func()
            cluster_plot_projection_w_name(plot_name = cluster_two_D_plot)

In [52]:
def cluster_plot_projection_w_name(plot_name):
        # Projection Plot (non-scaled) of records in data. # Uncomment to show the non-scaled plot.
        #display(HTML(f'<h2>{cluster_two_D_plot} Scatter Plot Projection<h2>'))
        #plt.figure(figsize=(6,6))
        #plt.scatter(bow_calc_array[:, 0], bow_calc_array[:, 1], marker='.', s=100, lw=0, alpha=1, c=None, edgecolor=None)
        #plt.title(label = f"PCA Plot with Data Reduced to {2} Components")
        #plt.show();

        # Projection Plot (scaled) of records in data.
        display(HTML(f'<h3>Scaled {plot_name} Scatter Plot Projection<h3>'))
        plt.figure(figsize=(6,6))
        plt.scatter(bow_calc_array_scaled[:, 0], bow_calc_array_scaled[:, 1], marker='.', s=100, lw=0, alpha=1, c=None, edgecolor=None)
        plt.title(label = f"Scaled {plot_name} Plot with Data Reduced to {2} Components")
        plt.show();
        
        # Projection Plot (scaled) of records in data and assigned clusters.
        display(HTML(f'<h3>Scaled {plot_name} Scatter Plot Projection with Assigned Clusters<h3>'))
        plt.figure(figsize=(6,6))
        plt.scatter(bow_calc_array_scaled[:, 0], bow_calc_array_scaled[:, 1], marker='.', s=100, lw=0, alpha=1, c=df_data_filtered["cluster_label_rev"].to_numpy(), edgecolor=None)
        plt.title(label = f"Clusters {plot_name} Scatter Plot Projection. Total Clusters = {total_clusters}")
        plt.show();

## DBSCAN Clustering Functions
[Return to Table of Contents](#Table-of-Contents)<br>

The optimal value of EPS is at the inflection point of the "EPS (Distances) vs. Samples" and is being calculated automatically as the closest or furthest point of the inflection to the corner of the plot when optimal value is used. 

The code allows to specify percent of records to cluster whcih could allow identification of outliers (e.g., using the 99% or a high percent). Given the high dimensionality and variability of the data it might be challenging to find an optimal valuefor all cases.

The code also allows to identify the max clusters with the optimal EPS value and lowest minimum samples in the cluster (i.e., 2)
These represents all or the maximum number of clusters within the optimal EPS value in the dataset. Identifying all the clusters and their record counts can help confirm common records. At a minimum, using the resulting topic models from the clusters (which could be in the hundreds) with the top counts could be used to verify those most occuring topics or those that have many records within a cluster.  If a tratidional trending analysis identifies electrical safety records as the most common the top non-zero cluster should represent electrical safety topics. If that is not the case then the analyst should verify why other topics are being identified as common. At the lowest number of minimum samples (e.g., 2), the results represent the theoretical optimal maximum number of clusters.

The code also allows to identify the most dense clusters with the optimal EPS value and highest minimum samples that results in clusters. The most dense cluster should represent the most common records within the dataset using DBSCAN. Currently, this can be calculated using the structured data. However, this approach can identify specific topics and insights that may not be captured at the Keyword level.

References: <br>
- https://towardsdatascience.com/how-to-use-dbscan-effectively-ed212c02e62
- https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc
- https://iopscience.iop.org/article/10.1088/1755-1315/31/1/012012/pdf

In [53]:
def distances_func(): # Calculates the distances between data points and estimates the optimal EPS.
    global df_eps_vs_samples, eps_optimal, distances_scaled, eps_scaled_min, eps_scaled_max, eps_scaled_step
    neigh = NearestNeighbors(n_neighbors=2)
    nbrs = neigh.fit(bow_calc_array_scaled)
    distances, indices = nbrs.kneighbors(bow_calc_array_scaled)
    distances = np.sort(distances, axis=0)
    distances = distances[:,1] # Distances represent the values for EPS.

    # DataFrame for the "Samples vs. EPS", normalization/scaling and Optimal EPS.
    # Scaling transform all values between [0-1].
    # It should be the same as using the SkLearn MaxAbsScaler or MinMaxScaler with range 0, 1.
    df_eps_vs_samples = pd.DataFrame(distances, columns=['EPS'])
    df_eps_vs_samples['samples_index'] = df_eps_vs_samples.index
    df_eps_vs_samples['samples_index_scaled'] = df_eps_vs_samples['samples_index']/max(df_eps_vs_samples['samples_index'])
    df_eps_vs_samples['EPS_scaled'] = df_eps_vs_samples['EPS']/max(df_eps_vs_samples['EPS'])
    df_eps_vs_samples['euclidean_distance_to_brpoint'] = ""
       
    print("Please Wait, Process is Slow.")
    # Calculating the Euclidean distance to the bottorm right point (brpoint) of the TSNE plot that will be used to calculate euclidian distance to find the point closest and determine the most optimal EPS.
    for i in range(0, df_eps_vs_samples.shape[0], 1):
        if (dimensionality_reduction == 'TSNE'):
            #Progress Status. But clears all of the other output in that cell.
            clear_output(wait=True)
            print(f'Current progress:', np.round(100*i/df_eps_vs_samples.shape[0], 0), '%')
        
        # Using the bottom right point as the relative point for calculation of inflection point. Can also use the top left. Not that when no DM is used the inflection point changes and is taken into consideration below.
        df_eps_vs_samples.at[i, 'euclidean_distance_to_brpoint'] = np.linalg.norm(np.array((df_eps_vs_samples.at[i, 'EPS_scaled'], df_eps_vs_samples.at[i, 'samples_index_scaled'])) - np.array((min(df_eps_vs_samples['EPS_scaled']), max(df_eps_vs_samples['samples_index_scaled']))))
        #df_eps_vs_samples.at[i, 'euclidean_distance_to_tlpoint'] = np.linalg.norm(np.array((df_eps_vs_samples.at[i, 'EPS_scaled'], df_eps_vs_samples.at[i, 'samples_index_scaled'])) - np.array((max(df_eps_vs_samples['EPS_scaled']), min(df_eps_vs_samples['samples_index_norm']))))

    distances_scaled = np.array(df_eps_vs_samples['EPS_scaled'])
    
    # Optimal EPS is the point of inflection of the plot (i.e., lowest distance between inflection point and bottom right point when DM is used and viceversa when no DM is used).
    if dimensionality_reduction == 'None':
        df_eps_vs_samples = df_eps_vs_samples.sort_values(by = ['euclidean_distance_to_brpoint'], ascending=False).reset_index()
    else:
        df_eps_vs_samples = df_eps_vs_samples.sort_values(by = ['euclidean_distance_to_brpoint'], ascending=True).reset_index()
    eps_optimal = df_eps_vs_samples['EPS_scaled'][0] # Optimal scaled EPS
    
    eps_scaled_min = min(df_eps_vs_samples['EPS_scaled']) # Min value of EPS scaled
    eps_scaled_max = max(df_eps_vs_samples['EPS_scaled']) # Max value of EPS scaled
    eps_scaled_step = (max(df_eps_vs_samples['EPS_scaled'])/100) # The step is 1% of the max EPS scaled
    
    input_data_stats_desc() # The Current Progress removes the input data stats description and hence added here again.
    
    # Plots for Samples Distance vs. EPS (scaled) and (non-scale)
    plot1 = plt.figure(1, figsize=(6,6))
    plt.title(label = f"EPS vs. Samples (scaled)")
    plt.scatter(df_eps_vs_samples['samples_index_scaled'], df_eps_vs_samples['EPS_scaled'], marker='.', s=30, lw=0, alpha=1, c=None, edgecolor=None)    
    plt.show();
    
    #plot2 = plt.figure(2, figsize=(6,6))
    #plt.title(label = f"EPS vs. Samples (non-scaled)")
    #plt.plot(distances);
    #plt.show();
    

def dbscan_clusters(eps_1, min_samples): # Performs DBSCAN Clustering. 
    global cluster_pred, clusters, yhat, df_cluster_label, cluster_percent, total_clusters

    # Define the DBSCAN model.  EPS is the distance between points, and represents a density value between data points.
    model = DBSCAN(eps= eps_1, min_samples = min_samples)
    # Fit model and predict clusters
    yhat = model.fit_predict(bow_calc_array_scaled)
    # retrieve unique clusters
    clusters = unique(yhat)
    
    # Saving Model Labels to a variable
    cluster_pred = model.labels_
    
    df_cluster_label = pd.DataFrame(cluster_pred, columns=['cluster_label_by_function']) # Model.labels_ gives an array and converting the array to a dataframe.
    df_cluster_label['cluster_label_by_function'] += 1 # This modifies cluster "-1" as cluster "0" and all other clusters up.  This modification corrects an issue in the word plotting function where -1 is not plotting.
    df_data_filtered['cluster_label_by_function'] = df_cluster_label # Adding column with cluster_labels to the full dataset dataframe.    
    
    cluster_percent = round(100*(df_cluster_label['cluster_label_by_function'].astype(bool).sum(axis=0))/(df_cluster_label.shape[0]),1) # Percent of non-zero cluster 
    total_clusters = df_cluster_label['cluster_label_by_function'].unique().max()

## DBSCAN Clustering (% Records to Cluster) Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In the DBSCAN Clustering, Percent of Records to cluster, the user selects how many number of records to put in a non-zero cluster. The functions then iterate thru the EPS values to find the EPS where the selected percentage is met. 

In [54]:
def dbscan_cluster_by_percent_widget():
    interact_manual(dbscan_cluster_by_percent, 
                    dbscan_percent_tocluster_W = widgets.IntSlider(min=1, 
                                                       max=100, 
                                                       value=95, 
                                                       step=1,
                                                       description="% of Records to Cluster",
                                                       disables=False, style={'description_width': 'initial'},
                                                       layout=widgets.Layout(width='50%')),
                    min_samples_inacluster_W = widgets.IntSlider(min=1, 
                                                       max = 500, 
                                                       value = recommended_min_samples_inacluster,
                                                       step = 1,
                                                       description = "Min. Samples in a Cluster",
                                                       disables = False, style={'description_width': 'initial'},
                                                       layout = widgets.Layout(width='50%')),
                   )

In [55]:
def dbscan_cluster_by_percent(dbscan_percent_tocluster_W, min_samples_inacluster_W):
    global dbscan_percent_tocluster, min_samples_inacluster, eps_iter, eps_iter_max, cluster_percent, total_clusters
    
    dbscan_percent_tocluster = dbscan_percent_tocluster_W
    min_samples_inacluster = min_samples_inacluster_W
    
    # Since EPS are scaled it goes from 0 to 1. 
    eps_iter = 0 
    eps_iter_max = 1
    # DBSCAN is really slow when not using dimensionality reduction hence step is increased to 0.1. May not give good results
    if dimensionality_reduction == 'None':
        counter_min = 0.1
    else:
        counter_min = 0.01

    while eps_iter < eps_iter_max:
        #Progress Status
        clear_output(wait=True)
        print(f'Current progress:', np.round(100*eps_iter/eps_iter_max, 0), '%')
        
        # Iteration of EPS and Clustered Reports.
        eps_iter += counter_min
        dbscan_clusters(eps_iter, min_samples_inacluster)
        
        if dbscan_percent_tocluster < cluster_percent:  # Breaks if the specified percent to cluster threshold is met.
            break;
        
    # Percent of reports not assigned to a cluster.
    print(f'Results:\n')
    print(f'(1) A minimum cluster size of {min_samples_inacluster}, results in {cluster_percent}% of the {df_data_filtered.shape[0]} records being identified as non-zero cluster.  {100-cluster_percent}% of records are outliers (Cluster #0).\n')
    print(f'(2) The scaled EPS value used by the algorithm is {round(eps_iter, 2)}. There are a total of {total_clusters} (not including outliers).\n')
    print(f'Note: The optimal scaled EPS value for this dataset is {round(eps_optimal, 2)}. The optimal EPS value is not used in this case and is for information purposes only.')
    
    cluster_plots() # Plots for term scores and eLements and trending of records within the cluster.

## DBSCAN Clustering (Max Clusters w/Optimal EPS) Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In the DBSCAN Clustering, Max Clusters with Optimal EPS, the user selects the minimum sample where the algorithm starts and then iterates from that value up to 500 using the optimal EPS and gives the user the number of min sample in a cluster that has the highest number of clusters.

In [56]:
def dbscan_max_clusters_opt_eps_widget():
    interact_manual(dbscan_max_clusters_opt_eps, 
                    min_samples_inacluster_W = widgets.IntSlider(min=1, 
                                                       max = 500, 
                                                       value = 2,
                                                       step = 1,
                                                       description = "Min. Samples in a Cluster",
                                                       disables = False, style={'description_width': 'initial'},
                                                       layout = widgets.Layout(width='50%')),
                   )

In [57]:
def dbscan_max_clusters_opt_eps(min_samples_inacluster_W):
    global min_samples_inacluster
    
    min_samples_inacluster = min_samples_inacluster_W
    min_samples_vs_percent = []
        
    dbscan_clusters(eps_1 = eps_optimal, min_samples = min_samples_inacluster)
    cluster_plots()
    
    row = [min_samples_inacluster, cluster_percent, df_cluster_label['cluster_label_by_function'].unique().max()]
    min_samples_vs_percent.append(row) 
    # If you want to see the dataframe of min_samples in a cluster vs percent add to global

    print(f'The results represent the number of clusters at the optimal EPS of {round(eps_optimal, 2)} and the minimum samples of {int(min_samples_inacluster)}. At the lowest number of minimum samples (e.g., 2), the results represent the theoretical optimal maximum number of clusters.\n')
    print(f'There are a total of {total_clusters} clusters with {cluster_percent}% of records. Outlier reports (i.e., Cluster #0) include {round(100-cluster_percent,2)}% of records.')

## DBSCAN Clustering (Most Dense Cluster w/Optimal EPS) Widget and Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In the DBSCAN Clustering, Most dense cluster with Optimal EPS, the algorithm finds the value where there are the most records in one cluster.

In [58]:
def dbscan_dense_clusters_opt_eps_widget():
    interact_manual(dbscan_dense_cluster_opt_eps, 
                    max_iter_samples_inacluster_W = widgets.IntSlider(min=1, 
                                                                   max = 500, 
                                                                   value = 500,
                                                                   step = 1,
                                                                   description = "Min. Samples in a Cluster",
                                                                   disables = False, style={'description_width': 'initial'},
                                                                   layout = widgets.Layout(width='50%')),
                   )

In [59]:
def dbscan_dense_cluster_opt_eps(max_iter_samples_inacluster_W):
    global min_samples_inacluster
    # Max Clusters Using the Optimal EPS value.
    min_samples_inacluster = 1 # Iteration starts at 1 up to the max.
    min_samples_vs_percent = []

    while min_samples_inacluster < max_iter_samples_inacluster_W: # Iterates up to a max of max_iter_samples_inacluster_W.
        #Progress Status
        clear_output(wait=True)
        print(f'Current progress:', np.round(100*min_samples_inacluster/max_iter_samples_inacluster_W, 0), '%')
        
        row = []
        min_samples_inacluster += 1
        dbscan_clusters(eps_1 = eps_optimal, min_samples = min_samples_inacluster)
    
        row = [min_samples_inacluster, cluster_percent, df_cluster_label['cluster_label_by_function'].unique().max()]
        min_samples_vs_percent.append(row)
    
        if cluster_percent == 0:  # Breaks when cluster percent equals 0.
            break;

    min_samples_inacluster -= 1 # This gives the last min_sample case and results in the most reports assigned a cluster.  This highlights high concentration of reports.

    dbscan_clusters(eps_1 = eps_optimal, min_samples = min_samples_inacluster)
    cluster_plots()
    
    row = [min_samples_inacluster, cluster_percent, df_cluster_label['cluster_label_by_function'].unique().max()]
    min_samples_vs_percent.append(row)
    
    print(f'The results represent the most dense cluster at the optimal EPS of {round(eps_optimal, 3)} and the highest minimum samples that results in a cluster(s) {int(min_samples_inacluster)}.')
    print(f'There are a total of {total_clusters} clusters with {cluster_percent}% of records. Outlier reports (i.e., Cluster #0) include {round(100-cluster_percent,2)}% of records.')

## DBSCAN Clustering (Custom EPS and Min Samples in a Cluster) Functions
[Return to Table of Contents](#Table-of-Contents)<br>

In the DBSCAN Custom EPS and Min Sampels in a Cluster the user can select whatever EPS and Minimum samples in a cluster they want.

In [60]:
def dbscan_cluster_custom_widget():
    interact_manual(dbscan_cluster_custom, 
                    eps_custom_W = widgets.FloatSlider(min=eps_scaled_min, 
                                                     max=eps_scaled_max, 
                                                     value=eps_optimal, 
                                                     step=eps_scaled_step,
                                                     description="Scaled EPS",
                                                     disables=False, style={'description_width': 'initial'},
                                                     layout=widgets.Layout(width='50%')),
                    min_samples_inacluster_custom_W = widgets.IntSlider(min=1, 
                                                                      max = 500, 
                                                                      value = recommended_min_samples_inacluster,
                                                                      step = 1,
                                                                      description = "Min. Samples in a Cluster",
                                                                      disables = False, style={'description_width': 'initial'},
                                                                      layout = widgets.Layout(width='50%')),
                   )

In [61]:
def dbscan_cluster_custom(eps_custom_W, min_samples_inacluster_custom_W):
    global eps_custom, min_samples_inacluster_custom
    eps_custom = eps_custom_W
    min_samples_inacluster_custom = min_samples_inacluster_custom_W
    
    dbscan_clusters(eps_1 = eps_custom, min_samples = min_samples_inacluster_custom)
    
    # Percent of reports not assigned to a cluster.
    print(f'Results:\n')
    print(f'(1) A scaled EPS of {round(eps_custom, 2)} and minimum cluster size of {min_samples_inacluster_custom}, results in a total of {total_clusters} (not including outlier Cluster#0) and {cluster_percent}% of the {df_data_filtered.shape[0]} records being identified as non-zero cluster.  {100-cluster_percent}% of records are identified as outliers (Cluster #0).\n')
    print(f'(2) The optimal scaled EPS value is {round(eps_optimal, 2)} and the recommended minimum sample sin a cluster is {recommended_min_samples_inacluster}.\n')
    print(f'Note: The optimal scaled EPS value for this dataset is {round(eps_optimal, 2)}. The optimal EPS value is not used in this case and is for information purposes only.')
    
    cluster_plots() # Plots for term scores and elements and trending of records within the cluster.

## KMeans Clustering Functions
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- https://medium.com/@lucasdesa/text-clustering-with-k-means-a039d84a941b

In [62]:
def kmeans_optimal_k_widget():
    interact(kmeans_optimal_k, 
             kmeans_optimal_k_W = widgets.Dropdown(options= ['Silhoutte Score', 'KMeans Distoritions', 'KMeans Score'], 
                                                   description = 'Optimal K Method',
                                                   value = 'Silhoutte Score',
                                                   disabled=False, style={'description_width': 'initial'},
                                                   layout=widgets.Layout(width='40%')),
            )

In [63]:
def kmeans_optimal_k(kmeans_optimal_k_W):
    # This fixes the error when there are less than 50 records and adjusts the K to the number of records.
    if df_data_filtered.shape[0] < 50:
        K = range(2, df_data_filtered.shape[0])
    if df_data_filtered.shape[0] >=50:
        K = range(2, 50)
    km_scores= []
    km_silhouette = []
    distortions = []
    
    for i in K:
        km = KMeans(n_clusters=i, max_iter=1000, n_init=10, random_state=0).fit(bow_calc_array_scaled)
        preds = km.predict(bow_calc_array_scaled)
        km_scores.append(-km.score(bow_calc_array_scaled))
        km.fit(bow_calc_array_scaled)
        distortions.append(sum(np.min(cdist(bow_calc_array_scaled, km.cluster_centers_, 'euclidean'), axis=1)) / bow_calc_array_scaled.shape[0])
        silhouette = silhouette_score(bow_calc_array_scaled, preds)
        km_silhouette.append(silhouette)

    if kmeans_optimal_k_W == 'Silhoutte Score':
        plt.figure(figsize=(14,4))
        plt.title("The silhouette coefficient method \nfor determining the optimal number of clusters\n",fontsize=16)
        plt.scatter(x=[i for i in K],y=km_silhouette,s=150,edgecolor='k')
        plt.grid(True)
        plt.xlabel("Number of clusters",fontsize=12)
        plt.ylabel("Silhouette score",fontsize=15)
        plt.xticks([i for i in K],fontsize=10)
        plt.yticks(fontsize=12)
        print('Silhouette score: measure of how similar an object is to its own cluster compared to other clusters. Values range from −1 to +1. A high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Value of 1 means that the clusters are very dense and nicely separated. A score of 0 means that clusters are overlapping. Score less than 0 means that data belonging to clusters may be wrong/incorrect.')
        print('References: "Silhouette (clustering)" at Wikipedia and "KMeans Silhouette Score Explained With Python Example" Published at DZone with permission of Ajitesh Kumar.')
        plt.show();
        
    if kmeans_optimal_k_W == 'KMeans Distoritions':# Plot the elbow using Kmeans Distortions
        plt.figure(figsize=(14,4))
        plt.title("The elbow method for determining number of clusters\n",fontsize=16)
        plt.scatter(K, distortions, s=150,edgecolor='k')
        plt.grid(True)
        plt.xlabel("Number of clusters", fontsize=14)
        plt.ylabel("K-means Distortion", fontsize=15)
        plt.xticks(K,fontsize=12)
        plt.yticks(fontsize=10)
        plt.show();
        
    if kmeans_optimal_k_W == 'KMeans Score':# Plot the elbow using Kmeans Score
        plt.figure(figsize=(14,4))
        plt.title("The elbow method for determining number of clusters\n",fontsize=16)
        plt.scatter(x=[i for i in K],y=km_scores,s=150,edgecolor='k')
        plt.grid(True)
        plt.xlabel("Number of clusters",fontsize=14)
        plt.ylabel("K-means score",fontsize=15)
        plt.xticks([i for i in K],fontsize=10)
        plt.yticks(fontsize=12)
        plt.show();

In [64]:
def kmeans_k_clusters_widget():
    interact_manual(kmeans_k_clusters, 
                    total_clusters_W = widgets.IntSlider(min=2, 
                                                   max=50, 
                                                   value=3, 
                                                   step=1,
                                                   description="KMEANS No. Clusters",
                                                   disables=False, style={'description_width': 'initial'},
                                                   layout=widgets.Layout(width='50%')
                                                  )
                   )

In [65]:
def kmeans_k_clusters(total_clusters_W):
    global cluster_pred, total_clusters
    total_clusters = total_clusters_W
    
    if total_clusters_W > len(df_data_filtered):
        print(f'Please select a lower number of clusters. There are more clusters ({total_clusters_W}) than records ({len(df_data_filtered)}).')
    else:
        print(f'{total_clusters_W} Clusters were selected.')
        # Kmeans function to assign predicted cluster for record.
        cluster_pred = KMeans(n_clusters=total_clusters_W, random_state=42, max_iter=1000, n_init=10).fit_predict(bow_calc_array_scaled)

        df_cluster_label = pd.DataFrame(cluster_pred, columns=['cluster_label_by_function']) # Model.labels_ gives an array and converting the array to a dataframe.
        df_cluster_label['cluster_label_by_function'] += 1 # This modifies cluster "0" as cluster "1" and all other clusters up.  This modification clarifies that there is no Cluster 0.
    
        # Percent of non-zero cluster.
        cluster_percent = round(100*(df_cluster_label['cluster_label_by_function'].astype(bool).sum(axis=0))/(df_cluster_label.shape[0]),1) 
        print(f'{cluster_percent}% of records assigned to a cluster.')
    
        df_data_filtered['cluster_label_by_function'] = df_cluster_label # Adding column with cluster_labels to the full dataset dataframe.    
    
        # WOULD BE NICE TO CALCULATE/SORT THE CLUSTER NUMBER BASED ON AVERAGE TOP 10 TERM SCORES WHERE CLUSTER 1 HAD THE HIGHEST AND DESCENDING FROM THERE.
        # WOULD BE NICE TO CALCULATE/SORT THE CLUSTER NUMBER BASED ON AVERAGE TOP 10 TERM SCORES WHERE CLUSTER 1 HAD THE HIGHEST AND DESCENDING FROM THERE.
        # WOULD BE NICE TO CALCULATE/SORT THE CLUSTER NUMBER BASED ON AVERAGE TOP 10 TERM SCORES WHERE CLUSTER 1 HAD THE HIGHEST AND DESCENDING FROM THERE.
    
        cluster_plots()

In [66]:
###TBD TBD NEED TO IMPLEMENT IF NEEDED. This is for Selection of the OPTIMAL K CLUSTERS in a similar Fashion to MICROSTRATEGY.

def kmeans_optimal_k_clusters(total_clusters_W, K_clusters_limit):
    # Similar to how MicroStrategy defines it puts an upper limit to the total_clusters_W. Could use the Elbow or the Silhoutte best value and just run Kmeans with the Optimal value.
    print('Running iwth the Optimal Number needs to be setup. Manual optimal K can be manually selected using the Kmeans w/Optimal K.')
    print(total_clusters_W)

#  Top Terms WordCloud Widget and Clustering Plots Functions
[Return to Table of Contents](#Table-of-Contents)<br>

The top terms widget has the functions for developing the the word cloud.

In [67]:
def get_top_features_cluster(bow_array, prediction, n_feats):
    global df_best_features, df_label_by_ftrs_score, df_best_features_raw
    # Note that the Clustering Functions assigns the clustering labels in somewhat random fashion (cluster_label_by_function). This function here revises the non-zero cluster labels to be sorted by mean of the top 10 term scores.
    
    labels = np.unique(prediction)
    df_best_features_raw = []
    df_label_by_ftrs_score = pd.DataFrame()

    for label in labels:
        id_temp = np.where(prediction==label) # Temporary ID for each record within the cluster cluster
        x_means = np.mean(bow_array[id_temp], axis = 0) # Returns average score across cluster
        sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top n_feats scores
        best_features = [(features[i], x_means[i]) for i in sorted_means] # Variable "features" is created in the bow_vectorizer_data function.
        df = pd.DataFrame(best_features, columns = ['Terms', 'Avg. Score']) # Dataframes with the Terms and Avg. Scores 
        df_best_features_raw.append(df) # List of Dataframes with the Terms and Avg. Scores

        # Calculates the top 10 terms average score for each Cluster Label.
        # Note that the "label-1" is the index of the label in the df_best_features_raw list of dataframes and needs to be used for clustering methods that don't produce a cluster number 0.  
        if min(labels) == 0: # Used by the Top Terms and DBSCAN whcih both have zero assigned to clusters.
            df2 = pd.DataFrame(data = [[label, round(mean(df_best_features_raw[label][:10]['Avg. Score']),4), int(0)]], 
                               columns = ['cluster_label_by_function', 
                                          'top_10_terms_avg_score', 'cluster_label_rev'])
        if min(labels) > 0: # Used by KMeans or DBSCAN when there is no cluster zero.
            df2 = pd.DataFrame(data = [[label, round(mean(df_best_features_raw[label-1][:10]['Avg. Score']),4), int(0)]], 
                               columns = ['cluster_label_by_function', 
                                          'top_10_terms_avg_score', 'cluster_label_rev'])  
        #df_label_by_ftrs_score = df_label_by_ftrs_score.append(df2) # Will be depracated
        df_label_by_ftrs_score = pd.concat([df_label_by_ftrs_score, df2], ignore_index = True)
        
    # Checks if there is a cluster 0 and if there is removes the row as df_temp as we want to keep the cluster 0 which represents terms not assigned a cluster.
    df_temp = pd.DataFrame()
    if (min(df_label_by_ftrs_score['cluster_label_by_function']) == 0):
        df_temp = df_label_by_ftrs_score.loc[(df_label_by_ftrs_score['cluster_label_by_function'] == 0)].copy()
        df_label_by_ftrs_score = df_label_by_ftrs_score[df_label_by_ftrs_score['cluster_label_by_function'] != 0]
    
    # Sorts the non-zero clusters by the top_10_terms_avg_score
    df_label_by_ftrs_score = df_label_by_ftrs_score.sort_values(by='top_10_terms_avg_score', ascending=False)
    # Once sorted, creates a revised clsuter label where the cluster with the highest avg score is first and descending from there.
    df_label_by_ftrs_score.cluster_label_rev = range(1, 1 + len(df_label_by_ftrs_score))

    # Reinserts cluster 0 to the final dataframe were clusters have been sorted by top 10 average feature score
    if not df_temp.empty:
        df_label_by_ftrs_score = pd.concat([df_temp, df_label_by_ftrs_score], axis=0)
    
    # Final Dataframe with the crosswalk of the revised label sorted by the top 10 average score.
    df_label_by_ftrs_score = df_label_by_ftrs_score.reset_index(drop = True) # Resets index.
    
    # Adds the revised cluster label to the df_data_filtered
    df_data_filtered['top_10_terms_avg_score'] = np.nan # Adds new column to the filtered data and sets values to nan.
    df_data_filtered['cluster_label_rev'] = 0 # Adds new column to the filtered data and sets values to 0.
    for row in range(len(df_data_filtered)): # Iterates thru the original cluster_labels and modifies them to those sorted by top 10 term average score.
        # Uses the crosswalk from df_label_by_ftrs_score to find the cluster label top_10_terms_avg_score.
        df_data_filtered.at[row, 'top_10_terms_avg_score'] = round(df_label_by_ftrs_score.loc[df_label_by_ftrs_score['cluster_label_by_function'] == df_data_filtered.at[row, 'cluster_label_by_function'], 'top_10_terms_avg_score'], 4)
        # Uses the crosswalk from df_label_by_ftrs_score to find the revised cluster label.
        df_data_filtered.at[row, 'cluster_label_rev'] = df_label_by_ftrs_score.loc[df_label_by_ftrs_score['cluster_label_by_function'] == df_data_filtered.at[row, 'cluster_label_by_function'], 'cluster_label_rev'].astype(int)
    
    if min(labels) == 0: # Used by the Top Terms and DBSCAN whcih both have zero assigned to clusters.
        rev_order = (df_label_by_ftrs_score['cluster_label_by_function']-1).tolist()
    if min(labels) > 0: # Used by KMeans or DBSCAN when there is no cluster zero.
        rev_order = (df_label_by_ftrs_score['cluster_label_by_function']-1).tolist()
    df_best_features = [df_best_features_raw[i] for i in rev_order] # Sorts the list of dataframes based on the mean of the top 10 terms average score and keeping the cluster 0 as 0. 
    
    return df_best_features
 
        
def plotWords(df_best_features, top_n_terms, elements_pie_chart, adjust_label):
    fig_height = top_n_terms/2.5
    for label in range(0, len(df_best_features)):
        # Note that the "df_best_features" data frame resets its index to 0 hence the adjusting of the label+1 which is only needed for Kmeans or when DBSCAN finds no Cluster 0.   
        if adjust_label == False:
            cluster_label = label
        if adjust_label == True:
            cluster_label = label+1 # Adjusting the Cluster Label list to the df_to_cluster labels.  

        f, ax = plt.subplots(figsize=(6, fig_height))
        plt.title((f"Most Common Terms in Cluster #{cluster_label}"), fontsize=10, fontweight='bold')
        sns.barplot(x = 'Avg. Score', y = 'Terms', data = df_best_features[label][:top_n_terms], palette = ['#006BA4'])
        ax.set(xlim=(0, max(df_best_features[label][:top_n_terms]['Avg. Score'])*1.1)) # Increases the x-axis limit by 10% to ensure the annotation is inside of the border.
        
        for p in ax.patches:
            width = p.get_width()    # get bar length
            ax.text(width,       # If 'width' with no quotes, sets the text to start right of the bar
                    p.get_y() + p.get_height()/2, # get Y coordinate + X coordinate / 2
                    '{:.3f}'.format(width), # set variable to display, 2 decimals
                    ha = 'left',   # horizontal alignment
                    va = 'center')  # vertical alignment
        plt.show();
        
        if elements_pie_chart == True:
            # Pie Chart of original_language
            show_threshold = 3
            # If the count for top original_language is larger than the threshold will show the plot. First steps checks that the number of records is more than 0. Second step shows the plot with only Categories that have counts more than the threshold.
            if len(df_data_filtered[(df_data_filtered['cluster_label_rev'] == cluster_label)].original_language.value_counts(dropna = False).loc[lambda x : x>show_threshold]) != 0:
                if df_data_filtered[(df_data_filtered['cluster_label_rev'] == cluster_label)].original_language.value_counts(dropna = False).loc[lambda x : x>show_threshold][0] > show_threshold:
                    plt.subplots(figsize=(6, fig_height))
                    df_data_filtered[(df_data_filtered['cluster_label_rev'] == cluster_label)].original_language.value_counts(dropna = False).loc[lambda x : x>show_threshold].plot.pie()
                    plt.title(f"Original Languages for Cluster #{cluster_label}",fontsize=16)
                    plt.xlabel(f"Original Languages (*> {show_threshold} Records)",fontsize=12)
                    plt.ylabel("",fontsize=15)
                    plt.show();
            else:
                print(f'PIE CHART NOT SHOWN: Cluster #{cluster_label} does not meet threshold of at least {show_threshold} records per Original Language.')
            
        # Prints the trend plot if a timeperiod is selected (!=None) and there are at least two time periods.
        if (trend_timeperiod != 'None'):
            if (len(df_data_filtered[(df_data_filtered[trend_timeperiod] != 'NaT')][trend_timeperiod].value_counts()) > 1):
                plot_cluster_trend(input_data = df_data_filtered, cluster_label = cluster_label)


def plot_cluster_trend(input_data, cluster_label): 
    global df_cluster_counts
    # Dataframe of Cluster's Record Counts per Time.
    df_cluster_counts = (input_data.loc[(input_data['cluster_label_rev'] == cluster_label)]
                                 .copy()
                                 .groupby([trend_timeperiod],as_index=False)
                                 .agg({'cluster_label_rev': 'count'})) # Aggregating counting the items in the filtered dataframe by year.
    df_cluster_counts = df_cluster_counts.rename(columns={"cluster_label_rev": "record_counts"})
    
    # Removes NaT/NaNs and convert to Float to Calculate correctly the trendline equation values.
    df_cluster_counts = df_cluster_counts.loc[(df_cluster_counts[trend_timeperiod] != 'NaT') & 
                                              (df_cluster_counts[trend_timeperiod] != np.nan)]
    df_cluster_counts[trend_timeperiod] = df_cluster_counts[trend_timeperiod].astype(float)

    # Plots the data and trendline only when the data has more than 1 time period (e.g., year). Otherwise does not calculate the trend.
    if (len(df_cluster_counts[trend_timeperiod].unique()) > 1):
        print(f'Linear Trend of Records within Cluster #{cluster_label}')                                                           
        bar_chart_wtrend(input_data = df_cluster_counts, x_feature = trend_timeperiod, 
                         y_feature='record_counts', add_trend ='Linear', add_equation = True)
    else:
        print(f'Linear Trend of Records within Cluster #{cluster_label}')                                                           
        bar_chart_wtrend(input_data = df_cluster_counts, x_feature = trend_timeperiod, 
                         y_feature='record_counts', add_trend ='None', add_equation = False)
    print('Note: Algorithm removes Null values for time period plots and calculating the trendline.')
    plt.show(); 
    
    
def wordcloud_plot(n_feats):
    Cloud = WordCloud(background_color="white",
                      max_words = n_feats,
                      random_state = 49,
                      normalize_plurals = False,
                      colormap= 'tab20').generate_from_frequencies(df_bow.T.mean(axis=1))
    plt.figure(figsize=(9, 6))
    plt.imshow(Cloud, interpolation='bilinear')
    plt.axis("off")
    plt.show();   

## Top Terms Wordcloud Widget
[Return to Table of Contents](#Table-of-Contents)<br>

In [68]:
def top_terms_widget():
    interact_manual(top_terms, 
                    word_preprocessing_W = widgets.Dropdown(options= ['Lemmatization', 'Stemming'], 
                                                                       description = 'Word Preprocessing Method',
                                                                       value = 'Lemmatization',
                                                                       disabled=False, style={'description_width': 'initial'},
                                                                       layout=widgets.Layout(width='40%')),
                    vect_W = widgets.Dropdown(options= ['TFIDF', 'Count'], 
                                              description = 'Word Vectorizer',
                                              value = 'TFIDF',
                                              disabled=False, style={'description_width': 'initial'},
                                              layout=widgets.Layout(width='40%')),
                    n_terms_W = widgets.IntSlider(min=10, 
                                                  max=100, 
                                                  value=20, 
                                                  step=1,
                                                  description="Number of Terms to Plot",
                                                  disables=False, style={'description_width': 'initial'},
                                                  layout=widgets.Layout(width='50%')),
                    max_df_W = widgets.FloatSlider(min=0.05, 
                                                   max=1.0, 
                                                   value=0.95, 
                                                   step=0.05,
                                                   description="Term Maximum Document Frequency",
                                                   disables=False, style={'description_width': 'initial'},
                                                   layout=widgets.Layout(width='50%'),
                                                   readout_format='.2f'),
                    min_df_W = widgets.FloatSlider(min=0.0001, 
                                                   max=0.1, 
                                                   value=0.01, 
                                                   step=0.005,
                                                   description="Term Minimum Document Frequency",
                                                   disables=False, style={'description_width': 'initial'},
                                                   layout=widgets.Layout(width='50%'),
                                                   readout_format='.4f'),
                    trend_timeperiod_W = widgets.Dropdown(options= ['None', 'release_cy'],
                                                                  description = 'Cluster Trend by',
                                                                  value = 'release_cy',
                                                                  disabled=False, style={'description_width': 'initial'},
                                                                  layout=widgets.Layout(width='40%')),
                   )    

In [69]:
def top_terms(word_preprocessing_W, vect_W, n_terms_W, max_df_W, min_df_W, trend_timeperiod_W):
    global word_preprocessing, vect, n_terms, max_df, min_df, cluster_pred, df_best_features, trend_timeperiod
    word_preprocessing = word_preprocessing_W
    vect = vect_W
    n_terms = n_terms_W
    max_df = max_df_W
    min_df = min_df_W
    trend_timeperiod = trend_timeperiod_W
    
    df_data_filtered['cluster_label_by_function'] = 0 # Sets cluster label to 0 as initial value to extrat top topics accross the filtered data without running cluster functions.
    cluster_pred = df_data_filtered["cluster_label_by_function"].to_numpy() # Converting cluster_label assigments to an array.
    
    bow_vectorizer_data(input_data = df_data_filtered, vect = vect, word_preprocessing = word_preprocessing,
                       max_df = max_df, min_df = min_df)
    
    df_best_features = get_top_features_cluster(bow_array = bow_array, prediction = cluster_pred, n_feats = n_terms)
    wordcloud_plot(n_feats = n_terms)
    plotWords(df_best_features = df_best_features, top_n_terms = n_terms, elements_pie_chart = False, adjust_label = False)

# Predictive Modeling Widget
[Return to Table of Contents](#Table-of-Contents)<br>

This widget uses the linear regression and Poisson Distribution to predict value of future years.

References: <br>
- Box Plots: https://en.wikipedia.org/wiki/Box_plot
- Kruskal-Wallis H statistic: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html
- Scatter Plot with Confidence Intervals: https://stackoverflow.com/questions/27164114/show-confidence-limits-and-prediction-limits-in-scatter-plot
- Poisson Distributions: https://www.statology.org/poisson-distribution-python/


In [70]:
def predictive_widget():
    interact_manual(predictive_models, 
                    predictive_model_W = widgets.Dropdown(options=['LR and Poisson', 'PLACEHOLDER: Monte Carlo'],
                                                            value = 'LR and Poisson',
                                                            description = 'Predictive Model', 
                                                            disabled=False, style={'description_width': 'initial'},
                                                            layout=widgets.Layout(width='50%'))
                   )

In [71]:
def predictive_models(predictive_model_W):
    if predictive_model_W == 'LR and Poisson':
        lr_poisson_prediction_widget()

In [72]:
def lr_poisson_prediction_widget():
    print('The Linear Regression and Poisson Distribution models use the filtered data to forecast the expected value of the following timeperiod (e.g., CY, FY).')
    print('\nFor example, if the user filters for 2016 to 2019 CY data and selects the release_cy, the model will show the predicted value for 2020 CY using a LR (red bar) and a Poisson Distribution (blue boxplot). Where data is available for that proceeding year it will show the occurrence data as a yellow bar.')
    print('\nThe Linear Regression equation for the selected data is provided in the summary table.')
    print('\nThe Poisson Distribution uses a parameter called Lambda (shown as a orange dotted line). This parameter is equal to the average count observed under the previous period. In this tool, the Lambda value can be set to the average of all selected years or fix it to the value of last year.')
    
    interact_manual(lr_poisson_prediction, 
                    x_feature = widgets.Dropdown(options=['release_cy'],
                                                 value = 'release_cy',
                                                 description = 'Time Period', 
                                                 disabled=False, style={'description_width': 'initial'},
                                                 layout=widgets.Layout(width='50%')),
                    lmbda_period_term = widgets.Dropdown(options=['Last Selected Period', 'Average of All Selected Periods'],
                                                 value = 'Last Selected Period',
                                                 description = 'Poisson Lambda Value', 
                                                 disabled=False, style={'description_width': 'initial'},
                                                 layout=widgets.Layout(width='50%'))
                   )

In [73]:
def lr_poisson_prediction(x_feature, lmbda_period_term):
    global input_data
    # Value Counts of records per selected x_feature/timeframe (i.e., release_cy, DATE_FY, etc.).
    input_data = df_data_filtered[x_feature].value_counts().rename_axis(x_feature).reset_index(name='counts').sort_values(by=[x_feature]) # input data source. Could use the already defined function. May want to clean the name out of per FY or define in the sameway a value_counts.
    input_data = input_data.loc[(input_data[x_feature] != 'NaT') & 
                                (input_data[x_feature] != np.nan)].copy().reset_index(drop = True) # Removes NaN values.
    input_data = input_data.apply(pd.to_numeric, errors='raise') # Converts the dataframe columns to numeric (recall the 'NaT' was string and is removed in previous step).

    # Used to calculate the proceeding year data point. 
    # Does the same filtering as the main function without using the time related filters.
    df_data_filtered_wo_dates = df_data.loc[(df_data['genres'].str.contains('|'.join(genres_winput), 
                                                                            flags=re.IGNORECASE, regex=True))
                                            ].copy().reset_index(drop = True)
    
    # Count of x_features (e.g., release_cy) in the source data.  Used to find if the predicted year has data and show it.
    df_data_counts = df_data_filtered_wo_dates[x_feature].value_counts().rename_axis(x_feature).reset_index(name='counts')
    
    # X and Y Data values
    x_values = input_data[x_feature]
    y_values = input_data['counts']
    
    # GENERIC INPUT PARAMETERS AND INDEX 
    end_period_term_index = input_data.shape[0] # Index for last year for mean and loop calculations.
    timeperiod_predicted = max(x_values)+1 # Creates the variable for the time period being predicted (i.e., plus one the last period in selected range).
    

    # POISSON PARAMETERS AND 
    sample_size = 10000
    if lmbda_period_term == 'Last Selected Period':
        lmbda = int(mean(y_values[end_period_term_index-1:end_period_term_index])) # Mean of selected year data.
    elif lmbda_period_term == 'Average of All Selected Periods':
        lmbda = int(mean(y_values)) # Mean of selected year data.
    
    x_random = poisson.rvs(mu=lmbda, size=sample_size) # Poisson random samples. Used in Histogram.
    x_random_sorted = np.sort(x_random) # Sorting the Poisson random sample values (i.e., occurrence). Used in CDF and df_CDF
    cdf_y_values = np.arange(sample_size)/float(sample_size) # Sorting the values of the probabilities CDF values (i.e., values of y). Used in CDF and df_CDF.
    df_CDF = pd.DataFrame({f'{timeperiod_predicted}':x_random_sorted, 'probability':cdf_y_values})

    # LINEAR TREND PARAMETERS AND EQUATION
    linreg = LinearRegression()
    linreg.fit(input_data[[x_feature]], y_values)
    y_intercept_min_year = linreg.coef_[0]*x_values.min()+linreg.intercept_ # y-intercept value.
    y_pred = linreg.predict(input_data[[x_feature]]) # y_predicted values.
    
    # STATISTIC FIT MEASURES AND METRICS
    f_reg = f_regression(X = input_data[[x_feature]], y = y_values, center = True)
    f_statistic = f_reg[0]
    p_value = f_reg[1]
    
    # LINEAR REGRESSION METRICS USING SCIPY
    slope_scipy, intercept_scipy, r_value_scipy, p_value_scipy, std_err_scipy = stats.linregress(x = x_values, y = y_values)
    confidence_interval_scipy = 2*std_err_scipy # For a Normally distributed dataset the 95% would be at 2 standard deviations.
    
    # Using Polyfit Library
    df_freq_fit2 = np.polynomial.polynomial.polyfit(x_values, y_values, 1)
    y_intercept_min_year2 = df_freq_fit2[1]*x_values.min()+df_freq_fit2[0]

    # COUNT TREND PLOT and PREDICTIVE PLOT
    display(HTML(f'<h3>Trend and Prediction Plot<h3>'))
    fig_width = min(int(len(x_values.unique()))+8, 14)
    fig_width_ratio = min(4, int(len(x_values.unique())))
    fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, 
                                   gridspec_kw={'width_ratios': [fig_width_ratio, 1]},
                                   sharey = True,
                                   figsize=(fig_width, 8))

    # FIRST PLOT: Plot of count time period data
    ax1.bar(x = x_values, height = y_values)
    ax1.set_title('Previous time periods count data\n', size = 16)
    ax1.set_xlabel(x_feature, fontsize=16)
    ax1.set_ylabel("No. of Records", fontsize=16)
    ax1.set_ylim(bottom = 0, top = max(max(y_values)+max(y_values)*0.10, max(df_CDF.iloc[:,0]))) # Adjust the lower ylimit to 0 and upper to 10% of the max value.
    ax1.axhline(lmbda, color="orange", linestyle="dashed") # Horizontal Line at the value of lambda Poisson parameter
    
    # Drawing the linear trendline
    ax1.plot(x_values, linreg.coef_[0]*x_values+linreg.intercept_, color='red', linewidth = 2)
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))

    # Show the values of each bar
    for i in range(len(input_data)):
        # Prints values on plot barchart
        ax1.text(x_values[i], 
                 y_values[i]+y_values.max()*0.01, 
                 int_number_format(y_values[i], 0), 
                 size = 14)

    # Secondary Plot: Predictions and Current Time Period
    # Value of x needs to be 1 as the boxplot uses an index of 1. 
    ax2.set_title('Poisson Box Plot \nLinear Trend Forecast (red bar)\nCurrent Period Data (yellow bar)', size = 16)

    # Post time period count data.
    if timeperiod_predicted in df_data_counts[x_feature].tolist():
        post_timeperiod_counts = df_data_counts['counts'].tolist()[df_data_counts[x_feature].tolist().index(timeperiod_predicted)] # If the timeperiod_predicted is found in the source data uses that year as the data to show.
    else:
        post_timeperiod_counts = 0
    
    ax2.bar(x = 1, height = post_timeperiod_counts, color = 'yellow', width = 0.6)
    ax2.text(x = 1.3, y = post_timeperiod_counts*0.99, s = f'{post_timeperiod_counts}', size = 14)

    # Drwaing BoxPlot Chart on Secondary Plot
    ax2.boxplot(x = df_CDF[str(timeperiod_predicted)], patch_artist = True, labels = [df_CDF.columns[0]], widths = 0.2)

    # Predicted value using the Linear Trend method 
    pred_value_LR = int(linreg.coef_[0]*((input_data.at[input_data.shape[0]-1, x_feature]+1)-(input_data.at[0, x_feature]))+y_intercept_min_year)
    ax2.bar(x = 1, # Need to use this as the boxplot uses an index of 1 for the boxplot. 
            height = pred_value_LR, 
            color = 'red',
            width = 0.25) 
    ax2.text(x = 0.95, y = pred_value_LR/2, s = f'{pred_value_LR}', size = 14)

    plt.tight_layout()
    plt.show();

    # PREDICTIVE STATISTICS SUMMARY TABLE
    display(HTML(f'<h3>Summary<h3>'))
    df_predictive_statistics = pd.DataFrame({'Description': ['Time Period Forecast',
                                                             'Linear trendline equation',
                                                             'Mean squared error',
                                                             'Coefficient of Determination (R2)',
                                                             'P-Value',
                                                             'F-Statistic',
                                                             'Standard Error',
                                                             'Confidence Interval at 2 Std. Deviations',
                                                             'Mean',
                                                             'Variance',
                                                             'Standard Deviation',
                                                             'Linear trendline equation (Original)',
                                                             'Forecast Value using the linear trendline',
                                                             'Poisson: Figure Box Plot',
                                                             'Poisson: Lambda Parameter',
                                                             'Poisson: Probability at Lambda',
                                                             'Poisson: 5th percentile',
                                                             'Poisson: 25th percentile', 
                                                             'Poisson: 50th percentile',
                                                             'Poisson: 75th percentile',
                                                             'Poisson: 95th percentile',
                                                             'Poisson: Kruskal-Wallis H statistic, corrected for ties',
                                                             'Poisson: Kruskal-Wallis P-value'],
                                             'Value': [str(timeperiod_predicted),
                                                       str('y={:.2f}*x+{:.2f}'.format(linreg.coef_[0].astype(float), y_intercept_min_year)),
                                                       round(mean_squared_error(y_values, y_pred), 2),
                                                       round(r2_score(y_values, y_pred), 2),
                                                       round(p_value[0], 6),
                                                       round(f_statistic[0], 4),
                                                       round(std_err_scipy, 4),
                                                       round(confidence_interval_scipy, 4),
                                                       round(y_values.values.mean()),
                                                       round(y_values.values.var()),
                                                       round(y_values.values.std()),
                                                       str('y={:.2f}*x+{:.2f}'.format(df_freq_fit2[1], y_intercept_min_year2)),
                                                       pred_value_LR,
                                                       str('N/A'),
                                                       str(lmbda),
                                                       str(round(poisson.pmf(k=lmbda, mu=lmbda),4)),
                                                       str(int(np.interp(0.05, df_CDF['probability'], df_CDF[str(timeperiod_predicted)]))), 
                                                       str(int(np.interp(0.25, df_CDF['probability'], df_CDF[str(timeperiod_predicted)]))), 
                                                       str(int(np.interp(0.50, df_CDF['probability'], df_CDF[str(timeperiod_predicted)]))), 
                                                       str(int(np.interp(0.75, df_CDF['probability'], df_CDF[str(timeperiod_predicted)]))), 
                                                       str(int(np.interp(0.95, df_CDF['probability'], df_CDF[str(timeperiod_predicted)]))),
                                                       stats.kruskal(x_values.values.tolist(), y_values.values.tolist())[0], #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html
                                                       stats.kruskal(x_values.values.tolist(), y_values.values.tolist())[1]],
                                             'Notes': ['Time period used for forecasting (e.g., CY, calendar quarter, FY, fiscal quarter, etc.).',
                                                       '"y = mx+b" where: "m" is trend slope, "x" is Max-Min time periods and "b" is intercept at Min time period.',
                                                       'MEAN SQUARE ERROR ADD EXPLANATION NOTE',
                                                       'R2 SCORE EXPLANATION NOTE: CONFIRMED WITH STATISTICIAN VALUE.',
                                                       'P-VALUE EXPLANATION NOTE: CONFIRMED WITH STATISTICIAN VALUE.',
                                                       'F-STATISTIC EPXLANATION NOTE: CONFIRMED WITH STATISTICIAN VALUE.',
                                                       'STANDARD ERROR: ADD EXPLANATION NOTE.',
                                                       'CONFIDENCE INTERVAL: ADD EXPLANATION NOTE',
                                                       'Mean of the Values',
                                                       'Variance of the Values',
                                                       'Standard Deviation of the Values',
                                                       'Linear trendline equation using Polyfit Library',
                                                       'Forecast value at the specified time period above.',
                                                       'For the definition of boxplot quartiles see: https://en.wikipedia.org/wiki/Box_plot.',
                                                       'Average rate at which records are being observed/specified/calculated (i.e., dashed orange line in plot).', 
                                                       'Probability that there are exactly lambda records in the predicted time period.', 
                                                       'There is a 0.05 probability that the there will be less than the specified number of records.', 
                                                       'There is a 0.25 probability that the there will be less than the specified number of records.', 
                                                       'There is a 0.50 probability that the there will be less than the specified number of records.', 
                                                       'There is a 0.75 probability that the there will be less than the specified number of records.', 
                                                       'There is a 0.95 probability that the there will be less than the specified number of records.',
                                                       "NEED VALUE CONFIRMATION: Poisson's distribution Kruskal-Wallis H statistic, corrected for ties",
                                                       "NEED VALUE CONFIRMATION: Poisson's distribution Kruskal-Wallis P-value for the test using the assumption that H has a chi square distribution. The p-value returned is the survival function of the chi square distribution evaluated at H. The chi-square approximation may not be accurate with low number of samples."]
                                            })

    display(HTML(df_predictive_statistics.to_html(index=False)))

    # POISSON PDF and CDF Plots
    display(HTML(f'<h3>Poisson PDF and CDF<h3>'))
    fig, ax1 = plt.subplots() # Figure with two axes
    # Plot for the CDF
    ax1.set_title(f'Poisson PDF and CDF with sample size of {sample_size}', size = 16)
    ax1.set_xlabel('# of Records', size = 14)
    ax1.set_ylabel('Cumulative Probability', size = 14, color = 'blue')
    ax1.plot(x_random_sorted, cdf_y_values, color = 'blue')
    ax1.tick_params(axis='y', size = 14)

    # Plot of Poisson distribution using the random sample size of "sample_size".
    ax2 = ax1.twinx()  # Instantiate a second axes that shares the same x-axis
    ax2.set_ylabel('# of Samples', color = 'red', size = 14)  # we already handled the x-label with ax1
    ax2.hist(x_random, histtype = u'step', color = 'red')
    ax2.tick_params(axis='y', size = 14)

    fig.tight_layout()
    plt.show();
    
    # SCATTER PLOT WITH CONFIDENCE INTERVALS AT 95% https://stackoverflow.com/questions/27164114/show-confidence-limits-and-prediction-limits-in-scatter-plot
    p, cov = np.polyfit(x_values, y_values, 1, cov=True) # parameters and covariance from of the fit of 1-D polynom.

    # Statistics
    n = y_values.size # number of observations
    m = p.size # number of parameters
    dof = n - m # degrees of freedom
    confidence_interval = 0.95 # Confidence interval
    t = stats.t.ppf(confidence_interval, n - m) # used for CI and PI bands
    # Estimates of Error in Data/Model
    resid = y_values - y_pred                          
    chi2 = np.sum((resid / y_pred)**2) # chi-squared; estimates error in data
    chi2_red = chi2 / dof # reduced chi-squared; measures goodness of fit
    s_err = np.sqrt(np.sum(resid**2) / dof) # standard deviation of the error

    # Plotting --------------------------------------------------------------------
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.set_title(f'Record Counts vs. {x_feature} with {confidence_interval} confidence interval', size = 16)
    ax.set_ylabel('Record Counts', size = 14)
    ax.set_xlabel(f'{x_feature}', size = 14)
    
    # Data
    ax.plot(x_values, y_values, "o", color="#b9cfe7", markersize=8, 
        markeredgewidth=1, markeredgecolor="b", markerfacecolor="None")

    # Fit
    ax.plot(x_values, y_pred, "-", color="0.1", linewidth=1.5, alpha=0.5, label="Fit")  
    
    x2 = np.linspace(np.min(x_values), np.max(x_values), 100)
    y2 = np.polyval(p, x2)

    # Confidence Interval (select one)
    if ax is None:
        ax = plt.gca()
    
    ci_test = t * s_err * np.sqrt(1/n + (x2 - np.mean(x_values))**2 / np.sum((x_values - np.mean(x_values))**2))
    ax.fill_between(x2, y2 + ci_test, y2 - ci_test, color="#b9cfe7")
   
    # Prediction Interval
    pi = t * s_err * np.sqrt(1 + 1/n + (x2 - np.mean(x_values))**2 / np.sum((x_values - np.mean(x_values))**2))   
    ax.fill_between(x2, y2 + pi, y2 - pi, color="None", linestyle="--")
    ax.plot(x2, y2 - pi, "--", color="0.5", label="95% Prediction Limits")
    ax.plot(x2, y2 + pi, "--", color="0.5")

    plt.show();

# Reports Widget Functions
[Return to Table of Contents](#Table-of-Contents)<br>

The Reports widget functions give the ability to export the df_data_filtered. The user can specify which columns to export.

In [74]:
def filtered_reports_widget():
    interact_manual(filtered_reports_options_widget)

In [75]:
def filtered_reports_options_widget():
    interact_manual(filtered_reports,
                    columns_to_show_W = widgets.SelectMultiple(options=df_data_filtered.columns,
                                                               value = ['original_title', 'overview', 'budget', 
                                                                        'revenue', 'runtime'],
                                                               rows = 8,
                                                               description = 'Data Features to Show', 
                                                               disabled=False, style={'description_width': 'initial'}),
                    row_range_W = widgets.IntRangeSlider(value=[0, 20],
                                                         min=0, 
                                                         max=df_data_filtered.shape[0],
                                                         step=20,
                                                         description='Row Selection:', disabled=False,
                                                         continuous_update=False,
                                                         orientation='horizontal',
                                                         readout=True,
                                                         readout_format='d',
                                                         style = {'description_width': 'initial'},
                                                         layout = widgets.Layout(width='40%')),
                   )

In [76]:
def filtered_reports(columns_to_show_W, row_range_W):
    # Styling. This changes the dataframe to a Styler Object.
    columns_to_hide = list(set(columns_to_show_W).symmetric_difference(set(df_data_filtered.columns)))
    
    styled_df_filtered = df_data_filtered.loc[row_range_W[0]:row_range_W[1]].copy() # Keeps the original dataframe.
    # Otherwise the function below changes the format of the df_data_filtered after this function runs.
    
    # NOTE FUTURE WARNING on style.set_properties() being depracated by styler.hide()
    # However: Styler documentation still shows style function examples as of Dec 2022. 
    styled_df_filtered = styled_df_filtered.style.set_properties(subset=['original_title_overview', 'norm_text_stem', 'norm_text_lemma'], 
                                                                 **{'width': '400px'},
                                                                ).hide_columns(columns_to_hide).hide_index()

    return(styled_df_filtered)

## Export Filtered Dataframe Function
[Return to Table of Contents](#Table-of-Contents)<br>

In [77]:
def export_df_widget():
    interact_manual(export_df_data_filtered, 
                    file_type_W = widgets.Dropdown(options=['CSV', 'Excel', 'JSON'],
                                                            value = 'CSV',
                                                            description = 'Export Filtered Data (File Type)', 
                                                            disabled=False, style={'description_width': 'initial'},
                                                            layout=widgets.Layout(width='50%'))
                   )

In [78]:
def export_df_data_filtered(file_type_W):
    if file_type_W == 'CSV':
        df_data_filtered.to_csv(r'.\output_data\df_data_filtered.csv', encoding='utf-8-sig', index = False, header = True)
    if file_type_W == 'Excel':
        df_data_filtered.to_excel(r'.\output_data\df_data_filtered.xlsx', encoding='utf-8-sig', index = False, header = True)
    if file_type_W == 'JSON': # NEED TO TEST WITH THE FUNCTION.
        print('NOT CONVERTED YET NEED TO ADD FUNCTION AND TEST')
        #df_data_filtered.to_json(r'.\output_data\df_data_filtered.xlsx', orient= 'record', date_format=None, force_ascii=True, lines=False, index=True, indent=None, storage_options=None)
    print(f'File successfully exported as {file_type_W}.')
# Encoding "cp1252" or "utf-8-sig" used so that Excel does not create special characters. Standard Python is utf-8.
# See reference for explanation https://stackoverflow.com/questions/57061645/why-is-%C3%82-printed-in-front-of-%C2%B1-when-code-is-run

## Reclustering: Filters the Selected Cluster
[Return to Table of Contents](#Table-of-Contents)<br>

The reclustering widget allows the user to update the df_data_filtered with the records that meet specified cluster number. User can then go back to any of the tabs and use the functions with the filtered cluster data.

In [79]:
def recluster_widget():
    interact_manual(recluster_options)

In [80]:
def recluster_options():
    global CLUSTER_LABEL_list
    if 'cluster_label_rev' in list(df_data_filtered.columns): # Checks if Clustering has been performed by checking the existence of 'cluster_label_rev' column.
        CLUSTER_LABEL_list = (df_data_filtered['cluster_label_rev'].unique()).tolist()
        CLUSTER_LABEL_list.sort()
        CLUSTER_LABEL_list = list(dict.fromkeys([element for element in CLUSTER_LABEL_list])) # Removes Duplicate
        recluster_select_cluster_widget(cluster_options = CLUSTER_LABEL_list)
    elif 'cluster_label_rev' not in list(df_data_filtered.columns): # If clustering has not been performed print message.
        print('Please run the "Text Analysis and Modeling" module with either Kmeans or DBSCAN.')
    elif len(df_data_filtered['cluster_label_rev'].unique()) == 1: # Checks if the Clustering has been performed.
        print('Please run the "Text Analysis and Modeling" module with either Kmeans or DBSCAN.')
    else:
        print('Please run the "Text Analysis and Modeling" module with either Kmeans or DBSCAN.')

In [81]:
def recluster_select_cluster_widget(cluster_options):
    interact_manual(recluster_select_cluster,
                    selected_cluster = widgets.Dropdown(options = ['None']+CLUSTER_LABEL_list,
                                                    value = 'None',
                                                    description = 'Cluster To Filter', 
                                                    disabled=False, style={'description_width': 'initial'},
                                                    layout=widgets.Layout(width='40%'))
                   )

In [82]:
def recluster_select_cluster(selected_cluster):
    global df_data_filtered, reclustering_counter
    if selected_cluster != 'None':
        df_data_filtered = df_data_filtered.loc[(df_data_filtered['cluster_label_rev'].isin(list(str(selected_cluster))))
                                               ].reset_index(drop = True)
        reclustering_counter = reclustering_counter+1
        df_unique_values_dict()
        print(f'Filtering of cluster {selected_cluster} complete.')
    else:
        print(f'Please Select a Cluster Number.')

#### Main Filter Widget Test Function

# Main Data Filtering Widget
[Return to Table of Contents](#Table-of-Contents)<br>

References: <br>
- List of widgets:https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html

In [83]:
# PRIMARY WIDGET FILTER OPTIONS
display(HTML(f'<h3>MAIN FILTER WIDGET<h3>'))
print("NOTE: Caution when selecting filters. Some filters may have overlaps (e.g., FY and CY) while other filters be mutually exclusive.")

_ = interact_manual(filter_data, # "filter_data" function filters the data and calls sub-widgest and related functions.
                    release_cy_filter = widgets.IntRangeSlider(value=[min(release_cy_list), max(release_cy_list)],
                                                            min=min(release_cy_list),
                                                            max=max(release_cy_list),
                                                            step=1,
                                                            description='Release Calendar Year', disabled=False,
                                                            continuous_update=False,
                                                            orientation='horizontal',
                                                            readout=True,
                                                            readout_format='d',
                                                            style = {'description_width': 'initial'},
                                                            layout = widgets.Layout(width='50%')),
                    mult_genres_filter = widgets.RadioButtons(options=['OR', 'AND'],
                                                              value='OR', # Default
                                                              description='Multiple Genres Selection',
                                                              disabled=False, style={'description_width': 'initial'}),
                    genres_filter = widgets.SelectMultiple(options= genres_list,
                                                             value = genres_list,
                                                             rows = 6,
                                                             description = 'Genres', 
                                                             disabled=False, style={'description_width': 'initial'}),
                    SIM_SEARCH_filter = widgets.Text(value='',
                                                     placeholder='Type something to search',
                                                     description='Similarity Search',
                                                     disabled=False, 
                                                     style={'description_width': 'initial'}, 
                                                     layout = widgets.Layout(width='50%')),
                    SIM_SEARCH_THRESHOLD_filter = widgets.FloatSlider(min=0.00, 
                                                                      max=1.0, 
                                                                      value=0.00, 
                                                                      step=0.01,
                                                                      description="Similarity Search Threshold",
                                                                      disables=False, 
                                                                      style={'description_width': 'initial'},
                                                                      layout=widgets.Layout(width='50%'),
                                                                      readout_format='.2f'),  
                    SIM_SEARCH_WORD_PROCESS_filter = widgets.Dropdown(options= ['Lemmatization', 'Stemming'], 
                                                                      description = 'Similarity Search Word Preprocessing',
                                                                      value = 'Stemming',
                                                                      disabled=False, style={'description_width': 'initial'},
                                                                      layout=widgets.Layout(width='40%')),
                    SIM_SEARCH_VECT_filter = widgets.Dropdown(options= ['TFIDF', 'Count'], 
                                                              description = 'Similarity Search Word Vectorizer',
                                                              value = 'TFIDF',
                                                              disabled=False, style={'description_width': 'initial'}),
                    STOPWORDS_filter = widgets.Text(value='',
                                                    placeholder='Add stopwords followed by comma.',
                                                    description='Add Stopwords',
                                                    disabled=False, 
                                                    style={'description_width': 'initial'},
                                                    layout = widgets.Layout(width='50%')),
                    PRINT_filter=widgets.Dropdown(options= ['NO', 'YES'], 
                                                  description = 'Print Report',
                                                  value = 'NO',
                                                  disabled=False, style={'description_width': 'initial'})
                   )

NOTE: Caution when selecting filters. Some filters may have overlaps (e.g., FY and CY) while other filters be mutually exclusive.


interactive(children=(IntRangeSlider(value=(1916, 2016), continuous_update=False, description='Release Calenda…

# NOTEBOOK END

# Add Filter to Columns
budget <br>
original_language<br>
popularity<br>
production_companies (categorical)<br>
production_countries (Categorical)<br>
vote_average<br>
vote_count<br>

# FIX
C:\Users\felix\AppData\Local\Temp\ipykernel_6432\2648881204.py:6: FutureWarning: this method is deprecated in favour of `Styler.hide(axis='columns')`
  styled_df_filtered = styled_df_filtered.style.set_properties(subset=['original_title_overview', 'norm_text_stem', 'norm_text_lemma'],
C:\Users\felix\AppData\Local\Temp\ipykernel_6432\2648881204.py:6: FutureWarning: this method is deprecated in favour of `Styler.hide(axis='index')`
  styled_df_filtered = styled_df_filtered.style.set_properties(subset=['original_title_overview', 'norm_text_stem', 'norm_text_lemma'],
