<div class="alert alert-block alert-info">
Author:<br>Felix Gonzalez, P.E. <br> Adjunct Instructor, <br> Division of Professional Studies <br> Computer Science and Electrical Engineering <br> University of Maryland Baltimore County <br> fgonzale@umbc.edu
</div>

# Table of Contents
[Introduction to Data Science: Story Telling with Data](#Introduction-to-Data-Science:-Story-Telling-with-Data)

[How does it all Fits?](#How-does-it-all-Fits?)

[Data Cleaning](#Data-Cleaning)

[Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-(EDA))

[Artificial Intelligence and Machine Learning (AI/ML)](#Artificial-Intelligence-and-Machine-Learning-(AI/ML))

[Ethics](#Ethics)

[Data Science Tools](#Data-Science-Tools)

[Data Science Example](#Data-Science-Tools)

# Introduction to Data Science: Story Telling with Data
[Return to Table of Contents](#Table-of-Contents)

In the last few years, interest in data science, artificial intelligence and machine learning continues to increased exponentially. The field is also difficult to define as it is very broad and  encompases the application of various other fields. The following definitions are an attempt to simplify the meaning of various interactions betwee various fields that are used in data science. Various definitions exist and is my advise to not focus on where one begins and ends but rather focus on what is the goal.  

- Data Analytics: Obtaining and transforming data to obtain insight and produce visualizations that aid decision-making
- Artificial Intelligence: Ability for a machine to emulate reasoning, knowledge representation, learning, language processing and decision making.
- Machine Learning: (part of AI) machine is taught to learn and understand data to make decisions without explicitly being programmed to do so.
- Big Data: Large volumes of data that are too large or complex to be dealt by traditional analysis and processing methods.
- Data Science: Unification of all the above (and more) to analyze and obtain insights from data.
- AI Algorithm: Computer code/instructions/steps that simulate logic.

For any problem there are various steps, defining a problem, identify data that can be used (either already collected or to be collected), processing the data, develop visualizations, models and tools that can be used to support decision-making or solve the problem. In all cases you want to make and document your story with data. 

# How does it all Fits?
[Return to Table of Contents](#Table-of-Contents)

![Data%20Science-How%20Does%20it%20All%20Fits.png](attachment:Data%20Science-How%20Does%20it%20All%20Fits.png)

![Data%20Science%20Problem%20Overview.png](attachment:Data%20Science%20Problem%20Overview.png)

# Data Cleaning
[Return to Table of Contents](#Table-of-Contents)

The data cleaning, data preparation, data wrangling, data pipeline, data transformation and data profiling are often used intercheangeably. Even though in many cases there is a lot of overlap in many cases they also include different tasks of the data science process mentioned in the figure above on "Overview of Typical Data Science Problem". However, they are currently not defined in the few available data science concensus standards or documents. From this point there will be no distinction on each of the steps and this task will be referred to data cleaning. Data cleaning is the most critical step in all data science projects and is one where the data scientist will spend the most amount of time. Data cleaning is performed in order to prepare the data as input to the exploratory data analysis and artificial intelligence algorithms and model steps. Not addressing issues in the data may negatively affect insights obtained from the data analysis and outputs and performance of the algorithms and models used. 

Data Cleaning steps include the following tasks:
1. Remove duplicates (e.g., full duplicates, duplicated unique feature)
2. Remove irrelevant data (e.g., outliers) when appropriate
3. Remove unneeded features.
4. Missing/Null values:
    - Remove missing/null values. 
    - Remove the record? 
    - Imputation of missing data (replace the value for another):
        - For time series data, can I take the average of adjacent points? If multiple points, is it appropriate to use forecasting to estimate those values?
        - Replace the missing/null value with the mean of the group's values? 
        - Replace missing/null value with a conservative, non-conservative, realistic or best estimate values and assumptions when possible. For example, the minimum or maximum of the group values may be a conservative assumption.
5. Identify and fix errors where possible. Can the team correct the errors? Drop the record with the error? Examples of errors may include miss-classifications, negative values where a positive values is expected, wrong dates, misspelling, orders of magnitude errors (e.g., person with several hundred years of age), etc. 
6. Verify data types and convert as needed
7. Data balance:  
8. Data bias: Evaluating biases in the data should be part of most data science tasks from data cleaning to model implementation to model deployment.

Data issues can also be evaluated using sensitivity testing. This referes to the process of evaluating the effect of a data issue in a calculated statistics, visualizations or model before and after a change or correction.

A dataset can also be enhanced or augmented by creating new derived feature, processing of unstructured data, performing natural language processing, or combining with another dataset. Depending on the approach a project is using, an initial step of data processing may be done during the data cleaning stage. In other cases it may be done during the exploratory data analysis (EDA) or model deployement.
1. Convert multi-label or multi-class features to one-hot-encoding if needed
2. Combine columns if needed
3. Derived or calculated features
4. Text Normalization (e.g., clear formatting, standardize capitalization, remove special characters, text lemmatization/stemming, etc.)
5. Language translation if needed

The extent of the data cleaning, data wrangling and data transformation will also take into consideration input from other team members and stakeholders including but not limited to subject matter expert (SME), data scientists, statisticians, project managers, advisors, management, customers and users of the data. Furter, issues with the data can be found during other steps that will required returning to the data cleaning step to address.

# Exploratory Data Analysis (EDA)
[Return to Table of Contents](#Table-of-Contents)

Exploratory data analysis (EDA) refers to the process of exploring, studying, investigating, visualizing, plotting, and charting the data in order to obtain insights about statistics, correlations, relationship between features, patterns, trends, anomalies, test hypothesis, etc. The EDA may also include exploreing feature selection to determine which features are most useful or relevant for your model.

# Artificial Intelligence and Machine Learning (AI/ML)
[Return to Table of Contents](#Table-of-Contents)

Application of artificial intelligence, machine learning (AI/ML) and deep learning algorithms refers to the process of applying AI/ML algorithms and model development with the purpose for tasks related to analyzing data, obtaining insights from the data or automate some process. These tasks include supervised machine learnign tasks (e.g., classification task, time series prediction and forecasting), unsupervised machine learnign (e.g., clustering), information retrieval (e.g., using similarity searching), topic modeling (e.g., using LDA to extract insights from text or speech data), network analysis (e.g., visualizing relationship between recordrs), and other AI/ML applications. As part of AI/ML steps the team needs to explore the available data and define if the data can support the final goal. Other ML algorithms include reinforcement learning, and deep learning (e.g., neural networks). 

### Supervised Machine Learning
In supervised ML, a dataset is divided into two, typically 80% as traning data and 20% as testing data. The model is developed and fitted with the training data. The testing data is then used to verify the performance of the model. Regression use metrics such as R2, adjusted R2, root mean square error (RMSE), mean absolute error (MAE) among others. Classification tasks use metrics such as accuracy, precision and recall, F1-score, AU-ROC, null accuracy, confustion matrix, sensitivity, specificity, among others. Data considerations and issues that may affect supervised machine learning:
1. Data balance (e.g., are all classes represented). Null accurac may be a worthwhile metric. 
2. Amount of data. Evaluate if there is enough data for dividing the dataset into training/testing (typically 80/20).
3. Data bias
4. Explore the need for using synthetic/augmented data.

When developing and evaluating a model in supervised ML (e.g., classification) includes the following steps:
- Defining values of the independent variable or variables (x)
- Defining values of the dependent variable or variable to be predicted (y)
- Scaling/normalization if needed
- Train/Test Split (Typically 80/20)
- Fitting the training data to the model
- Calculate predictions using the testing data 
- Calculating metrics, predicting, using the model, etc.
- Deploy and use the model. 

### Unsupervised Machine Learning
In unsupervised ML, an algorithm finds patterns in the data without using labeled data for training the model. The unsupervised ML is used to analyze and cluster unlabled dataset, discover similarities, find hidden patters, find outliers, etc. There are various data considerations and issues that may affect unsupervised machine learning, some include:
1. Knoweing the shape of clusters will allow deciding which clustering algorithm would be best (e.g., centroid-based, density-based, etc.)
2. Is there overlap between clusters?
3. Do I need to identify outliers?

There are about a dozen clustering algorithms including but not limited to KMeans, DBSCAN, meanshift, OPTICS, etc. The SciKit Learn Library documentation has example on these and other algorithms (https://scikit-learn.org/stable/modules/clustering.html).

Other examples of unsupervised ML algorithms include dimensionality reduction. These are used to create convert a high dimensionality problem into a lower dimensionality problem by creating a lower dimension projection. Example algorihtms include [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA), [T-distributed Stochastic Neighbor Embedding (TSNE)](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), [singular value decomposition (SVD)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), among others. With high dimensional data (e.g., text), applying dimensionality reduction (e.g., PCA or TSNE) may improve the performance of AI/ML models.

Other Unsupervised ML References:
- https://machinelearningmastery.com/clustering-algorithms-with-python/

The following figure shows main areas under machine learning.
![ML_Supervised_Unsupervised.png](attachment:ML_Supervised_Unsupervised.png)

Scikit Learn machine learning library has the following flowchart guide for choosing estimators and ML algorithms (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html). Note that this is only a guide and does not include many other algorithms and methods that are available.
![image.png](attachment:image.png)

# Ethics
[Return to Table of Contents](#Table-of-Contents)

When automating a process using AI/ML/NLP Ethics is very important. Algorithms may be designed to automate decision-making. However, because a non-human system is making a decision we still need to consider how these systems are making decisions ethically, respecting organizational values, protecting civil rights, civil liberties and privacy, and all applicable laws. There are various documents that may provide some guidance in considerations to make sure that your application is following important identified principles. These documents include:
- U.S. Ececutive Order (EO) 13859 (https://www.federalregister.gov/documents/2019/02/14/2019-02544/maintaining-american-leadership-in-artificial-intelligence)
- U.S. EO 13960 (https://www.federalregister.gov/documents/2020/12/08/2020-27065/promoting-the-use-of-trustworthy-artificial-intelligence-in-the-federal-government)
- U.S. National Institute of Standards and Technology AI Risk Management Framework (https://www.nist.gov/itl/ai-risk-management-framework)
- U.S. Government Accountability Office AI Accountability Framework (https://www.gao.gov/products/gao-21-519sp)
- U.S. Department of Energy AI Risk Management Playbook (https://www.energy.gov/ai/doe-ai-risk-management-playbook-airmp)
- IEEE Standards:
    - IEEE 3652.1-2020: IEEE Guide for Architectural Framework and Application of Federated Machine Learning
    - IEEE 2830-2021: IEEE Standard for Technical Framework and Requirements of Trusted Execution Environment based Shared Machine Learning
    - IEEE 2937-2022: IEEE Standard for Performance Benchmarking for AI Server Systems
    - IEEE 2941-2021: IEEE Standard for Artificial Intelligence (AI) Model Representation, Compression, Distribution, and Management

Other references that provide examples and can help stimulate the conversation on Ethics and AI:
- Top 9 Ethical Dilemmas of AI and How to Navigate Them in 2022 (https://research.aimultiple.com/ai-ethics/)
- Jumpstart Article on "AI Gone Wrong 5 Biggest AI Failures of All Time" https://www.jumpstartmag.com/ai-gone-wrong-5-biggest-ai-failures-of-all-time/#:~:text=AI%20Gone%20Wrong%3A%205%20Biggest%20AI%20Failures%20Of,recognition%20match%20leads%20to%20Black%20man%E2%80%99s%20arrest%20
- Evil AI Cartoons: https://www.evilaicartoons.com/
- Rogan grills Zuckerberg on how Facebook moderates controversial content (https://www.youtube.com/watch?v=irCFLEVUuIo): Mr. Zuckerberg provides a good example on algorithms catching as much bad guys but at the same time getting a few good guys on the way or is the cost of catching a few good guys too high and is better to let a few bad guys go by? This is a classification problem dilema and as a data scientist your job is to work with your stakeholders to idenfity which are acceptable and unacceptable situations for your problem.

# Data Science Tools
[Return to Table of Contents](#Table-of-Contents)

Tools often have a combination of capabilities (e.g., visualization, AI, ML, NLP, etc.). Examples of readily available tools:
1. Open source (e.g., Python, R, Jupyter, JSD3, etc.)

Example of data analytics and data science platforms all with different capabilities, limitations, and learning curves:
1. Anaconda (Data science distribution package Python/R/Jupyter)
2. SkySpark
3. MicroStrategy
4. Palantir Foundry
5. PNNL Inspire (text analysis)
6. MS Power Apps/Power Bi/Azure
7. Amazon Web Services
8. Tableau
9. Google Cloud Services Platform
10. Hadoop
11. Low code or no-code development platforms

# Data Science Example
[Return to Table of Contents](#Table-of-Contents)

This Jupyter Notebook is part of a multi-part Notebooks that explore various data science topics including data cleaning, exploratory data analysis (EDA), trending and forecasting, applciations of natural language processing (NLP), clustering, classification and dashboarding. The TMDB movie dataset will be used as example as this dataset was determined to include varios features that are prepared differently. This include data and features related to currency, categorical, numbers (float/integer), dates and time series, and text data.

This example is divided into several Jupyter Notebooks that have the following characteristics:
- Steps are self-contained in individual notbooks by the type of process (e.g., data cleaning, EDA, dashboarding, etc.).
- This allows efficient use of perconal computer resources, allows to separate and run processes that take a relatively long amount of time (e.g., text normalization)
- Allows to explore the results of each process step independently as the data is saved at the end of each Notebook and reloaded in the next.

The Jupyter Notebooks are divided into the following notebooks:
1. 0_Data_Science-Story Telling with Data: Provides an overview of what data science is and description of the example.
2. 1_Data_Cleaning: Initial exploration of the data to identify and address data quality issues related to but not limited to missing/null values, irrelevant data, outliers, errors, data types, and modify the data as needed.
3. 2_Exploratory_Data_Analysis: Provide visualizations, plots, charts, statistics, correlations, relationship between features, patterns, trends, anomalies, test hypothesis for feature selection, identify potential features to use in clustering or classification and other AI/ML algorithms, etc.

TBD:
4. 3_Trending_Forecasting: This notebook uses regressions to perform forecasting of applicable features. For example, we can attempt to forecast future of movie budget and revenues of major films and how well the industry could perform.
5. 4_NLP: Depending on the use of the data this notebook provides examples on how NLP can be used specifically using similarity search techniques to rank movies.
6. 5_Clustering: Uses unsupervised machine learning to perform clustering (e.g., text clustering). 
7. 6_Classification: Uses supervised machine learning to perform classification. For example, a classification algorithm could be written that where it reads a new film overview and does genre classification.

Dashboarding
8. 7_Dashboarding: Provide a demonstration of how a dashboard can be developed within Jupyter Notebooks.

<div class="alert alert-block alert-info">
<b> NOTEBOOK END
</div>