# Data Science Toolkit

Data Science is a process of answering questions through data and in this process there are multiple tools that we can use in other to fulfill it. There is no 1 tool fits all and each tool has it's own pros and cons.

There are multiple tasks that we perform in data science. We can define them broadly as - 
* Data management - process of persisting and retreiving data
* Data integration and transformation(or ETL - Extract, Transform and Load or data refining and cleaning) - process of retreiving data from remote data source, transforming data and loading into local data system
* Data Visualization - part of intial data exploration and final deliverable 
* Model Building - creating ml or dl model
* Model Deployment - making model available to others application
* Model monitoring and assessment - continuous quality checks on deployed models for accuracy, fairness, etc 
* Code asset management - code subversioning(like git) and team collaberation
* Data asset management - data backup, replication and access mgmt
* Development Environment(or IDE) setup - environment to help Data Scientist develop, test and deploy works
* Execution env setup- tools where data processing, modeling training and deployment take place

## Tools for Data Science tasks - 

To fulfill the data science tasks, there are multiple tools, both open-source and commercial. Each type of tool has it's own pros and cons and there is no 1 tool fits all. The selection criteria for a specific tool depends on our capabilities and requirement. These tools are standalone tools and do not include programming libraries.

### Open Source tools 

* Data management -
    * realtional db - MySQL, PostgreSQL
    * NoSQL db - MongoDB, CouchDB
    * File system based tools - HDFS(Hadoop file system), CEPH(cloud based file system), elasticsearch
    
    
* Data integration - 
    * Apache airflow
    * KubeFlow - pipeline execution on kubernetes
    * Apache Kafka - 
    * Apache nifi
    * Apache SparkSQL
    * Node RED - provides visual editor
    
    
* Data Visualization - These comprise of programming libraries that can be used to create visualizations and tools that inherently provide visualization capabilities. Some of the common tools are 
     * Hue - visualization from SQL query
     * Kibana - For Elasticsearch
     * Apache Superset
     
     
* Model Deployment - Once the model is developed, the model should be consumable by others(via APIs)
    * Apache PredicationIO - Supports Apache Spark ML models for deployment
    * Seldon - Supports majority of frameworks. Runs on top of kubernetes or openshift
    * mleap
    * tensorflow service - used to deploy tensorflow models
    
    
* Model monitoring - Once a model is deployed, imp to track performance as new data arrives in order to check if the model is outdated or not.
    * ModelDB - Machine model DB. Supports Apache Spark pipeline and scikit
    * Prometheus
    
    
* Code asset management - 
    * git - platforms like github, gitlab, bitbucket, etc
    
    
* Data asset management - 
    * Apache atlas
    * ODPi Egeria
    * kylo - data lake management platform
    
    
* Development Environment - 
    * Jupyter
    * Apache Zepplin
    * R Studio
    * Spyder
    
    
* Execution Environments - 
    * Apache Spark - cluster computing framework, widely used. Key property is linear scalibility(double servers means double capacity). It is a batch data processing engine
    * Apache Flink - stream processing engine(can process real time data stream)
    * Ray(riselab)
    
    
* Fully integrated visual tools - These are tools which help in many or all of the data science tasks define above and also do not require must of programming knowledge to use. Some of them are - 
    * Knime
    * orange - easier to use

### Commercial tools

* Data management -
    * Oracle DB
    * Miscrosoft SQL Server
    * AWS DynamoDB
    * Cloudant CouchDB
    * DB2
    * etc
    
    
* Data integration - 
    * Informatica
    * SAP
    * SAS
    
    
* Data Visualization - These comprise of programming libraries that can be used to create visualizations and tools that inherently provide visualization capabilities. Some of the common tools are 
     * Tableau
     * PowerBI


* Model Building -
    * Google Cloud

     
* Model Deployment - Once the model is developed, the model should be consumable by others(via APIs)
    * SPSS modeler
    * SAS enterprise miner
    
    
    
* Model monitoring - Once a model is deployed, imp to track performance as new data arrives in order to check if the model is outdated or not.
    * AWS SageMaker
    
    
* Code asset management - 
    * git based platforms
    
    
* Data asset management - 
    * Informatica
    
    
* Development Environment - 
    * IBM Watson studio
    
    
* Execution Environments - 
    * IBM Watson
    
    
* Fully integrated visual tools - These are tools which help in many or all of the data science tasks define above and also do not require must of programming knowledge to use. Some of them are - 
    * Watson studio + Watson open scale
    * h2o.ai
    * Azure Machine Learning

## Programming Languages

In data Science, it is not necessary to learn code but learning so can help a lot. There are mainly three languages that are used most commonly in data science domain, these are - 

* Python
* R
* SQL
* JavaScript

Other languages are - scala, java, c++, julia, go, js, ruby, php, visual basic, etc


## Libraries(Python) - 

* Scientific computing - 
    * Pandas - Data structures and tools
    * NumPy - Arrays and matrices
    
    
* Visualization - 
    * Matplotlib - most popular
    * Seaborn - high level, based on Matplotlib
    
    
* Machine learning and deep learning - 
    * Scikit-learn
    * Keras - deep learning NN
    * TensorFlow - deep learning production and deployment
    * PyTorch
    
    
* PySpark - provides functionalities for all above

## Datasets - 

* Definition - Structured collection of data. The data can be in data structures like - 
    * Tabular data - commonly csv format
    * Hierarchial/network data - used to define relation in data
    * Raw files - images/audio


* Data ownnership - 
    * Private data - datasets are private because they contain confidential data, private or personal information and are commercially sensitive. They are not generally shared publically.
    * Open data - these are datasets that institutions, governments, companies or organizations have made available to the general public for use.


* Sources of public datasets - publically available open data sets can be found from multiple data sources such as - 
    * Open data portal list from around the world - http://datacatalogs.org/
    * Government and organizational websites like [UN](https://data.un.org/), [US govt data](https://data.gov) and [Indian govt data](https://data.gov.in/)
    * Other sources - 
        * [Kaggle](https://kaggle.com/datasets)
        * [Google](https://datasetsearch.research.google.com/)
        
        
* License consideration - While using a publically available dataset(or for that matter any dataset), it is important to check for the license and whether it allows data usage as we require. A common data sharing license is CDLA(Common Data License Agreement)

## Machine Learning models -

Data contains wealth of information. Manual analysis of data is not possible today since the amount of data present is huge and traditional data analysis techniques are unable to handle the volume. Machine Learning models help identify the patterns in data. Model training is the process through which a machine learning models learns about the data. Once the model is trained, it can be used to make predications on new data. 

There are 3 basic classes of machine learning- 

* Supervised learning - most common type, input data and correct output is provided. The model tries to find relation between input data and output data. Generally used for regression and classification.
* Unsupervised learning - data is not labeled(not known what is input, structure, type, etc). The model tries to identify pattern in this kind of data without any help from the humans. Generally used for clustering problems and anamoly detection.
* Reinforcement learning - similar to way humans learn, by trial and error in idnetifying actions in order to maximize reward over time. 

Deep learning is specialized type of machine learning. 