<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## ***Numpy `numpy`***

NumPy (Numerical Python) is the foundational library for numerical/scientific computing in Python. It supports N-dimensional arrays (ndarray), vectorized operations, linear algebra, random sampling.

***Official Link:*** ***[https://numpy.org/](https://numpy.org/)*** <br>
***GitHub:*** [https://github.com/numpy/numpy](https://github.com/numpy/numpy) <br>
***Docs:*** [https://numpy.org/doc/stable/user/index.html](https://numpy.org/doc/stable/user/index.html) <br>

#### **Key Strengths/Use :**

1. ***Multidimensional Arrays:*** Efficient storage and manipulation of large numerical datasets, of homogeneous types (e.g., all float32 or int64).

2. ***Performance:*** Written in C for low-level speed. Operations are significantly faster than native Python loops, and also supports vectorized operations (SIMD-level performance via broadcasting).
(SIMD - Single Instruction, Multiple Data is a type of parallel processing where a single instruction is executed on multiple data points simultaneously.)

3. ***Scientific Computing:*** Useful linear algebra, Fourier transform, random number capabilities and more.

4. ***Interoperability:*** Widely adopted in libraries like: Pandas, SciPy, Scikit-learn, TensorFlow, PyTorch, OpenCV, etc.

5. Easy C/C++/Fortran bindings.

6. ***Mature, stable and serves as a backbone for most ML/AI libraries***

#### **Key Limitations:**

1. No GPU support - use CuPy, JAX

2. No native autograd/differentiation - use JAX, PyTorch, TensorFlow

3. Single-threaded unless explicitly parallelized

#### **Few Alternatives:** 

- ***CuPy :*** A NumPy-compatible library for GPU-accelerated (uses CUDA) computing with Python.

- ***JAX :*** 

    a) [JAX](https://github.com/jax-ml/jax) is a numerical computing library with an extended NumPy API, developed by Google. JAX brings automatic differentiation and hardware acceleration (CPU, GPU, TPU) to NumPy programs, enabling efficient numerical and ML research workflows through advanced function transformations, .

    b) Provides automatic differentiation (autograd), JIT compilation ([XLA- Accelerated Linear Algebra](https://openxla.org/xla)), vectorization (vmap), and parallelization (pmap).

    c) It does not provide a built-in deep learning framework — unlike PyTorch or TensorFlow. To use JAX for deep learning, you need libraries built on top of JAX, such as: [Flax](https://github.com/google/flax), [Equinox](https://docs.kidger.site/equinox/) etc.

- **PyTorch :** PyTorch is an open-source machine learning library primarily developed by Meta AI. It's renowned for its flexibility, "Pythonic" design, and ease of use, making it extremely popular for deep learning research and increasingly in production.

- **TensorFlow** : More enterprise-ready and strong production ecosystem with tools like TF Serving, TensorFlow Lite, TensorBoard.

## ***Pandas `pandas`***

Pandas is an open-source Python library designed for data manipulation, analysis, and preparation. It provides highly optimized data structures like:

    Series: 1D labeled array
    DataFrame: 2D labeled table with heterogeneous data types based on columns (like an Excel table)

It’s built on ***NumPy*** and often used alongside Matplotlib, Seaborn and others, in data workflows.

***Official Link:*** [https://pandas.pydata.org](https://pandas.pydata.org) <br>
***GitHub:*** [pandas-dev/pandas](https://github.com/pandas-dev/pandas) <br>
***Docs:*** [pandas.pydata.org/docs](https://pandas.pydata.org/docs/) <br>

#### **Main Strengths/Use**:

1. **Tabular Data Handling:** Cleaning, exploring, transforming, and analyzing structured data like CSVs, Excel files, SQL tables, JSON, etc. Native support for various file formats.

2. **Data Wrangling & Cleaning:** Like handling missing values (NaN), filtering, grouping, joining, type casting and conversion etc.

3. **Time Series Analysis:** Like Datetime-aware indexing etc.

4. **Exploratory Data Analysis (EDA):** Quick statistics and structural inspection of large datasets, like using describe() or groupby() method. 

5. Intuitive API for Python users

6. ***Ideal for data exploration, prototyping, and cleaning***

#### **Key Limitations:**

1. Single-threaded by default (not ideal for multicore CPUs)

2. Memory-bound - struggles with datasets > RAM

3. Slower for large-scale operations

#### **Few Alternatives:** 

- ***polars*** : A relatively new but rapidly growing DataFrame library written in Rust, offering extremely fast performance and memory efficiency. It leverages columnar storage and parallel execution. Lightning-fast DataFrame library using Apache Arrow memory model.

- ***pyspark*** (Apache Spark) : It's a distributed computing framework that can handle petabytes of data across clusters of machines.

- ***dask*** : Dask can operate on datasets larger than RAM by intelligently partitioning and processing data in chunks. It can scale from a single machine to a cluster.


## ***Scikit-learn `sklearn`***

Scikit-learn is a powerful, open-source Python library for machine learning built on top of NumPy, SciPy, and matplotlib. It is an essential tool in any data scientist's toolkit, and it has good tech stack cmpatibility.

***Official Link:*** [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/) <br>
***GitHub:*** [https://github.com/scikit-learn/scikit-learn](https://github.com/scikit-learn/scikit-learn) <br>
***Docs:*** [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html) <br>

#### **Main Strength/Use:**

It provides simple and efficient tools for:

1. **Machine Learning (and related) Algorithms:** Out-of-the-box implementations for Logistic Regression, Dimensionality Reduction, Classification, decision trees, SVMs, random forests, k-NN and Clustering (KMeans, DBSCAN) etc.

2. **Preprocessing:** Provides scaling, normalization, encoding (e.g., OneHotEncoder, LabelEncoder) etc.

3. **Model Evaluation (Metrics):** Provides easy way for getting model evaluation metrics like accuracy, F1, ROC-AUC, MSE, etc.

#### **Few Limitations:**

- No GPU acceleration out of the box
- Not optimized for very large datasets or real-time inference

#### **Few Alternatives:** 

- Specialized ML Libraries (for specific tasks or algorithms) like XGBoost / LightGBM / CatBoost, Statsmodels etc.

- **TensorFlow/PyTorch:** Better for deep learning research and flexible neural network architectures. Worse/Overkill for tasks well-suited to classical algorithms and smaller datasets.


## ***Matplotlib `matplotlib`***

Matplotlib is a comprehensive plotting library for Python that provides control over 2D (and limited 3D) visualizations. It was originally developed by John Hunter in 2003 and is now part of the scientific Python ecosystem alongside NumPy, SciPy, and Pandas.

It is best known for its flexibility and low-level control, similar to MATLAB’s plotting environment.

***Official Link:*** [https://matplotlib.org](https://matplotlib.org) <br>
***GitHub:*** [https://github.com/matplotlib/matplotlib](https://github.com/matplotlib/matplotlib) <br>
***Docs:*** [https://matplotlib.org/stable/users/index](https://matplotlib.org/stable/users/index) <br>

#### **Key Strengths/Use:**

1. ***Unrivaled Customization :*** For creating publication-quality figures for academic papers, presentations, or specific branding requirements, Matplotlib's fine-grained control is unmatched. You can tweak absolutely everything.

2. ***Foundation and Compatibility :*** Its position as the underlying engine for many other popular libraries (Seaborn, Pandas plot) means that learning Matplotlib provides a valuable fundamental understanding and ensures compatibility across different plotting tools.

3. ***Static Plot Excellence :*** For producing static images (PNG, PDF, SVG) for reports, papers, or dashboards that don't require user interaction, Matplotlib is highly reliable and performs exceptionally well.

4. ***Flexibility for Complex Layouts :*** Creating complex multi-panel figures with shared axes, insets, or highly custom grid layouts is very powerful with its object-oriented API.

5. ***Mature and Stable :*** Decades of development mean it's robust, well-tested, and has a vast community for support.

6. ***Versatile Plot Types :*** Matplotlib supports almost every common 2D plot type, including: Line plots, Scatter plots, Bar charts (vertical and horizontal), Histograms, Pie charts, Heatmaps, Contour plots, Basic 3D plotting (mplot3d toolkit)

#### **Few Limitations:**

1. ***Dated Default Aesthetics***: Out-of-the-box, Matplotlib plots can sometimes look a bit dated or less aesthetically pleasing compared to libraries like Seaborn or Plotly, requiring more code for visual appeal.

2. ***Verbosity for Simple Plots***: For straightforward plots, the object-oriented API can be a bit verbose, requiring explicit management of figures and axes, whereas higher-level libraries simplify this.

3. ***Statistical Graphics Simplicity***: While it can create statistical plots, ***Seaborn*** simplifies the creation of complex statistical visualizations with less code and better defaults.

4. No native interactivity

#### **Few Alternatives:** 

- ***Plotly :*** Primarily designed for interactive, web-based visualizations.

- ***Seaborn :*** Built on top of Matplotlib, easier for statistical plots.

- bokeh, altair and more.