# Lecture 6c: Updated, informed data management modules. Advanced material.   

**Optional for the purpose of this course, necessary if you aim to work with python and Data Science.**

**Work in Progress**, *thanks to class of 2024 for always asking for more.*  

Generated by asking the right prompts to an AI, curated, corrected and filtered by you know who.     
The best example is always at the end.  

Note: Upload the full version on eclass, after 50% of students submitted their practise sets.

## 1. More data management modules.

### 1.2 Multithreading and RAM efficiency: Polars   
More efficient as it does not load all the data into RAM at once.   
Pandas and Polars are both data analysis libraries in Python, but they have some key differences:

1. **Performance**: Polars is designed to be more efficient than Pandas, especially when dealing with large datasets. It uses a lazy evaluation approach, which means that computations are not performed until the result is needed. This can lead to significant performance improvements.

2. **Memory Usage**: Polars is also more memory efficient than Pandas. It does not load all the data into RAM at once, but instead loads only the chunks of data that are needed for computation. This can be a big advantage when working with large datasets that do not fit into memory.

3. **API Compatibility**: While Polars provides a similar API to Pandas for data manipulation, there are some differences. If you're used to working with Pandas, you might need to learn new ways of doing things in Polars.

4. **Multi-threading**: Polars is designed to utilize multi-threading, which means it can use all available CPU cores for computations. This can lead to faster execution times compared to Pandas, which is single-threaded.

In summary, if you're working with larger datasets and find that Pandas is too slow or uses too much memory, Polars might be a good alternative to consider. However, for smaller datasets, the performance difference might not be noticeable, and using Pandas could be simpler and more straightforward.

### 1.3 Big data frameworks: PySpark

Pandas and PySpark are both data analysis libraries in Python, but they have some key differences:

1. **Data Size**: Pandas is typically used for small to medium-sized datasets that can fit into memory. On the other hand, PySpark is designed for distributed computing on big data using the Apache Spark framework. It can handle much larger datasets that don't fit into memory.

2. **Performance**: PySpark can be faster than Pandas when dealing with large datasets because it uses the Spark engine for distributed computing. It can distribute the data and computations across multiple nodes in a cluster, which can significantly speed up processing times.

3. **API**: While both provide similar data manipulation capabilities, their APIs are different. If you're used to working with Pandas, you might need to learn new ways of doing things in PySpark.

4. **Environment**: PySpark is typically used in a distributed computing environment (like Hadoop or Spark), while Pandas is used in a single-machine environment. 

5. **Functionality**: PySpark provides more functionality beyond data manipulation, including support for distributed machine learning and graph processing through libraries like MLlib and GraphX.

In summary, if you're working with larger datasets and need distributed computing capabilities, PySpark might be a good choice. However, for smaller datasets and simpler data analysis tasks, Pandas could be simpler and more straightforward to use.

### 1.4 Other alternatives to PySpark for big data.
**CuPy**, **Vaex**, and **Datatable** are all libraries in Python that are used for data manipulation and analysis, similar to Pandas. However, they each have their unique features and use-cases:

1. **CuPy**: CuPy is a GPU-accelerated library for numerical computations. It provides a numpy-like interface while offloading heavy computations to GPUs for better performance. It's particularly useful for tasks that require heavy numerical computations. However, it doesn't provide the same data manipulation capabilities as Pandas.

2. **Vaex**: Vaex is a high performance Python library for lazy, out-of-core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can be used to visualize and manipulate large datasets (beyond the capacity of Pandas) efficiently as it uses memory mapping, a lazy approach, and a functional programming style to achieve optimal performance.

3. **Datatable**: Datatable is a library in Python that is designed for big data processing. It's similar to Pandas but is designed to be more efficient with larger datasets. It uses a columnar storage approach (data is stored by columns rather than by rows), which can make certain operations much faster. It's also multithreaded, which can provide a significant speed boost over Pandas.

In summary, while all these libraries can be used for data manipulation and analysis, the choice between them depends on your specific use case.    
* If you're dealing with very large datasets, Vaex or Datatable might be more suitable.    
* If your work involves heavy numerical computations, CuPy might be the better choice.
* For general data manipulation tasks, especially with smaller to medium-sized datasets, Pandas is often the go-to library.

### 1.5 Other alternatives to PySpark for big data and distributed processing.

Yes, there are several alternatives to PySpark that are used for distributed data processing. Here are a few:

1. **Dask**: Dask is a flexible library for parallel computing in Python. It's built with the Python ecosystem in mind and integrates well with tools like NumPy, Pandas, and Scikit-Learn. Dask provides dynamic task scheduling and parallel collections that extend the functionality of these tools to larger datasets.

2. **Ray**: Ray is a general-purpose distributed computing framework. In addition to simple map-reduce style computations, Ray supports task dependencies and also provides libraries for distributed training of machine learning models, hyperparameter tuning, reinforcement learning and more.

3. **Koalas**: Koalas is a project that brings the pandas API to Apache Spark. It aims to make the transition from pandas to Spark easier by providing a familiar API.

4. **Vaex**: Vaex is a Python library for lazy, out-of-core DataFrames, similar to pandas, but designed to be used with big data. It uses memory mapping, a lazy approach, and a functional programming style to achieve optimal performance.

5. **Modin**: Modin is a library that speeds up your pandas workflows by changing a single line of code. It uses parallelism to accomplish this, via Ray or Dask.

Remember to choose the tool that best fits your specific needs and constraints.

### 1.6 Scalable pandas code: Modin   
Modin and Pandas are both data analysis libraries in Python, but they have some key differences:

1. **Performance**: Modin is designed to speed up your Pandas workflows by changing a single line of code. It uses parallelism to accomplish this, via Ray or Dask. This means that operations in Modin are designed to utilize all available CPU cores, which can lead to significant speed improvements for larger datasets.

2. **API Compatibility**: Modin is a drop-in replacement for Pandas, meaning it aims to have complete API compatibility with Pandas. This means you can use Modin as a direct replacement for Pandas in your code, and it should work as expected. However, there might be some edge cases where Modin's behavior is slightly different from Pandas.

3. **Scalability**: While Pandas is excellent for small to medium-sized datasets, it can struggle with larger datasets. Modin, on the other hand, is designed to handle larger datasets more efficiently by distributing the computations across all cores.

4. **Memory Usage**: Modin can be more memory efficient than Pandas, especially for larger datasets. This is because Modin partitions the data and only loads the partitions that are necessary for computation, reducing memory usage.

In summary, if you're working with larger datasets and find that Pandas is too slow or uses too much memory, Modin might be a good alternative to consider. However, for smaller datasets, the performance difference might not be noticeable, and using Pandas could be simpler and more straightforward.

## 2. Comparing modules for various use cases.

### 2.1 Comparing Modin versus Polars.

1. **Performance and Scalability**: Both Modin and Polars are designed to handle larger datasets more efficiently than traditional Pandas. Modin achieves this by parallelizing computations across all available CPU cores using either Ray or Dask. Polars, on the other hand, uses a lazy evaluation approach, which means that computations are not performed until the result is needed. This can lead to significant performance improvements, especially when dealing with large datasets.

2. **Memory Usage**: Polars is more memory efficient than Modin. It does not load all the data into RAM at once, but instead loads only the chunks of data that are needed for computation. This can be a big advantage when working with large datasets that do not fit into memory. Modin, while more memory efficient than Pandas, still loads the entire dataset into memory.

3. **API Compatibility**: Modin aims to be a drop-in replacement for Pandas, meaning it strives for complete API compatibility with Pandas. This makes transitioning from Pandas to Modin relatively straightforward. Polars, while providing a similar API to Pandas, does have some differences. If you're used to working with Pandas, you might need to learn new ways of doing things in Polars.

4. **Multi-threading**: Both Modin and Polars are designed to utilize multi-threading, which means they can use all available CPU cores for computations. This can lead to faster execution times compared to single-threaded libraries like Pandas.

In summary, if you're working with larger datasets and find that Pandas is too slow or uses too much memory, both Modin and Polars might be good alternatives to consider. The choice between Modin and Polars will depend on your specific needs and constraints, such as memory usage, API familiarity, and the specific nature of your data and computations.

#### 2.1.1. Advantages of Polars over Modin.

Polars and Modin are both data analysis libraries in Python designed to handle larger datasets more efficiently than traditional Pandas. However, they have some key differences and advantages over each other:

1. **Memory Efficiency**: Polars is more memory efficient than Modin. Polars does not load all the data into RAM at once, but instead loads only the chunks of data that are needed for computation. This can be a big advantage when working with large datasets that do not fit into memory. Modin, while more memory efficient than Pandas, still loads the entire dataset into memory.

2. **Lazy Evaluation**: Polars uses a lazy evaluation approach, which means that computations are not performed until the result is needed. This can lead to significant performance improvements, especially when dealing with large datasets. Modin, on the other hand, performs computations immediately.

3. **Multi-threading**: Both Polars and Modin are designed to utilize multi-threading, which means they can use all available CPU cores for computations. However, Polars is designed from the ground up to be multi-threaded, which can lead to more efficient utilization of CPU cores.

4. **API Differences**: While Modin aims to be a drop-in replacement for Pandas, meaning it strives for complete API compatibility with Pandas, Polars provides a similar but not identical API to Pandas. This means that transitioning from Pandas to Polars might require learning new ways of doing things, but it also means that Polars can offer features and functionality that are not available in Pandas or Modin.

In summary, the advantages of Polars over Modin mainly lie in its memory efficiency, lazy evaluation approach, and potentially more efficient multi-threading. However, the choice between Polars and Modin will depend on your specific needs and constraints, such as memory usage, API familiarity, and the specific nature of your data and computations.

#### 2.1.2. Advantages of Modin over Polars.   

Modin and Polars are both data analysis libraries in Python designed to handle larger datasets more efficiently than traditional Pandas. However, they have some key differences and advantages over each other:

1. **API Compatibility**: Modin aims to be a drop-in replacement for Pandas, meaning it strives for complete API compatibility with Pandas. This makes transitioning from Pandas to Modin relatively straightforward. Polars, while providing a similar API to Pandas, does have some differences. If you're used to working with Pandas, you might need to learn new ways of doing things in Polars.

2. **Parallelism Frameworks**: Modin provides the flexibility to choose the parallelism framework that suits your needs. It supports both Ray and Dask as backends, which allows you to choose based on your specific use case or environment. Polars, on the other hand, does not offer this flexibility.

3. **Ease of Use**: Given its goal to be a drop-in replacement for Pandas, Modin can be easier to use for those who are already familiar with the Pandas API. The transition from Pandas to Modin can be as simple as changing a single line of code in some cases.

4. **Community and Ecosystem**: Modin is part of the Ray Project ecosystem, which is a fast-growing project with a large community and a wide range of tools for distributed computing. This can be an advantage in terms of finding help, resources, and integrations with other tools.

In summary, the advantages of Modin over Polars mainly lie in its API compatibility with Pandas, flexibility in choosing parallelism frameworks, ease of use, and its community and ecosystem. However, the choice between Modin and Polars will depend on your specific needs and constraints, such as API familiarity, the specific nature of your data and computations, and your computational resources.

## 3. The PyArrow datatypes on pandas
[Apache Arrow development platform for in-memory analytics](https://arrow.apache.org/docs/python/index.html)   

[PyArrow functionality](https://pandas.pydata.org/docs/user_guide/pyarrow.html#pyarrow-functionality)   

**To use this functionality, please ensure you have installed the [minimum supported PyArrow version.](https://pandas.pydata.org/docs/getting_started/install.html#install-optional-dependencies)**   
It is one of pandas optional dependecies.

## 4. Always stay up-to-date on python data management. Examples for data modules that generalize for all python ecosystem.   
!TODO: Split this notebook, refactor as lecture 6 Instructions. (Think about it?).     
Or, just add corresponding examples in each lecture (?).

### 4.1. Read the official docs.   
e.g. https://pandas.pydata.org/

### 4.2 Signup and watch relevant conferences (PyData).  
https://pydata.org/global2023/   

https://www.youtube.com/@PyDataTV  
[PyData Global 2023 complete video gallery](https://www.youtube.com/playlist?list=PLGVZCDnMOq0rCyO6B53u1eFT4owN8Lvwj)

[PyData Global 2022 complete video gallery](https://www.youtube.com/playlist?list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq)

### 4.3 Ask questions in stackoverflow.   
[How to ask a good question](https://stackoverflow.com/help/how-to-ask)   

### 4.4. Follow the right people and join the right communities in social media.   
Cassie Kozyrkov.   

### 4.5. Double check AI copilots and assistants 

`argythana:`  
> append_df = df1.append(df2)  AttributeError: 'DataFrame' object has no attribute 'append'
> Why is this wrong?   


`GitHub Copilot:`
> The error message you're seeing typically indicates that the method you're trying to use doesn't exist for the object you're using it on. However, the `append` method is a valid method for a DataFrame object in pandas.
> The error might be due to a different reason such as:
> 1. You might have overwritten the pandas DataFrame class with a variable of the same name. Make sure you haven't used `DataFrame` as a variable elsewhere in your code.
> 2. There might be an issue with your pandas installation. Try reinstalling pandas using pip or conda.
> 3. You might be using an older version of pandas where the `append` method is not available. You can check your pandas version with `print(pd.__version__)`. If it's not the latest version, consider upgrading it.
> If none of these are the case, please provide more context or check the rest of your code for potential issues.

`argythana:`   

> when was df.append() method deprecated and when whas it totally removed from pandas?

`GitHub Copilot:`
> The `append()` method is not deprecated and has not been removed from pandas. It is still a valid method for appending rows of one DataFrame to the end of another DataFrame, returning a new object. The error you're encountering might be due to a different reason. Please ensure that `df1` and `df2` are indeed pandas DataFrames. If the issue persists, consider checking your pandas installation or version.

`argythana on 2nd of April, 2024:`  
> which version of github copilot is this?

`GitHub Copilot:`  
> This is the GPT-4 version of GitHub Copilot.

Meanwhile, since pandas 1.4 (January 22, 2022) `df.append()` has been [deprecated](https://pandas.pydata.org/pandas-docs/version/1.4/reference/api/pandas.DataFrame.append.html).        


`df.append()` has been removed for ages, since April 3, 2023.   
Read this [reply.](https://stackoverflow.com/a/75956237)