DATA TOOLKIT -theory

In [None]:
1. What is NumPy, and why is it widely used in Python?
-NumPy, which stands for Numerical Python, is an essential library in the Python ecosystem, widely used for scientific computing. It provides a high-performance multidimensional array object, ndarray, and tools for working with these arrays.

Key Features of NumPy

N-dimensional Array Support: At its core, NumPy offers the ndarray object, enabling efficient storage and manipulation of homogeneous data types across multiple dimensions.

Performance: NumPy operations are implemented in C, which provides a performance boost compared to pure Python implementations. This is particularly beneficial for tasks involving large datasets or complex mathematical computations.

Comprehensive Mathematical Functions: The library includes functions for statistical analysis, linear algebra, and random number generation, among others, making it a versatile tool for data analysis and scientific research.

Interoperability: NumPy arrays are used as the standard data container for many other libraries in the scientific Python ecosystem, facilitating data exchange and integration.

Ease of Use: With its clear and concise syntax, NumPy is accessible to programmers from various backgrounds, simplifying the transition to scientific computing in Python.

Practical Example: Array Operations

Here's a simple demonstration of creating and manipulating a NumPy array:

import numpy as np

# Create a 2-dimensional array and perform operations
x = np.arange(15, dtype=np.int64).reshape(3, 5)
x[1:, ::2] = -99
print(x)
# Output:
# array([[ 0, 1, 2, 3, 4],
# [-99, 6, -99, 8, -99],
# [-99, 11, -99, 13, -99]])

# Find the maximum value in each row
max_values = x.max(axis=1)
print(max_values)
# Output: array([ 4, 8, 13])
Why NumPy is Faster Than Lists

NumPy arrays are stored contiguously in memory, providing the benefit of locality of reference. This storage method, combined with the fact that operations are performed in compiled C code, results in significant performance gains over traditional Python lists.

Open Source and Community-Driven

NumPy is an open-source project, distributed under a BSD license, and maintained by a vibrant community on GitHub. Its open nature encourages collaboration and contributions, ensuring the library stays up-to-date with the latest computing architectures and paradigms.

Conclusion

NumPy is a cornerstone in Python's scientific computing stack, offering efficient array manipulation and a suite of mathematical tools. Its integration with other libraries and ease of use make it an indispensable resource for data scientists, researchers, and engineers alike. Whether you're performing complex numerical simulations or analyzing large datasets, NumPy provides the functionality and performance necessary to get the job done efficiently.

2. How does broadcasting work in NumPy?
-Broadcasting in NumPy is a powerful feature that allows operations on arrays of different shapes without explicitly reshaping or replicating data. It simplifies mathematical operations by automatically expanding smaller arrays to match the shape of larger ones, following specific rules.

How Broadcasting Works
When performing operations on two arrays, NumPy compares their shapes element-wise. It applies the following rules to determine compatibility:

Rule 1: Matching Dimensions
If the dimensions of the two arrays are the same, they are compatible.

Rule 2: Size of 1
If one of the dimensions is 1, it can be stretched (broadcasted) to match the other dimension.

Rule 3: Incompatible Shapes
If the dimensions do not match and neither is 1, broadcasting is not possible, and a ValueError is raised.

Examples
1. Scalar and Array
A scalar can be broadcasted to any array:


import numpy as np

arr = np.array([1, 2, 3])
result = arr + 5  # Scalar 5 is broadcasted
print(result)  # Output: [6 7 8]
2. Arrays with Different Shapes

arr1 = np.array([[1, 2, 3], [4, 5, 6]])  # Shape: (2, 3)
arr2 = np.array([10, 20, 30])            # Shape: (3,)
result = arr1 + arr2  # arr2 is broadcasted to shape (2, 3)
print(result)
# Output:
# [[11 22 33]
#  [14 25 36]]
3. Higher Dimensions

arr1 = np.array([1, 2, 3])               # Shape: (3,)
arr2 = np.array([[10], [20], [30]])      # Shape: (3, 1)
result = arr1 + arr2  # arr1 is broadcasted to shape (3, 3)
print(result)
# Output:
# [[11 12 13]
#  [21 22 23]
#  [31 32 33]]
Key Points
Broadcasting avoids memory overhead by not creating large intermediate arrays.
It works only when the shapes are compatible according to the rules.
If shapes are incompatible, you’ll get a ValueError.
This feature is particularly useful in scientific computing, where operations on multi-dimensional data are common.

3. What is a Pandas DataFrame?
-A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is one of the primary data structures in pandas.

Creating a DataFrame

You can create a DataFrame using various methods such as from a dictionary, list, or even from a CSV file.

Example: Creating DataFrame from Dictionary

import pandas as pd

data = {
"Name": ["Tom", "Nick", "Krish", "Jack"],
"Age": [20, 21, 19, 18]
}

df = pd.DataFrame(data)
print(df)
Output:

Name Age
0 Tom 20
1 Nick 21
2 Krish 19
3 Jack 18
Accessing Data

You can access data in a DataFrame using various methods like loc[] and iloc[].

Example: Using loc[] to Access Rows by Label

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index=["day1", "day2", "day3"])
print(df.loc["day2"])
Output:

calories 380
duration 40
Name: day2, dtype: int64
Handling Missing Data

Pandas provides functions like isnull(), fillna(), and dropna() to handle missing data.

Example: Filling Missing Values

import pandas as pd
import numpy as np

data = {
'First Score': [100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, 40, 80, 98]
}

df = pd.DataFrame(data)
df_filled = df.fillna(0)
print(df_filled)
Output:

First Score Second Score Third Score
0 100.0 30.0 52.0
1 90.0 0.0 40.0
2 0.0 45.0 80.0
3 95.0 56.0 98.0
Pandas DataFrames are versatile and powerful for data manipulation and analysis.

4. Explain the use of the groupby() method in Pandas?
-The groupby method in Pandas is a powerful tool for grouping data and performing operations on those groups. It is commonly used for data aggregation, transformation, and analysis. Here's a concise explanation of its use:

What does groupby do?
The groupby method splits the data into groups based on some criteria (e.g., a column or multiple columns), applies a function to each group, and then combines the results into a new DataFrame or Series.

Key Steps in groupby:
Splitting: Divide the data into groups based on values in one or more columns.
Applying: Perform an operation (e.g., aggregation, transformation, or filtering) on each group.
Combining: Merge the results into a new structure.
Common Use Cases:
Aggregation: Summarize data using functions like sum(), mean(), count(), etc.
Transformation: Modify data within each group (e.g., normalize values).
Filtering: Select groups that meet specific criteria.
Example 1: Aggregation

import pandas as pd

# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Group by 'Category' and calculate the sum of 'Values'
result = df.groupby('Category')['Values'].sum()
print(result)
Output:


Category
A    90
B    60
Name: Values, dtype: int64
Example 2: Multiple Aggregations

# Group by 'Category' and calculate multiple aggregations
result = df.groupby('Category')['Values'].agg(['sum', 'mean', 'count'])
print(result)
Output:


          sum  mean  count
Category                    
A          90  30.0      3
B          60  30.0      2
Example 3: Transformation

# Normalize 'Values' within each group
df['Normalized'] = df.groupby('Category')['Values'].transform(lambda x: x / x.sum())
print(df)
Output:


  Category  Values  Normalized
0        A      10    0.111111
1        B      20    0.333333
2        A      30    0.333333
3        B      40    0.666667
4        A      50    0.555556
Key Parameters of groupby:
by: Column(s) or keys to group by.
axis: Whether to group rows (axis=0, default) or columns (axis=1).
level: Group by a specific level in a MultiIndex.
The groupby method is highly versatile and can be combined with other Pandas functions to perform complex data manipulations efficiently.

5. Why is Seaborn preferred for statistical visualizations?
-Seaborn is a Python library that provides a high-level interface for creating attractive and informative statistical graphics. It is built on top of matplotlib and integrates closely with pandas data structures, making it an essential tool for data analysis and visualization.

Creating Visualizations with Seaborn

Seaborn simplifies the process of creating visualizations by providing a dataset-oriented API. This means you can focus on the meaning of your data and the story you want to tell, rather than the mechanics of plotting. Here's an example of how you can create a visualization with Seaborn:

import seaborn as sns
import pandas as pd

# Load an example dataset
tips = sns.load_dataset("tips")

# Create a visualization
sns.relplot(
data=tips,
x="total_bill",
y="tip",
col="time",
hue="smoker",
style="smoker",
size="size",
)
In this example, Seaborn automatically handles the details of creating a relational plot, including the mapping of dataframe columns to visual attributes like color and size.

Statistical Estimation and Plot Types

Seaborn excels at statistical estimation and offers a variety of plot types to represent data distributions and relationships. Some of the plot types available in Seaborn include:

Line plots: Useful for visualizing relationships involving time or ordered categories.

Scatter plots: Ideal for showing the relationship between two continuous variables.

Box plots: Provide a visual summary of the distribution of a dataset, highlighting the median, quartiles, and outliers.

Violin plots: Similar to box plots but include a kernel density estimation to show the distribution shape.

Bar plots: Represent an estimate of central tendency for a numeric variable with error bars to indicate uncertainty.

Count plots: Show the counts of observations in each categorical bin using bars.

KDE plots: Visualize the probability density of a continuous variable.

Customization and Flexibility

Seaborn offers a balance between ease of use and customization. It comes with opinionated defaults that create presentable plots with minimal effort. However, it also allows for extensive customization to fine-tune your visualizations for publication quality. You can adjust the plot's theme, scale, color palette, and more to fit the context of your presentation or publication.

Integration with Pandas

Seaborn's integration with pandas makes it straightforward to work with dataframes. You can pass pandas dataframes directly to Seaborn's plotting functions, and it will internally perform the necessary semantic mapping and statistical aggregation. This integration streamlines the visualization process, especially when working with complex datasets.

Example of Seaborn with Pandas

Here's an example of how you can use Seaborn together with pandas to create a box plot:

import seaborn as sns
import pandas as pd

# Load data from a CSV file
data = pd.read_csv("nba.csv")

# Create a box plot
sns.boxplot(data['Age'], data['Weight'])
This example demonstrates the simplicity of creating a box plot with Seaborn, where 'Age' and 'Weight' are columns in the 'nba.csv' dataframe.

Conclusion

Seaborn is a powerful tool for data visualization in Python. It simplifies the creation of complex statistical graphics, allowing you to convey insights effectively. Whether you're exploring data interactively or preparing a final presentation, Seaborn's high-level interface and integration with pandas make it an indispensable library for data analysts and scientists.

6. What are the differences between NumPy arrays and Python lists?
-1. Performance
NumPy Arrays: Faster due to their implementation in C and optimized for numerical computations.
Python Lists: Slower as they are general-purpose containers and not optimized for numerical operations.
2. Data Type
NumPy Arrays: Homogeneous; all elements must be of the same data type (e.g., integers, floats).
Python Lists: Heterogeneous; can store elements of different data types (e.g., integers, strings, objects).
3. Memory Efficiency
NumPy Arrays: More memory-efficient because they store data in contiguous blocks of memory.
Python Lists: Less memory-efficient as they store references to objects, which can lead to overhead.
4. Functionality
NumPy Arrays: Provide a wide range of mathematical, statistical, and linear algebra operations directly.
Python Lists: Limited built-in operations; require loops or external libraries for numerical computations.
5. Indexing and Slicing
NumPy Arrays: Support advanced slicing, broadcasting, and multidimensional indexing.
Python Lists: Support basic slicing but lack advanced features like broadcasting.
6. Mutability
NumPy Arrays: Mutable; elements can be changed, but resizing is less flexible.
Python Lists: Mutable and can be resized dynamically (e.g., appending or removing elements).
7. Use Case
NumPy Arrays: Ideal for scientific computing, data analysis, and large-scale numerical operations.
Python Lists: Better suited for general-purpose programming and small-scale tasks.
In summary, NumPy arrays are specialized for numerical and scientific tasks, while Python lists are versatile and better for general use.

7. What is a heatmap, and when should it be used?
-A heatmap is a graphical representation of data where individual values are represented as colors. It is particularly useful for visualizing matrix data, where the color intensity represents the magnitude of the values.

Using Seaborn to Create a Heatmap

Seaborn is a powerful Python library for data visualization based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. The seaborn.heatmap() function is used to create heatmaps.

Basic Heatmap

To create a basic heatmap, you need a 2D dataset. Here is an example using random data:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate random data
data = np.random.randint(low=1, high=100, size=(10, 10))

# Create a heatmap
sns.heatmap(data)
plt.show()
Customizing the Heatmap

You can customize various aspects of the heatmap, such as the colormap, annotations, and colorbar.

Colormap

You can change the colormap using the cmap parameter:

sns.heatmap(data, cmap="YlGnBu")
plt.show()
Annotations

To display the data values in each cell, use the annot parameter:

sns.heatmap(data, annot=True, fmt="d")
plt.show()
Colorbar

You can hide the colorbar by setting the cbar parameter to False:

sns.heatmap(data, cbar=False)
plt.show()
Using Plotly to Create a Heatmap

Plotly is another powerful library for creating interactive visualizations. The plotly.express.imshow() function can be used to create heatmaps.

Basic Heatmap

Here is an example of creating a basic heatmap using Plotly:

import plotly.express as px

# Generate random data
data = np.random.randint(low=1, high=100, size=(10, 10))

# Create a heatmap
fig = px.imshow(data)
fig.show()
Customizing the Heatmap

You can customize the heatmap by adding text annotations and adjusting the aspect ratio.

Text Annotations

To add text annotations, use the text_auto parameter:

fig = px.imshow(data, text_auto=True)
fig.show()
Aspect Ratio

To adjust the aspect ratio, use the aspect parameter:

fig = px.imshow(data, aspect="auto")
fig.show()
Conclusion

Heatmaps are a versatile tool for visualizing matrix data. Both Seaborn and Plotly provide powerful functions to create and customize heatmaps. Seaborn is great for static visualizations, while Plotly offers interactive capabilities
1
2
3
. Choose the library that best fits your needs and start visualizing your data effectively.

8. What does the term “vectorized operation” mean in NumPy?
-In NumPy, a vectorized operation refers to performing element-wise computations directly on entire arrays without using explicit loops. This makes NumPy operations significantly faster and more efficient than traditional Python loops because they utilize optimized low-level implementations.

For example, instead of looping through arrays to perform operations, NumPy allows you to apply operations directly:

python
import numpy as np

# Creating NumPy arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Vectorized addition
c = a + b  # No need for explicit looping

print(c)  # Output: [5 7 9]
Since NumPy operations are highly optimized, they run much faster than equivalent loop-based implementations. This is one of the reasons NumPy is widely used in scientific computing and machine learning.

9. How does Matplotlib differ from Plotly?
- Matplotlib and Plotly are both powerful visualization libraries in Python, but they have distinct features and use cases.

Key Differences
Interactivity:

Matplotlib: Primarily used for static plots. While it has some interactivity via plt.show(), it is not as dynamic as Plotly.

Plotly: Designed for interactive plots with features like hover effects, zooming, and clickable elements. Ideal for web-based applications and dashboards.

Ease of Use:

Matplotlib: Requires more manual formatting but gives deep control over plot elements. Often used in scientific computing and academic research.

Plotly: More user-friendly with built-in themes and automatic styling.
Customization:

Matplotlib: Offers fine-grained control but requires additional work for complex customizations.

Plotly: Provides automatic layouts and formatting, making complex visualizations easier.

Performance & Data Handling:

Matplotlib: Works well for static and simple visualizations; can be slow with large datasets.

Plotly: Optimized for larger datasets and web-based applications; integrates well with Dash for dashboards.

Use Cases:

Matplotlib: Best for academic plots, research papers, and non-interactive visualizations.
Plotly: Great for data dashboards, web apps, and interactive presentations.

10. What is the significance of hierarchical indexing in Pandas?
-Hierarchical indexing, also known as multi-level indexing, is a powerful feature in Pandas that allows you to work with data in a multi-dimensional way within a two-dimensional DataFrame or Series. It provides a way to handle and analyze data with multiple levels of indexing, making it easier to organize, filter, and manipulate complex datasets.

Significance of Hierarchical Indexing in Pandas
Organizing Complex Data:

Hierarchical indexing allows you to represent data with multiple dimensions (e.g., rows and columns) in a compact and structured way.
For example, you can group data by categories and subcategories, making it easier to analyze relationships between them.
Efficient Data Selection:

It enables you to perform slicing and subsetting operations on multiple levels of the index.
You can access specific rows or columns using tuples or by specifying levels, which simplifies working with large datasets.
Facilitates Grouping and Aggregation:

Hierarchical indexing works seamlessly with grouping operations (groupby) and allows for easy aggregation of data at different levels.
For example, you can calculate statistics for each group or subgroup in a dataset.
Improved Data Representation:

It makes the data more readable and intuitive by organizing it hierarchically.
For example, sales data can be indexed by region, then by city, and then by product category.
Flexibility in Reshaping Data:

Hierarchical indexing is essential for reshaping operations like stack() and unstack(), which allow you to pivot data between wide and long formats.
This is particularly useful for preparing data for visualization or analysis.
Handling Missing Data:

It provides better control over missing data by allowing you to align data at multiple levels of the index.
Example of Hierarchical Indexing

import pandas as pd
import numpy as np

# Creating a DataFrame with hierarchical indexing
data = pd.DataFrame(
    np.random.randn(6, 2),
    index=[['Region1', 'Region1', 'Region2', 'Region2', 'Region3', 'Region3'],
           ['CityA', 'CityB', 'CityA', 'CityB', 'CityA', 'CityB']],
    columns=['Metric1', 'Metric2']
)

# Setting hierarchical index
data.index.names = ['Region', 'City']

print(data)
Output:


                 Metric1   Metric2
Region  City                     
Region1 CityA  0.123456  1.234567
        CityB -0.987654  0.876543
Region2 CityA  0.456789 -1.234567
        CityB -0.654321  0.543210
Region3 CityA  1.111111 -0.222222
        CityB -0.333333  0.444444
Key Operations:
Accessing Data: data.loc['Region1'] or data.loc[('Region1', 'CityA')]
Slicing: data.loc['Region1':'Region2']
Unstacking: data.unstack(level='City')
Stacking: data.stack()
Hierarchical indexing is a cornerstone of Pandas' flexibility, enabling you to work with complex datasets in a clean and efficient manner.

11. What is the role of Seaborn’s pairplot() function?
-Pair plots are a powerful tool for visualizing the relationships between multiple variables in a dataset. In Python, the Seaborn library provides a convenient function pairplot to create a grid of scatter plots that compare each variable in your dataset against all others. Additionally, it generates histograms or Kernel Density Estimates (KDEs) along the diagonal to show the distribution of each variable.

Code Example

Here's a basic example of how to use pairplot in Seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

# Load an example dataset
penguins = sns.load_dataset("penguins")

# Create a pair plot
sns.pairplot(penguins)

# Display the plot
plt.show()
Customization and Additional Features

The pairplot function is highly customizable. You can color the points using a categorical variable with the hue parameter, change the kind of plot for both the grid and the diagonal with kind and diag_kind, and control the size and aspect ratio of each subplot with height and aspect.

For example, to create a pair plot with different markers for each species in the penguins dataset, you could do the following:

sns.pairplot(penguins, hue="species", markers=["o", "s", "D"])
If you want to focus on specific variables, use the vars, x_vars, and y_vars parameters to select them. To create a corner plot, which only includes the lower triangle of the grid, set corner=True.

Advanced Customization

For more advanced customization, pairplot returns a PairGrid object, which can be further modified. For instance, you can map additional functions to different parts of the grid to overlay different kinds of plots:

g = sns.pairplot(penguins, diag_kind="kde")
g.map_lower(sns.kdeplot, levels=4, color=".2")
This will add contour lines to the lower triangle of the grid, enhancing the visual representation of density.

Considerations

When using pairplot, it's important to note that it can be computationally intensive for large datasets. Additionally, if your dataset contains non-numeric variables, you should explicitly specify which variables to include, or they will be ignored.

Seaborn's pairplot is a high-level interface for PairGrid. If you require more flexibility than pairplot provides, consider using PairGrid directly. This allows for more granular control over the types of plots displayed and their properties.

In summary, pairplot is a versatile function that can quickly give you a comprehensive overview of the pairwise relationships within your dataset, with the flexibility to tailor the output to your specific analysis needs.

12. What is the purpose of the describe() function in Pandas?
-Pandas is a powerful and flexible open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. The primary data structures in Pandas are Series and DataFrame.

Key Features of Pandas

Data Structures

Series: A one-dimensional labeled array capable of holding any data type.

DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It is similar to a table in a database or an Excel spreadsheet.

DataFrame.describe() Method

The describe() method in Pandas is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values
1
2
.

Syntax

DataFrame.describe(percentiles=None, include=None, exclude=None)
Parameters

percentiles: List-like of numbers between 0 and 1 to include in the output. Default is [.25, .5, .75].

include: List-like of dtypes or 'all' to include in the result. Default is None.

exclude: List-like of dtypes to exclude from the result. Default is None.

Returns

Series or DataFrame: Summary statistics of the Series or DataFrame provided.

Examples

Numeric Series

import pandas as pd
s = pd.Series([1, 2, 3])
print(s.describe())
Output:

count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
Categorical Series

s = pd.Series(['a', 'a', 'b', 'c'])
print(s.describe())
Output:

count 4
unique 3
top a
freq 2
dtype: object
DataFrame

df = pd.DataFrame({
'categorical': pd.Categorical(['d', 'e', 'f']),
'numeric': [1, 2, 3],
'object': ['a', 'b', 'c']
})
print(df.describe())
Output:

numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Important Considerations

For numeric data, the result includes count, mean, std, min, max, and percentiles.

For object data, the result includes count, unique, top, and freq.

The include and exclude parameters can be used to limit which columns are analyzed
1
3
.

Pandas is an essential tool for data analysis in Python, offering robust data manipulation capabilities and a wide range of functionalities to handle various data types and operations efficiently.

13. Why is handling missing data important in Pandas?
-Missing data is a common issue in real-world datasets and can significantly impact data analysis and machine learning models. In Pandas, missing data is represented by None or NaN (Not a Number). Pandas provides several functions to detect, remove, and replace missing values.

Detecting Missing Values

To identify missing values in a DataFrame, you can use the isnull() and notnull() functions. These functions return a DataFrame of Boolean values indicating the presence of missing values.

import pandas as pd
import numpy as np

# Sample DataFrame
data = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())
Removing Missing Values

You can remove rows or columns with missing values using the dropna() function. This function provides flexibility to drop rows or columns based on the presence of missing values.

# Drop rows with at least one missing value
df_cleaned = df.dropna()
print(df_cleaned)

# Drop columns with at least one missing value
df_cleaned = df.dropna(axis=1)
print(df_cleaned)
Filling Missing Values

Instead of removing missing values, you can fill them using the fillna() function. This function allows you to replace missing values with a specified value, such as the mean, median, or mode of the column.

# Fill missing values with a specified value
df_filled = df.fillna(0)
print(df_filled)

# Fill missing values with the mean of the column
df['First Score'] = df['First Score'].fillna(df['First Score'].mean())
print(df)
Interpolating Missing Values

For numerical data, you can use the interpolate() function to estimate missing values using various interpolation methods.

# Interpolate missing values using linear method
df_interpolated = df.interpolate(method='linear')
print(df_interpolated)
Replacing Missing Values

The replace() function can be used to replace missing values with a specified value or another DataFrame.

# Replace missing values with a specified value
df_replaced = df.replace(np.nan, -99)
print(df_replaced)

# Replace missing values using another DataFrame
data2 = {'First Score': [10, 20, 30, 40],
'Second Score': [10, 20, 30, 40],
'Third Score': [10, 20, 30, 40]}
df2 = pd.DataFrame(data2)
df_filled = df.fillna(df2)
print(df_filled)
Important Considerations

Handling missing data is crucial for accurate data analysis and modeling. Depending on the context and the nature of the data, you can choose to remove, fill, or interpolate missing values. Each method has its advantages and limitations, and the choice should be based on the specific requirements of your analysis.

14. What are the benefits of using Plotly for data visualization?
-Interactive data visualization allows users to explore and understand data more effectively by providing dynamic and engaging visual representations. Python offers several powerful libraries for creating interactive visualizations, including Bokeh and Plotly.

Bokeh

Bokeh is a Python library for creating interactive visualizations that can be rendered in web browsers using HTML and JavaScript. It is particularly useful for building web-based dashboards and applications. Here are the steps to create a visualization with Bokeh:

Prepare the Data: Use libraries like Pandas and Numpy to handle and transform your data.

Determine Where the Visualization Will Be Rendered: You can generate a static HTML file or render the visualization inline in a Jupyter Notebook.

Set up the Figure: Customize the figure, including titles, tick marks, and tools for user interactions.

Connect to and Draw Your Data: Use Bokeh's renderers to draw your data with various markers and shapes.

Organize the Layout: Arrange multiple figures in a grid or tabbed layout.

Preview and Save: View your visualization in a browser or notebook and save it to an image file if desired
1
.

Here is an example of creating a simple scatter plot with Bokeh:

from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook

# Prepare the data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]

# Output to a static HTML file
output_file("scatter.html")

# Create a new plot
p = figure(title="Simple Scatter Plot", x_axis_label='X', y_axis_label='Y')

# Add a scatter renderer with circle markers
p.circle(x, y, size=10, color="navy", alpha=0.5)

# Show the results
show(p)
Plotly

Plotly is another powerful library for creating interactive visualizations in Python. It offers a variety of graph types, such as line charts, scatter plots, bar charts, histograms, and more. Plotly visualizations are highly interactive, allowing users to zoom in, hover for data insights, and customize the appearance.

To get started with Plotly, you need to install it using the following command:

pip install plotly
Here is an example of creating a line chart with Plotly:

import plotly.express as px

# Sample data
df = px.data.iris()

# Create a line chart
fig = px.line(df, x='sepal_width', y='sepal_length', title='Sepal Width vs Length')

# Show the plot
fig.show()
Adding Interactivity

Both Bokeh and Plotly provide various ways to add interactivity to your visualizations. For example, you can add hover actions, selection tools, and linked axes to enhance the user experience.

Bokeh Example: Adding Hover Tool

from bokeh.models import HoverTool

# Add hover tool
hover = HoverTool()
hover.tooltips = [("X", "@x"), ("Y", "@y")]
p.add_tools(hover)

# Show the results
show(p)
Plotly Example: Adding Dropdown Menu

import plotly.graph_objects as go

# Create a scatter plot
fig = go.Figure(data=[go.Scatter(x=[1, 2, 3], y=[4, 5, 6], mode='markers')])

# Add dropdown menu
fig.update_layout(
updatemenus=[
dict(
buttons=list([
dict(
args=["type", "scatter"],
label="Scatter Plot",
method="restyle"
),
dict(
args=["type", "bar"],
label="Bar Chart",
method="restyle"
)
]),
direction="down"
)
]
)

# Show the plot
fig.show()
By leveraging these libraries, you can create interactive and visually appealing data visualizations that help users explore and understand complex datasets.

15. How does NumPy handle multidimensional arrays?
-NumPy is a powerful library in Python that excels at handling multidimensional arrays, also known as ndarrays. Here's a concise overview of how it manages them:

1. Creation of Multidimensional Arrays
NumPy allows you to create arrays of any dimension using functions like numpy.array(), numpy.zeros(), numpy.ones(), and numpy.random.
Example:
import numpy as np
array_2d = np.array([[1, 2, 3], [4, 5, 6]])  # 2D array
array_3d = np.ones((2, 3, 4))  # 3D array filled with ones

2. Efficient Storage and Operations
NumPy arrays are stored in contiguous memory blocks, making operations like slicing, indexing, and mathematical computations highly efficient.
Operations are applied element-wise, and broadcasting allows operations between arrays of different shapes.
3. Shape and Dimensions
The .shape attribute provides the dimensions of the array, while .ndim gives the number of dimensions.
Example:
print(array_2d.shape)  # Output: (2, 3)
print(array_2d.ndim)   # Output: 2

4. Indexing and Slicing
You can access elements using indices, and slicing works seamlessly across multiple dimensions.
Example:
print(array_2d[1, 2])  # Access element at row 1, column 2
print(array_2d[:, 1])  # Slice all rows, column 1

5. Reshaping and Transposing
Arrays can be reshaped using .reshape() and transposed using .T.
Example:
reshaped = array_2d.reshape(3, 2)  # Reshape to 3x2
transposed = array_2d.T            # Transpose the array

6. Broadcasting
NumPy supports broadcasting, allowing operations between arrays of different shapes by automatically expanding dimensions.
Example:
array = np.array([[1, 2, 3], [4, 5, 6]])
result = array + np.array([10, 20, 30])  # Adds row-wise

7. Advanced Features
Axis Operations: Functions like sum(), mean(), etc., can operate along specific axes.
print(array_2d.sum(axis=0))  # Sum along columns
print(array_2d.sum(axis=1))  # Sum along rows

Masking and Boolean Indexing: You can filter elements using conditions.
print(array_2d[array_2d > 3])  # Elements greater than 3

Why NumPy for Multidimensional Arrays?
Performance: Faster than Python lists due to optimized C-based implementation.
Flexibility: Supports a wide range of operations, from basic arithmetic to complex linear algebra.
Scalability: Handles large datasets efficiently.

NumPy's multidimensional array handling is a cornerstone of scientific computing in Python, making it indispensable for tasks like data analysis, machine learning, and numerical simulations.

 16. What is the role of Bokeh in data visualization.
 -Bokeh is a powerful Python library for creating interactive visualizations that are web-friendly and scalable. Its primary role in data visualization includes:

Key Features & Role
Interactive Plots – Bokeh allows users to build highly interactive visualizations with tools like zooming, panning, and hover effects.

Web-Ready – Unlike static libraries like Matplotlib, Bokeh generates plots as JavaScript-enabled HTML documents, making it perfect for web applications.

Scalability – It can handle large datasets efficiently and integrate with databases and streaming data sources.

Customizable Dashboards – Bokeh works seamlessly with Flask and Django, making it great for building web-based dashboards.

Supports Various Plot Types – From simple scatter plots to complex network graphs, Bokeh can generate dynamic visuals.

from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()  # Enables inline display in Jupyter

# Create figure
p = figure(title="Simple Bokeh Plot", x_axis_label="X", y_axis_label="Y")

# Add line
p.line([1, 2, 3, 4], [10, 20, 30, 40], line_width=2)

# Show plot
show(p)


This will generate an interactive plot directly in your notebook or as an HTML file.

17. Explain the difference between apply() and map() in Pandas?
-In Pandas, both apply() and map() are used to apply functions to data, but they differ in their scope and use cases. Here's a clear breakdown:

1. map()
Scope: Works only on Series (one-dimensional data).
Functionality: Applies a function element-wise to each value in the Series.
Use Case: Best for simple transformations or mappings on a single column or Series.
Input: Can take a function, dictionary, or Series as an argument.

Example:

import pandas as pd

# Sample Series
s = pd.Series([1, 2, 3, 4])

# Using map to square each element
result = s.map(lambda x: x**2)
print(result)


Output:

0     1
1     4
2     9
3    16
dtype: int64

2. apply()
Scope: Works on both Series and DataFrames (one-dimensional or two-dimensional data).
Functionality: Applies a function along an axis (rows or columns) for DataFrames or element-wise for Series.
Use Case: More versatile; used for complex operations, row/column-wise transformations, or custom logic.
Input: Can take any callable function.

Example with Series:

# Using apply to square each element in a Series
result = s.apply(lambda x: x**2)
print(result)


Example with DataFrame:

# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Using apply to sum values row-wise
result = df.apply(lambda row: row.sum(), axis=1)
print(result)


Output:

0     5
1     7
2     9
dtype: int64

Key Differences:
Feature	map()	apply()
Scope	Series only	Series and DataFrame
Axis Control	Not applicable	Can specify axis (rows/columns)
Complexity	Simple element-wise operations	Complex row/column-wise operations
Input	Function, dictionary, or Series	Callable function only

In summary, use map() for straightforward element-wise operations on a Series, and apply() for more flexible and complex transformations, especially when working with DataFrames.

18. What are some advanced features of NumPy?
-NumPy is a powerful Python library used for numerical computing, particularly with arrays. One of its essential functions is numpy.mean, which computes the arithmetic mean of array elements along a specified axis
1
2
.

Definition and Usage

The numpy.mean function calculates the average of the array elements. By default, it computes the mean of the flattened array, but you can specify an axis to compute the mean along that axis. The function signature is as follows:

numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)
Parameters:

a: Array-like object containing numbers whose mean is desired.

axis: Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.

dtype: Data type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.

out: Alternate output array in which to place the result. It must have the same shape as the expected output.

keepdims: If set to True, the axes which are reduced are left in the result as dimensions with size one.

where: Elements to include in the mean.

Returns:

m: An ndarray containing the mean values.

Examples

Here are some examples to illustrate the usage of numpy.mean:

import numpy as np

# Example 1: Compute the mean of a 2D array
a = np.array([[1, 2], [3, 4]])
print(np.mean(a)) # Output: 2.5

# Example 2: Compute the mean along the first axis (rows)
print(np.mean(a, axis=0)) # Output: [2. 3.]

# Example 3: Compute the mean along the second axis (columns)
print(np.mean(a, axis=1)) # Output: [1.5 3.5]

# Example 4: Compute the mean with a specified dtype
b = np.zeros((2, 512*512), dtype=np.float32)
b[0, :] = 1.0
b[1, :] = 0.1
print(np.mean(b)) # Output: 0.54999924 (inaccurate due to float32)
print(np.mean(b, dtype=np.float64)) # Output: 0.55000000074505806 (more accurate)
Important Considerations

Precision: For floating-point input, the mean is computed using the same precision as the input. This can cause inaccuracies, especially for float32. Specifying a higher-precision accumulator using the dtype keyword can alleviate this issue
1
.

Performance: NumPy arrays are stored in contiguous memory locations, making them faster to process compared to Python lists
2
. This is particularly beneficial in data science and scientific computing where large datasets are common
3
.

By understanding and utilizing numpy.mean, you can efficiently compute the average values of arrays, which is a fundamental operation in many numerical and data analysis tasks.

19. How does Pandas simplify time series analysis?
-Pandas is a powerful library in Python that significantly simplifies time series analysis by providing intuitive and efficient tools for handling, analyzing, and visualizing time-based data. Here's how it helps:

1. Date and Time Handling
Datetime Indexing: Pandas allows you to set a DatetimeIndex for your data, enabling easy slicing and filtering based on dates or time ranges.
Datetime Conversion: It can convert strings or other formats into datetime objects using pd.to_datetime(), making it easier to work with inconsistent date formats.
Date Components: You can easily extract components like year, month, day, hour, etc., using attributes like .dt.year, .dt.month, etc.
2. Resampling and Aggregation
Resampling: Pandas provides the .resample() method to aggregate data into different time frequencies (e.g., daily, monthly, yearly). For example, converting hourly data to daily averages is straightforward.
Rolling and Expanding Windows: With .rolling() and .expanding(), you can calculate moving averages, cumulative sums, or other window-based statistics.
3. Time Zone Support
Pandas supports time zone-aware operations, allowing you to localize timestamps to specific time zones and convert between them seamlessly.
4. Missing Data Handling
Time series often have missing data. Pandas provides methods like .fillna() and .interpolate() to handle gaps effectively.
5. Shifting and Lagging
You can use .shift() to create lagged versions of your data, which is useful for calculating changes over time or creating lagged features for modeling.
6. Powerful Plotting
Pandas integrates well with Matplotlib, enabling quick and easy visualization of time series data with .plot().
7. Built-in Statistical Functions
Pandas offers built-in methods for descriptive statistics, correlation, and other analyses, making it easier to explore time series trends and patterns.
Example:
import pandas as pd

# Create a time series
date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
data = [10, 12, 15, None, 18, 20, 22, 25, 28, 30]
time_series = pd.Series(data, index=date_range)

# Resample to calculate weekly mean
weekly_mean = time_series.resample('W').mean()

# Fill missing values
time_series_filled = time_series.fillna(method='ffill')

print("Original Time Series:")
print(time_series)
print("\nWeekly Mean:")
print(weekly_mean)
print("\nFilled Time Series:")
print(time_series_filled)


By abstracting away much of the complexity, Pandas makes time series analysis more accessible and efficient, even for beginners.

20. What is the role of a pivot table in Pandas?
-A pivot table in Pandas is a powerful tool for data analysis and summarization. It allows you to reorganize and aggregate data in a flexible way, making it easier to extract meaningful insights. Here's a concise explanation of its role:

Role of a Pivot Table in Pandas

Data Summarization: It helps summarize large datasets by grouping data based on one or more keys (columns) and applying aggregation functions like sum, mean, count, etc.

Example: Summing up sales by region and product category.

Reshaping Data: Pivot tables allow you to reshape data into a more readable or analyzable format by creating a matrix-like structure with rows and columns.

Example: Converting long-format data into a wide-format table.

Custom Aggregations: You can apply custom aggregation functions to calculate metrics like averages, totals, or percentages for specific groups.

Multi-Level Indexing: Pivot tables support hierarchical indexing, enabling you to analyze data across multiple dimensions (e.g., grouping by both region and year).

Key Features
pivot_table() Function: The primary function in Pandas to create pivot tables.
Parameters:
index: Defines rows of the pivot table.
columns: Defines columns of the pivot table.
values: Specifies the data to aggregate.
aggfunc: Defines the aggregation function (e.g., sum, mean, count).
Example Code
import pandas as pd

# Sample dataset
data = {
    'Region': ['North', 'South', 'North', 'East', 'South'],
    'Product': ['A', 'B', 'A', 'C', 'B'],
    'Sales': [100, 200, 150, 300, 250]
}

df = pd.DataFrame(data)

# Creating a pivot table
pivot = pd.pivot_table(df,
                       index='Region',
                       columns='Product',
                       values='Sales',
                       aggfunc='sum',
                       fill_value=0)

print(pivot)

Output
Product      A    B    C
Region                    
East         0    0  300
North      250    0    0
South        0  450    0


This example demonstrates how pivot tables can transform raw data into a structured summary, making it easier to analyze trends and patterns.

21. Why is NumPy’s array slicing faster than Python’s list slicing?
-NumPy is a fundamental package for scientific computing in Python, providing efficient operations on large arrays and matrices. It is significantly faster than Python lists for several reasons:

Homogeneous Data and Contiguous Memory

NumPy arrays store elements of the same data type, making them more compact and memory-efficient than lists, which can hold elements of varying data types
1
. This homogeneity allows NumPy to store elements in contiguous memory locations, reducing memory fragmentation and enabling faster access
2
.

Vectorized Operations

NumPy supports vectorized operations, which means that operations are applied element-wise to entire arrays without the need for explicit loops. This is achieved through broadcasting, which allows NumPy to perform operations on arrays of different shapes efficiently
1
. For example:

import numpy as np

# Create two NumPy arrays
array1 = np.arange(1000000)
array2 = np.arange(1000000)

# Perform element-wise multiplication
resultantArray = array1 * array2
In contrast, performing the same operation with Python lists requires explicit loops, which are slower due to Python's interpretation overhead.

22.What are some common use cases for Seaborn?
-Seaborn is a powerful Python library for data visualization built on top of Matplotlib. It is widely used for creating informative and aesthetically pleasing statistical graphics. Here are some common use cases for Seaborn:

1. Exploratory Data Analysis (EDA)
Visualizing distributions: Use plots like distplot, kdeplot, or histplot to understand the distribution of a single variable.
Comparing distributions: Use boxplot, violinplot, or stripplot to compare distributions across categories.
Pairwise relationships: Use pairplot or jointplot to explore relationships between multiple variables.
2. Correlation and Relationships
Heatmaps: Use heatmap to visualize correlation matrices or other tabular data.
Scatter plots: Use scatterplot or relplot to examine relationships between two continuous variables.
Regression analysis: Use regplot or lmplot to visualize linear regression and trends.
3. Categorical Data Visualization
Bar plots: Use barplot to show aggregated values (e.g., mean, median) for categories.
Count plots: Use countplot to display the frequency of categories.
Swarm and strip plots: Use swarmplot or stripplot for detailed visualization of individual data points within categories.
4. Time Series Analysis
Use lineplot to visualize trends over time, such as stock prices, weather data, or sales trends.
5. Customizing and Enhancing Visualizations
Themes: Apply built-in themes like darkgrid, whitegrid, or ticks to improve aesthetics.
Faceted plots: Use FacetGrid or catplot to create multi-panel plots for subgroup analysis.
Color palettes: Use Seaborn's color palettes (e.g., coolwarm, viridis) for visually appealing plots.
6. Statistical Analysis
Confidence intervals: Automatically include confidence intervals in plots like lineplot or barplot.
Distribution fitting: Use kdeplot to fit and visualize kernel density estimates.

Seaborn simplifies complex visualizations and integrates seamlessly with Pandas, making it an essential tool for data scientists and analysts.

PRACTICAL----

In [None]:
1. How do you create a 2D NumPy array and calculate the sum of each row?
-import numpy as np

# Creating a 2D NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Summing each row (axis=1)
row_sums = np.sum(arr, axis=1)

print(row_sums)  # Output: [ 6 15 24 ]


In [None]:
 2.Write a Pandas script to find the mean of a specific column in a DataFrame?
 -import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000]}

df = pd.DataFrame(data)

# Finding the mean of the 'Salary' column
mean_salary = df['Salary'].mean()

print(f"Mean Salary: {mean_salary}")


In [None]:
3. Create a scatter plot using Matplotlib?
-import matplotlib.pyplot as plt
import numpy as np

# Generating sample data
x = np.random.rand(50) * 10  # Random values for X-axis
y = np.random.rand(50) * 10  # Random values for Y-axis

# Creating the scatter plot
plt.scatter(x, y, color='blue', marker='o', alpha=0.7)

# Adding labels and title
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.title("Simple Scatter Plot")

# Display the plot
plt.show()


In [None]:
4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?
-import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [2, 3, 4, 5, 6],
        'C': [5, 8, 7, 6, 5],
        'D': [10, 9, 7, 6, 4]}

df = pd.DataFrame(data)

# Calculating correlation matrix
corr_matrix = df.corr()

# Plotting the heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()


In [None]:
5. Generate a bar plot using Plotly?
-import plotly.graph_objects as go

# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 20, 15, 25, 30]

# Creating the bar plot
fig = go.Figure(data=[go.Bar(x=categories, y=values, marker_color='blue')])

# Adding title and labels
fig.update_layout(title="Simple Bar Plot with Plotly",
                  xaxis_title="Categories",
                  yaxis_title="Values")

# Display the plot
fig.show()


In [None]:
6. Create a DataFrame and add a new column based on an existing column?
-import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Salary': [50000, 60000, 70000, 80000]}

df = pd.DataFrame(data)

# Adding a new column 'Tax' (10% of Salary)
df['Tax'] = df['Salary'] * 0.1

print(df)


In [None]:
7. Write a program to perform element-wise multiplication of two NumPy arrays?
-import numpy as np

# Creating two NumPy arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])

# Using '*' operator
result1 = array1 * array2

# Using np.multiply()
result2 = np.multiply(array1, array2)

print("Result using '*':", result1)
print("Result using np.multiply():", result2)

======OUTPUT=========
Result using '*': [ 5 12 21 32]
Result using np.multiply(): [ 5 12 21 32]



In [None]:
8. Create a line plot with multiple lines using Matplotlib?
-import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]  # First line
y2 = [1, 3, 5, 7, 9]   # Second line
y3 = [3, 6, 9, 12, 15]  # Third line

# Plot multiple lines
plt.plot(x, y1, label="Line 1", color="blue", linestyle="--", marker="o")
plt.plot(x, y2, label="Line 2", color="red", linestyle="-.", marker="s")
plt.plot(x, y3, label="Line 3", color="green", linestyle=":", marker="^")

# Adding title and labels
plt.title("Multiple Line Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Display legend
plt.legend()

# Show the plot
plt.show()
=========OUTPUT========





In [None]:
9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold?
-import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000]}

df = pd.DataFrame(data)

# Filtering rows where 'Salary' is greater than 60000
filtered_df = df[df['Salary'] > 60000]

print(filtered_df)
==========OUTPUT=========
     Name  Age  Salary
2  Charlie   35  70000
3   David   40  80000


In [None]:
10. Create a histogram using Seaborn to visualize a distribution?
-import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generating sample data (random normal distribution)
data = np.random.randn(1000)  # 1000 data points with a normal distribution

# Creating the histogram
plt.figure(figsize=(8, 5))
sns.histplot(data, bins=30, kde=True, color="blue")

# Adding labels and title
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram with Seaborn")

# Show the plot
plt.show()


In [None]:
11. Perform matrix multiplication using NumPy?
-import numpy as np

# Creating two matrices (2x2)
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Using '@' operator
result1 = A @ B

# Using np.matmul()
result2 = np.matmul(A, B)

print("Result using '@':\n", result1)
print("Result using np.matmul():\n", result2)


============OUTPUT=============
Result using '@':
 [[19 22]
  [43 50]]

Result using np.matmul():
 [[19 22]
  [43 50]]


In [None]:
12. Use Pandas to load a CSV file and display its first 5 rows?
-import pandas as pd

# Loading the CSV file (ensure the path is correct)
df = pd.read_csv("your_file.csv")

# Displaying the first 5 rows
print(df.head())


In [None]:
13. Create a 3D scatter plot using Plotly.
-import plotly.graph_objects as go
import numpy as np

# Generating sample data
x = np.random.rand(50) * 10
y = np.random.rand(50) * 10
z = np.random.rand(50) * 10

# Creating the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=x, y=y, z=z,
    mode='markers',
    marker=dict(size=5, color=z, colorscale='Viridis', opacity=0.8)
)])

# Adding title and labels
fig.update_layout(title="3D Scatter Plot",
                  scene=dict(xaxis_title="X-axis",
                             yaxis_title="Y-axis",
                             zaxis_title="Z-axis"))

# Display the plot
fig.show()
