#DATA TOOLKIT

THEORY :

1. What is NumPy, and why is it widely used in Python?
  - NumPy is a fundamental Python library used for numerical computing, providing efficient tools for working with multi-dimensional arrays and matrices. It's used in Python for its speed, efficiency, and ability to perform complex mathematical operations on large datasets
2. How does broadcasting work in NumPy?
  - In NumPy, broadcasting is a powerful mechanism that allows NumPy to perform element-wise operations on arrays of different shapes and sizes without explicitly copying data. It follows a set of rules to make arrays compatible for arithmetic operations (like addition, subtraction, multiplication, etc.).
3. What is a Pandas DataFrame?
  - pandas DataFrame is a way to represent and work with tabular data.
4. Explain the use of the groupby() method in Pandas.
  - The groupby() method in Pandas is used to split data into groups based on some criteria, and then apply operations to each group independently. It's a very powerful tool for data aggregation, transformation, and analysis.
5. Why is Seaborn preferred for statistical visualizations?
  - i) High-Level Interface for Drawing Attractive and Informative Graphics.
  - ii) Built-in Support for Statistical Plotting.
  - iii) Integrated with Pandas DataFrames.
  - iv) Automatic Estimation and Plotting of Statistical Aggregates.
  - v) Enhanced Aesthetics and Style Control.
  - vi) Facilitates Multivariate Visualization.
  - vii) Extensibility and Compatibility.
6. What are the differences between NumPy arrays and Python lists?
  - Performance and Speed :
  - i) Python Lists: Are slower because each element is a separate Python object with metadata.
  - ii) NumPy Arrays: Are faster and more memory-efficient due to contiguous memory allocation and lower-level optimizations (implemented in C).
  - Functionality :
  - i) Python Lists: Provide basic operations like appending, removing, and slicing.
  - ii) NumPy Arrays: Offer a wide range of mathematical and statistical operations (e.g., vectorized operations, broadcasting, matrix multiplication) that are not natively supported by lists.
7. What is a heatmap, and when should it be used?
  - In Python, a heatmap is a type of data visualization that uses color to represent values within a matrix or a 2D grid. It's a way to visually display complex data, highlight patterns, correlations, and outliers, and is particularly useful for large datasets or matrices.
8. What does the term “vectorized operation” mean in NumPy?
  - A vectorized operation in NumPy refers to performing operations on entire arrays (vectors, matrices, etc.) without using explicit loops. These operations are applied element-wise and are implemented in highly optimized compiled code (C/C++), making them much faster and more efficient than standard Python loops.
9. How does Matplotlib differ from Plotly?
  - Matplotlib:
  - i)Focuses on static 2D plots (line, bar, scatter, histogram, etc.).
  - ii) Basic 3D support via mpl_toolkits.mplot3d, but limited.  
  - iii) Low-level and highly customizable, but can be more verbose and requires more code for complex plots.
  - Ploty:
  - i) Excels in interactive and web-based plots.
  - ii)Offers rich 3D visualization, animations, and interactivity (hover, zoom, drag).
  - iii) Higher-level syntax with easy-to-use functions for complex and beautiful visualizations quickly.
10. What is the significance of hierarchical indexing in Pandas?
   - Hierarchical indexing (also called MultiIndexing) in Pandas allows you to have multiple levels of index on rows or columns. It’s a way to organize and work with higher-dimensional data in a 2D DataFrame.
11. What is the role of Seaborn’s pairplot() function?
   - Seaborn's pairplot() function creates a grid of pairwise relationships in a dataset, essentially visualizing the relationships between all numeric variables.
12. What is the purpose of the describe() function in Pandas?
   - The purpose of the describe() function in Pandas is to generate descriptive statistics for a given Pandas DataFrame or Series. It provides a concise summary of the data, including measures of central tendency (mean, median), dispersion (standard deviation, quartiles), and shape (minimum, maximum, percentiles).
13. Why is handling missing data important in Pandas?
  - Handling missing data in Pandas is crucial because it can significantly impact data analysis and machine learning model accuracy.
14. What are the benefits of using Plotly for data visualization?
   - It allows you to display data in a way that's easy to explore and understand, such as by zooming in, hovering over data points for more details, and clicking to get deeper insights Plotly uses JavaScript to handle interactivity, but you don't need to worry about that when using it in Python.
15. How does NumPy handle multidimensional arrays?
   - To create a multi-dimensional array using NumPy, we can use the np. array() function and pass in a nested list of values as an argument.
16. What is the role of Bokeh in data visualization?
   - Bokeh is a Python library that enables users to create interactive visualizations for web browsers, ranging from basic plots to complex dashboards.
17. Explain the difference between apply() and map() in Pandas?
   - map() Function:
  - i) The map() function is used only with Pandas Series (i.e., a single column or row).

  - ii) It applies a function element-wise to each value in the Series.

  - iii) Mainly used for simple transformations like modifying values or mapping values from one form to another.

  - iv) It can take a function, dictionary, or Series as input for mapping.
  - apply() Function:
  - i) The apply() function works with both Series and DataFrames.

  - ii) For a Series, it behaves similarly to map(), applying a function to each element.

  - iii) For a DataFrame, apply() can be used to apply a function along either rows or columns (using the axis parameter).

  - iv) Suitable for more complex operations, such as applying a function to multiple columns at once.
18. What are some advanced features of NumPy?
   - Broadcasting, array manipulation (including reshaping, stacking, and splitting), universal functions (ufuncs), fancy indexing, linear algebra operations, and random number generation.
19. How does Pandas simplify time series analysis?
   - The DatetimeIndex makes performing most operations in Pandas very simple, as it allows you to index, slice, and resample data based on the date and time.
20. What is the role of a pivot table in Pandas?
   - It allows you to restructure a DataFrame by turning rows into columns and columns into rows based on a specified index column, a specified columns column, and a specified values column.
21. Why is NumPy’s array slicing faster than Python’s list slicing?
   - because they store elements of the same data type in contiguous memory locations.
22. What are some common use cases for Seaborn?
   - Exploratory data analysis (EDA), statistical analysis, and visualizing machine learning model performance.


PRACTICAL:

In [None]:
# 1. How do you create a 2D NumPy array and calculate the sum of each row?
import numpy as np

# Create a 2D NumPy array
array_2d = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

# Calculate the sum of each row
row_sums = np.sum(array_2d, axis=1)

# Print the results
print("2D Array:")
print(array_2d)
print("Sum of each row:")
print(row_sums)

In [None]:
# 2. Write a Pandas script to find the mean of a specific column in a DataFrame
import pandas as pd

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Specify the column you want to find the mean of
column_name = 'Salary'

# Calculate the mean
mean_value = df[column_name].mean()

# Print the result
print(f"Mean of '{column_name}' column:", mean_value)

In [None]:
# 3. Create a scatter plot using Matplotlib
import matplotlib.pyplot as plt

# Sample data
x = [10, 20, 30, 40, 50]
y = [15, 25, 35, 30, 45]

# Create scatter plot
plt.scatter(x, y, color='blue', marker='o', s=100)  # s is size of points

# Add labels and title
plt.title("Sample Scatter Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")

# Show grid (optional)
plt.grid(True)

# Show the plot
plt.show()

In [None]:
# 4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Create or load your DataFrame
# Example data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)

# 2. Calculate the correlation matrix
corr_matrix = df.corr()

# 3. Create the heatmap using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

# 4. Add a title
plt.title("Correlation Matrix Heatmap")

# 5. Show the plot
plt.show()

In [None]:
# 5. Generate a bar plot using Plotly.
import plotly.express as px
import pandas as pd

# Sample data
data = {
    'Category': ['A', 'B', 'C', 'D'],
    'Values': [23, 45, 12, 34]
}

# Create DataFrame
df = pd.DataFrame(data)

# Create bar plot
fig = px.bar(df, x='Category', y='Values', title='Sample Bar Plot', color='Category')

# Show plot
fig.show()

In [None]:
# 6. Create a DataFrame and add a new column based on an existing column
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 62, 90, 70]
}

df = pd.DataFrame(data)

# Add a new column based on 'Score'
# Let's say we want to classify the scores as 'Pass' or 'Fail'
df['Result'] = df['Score'].apply(lambda x: 'Pass' if x >= 75 else 'Fail')

# Print the updated DataFrame
print(df)

In [None]:
# 7. Write a program to perform element-wise multiplication of two NumPy arrays
import numpy as np

# Create two NumPy arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([10, 20, 30, 40])

# Perform element-wise multiplication
result = array1 * array2

# Print the result
print("Array 1:", array1)
print("Array 2:", array2)
print("Element-wise multiplication:", result)

In [None]:
# 8. Create a line plot with multiple lines using Matplotlib.
import matplotlib.pyplot as plt

# Sample data
x = [0, 1, 2, 3, 4, 5]
y1 = [0, 1, 4, 9, 16, 25]   # Line 1: y = x^2
y2 = [0, -1, -4, -9, -16, -25]  # Line 2: y = -x^2
y3 = [0, 2, 6, 12, 20, 30]   # Line 3: y = 2x^2

# Create a figure and axis
plt.figure(figsize=(8, 6))

# Plot multiple lines
plt.plot(x, y1, label='y = x^2', color='blue', marker='o')  # First line
plt.plot(x, y2, label='y = -x^2', color='red', marker='x')  # Second line
plt.plot(x, y3, label='y = 2x^2', color='green', marker='^')  # Third line

# Adding labels and title
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Multiple Line Plot')

# Show legend
plt.legend()

# Show the plot
plt.grid(True)
plt.show()

In [None]:
# 9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [23, 35, 45, 30, 50],
    'Salary': [50000, 60000, 70000, 65000, 80000]
}

df = pd.DataFrame(data)

# Set the threshold for filtering
salary_threshold = 60000

# Filter rows where 'Salary' is greater than the threshold
filtered_df = df[df['Salary'] > salary_threshold]

# Print the filtered DataFrame
print("Filtered DataFrame (Salary > 60000):")
print(filtered_df)

In [None]:
# 10. Create a histogram using Seaborn to visualize a distribution.
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [23, 45, 56, 78, 45, 67, 89, 90, 45, 34, 67, 89, 56, 45, 23, 34, 56, 78, 90]

# Create a seaborn histogram
sns.histplot(data, kde=True, bins=10, color='blue')

# Add title and labels
plt.title('Distribution of Sample Data')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [None]:
# 11. Perform matrix multiplication using NumPy
import numpy as np

# Define two matrices
matrix1 = np.array([[1, 2],
                    [3, 4]])

matrix2 = np.array([[5, 6],
                    [7, 8]])

# Perform matrix multiplication
result = np.dot(matrix1, matrix2)

# Alternatively, you can use the @ operator
# result = matrix1 @ matrix2

# Print the result
print("Matrix 1:")
print(matrix1)
print("Matrix 2:")
print(matrix2)
print("Result of matrix multiplication:")
print(result)

In [None]:
# 12. Use Pandas to load a CSV file and display its first 5 rows
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('your_file.csv')  # Replace 'your_file.csv' with the actual path

# Display the first 5 rows of the DataFrame
print(df.head())

In [None]:
# 13. Create a 3D scatter plot using Plotly
import plotly.express as px
import pandas as pd

# Sample data for 3D scatter plot
data = {
    'X': [1, 2, 3, 4, 5],
    'Y': [5, 4, 3, 2, 1],
    'Z': [1, 2, 3, 4, 5],
    'Category': ['A', 'B', 'C', 'D', 'E']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Create 3D scatter plot
fig = px.scatter_3d(df, x='X', y='Y', z='Z', color='Category', title="3D Scatter Plot")

# Show plot
fig.show()