#Data Toolkit Assignment

##Theory Questions

###1. What is NumPy, and why is it widely used in Python ?

- `NumPy` is a fundamental Python library for numerical computing, providing support for multidimensional arrays and a wide range of mathematical functions. It's used extensively in scientific computing, data analysis, and machine learning because it offers efficient and optimized array operations, which are often required in these fields.

###2. How does broadcasting work in NumPy ?

- `Broadcasting in NumPy` is a powerful mechanism that allows NumPy to perform element-wise operations on arrays of different shapes without explicitly replicating data. It’s a way of extending smaller arrays so they "match" the shape of larger arrays during operations like addition, multiplication, etc.

- Basic Rules of Broadcasting
When NumPy operates on two arrays, it compares their shapes element-wise, starting from the trailing dimensions. The dimensions are compatible when:

  - They are equal, or

  - One of them is 1

- If the shapes are not compatible, NumPy throws a ValueError.

###3. What is a Pandas DataFrame ?

- `Pandas DataFrame` is a way to represent and work with tabular data. It can be seen as a table that organizes data into rows and columns, making it a two-dimensional data structure. A DataFrame can be created from scratch, or you can use other data structures, like NumPy arrays.

###4. Explain the use of the groupby() method in Pandas.

- The `groupby()` method in Pandas is used to split your data into groups based on one or more keys (like columns), then apply some operation to each group, and finally combine the results back into a DataFrame or Series.

- This is often referred to as the "split-apply-combine" strategy.

- Uses of groupby()
  - Aggregation
  - Filtering
  - Transformation
  - Iteration

###5.  Why is Seaborn preferred for statistical visualizations ?

- `Seaborn` is often preferred for statistical visualizations because it’s designed to make beautiful, informative, and statistically meaningful plots with minimal code. It's built on top of Matplotlib, but adds high-level abstraction and built-in support for complex visualizations.



###6. What are the differences between NumPy arrays and Python lists ?

- Comparison between `Numpy array` and `Python List` :-

-  1. Data Type Consistency
  - NumPy Arrays: Elements must be of the same data type (e.g., all floats or all ints). This allows for fast computation.

  - Python Lists: Can hold mixed types (e.g., [1, 'a', True] is valid).

- 2. Performance (Speed)
  - NumPy arrays are much faster than lists for numerical operations because they're implemented in C and use contiguous memory blocks.

  - Lists are slower due to Python’s dynamic typing and object overhead.
-  3. Functionality
  - NumPy arrays support powerful vectorized operations, broadcasting, and a massive ecosystem of functions (e.g., linear algebra, stats).

  - Lists don’t support element-wise math; you’d need a loop or list comprehension.
-  4. Memory Efficiency
  - NumPy arrays use less memory and are more efficient.

  - Lists have extra overhead per element because each is a separate Python object.

###7. What is a heatmap, and when should it be used ?

- A `heatmap` is a data visualization technique that uses color to represent values across two variables, effectively translating complex data into a visual summary. It's commonly used to show patterns and relationships within large datasets, like website user behavior or product performance, making it easier to identify trends and areas needing improvement.

- When to `use` a heatmap :
  - Web Analytics
  - Data Analysis
  - Bussiness Strategy
  - Manufacturing
  - Scientific Visualization


###8. What does the term “vectorized operation” mean in NumPy ?

- A `vectorized operation` in NumPy means applying a function or operation directly to entire arrays, element-wise, without writing explicit loops in Python. Instead of looping through each element manually, NumPy runs the operation underneath in fast, compiled code (usually in C).



###9. How does Matplotlib differ from Plotly ?

- Matplotlib is often preferred for academic or highly customized plots because you can fine-tune just about any aspect of the figure—fonts, margins, axis scales, etc. Plotly: While still highly customizable, Plotly's real strength lies in interactivity and web-based visuals.

###10. What is the significance of hierarchical indexing in Pandas ?

- Hierarchical Indexing, also known as MultiIndexing, is a powerful feature in Pandas that allows you to have multiple levels of indexing on an axis (row or column). This capability is particularly useful when dealing with high-dimensional data.

###11. What is the role of Seaborn’s pairplot() function ?

- `Seaborn's pairplot()` function in Python creates a grid of pairwise relationships (scatter plots and histograms) between variables in a dataset. It's a quick and useful tool for exploratory data analysis, especially when you want to visualize relationships between multiple features in your dataset. The diagonal plots show the distribution of individual variables.

###12. What is the purpose of the describe() function in Pandas ?

- The `describe()` function in Pandas is used to generate descriptive statistics of a DataFrame's columns. It provides a quick summary of key statistical metrics like mean, standard deviation, percentiles, and more for numeric data. For non-numeric data, it offers statistics like count, unique, top, and frequency.

###13. Why is handling missing data important in Pandas ?

- `Handling missing data` in Pandas is crucial because missing values (often represented as NaN) can severely impact data analysis and modeling. They can lead to incorrect results, biased models, and inaccurate conclusions. Pandas provides various tools to identify, address, and manage missing data, ensuring the integrity and reliability of your analysis.

###14. What are the benefits of using Plotly for data visualization ?

-  `Plotly` allows you to display data in a way that's easy to explore and understand, such as by zooming in, hovering over data points for more details, and clicking to get deeper insights Plotly uses JavaScript to handle interactivity, but you don't need to worry about that when using it in Python.

###15. How does NumPy handle multidimensional arrays ?

- NumPy handles multidimensional arrays using the `ndarray` object — where "n" stands for any number of dimensions (1D, 2D, 3D, ...). These arrays are often referred to as tensors, especially in machine learning contexts.
- It offers fast, flexible indexing, reshaping, and operations across dimensions — all optimized under the hood.

###16. What is the role of Bokeh in data visualization ?

- `Bokeh` is a Python library primarily used for creating interactive and visually appealing data visualizations for modern web browsers. It allows users to build a wide variety of plots and charts, from simple to complex, and can be used to create standalone documents or server-backed applications. Unlike some other Python visualization libraries, Bokeh uses HTML and JavaScript to render its plots, which enables interactive features like zooming, panning, and tooltips.


###17. Explain the difference between apply() and map() in Pandas.

- `apply()` and `map()` in Pandas are both used to apply functions to data, but they work a bit differently and are used in different contexts.
- `map()` is used only on a Pandas Series (a single column). It applies a function to each element individually.
- You can also use map() to:

  - Replace values with a dict

  - Apply functions to strings

  - Clean up or reformat data

- `apply()` is more flexible and powerful
- apply() works on:

  - Series (like map, but more powerful)

  - DataFrames (apply across rows or columns)

###18.  What are some advanced features of NumPy ?

- `NumPy` is not just about arrays and basic math — it also has advanced features that make it a powerhouse for scientific computing, data analysis, and machine learning.
- Some `advanced features` of NumPy are as follows :
  - Broadcasting
  - Vectorization
  - Linear Algebra Module
  - Structured Arrays/Record Arrays
  - Fancy Indexing and Boolean Masking

###19. How does Pandas simplify time series analysis ?

- `Pandas` makes time series analysis super intuitive and powerful by providing a suite of tools specifically designed for working with date and time data — whether you're dealing with stock prices, IoT sensor data, web traffic, or any time-indexed dataset.

- Here's how Pandas simplifies time series analysis:
  - Datetime Indixing
  - Powerful Date Parsing
  - Date Range Generation
  - Resampling and Frequency Conversion
  - Shifting and Lagging Data

###20. What is the role of a pivot table in Pandas ?

- `Pivot Table` allows you to restructure a DataFrame by turning rows into columns and columns into rows based on a specified index column, a specified columns column, and a specified values column. This creates a summary table of the data that is easy to read and analyze.

###21. Why is NumPy’s array slicing faster than Python’s list slicing ?

- The `NumPy's slicing` is Faster than `Python's list slicing` because:-
  - Contiguous memory: NumPy arrays are stored in one block, allowing fast access.

  - No copying: NumPy slicing creates a view (not a copy), making it efficient.

  - Optimized operations: NumPy can perform complex operations on slices using low-level optimizations, making slicing faster.

  - Homogeneous data: NumPy arrays only store one type of data, so there’s less overhead in managing slices.

###22. What are some common use cases for Seaborn ?

- `Seaborn` is commonly used for visualizing data distributions, relationships between variables, and statistical insights in various fields like data science, machine learning, and exploratory data analysis. Its strength lies in creating aesthetically pleasing and informative plots that are easy to understand and can be used for publication or presentations.

##Practical Questions

In [None]:
#1. How do you create a 2D NumPy array and calculate the sum of each row ?


#To create a 2D NumPy array and calculate the sum of each row, you can follow these steps:​

#1. Create a 2D NumPy Array

#First, import the NumPy library and create a 2D array using np.array() or np.random.randint() for random integers.

import numpy as np

# Define a 2D array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Or create a 3x3 array with random integers between 0 and 10
arr_random = np.random.randint(0, 10, size=(3, 3))

print(arr)
print(arr_random)


#2. Calculate the Sum of Each Row
#Use the np.sum() function with the axis=1 parameter to sum along each row:

# Sum of each row
row_sums = arr.sum(axis=1)

print("Sum of each row:", row_sums)


In [None]:
#2. Write a Pandas script to find the mean of a specific column in a DataFrame.

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}

df = pd.DataFrame(data)

# Calculate the mean of the 'Age' column
mean_age = df['Age'].mean()

print("Mean Age:", mean_age)


In [None]:
#3. Create a scatter plot using Matplotlib.

import matplotlib.pyplot as plt
import numpy as np

x = np.array([1, 5, 2, 8, 4, 7, 3, 9, 6, 10])
y = np.array([3, 7, 1, 9, 5, 8, 2, 10, 6, 4])
colors = np.random.rand(10)  # Random colors for each point
sizes = 100 * np.random.rand(10)  # Random sizes for each point

# Create the scatter plot
plt.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')

# Add labels and title
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
plt.title("Simple Scatter Plot")

# Add a colorbar (optional, but useful if you're mapping data to colors)
plt.colorbar(label='Color Intensity')

# Add a grid (optional)
plt.grid(True)

# Show the plot
plt.show()


In [None]:
#4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap ?

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample DataFrame (replace with your actual data)
data = {'A': [1, 2, 3, 4, 5],
        'B': [5, 4, 3, 2, 1],
        'C': [1, 3, 5, 2, 4],
        'D': [2, 2, 4, 4, 1]}
df = pd.DataFrame(data)

# 1. Calculate the Correlation Matrix
correlation_matrix = df.corr()

# 2. Visualize the Correlation Matrix with a Heatmap
plt.figure(figsize=(8, 6))  # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.show()


In [None]:
#5. Generate a bar plot using Plotly.

import plotly.express as px

# Load the gapminder dataset and filter for Canada
df = px.data.gapminder().query("country == 'Canada'")

# Create a bar plot showing population over the years
fig = px.bar(df, x='year', y='pop', title='Population of Canada Over Time')

# Display the plot
fig.show()


In [None]:
#6. Create a DataFrame and add a new column based on an existing column.

import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Add a new column 'Age_in_5_Years' by adding 5 to the 'Age' column
df['Age_in_5_Years'] = df['Age'] + 5

print(df)


In [None]:
#7. Write a program to perform element-wise multiplication of two NumPy array.

import numpy as np

# Create two 1D NumPy arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Perform element-wise multiplication using np.multiply()
result = np.multiply(a, b)

# Alternatively, using the * operator
result_alt = a * b

# Print the results
print("Using np.multiply():", result)
print("Using * operator:", result_alt)


In [None]:
#8. Create a line plot with multiple lines using Matplotlib.

import matplotlib.pyplot as plt
import numpy as np

# Generate some sample data
x = np.linspace(0, 10, 100)  # 100 points from 0 to 10
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.tan(x)

# Create the line plot
plt.plot(x, y1, label='sin(x)')
plt.plot(x, y2, label='cos(x)')
plt.plot(x, y3, label='tan(x)')

# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Multiple Line Plot')

# Add a legend
plt.legend()

# Show the plot
plt.show()


In [None]:
#9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}
        df = pd.DataFrame(data)

# Define the threshold
threshold = 30

# Filter rows where 'Age' is greater than the threshold
filtered_df = df[df['Age'] > threshold]

print(filtered_df)


In [None]:
#10. Create a histogram using Seaborn to visualize a distribution.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generate random data
data = np.random.normal(loc=0, scale=1, size=1000)

# Create the histogram
sns.histplot(data, bins=30, kde=True, color='skyblue')

# Add titles and labels
plt.title('Histogram with KDE', fontsize=16)
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Display the plot
plt.show()


In [None]:
#11.Perform matrix multiplication using NumPy.

import numpy as np

# Define two matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Perform matrix multiplication
result = np.matmul(A, B)

# Alternatively, using the @ operator
result_alt = A @ B

print("Result using np.matmul():\n", result)
print("Result using @ operator:\n", result_alt)


In [None]:
#12. Use Pandas to load a CSV file and display its first 5 rows.

import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('your_file.csv')

# Display the first 5 rows
print(df.head())


In [None]:
#13. Create a 3D scatter plot using Plotly.

import plotly.express as px
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
num_points = 100
x = np.random.rand(num_points)
y = np.random.rand(num_points)
z = np.random.rand(num_points)
color = np.random.rand(num_points)
size = np.random.rand(num_points) * 10

# Create a DataFrame
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z, 'Color': color, 'Size': size})

# Create a 3D scatter plot
fig = px.scatter_3d(df, x='X', y='Y', z='Z', color='Color', size='Size', title='3D Scatter Plot using Plotly')

# Display the plot
fig.show()
