# **THEORY QUESTION**

1. What is NumPy, and why is it widely used in Python?

NumPy is a Python library used for numerical computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently.


2. How does broadcasting work in NumPy?

Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes by automatically expanding their shapes to be compatible.


3. What is a Pandas DataFrame?

A DataFrame is a two-dimensional, labeled data structure in Pandas, similar to a spreadsheet or SQL table, that allows data manipulation and analysis.


4. Explain the use of the groupby() method in Pandas.

The groupby() method is used to group data based on certain keys and apply aggregate functions like sum, mean, or count on grouped data.


5. Why is Seaborn preferred for statistical visualizations?

Seaborn provides a high-level interface for creating attractive and informative statistical graphics, with built-in themes and color palettes.

6. What are the differences between NumPy arrays and Python lists?

NumPy arrays are more efficient for numerical operations and support broadcasting, while Python lists are more flexible but slower for large data.



7. What is a heatmap, and when should it be used?

A heatmap is a data visualization technique that shows the magnitude of a phenomenon as color. It is useful for visualizing correlations or data density

8. What does the term "vectorized operation" mean in NumPy?

Vectorized operations allow you to perform operations on entire arrays without using loops, making computations faster and more efficient.



9. How does Matplotlib differ from Plotly?

Matplotlib creates static, publication-quality plots, while Plotly supports interactive and web-based visualizations.




10. What is the significance of hierarchical indexing in Pandas?

Hierarchical indexing allows multiple index levels on an axis, enabling more complex data representations and easier data manipulation.

11. What is the role of Seaborn's pairplot() function?

pairplot() creates a matrix of scatter plots to visualize pairwise relationships in a dataset, often used for exploratory data analysis.


12. What is the purpose of the describe() function in Pandas?

The describe() function provides summary statistics (like mean, median, std) for numerical columns in a DataFrame.


13. Why is handling missing data important in Pandas?

Missing data can skew results or cause errors in analysis, so it's essential to handle it using methods like fill, drop, or imputation.


14. What are the benefits of using Plotly for data visualization?

Plotly enables the creation of interactive, web-ready visualizations with rich features like zoom, hover info, and animations.


15. How does NumPy handle multidimensional arrays?

NumPy uses the ndarray object to handle multidimensional arrays efficiently, supporting various mathematical operations on them.

16. What is the role of Bokeh in data visualization?

Bokeh is a Python library for creating interactive and real-time visualizations for modern web browsers.


17. Explain the difference between apply() and map() in Pandas.

apply() can be used on both Series and DataFrames and applies a function along rows or columns, while map() is only for Series and applies element-wise.


18. What are some advanced features of NumPy?

NumPy supports broadcasting, linear algebra, FFT, masked arrays, and integration with C/C++ code for high performance.


19. How does Pandas simplify time series analysis?

Pandas offers powerful tools for time series like date range generation, frequency conversion, resampling, and time-based indexing.


20. What is the role of a pivot table in Pandas?

A pivot table summarizes data by grouping and aggregating, making it easier to analyze relationships between variables.


21. Why is NumPy's array slicing faster than Python's list slicing?

NumPy arrays are stored in contiguous memory blocks and leverage vectorized operations, making slicing faster than in lists.



22. What are some common use cases for Seaborn?

Seaborn is commonly used for correlation analysis, distribution plots, categorical plots, and visualizing linear relationships.

# **PRACTICAL QUESTION**

In [1]:
### 1. How do you create a 2D NumPy array and calculate the sum of each row?
import numpy as np
array_2d = np.array([
    [5, 10, 15],
    [20, 25, 30],
    [35, 40, 45]
])

row_sums = array_2d.sum(axis=1)


print("2D Array:")
print(array_2d)
print("\nSum of each row:")
print(row_sums)

print("\nVerification:")
for i, row in enumerate(array_2d):
    print(f"Row {i}: {row} → Sum = {sum(row)} (matches {row_sums[i]})")

2D Array:
[[ 5 10 15]
 [20 25 30]
 [35 40 45]]

Sum of each row:
[ 30  75 120]

Verification:
Row 0: [ 5 10 15] → Sum = 30 (matches 30)
Row 1: [20 25 30] → Sum = 75 (matches 75)
Row 2: [35 40 45] → Sum = 120 (matches 120)


In [3]:
### 2. Write a Pandas script to find the mean of a specific column in a DataFrame.
import pandas as pd

data = {
    'Employee': ['VISHAL', 'SHARMA', 'PW', 'ROHIT', 'RAHUL'],
    'Age': [28, 34, 29, 42, 33],
    'Salary': [72000, 65000, 80000, 90000, 75000],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing']
}
df = pd.DataFrame(data)

target_column = 'Salary'

mean_value = df[target_column].mean()

print("=== DATAFRAME ===")
print(df)
print(f"\nMean of '{target_column}' column: ${mean_value:,.2f}")


if not pd.api.types.is_numeric_dtype(df[target_column]):
    print(f"\nWarning: '{target_column}' is non-numeric - cannot calculate mean")
else:
    print(f"\nStatistical summary for '{target_column}':")
    print(df[target_column].describe())

=== DATAFRAME ===
  Employee  Age  Salary Department
0   VISHAL   28   72000         HR
1   SHARMA   34   65000         IT
2       PW   29   80000    Finance
3    ROHIT   42   90000         IT
4    RAHUL   33   75000  Marketing

Mean of 'Salary' column: $76,400.00

Statistical summary for 'Salary':
count        5.000000
mean     76400.000000
std       9343.446901
min      65000.000000
25%      72000.000000
50%      75000.000000
75%      80000.000000
max      90000.000000
Name: Salary, dtype: float64


In [None]:
### 3.Create a scatter plot using Matplotlib.
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
x = np.random.normal(50, 15, 100)
y = 2 * x + np.random.normal(0, 20, 100)

plt.figure(figsize=(8, 6), dpi=100)
ax = plt.gca()

scatter = ax.scatter(
    x, y,
    c=np.sqrt(x**2 + y**2),
    s=np.abs(y)/5,
    alpha=0.7,
    cmap='viridis',
    edgecolor='black',
    linewidth=0.5
)


cbar = plt.colorbar(scatter)
cbar.set_label('Distance from Origin')


ax.set_title('Advanced Scatter Plot Example', pad=20)
ax.set_xlabel('X Values (arbitrary units)', labelpad=10)
ax.set_ylabel('Y Values (arbitrary units)', labelpad=10)
ax.grid(True, linestyle='--', alpha=0.6)


ax.axhline(y=np.mean(y), color='red', linestyle=':', label=f'Mean Y ({np.mean(y):.1f})')
ax.axvline(x=np.mean(x), color='blue', linestyle=':', label=f'Mean X ({np.mean(x):.1f})')
ax.legend()


plt.tight_layout()
plt.savefig('scatter_plot.png', bbox_inches='tight', dpi=120)
plt.show()

In [None]:
### 4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


np.random.seed(42)
data = pd.DataFrame({
    'Sales': np.random.normal(100, 20, 100),
    'Marketing': np.random.normal(50, 10, 100) + np.random.normal(0, 5, 100),
    'R&D': np.random.normal(30, 5, 100) * np.random.uniform(0.8, 1.2, 100),
    'Profit': np.random.normal(20, 3, 100) + np.random.normal(0, 2, 100),
    'Customers': np.random.poisson(150, 100)
})


corr_matrix = data.corr(method='pearson')


plt.figure(figsize=(10, 8), dpi=100)
ax = sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    vmin=-1, vmax=1,
    center=0,
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .8},
    mask=np.triu(np.ones_like(corr_matrix))
)

ax.set_title('Feature Correlation Matrix\n', fontsize=14, pad=20)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)


for i in range(len(corr_matrix.columns)):
    for j in range(i):
        pval = np.random.uniform(0, 0.1)
        if abs(corr_matrix.iloc[i, j]) > 0.5 and pval < 0.05:
            ax.text(j+0.5, i+0.5, '*', ha='center', va='center', color='black', fontsize=14)


plt.tight_layout()
plt.savefig('correlation_heatmap.png', bbox_inches='tight', dpi=120)
plt.show()

In [None]:
### 5.import plotly.graph_objects as go
import pandas as pd
import numpy as np


np.random.seed(123)
categories = ['Electronics', 'Clothing', 'Groceries', 'Furniture', 'Books']
sales_data = pd.DataFrame({
    'Category': categories,
    'Q1_Sales': np.random.randint(50, 200, size=5),
    'Q2_Sales': np.random.randint(60, 250, size=5),
    'Q3_Sales': np.random.randint(70, 300, size=5),
    'Q4_Sales': np.random.randint(80, 350, size=5)
})

fig = go.Figure()


quarters = ['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']
colors = ['#636EFA', '#EF553B', '#00CC96', '#AB63FA']

for q, color in zip(quarters, colors):
    fig.add_trace(go.Bar(
        x=sales_data['Category'],
        y=sales_data[q],
        name=q.replace('_', ' '),
        marker_color=color,
        hovertemplate='<b>%{x}</b><br>' +
                     f'{q.replace("_", " ")}: %{{y:,}}<extra></extra>',
        text=sales_data[q],
        textposition='auto'
    ))


fig.update_layout(
    title='<b>Quarterly Sales by Category</b>',
    title_x=0.5,
    xaxis_title='Product Category',
    yaxis_title='Sales Volume (units)',
    barmode='group',
    template='plotly_white',
    hoverlabel=dict(
        bgcolor='white',
        font_size=14,
        font_family='Arial'
    ),
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=1.02,
        xanchor='right',
        x=1
    ),
    margin=dict(l=50, r=50, b=100, t=100, pad=10),
    height=600,
    width=900
)
fig.add_annotation(
    text="Source: Synthetic Data | Visualization: Plotly",
    xref="paper", yref="paper",
    x=0.5, y=-0.15,
    showarrow=False,
    font=dict(size=10, color='grey')
)


fig.show()


In [None]:
### 6.Create a DataFrame and add a new column based on an existing column
import pandas as pd
import numpy as np


data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Price': [1200, 800, 450, 300, 80],
    'Units_Sold': [150, 320, 210, 180, 95]
}
df = pd.DataFrame(data)




df['Revenue'] = df['Price'] * df['Units_Sold']


df['Tier'] = np.where(df['Price'] > 1000, 'Premium',
                     np.where(df['Price'] > 500, 'Mid-Range', 'Budget'))


def apply_discount(price):
    return price * 0.9

df['Discounted_Price'] = df['Price'].apply(apply_discount)


df['Product_Code'] = 'ELEC_' + df['Product'].str.upper().str[:3]


print("Original DataFrame:")
print(df.head())

print("\nModified DataFrame:")
print(df)

df.to_csv('product_data_with_new_columns.csv', index=False)

In [None]:
### 7.Write a program to perform element-wise multiplication of two NumPy arrays
import numpy as np


array1 = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

array2 = np.array([[9, 8, 7],
                   [6, 5, 4],
                   [3, 2, 1]])



result1 = array1 * array2


result2 = np.multiply(array1, array2)

result3 = np.empty_like(array1)
np.multiply(array1, array2, out=result3)


print("Array 1:")
print(array1)
print("\nArray 2:")
print(array2)

print("\nElement-wise product (using * operator):")
print(result1)

print("\nElement-wise product (using np.multiply()):")
print(result2)

print("\nElement-wise product (with out parameter):")
print(result3)


assert np.array_equal(result1, result2), "Results don't match!"
assert np.array_equal(result1, result3), "Results don't match!"
print("\nAll methods produced identical results ✓")


vector = np.array([10, 20, 30])
broadcast_result = array1 * vector
print("\nBroadcasting example (array * vector):")
print(broadcast_result)

In [None]:
### 8.Create a line plot with multiple lines using Matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


np.random.seed(42)
x = np.linspace(0, 10, 100)
data = pd.DataFrame({
    'Time': x,
    'Sensor_A': np.sin(x) + np.random.normal(0, 0.1, 100),
    'Sensor_B': np.cos(x) + np.random.normal(0, 0.15, 100),
    'Sensor_C': 0.5 * np.sin(0.5 * x) + np.random.normal(0, 0.2, 100)
})


plt.figure(figsize=(10, 6), dpi=100)
plt.style.use('seaborn-v0_8')


sensors = ['Sensor_A', 'Sensor_B', 'Sensor_C']
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
line_styles = ['-', '--', '-.']
markers = ['', 'o', '']

for sensor, color, style, marker in zip(sensors, colors, line_styles, markers):
    plt.plot(
        'Time', sensor,
        data=data,
        color=color,
        linestyle=style,
        marker=marker,
        markersize=4,
        markevery=10,
        linewidth=2,
        alpha=0.8,
        label=sensor.replace('_', ' ')
    )


plt.title('Multi-Sensor Time Series Data', pad=20, fontsize=14)
plt.xlabel('Time (seconds)', labelpad=10)
plt.ylabel('Sensor Values', labelpad=10)
plt.grid(True, linestyle=':', alpha=0.7)
plt.legend(frameon=True, shadow=True)


plt.axhline(y=0.8, color='red', linestyle=':', label='Upper Threshold')
plt.axhline(y=-0.8, color='purple', linestyle=':', label='Lower Threshold')


plt.tight_layout()
plt.savefig('multi_line_plot.png', bbox_inches='tight', dpi=120)
plt.show()

In [None]:
### 9.Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold
import pandas as pd
import numpy as np


np.random.seed(42)
data = pd.DataFrame({
    'Product_ID': [f'P{1000 + i}' for i in range(20)],
    'Category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Sports'], 20),
    'Price': np.round(np.random.uniform(10, 500, 20), 2),
    'Rating': np.round(np.random.uniform(1, 5, 20), 1),
    'Units_Sold': np.random.randint(5, 200, 20)
})


threshold = 100  # Price threshold


filtered_1 = data[data['Price'] > threshold]


filtered_2 = data.query('Price > @threshold')


filtered_3 = data.loc[data['Price'] > threshold]

# 3. DISPLAY RESULTS
print("══════════════ ORIGINAL DATAFRAME (20 rows) ══════════════")
print(data.head())

print(f"\n═════════ FILTERED RESULTS (Price > ${threshold}) ═════════")
print(f"Method 1: {len(filtered_1)} rows")
print(filtered_1.head())


high_rated = data[(data['Price'] > threshold) &
                 (data['Rating'] >= 4.0) &
                 (data['Units_Sold'] > 50)]

print("\n════════ HIGH-RATED PRODUCTS (Price>$100, Rating≥4, Sold>50) ════════")
print(high_rated)


filtered_1.to_csv('filtered_products.csv', index=False)

In [None]:
### 10.Create a histogram using Seaborn to visualize a distribution
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# 1. GENERATE SAMPLE DATA
np.random.seed(42)
data = pd.DataFrame({
    'Scores': np.concatenate([
        np.random.normal(70, 10, 500),
        np.random.normal(90, 5, 300)
    ]),
    'Group': ['Class A'] * 500 + ['Class B'] * 300
})

# 2. CREATE HISTOGRAM PLOT
plt.figure(figsize=(10, 6), dpi=100)

# Single distribution histogram
ax1 = plt.subplot(1, 2, 1)
sns.histplot(
    data=data,
    x='Scores',
    bins=30,
    kde=True,
    color='skyblue',
    edgecolor='white',
    linewidth=0.5,
    alpha=0.8
)
plt.title('Overall Score Distribution')
plt.xlabel('Test Scores')
plt.ylabel('Frequency')

ax2 = plt.subplot(1, 2, 2)
sns.histplot(
    data=data,
    x='Scores',
    hue='Group',
    bins=25,
    kde=True,
    palette=['#FF7F0E', '#1F77B4'],
    edgecolor='white',
    linewidth=0.3,
    alpha=0.6,
    element='step'
plt.title('Score Distribution by Group')
plt.xlabel('Test Scores')
plt.ylabel('Frequency')
plt.legend(title='Class')

# 3. ENHANCE VISUALIZATION
plt.suptitle('Exam Score Distributions', y=1.02, fontsize=14)
plt.tight_layout()

# 4. SAVE AND SHOW
plt.savefig('score_distributions.png', bbox_inches='tight', dpi=120)
plt.show()

In [None]:
### 11.Perform matrix multiplication using NumPy
import numpy as np


matrix_A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

matrix_B = np.array([
    [9, 8, 7],
    [6, 5, 4],
    [3, 2, 1]
])


result_1 = matrix_A @ matrix_B

result_2 = np.matmul(matrix_A, matrix_B)

result_3 = np.dot(matrix_A, matrix_B)

print("Matrix A:")
print(matrix_A)
print("\nMatrix B:")
print(matrix_B)

print("\nResult (using @ operator):")
print(result_1)

print("\nResult (using np.matmul()):")
print(result_2)

print("\nResult (using np.dot()):")
print(result_3)


assert np.array_equal(result_1, result_2), "Results don't match!"
assert np.array_equal(result_1, result_3), "Results don't match!"
print("\nAll methods produced identical results ✓")


vector = np.array([1, 2, 3])
print("\nVector-matrix product:")
print(vector @ matrix_A)


batch_A = np.random.rand(5, 3, 4)
batch_B = np.random.rand(5, 4, 2)
batch_result = np.matmul(batch_A, batch_B)
print("\nBatch matrix multiplication shape:", batch_result.shape)


In [None]:
### 12. Use Pandas to load a CSV file and display its first 5 rows
import pandas as pd


file_path = 'data.csv'
df = pd.read_csv(file_path)


print("═══════════════════════════════════════")
print(f"First 5 rows of '{file_path}':")
print("═══════════════════════════════════════")
print(df.head())  # Default shows 5 rows
print("═══════════════════════════════════════")



In [None]:
### 13.Create a 3D scatter plot using Plotly
import plotly.graph_objects as go
import numpy as np
import pandas as pd

np.random.seed(42)
n_points = 200

df = pd.DataFrame({
    'X': np.random.normal(0, 10, n_points),
    'Y': np.random.normal(5, 3, n_points),
    'Z': np.random.normal(10, 5, n_points),
    'Value': np.random.uniform(1, 100, n_points),
    'Category': np.random.choice(['A', 'B', 'C'], size=n_points)
})

fig = go.Figure()


for category in df['Category'].unique():
    category_data = df[df['Category'] == category]
    fig.add_trace(go.Scatter3d(
        x=category_data['X'],
        y=category_data['Y'],
        z=category_data['Z'],
        mode='markers',
        name=category,
        marker=dict(
            size=category_data['Value']/10,
            color=category_data['Value'],
            colorscale='Viridis',
            opacity=0.8,
            colorbar=dict(title='Value'),
        hovertemplate=
            '<b>X</b>: %{x:.2f}<br>' +
            '<b>Y</b>: %{y:.2f}<br>' +
            '<b>Z</b>: %{z:.2f}<br>' +
            '<b>Value</b>: %{marker.color:.1f}<extra></extra>'
    ))


fig.update_layout(
    title='<b>Interactive 3D Scatter Plot</b>',
    scene=dict(
        xaxis_title='X Axis',
        yaxis_title='Y Axis',
        zaxis_title='Z Axis',
        camera=dict(
            eye=dict(x=1.5, y=1.5, z=0.1)
        )
    ),
    width=900,
    height=700,
    margin=dict(l=50, r=50, b=50, t=50),
    legend=dict(
        yanchor='top',
        y=0.99,
        xanchor='left',
        x=0.01
    )
)


fig.update_layout(
    annotations=[
        dict(
            text="Size represents Value<br>Color represents Value",
            xref="paper", yref="paper",
            x=0.05, y=0.05,
            showarrow=False,
            font=dict(size=10)
        )
    ]
)


fig.show()