**Basic Array Operations**

Convert the mpg column into a NumPy array and calculate:
* The mean, median, and standard deviation of mpg.
* The number of cars with mpg greater than 25.

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv("auto-mpg.csv")

mpg_array = df['mpg'].values

mean = np.mean(mpg_array)
median = np.median(mpg_array)
std_dev = np.std(mpg_array)

print(f"Mean of mpg: {mean}")
print(f"Median of mpg: {median}")
print(f"Standard deviation of mpg: {std_dev}")

cars_greater_than_25_mpg = np.sum(mpg_array > 25)
print(f"Number of cars with mpg greater than 25: {cars_greater_than_25_mpg}")

**Filtering**

*   Using NumPy, filter all cars with more than 6 cylinders.
*   Return the corresponding car_name as a list.

In [None]:
cylinders_array = df['cylinders'].values

car_names = df['car name'].values

filtered_cars = car_names[cylinders_array > 6]

filtered_cars_list = filtered_cars.tolist()

filtered_cars_list

**Statistical Analysis**

Compute the 25th, 50th, and 75th percentiles of the weight column using NumPy.

In [None]:
weight_array = df['weight'].values

percentiles = np.percentile(weight_array, [25, 50, 75])

print(f"25th percentile of weight: {percentiles[0]}")
print(f"50th percentile of weight: {percentiles[1]}")
print(f"75th percentile of weight: {percentiles[2]}")


**Array Manipulation**

Convert the acceleration column into a NumPy array and normalize its values
(scale between 0 and 1).

In [None]:
acceleration_array = df['acceleration'].values

min_accel = np.min(acceleration_array)
max_accel = np.max(acceleration_array)
normalized_acceleration = (acceleration_array - min_accel) / (max_accel - min_accel)

print("\nNormalized Acceleration Array:")
normalized_acceleration

**Broadcasting**

Increase all horsepower values by 10% and store the updated values in a new
NumPy array. Handle missing data (if any) by replacing it with the mean of the
column before applying the increase.

In [None]:
import pandas as pd
import numpy as np

df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')

mean_horsepower = df['horsepower'].mean()

df['horsepower'] = df['horsepower'].fillna(mean_horsepower)

horsepower_array = df['horsepower'].values

updated_horsepower_array = horsepower_array * 1.10

print("\nUpdated Horsepower Array:")
print(updated_horsepower_array)

**Boolean indexing**

Find the average displacement of cars with an origin of 2 (Europe) using NumPy
indexing.

In [None]:
european_cars = df[df['origin'] == 2]

european_displacement_array = european_cars['displacement'].values

avg_european_displacement = np.mean(european_displacement_array)

print(f"\nAverage displacement of European cars: {avg_european_displacement}")

**Matrix Operations**

Create a 2D NumPy array containing the columns mpg, horsepower, and weight.
Compute the dot product of this matrix with a given vector [1, 0.5, -0.2].


In [None]:
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
mean_horsepower = df['horsepower'].mean()
df['horsepower'] = df['horsepower'].fillna(mean_horsepower)  # <- FIXED LINE

matrix = df[['mpg', 'horsepower', 'weight']].values
vector = np.array([1, 0.5, -0.2])
dot_product_result = np.dot(matrix, vector)

print("\nDot product of the matrix and the vector:")
dot_product_result

**Sorting**

Use NumPy to sort the cars by model_year in descending order and display the first
five car names.

In [None]:
model_year_array = df['model year'].values
car_name_array = df['car name'].values

sorted_indices = np.argsort(model_year_array)[::-1]

sorted_car_names = car_name_array[sorted_indices]

print("\nFirst five car names sorted by model year (descending):")
print(sorted_car_names[:5])

**Correlation**

Compute the Pearson correlation coefficient between mpg and weight using
NumPy.

In [None]:
correlation_coefficient = np.corrcoef(mpg_array, weight_array)[0, 1]

print(f"\nPearson correlation coefficient between mpg and weight: {correlation_coefficient}")

**Conditional Aggregates**

Calculate the mean mpg for cars grouped by the number of cylinders using NumPy
techniques.

In [None]:
cylinders_array = df['cylinders'].values
mpg_array = df['mpg'].values

unique_cylinders = np.unique(cylinders_array)

mean_mpg_by_cylinders = {}

for cylinder_count in unique_cylinders:
  mask = cylinders_array == cylinder_count
  mpg_for_cylinder_count = mpg_array[mask]
  mean_mpg = np.mean(mpg_for_cylinder_count)
  mean_mpg_by_cylinders[cylinder_count] = mean_mpg

print("\nMean mpg for cars grouped by the number of cylinders:")
print(mean_mpg_by_cylinders)

**Basic Exploration**

Load the dataset into a Pandas DataFrame. Display:


*  The first 10 rows
*  The total number of rows and columns
*  Summary statistics for numerical columns

In [None]:
print("First 10 rows of the DataFrame:")
print(df.head(10))

print("\nShape of the DataFrame (rows, columns):")
print(df.shape)

print("\nSummary statistics for numerical columns:")
print(df.describe())

**Filtering and Indexing**

Find all cars manufactured in 1975 with a weight less than 3000.

Return the
DataFrame with selected columns: car_name, weight, and mpg.

In [None]:
filtered_df = df[(df['model year'] == 75) & (df['weight'] < 3000)]
result_df = filtered_df[['car name', 'weight', 'mpg']]
print("\nCars manufactured in 1975 with weight less than 3000:")
result_df

**Handling Missing Data**

Identify if there are any missing values in the dataset.

Replace missing values in the horsepower column with the column's median.

In [None]:
print("\nMissing values before handling:")
print(df.isnull().sum())

df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
median_horsepower = df['horsepower'].median()

df['horsepower'] = df['horsepower'].fillna(median_horsepower)

print("\nMissing values after handling:")
print(df.isnull().sum())

**Data Transformation**

Add a new column power_to_weight_ratio, calculated as horsepower / weight.

In [None]:
df['power_to_weight_ratio'] = df['horsepower'] / df['weight']
print("\nDataFrame with 'power_to_weight_ratio' column:")
print(df.head())

**Group By**

Group the cars by origin and calculate the mean mpg for each group.

In [None]:
mpg_by_origin = df.groupby('origin')['mpg'].mean()
print("\nMean mpg by origin:")
mpg_by_origin

**Sorting**

Sort the DataFrame by mpg in descending order and display the top 10 cars with
the highest mpg.

In [None]:
sorted_df_mpg = df.sort_values(by='mpg', ascending=False)
top_10_mpg = sorted_df_mpg.head(10)
print("\nTop 10 cars with the highest mpg:")
top_10_mpg

**Apply Function**

Create a new column performance_score using a custom function:

def performance_score(row):

return row['mpg'] * row['acceleration'] / row['weight']

Apply this function to each row and store the result in the new column.


In [None]:
def performance_score(row):
  return row['mpg'] * row['acceleration'] / row['weight']

df['performance_score'] = df.apply(performance_score, axis=1)
print("\nDataFrame with 'performance_score' column:")
print(df.head())

**Visualization Preparation**

Generate a summary DataFrame with:

Average mpg, weight, and horsepower for each model_year.

In [None]:
summary_df = df.groupby('model year')[['mpg', 'weight', 'horsepower']].mean().reset_index()
print("\nSummary DataFrame by Model Year:")
summary_df

**Exporting Data**

Save a subset of the data containing only mpg, cylinders, horsepower, and weight
for cars with mpg > 30 into a CSV file named high_mpg_cars.csv.

In [None]:
high_mpg_df = df[df['mpg'] > 30]
subset_df = high_mpg_df[['mpg', 'cylinders', 'horsepower', 'weight']]
subset_df.to_csv('high_mpg_cars.csv', index=False)

print("\nSubset data saved to high_mpg_cars.csv")
print(subset_df.head())

**Finding Anomalies**

Identify potential outliers in the mpg column using the Interquartile Range (IQR)
method. Specifically:

*   Calculate the IQR for mpg.
*   Define outliers as values less than Q1 - 1.5 * IQR or greater than Q3 + 1.5 *
IQR.
* Create a DataFrame of cars classified as outliers, displaying car_name, mpg,
and model_year.

In [None]:
Q1 = df['mpg'].quantile(0.25)
Q3 = df['mpg'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_df = df[(df['mpg'] < lower_bound) | (df['mpg'] > upper_bound)]

outliers_info_df = outliers_df[['car name', 'mpg', 'model year']]

print("\nCars identified as outliers in mpg using IQR method:")
outliers_info_df

What is the distribution of miles per gallon (mpg) in the dataset?

Plot a histogram of mpg values.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.hist(df['mpg'], bins=20, edgecolor='black')
plt.title('Distribution of Miles Per Gallon (mpg)')
plt.xlabel('mpg')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

How does mpg vary with the number of cylinders?

Use a boxplot to compare mpg across different cylinders.

In [None]:
df.boxplot(column='mpg', by='cylinders', figsize=(10, 6))
plt.title('MPG Distribution by Number of Cylinders')
plt.xlabel('Number of Cylinders')
plt.ylabel('MPG')
plt.suptitle('')
plt.show()

Is there a relationship between horsepower and mpg? Summarize your
observation

Plot a scatter plot of horsepower vs. mpg.


In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['horsepower'], df['mpg'], alpha=0.5)
plt.title('Relationship between Horsepower and MPG')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.grid(True)
plt.show()

How does car weight influence mpg?

Plot a scatter plot with a trend line for weight vs. mpg.


In [None]:
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.regplot(x='weight', y='mpg', data=df, scatter_kws={'alpha':0.5})
plt.title('Relationship between Weight and MPG with Trend Line')
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.grid(True)
plt.show()

What is the trend of average mpg across model years?

Plot a line chart of average mpg per model year.


In [None]:
avg_mpg_by_year = df.groupby('model year')['mpg'].mean().reset_index()

plt.figure(figsize=(10, 6))
plt.plot(avg_mpg_by_year['model year'], avg_mpg_by_year['mpg'], marker='o')
plt.title('Trend of Average MPG Across Model Years')
plt.xlabel('Model Year')
plt.ylabel('Average MPG')
plt.grid(True)
plt.show()

How is the count of cars distributed by origin?

Use a bar chart to show the number of carsfor each origin.

In [None]:
origin_counts = df['origin'].value_counts()

plt.figure(figsize=(10, 6))
origin_counts.plot(kind='bar', rot=0)
plt.title('Count of Cars by Origin')
plt.xlabel('Origin')
plt.ylabel('Number of Cars')
plt.xticks(ticks=[0, 1, 2], labels=['USA', 'Europe', 'Japan'])
plt.grid(axis='y', alpha=0.75)
plt.show()

How do acceleration values vary across different cylinders?

Use a boxplot of acceleration grouped by cylinders.


In [None]:
df.boxplot(column='acceleration', by='cylinders',figsize=(10, 6))
plt.title('Acceleration Distribution by Number of Cylinders')
plt.xlabel('Number of Cylinders')
plt.ylabel('Acceleration')
plt.suptitle('')
plt.show()

Which year had the most number of car entries?

Plot a histogram or bar chart of car counts by model year.

In [None]:
model_year_counts = df['model year'].value_counts().sort_index()

most_common_year = model_year_counts.idxmax()
most_common_year_count = model_year_counts.max()

print(f"The year with the most number of car entries is: {most_common_year} with {most_common_year_count} entries.")

plt.figure(figsize=(12, 6))
model_year_counts.plot(kind='bar')
plt.title('Number of Car Entries by Model Year')
plt.xlabel('Model Year')
plt.ylabel('Number of Cars')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()

Is there a clustering pattern among weight, horsepower, and mpg?

Create a 3D scatter plot of these three variables.


In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df['weight'], df['horsepower'], df['mpg'], c=df['mpg'], cmap='viridis', marker='o')

ax.set_xlabel('Weight')
ax.set_ylabel('Horsepower')
ax.set_zlabel('MPG')
ax.set_title('3D Scatter Plot of Weight, Horsepower, and MPG')

plt.show()

Which 10 cars have the bestfuel efficiency?

Plot a horizontal bar chartshowing the top 10 car names with the highest mpg.


In [None]:
plt.figure(figsize=(12, 8))
sns.barplot(
    x='mpg',
    y='car name',
    data=top_10_mpg,
    orient='h',
    hue='car name',
    dodge=False,
    palette='viridis',
    legend=False
)
plt.title('Top 10 Cars with the Best Fuel Efficiency (MPG)')
plt.xlabel('MPG')
plt.ylabel('Car Name')
plt.grid(axis='x', alpha=0.75)
plt.tight_layout()
plt.show()