<a href="https://colab.research.google.com/github/devikaajay/DSA-Case-study-on-numpy-pandas/blob/main/case_study_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Basic Array Operations

Convert the mpg column into a NumPy array and calculate:

○ The mean, median, and standard deviation of mpg.
○ The number of cars with mpg greater than 25.

In [10]:
import numpy as np
import pandas as pd
df = pd.read_csv('auto-mpg.csv')
mpg_array = df['mpg'].to_numpy() #Convert the 'mpg' column to a NumPy array
mean_mpg = np.mean(mpg_array)
median_mpg = np.median(mpg_array)
std_mpg = np.std(mpg_array)
cars_greater_than_25 = np.sum(mpg_array > 25)
print("Mean MPG:", mean_mpg)
print("Median MPG:", median_mpg)
print("Standard Deviation MPG:", std_mpg)
print("Number of cars with MPG greater than 25:", cars_greater_than_25)

Mean MPG: 23.514572864321607
Median MPG: 23.0
Standard Deviation MPG: 7.806159061274433
Number of cars with MPG greater than 25: 158


2. Filtering

Using NumPy, filter all cars with more than 6 cylinders.
Return the corresponding car_name as a list.

In [6]:
filtered_cars = df[df['cylinders'] > 6]
car_name_column = next((col for col in filtered_cars.columns if 'car' in col and 'name' in col), None)

if car_name_column:
    car_names_list = filtered_cars[car_name_column].tolist()
    print(f"Using column '{car_name_column}' for car names.")
else:
    raise KeyError("Could not find a column containing 'car' and 'name' for car names.")
print("Cars with more than 6 cylinders:")
print(car_names_list)

Using column 'car name' for car names.
Cars with more than 6 cylinders:
['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino', 'ford galaxie 500', 'chevrolet impala', 'plymouth fury iii', 'pontiac catalina', 'amc ambassador dpl', 'dodge challenger se', "plymouth 'cuda 340", 'chevrolet monte carlo', 'buick estate wagon (sw)', 'ford f250', 'chevy c20', 'dodge d200', 'hi 1200d', 'chevrolet impala', 'pontiac catalina brougham', 'ford galaxie 500', 'plymouth fury iii', 'dodge monaco (sw)', 'ford country squire (sw)', 'pontiac safari (sw)', 'chevrolet impala', 'pontiac catalina', 'plymouth fury iii', 'ford galaxie 500', 'amc ambassador sst', 'mercury marquis', 'buick lesabre custom', 'oldsmobile delta 88 royale', 'chrysler newport royal', 'amc matador (sw)', 'chevrolet chevelle concours (sw)', 'ford gran torino (sw)', 'plymouth satellite custom (sw)', 'buick century 350', 'amc matador', 'chevrolet malibu', 'ford gran torino', 'dodge coronet c

3. Statistical Analysis

Compute the 25th, 50th, and 75th percentiles of the weight column using NumPy

In [11]:
weight_array = df['weight'].to_numpy()

percentile_25 = np.percentile(weight_array, 25)
percentile_50 = np.percentile(weight_array, 50)
percentile_75 = np.percentile(weight_array, 75)

print(f"25th Percentile (Q1): {percentile_25}")
print(f"50th Percentile (Median/Q2): {percentile_50}")
print(f"75th Percentile (Q3): {percentile_75}")

25th Percentile (Q1): 2223.75
50th Percentile (Median/Q2): 2803.5
75th Percentile (Q3): 3608.0


4. Array Manipulation

Convert the acceleration column into a NumPy array and normalize its values
(scale between 0 and 1).

In [12]:
acceleration_array = df['acceleration'].to_numpy()

# Normalize the acceleration values to the range [0, 1].
min_acceleration = np.min(acceleration_array)
max_acceleration = np.max(acceleration_array)

normalized_acceleration = (acceleration_array - min_acceleration) / (max_acceleration - min_acceleration)

print("Original Acceleration Array:")
print(acceleration_array)
print("\nNormalized Acceleration Array:")
print(normalized_acceleration)

Original Acceleration Array:
[12.  11.5 11.  12.  10.5 10.   9.   8.5 10.   8.5 10.   8.   9.5 10.
 15.  15.5 15.5 16.  14.5 20.5 17.5 14.5 17.5 12.5 15.  14.  15.  13.5
 18.5 14.5 15.5 14.  19.  13.  15.5 15.5 15.5 15.5 12.  11.5 13.5 13.
 11.5 12.  12.  13.5 19.  15.  14.5 14.  14.  19.5 14.5 19.  18.  19.
 20.5 15.5 17.  23.5 19.5 16.5 12.  12.  13.5 13.  11.5 11.  13.5 13.5
 12.5 13.5 12.5 14.  16.  14.  14.5 18.  19.5 18.  16.  17.  14.5 15.
 16.5 13.  11.5 13.  14.5 12.5 11.5 12.  13.  14.5 11.  11.  11.  16.5
 18.  16.  16.5 16.  21.  14.  12.5 13.  12.5 15.  19.  19.5 16.5 13.5
 18.5 14.  15.5 13.   9.5 19.5 15.5 14.  15.5 11.  14.  13.5 11.  16.5
 17.  16.  17.  19.  16.5 21.  17.  17.  18.  16.5 14.  14.5 13.5 16.
 15.5 16.5 15.5 14.5 16.5 19.  14.5 15.5 14.  15.  15.5 16.  16.  16.
 21.  19.5 11.5 14.  14.5 13.5 21.  18.5 19.  19.  15.  13.5 12.  16.
 17.  16.  18.5 13.5 16.5 17.  14.5 14.  17.  15.  17.  14.5 13.5 17.5
 15.5 16.9 14.9 17.7 15.3 13.  13.  13.9 12.8 15.4 14.5

5. Broadcasting

Increase all horsepower values by 10% and store the updated values in a new
NumPy array. Handle missing data (if any) by replacing it with the mean of the
column before applying the increase.

In [14]:
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)

# Convert to NumPy array
horsepower_array = df['horsepower'].to_numpy()

# Increase horsepower by 10%
increased_horsepower = horsepower_array * 1.10

print("Original Horsepower Array:")
print(horsepower_array)
print("\nIncreased Horsepower Array:")
print(increased_horsepower)

Original Horsepower Array:
[130.         165.         150.         150.         140.
 198.         220.         215.         225.         190.
 170.         160.         150.         225.          95.
  95.          97.          85.          88.          46.
  87.          90.          95.         113.          90.
 215.         200.         210.         193.          88.
  90.          95.         104.46938776 100.         105.
 100.          88.         100.         165.         175.
 153.         150.         180.         170.         175.
 110.          72.         100.          88.          86.
  90.          70.          76.          65.          69.
  60.          70.          95.          80.          54.
  90.          86.         165.         175.         150.
 153.         150.         208.         155.         160.
 190.          97.         150.         130.         140.
 150.         112.          76.          87.          69.
  86.          92.          97.          80. 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)


6. Boolean Indexing
Find the average displacement of cars with an origin of 2 (Europe) using NumPy
indexing.

In [15]:
origin_array = df['origin'].to_numpy()# Convert to NumPy arrays.
displacement_array = df['displacement'].to_numpy()
european_displacements = displacement_array[origin_array == 2]# Use boolean indexing to filter the displacement values.
average_european_displacement = np.mean(european_displacements)# Calculate the average displacement.
print(f"Average displacement of European cars: {average_european_displacement}")

Average displacement of European cars: 109.14285714285714


7. Matrix Operations
Create a 2D NumPy array containing the columns mpg, horsepower, and weight.
Compute the dot product of this matrix with a given vector [1, 0.5, -0.2]

In [19]:
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].mean())
selected_columns = df[['mpg', 'horsepower', 'weight']].to_numpy()# Create the 2D NumPy array.
vector = np.array([1, 0.5, -0.2])# Define the vector
dot_product = np.dot(selected_columns, vector)# Compute the dot product

print("2D NumPy Array:")
print(selected_columns)
print("\nVector:")
print(vector)
print("\nDot Product:")
print(dot_product)

2D NumPy Array:
[[  18.  130. 3504.]
 [  15.  165. 3693.]
 [  18.  150. 3436.]
 ...
 [  32.   84. 2295.]
 [  28.   79. 2625.]
 [  31.   82. 2720.]]

Vector:
[ 1.   0.5 -0.2]

Dot Product:
[-617.8        -641.1        -594.2        -595.6        -602.8
 -754.2        -746.8        -740.9        -758.5        -660.
 -612.6        -627.8        -662.2        -490.7        -402.9
 -497.1        -488.3        -453.9        -355.         -318.
 -465.9        -417.         -402.5        -364.3        -463.6
 -805.5        -765.2        -760.4        -840.9        -355.
 -379.8        -373.1        -331.96530612 -457.8        -619.3
 -598.8        -597.4        -589.6        -745.3        -791.3
 -740.3        -730.2        -889.         -851.2        -927.5
 -519.4        -423.6        -587.4        -565.8        -378.
 -351.6        -349.8        -345.         -291.1        -253.1
 -309.8        -330.         -384.1        -360.2        -400.8
 -416.6        -381.2        -759.3        -775.

8. Sorting
Use NumPy to sort the cars by model_year in descending order and display the first
five car names.

In [22]:
model_year_array = df['model year'].to_numpy()
car_name_array = df['car name'].to_numpy()
sorted_indices = np.argsort(model_year_array)[::-1]  #reverses the order
sorted_car_names = car_name_array[sorted_indices]

print("First five car names sorted by model_year (descending):")
print(sorted_car_names[:5])

First five car names sorted by model_year (descending):
['dodge aries se' 'pontiac phoenix' 'pontiac j2000 se hatchback'
 'chevrolet cavalier 2-door' 'chevrolet cavalier wagon']


9. Correlation
Compute the Pearson correlation coefficient between mpg and weight using
NumPy.

In [23]:
mpg_array = df['mpg'].to_numpy()
weight_array = df['weight'].to_numpy()
correlation_coefficient = np.corrcoef(mpg_array, weight_array)[0, 1]

print(f"Pearson Correlation Coefficient between MPG and Weight: {correlation_coefficient}")

Pearson Correlation Coefficient between MPG and Weight: -0.8317409332443352


10. Conditional Aggregates
Calculate the mean mpg for cars grouped by the number of cylinders using NumPy
techniques.

In [27]:
cylinders_array = df['cylinders'].to_numpy()
mpg_array = df['mpg'].to_numpy()
unique_cylinders = np.unique(cylinders_array)

mean_mpg_by_cylinders = {} #dictionary initialized
for cylinder in unique_cylinders:
    mean_mpg = np.mean(mpg_array[cylinders_array == cylinder])
    mean_mpg_by_cylinders[cylinder] = mean_mpg

print("Mean MPG by Cylinder:")
for cylinder, mean_mpg in mean_mpg_by_cylinders.items():
    print(f"Cylinders {cylinder}: Mean MPG = {mean_mpg}")

Mean MPG by Cylinder:
Cylinders 3: Mean MPG = 20.55
Cylinders 4: Mean MPG = 29.28676470588235
Cylinders 5: Mean MPG = 27.366666666666664
Cylinders 6: Mean MPG = 19.985714285714284
Cylinders 8: Mean MPG = 14.963106796116506
