

---



## 1. Importing the required libraries for EDA

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation
%matplotlib inline
sns.set(color_codes=True)



---



## 2. Loading the data into the data frame.

In [None]:
df = pd.read_csv("sample_data/EDAdata.csv")
# To display the top 5 rows
df.head(5)

In [None]:
df.tail(5)                        # To display the botton 5 rows



---



## 3. Checking the types of data

In [None]:
df.dtypes



---



## 4. Dropping irrelevant columns

In [None]:
#df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)



---



## 5. Renaming the columns

In [None]:
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "DriveMode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)



---



## 6. Dropping the duplicate rows

In [None]:
df.shape

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
df.count()      # Used to count the number of rows

In [None]:
df = df.drop_duplicates()
df.head(5)

In [None]:
df.count()



---



## 7. Dropping the missing or null values.

In [None]:
print(df.isnull().sum())

In [None]:
df = df.dropna()    # Dropping the missing values.
df.count()

In [None]:
print(df.isnull().sum())   # After dropping the values



---



## 8. Detecting Outliers

In [None]:
sns.boxplot(x=df['Cylinders'])

In [None]:
#df['HP'] = pd.to_numeric(df['HP'])
sns.boxplot(x=df['HP'])

In [None]:
df['Price'] = pd.to_numeric(df['Price'])
sns.boxplot(x=df['Price'])

In [None]:
#Q1_col1 = df['column_name1'].quantile(0.20)

Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

The code snippet is calculating the Interquartile Range (IQR), which is a measure of statistical dispersion and is often used to identify outliers in a dataset.

Q1 = df['Price'].quantile(0.25):

df['Price']: Refers to the "Price" column in a DataFrame df.
.quantile(0.25): Calculates the first quartile (Q1), which is the 25th percentile of the data. This means that 25% of the data points are below this value.
Q3 = df['Price'].quantile(0.75):

.quantile(0.75): Calculates the third quartile (Q3), which is the 75th percentile of the data. This means that 75% of the data points are below this value, and 25% are above.
IQR = Q3 - Q1:

The Interquartile Range (IQR) is calculated by subtracting the first quartile (Q1) from the third quartile (Q3). This gives the range within which the middle 50% of the data lies.
The IQR is useful for detecting outliers, as values that fall outside of the range defined by
𝑄1−1.5×𝐼𝑄𝑅 Q1−1.5×IQR and
𝑄3+1.5×𝐼𝑄𝑅 Q3+1.5×IQR are considered potential outliers.
print(IQR):
This prints the value of the IQR, which represents the spread of the middle 50% of the data.

In [None]:
numerical_df = df.select_dtypes(include=[np.number])

numerical_df = numerical_df[~((numerical_df < (numerical_df['Price'].quantile(0.25) - 1.5 * IQR)) |(numerical_df > (numerical_df['Price'].quantile(0.75) + 1.5 * IQR))).any(axis=1)]
print ('numerical_df shape:',numerical_df.shape)
print ('df shape:',df.shape)

This code first creates a new DataFrame containing only the numeric columns. It then filters out rows with outliers based on the IQR for the "Price" column. Finally, it compares the shape of the filtered DataFrame (numerical_df) with the original DataFrame (df) to show how many rows were removed as outliers.

In [None]:
df.head(5)

In [None]:
numerical_df.head()



---



## 9. Plot different features against one another (scatter), against frequency (histogram)



### Histogram


In [None]:
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

### Scatterplot

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['Year'], df['Price'])
ax.set_xlabel('Year')
ax.set_ylabel('Price')
plt.show()

In [None]:

fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(numerical_df['HP'], numerical_df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()