# Today's Session: Exploratory Data Analysis (EDA)


Todays' Agenda:
1. Data Exploration
2. Data Cleaning 
3. Data Visualization
4. Data Transformation 


## 1. Data Exploration

### Topic
1. Data Types
2. Data Dimensions
3. Data Summary
4. Checking for Missing Values
5. Checking for Duplicates
6. Checking for Outliers
7. Checking for Unqiue Values
8. Checking for Distribution of Data (Skewness)
9. Checking for Value Counts


## 2. Data Cleaning

### Topic 
1. Transforming Data Types (Categorical to Numerical)
2. Handling Missing Values
3. Handling Duplicates
4. Handling Outliers

## 3. Data Visualization

## Different Types of Plots 

1. Line Plot
2. Scatter Plot
3. Histogram
4. Box Plot
5. Bar Plot
6. Pie Chart
7. Stacked Bar Plot
8. Stacked Area Plot
9. Density Plot
10. Distribution Plot


### How are these plots useful in EDA? 
1. Univariate Analysis
     - Countinous Variables
         - Distribution Plot 
         - Box Plot
         - Density Plot
         - Scatter Plot  
     - Categorical Variables
         - Bar Plot
         - Pie Chart
         - Histogram
         - Density Plot
         - Scatter Plot

2. Bivariate Analysis
    - Numerical
         - Scatter Plot
    - Categorical
         - Bar Plot
         - Scatter Plot
3. Multivariate Analysis
     - Correlation Matrix
     - Heatmap
     - Pair Plot


## 4. Data Transformation

### Topic
1. Scaling [Min-Max, Standardization, Normalization]
2. Normalization [Log, Square Root, Cube Root]
3. Standardization [Z-Score, Min-Max]
4. Binning 
5. Encoding [Label Encoding, One Hot Encoding]
6. Feature Engineering [Feature Extraction, Feature Generation, Feature Transformation]
7. Feature Selection

## 5. Data Preprocessing

### Topic
1. Train Test Split 












In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Read in the data
df = pd.read_csv('./Datasets/data.csv')

In [None]:
df.head()

## 1. Data Exploration

### Topic

1. Data Dimensions
2. Data Summary
8. Data Types
7. Checking for Value Counts
6. Checking for Unqiue Values
3. Checking for Missing Values
4. Checking for Duplicates
5. Checking for Outliers
9. Checking for Distribution of Data (Skewness)


In [None]:
#check the shape of the data
df.shape

In [None]:
#check the columns
df.columns

In [None]:
#describe the data
df.describe()

In [None]:
#check the data types
df.dtypes

In [None]:
#describe all the columns
df.describe(include='all')

In [None]:
# df['Year'].describe(include=)

In [None]:
#lets understand the data column wise 
#value counts
df['Make'].value_counts()

In [None]:
df['Make'].value_counts().index

In [None]:
#make a bar plot
plt.bar(df['Make'].value_counts().index, df['Make'].value_counts())

In [None]:
#make a horizontal bar plot
plt.barh(df['Make'].value_counts().index, df['Make'].value_counts())

In [None]:
#understanding matplotlib
# fig = plt.figure(figsize=(10,5))
fig, ax = plt.subplots(figsize=(10,15))
ax.barh(df['Make'].value_counts().index, df['Make'].value_counts())
plt.show()

In [None]:
# plt.figure(figsize=(10,15))
fig, ax = plt.subplots(figsize=(10,15))
# bars = ax.barh(indexes, values)

bars = ax.barh(df['Make'].value_counts().index,df['Make'].value_counts().values)
ax.bar_label(bars)
plt.show()

In [None]:
#second column
df['Model'].value_counts()[:10]

In [None]:
# plt.figure(figsize=(10,15))
fig, ax = plt.subplots(figsize=(10,15))
# bars = ax.barh(indexes, values)

bars = ax.barh(df['Model'].value_counts()[:10].index,df['Model'].value_counts()[:10].values)
ax.bar_label(bars)
plt.show()

In [None]:
#third column
df['Year'].value_counts()

In [None]:
df['Year'].nunique()

In [None]:
#get min and max year
df['Year'].min(), df['Year'].max()

In [None]:
df

In [None]:
yearDist

In [None]:
#lets see the distribution of the year
fig, ax = plt.subplots(figsize=(10,10))
yearDist = ax.hist(df['Year'], bins=28, orientation='horizontal')
ax.bar_label(yearDist[2])
plt.show()

In [None]:
#third column
df['Engine Fuel Type'].value_counts()

In [None]:
#fourth column
df['Engine HP'].value_counts()

In [None]:
#line plot
fig, ax = plt.subplots(figsize=(10,10))
ax.plot(df['Engine HP'])
plt.show()


In [None]:
#fifth column
df['Engine Cylinders'].value_counts()

In [None]:
#sixth column
df['Transmission Type'].value_counts()

In [None]:
#seventh column
df['Driven_Wheels'].value_counts()


In [None]:
#eighth column
df['Number of Doors'].value_counts()


In [None]:
#ninth column
df['Market Category'].value_counts()

In [None]:
#tenth column
df['Vehicle Size'].value_counts()

In [None]:
#eleventh column
df['Vehicle Style'].value_counts()


In [None]:
df['highway MPG'].max()

In [None]:
df[df['highway MPG'] == 354]

In [None]:
#twelveth column
df['highway MPG'].value_counts()

In [None]:
#thirteenth column
df['city mpg'].value_counts()


In [None]:
#fourteenth column
df['Popularity'].value_counts()

In [None]:
#fifteenth column
df['MSRP'].value_counts()

In [None]:
df.describe(include='all')

In [None]:
df.dtypes

In [None]:
df['Engine Fuel Type'].isnull()

In [None]:
#Check for missing values
df.isnull().sum()

In [None]:
df['Engine Fuel Type'].value_counts()

In [None]:
df[df['Engine Fuel Type'].isnull()]

In [None]:
# get all carsof model Verona from Suzuki
df[(df['Make']=='Suzuki') & (df['Model']=='Verona')]

In [None]:
df[(df['Make']=='Suzuki') & (df['Model']=='Verona')]['Engine Fuel Type'].mode()

In [None]:
#get mode of Engine Fuel Type
df['Engine Fuel Type'].mode()[0]

In [None]:
# fill the Engine Fuel Type with the mode of the column
df['Engine Fuel Type'].fillna(df['Engine Fuel Type'].mode()[0], inplace=True)

In [None]:
df.isnull().sum()

In [None]:
#check the cars with missing Engine HP
df[df['Engine HP'].isnull()]

In [None]:
#check the percentage of missing values
df['Engine HP'].isnull().sum()/df.shape[0]*100

In [None]:
#drop the rows with missing Engine HP
df.dropna(subset=['Engine HP'], inplace=True)

In [None]:
df.isnull().sum()

In [None]:
#check the cars with missing Engine Cylinders
df[df['Engine Cylinders'].isnull()]

In [None]:
# get all chevrolet cars and Bolt EV model
df[(df['Make']=='Chevrolet') & (df['Model']=='Bolt EV')]

In [None]:
# get all Volkswagen cars and e-Golf model
df[(df['Make']=='Volkswagen') & (df['Model']=='e-Golf')]

In [None]:
# get all unique values of Make and Model in df[df['Engine Cylinders'].isnull()]
df[df['Engine Cylinders'].isnull()][['Make', 'Model']].drop_duplicates()

In [None]:
for x, y in df[df['Engine Cylinders'].isnull()][['Make', 'Model']].drop_duplicates().values:
    print(x, y)
    print(df[(df['Make']==x) & (df['Model']==y)]['Engine Cylinders'])
    print()

In [None]:
#check the percentage of missing values
df['Engine Cylinders'].isnull().sum()/df.shape[0]

In [None]:
#drop the rows with missing Engine Cylinders
df.dropna(subset=['Engine Cylinders'], inplace=True)

In [None]:
df.isnull().sum()

In [None]:
#check the cars with missing Number of Doors
df[df['Number of Doors'].isnull()]

In [None]:
# check for Ferrari and FF model
df[(df['Make']=='Ferrari') & (df['Model']=='FF')]

In [None]:
#fill the missing values with 2
df['Number of Doors'].fillna(2, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
#check the cars with missing Market Category
df[df['Market Category'].isnull()]

In [None]:
#drop the rows with missing Market Category
df.dropna(subset=['Market Category'], inplace=True)

In [None]:
df.isnull().sum()

In [None]:
#check shape of the data
df.shape

6. Checking for Unqiue Values
3. Checking for Missing Values
4. Checking for Duplicates
5. Checking for Outliers
9. Checking for Distribution of Data (Skewness)

In [None]:
#check for duplicates
df.duplicated().sum()

In [None]:
df[df.duplicated()]

In [None]:
df[df.duplicated(keep=False)] #keep=False will show all the duplicates

In [None]:
#drop the duplicates
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

In [None]:
df.shape

In [None]:
df['MSR']

In [None]:
#What are outliers?
#Outliers are the data points that are far away from the rest of the data points.
#Why do we need to remove outliers?
#Outliers can cause problems in the data analysis process.

#methods to detect outliers
#1. Box plot
#2. Z-score
#3. IQR


#1. Box plot
#box plot is a graphical representation of numerical data through their quartiles.

#lets see the box plot of MSRP
plt.boxplot(df['MSRP'])
plt.show()

In [None]:
#lets see the box plot of Engine HP
plt.boxplot(df['Engine HP'])
plt.show()

In [None]:
df['Engine Cylinders'].value_counts()

In [None]:
#lets see the box plot of Engine Cylinders
plt.boxplot(df['Engine Cylinders'])
plt.show()

In [None]:
df['Engine Cylinders'].value_counts()

In [None]:
df.dtypes

In [None]:
#category columns
cat_cols = df.select_dtypes(include='object').columns
cat_cols

In [None]:
#numerical columns
num_cols = df.select_dtypes(exclude='object').columns
num_cols

In [None]:
#lets see the box plot of all numerical columns
for col in num_cols:
    plt.boxplot(df[col])
    plt.title(col)
    plt.show()


In [None]:
plt.boxplot(df[df['MSRP']<80000]['MSRP'])
plt.show()

In [None]:
#2. Scatter plot
#scatter plot is a graphical representation of the relationship between two numerical variables.

#how does scatter plot help in detecting outliers?
#outliers are the data points that are far away from the rest of the data points.

#lets see the scatter plot of MSRP and Engine HP
plt.scatter(df['MSRP'], df['Engine HP'])
# plt.title('MSRP vs Engine HP')
# plt.xlabel('MSRP')
# plt.ylabel('Engine HP')
plt.show()


In [None]:
#3. IQR
#IQR is the interquartile range.
#IQR = Q3 - Q1
#Q1 = 25th percentile
#Q3 = 75th percentile


#writing a function to detect outliers
def detect_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - (1.5 * IQR)
    upper_bound = Q3 + (1.5 * IQR)
    outliers = df[(df[col]<lower_bound) | (df[col]>upper_bound)]
    return outliers.shape[0]


In [None]:
# lets see the outliers in MSRP
detect_outliers(df, 'MSRP')

In [None]:
for col in num_cols:
    print(col, detect_outliers(df, col))

In [None]:
#check the distribution of data (skewness)
df.skew()

In [None]:
#plot the distribution of data (skewness)
for col in num_cols:
    sns.displot(df[col], kde=True)
    plt.title(col)
    plt.show()

### How are these plots useful in EDA? 
1. Univariate Analysis
     - Countinous Variables
         - Distribution Plot 
         - Box Plot
         - Scatter Plot  
     - Categorical Variables
         - Count Plot
         - Bar Plot
         - Pie Chart

2. Bivariate Analysis
    - Numerical
         - Scatter Plot
    - Categorical
         - Bar Plot
         - Scatter Plot
3. Multivariate Analysis
     - Correlation Matrix
     - Heatmap
     - Pair Plot

# Univariate Analysis

## Countinous Variables
What is a countinous variable?
- A countinous variable is a variable that can take any value within a range.
- For example, the height of a person can be any value between 0 and 7 feet.

## Categorical Variables
What is a categorical variable?
- A categorical variable is a variable that can take only a limited number of values.

In [None]:
cat_cols

In [None]:
num_cols

In [None]:
#lets see the distribution of data for all numerical columns
for col in num_cols:
    sns.displot(df[col], kde=True)
    plt.title(col)
    plt.show()

In [None]:
for col in num_cols:
    sns.boxplot(x=col, data=df)
    plt.title(col)
    plt.show()

In [None]:
#scatter plot of MSRP
sns.scatterplot(x=[i for i in range(df.shape[0])], y='MSRP', data=df)
plt.show()

In [None]:
df.columns

In [None]:
#scatter plot of Engine HP
sns.scatterplot(x=[i for i in range(df.shape[0])], y='Engine HP', data=df)
plt.show()

In [None]:
#lets see the distribution of data for all categorical columns
for col in cat_cols:
    sns.countplot(x=df[col])
    plt.title(col)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
#Pie chart of Transmission Type
plt.pie(df['Transmission Type'].value_counts(), labels=df['Transmission Type'].value_counts().index, autopct='%1.1f%%')
plt.show()

In [None]:
#Pie chart of Vehicle Size
plt.pie(df['Vehicle Size'].value_counts(), labels=df['Vehicle Size'].value_counts().index, autopct='%1.1f%%')
plt.show()

In [None]:
#histogram of MSRP
plt.hist(df['Engine HP'], bins=20)

In [None]:
df.columns

# Bivariate Analysis
    - Numerical
         - Scatter Plot
    - Categorical
         - Bar Plot
         - Scatter Plot

In [None]:
# scatter plot of MSRP and Engine HP
plt.scatter(df['MSRP'], df['Engine HP'])
plt.show()

In [None]:
#cars with more than 1000 HP are having price more than 15 lakhs

In [None]:
#scatter plot of MSRP and Engine Cylinders
plt.scatter(df['MSRP'], df['Engine Cylinders'])
plt.show()

In [None]:
#scatter plot of MSRP and Number of Doors
plt.scatter(df['MSRP'], df['Number of Doors'])

In [None]:
#scatter plot of MSRP and Highway MPG
plt.scatter(df['MSRP'], df['highway MPG'])
plt.show()

In [None]:
#scatter plot of MSRP and Highway MPG
plt.scatter(df['MSRP'], df['highway MPG'])
plt.show()

In [None]:
df.columns

In [None]:
group = df.groupby('Make')['MSRP'].mean().sort_values(ascending=False).head(10)
group

In [None]:
#group by Vechile Type and mean of popularity
group = df.groupby('Vehicle Style')['Popularity'].mean().sort_values(ascending=False).head(10)
group

In [None]:
#group by Vechile Type and mean of popularity
group = df.groupby('Vehicle Size')['Popularity'].mean().sort_values(ascending=False).head(10)
group

In [None]:
#value counts of Vehicle Size
df['Vehicle Size'].value_counts()

In [None]:
#groupby Make and mean of MSRP
group = df.groupby('Make')['MSRP'].mean().sort_values(ascending=False).head(10).astype(int)
group

In [None]:
#groupby Make and year and mean of MSRP
group = df.groupby(['Make', 'Year'])['MSRP'].mean().sort_values(ascending=False).head(10).astype(int)
group

In [None]:
#bar plot between Make and MSRP
plt.bar(df['Make'], df['MSRP'])
plt.xticks(rotation=90)
plt.show()

# Multivariate Analysis
     - Correlation Matrix
     - Heatmap
     - Pair Plot

In [None]:
#coorelation matrix
df.corr()

In [None]:
sns.heatmap(df.corr(), annot=True)
plt.show()

In [None]:
#pair plot
sns.pairplot(df)
plt.show()

## 4. Data Transformation

### Topic
1. Scaling [Min-Max, Standardization, Normalization]
5. Encoding [Label Encoding, One Hot Encoding]
6. Feature Engineering [Feature Extraction, Feature Generation, Feature Transformation]
7. Feature Selection

In [None]:
#import minmax scaler and standard scaler
from sklearn.preprocessing import MinMaxScaler, StandardScaler

#instantiate the scaler
scaler = MinMaxScaler()

#fit the scaler to the data
scaler.fit(df[num_cols])

#transform the data
scaled_data = scaler.transform(df[num_cols])

df2 = pd.DataFrame(scaled_data, columns=num_cols)
df2.head()

In [None]:

#instantiate the scaler
scaler = StandardScaler()

#fit the scaler to the data
scaler.fit(df[num_cols])

#transform the data
scaled_data = scaler.transform(df[num_cols])

df2 = pd.DataFrame(scaled_data, columns=num_cols)
df2.head()

In [None]:
#import label encoder
from sklearn.preprocessing import LabelEncoder

#instantiate the encoder
encoder = LabelEncoder()

#fit it for all cols in cat_cols
for col in cat_cols:
    df2[col] = encoder.fit_transform(df[col])

df2.head()

# df2 = pd.DataFrame(encoded_data, columns=cat_cols)
# df2.head()

In [None]:
df

In [None]:
#import one hot encoder
from sklearn.preprocessing import OneHotEncoder

#instantiate the encoder
encoder = OneHotEncoder()

#fit it for all cols in cat_cols
encoder.fit(df[cat_cols])

#transform the data
encoded_data = encoder.transform(df[cat_cols])

df2 = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names(cat_cols))
df2.head()