# **Exploratory Data Analysis**

Data cleaning is one of the most hectic and time-consuming tasks in Data Science. There is no easy template that facilitates the cleaning of data as each data set is unique in its own way consisting of noises that need to be carefully filtered out.

Exploratory Data Analysis or EDA is the first and foremost of all tasks that a dataset goes through. EDA lets us understand the data and thus helping us to prepare it for the upcoming tasks.

Some of the key steps in EDA are identifying the features, a number of observations, checking for null values or empty cells etc.

## **Importing the dataset**

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import pandas as pd
training_set = pd.read_excel("EDA_Data_Train.xlsx")
test_set = pd.read_excel("EDA_Data_Test.xlsx")

We now have two data frames, one consisting of the data to be trained and the other for predicting the target value which in this case is the price of the car.

In [None]:
training_set

In [None]:
test_set

## **Identifying the number of features or columns**

In [None]:
#checking the number of features in the Datasets
print("\n\nNumber of features in the datasets :\n",'#' * 40)
print("\nTraining Set : \n",'-' * 20, len(training_set.columns))
print("\nTest Set : \n",'-' * 20,len(test_set.columns))

## **Identifying the features or columns**

In [None]:
#checking the features in the Datasets
print("\n\nFeatures in the datasets :\n",'#' * 40)
print("\nTraining Set : \n",'-' * 20, list(training_set.columns))
print("\nTest Set : \n",'-' * 20,list(test_set.columns))

## **Identifying the data types of features**

In [None]:
#checking the data types of features
print("\n\nDatatypes of features in the datasets :\n",'#' * 40)
print("\nTraining Set : \n",'-' * 20,"\n", training_set.dtypes)
print("\nTest Set : \n",'-' * 20,"\n",test_set.dtypes)

## **Identifying the number of observations**

In [None]:
#checking the number of rows
print("\n\nNumber of observations in the datasets :\n",'#' * 40)
print("\nTraining Set : \n",'-' * 20,len(training_set))
print("\nTest Set : \n",'-' * 20,len(test_set))

## **Checking if the dataset has empty cells or samples**

In [None]:
#checking for NaNs or empty cells
print("\n\nEmpty cells or Nans in the datasets :\n",'#' * 40)
print("\nTraining Set : \n",'-' * 20,training_set.isnull().values.any())
print("\nTest Set : \n",'-' * 20,test_set.isnull().values.any())

## **Identifying the number of empty cells by features or columns**

In [None]:
#checking for NaNs or empty cells by features
print("\n\nNumber of empty cells or Nans in the datasets :\n",'#' * 40)
print("\nTraining Set : \n",'-' * 20,"\n", training_set.isnull().sum())
print("\nTest Set : \n",'-' * 20,"\n",test_set.isnull().sum())

## **Exploring Categorical features**

The below code block explores the categorical features, identifying the unique categories in both the test and training_sets combined.

In [None]:
#combining training set and test set data
all_brands = list(training_set.Name) + list(test_set.Name)
all_locations = list(training_set.Location) + list(test_set.Location)
all_fuel_types = list(training_set.Fuel_Type) + list(test_set.Fuel_Type)
all_transmissions = list(training_set.Transmission) + list(test_set.Transmission)
all_owner_types = list(training_set.Owner_Type) + list(test_set.Owner_Type)

print("\nNumber Of Unique Values In Name : \n ", len(set(all_brands)))
#print("\nThe Unique Values In Name : \n ", set(all_brands))

print("\nNumber Of Unique Values In Location : \n ", len(set(all_locations)))
print("\nThe Unique Values In Location : \n ", set(all_locations) )

print("\nNumber Of Unique Values In Fuel_Type : \n ", len(set(all_fuel_types)))
print("\nThe Unique Values In Fuel_Type : \n ", set(all_fuel_types) )

print("\nNumber Of Unique Values In Transmission : \n ", len(set(all_transmissions)))
print("\nThe Unique Values In Transmission : \n ", set(all_transmissions) )

print("\nNumber Of Unique Values In Owner_Type : \n ", len(set(all_owner_types)))
print("\nThe Unique Values In Owner_Type : \n " ,set(all_owner_types) )

## **Data Cleaning**

In this stage, we will remove unwanted data or noises from the data set to prepare it for the data preprocessing stage. The goal is to clean the data in such a way that all data can be successfully converted into a numerical type in the preprocessing stage.

By performing Exploratory data analysis, we found out that the majority of the features in the data set are objects. These features contain multiple strings of data in which most of them are useless or insignificant for a predictive model. We will traverse through each of those features cleaning one by one for both the training set and the test_set given.

### **Feature/Column : Name**

We will start with the column “Name”. By going through the data, we can see that each cell in the column consists of multiple words that provide insights about both the brand and the model of the car. We can thus simplify the dataset by splitting this feature into two different features Brand and Model. We will then replace the “Name ” feature with the 2 derived features.

In [None]:
#"""Splitting name into 2 features, brand and model"""

#Training Set
names = list(training_set.Name)
brand = []
model = []
for i in range(len(names)):
   try:
       brand.append(names[i].split(" ")[0].strip())
       try:
           model.append(" ".join(names[i].split(" ")[1:]).strip())
       except:
           pass
   except:
       print("ERR ! - ", names[i], "@" , i)
training_set["Brand"] =  brand
training_set["Model"] = model
training_set.drop(labels = ['Name'], axis = 1, inplace = True)

#Test Set
names = list(test_set.Name)
brand = []
model = []
for i in range(len(names)):
   try:
       brand.append(names[i].split(" ")[0].strip())
       try:
           model.append(" ".join(names[i].split(" ")[1:]).strip())
       except:
           pass
   except:
       print("ERR ! - ", names[i], "@" , i)
test_set["Brand"] =  brand
test_set["Model"] = model
test_set.drop(labels = ['Name'], axis = 1, inplace = True)

### **Feature/Column : Mileage**

In the given dataset, you will find that each of the values in the column ‘Mileage’ , the unit is also appended to the value. The unit makes no sense to the machines or the model. So we will remove it and convert the feature to float type with the following code.

In [None]:
#""" Removing the  texts and converting to integer''"""

# Training Set
import numpy as np

mileage = list(training_set.Mileage)
for i in range(len(mileage)):
   try :
       mileage[i] = float(mileage[i].split(" ")[0].strip())
   except:
       mileage[i] = np.nan
training_set['Mileage'] = mileage

Repeat the above code block for the test set by replacing all training_set with test_set.

In [None]:
#""" Removing the  texts and converting to integer''"""

# Test Set
import numpy as np

mileage = list(test_set.Mileage)
for i in range(len(mileage)):
   try :
       mileage[i] = float(mileage[i].split(" ")[0].strip())
   except:
       mileage[i] = np.nan
test_set['Mileage'] = mileage

### **Feature/Column : Engine**

Similar to the column Mileage, Engine column also has the units in its values. We will remove the units and will convert the feature to int type as we can see all the values are integers.

In [None]:
#""" Removing the  texts and converting to integer''"""

# Training Set
engine = list(training_set.Engine)
for i in range(len(engine)):
   try :
       engine[i] = int(engine[i].split(" ")[0].strip())
   except:
       engine[i] = np.nan
training_set['Engine'] = engine

Repeat the above code block for the test set by replacing all training_set with test_set.

In [None]:
#""" Removing the  texts and converting to integer''"""

# Test Set
engine = list(test_set.Engine)
for i in range(len(engine)):
   try :
       engine[i] = int(engine[i].split(" ")[0].strip())
   except:
       engine[i] = np.nan
test_set['Engine'] = engine

### **Feature/Column : Power**

We will do the same for the feature Power.


In [None]:
#""" Removing the  texts and converting to integer"""

# Training Set
power = list(training_set.Power)
for i in range(len(power)):
   try :
       power[i] = float(power[i].split(" ")[0].strip())
   except:
       power[i] = np.nan
training_set['Power'] = power

Repeat the above code block for the test set by replacing all training_set with test_set.

In [None]:
#""" Removing the  texts and converting to integer"""

# Test Set
power = list(test_set.Power)
for i in range(len(power)):
   try :
       power[i] = float(power[i].split(" ")[0].strip())
   except:
       power[i] = np.nan
test_set['Power'] = power

### **Feature/Column : New_Price**

Since this feature has a huge number of null values compared to the entire dataset, we will choose to remove the feature itself from the datasets. (It is also possible to fill the nulls with zeros or unit values to check its significance in predictions.)

In [None]:
training_set.drop(labels = ['New_Price'], axis = 1, inplace = True)
test_set.drop(labels = ['New_Price'], axis = 1, inplace = True)

### **Reordering The Dataset**

We have now cleaned the dataset, Lets reorder the columns and have a look at the new and cleaner dataset.


In [None]:

#Re-ordering the columns
training_set = training_set[['Brand', 'Model', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission',
      'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Price']]
test_set = test_set[['Brand', 'Model', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission',
      'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats']]

In [None]:
training_set

In [None]:
test_set

Read more about it [here](https://analyticsindiamag.com/tutorial-get-started-with-exploratory-data-analysis-and-data-preprocessing/).

#**Pandas Visual Analysis – Way To Speed-Up Data Visualization**

Being an important step in analyzing what data is all about Exploratory Data Analysis generally takes a lot of time because we need to write code for analyzing and visualizing data. What if we can automate this process of visualizing and analyzing data? 

Pandas Visual Analysis is an open-source python library which is used to visually analyze the data and that too in just a single line of code. It creates a user interface that can be used to create different plots and graphs taking different attributes. It supports a large variety of graphs and plots, also all the graphs are created using Plotly so that they are highly interactive, visually appealing, and easily downloadable.

In [None]:
!pip install pandas-visual-analysis --user

Read more about it, [here](https://analyticsindiamag.com/hands-on-guide-to-pandas-visual-analysis-way-to-speed-up-data-visualization/).

## **Importing Required Libraries**

For data analysis, we will be importing pandas visual analysis and we will import pandas for loading the dataset we will use. Other than this we will import seaborn to load a dataset defined in seaborn named tips.

In [None]:
import pandas as pd
from pandas_visual_analysis import VisualAnalysis
import seaborn as sns

## **Loading the dataset**

We will explore pandas visual analysis using dataset which we will load from seaborn named tips is a dataset of a restaurant data which contains attributes like ‘total bill’, ‘tip’, etc. 

In [None]:
df1= sns.load_dataset('tips')

df1

## **Visual Analysis**

This is the final step that will load our data in the form of a Graphical User Interface where we have a variety of graphs and plots defined and we can select different attributes to visualize.

In [None]:
VisualAnalysis(df1)

## **Graphical User Interface**

Here you can see that we have created an interface with different sections to analyze and visualize the dataset we are working on. It is a multivariate dataset still pandas visual analysis created it so easily and effortlessly. Let us see what are the different sections.

**1. Statistical Analysis**

The first section helps us analyze the statistical properties, we can analyze different metrics like mean, quartiles, median, etc. for all the numerical attributes.
Statistical analysis

**2. Distribution using Scatter Plot**

Using this,  we can analyze the distribution and relationship between two attributes using a scatter plot.
Scatter Plot

**3. Distribution Using Histogram**

In this way, we will analyze the distribution of an attribute using the histogram.

Read more about it, [here](https://analyticsindiamag.com/hands-on-guide-to-pandas-visual-analysis-way-to-speed-up-data-visualization/).